Windows 2012 R2 sometimes hangs at splash screen after reboot

Hello

Sometimes, Windows 2012 R2 servers  hangs at splash screen (spinning dots) and never boot. They are virtual machine, installed on ESXi 5.5. To resolve this issue, we just have to reset the VM, then Windows boot normally.

All of our servers are affected. No memory dump is generated and there nothing is wrong in event viewer. Any ideas?

September 18th, 2014 10:23am

Hi,

where you have checked event? whether in physical machine or guest? better you have a check on physical machine events. and make sure that in physical machine all the drivers are installed properly...

Zia

Free Windows Admin Tool Kit Click here and download it now
September 18th, 2014 1:12pm

Hi kompakt,

First please keep your server up-to-date .

If the issue persists (hangs sometimes ) , I would suggest you try to contact with WMware :

https://communities.vmware.com/welcome

http://partnerweb.vmware.com/GOSIG/Windows_Server_2012.html

Best Regards

Elton Ji

September 22nd, 2014 6:32am

Hello,

exact same situation over here. 

Fully up to date vSphere 5.5 infrastructure. Only Windows 2012 R2 VMs affected. 2012 R2 running on physical hardware never showed this behavior. This happens after a reboot initiated after installing patches once a month (using LANdesk patch manager).

Regards,

Andreas

Free Windows Admin Tool Kit Click here and download it now
October 7th, 2014 9:29am

We are currently experiencing the same issue with our Server 2012 R2 VMs.  The VM's will be struck at the windows splash screen and the only way to fix them as of right now is to power them off completely and power them back on.  We are running vSphere 5.5 with Update 1 and all of our VM's are running the latest guest tools.
October 21st, 2014 6:01pm

Unfortunately I have the issue too. We use System Center 2012 to deploy the patches. We are also on VMware 5.5.
Free Windows Admin Tool Kit Click here and download it now
October 29th, 2014 6:16pm

I just created a case with Microsoft this afternoon and I am waiting to hear back from them.  We use SC 2012 also for patch deployment.  In most of our cases, this happens after a patch deployment however, I can't duplicated this issue so far with a reboot of the VM itself.  The most recent event was when a DBA was installing additional SQL Roles on a Windows Server 2012 VM and a reboot was required to finished the install.  The VM had to be powered off and powered back on again.

I have also created a case with VMware and so far, they have turned up nothing.

October 29th, 2014 6:26pm

We have the exact same issue.  VMWare ESXi 5.5 Update 1 and Update 2 servers.  The windows VM servers are patched by WSUS with 4 different patching times/groups.  We have over 160 VMs, of which 25+ are now Windows 2012 R2 Servers.  All VMs patch correctly, but a RANDOM PORTON of the Windows 2012 R2 servers fail to complete their boot after patching (6 last patching cycle).  They hang at the spinning circle of dots on the boot screen.

I have so far not been able to track this one down.  They seem to be hanging very early in the boot cycle - early enough that the volumes (drives) are not marked as dirty when you 'reset' the VM (i.e. Data Protection Manager does not need to run a consistency check over the volumes at next boot-up).

I 'suspected' this is to do with heavy I/O on the underlying datastores, but have not been able to prove it.  I have moved a Windows 2012 R2 VM to a separate LUN that I then generated lots of I/Os against the LUN whilst I rebooted the VM.  The datastore latency went up to 160ms+ but the VM still rebooted just fine...  It doesn't rule out latency but I just can't prove it...

Another option I have considered, but haven't tried yet, is to replace the virtual LSI controller with VMware ParaVirtual.  Its not my standard, but if it is a bug in the LSI driver it would get around it.  ParaVirtual driver comes with caveats for MS Clustered VMs.

Will be watching this thread with interest...  There is definitely something wrong here.  And I know I will be battling more and more Windows 2012 R2 that fail their late night patching cycles as the weeks go by. :-(  So much for 'automated' patching...

Free Windows Admin Tool Kit Click here and download it now
November 1st, 2014 3:37am

I have a Class C ticket with Microsoft and VMware. VMware is still looking into all of the logs that I sent to them. I hate to say it but, the Microsoft engineer isn't much of help. He states that there is very little that he can do since it isn't consistent. I been busy fighting other things at work and haven't had the time to argue with him about gathering more data. That is interesting about the VMware parascsi controller and the data store. I will check into that on Monday. Maybe there is a correlation with were the vm are located at within all of the data stores. I check the vm having this issue and I was thinking that maybe they are using the wrong hardware profile assignment. That isn't the case. So far I haven't been able to duplicate the issue at all.
November 1st, 2014 2:09pm

I also have opened a premier support call with Microsoft and they recommended I turn on boot logging and capture the memory through VMware snapshots when the failures happens. They will then analyze it. Hopefully next patch cycle we will have some more failures and they can find something.
Free Windows Admin Tool Kit Click here and download it now
November 5th, 2014 3:45pm

We have the same issue - I have been researching this since April.  I thought it might be related to the Automatic updates Microsoft saw fit to turn on - however I have been unable to find any common thread among the configurations of the settings.  Thanks for all the postings - will be watching with interest.
November 5th, 2014 9:25pm

GJMFL could you tell us a little about your environment, is it Vmware 5.5 or something else? Maybe we can find the common thread.
Free Windows Admin Tool Kit Click here and download it now
November 5th, 2014 9:29pm

VMWare ESXi 5.5 Update 1 - Server 2012 R2.  I am looking at everything - other users who never logged off - automatic update settings (we use a WSUS server) - although it happens sometimes when I go out to Microsoft update site. I have been trying to test various scenarios to find something.
November 6th, 2014 1:07pm

I have been on PTO this weekend where I work so I haven't been checking my email until this morning.  VMware finally has gotten back to me.  There is a bug with ESXi and their engineers are working on a fix.  VMware suggests making one of the changes below.  If anyone implements any of these changes, please let me know if it does or doesn't work.

Starting with Windows 8 / Windows 2012 Server, during its boot process the operating system will reset the TSC (TimeStampCounter, which increments by 1 for each passed cycle) on CPU0. It does not reset the TSC of the other vCPUs and the resulting discrepancy between two vCPUs' TSC can result in the OS not booting past the Windows splash screen, and a full power off and on will fix it.

Our engineering team here are currently working on a code change to accommodate this.
There is a workaround suggested from engineering to add a line of code to the vmx (configuration) file of the VM to prevent this from reoccurring.
This will basically tell the vmx file that the TSC for all vCPUs should be reset to zero on a soft reset of the machine, and not just CPU0.

Please note that this has not been tested extensively by engineering, and should be run at your own risk as it is just a workaround which has not been fully QE tested.


This can be done a few ways:

First method: Manually editing the VM's vmx file one VM at a time.
1. Power off the VM
2. Add the following line to the vmx file:
monitor_control.enable_softResetClearTSC = "TRUE"
3. Reload the VM
4. Power on the VM again.


Second method: Doing this to every VM on a host at one time.
1. SSH to the ESX host
2. Run the following command:
echo 'monitor_control.enable_softResetClearTSC = "TRUE"' >> /etc/vmware/config
3. Run the following command to do a suspend-resume in order to apply the setting so that affected guests won't hang during the next reboot:
vim-cmd vmsvc/getallvms | sed -n 's/\(^[0-9]\+\).* windows8.*Guest.*$/\1/p' | while read vmid; do state=$(vim-cmd vmsvc/power.getstate ${vmid} | sed -n 's/^.*\(Powered on\).*$/\1/p'); if [ "$state" ]; then vim-cmd vmsvc/power.suspendResume ${vmid} && sleep 5; fi; done;


Last method: Using PowerCLI to do this to every VM in the environment.
Open PowerCLI, connect to vCenter server and run the following command:

    ForEach ($vm in (Get-VM)){
    $vmv = Get-VM $vm | Get-View
    $name = $vmv.Name
    $guestid = $vmv.Summary.Config.GuestId
    $state = $vmv.Summary.Runtime.PowerState
    $vmx = New-Object VMware.Vim.VirtualMachineConfigSpec
    $vmx.extraConfig += New-Object VMware.Vim.OptionValue
    $vmx.extraConfig[0].key = "monitor_control.enable_softResetClearTSC"
    $vmx.extraConfig[0].value = "TRUE"
    if ($guestid -like "windows8*Guest") {
    ($vmv).ReconfigVM_Task($vmx)
    if ($state -eq "poweredOn") {
    $vmv.MigrateVM_Task($null, $_.Runtime.Host, 'highPriority', $null)
    }
    }
    }


Note:
If you are using Solaris VMs in the environment, do not run this against those Solaris VMs as they could potentially hang with that setting in the vmx.
Also, when the script is running, do not do a vmotion, suspend, clone, or snapshot operation at the same time - this is very important, as it could cause the script to fail.

From looking at the logs, it seems like you are not running Solaris as an OS anyway, at least on these 2 hosts:
rhayden@scripts-prod-3 HostLogs29thOct $ find esx*/vmfs/volumes/ -maxdepth 3 -name "*.vmx" -exec grep 'guestOS'  {} \; | awk '{print $NF }' | sort | uniq -c
      5 "longhorn"
      1 "longhorn-64"
      1 "rhel6-64"
      1 "sles11-64"
     54 "windows7srv-64"
     19 "windows8srv-64"
      1 "winnetenterprise-64"
      7 "winnetstandard"
      1 "winNetStandard



If after applying these settings to the VMs this does not work after the next patching / updating (you are still seeing the issue), what we would need to do at that point is get the suspended state file for the VM to send to engineering, as we cannot reproduce this issue in-house.

If this occurs, this is how to gather the information we would need to send to engineering:
(Do not reboot the VM's until this is done)

1. SSH to the host and run the following command:
vm-support --listvms

2. Now run this command:

vm-support --performance --manifests="HungVM:Coredump_VM HungVM:Suspend_VM" --groups="Fault Hardware Logs Network Storage System Userworld VirtualMachines" --vm="</vmfs/volumes/path/to/virtualmachine.vmx>"

(Change the path of the VM in the command above to the actual path).
That will put a tgz file in /var/tmp. The file name is displayed when complete. Copy this file off the host manually.


Free Windows Admin Tool Kit Click here and download it now
November 7th, 2014 10:05am

I have been on PTO this weekend where I work so I haven't been checking my email until this morning.  VMware finally has gotten back to me.  There is a bug with ESXi and their engineers are working on a fix.  VMware suggests making one of the changes below.  If anyone implements any of these changes, please let me know if it does or doesn't work.

Starting with Windows 8 / Windows 2012 Server, during its boot process the operating system will reset the TSC (TimeStampCounter, which increments by 1 for each passed cycle) on CPU0. It does not reset the TSC of the other vCPUs and the resulting discrepancy between two vCPUs' TSC can result in the OS not booting past the Windows splash screen, and a full power off and on will fix it.

Our engineering team here are currently working on a code change to accommodate this.
There is a workaround suggested from engineering to add a line of code to the vmx (configuration) file of the VM to prevent this from reoccurring.
This will basically tell the vmx file that the TSC for all vCPUs should be reset to zero on a soft reset of the machine, and not just CPU0.

Please note that this has not been tested extensively by engineering, and should be run at your own risk as it is just a workaround which has not been fully QE tested.


This can be done a few ways:

First method: Manually editing the VM's vmx file one VM at a time.
1. Power off the VM
2. Add the following line to the vmx file:
monitor_control.enable_softResetClearTSC = "TRUE"
3. Reload the VM
4. Power on the VM again.


Second method: Doing this to every VM on a host at one time.
1. SSH to the ESX host
2. Run the following command:
echo 'monitor_control.enable_softResetClearTSC = "TRUE"' >> /etc/vmware/config
3. Run the following command to do a suspend-resume in order to apply the setting so that affected guests won't hang during the next reboot:
vim-cmd vmsvc/getallvms | sed -n 's/\(^[0-9]\+\).* windows8.*Guest.*$/\1/p' | while read vmid; do state=$(vim-cmd vmsvc/power.getstate ${vmid} | sed -n 's/^.*\(Powered on\).*$/\1/p'); if [ "$state" ]; then vim-cmd vmsvc/power.suspendResume ${vmid} && sleep 5; fi; done;


Last method: Using PowerCLI to do this to every VM in the environment.
Open PowerCLI, connect to vCenter server and run the following command:

    ForEach ($vm in (Get-VM)){
    $vmv = Get-VM $vm | Get-View
    $name = $vmv.Name
    $guestid = $vmv.Summary.Config.GuestId
    $state = $vmv.Summary.Runtime.PowerState
    $vmx = New-Object VMware.Vim.VirtualMachineConfigSpec
    $vmx.extraConfig += New-Object VMware.Vim.OptionValue
    $vmx.extraConfig[0].key = "monitor_control.enable_softResetClearTSC"
    $vmx.extraConfig[0].value = "TRUE"
    if ($guestid -like "windows8*Guest") {
    ($vmv).ReconfigVM_Task($vmx)
    if ($state -eq "poweredOn") {
    $vmv.MigrateVM_Task($null, $_.Runtime.Host, 'highPriority', $null)
    }
    }
    }


Note:
If you are using Solaris VMs in the environment, do not run this against those Solaris VMs as they could potentially hang with that setting in the vmx.
Also, when the script is running, do not do a vmotion, suspend, clone, or snapshot operation at the same time - this is very important, as it could cause the script to fail.

From looking at the logs, it seems like you are not running Solaris as an OS anyway, at least on these 2 hosts:
rhayden@scripts-prod-3 HostLogs29thOct $ find esx*/vmfs/volumes/ -maxdepth 3 -name "*.vmx" -exec grep 'guestOS'  {} \; | awk '{print $NF }' | sort | uniq -c
      5 "longhorn"
      1 "longhorn-64"
      1 "rhel6-64"
      1 "sles11-64"
     54 "windows7srv-64"
     19 "windows8srv-64"
      1 "winnetenterprise-64"
      7 "winnetstandard"
      1 "winNetStandard



If after applying these settings to the VMs this does not work after the next patching / updating (you are still seeing the issue), what we would need to do at that point is get the suspended state file for the VM to send to engineering, as we cannot reproduce this issue in-house.

If this occurs, this is how to gather the information we would need to send to engineering:
(Do not reboot the VM's until this is done)

1. SSH to the host and run the following command:
vm-support --listvms

2. Now run this command:

vm-support --performance --manifests="HungVM:Coredump_VM HungVM:Suspend_VM" --groups="Fault Hardware Logs Network Storage System Userworld VirtualMachines" --vm="</vmfs/volumes/path/to/virtualmachine.vmx>"

(Change the path of the VM in the command above to the actual path).
That will put a tgz file in /var/tmp. The file name is displayed when complete. Copy this file off the host manually.


November 7th, 2014 10:05am

Chirs,

Do you have any update from VMWare on this? We have a bunch of servers experiencing this problem.

Free Windows Admin Tool Kit Click here and download it now
November 11th, 2014 4:32am

Nathaniel, Other than the 3 workarounds that they suggested I make, no. I have inquired about when a real fix will be created and I haven't heard back as of yet.
November 11th, 2014 9:14am

Chris,

Have you implemented any of the workarounds or are you waiting for an update? Please keep us posted on VMWare response please.

Free Windows Admin Tool Kit Click here and download it now
November 12th, 2014 7:29pm

I plan on making these changes to 8-12 VM's tomorrow and wait to see what happens.  We will be patching our QA environment over the weekend.

As of right now, there isn't a time frame if or when a VMware will create a patch for this.

November 12th, 2014 11:04pm

Sorry for the delay in posting my results.

I have updated 10 2012 R2 VMs with the changes in the .vmx file.  None of them experienced any issues upon rebooting when they were patched.  I am going to expand my sample in our QA environment to 20 - 30 VMs however, I am going to say that making those modifications did help.

There isn't a public KB article from VMware about this issue other than this one:

http://kb.vmware.com/kb/2082042

There is no ETA on a patch from VMware.  I hope that this information helps.


Free Windows Admin Tool Kit Click here and download it now
November 26th, 2014 2:54pm

Sorry for the delay in posting my results.

I have updated 10 2012 R2 VMs with the changes in the .vmx file.  None of them experienced any issues upon rebooting when they were patched.  I am going to expand my sample in our QA environment to 20 - 30 VMs however, I am going to say that making those modifications did help.

There isn't a public KB article from VMware about this issue other than this one:

http://kb.vmware.com/kb/2082042

There is no ETA on a patch from VMware.  I hope that this information helps.


November 26th, 2014 2:54pm

We had this issue several months ago and Microsoft pointed us to clock time mismatch.  They suggested we go to our Hypervisor vendor.  VMware did minor investigations and found nothing of course.  We are currently back to the same issue again like your post.  Finding your post here has made VMware release the internal document stating what you found, to us.  It is still internal only to VMware and Microsoft.

We have applied the PowerCLI script to all of our servers and it does modify the VMX without any issue.  The problem is you still need to do a reset or power off/on via the virtual power buttons in VMware.  OS reboots do not work.  So we are in the middle of scheduling outages for our 400+ 2012 servers.  

Thanks again for the post.  I will post what VMware gave me on the symptoms for this issue to happen.

Symptoms:
Under the following conditions, you are:
Running Windows 8 or 2012 Server or later as the guest operating system on the virtual machine
Running on ESXi 5.5 or later with virtual machine hardware version 10 (vmx-10)
The virtual machine has not experienced a full power cycle (powered off / powered on) for more than two months.
The virtual machine is configured with more than one vCPU.
You might see the following symptoms:
After rebooting, Windows 8 or 2012 Server virtual machines might hang during the Microsoft Windows boot splash screen 

After resetting or power cycling the virtual machine, it will boot successfully.
The virtual machine might resume booting after multiple hours or days
A memory dump analysis might reveal thread blocking on a timer expiry hours or days in the future
The blocking thread might be stuck in KeDelayExecutionThread() during PciStallForPowerChange()

Cause:
Starting with Windows 8 / Windows 2012 Server, during the boot process the operating


Free Windows Admin Tool Kit Click here and download it now
December 1st, 2014 6:17pm

Thanks for all the info in this thread. I have the same problem using ESXi 5.1 and 2012 R2 servers. Has anyone experienced this problem using 5.1?

December 16th, 2014 12:46am

Has anyone received any updates from VMware on this?  We are experiencing the same issues after Windows updates.  Any issues reported with the proposed workarounds?

Thanks,

Derek

Free Windows Admin Tool Kit Click here and download it now
January 5th, 2015 4:47pm

Most likely this is low on VMware's radar. I have not heard on when a fix will be issued.  We have implemented this work around in our QA/Dev VM's (about 100 of them) and we have not had any issues since the .vmx modifications where made.
January 6th, 2015 12:21am

I discussed this with VMware Support yesterday. Here's an official KB article, hot off the press:

http://kb.vmware.com/kb/2092807

It has a few details not yet discussed on this thread, so definitely check it out if you're affected by the problem.

Joe.

Free Windows Admin Tool Kit Click here and download it now
January 21st, 2015 1:30pm

I've also been told that the fix/workaround is proposed to be included with ESXi 5.5 Update 3 and ESXi 6.0 Update 1.

January 27th, 2015 2:52pm

Does anyone see this working in their environment? We have a few VMs that still hung on reboot with this applied. When comparing the Advanced configuration properties we noticed the script set the parameter as "monitor_control.enable_softResetClearTSC = TRUE" while other parameters show their values as "true". Not sure if the "TRUE" vs "true" makes a difference.
Free Windows Admin Tool Kit Click here and download it now
March 3rd, 2015 9:36pm

Machines in our environment also still hanging on with specified "monitor_control.enable_softResetClearTSC = TRUE" parameter. Maybe it requires server reboot to start applying this setting?  In this case upcoming patching will show is it true or not.
Also think that there no difference between "TRUE" and "true".

  • Edited by andriktr Wednesday, March 04, 2015 1:28 PM
  • Proposed as answer by MiliusXP Friday, March 13, 2015 1:36 PM
  • Unproposed as answer by MiliusXP Friday, March 13, 2015 1:36 PM
March 4th, 2015 1:23pm

Machines in our environment also still hanging on with specified "monitor_control.enable_softResetClearTSC = TRUE" parameter. Maybe it requires server reboot to start applying this setting?  In this case upcoming patching will show is it true or not.
Also think that there no difference between "TRUE" and "true".

  • Edited by Andrej Trusevic Wednesday, March 04, 2015 1:28 PM
  • Proposed as answer by MiliusXP Friday, March 13, 2015 1:36 PM
  • Unproposed as answer by MiliusXP Friday, March 13, 2015 1:36 PM
Free Windows Admin Tool Kit Click here and download it now
March 4th, 2015 1:23pm

I've applied the workaround, restarted hosts and still have the issue.

Lets wait and see what 5.5 U3 brings, no chance I'm touching ESXi 6 until U1 comes out, and when that does hopefully the fix will be in there too.

There is every chance it won't be though; when I previously looked in to this issue (maybe 3 or 4 months ago now); I was lead to believe it was a Microsoft fault rather than VMware, the argument for this was a good one and I am yet to see Microsoft admitting to anything.

March 9th, 2015 9:44pm

Requiert complete shutdown and restart

parameters are not case sensitive.

Free Windows Admin Tool Kit Click here and download it now
March 13th, 2015 1:37pm

Has anyone heard anything more on this issue?   I have applied the setting change and it does not make a difference...

I have to reboot multple times to get my VMs to come up.

March 17th, 2015 1:47pm

Hi all,

Can someone please summarize this?

Doesn't the VMware KB 2092807 have the resolution? It doesnt solve this bug?
If i use the PS script,  do i still need to restart my VMs?

(The scripts seems to do a "localhost" vmotion which should create a new vmx file?)  

IF 9 2807

Free Windows Admin Tool Kit Click here and download it now
March 18th, 2015 3:41pm

Has anyone opened a case with Microsoft on this issue? Is anyone seeing this in Hyper-V, Stand Alone, Xen? VMware has reported this to be a Microsoft issue and are unable to find any problems on the vmware side on our system.
March 18th, 2015 4:39pm

I can confirm that  VMware provided solution in KB 2092807 does not solve bug. Required parameter was set to all win2012 and win 2012R2 machines in our environment. All servers rebooted after that, but during this month's patching some of them still hangs
Free Windows Admin Tool Kit Click here and download it now
March 20th, 2015 3:32am

@andriktr  
Did you use the ps script?

March 20th, 2015 7:54am

Yes, script was used for setting parameter
To set parameter without script you will be required turn off VM. It's not possible manually editing VM config and set this param when VM is turned on. Using script you can set param without turning off VM.
Free Windows Admin Tool Kit Click here and download it now
March 20th, 2015 10:23am

Yep, but I don't want to run the script on my production VMs if it doesn't solve anything.....
March 20th, 2015 12:05pm

Can now confirm that I also experience the same problem even though I applied the "fix" ......
Free Windows Admin Tool Kit Click here and download it now
March 20th, 2015 1:43pm

Can also confirm that we are seeing this issue, have been for months and finally know why. Going to be opening cases regarding this issue.  Same behavior, after being online for about a month 2012 R2 servers will get hung during automatic patch reboot.
March 20th, 2015 5:44pm

I also created a ticket for VMWare support. Let's wait for the answer. :)
Free Windows Admin Tool Kit Click here and download it now
March 31st, 2015 3:30am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics