Sometimes, Windows 2012 R2 servers hangs at splash screen (spinning dots) and never boot. They are virtual machine, installed on ESXi 5.5. To resolve this issue, we just have to reset the VM, then Windows boot normally.
All of our servers are affected. No memory dump is generated and there nothing is wrong in event viewer. Any ideas?
Hi,
where you have checked event? whether in physical machine or guest? better you have a check on physical machine events. and make sure that in physical machine all the drivers are installed properly...
Zia
Hi kompakt,
First please keep your server up-to-date .
If the issue persists (hangs sometimes ) , I would suggest you try to contact with WMware :
https://communities.vmware.com/welcome
http://partnerweb.vmware.com/GOSIG/Windows_Server_2012.html
Best Regards
Elton Ji
Hello,
exact same situation over here.
Fully up to date vSphere 5.5 infrastructure. Only Windows 2012 R2 VMs affected. 2012 R2 running on physical hardware never showed this behavior. This happens after a reboot initiated after installing patches once a month (using LANdesk patch manager).
Regards,
Andreas
I just created a case with Microsoft this afternoon and I am waiting to hear back from them. We use SC 2012 also for patch deployment. In most of our cases, this happens after a patch deployment however, I can't duplicated this issue so far with a reboot of the VM itself. The most recent event was when a DBA was installing additional SQL Roles on a Windows Server 2012 VM and a reboot was required to finished the install. The VM had to be powered off and powered back on again.
I have also created a case with VMware and so far, they have turned up nothing.
We have the exact same issue. VMWare ESXi 5.5 Update 1 and Update 2 servers. The windows VM servers are patched by WSUS with 4 different patching times/groups. We have over 160 VMs, of which 25+ are now Windows 2012 R2 Servers. All VMs patch correctly, but a RANDOM PORTON of the Windows 2012 R2 servers fail to complete their boot after patching (6 last patching cycle). They hang at the spinning circle of dots on the boot screen.
I have so far not been able to track this one down. They seem to be hanging very early in the boot cycle - early enough that the volumes (drives) are not marked as dirty when you 'reset' the VM (i.e. Data Protection Manager does not need to run a consistency check over the volumes at next boot-up).
I 'suspected' this is to do with heavy I/O on the underlying datastores, but have not been able to prove it. I have moved a Windows 2012 R2 VM to a separate LUN that I then generated lots of I/Os against the LUN whilst I rebooted the VM. The datastore latency went up to 160ms+ but the VM still rebooted just fine... It doesn't rule out latency but I just can't prove it...
Another option I have considered, but haven't tried yet, is to replace the virtual LSI controller with VMware ParaVirtual. Its not my standard, but if it is a bug in the LSI driver it would get around it. ParaVirtual driver comes with caveats for MS Clustered VMs.
Will be watching this thread with interest... There is definitely something wrong here. And I know I will be battling more and more Windows 2012 R2 that fail their late night patching cycles as the weeks go by. :-( So much for 'automated' patching...
I have been on PTO this weekend where I work so I haven't been checking my email until this morning. VMware finally has gotten back to me. There is a bug with ESXi and their engineers are working on a fix. VMware suggests making one of the changes below. If anyone implements any of these changes, please let me know if it does or doesn't work.
Starting with Windows 8 / Windows 2012 Server, during its boot process the operating system will reset the TSC (TimeStampCounter, which increments by 1 for each passed cycle) on CPU0. It does not reset the TSC of the other vCPUs and the resulting
discrepancy between two vCPUs' TSC can result in the OS not booting past the Windows splash screen, and a full power off and on will fix it.
Our engineering team here are currently working on a code change to accommodate this.
There is a workaround suggested from engineering to add a line of code to the vmx (configuration) file of the VM to prevent this from reoccurring.
This will basically tell the vmx file that the TSC for all vCPUs should be reset to zero on a soft reset of the machine, and not just CPU0.
Please note that this has not been tested extensively by engineering, and should be run at your own risk as it is just a workaround which has not been fully QE tested.
This can be done a few ways:
First method: Manually editing the VM's vmx file one VM at a time.
1. Power off the VM
2. Add the following line to the vmx file:
monitor_control.enable_
3. Reload the VM
4. Power on the VM again.
Second method: Doing this to every VM on a host at one time.
1. SSH to the ESX host
2. Run the following command:
echo 'monitor_control.enable_
3. Run the following command to do a suspend-resume in order to apply the setting so that affected guests won't hang during the next reboot:
vim-cmd vmsvc/getallvms | sed -n 's/\(^[0-9]\+\).* windows8.*Guest.*$/\1/p' | while read vmid; do state=$(vim-cmd vmsvc/power.getstate ${vmid} | sed -n 's/^.*\(Powered on\).*$/\1/p'); if [
"$state" ]; then vim-cmd vmsvc/power.suspendResume ${vmid} && sleep 5; fi; done;
Last method: Using PowerCLI to do this to every VM in the environment.
Open PowerCLI, connect to vCenter server and run the following command:
ForEach ($vm in (Get-VM)){
$vmv = Get-VM $vm | Get-View
$name = $vmv.Name
$guestid = $vmv.Summary.Config.GuestId
$state = $vmv.Summary.Runtime.
$vmx = New-Object VMware.Vim.
$vmx.extraConfig += New-Object VMware.Vim.OptionValue
$vmx.extraConfig[0].key = "monitor_control.enable_
$vmx.extraConfig[0].value = "TRUE"
if ($guestid -like "windows8*Guest") {
($vmv).ReconfigVM_Task($vmx)
if ($state -eq "poweredOn") {
$vmv.MigrateVM_Task($null, $_.Runtime.Host, 'highPriority', $null)
}
}
}
Note:
If you are using Solaris VMs in the environment, do not run this against those Solaris VMs as they could potentially hang with that setting in the vmx.
Also, when the script is running, do not do a vmotion, suspend, clone, or snapshot operation at the same time - this is very important, as it could cause the script to fail.
From looking at the logs, it seems like you are not running Solaris as an OS anyway, at least on these 2 hosts:
rhayden@scripts-prod-3 HostLogs29thOct $ find esx*/vmfs/volumes/ -maxdepth 3 -name "*.vmx" -exec grep 'guestOS' {} \; | awk '{print $NF }' | sort | uniq -c
5 "longhorn"
1 "longhorn-64"
1 "rhel6-64"
1 "sles11-64"
54 "windows7srv-64"
19 "windows8srv-64"
1 "winnetenterprise-64"
7 "winnetstandard"
1 "winNetStandard
If after applying these settings to the VMs this does not work after the next patching / updating (you are still seeing the issue), what we would need to do at that point is get the suspended state file for the VM to send to engineering, as we cannot
reproduce this issue in-house.
If this occurs, this is how to gather the information we would need to send to engineering:
(Do not reboot the VM's until this is done)
1. SSH to the host and run the following command:
vm-support --listvms
2. Now run this command:
vm-support --performance --manifests="HungVM:Coredump_
(Change the path of the VM in the command above to the actual path).
That will put a tgz file in /var/tmp. The file name is displayed when complete. Copy this file off the host manually.
- Edited by Chris Bonsted Friday, November 07, 2014 10:06 AM
- Proposed as answer by MJMorris Thursday, May 14, 2015 1:33 PM
I have been on PTO this weekend where I work so I haven't been checking my email until this morning. VMware finally has gotten back to me. There is a bug with ESXi and their engineers are working on a fix. VMware suggests making one of the changes below. If anyone implements any of these changes, please let me know if it does or doesn't work.
Starting with Windows 8 / Windows 2012 Server, during its boot process the operating system will reset the TSC (TimeStampCounter, which increments by 1 for each passed cycle) on CPU0. It does not reset the TSC of the other vCPUs and the resulting
discrepancy between two vCPUs' TSC can result in the OS not booting past the Windows splash screen, and a full power off and on will fix it.
Our engineering team here are currently working on a code change to accommodate this.
There is a workaround suggested from engineering to add a line of code to the vmx (configuration) file of the VM to prevent this from reoccurring.
This will basically tell the vmx file that the TSC for all vCPUs should be reset to zero on a soft reset of the machine, and not just CPU0.
Please note that this has not been tested extensively by engineering, and should be run at your own risk as it is just a workaround which has not been fully QE tested.
This can be done a few ways:
First method: Manually editing the VM's vmx file one VM at a time.
1. Power off the VM
2. Add the following line to the vmx file:
monitor_control.enable_
3. Reload the VM
4. Power on the VM again.
Second method: Doing this to every VM on a host at one time.
1. SSH to the ESX host
2. Run the following command:
echo 'monitor_control.enable_
3. Run the following command to do a suspend-resume in order to apply the setting so that affected guests won't hang during the next reboot:
vim-cmd vmsvc/getallvms | sed -n 's/\(^[0-9]\+\).* windows8.*Guest.*$/\1/p' | while read vmid; do state=$(vim-cmd vmsvc/power.getstate ${vmid} | sed -n 's/^.*\(Powered on\).*$/\1/p'); if [
"$state" ]; then vim-cmd vmsvc/power.suspendResume ${vmid} && sleep 5; fi; done;
Last method: Using PowerCLI to do this to every VM in the environment.
Open PowerCLI, connect to vCenter server and run the following command:
ForEach ($vm in (Get-VM)){
$vmv = Get-VM $vm | Get-View
$name = $vmv.Name
$guestid = $vmv.Summary.Config.GuestId
$state = $vmv.Summary.Runtime.
$vmx = New-Object VMware.Vim.
$vmx.extraConfig += New-Object VMware.Vim.OptionValue
$vmx.extraConfig[0].key = "monitor_control.enable_
$vmx.extraConfig[0].value = "TRUE"
if ($guestid -like "windows8*Guest") {
($vmv).ReconfigVM_Task($vmx)
if ($state -eq "poweredOn") {
$vmv.MigrateVM_Task($null, $_.Runtime.Host, 'highPriority', $null)
}
}
}
Note:
If you are using Solaris VMs in the environment, do not run this against those Solaris VMs as they could potentially hang with that setting in the vmx.
Also, when the script is running, do not do a vmotion, suspend, clone, or snapshot operation at the same time - this is very important, as it could cause the script to fail.
From looking at the logs, it seems like you are not running Solaris as an OS anyway, at least on these 2 hosts:
rhayden@scripts-prod-3 HostLogs29thOct $ find esx*/vmfs/volumes/ -maxdepth 3 -name "*.vmx" -exec grep 'guestOS' {} \; | awk '{print $NF }' | sort | uniq -c
5 "longhorn"
1 "longhorn-64"
1 "rhel6-64"
1 "sles11-64"
54 "windows7srv-64"
19 "windows8srv-64"
1 "winnetenterprise-64"
7 "winnetstandard"
1 "winNetStandard
If after applying these settings to the VMs this does not work after the next patching / updating (you are still seeing the issue), what we would need to do at that point is get the suspended state file for the VM to send to engineering, as we cannot
reproduce this issue in-house.
If this occurs, this is how to gather the information we would need to send to engineering:
(Do not reboot the VM's until this is done)
1. SSH to the host and run the following command:
vm-support --listvms
2. Now run this command:
vm-support --performance --manifests="HungVM:Coredump_
(Change the path of the VM in the command above to the actual path).
That will put a tgz file in /var/tmp. The file name is displayed when complete. Copy this file off the host manually.
- Edited by Chris Bonsted Friday, November 07, 2014 10:06 AM
- Proposed as answer by MJMorris Thursday, May 14, 2015 1:33 PM
Chirs,
Do you have any update from VMWare on this? We have a bunch of servers experiencing this problem.
Chris,
Have you implemented any of the workarounds or are you waiting for an update? Please keep us posted on VMWare response please.
I plan on making these changes to 8-12 VM's tomorrow and wait to see what happens. We will be patching our QA environment over the weekend.
As of right now, there isn't a time frame if or when a VMware will create a patch for this.
Sorry for the delay in posting my results.
I have updated 10 2012 R2 VMs with the changes in the .vmx file. None of them experienced any issues upon rebooting when they were patched. I am going to expand my sample in our QA environment to 20 - 30 VMs however, I am going to say that making those modifications did help.
There isn't a public KB article from VMware about this issue other than this one:
http://kb.vmware.com/kb/2082042
There is no ETA on a patch from VMware. I hope that this information helps.
- Edited by Chris Bonsted Wednesday, November 26, 2014 3:01 PM
Sorry for the delay in posting my results.
I have updated 10 2012 R2 VMs with the changes in the .vmx file. None of them experienced any issues upon rebooting when they were patched. I am going to expand my sample in our QA environment to 20 - 30 VMs however, I am going to say that making those modifications did help.
There isn't a public KB article from VMware about this issue other than this one:
http://kb.vmware.com/kb/2082042
There is no ETA on a patch from VMware. I hope that this information helps.
- Edited by Chris Bonsted Wednesday, November 26, 2014 3:01 PM
We had this issue several months ago and Microsoft pointed us to clock time mismatch. They suggested we go to our Hypervisor vendor. VMware did minor investigations and found nothing of course. We are currently back to the same issue again like your post. Finding your post here has made VMware release the internal document stating what you found, to us. It is still internal only to VMware and Microsoft.
We have applied the PowerCLI script to all of our servers and it does modify the VMX without any issue. The problem is you still need to do a reset or power off/on via the virtual power buttons in VMware. OS reboots do not work. So we are in the middle of scheduling outages for our 400+ 2012 servers.
Thanks again for the post. I will post what VMware gave me on the symptoms for this issue to happen.
Symptoms:
Under the following conditions, you are:
Running Windows 8 or 2012 Server or later as the guest operating system on the virtual machine
Running on ESXi 5.5 or later with virtual machine hardware version 10 (vmx-10)
The virtual machine has not experienced a full power cycle (powered off / powered on) for more than two months.
The virtual machine is configured with more than one vCPU.
You might see the following symptoms:
After rebooting, Windows 8 or 2012 Server virtual machines might hang during the Microsoft Windows boot splash screen
After resetting or power cycling the virtual machine, it will boot successfully.
The virtual machine might resume booting after multiple hours or days
A memory dump analysis might reveal thread blocking on a timer expiry hours or days in the future
The blocking thread might be stuck in KeDelayExecutionThread() during PciStallForPowerChange()
Cause:
Starting with Windows 8 / Windows 2012 Server, during the boot process the operating
Thanks for all the info in this thread. I have the same problem using ESXi 5.1 and 2012 R2 servers. Has anyone experienced this problem using 5.1?
Has anyone received any updates from VMware on this? We are experiencing the same issues after Windows updates. Any issues reported with the proposed workarounds?
Thanks,
Derek
I discussed this with VMware Support yesterday. Here's an official KB article, hot off the press:
http://kb.vmware.com/kb/2092807
It has a few details not yet discussed on this thread, so definitely check it out if you're affected by the problem.
Joe.
I've also been told that the fix/workaround is proposed to be included with ESXi 5.5 Update 3 and ESXi 6.0 Update 1.
Machines in our environment also still hanging on with specified "monitor_control.enable_softResetClearTSC = TRUE" parameter. Maybe it requires server reboot to start applying this setting? In this case upcoming patching will show is it true
or not.
Also think that there no difference between "TRUE" and "true".
- Edited by Andrej Trusevic Wednesday, March 04, 2015 1:28 PM
- Proposed as answer by MiliusXP Friday, March 13, 2015 1:36 PM
- Unproposed as answer by MiliusXP Friday, March 13, 2015 1:36 PM
Machines in our environment also still hanging on with specified "monitor_control.enable_softResetClearTSC = TRUE" parameter. Maybe it requires server reboot to start applying this setting? In this case upcoming patching will show is it true
or not.
Also think that there no difference between "TRUE" and "true".
- Edited by Andrej Trusevic Wednesday, March 04, 2015 1:28 PM
- Proposed as answer by MiliusXP Friday, March 13, 2015 1:36 PM
- Unproposed as answer by MiliusXP Friday, March 13, 2015 1:36 PM
I've applied the workaround, restarted hosts and still have the issue.
Lets wait and see what 5.5 U3 brings, no chance I'm touching ESXi 6 until U1 comes out, and when that does hopefully the fix will be in there too.
There is every chance it won't be though; when I previously looked in to this issue (maybe 3 or 4 months ago now); I was lead to believe it was a Microsoft fault rather than VMware, the argument for this was a good one and I am yet to see Microsoft admitting to anything.
Requiert complete shutdown and restart
parameters are not case sensitive.
Has anyone heard anything more on this issue? I have applied the setting change and it does not make a difference...
I have to reboot multple times to get my VMs to come up.
Hi all,
Can someone please summarize this?
Doesn't the VMware KB 2092807 have the resolution? It doesnt solve this bug?
If i use the PS script, do i still need to restart my VMs?
(The scripts seems to do a "localhost" vmotion which should create a new vmx file?)
IF 9 2807
@andriktr
Did you use the ps script?
To set parameter without script you will be required turn off VM. It's not possible manually editing VM config and set this param when VM is turned on. Using script you can set param without turning off VM.
- Edited by Andrej Trusevic Friday, March 20, 2015 10:30 AM
To set parameter without script you will be required turn off VM. It's not possible manually editing VM config and set this param when VM is turned on. Using script you can set param without turning off VM.
- Edited by Andrej Trusevic Friday, March 20, 2015 10:30 AM
The good news is that they also confirmed this problem will be fixed in 5.5 U3 which will be released between 2nd and 3rd quartal.
Hello everybody,
kind of late to the party. We are running 2012 R2 on a physical machine using the Hyper-V role. This is a no-HA lab machine. We are having the same issues as described here, just with Hyper-V. Again, the HOST system is the Hyper-V server, guests are a mix of XP to Server 2012 R2.
I have no idea how I could apply any of the fixes described here to a physical machine.
Is there any news from Microsoft on this issue?
Regards,
Michael
For those that say the "fix" using the powershell script did not work....you did read in the KB that "The virtual machine(s) need to be shutdown and powered on for the changes to take affect.".
Was that done...or do you simply do a reboot of the VMs (which would not fix), or some say they rebooted the ESX Host (which is not the fix).
What about reloading the VMX settings while the VM is running (Reference: http://kb.vmware.com/kb/1026043), and then restarting normally? Has anyone tried that? It seems to work for other settings that normally don't take effect without a full shutdown and poweron.
-
Also, we've seen this same exact behavior since we installed Patch 4 for McAfee VirusScan Enterprise 8.8. It's a known issue with Patch 4 (Reference: https://kc.mcafee.com/corporate/index?page=content&id=KB78495 - issue 1020874). Patch 5 is supposed to be released to the general public next week.- Edited by Random Anonymous Name Thursday, May 14, 2015 5:27 PM
What about reloading the VMX settings while the VM is running (Reference: http://kb.vmware.com/kb/1026043), and then restarting normally? Has anyone tried that? It seems to work for other settings that normally don't take effect without a full shutdown and poweron.
-
Also, we've seen this same exact behavior since we installed Patch 4 for McAfee VirusScan Enterprise 8.8. It's a known issue with Patch 4 (Reference: https://kc.mcafee.com/corporate/index?page=content&id=KB78495 - issue 1020874). Patch 5 is supposed to be released to the general public next week.- Edited by Random Anonymous Name Thursday, May 14, 2015 5:27 PM
I don't think this is all the problem
it seems this case only happens after installed a(sepcial one maybe) update, normal reboot just fine
It looks like the issue happens only if the VM has not been powered off for more than two months.
This only applies to virtual machine hardware version 10 as Windows resets the TSC on all CPUs on virtual machines with older hardware versions (which do not support hypervisor.cpuid.v2).
pulling my hair out with this patch cycle and 2012 r2/5.5....found this thread, sorry that we are all having this problem but good to see I'm not the only one and going crazy. Found a workaround for all of the small environments. If you shut down the server and start it from vmware there is no problem. I find that better than "crashing" it everytime it won't boot...makes me a little nervous. I guess I'll just do this until u3 comes out.
Hope this helps someone out....have a good weekend.
The ESXi VMs use an LSI controller isnt that hotfix needed addresses issues hanging with LSI controllers
https://support.microsoft.com/en-us/kb/2966870#/en-us/kb/2966870