Sometimes, Windows 2012 R2 servers hangs at splash screen (spinning dots) and never boot. They are virtual machine, installed on ESXi 5.5. To resolve this issue, we just have to reset the VM, then Windows boot normally.
All of our servers are affected. No memory dump is generated and there nothing is wrong in event viewer. Any ideas?
Hi,
where you have checked event? whether in physical machine or guest? better you have a check on physical machine events. and make sure that in physical machine all the drivers are installed properly...
Zia
Hi kompakt,
First please keep your server up-to-date .
If the issue persists (hangs sometimes ) , I would suggest you try to contact with WMware :
https://communities.vmware.com/welcome
http://partnerweb.vmware.com/GOSIG/Windows_Server_2012.html
Best Regards
Elton Ji
Hello,
exact same situation over here.
Fully up to date vSphere 5.5 infrastructure. Only Windows 2012 R2 VMs affected. 2012 R2 running on physical hardware never showed this behavior. This happens after a reboot initiated after installing patches once a month (using LANdesk patch manager).
Regards,
Andreas
I just created a case with Microsoft this afternoon and I am waiting to hear back from them. We use SC 2012 also for patch deployment. In most of our cases, this happens after a patch deployment however, I can't duplicated this issue so far with a reboot of the VM itself. The most recent event was when a DBA was installing additional SQL Roles on a Windows Server 2012 VM and a reboot was required to finished the install. The VM had to be powered off and powered back on again.
I have also created a case with VMware and so far, they have turned up nothing.
We have the exact same issue. VMWare ESXi 5.5 Update 1 and Update 2 servers. The windows VM servers are patched by WSUS with 4 different patching times/groups. We have over 160 VMs, of which 25+ are now Windows 2012 R2 Servers. All VMs patch correctly, but a RANDOM PORTON of the Windows 2012 R2 servers fail to complete their boot after patching (6 last patching cycle). They hang at the spinning circle of dots on the boot screen.
I have so far not been able to track this one down. They seem to be hanging very early in the boot cycle - early enough that the volumes (drives) are not marked as dirty when you 'reset' the VM (i.e. Data Protection Manager does not need to run a consistency check over the volumes at next boot-up).
I 'suspected' this is to do with heavy I/O on the underlying datastores, but have not been able to prove it. I have moved a Windows 2012 R2 VM to a separate LUN that I then generated lots of I/Os against the LUN whilst I rebooted the VM. The datastore latency went up to 160ms+ but the VM still rebooted just fine... It doesn't rule out latency but I just can't prove it...
Another option I have considered, but haven't tried yet, is to replace the virtual LSI controller with VMware ParaVirtual. Its not my standard, but if it is a bug in the LSI driver it would get around it. ParaVirtual driver comes with caveats for MS Clustered VMs.
Will be watching this thread with interest... There is definitely something wrong here. And I know I will be battling more and more Windows 2012 R2 that fail their late night patching cycles as the weeks go by. :-( So much for 'automated' patching...
I have been on PTO this weekend where I work so I haven't been checking my email until this morning. VMware finally has gotten back to me. There is a bug with ESXi and their engineers are working on a fix. VMware suggests making one of the changes below. If anyone implements any of these changes, please let me know if it does or doesn't work.
Starting with Windows 8 / Windows 2012 Server, during its boot process the operating system will reset the TSC (TimeStampCounter, which increments by 1 for each passed cycle) on CPU0. It does not reset the TSC of the other vCPUs and the resulting
discrepancy between two vCPUs' TSC can result in the OS not booting past the Windows splash screen, and a full power off and on will fix it.
Our engineering team here are currently working on a code change to accommodate this.
There is a workaround suggested from engineering to add a line of code to the vmx (configuration) file of the VM to prevent this from reoccurring.
This will basically tell the vmx file that the TSC for all vCPUs should be reset to zero on a soft reset of the machine, and not just CPU0.
Please note that this has not been tested extensively by engineering, and should be run at your own risk as it is just a workaround which has not been fully QE tested.
This can be done a few ways:
First method: Manually editing the VM's vmx file one VM at a time.
1. Power off the VM
2. Add the following line to the vmx file:
monitor_control.enable_
3. Reload the VM
4. Power on the VM again.
Second method: Doing this to every VM on a host at one time.
1. SSH to the ESX host
2. Run the following command:
echo 'monitor_control.enable_
3. Run the following command to do a suspend-resume in order to apply the setting so that affected guests won't hang during the next reboot:
vim-cmd vmsvc/getallvms | sed -n 's/\(^[0-9]\+\).* windows8.*Guest.*$/\1/p' | while read vmid; do state=$(vim-cmd vmsvc/power.getstate ${vmid} | sed -n 's/^.*\(Powered on\).*$/\1/p'); if [
"$state" ]; then vim-cmd vmsvc/power.suspendResume ${vmid} && sleep 5; fi; done;
Last method: Using PowerCLI to do this to every VM in the environment.
Open PowerCLI, connect to vCenter server and run the following command:
ForEach ($vm in (Get-VM)){
$vmv = Get-VM $vm | Get-View
$name = $vmv.Name
$guestid = $vmv.Summary.Config.GuestId
$state = $vmv.Summary.Runtime.
$vmx = New-Object VMware.Vim.
$vmx.extraConfig += New-Object VMware.Vim.OptionValue
$vmx.extraConfig[0].key = "monitor_control.enable_
$vmx.extraConfig[0].value = "TRUE"
if ($guestid -like "windows8*Guest") {
($vmv).ReconfigVM_Task($vmx)
if ($state -eq "poweredOn") {
$vmv.MigrateVM_Task($null, $_.Runtime.Host, 'highPriority', $null)
}
}
}
Note:
If you are using Solaris VMs in the environment, do not run this against those Solaris VMs as they could potentially hang with that setting in the vmx.
Also, when the script is running, do not do a vmotion, suspend, clone, or snapshot operation at the same time - this is very important, as it could cause the script to fail.
From looking at the logs, it seems like you are not running Solaris as an OS anyway, at least on these 2 hosts:
rhayden@scripts-prod-3 HostLogs29thOct $ find esx*/vmfs/volumes/ -maxdepth 3 -name "*.vmx" -exec grep 'guestOS' {} \; | awk '{print $NF }' | sort | uniq -c
5 "longhorn"
1 "longhorn-64"
1 "rhel6-64"
1 "sles11-64"
54 "windows7srv-64"
19 "windows8srv-64"
1 "winnetenterprise-64"
7 "winnetstandard"
1 "winNetStandard
If after applying these settings to the VMs this does not work after the next patching / updating (you are still seeing the issue), what we would need to do at that point is get the suspended state file for the VM to send to engineering, as we cannot
reproduce this issue in-house.
If this occurs, this is how to gather the information we would need to send to engineering:
(Do not reboot the VM's until this is done)
1. SSH to the host and run the following command:
vm-support --listvms
2. Now run this command:
vm-support --performance --manifests="HungVM:Coredump_
(Change the path of the VM in the command above to the actual path).
That will put a tgz file in /var/tmp. The file name is displayed when complete. Copy this file off the host manually.
- Edited by Chris Bonsted Friday, November 07, 2014 10:06 AM
I have been on PTO this weekend where I work so I haven't been checking my email until this morning. VMware finally has gotten back to me. There is a bug with ESXi and their engineers are working on a fix. VMware suggests making one of the changes below. If anyone implements any of these changes, please let me know if it does or doesn't work.
Starting with Windows 8 / Windows 2012 Server, during its boot process the operating system will reset the TSC (TimeStampCounter, which increments by 1 for each passed cycle) on CPU0. It does not reset the TSC of the other vCPUs and the resulting
discrepancy between two vCPUs' TSC can result in the OS not booting past the Windows splash screen, and a full power off and on will fix it.
Our engineering team here are currently working on a code change to accommodate this.
There is a workaround suggested from engineering to add a line of code to the vmx (configuration) file of the VM to prevent this from reoccurring.
This will basically tell the vmx file that the TSC for all vCPUs should be reset to zero on a soft reset of the machine, and not just CPU0.
Please note that this has not been tested extensively by engineering, and should be run at your own risk as it is just a workaround which has not been fully QE tested.
This can be done a few ways:
First method: Manually editing the VM's vmx file one VM at a time.
1. Power off the VM
2. Add the following line to the vmx file:
monitor_control.enable_
3. Reload the VM
4. Power on the VM again.
Second method: Doing this to every VM on a host at one time.
1. SSH to the ESX host
2. Run the following command:
echo 'monitor_control.enable_
3. Run the following command to do a suspend-resume in order to apply the setting so that affected guests won't hang during the next reboot:
vim-cmd vmsvc/getallvms | sed -n 's/\(^[0-9]\+\).* windows8.*Guest.*$/\1/p' | while read vmid; do state=$(vim-cmd vmsvc/power.getstate ${vmid} | sed -n 's/^.*\(Powered on\).*$/\1/p'); if [
"$state" ]; then vim-cmd vmsvc/power.suspendResume ${vmid} && sleep 5; fi; done;
Last method: Using PowerCLI to do this to every VM in the environment.
Open PowerCLI, connect to vCenter server and run the following command:
ForEach ($vm in (Get-VM)){
$vmv = Get-VM $vm | Get-View
$name = $vmv.Name
$guestid = $vmv.Summary.Config.GuestId
$state = $vmv.Summary.Runtime.
$vmx = New-Object VMware.Vim.
$vmx.extraConfig += New-Object VMware.Vim.OptionValue
$vmx.extraConfig[0].key = "monitor_control.enable_
$vmx.extraConfig[0].value = "TRUE"
if ($guestid -like "windows8*Guest") {
($vmv).ReconfigVM_Task($vmx)
if ($state -eq "poweredOn") {
$vmv.MigrateVM_Task($null, $_.Runtime.Host, 'highPriority', $null)
}
}
}
Note:
If you are using Solaris VMs in the environment, do not run this against those Solaris VMs as they could potentially hang with that setting in the vmx.
Also, when the script is running, do not do a vmotion, suspend, clone, or snapshot operation at the same time - this is very important, as it could cause the script to fail.
From looking at the logs, it seems like you are not running Solaris as an OS anyway, at least on these 2 hosts:
rhayden@scripts-prod-3 HostLogs29thOct $ find esx*/vmfs/volumes/ -maxdepth 3 -name "*.vmx" -exec grep 'guestOS' {} \; | awk '{print $NF }' | sort | uniq -c
5 "longhorn"
1 "longhorn-64"
1 "rhel6-64"
1 "sles11-64"
54 "windows7srv-64"
19 "windows8srv-64"
1 "winnetenterprise-64"
7 "winnetstandard"
1 "winNetStandard
If after applying these settings to the VMs this does not work after the next patching / updating (you are still seeing the issue), what we would need to do at that point is get the suspended state file for the VM to send to engineering, as we cannot
reproduce this issue in-house.
If this occurs, this is how to gather the information we would need to send to engineering:
(Do not reboot the VM's until this is done)
1. SSH to the host and run the following command:
vm-support --listvms
2. Now run this command:
vm-support --performance --manifests="HungVM:Coredump_
(Change the path of the VM in the command above to the actual path).
That will put a tgz file in /var/tmp. The file name is displayed when complete. Copy this file off the host manually.
- Edited by Chris Bonsted Friday, November 07, 2014 10:06 AM
Chirs,
Do you have any update from VMWare on this? We have a bunch of servers experiencing this problem.
Chris,
Have you implemented any of the workarounds or are you waiting for an update? Please keep us posted on VMWare response please.
I plan on making these changes to 8-12 VM's tomorrow and wait to see what happens. We will be patching our QA environment over the weekend.
As of right now, there isn't a time frame if or when a VMware will create a patch for this.
Sorry for the delay in posting my results.
I have updated 10 2012 R2 VMs with the changes in the .vmx file. None of them experienced any issues upon rebooting when they were patched. I am going to expand my sample in our QA environment to 20 - 30 VMs however, I am going to say that making those modifications did help.
There isn't a public KB article from VMware about this issue other than this one:
http://kb.vmware.com/kb/2082042
There is no ETA on a patch from VMware. I hope that this information helps.
- Edited by Chris Bonsted Wednesday, November 26, 2014 3:01 PM
Sorry for the delay in posting my results.
I have updated 10 2012 R2 VMs with the changes in the .vmx file. None of them experienced any issues upon rebooting when they were patched. I am going to expand my sample in our QA environment to 20 - 30 VMs however, I am going to say that making those modifications did help.
There isn't a public KB article from VMware about this issue other than this one:
http://kb.vmware.com/kb/2082042
There is no ETA on a patch from VMware. I hope that this information helps.
- Edited by Chris Bonsted Wednesday, November 26, 2014 3:01 PM
We had this issue several months ago and Microsoft pointed us to clock time mismatch. They suggested we go to our Hypervisor vendor. VMware did minor investigations and found nothing of course. We are currently back to the same issue again like your post. Finding your post here has made VMware release the internal document stating what you found, to us. It is still internal only to VMware and Microsoft.
We have applied the PowerCLI script to all of our servers and it does modify the VMX without any issue. The problem is you still need to do a reset or power off/on via the virtual power buttons in VMware. OS reboots do not work. So we are in the middle of scheduling outages for our 400+ 2012 servers.
Thanks again for the post. I will post what VMware gave me on the symptoms for this issue to happen.
Symptoms:
Under the following conditions, you are:
Running Windows 8 or 2012 Server or later as the guest operating system on the virtual machine
Running on ESXi 5.5 or later with virtual machine hardware version 10 (vmx-10)
The virtual machine has not experienced a full power cycle (powered off / powered on) for more than two months.
The virtual machine is configured with more than one vCPU.
You might see the following symptoms:
After rebooting, Windows 8 or 2012 Server virtual machines might hang during the Microsoft Windows boot splash screen
After resetting or power cycling the virtual machine, it will boot successfully.
The virtual machine might resume booting after multiple hours or days
A memory dump analysis might reveal thread blocking on a timer expiry hours or days in the future
The blocking thread might be stuck in KeDelayExecutionThread() during PciStallForPowerChange()
Cause:
Starting with Windows 8 / Windows 2012 Server, during the boot process the operating
Thanks for all the info in this thread. I have the same problem using ESXi 5.1 and 2012 R2 servers. Has anyone experienced this problem using 5.1?
Has anyone received any updates from VMware on this? We are experiencing the same issues after Windows updates. Any issues reported with the proposed workarounds?
Thanks,
Derek
I discussed this with VMware Support yesterday. Here's an official KB article, hot off the press:
http://kb.vmware.com/kb/2092807
It has a few details not yet discussed on this thread, so definitely check it out if you're affected by the problem.
Joe.
I've also been told that the fix/workaround is proposed to be included with ESXi 5.5 Update 3 and ESXi 6.0 Update 1.
Machines in our environment also still hanging on with specified "monitor_control.enable_softResetClearTSC = TRUE" parameter. Maybe it requires server reboot to start applying this setting? In this case upcoming patching will show is it true
or not.
Also think that there no difference between "TRUE" and "true".
Machines in our environment also still hanging on with specified "monitor_control.enable_softResetClearTSC = TRUE" parameter. Maybe it requires server reboot to start applying this setting? In this case upcoming patching will show is it true
or not.
Also think that there no difference between "TRUE" and "true".
- Edited by Andrej Trusevic Wednesday, March 04, 2015 1:28 PM
- Proposed as answer by MiliusXP Friday, March 13, 2015 1:36 PM
- Unproposed as answer by MiliusXP Friday, March 13, 2015 1:36 PM
I've applied the workaround, restarted hosts and still have the issue.
Lets wait and see what 5.5 U3 brings, no chance I'm touching ESXi 6 until U1 comes out, and when that does hopefully the fix will be in there too.
There is every chance it won't be though; when I previously looked in to this issue (maybe 3 or 4 months ago now); I was lead to believe it was a Microsoft fault rather than VMware, the argument for this was a good one and I am yet to see Microsoft admitting to anything.
Requiert complete shutdown and restart
parameters are not case sensitive.
Has anyone heard anything more on this issue? I have applied the setting change and it does not make a difference...
I have to reboot multple times to get my VMs to come up.
Hi all,
Can someone please summarize this?
Doesn't the VMware KB 2092807 have the resolution? It doesnt solve this bug?
If i use the PS script, do i still need to restart my VMs?
(The scripts seems to do a "localhost" vmotion which should create a new vmx file?)
IF 9 2807
@andriktr
Did you use the ps script?
To set parameter without script you will be required turn off VM. It's not possible manually editing VM config and set this param when VM is turned on. Using script you can set param without turning off VM.
- Edited by Andrej Trusevic Friday, March 20, 2015 10:30 AM