Server 2012 Virtual Machine BSOD after hardware crash - 4 times

I have now had 4 Server 2012 VMs fail to boot after a hardware failure in the cluster, requiring the servers to be rebuilt. This has happened in a span of 2 months since we started deploying this configuration and using Server 2012.

This has happened at 3 different locations, meaning it is NOT tied to a specific SAN, switch, or server.
-The first and second failures was when the Hyper-v servers both BSOD at the same time, causing the guests to die. BSOD was due to a network driver issue, and both systems receiving the same bad packet.
-The third failure was in our lab as we were building the cluster prior to production deployment. We had a power outage that killed the SAN, but left the storage and switch running.
-The fourth failure was when we had the switch doing iSCSI reboot on us. None of the Server 2008 machines were effected, but 1 of our 2012 Domain Controllers BSOD and needed to be rebuilt from scratch.

Here's the hardware setup for each location. I have this setup in 15+ locations, the problem has happened in different locations, so it is not specific to one location's hardware.

-2 Cisco UCS servers running Server 2012 Datacenter with Hyper-V in a cluster.
-Nimble storage array for iSCSI volumes.
-HP switch for iSCSI backbone.
-All hardware and drivers are being updated at the time of the build.

The dead systems all share these things:
-Running Server 2012 standard
-Domain controller
-created from the same template VHDX file

At this point I have no faith that any VM will survive any sort of hardware crash or power outage. I can see it occasionally causing a VM to not boot after a crash, but we've had 5 system crashes, and of those 5, 4 of them have caused me to rebuild a Server 2012 guest VM. That's an 80% failure rate.

Anyone else experiencing this level of failure?

I've been doing iSCSI VMs on VMware for 6+ years, seen countless crashes, and never had to rebuild a VM because of it. This is my first go at hyper-v. I'm not trying to bash it and say VMware is better, I'm just amazed at what I'm seeing, as it's not expected.

I've followed all the guidelines of MS and the storage vendor for iSCSI configuration. The problem doesn't appear to be storage related though. The VM will boot, so the VHDx file is not corrupt, it just seems that Server 2012 as a VM is not very crash resilient.

I have some systems in the lab now getting ready to ship, I am going to see if I can BSOD a Server 2012 guest running on local storage.

May 17th, 2013 12:07am

Hi,

Blue screen errors in Windows are usually related to hardware failure, that maybe a storage failure or memory error.

As you mentioned the Hyper-V physical hardware has some failure in the past, just like storage power outage, which may cause bad data block in disk. If VM VHD file just use this data block, and this data block store critical data, this may cause VM BSOD.

So I think you should remove these BSOD guest VMs, run disk defragmenter, and then redeploy guest VMs.

Also if you have VM backup, you can use backup to restore your VMs to backup point.

For more information please refer to following MS articles:

Demystifying the 'Blue Screen of Death'
http://technet.microsoft.com/en-us/library/cc750081.aspx

If you want to trace down BSOD reason, please collect dump file and contact Microsoft Customer Support Services (CSS) so that a dedicated Support Professional can help you on this issue.

To obtain the phone numbers for specific technology request, please refer to the website listed below:

http://support.microsoft.com/default.aspx?scid=fh;EN-US;PHONENUMBERS

If you are outside the US, please refer to http://support.microsoft.com  for regional support phone numbers.

Thanks for your understanding.

Free Windows Admin Tool Kit Click here and download it now
May 17th, 2013 7:13am

Lawrence,

Thanks for the response. I realize the BSOD can be caused by a hardware fault and blocks getting corrupt on the VHD file.

My concern is how frequently this is happening. I have an 80% failure rate on Server 2012 VMs. The physical Server2012 systems have not experienced this problem on the crashes, and 2008 and 2008R2 systems have not been affected in this way when crashes happen.

I am going to build a Server 2012 Hyper-v box, then throw a bunch of Server 2012 VMs on it and pull the power to see if I can replicate this behavior repeatedly.

I've seen hundreds of crashes of VMs over the past few years, but in that time I have never seen this failure rate. I have seen maybe 3 systems needing rebuilds in that time, which is closer to 1% failure rate.

I'll report back if I find anything after my tests.

May 17th, 2013 2:57pm

Update: I built 3 Server 2012 VMs running on local disk. The physical machine is a Cisco UCS with Server 2012 hyper-v installed.


The VMs
Server1 was promoted to be a DC in a test domain, so no users or computers other than the DC and default users.

Server2 and Server3 were left as just a bare bones OS, not in a domain.

I pulled the power on the physical server and let it boot and launch the VMs 3 times. All 3 times all VMs were fine.

I then decided to add Server2 and Server3 as DCs to our real domain, which has 1000+ users and computers. Once properly promoted and stable, I pulled the power on the physical server. THIS RESULTED IN A DEAD DC on the FIRST CRASH.

I will continue testing, but anyone virtualizing Server 2012 DCs, I recommend being very cautious. Or maybe this is just something to watch out for with Server 2012 DCs in general.


Free Windows Admin Tool Kit Click here and download it now
May 18th, 2013 12:21am

Were you ever able to solve this, foodandbikes?  I have the exact same problem, but it seems like the sort of thing that should be recoverable....
May 30th, 2013 7:16pm

I've seen this same issue so far with 2 Host servers 2012 Datacenters, and Guest VMs 2008 R2 DC's or Exchange servers or anything for that matter... issue turned out to be RAM being used... I had to remove the RAM in use and test each and every one of the RAM sticks after replacing some of the sticks everything was running smoothly after that fact.

the RAM sticks were all new btw. brand new hardware... all for the pupose of testing and clustering. seeing that you have this many servers and this many locations failing this will undoubtedly be a nightmare!

the Hosts were runing without any issues I even did stress tests overclocking underclocking Memory tests... everything under the sun you could think of, and there was no apparent errors coming back showing any sign of bad RAM, and the VMs kept on crashing... I was losing my mind. but I decided to do manual tests for the RAM either way... sure enough as soon as I started taking out the sticks and testing different combinations... it became apparent which RAM sticks were not aggreeing with the Server 2012... I still don't understand why this was happening, as if I use those same sticks elsewhere they work no problem... DDR3 8GB 1600Mhz sticks

I'll keep looking at this to see if you end up with same results or find a different solution! this is plain ridiculous... when installing multiple servers you would be sitting there guessing if RAM sticks will work or not especially when you have 192GB of RAM or more thats a lot of work and Money.

Free Windows Admin Tool Kit Click here and download it now
June 6th, 2013 8:41pm

hey guys;

after many attempts and different tests... I wanted to see if you guys could try and disable the winRM service in the host machine see if this fixes the issues... I'm testing this now... I've noticed a couple of glitches with the RAM reedings one machine had been stable for a month and it had the winRM disabled...

let me know if this helps... clearly if the hosts are on the domain and you access them using comand line... than this may not be the best solution... but if you use a thrid paty monitoring tool  or remote access you should be fine with this setting.

cheers!

June 12th, 2013 2:20am

Great News

I've Confirmed the WinRM definitely will solve all BSOD issues. as soon as I disabled it, reboot, and retest -- everything was going smoothly after the fact.

workaround would have to be using third party monitoring tools or simply remote access to remote manage... definitely not ideal of large enterprise however it's manageable... it's crappy you would lose functionality on such a large module.

they need to make it more lenient towards compatibility so that it would stop crashing the VM's -- wait for MS to bring out their new patch!

G

Free Windows Admin Tool Kit Click here and download it now
June 13th, 2013 2:10pm

can you tell  me if the disable WinRM service worked for you?
June 21st, 2013 5:28pm

I want to revive this thread to see if anybody else has experienced similar problems with 2012 hyper-v hosts.

I have done extensive testing on this issue and the problem for me seems to revolve around the VHDx file format. I have tested 4 separate environments (completely different servers at different location) with Server 2008 R2 domain controllers running as VMs on dynamic VHDx files. In each instance a power failure has caused corruption within the file system of the domain controller and left the DC unusable. During testing not every power failure has caused corruption, but a high percentage (10+ %) of power failures during testing caused corruption. I have managed numerous Hyper-v installations since the original 2008 server version was released and I have never seen corruption like this until 2012 and VHDx.

for the past few days I have been testing fixed sized VHD VMs on a 2012 host and I have not been able to reproduce the data corruption issue. I seem to only be able to reproduce the problem when using dynamic VHDx files. I have not done any testing on 2012 hosts with fixed size VHDx files or dynamic VHD files.

If you have experienced similar issues please respond so that we can compare notes and hopefully get to the bottom of a serious problem.

Free Windows Admin Tool Kit Click here and download it now
June 26th, 2013 5:20am

Please see my blog post and related hot fix for this type of issue.

Important Hyper-V Fixes Included in KB 2855336 (Virtualized Active Directory corruption (bug check c00002e2) and vmms process hang)
July 9th, 2013 8:01pm

Sorry for not responding earlier. I lost track of this thread and have just been crossing my fingers when we have unexpected outages.

I came to update the thread since I found a patch for the issue, but see that Taylor has beaten me to it..by a month.

Just glad to know I'm not losing my mind and this was a bug.

http://support.microsoft.com/kb/2853952/en-us

Free Windows Admin Tool Kit Click here and download it now
August 8th, 2013 12:44am

thanks!

I was losing my mind , as Ive lost a domain controller today because of this issue. It was the third time, and it was driving me crazy. Hopefully the main DC was a 2008R2 and not affected by the power outage, then I saw Taylor Brown post and it will probably save me again in the future, since our branch is a construction site and affected everyday by power surges.

Apart from the fact Ive needed to delete the 2012 DC VM and install a fresh one, hopefully the problem will not happen in the future.

Thank you all, cheers from Brazil

September 24th, 2013 5:43pm

BTW just a touch up on this thread if anyone is deciding to come back and have a read... I haven't seen any of these same issues with 2012R2... and after a lot and lots of searches it came down to bad ECC or non-ECC RAM was the cause of these issues... mind you no more issues in 2012 R2 :) running on now for about 15 months and still going strong :)

Cheers!

Free Windows Admin Tool Kit Click here and download it now
May 19th, 2015 12:08am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics