Hyper-V host and all guests become unresponsive, event ID 153 fills system log

We have some Windows 2012 R2 Hyper-V with dozens of VM running legacy systems and some departmental systems (XP, Win2003, Win2008, Win2012R2). They had worked for years without any problem.

In last December, we approve the monthly Windows Updates in the guest systems. In the next day, our IBM x3650 hosts presented disk errors and hanged, losing the disk array after restarted.

The host system logs had these records on Windows event viewer:

  • Megasas2 ID 129 reset to device, \device\raidport0, was issued.
  • Disk ID 153 - the IO operation at logical block address 0x for disk 2 (PDO: \device\0000005d) was retried.
  • Megasas2 ID 11 - the driver detected a controller error on \device\raidport0

The systems running on HP DL160/360 were not affected. The problematic environment is IBM x3650 M4 HD 5460 configured to boot from internal LSI MegaRAID M5210e. We are using fixed VHDX.

All diagnosis made after reboot have no errors and the hosts work fine after reinstall until an updated guest is started. VM guests run normally in another environment.

With this, we have tried to reinstall the host system using all versions of LSI MEGASAS drivers that we can, with or without Windows updates in host OS. In all attempts, we have a server crash and disk array loss when a guest starts.

Has anyone experienced this problem? How can we confirm that this is a driver issue?


March 11th, 2015 7:49pm

Hi Elton,

Sorry for my bad English. I'm working on improving it. Let me clarify.

When the problem occurs, the host has stopped responding. There's no dump or blue screen. If we try to interact with any application, it stops responding. The resource monitor shows zero disk usage on write queue. Read Queue Length reaches the scale. When CHKDSK is running, it shows a lot of file record segment ### is unreadable. Event Viewer shows 11, 129 and 153 errors. In a few minutes, the host freezes.

After restarting, the boot partition is unreachable and the recover process fails. All disk data was lost, and Virtual Machines too. However, if we install again, the system works fine, it runs all diagnostics or disk checks without errors.

Yesterday, we got some situations that cause the problem almost instantly: run an antivirus full scan in a guest or just format a volume in a guest without antivirus, by example. This, only in XP or Windows 2003 guests.

The host OS and the host's hardware drivers are up-to-date. The problem occurs only in our xSeries. We think it's a driver bug. There is an open support request with hardware supplier, but they are not convinced about driver bug hypothesis. They suggest it's a Hyper-V bug.

Is this a driver or hypervisor issue? How to confirm?

Regards.

Free Windows Admin Tool Kit Click here and download it now
March 13th, 2015 12:56pm

This sounds like a driver issue. Have you installed the latest IBM drivers from the vendor? Another good log to look at is the Hyper-V VMMS operational logs. Event Viewer\Applications and Services Logs\Microsoft\Windows\Hyper-V-VMMS

March 13th, 2015 7:46pm

We have tried with latest version drivers provided by MS, IBM and LSI.

  • MS: lsi_sas2 v2.00.60.82 / lsi_sas3 v2.50.65.01  / megasas2 v6.600.21.8
  • IBM: lsi_sas2 v2.00.69.01 / lsi_sas3 v2.50.75.00 / megasas2 v6.704.12.00
  • LSI: lsi_sas2 v.2.00.72.00 / lsi_sas3 v2.50.92.00 / megasas2 v6.705.05.00

The Hyper-V VMMS-Storage log shows: Failed to open attachment ... Error: 'The file or directory is corrupted and unreadable.'

The problem only occurs when Integration Components are installed.

Free Windows Admin Tool Kit Click here and download it now
March 16th, 2015 4:40pm

Hi Sir,

>>The resource monitor shows zero disk usage on write queue. Read Queue Length reaches the scale. When CHKDSK is running, it shows a lot of file record segment ### is unreadable. Event Viewer shows 11, 129 and 153 errors. In a few minutes, the host freezes.

After restarting, the boot partition is unreachable and the recover process fails. All disk data was lost, and Virtual Machines too. However, if we install again, the system works fine, it runs all diagnostics or disk checks without errors.

>>The problem occurs only in our xSeries.

Sorry for the delay .

As you metioned the issue only happens to XSeries , based on my experience there should be a conflict  .

Please refer to following web site to check if X3650 is supported to install 2012R2 :

http://windowsservercatalog.com/results.aspx?text=IBM+x3650+&=Go&bCatID=1282&avc=10&ava=0&OR=5&chtext=&cstext=&csttext=&chbtext=

Best Regards,

Elton Ji

March 25th, 2015 2:12am

Hi guys,

The hardware is 2012R2 certified (http://windowsservercatalog.com/item.aspx?idItem=a8ee4fd3-a6de-8d5c-9426-b8f8d1cba0d0&bCatID=1282). 

Continuing our trial and error approach, this week some tests revealed new information about the issue. The problem occurs when Hyper-V Storage Accelerator driver is enabled in an XP/2003 x32 guest OS. If it is disabled then all runs fine. In an x64 guest OS too, even when HSA driver is enabled.

Yesterday, we tested another option, disabling data protection in RAID BIOS (T10-DIF PI). Now, we get no errors or issues. It sounds like a workaround.

However, T10 data protection prevents silent corruption like FIFO overruns and underruns or firmware errors (such as arithmetic overflow or incorrect pointer usage). Is it safe disable it?

Regards,
Free Windows Admin Tool Kit Click here and download it now
March 25th, 2015 7:26pm

Hi Sir,

>>However, T10 data protection prevents silent corruption like FIFO overruns and underruns or firmware errors (such as arithmetic overflow or incorrect pointer usage). Is it safe disable it?

Glad to hear that you have found a workaround .

But for this hardware setting , I would suggest you to contact hardware vendor to check if there is any potential issue .

Best Regards,

El

March 27th, 2015 1:29pm

Hi Rafael,

As you know, these events can be logged for multiple reasons and the underlying cause can be as simple as performing a Buffered large file copy operation when it should be Unbuffered, insufficient controller command queue (Queue Depth) or workload related down to FW/Drivers and HW.

Q: Where are these events being logged (within host server or the VMs)?

Q: Can you copy the "Details" of Event 153 and paste it here?

Right click the event 153 > Select Copy  > Copy Details as text

Also, what is the workload on this server?

What type of IO operation is occurring when this happens?

Thanks

Free Windows Admin Tool Kit Click here and download it now
April 2nd, 2015 12:23am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics