Windows 2008 R2 reboot every 14 days due to exchange 2010
Hello Guys, I have a server Windows 2008 R2 SP1 with exchange 2010 SP2 installed & all latest updates apply. Hardware: Dell Power edge R900 - 2 * Intel Xeon CPU E7430 @ 2.13GHz 32 GB RAM it also have attached Openfiler iSCSI, 1 USB 2 TB with Exchange DB & 1 USB 2 TB for backup (bad choice but cant avoid that) Config:Windows Server 2008 R2 Enterprise (x64) SP1 Exchange 2010 Mailbox server role with DAG, We are experiencing strange problem with it as it is rebooting every 2nd Tuesday (14days) So far i have done following, 1. Updated all firmware-drivers to latest. Dell Power edge R900 hardware is tested with hardware diag tools and no problem found. 2. All windows drivers installed. 3. memory dump is as below. Some time it does not generate memory dump also. 5: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* CRITICAL_OBJECT_TERMINATION (f4) A process or thread crucial to system operation has unexpectedly exited or been terminated. Several processes and threads are necessary for the operation of the system; when they are terminated (for any reason), the system can no longer function. Arguments: Arg1: 0000000000000003, Process Arg2: fffffa801aa10b30, Terminating object Arg3: fffffa801aa10e10, Process image file name Arg4: fffff800023db8b0, Explanatory message (ascii) Debugging Details: ------------------ Page 2e039e not present in the dump file. Type ".hh dbgerr004" for details TRIAGER: Could not open triage file : C:\Program Files (x86)\Windows Kits\8.0\Debuggers\x64\triage\modclass.ini, error 2 PROCESS_OBJECT: fffffa801aa10b30 DEBUG_FLR_IMAGE_TIMESTAMP: 0 MODULE_NAME: wininit FAULTING_MODULE: 0000000000000000 PROCESS_NAME: msexchangerepl BUGCHECK_STR: 0xF4_msexchangerepl DEFAULT_BUCKET_ID: WIN7_DRIVER_FAULT CURRENT_IRQL: 0 LAST_CONTROL_TRANSFER: from fffff800024625e2 to fffff800020d7c40 STACK_TEXT: fffff880`08a03b08 fffff800`024625e2 : 00000000`000000f4 00000000`00000003 fffffa80`1aa10b30 fffffa80`1aa10e10 : nt!KeBugCheckEx fffff880`08a03b10 fffff800`0240f99b : ffffffff`ffffffff fffffa80`56502060 fffffa80`1aa10b30 fffffa80`1e7e8b30 : nt!PspCatchCriticalBreak+0x92 fffff880`08a03b50 fffff800`0238f448 : ffffffff`ffffffff 00000000`00000001 fffffa80`1aa10b30 00000000`00000008 : nt! ?? ::NNGAKEGL::`string'+0x176d6 fffff880`08a03ba0 fffff800`020d6ed3 : fffffa80`1aa10b30 fffff880`ffffffff fffffa80`56502060 00000000`00001044 : nt!NtTerminateProcess+0xf4 fffff880`08a03c20 00000000`76f415da : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13 00000000`217ae7b8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x76f415da STACK_COMMAND: kb FOLLOWUP_NAME: MachineOwner IMAGE_NAME: wininit.exe FAILURE_BUCKET_ID: X64_0xF4_msexchangerepl_IMAGE_wininit.exe BUCKET_ID: X64_0xF4_msexchangerepl_IMAGE_wininit.exe Followup: MachineOwner --------- 5: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* CRITICAL_OBJECT_TERMINATION (f4) A process or thread crucial to system operation has unexpectedly exited or been terminated. Several processes and threads are necessary for the operation of the system; when they are terminated (for any reason), the system can no longer function. Arguments: Arg1: 0000000000000003, Process Arg2: fffffa801aa10b30, Terminating object Arg3: fffffa801aa10e10, Process image file name Arg4: fffff800023db8b0, Explanatory message (ascii) Debugging Details: ------------------ Page 2e039e not present in the dump file. Type ".hh dbgerr004" for details TRIAGER: Could not open triage file : C:\Program Files (x86)\Windows Kits\8.0\Debuggers\x64\triage\modclass.ini, error 2 PROCESS_OBJECT: fffffa801aa10b30 DEBUG_FLR_IMAGE_TIMESTAMP: 0 MODULE_NAME: wininit FAULTING_MODULE: 0000000000000000 PROCESS_NAME: msexchangerepl BUGCHECK_STR: 0xF4_msexchangerepl DEFAULT_BUCKET_ID: WIN7_DRIVER_FAULT CURRENT_IRQL: 0 LAST_CONTROL_TRANSFER: from fffff800024625e2 to fffff800020d7c40 STACK_TEXT: fffff880`08a03b08 fffff800`024625e2 : 00000000`000000f4 00000000`00000003 fffffa80`1aa10b30 fffffa80`1aa10e10 : nt!KeBugCheckEx fffff880`08a03b10 fffff800`0240f99b : ffffffff`ffffffff fffffa80`56502060 fffffa80`1aa10b30 fffffa80`1e7e8b30 : nt!PspCatchCriticalBreak+0x92 fffff880`08a03b50 fffff800`0238f448 : ffffffff`ffffffff 00000000`00000001 fffffa80`1aa10b30 00000000`00000008 : nt! ?? ::NNGAKEGL::`string'+0x176d6 fffff880`08a03ba0 fffff800`020d6ed3 : fffffa80`1aa10b30 fffff880`ffffffff fffffa80`56502060 00000000`00001044 : nt!NtTerminateProcess+0xf4 fffff880`08a03c20 00000000`76f415da : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13 00000000`217ae7b8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x76f415da STACK_COMMAND: kb FOLLOWUP_NAME: MachineOwner IMAGE_NAME: wininit.exe FAILURE_BUCKET_ID: X64_0xF4_msexchangerepl_IMAGE_wininit.exe BUCKET_ID: X64_0xF4_msexchangerepl_IMAGE_wininit.exe Followup: MachineOwner --------- 5: kd> !process fffffa801aa10b30 3 PROCESS fffffa801aa10b30 SessionId: 0 Cid: 01d4 Peb: 7fffffd7000 ParentCid: 0198 DirBase: 268faa000 ObjectTable: fffff8a000cbf4a0 HandleCount: 92. Image: wininit.exe VadRoot fffffa801b872240 Vads 61 Clone 0 Private 455. Modified 2. Locked 2. DeviceMap fffff8a000008bc0 Token fffff8a000087820 ElapsedTime 20:02:21.124 UserTime 00:00:00.031 KernelTime 00:00:00.062 QuotaPoolUsage[PagedPool] 95920 QuotaPoolUsage[NonPagedPool] 9984 Working Set Sizes (now,min,max) (1315, 50, 345) (5260KB, 200KB, 1380KB) PeakWorkingSetSize 1319 VirtualSize 47 Mb PeakVirtualSize 50 Mb PageFaultCount 1480 MemoryPriority BACKGROUND BasePriority 13 CommitCharge 547 THREAD fffffa801aa11b60 Cid 01d4.01d8 Teb: 000007fffffde000 Win32Thread: fffff900c00c7950 WAIT: (UserRequest) UserMode Non-Alertable fffffa801ad9bbc0 NotificationEvent THREAD fffffa801aa99b60 Cid 01d4.022c Teb: 000007fffffd5000 Win32Thread: 0000000000000000 WAIT: (UserRequest) UserMode Alertable fffffa801aa938c0 SynchronizationTimer fffffa801aa418c0 SynchronizationTimer fffffa801aa9eb30 ProcessObject fffffa801ab06910 ProcessObject fffffa801ab0fb30 ProcessObject fffffa801aa8f8e0 SynchronizationTimer THREAD fffffa801f444060 Cid 01d4.0fcc Teb: 000007fffffdc000 Win32Thread: 0000000000000000 WAIT: (WrQueue) UserMode Alertable fffffa801a9c5ac0 QueueObject
April 24th, 2012 1:09pm

First I am not an Exchange person but I found your post interesting. This article indicates you may be seeing poor I/O and Exchange 2010 SP1 adds a feature to bug check the machine If you can't resolve the I/O issue you could possibly turn this feature off. New High Availability and Site Resilience Functionality in Exchange 2010 SP1 http://technet.microsoft.com/en-us/library/ff625233.aspx DisableBugcheckOnHungIo HKLM\Software\Microsoft\Exchange Server\V14\Replay\Parameters DWORD value; when set to any value other than 0, the hung I/O bugcheck feature is disabled. If hung I/O occurs, only an event is logged. Extensible Storage Engine (ESE) has been updated to detect when I/O is hung and to take corrective action to automatically recover the server. ESE maintains an I/O monitoring thread that detects when an I/O has been outstanding for a specific period of time. By default, if an I/O for a database is outstanding for more than one minute, ESE logs an event. If a database has an I/O outstanding for greater than 4 minutes, ESE logs a specific failure event, if its possible to do so. ESE event 507, 508, 509, or 510 may or may not be logged, depending on the nature of the hung I/O. If the problem is such that the operating system volume is affected or the ability to write to the event log is affected, the events arent logged. If the events are logged, the Microsoft Exchange Replication service (MSExchangeRepl.exe) intentionally terminates the wininit.exe process to cause a bugcheck of Windows. Something to check at least.Dave Guenthner [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights. http://blogs.technet.com/b/davguents_blog
Free Windows Admin Tool Kit Click here and download it now
April 24th, 2012 1:40pm

yes dave is right her is an interesting article ThesE ERRORS occur when disks are too busy to handle the I/O load generated on the server and can result from storage that is simply not capable of handling the load or in situations where a configuration is not optimum and slows I/O in some conditions. Exchange looks for hanging I/Os because it wants to protect databases against potential loss. Obviously, its not good when an I/O has not completed as the I/O may be relevant to essential data. A hanging I/O may never complete as it might be in that charming condition called hung in cyber never-land. Its worth noting that Exchange is not the only component that can force a Windows 2008 R2 server into a BSOD as Failover Clustering will also force a server bugcheck under certain conditions that Windows considers to be unrecoverable from without a reboot http://thoughtsofanidlemind.wordpress.com/2011/02/10/sp1-and-bsod/ you may try posting this in the exchange related forums http://social.technet.microsoft.com/forums/en-US/category/exchangeserver/ http://www.arabitpro.com
April 24th, 2012 1:46pm

Hi Syed, Thank you. What are the steps i should take to find if failover clustering might be causing reboot?
Free Windows Admin Tool Kit Click here and download it now
April 24th, 2012 2:04pm

No. What we are saying is that Exchange is seeing very poor disk I/O and as a result, Exchange is dumping the machine so that customers can get insight on what may be causing the slow performance. If you are using USB you already have the answer. Follow my post to turn off this feature if you cannot improve the perf for disks. San G is referring to a similar feature built into Failover clustering to help troubleshoot hang conditions. http://blogs.technet.com/b/askcore/archive/2009/06/12/why-is-my-2008-failover-clustering-node-blue-screening-with-a-stop-0x0000009e.aspx Dave Guenthner [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights. http://blogs.technet.com/b/davguents_blog
April 24th, 2012 3:10pm

Hi Dave, I have applied changes and can update you only on 08th May which is next schedule for reboot. If that does not work , i will try failover clustering troubleshooting which does not seem to be issue at first. Thanks for your help.
Free Windows Admin Tool Kit Click here and download it now
April 27th, 2012 9:02am

May 8th works for me San G, have an nice weekend.Dave Guenthner [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights. http://blogs.technet.com/b/davguents_blog
April 27th, 2012 11:54am

Any updates for your issue?Jeff Ren TechNet Community Support beneficial to other community members reading the thread.
Free Windows Admin Tool Kit Click here and download it now
May 3rd, 2012 2:56am

I Bring in sad news. Server did reboot again same way. It did not created memory dump. I had forcefully generated memory dump by crashing it again from another method. Is there any more ideas? Now i should focus to failover cluster problem. Any suggestion from you?
May 9th, 2012 5:35am

to help you more, i paste here my cluster configuration. Please point if any mistake. Both Node are in same LAN & in Giga switch. PS C:\Windows\system32> Get-Cluster | fl * Domain : mydomain Name : DAG AddEvictDelay : 60 BackupInProgress : 0 ClusSvcHangTimeout : 60 ClusSvcRegroupOpeningTimeout : 5 ClusSvcRegroupPruningTimeout : 5 ClusSvcRegroupStageTimeout : 7 ClusSvcRegroupTickInMilliseconds : 300 ClusterGroupWaitDelay : 30 ClusterLogLevel : 3 ClusterLogSize : 100 CrossSubnetDelay : 1000 CrossSubnetThreshold : 5 DefaultNetworkRole : 2 Description : FixQuorum : 0 HangRecoveryAction : 3 IgnorePersistentStateOnStartup : 0 LogResourceControls : 0 PlumbAllCrossSubnetRoutes : 0 QuorumArbitrationTimeMax : 90 RequestReplyTimeout : 60 RootMemoryReserved : 4294967295 SameSubnetDelay : 1200 SameSubnetThreshold : 10 SecurityLevel : 2 SharedVolumesRoot : C:\ClusterStorage ShutdownTimeoutInMinutes : 20 WitnessDatabaseWriteTimeout : 300 WitnessRestartInterval : 15 EnableSharedVolumes : Disabled Id : 05c39a0f-dc66-474a-b43e-811fedf53ae0
Free Windows Admin Tool Kit Click here and download it now
May 9th, 2012 5:45am

Hi San_G Have you made any progress on the issue? We have the same issue in our organization however we are using sas storage (Dell MD1200). Our restart frequency is more frequent, it happens around every 3 days and is always during the evening. Configuring DisableBugcheckOnHungIo is not really an option as this is risky and could cause catastrophic issues, this is to be used as a LAST resort, only if there is no way to improve disk IO. I don't mean to hijack your thread, just want to knwo if you made any progress and if so what did you do? Thanks
May 29th, 2012 3:49am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics