Windows 2008 R2 reboot every 14 days due to exchange 2010
Hello Guys,
I have a server Windows 2008 R2 SP1 with exchange 2010 SP2 installed & all latest updates apply.
Hardware: Dell Power edge R900 - 2 * Intel Xeon CPU E7430 @ 2.13GHz 32 GB RAM
it also have attached Openfiler iSCSI, 1 USB 2 TB with Exchange DB & 1 USB 2 TB for backup (bad choice but cant avoid that)
Config:Windows Server 2008 R2 Enterprise (x64) SP1
Exchange 2010 Mailbox server role with DAG,
We are experiencing strange problem with it as it is rebooting every 2nd Tuesday (14days)
So far i have done following,
1. Updated all firmware-drivers to latest. Dell Power edge R900 hardware is tested with hardware diag tools and no problem found.
2. All windows drivers installed.
3. memory dump is as below. Some time it does not generate memory dump also.
5: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
CRITICAL_OBJECT_TERMINATION (f4)
A process or thread crucial to system operation has unexpectedly exited or been
terminated.
Several processes and threads are necessary for the operation of the
system; when they are terminated (for any reason), the system can no
longer function.
Arguments:
Arg1: 0000000000000003, Process
Arg2: fffffa801aa10b30, Terminating object
Arg3: fffffa801aa10e10, Process image file name
Arg4: fffff800023db8b0, Explanatory message (ascii)
Debugging Details:
------------------
Page 2e039e not present in the dump file. Type ".hh dbgerr004" for details
TRIAGER: Could not open triage file : C:\Program Files (x86)\Windows Kits\8.0\Debuggers\x64\triage\modclass.ini, error 2
PROCESS_OBJECT: fffffa801aa10b30
DEBUG_FLR_IMAGE_TIMESTAMP: 0
MODULE_NAME: wininit
FAULTING_MODULE: 0000000000000000
PROCESS_NAME: msexchangerepl
BUGCHECK_STR: 0xF4_msexchangerepl
DEFAULT_BUCKET_ID: WIN7_DRIVER_FAULT
CURRENT_IRQL: 0
LAST_CONTROL_TRANSFER: from fffff800024625e2 to fffff800020d7c40
STACK_TEXT:
fffff880`08a03b08 fffff800`024625e2 : 00000000`000000f4 00000000`00000003 fffffa80`1aa10b30 fffffa80`1aa10e10 : nt!KeBugCheckEx
fffff880`08a03b10 fffff800`0240f99b : ffffffff`ffffffff fffffa80`56502060 fffffa80`1aa10b30 fffffa80`1e7e8b30 : nt!PspCatchCriticalBreak+0x92
fffff880`08a03b50 fffff800`0238f448 : ffffffff`ffffffff 00000000`00000001 fffffa80`1aa10b30 00000000`00000008 : nt! ?? ::NNGAKEGL::`string'+0x176d6
fffff880`08a03ba0 fffff800`020d6ed3 : fffffa80`1aa10b30 fffff880`ffffffff fffffa80`56502060 00000000`00001044 : nt!NtTerminateProcess+0xf4
fffff880`08a03c20 00000000`76f415da : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`217ae7b8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x76f415da
STACK_COMMAND: kb
FOLLOWUP_NAME: MachineOwner
IMAGE_NAME: wininit.exe
FAILURE_BUCKET_ID: X64_0xF4_msexchangerepl_IMAGE_wininit.exe
BUCKET_ID: X64_0xF4_msexchangerepl_IMAGE_wininit.exe
Followup: MachineOwner
---------
5: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
CRITICAL_OBJECT_TERMINATION (f4)
A process or thread crucial to system operation has unexpectedly exited or been
terminated.
Several processes and threads are necessary for the operation of the
system; when they are terminated (for any reason), the system can no
longer function.
Arguments:
Arg1: 0000000000000003, Process
Arg2: fffffa801aa10b30, Terminating object
Arg3: fffffa801aa10e10, Process image file name
Arg4: fffff800023db8b0, Explanatory message (ascii)
Debugging Details:
------------------
Page 2e039e not present in the dump file. Type ".hh dbgerr004" for details
TRIAGER: Could not open triage file : C:\Program Files (x86)\Windows Kits\8.0\Debuggers\x64\triage\modclass.ini, error 2
PROCESS_OBJECT: fffffa801aa10b30
DEBUG_FLR_IMAGE_TIMESTAMP: 0
MODULE_NAME: wininit
FAULTING_MODULE: 0000000000000000
PROCESS_NAME: msexchangerepl
BUGCHECK_STR: 0xF4_msexchangerepl
DEFAULT_BUCKET_ID: WIN7_DRIVER_FAULT
CURRENT_IRQL: 0
LAST_CONTROL_TRANSFER: from fffff800024625e2 to fffff800020d7c40
STACK_TEXT:
fffff880`08a03b08 fffff800`024625e2 : 00000000`000000f4 00000000`00000003 fffffa80`1aa10b30 fffffa80`1aa10e10 : nt!KeBugCheckEx
fffff880`08a03b10 fffff800`0240f99b : ffffffff`ffffffff fffffa80`56502060 fffffa80`1aa10b30 fffffa80`1e7e8b30 : nt!PspCatchCriticalBreak+0x92
fffff880`08a03b50 fffff800`0238f448 : ffffffff`ffffffff 00000000`00000001 fffffa80`1aa10b30 00000000`00000008 : nt! ?? ::NNGAKEGL::`string'+0x176d6
fffff880`08a03ba0 fffff800`020d6ed3 : fffffa80`1aa10b30 fffff880`ffffffff fffffa80`56502060 00000000`00001044 : nt!NtTerminateProcess+0xf4
fffff880`08a03c20 00000000`76f415da : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x13
00000000`217ae7b8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x76f415da
STACK_COMMAND: kb
FOLLOWUP_NAME: MachineOwner
IMAGE_NAME: wininit.exe
FAILURE_BUCKET_ID: X64_0xF4_msexchangerepl_IMAGE_wininit.exe
BUCKET_ID: X64_0xF4_msexchangerepl_IMAGE_wininit.exe
Followup: MachineOwner
---------
5: kd> !process fffffa801aa10b30 3
PROCESS fffffa801aa10b30
SessionId: 0 Cid: 01d4 Peb: 7fffffd7000 ParentCid: 0198
DirBase: 268faa000 ObjectTable: fffff8a000cbf4a0 HandleCount: 92.
Image: wininit.exe
VadRoot fffffa801b872240 Vads 61 Clone 0 Private 455. Modified 2. Locked 2.
DeviceMap fffff8a000008bc0
Token fffff8a000087820
ElapsedTime 20:02:21.124
UserTime 00:00:00.031
KernelTime 00:00:00.062
QuotaPoolUsage[PagedPool] 95920
QuotaPoolUsage[NonPagedPool] 9984
Working Set Sizes (now,min,max) (1315, 50, 345) (5260KB, 200KB, 1380KB)
PeakWorkingSetSize 1319
VirtualSize 47 Mb
PeakVirtualSize 50 Mb
PageFaultCount 1480
MemoryPriority BACKGROUND
BasePriority 13
CommitCharge 547
THREAD fffffa801aa11b60 Cid 01d4.01d8 Teb: 000007fffffde000 Win32Thread: fffff900c00c7950 WAIT: (UserRequest) UserMode Non-Alertable
fffffa801ad9bbc0 NotificationEvent
THREAD fffffa801aa99b60 Cid 01d4.022c Teb: 000007fffffd5000 Win32Thread: 0000000000000000 WAIT: (UserRequest) UserMode Alertable
fffffa801aa938c0 SynchronizationTimer
fffffa801aa418c0 SynchronizationTimer
fffffa801aa9eb30 ProcessObject
fffffa801ab06910 ProcessObject
fffffa801ab0fb30 ProcessObject
fffffa801aa8f8e0 SynchronizationTimer
THREAD fffffa801f444060 Cid 01d4.0fcc Teb: 000007fffffdc000 Win32Thread: 0000000000000000 WAIT: (WrQueue) UserMode Alertable
fffffa801a9c5ac0 QueueObject
April 24th, 2012 1:09pm
First I am not an Exchange person but I found your post interesting. This article indicates you may be seeing poor I/O and Exchange 2010 SP1 adds a feature to bug check the machine
If you can't resolve the I/O issue you could possibly turn this feature off.
New High Availability and Site Resilience Functionality in Exchange 2010 SP1
http://technet.microsoft.com/en-us/library/ff625233.aspx
DisableBugcheckOnHungIo
HKLM\Software\Microsoft\Exchange Server\V14\Replay\Parameters
DWORD value; when set to any value other than 0, the hung I/O bugcheck feature is disabled. If hung I/O occurs, only an event is logged.
Extensible Storage Engine (ESE) has been updated to detect when I/O is hung and to take corrective action to automatically recover the server. ESE maintains an I/O monitoring thread that detects when an I/O has been outstanding for a specific period of time.
By default, if an I/O for a database is outstanding for more than one minute, ESE logs an event. If a database has an I/O outstanding for greater than 4 minutes, ESE logs a specific failure event, if its possible to do so. ESE event 507, 508, 509, or 510
may or may not be logged, depending on the nature of the hung I/O. If the problem is such that the operating system volume is affected or the ability to write to the event log is affected, the events arent logged. If the events are logged, the Microsoft Exchange
Replication service (MSExchangeRepl.exe) intentionally terminates the wininit.exe process to cause a bugcheck of Windows.
Something to check at least.Dave Guenthner [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights. http://blogs.technet.com/b/davguents_blog
Free Windows Admin Tool Kit Click here and download it now
April 24th, 2012 1:40pm
yes dave is right her is an interesting article
ThesE ERRORS occur when disks are too busy to handle the I/O load generated on the server and can result from storage that is simply not capable of handling the load or in situations where a configuration is not optimum and slows I/O in some conditions.
Exchange looks for hanging I/Os because it wants to protect databases against potential loss. Obviously, its not good when an I/O has not completed as the I/O may be relevant to essential data. A hanging I/O may never complete as it might be in that charming
condition called hung in cyber never-land.
Its worth noting that Exchange is not the only component that can force a Windows 2008 R2 server into a BSOD as Failover Clustering will also force a server bugcheck under certain conditions that Windows considers to be unrecoverable from without a reboot
http://thoughtsofanidlemind.wordpress.com/2011/02/10/sp1-and-bsod/
you may try posting this in the exchange related forums
http://social.technet.microsoft.com/forums/en-US/category/exchangeserver/
http://www.arabitpro.com
April 24th, 2012 1:46pm
Hi Syed,
Thank you. What are the steps i should take to find if failover clustering might be causing reboot?
Free Windows Admin Tool Kit Click here and download it now
April 24th, 2012 2:04pm
No. What we are saying is that Exchange is seeing very poor disk I/O and as a result, Exchange is dumping the machine so that customers can get insight on what may be causing the slow performance. If you are using USB you already have the answer.
Follow my post to turn off this feature if you cannot improve the perf for disks.
San G is referring to a similar feature built into Failover clustering to help troubleshoot hang conditions.
http://blogs.technet.com/b/askcore/archive/2009/06/12/why-is-my-2008-failover-clustering-node-blue-screening-with-a-stop-0x0000009e.aspx
Dave Guenthner [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights. http://blogs.technet.com/b/davguents_blog
April 24th, 2012 3:10pm
Hi Dave,
I have applied changes and can update you only on 08th May which is next schedule for reboot. If that does not work , i will try failover clustering troubleshooting which does not seem to be issue at first.
Thanks for your help.
Free Windows Admin Tool Kit Click here and download it now
April 27th, 2012 9:02am
May 8th works for me San G, have an nice weekend.Dave Guenthner [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights. http://blogs.technet.com/b/davguents_blog
April 27th, 2012 11:54am
Any updates for your issue?Jeff Ren TechNet Community Support beneficial to other community members reading the thread.
Free Windows Admin Tool Kit Click here and download it now
May 3rd, 2012 2:56am
I Bring in sad news. Server did reboot again same way. It did not created memory dump. I had forcefully generated memory dump by crashing it again from another method. Is there any more ideas?
Now i should focus to failover cluster problem. Any suggestion from you?
May 9th, 2012 5:35am
to help you more, i paste here my cluster configuration. Please point if any mistake. Both Node are in same LAN & in Giga switch.
PS C:\Windows\system32> Get-Cluster | fl *
Domain : mydomain
Name : DAG
AddEvictDelay : 60
BackupInProgress : 0
ClusSvcHangTimeout : 60
ClusSvcRegroupOpeningTimeout : 5
ClusSvcRegroupPruningTimeout : 5
ClusSvcRegroupStageTimeout : 7
ClusSvcRegroupTickInMilliseconds : 300
ClusterGroupWaitDelay : 30
ClusterLogLevel : 3
ClusterLogSize : 100
CrossSubnetDelay : 1000
CrossSubnetThreshold : 5
DefaultNetworkRole : 2
Description :
FixQuorum : 0
HangRecoveryAction : 3
IgnorePersistentStateOnStartup : 0
LogResourceControls : 0
PlumbAllCrossSubnetRoutes : 0
QuorumArbitrationTimeMax : 90
RequestReplyTimeout : 60
RootMemoryReserved : 4294967295
SameSubnetDelay : 1200
SameSubnetThreshold : 10
SecurityLevel : 2
SharedVolumesRoot : C:\ClusterStorage
ShutdownTimeoutInMinutes : 20
WitnessDatabaseWriteTimeout : 300
WitnessRestartInterval : 15
EnableSharedVolumes : Disabled
Id : 05c39a0f-dc66-474a-b43e-811fedf53ae0
Free Windows Admin Tool Kit Click here and download it now
May 9th, 2012 5:45am
Hi San_G
Have you made any progress on the issue? We have the same issue in our organization however we are using sas storage (Dell MD1200). Our restart frequency is more frequent, it happens around every 3 days and is always during the evening.
Configuring DisableBugcheckOnHungIo is not really an option as this is risky and could cause catastrophic issues, this is to be used as a LAST resort, only if there is no way to improve disk IO.
I don't mean to hijack your thread, just want to knwo if you made any progress and if so what did you do?
Thanks
May 29th, 2012 3:49am