Cannot failover cluster

Hi all,

In my environment I have 2 exchange 2013 servers : ex01 & ex02 in 1 DAG01. Recently my ex02 server has problem, some exchange service crash ... I'm still finding reason, maybe because lack of memory ... BTW I have a "Mailbox Database 01":

Get-MailboxDatabaseCopyStatus "Mailbox Database 01"
Name                      Status     ContentIndexState
Mailbox Database 01\EX01  Mounted        Healthy
Mailbox Database 01\EX02  Healthy        Healthy

When ex02 has problem, something happen with "Mailbox Database 01" copy on both ex01 & ex02, it cannot be mounted on boths. The status of them keep switching : Mounted , Mounting , Initializing , Disconnected... like this:

Get-MailboxDatabaseCopyStatus "Mailbox Database 01"
Name                      Status     ContentIndexState
Mailbox Database 01\EX01  Mounting        Failed
Mailbox Database 01\EX02  Initializing    Failed

Get-MailboxDatabaseCopyStatus "Mailbox Database 01"
Name                      Status     ContentIndexState
Mailbox Database 01\EX01  Initializing        Failed
Mailbox Database 01\EX02  Mounting            Failed

Get-MailboxDatabaseCopyStatus "Mailbox Database 01"
Name                      Status     ContentIndexState
Mailbox Database 01\EX01  Disconected        Failed
Mailbox Database 01\EX02  Mounting           Failed

Until I suspend "Mailbox Database 01" copy on ex02, "Mailbox Database 01" will mount on ex01 successfully, then "Mailbox Database 01" edb file on ex02 has "dirty shutdown" state, and I have to reseed "Mailbox Database 01" from ex01 to ex02 manually.

Get-ClusterGroup -Cluster EX01
Name              OwnerNode            State
ClusterGroup         ex01              Online
Available Storage    ex02              Offline

Get-DatabaseAvailabilityGroup -Status -Identity DAG01 | fl name,primaryActiveManager

Name                 : DAG01
PrimaryActiveManager : EX01

Please let me know if you need any information.
Thanks for your help.
Jack.


  • Edited by Jack Chuong Saturday, March 28, 2015 5:41 AM
March 28th, 2015 5:40am

Hi John,
I upgraded memory on server ex01 & ex02 yesterday to exclude "lack of memory" problem in future, and I figured out that failover didn't work, here things happened :

Step 1 : I run commands on server ex02  to get it into maintenance mode, following instruction here : http://blogs.technet.com/b/nawar/archive/2014/03/30/exchange-2013-maintenance-mode.aspx

1. Drain active mail queues on the mailbox server 
Set-ServerComponentState EX02 -Component HubTransport -State Draining -Requester Maintenance
2. To help transport services immediately pick the state change run: 
Restart-Service MSExchangeTransport 
Restart-Service MSExchangeFrontEndTransport
3. To redirect messages pending delivery in the local queues to another Mailbox server run: 
Redirect-Message -Server EX02 Target EX01
4. To prevents the node from being and becoming the PAM, pause the cluster node by running 
Suspend-ClusterNode EX02
5. To move all active databases currently hosted on the DAG member to other DAG members, run 
Set-MailboxServer EX02 -DatabaseCopyActivationDisabledAndMoveNow $True
6. Get the status of the existing database copy auto activation policy, run the following and note the value of DatabaseCopyAutoActivationPolicy, we will need this when taking the server out of Maintenance in the future 
To prevent the server from hosting active database copies, run 
Set-MailboxServer EX02 -DatabaseCopyAutoActivationPolicy Blocked
7. To put the server in maintenance mode run: 
Set-ServerComponentState EX02 -Component ServerWideOffline -State Inactive -Requester Maintenance

After run above commands, I see that server ex02 in maintenance mode:

[PS] C:\Windows\system32>Get-ServerComponentState ex02 | ft Component,State -Autosize

Component                   State
---------                   -----
ServerWideOffline          Inactive
HubTransport               Inactive
FrontendTransport          Inactive
Monitoring                 Active
RecoveryActionsEnabled     Active
AutoDiscoverProxy          Inactive
ActiveSyncProxy            Inactive
EcpProxy                   Inactive
EwsProxy                   Inactive
ImapProxy                  Inactive
OabProxy                   Inactive
OwaProxy                   Inactive
PopProxy                   Inactive
PushNotificationsProxy     Inactive
RpsProxy                   Inactive
RwsProxy                   Inactive
RpcProxy                   Inactive
UMCallRouter               Inactive
XropProxy                  Inactive
HttpProxyAvailabilityGroup Inactive
ForwardSyncDeamon          Inactive
ProvisioningRps            Inactive
MapiProxy                  Inactive

But when I get "Mailbox Database 01" copy status

Get-MailboxDatabaseCopyStatus "Mailbox Database 01"
Name                      Status     ContentIndexState
Mailbox Database 01\EX01  Mounted         Healthy
Mailbox Database 01\EX02  Healthy         Healthy

Is that normal ?

Step 2 : I shutdown server ex02 to upgrade RAM, after server ex02 off, I get "Mailbox Database 01" copy status

Get-MailboxDatabaseCopyStatus "Mailbox Database 01"
Name                      Status     ContentIndexState
Mailbox Database 01\EX01  Mounted         Healthy
Mailbox Database 01\EX02  ServiceDown     Unknown

Everything is fine, afew seconds later, "Mailbox Database 01" copy on server ex01 is dismounted (server ex01 is primaryActiveManager as I showed you previous replies), I get "Mailbox Database 01" copy status again

Get-MailboxDatabaseCopyStatus "Mailbox Database 01"
Name                      Status     ContentIndexState
Mailbox Database 01\EX01  Dismounted      Failed
Mailbox Database 01\EX02  ServiceDown     Unknown

I tried to mount "Mailbox Database 01" copy on server ex01 but it didn't work

Mount-Database -Identity "Mailbox Database 01"
Couldn't mount the database that you specified. Specified database: Mailbox Database 01; Error code: An Active Manager
operation failed. Error: An Active Manager operation encountered an error. To perform this operation, the server must
be a member of a database availability group, and the database availability group must have quorum. Error: Active
Manager encountered an error while trying to access the cluster database. [Server: EX01.mydomain.com].
    + CategoryInfo          : InvalidOperation: (Mailbox Database 01:ADObjectId) [Mount-Database], InvalidOperationExc
   eption
    + FullyQualifiedErrorId : [Server=EX01,RequestId=719edf90-3a8a-47e1-a85a-0e37902f0253,TimeStamp=3/29/2015 4:3
   3:26 AM] 9BA9C40F,Microsoft.Exchange.Management.SystemConfigurationTasks.MountDatabase
    + PSComputerName        : ex01.mydomain.com
A process is holding onto a transport performance counter. processId : 2888, counter : time in resource per second Value=0 SpinLock=0 Lifetime=Type:...

So my user experienced "service down" for about 10 mins, after I power on server ex02 (when ex02 startup it is still in maintenance mode) , "Mailbox Database 01" copy on server ex01 is mounted and healthy automatically, everything back to normal. I run commands on server ex02 to get it out of maintenance mode

Set-ServerComponentState EX02 -Component ServerWideOffline -State Active -Requester Maintenance
Resume-ClusterNode EX02
Set-MailboxServer EX02 -DatabaseCopyActivationDisabledAndMoveNow $False
Set-MailboxServer EX02 -DatabaseCopyAutoActivationPolicy Unrestricted
Set-ServerComponentState EX02 -Component HubTransport -State Active -Requester Maintenance
Restart-Service MSExchangeTransport 
Restart-Service MSExchangeFrontEndTransport

I active "Mailbox Database 01" copy on server ex02, I do same things with server ex01 to upgrade RAM and same things happen.

My witness server is Windows 7 Pro 32bit, joined domain, Witness directory: C:\Witness , "Exchange Trusted Subsystem" is added into local administrators group in witness server

I don't know what is wrong with my DAG.


Free Windows Admin Tool Kit Click here and download it now
March 30th, 2015 3:51am

I'm sure that my EX01 is PAM before shutdown EX02, I checked before I shutdown EX02

Get-ClusterGroup -Cluster EX01
Name              OwnerNode            State
ClusterGroup         ex01              Online
Available Storage    ex02              Offline

Get-DatabaseAvailabilityGroup -Status -Identity DAG01 | fl name,primaryActiveManager

Name                 : DAG01
PrimaryActiveManager : EX01

And then, before I get EX01 into maintenance mode and shutdown it, I run command to make EX02 become PAM

Move-ClusterGroup -Name "Cluster Group" -Node EX02
Name             OwnerNode                      State
Cluster Group    EX02                          Online

Get-DatabaseAvailabilityGroup -Status -Identity DAG01 | fl name,primaryActiveManager
Name                 : DAG01
PrimaryActiveManager : EX02

Get-ClusterGroup -Cluster IDCEXC002
Name                   OwnerNode               State
Cluster Group          EX02                    Online
Available Storage      EX02                    Offline

But problem still happened.
When I shutdown EX02, many events appeared on EX01 :

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          3/29/2015 11:16:16 AM
Event ID:      1177
Task Category: Quorum Manager
Level:         Critical
Keywords:      
User:          SYSTEM
Computer:      EX01.mydomain.com
Description:
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk. 
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Log Name:      System
Source:        Service Control Manager
Date:          3/29/2015 11:18:04 AM
Event ID:      7031
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      EX01.mydomain.com
Description:
The Microsoft Exchange Replication service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 5000 milliseconds: Restart the service.

Log Name:      System
Source:        Service Control Manager
Date:          3/29/2015 11:18:10 AM
Event ID:      7032
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      EX01.mydomain.com
Description:
The Service Control Manager tried to take a corrective action (Restart the service) after the unexpected termination of the Microsoft Exchange Replication service, but this action failed with the following error: 
An instance of the service is already running.

Log Name:      MSExchange Management
Source:        MSExchange CmdletLogs
Date:          3/29/2015 11:22:32 AM
Event ID:      6
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      EX01.mydomain.com
Description:
Cmdlet failed. Cmdlet Get-Notification, parameters {Summary=True}.

Log Name:      System
Source:        Service Control Manager
Date:          3/29/2015 11:26:37 AM
Event ID:      7024
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      EX01.mydomain.com
Description:
The Cluster Service service terminated with service-specific error The wait operation timed out..

Log Name:      System
Source:        Microsoft-Windows-DistributedCOM
Date:          3/29/2015 11:30:39 AM
Event ID:      10009
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      EX01.mydomain.com
Description:
DCOM was unable to communicate with the computer EX02.mydomain.com using any of the configured protocols.

Log Name:      MSExchange Management
Source:        MSExchange CmdletLogs
Date:          3/29/2015 11:33:26 AM
Event ID:      6
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      EX01.mydomain.com
Description:
Cmdlet failed. Cmdlet Mount-Database, parameters {Identity=Mailbox Database 01}.
(This one when I try to mount Mailbox Database 01 on EX01)


March 30th, 2015 7:52am

Can anyone help me with this issue ? I'm going to upgrade CU7, I have to make sure that my DAG can failover when I put an exchange server into maintenance mode and restart it. I don't want my users experience "service down" again.

Thanks for your help.

Free Windows Admin Tool Kit Click here and download it now
April 1st, 2015 11:12pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics