Exchange DAG Failover causes 100% cpu for 30 minutes

Hello,

We are experiencing a strange problem with our Exchange 2013 environment.

We have 2 Exchange 2013 servers with both cas and database role.
Clients are being loadbalanced with Citrix Netscaler and the servers are virtual servers running on VMWare Vspshere 6 (but we had the same issue on vsphere 5.5)

Normally all seems to be working fine, but when I activate one of the databases on the other server both servers are unavailable for client connections for 30 minutes or so.

When this happens, both servers are using 100% CPU for the full 30 minutes, the processes causing this are IIS Worker, Exchange store worker and Exchange RPC Client Access  
After about 30 minutes things are settling down, cpu usage drops to 30-40% percent and clients can connect again without a problem.
It seems to be related to outlook clients connecting, because doing a failover outside normal office hours, the 100% cpu issue is still there, but only for 30 seconds or so and most ActiveSync connections are also active outside office hours.

Strange thing is, there are only about 500 to 600 outlook clients connected at the same time, can't imagine that that's a lot for exchange to handle.

Anyone has a clue what could cause this?

Thanks in advance.

July 14th, 2015 10:45am

Hi

How much RAM, CPU etc, do you have allocated to each server?

What version of exchange 2013 are you running?

Free Windows Admin Tool Kit Click here and download it now
July 14th, 2015 10:49am

I would start with latest CU on the Exchange servers and look for any errors in the event logs if this happens after CU updates...

Exchange 2013 CU9 : http://www.microsoft.com/en-us/download/details.aspx?id=47679

July 14th, 2015 7:20pm

Thank you for your answers.

We are running Exchange 2013 Enterprise CU 8 at the moment. 
Have had this problem for a while now, also with previous CU's, but will upgrade to CU9 as soon as possible.

Will let you know if this resolves anything.

Strange thing is that i don't see any errors in the event logs which could explain this behaviour.

The VMWare hosts are HP ProLiant BL465c Gen8, AMD Opteron(tm) Processor 6380 (32 logical processors) and 256 GB of RAM.
The Virtual Machines running Exchange are equipped with 8 vcpu's (was 4 before and had the same problems) and 32 GB of RAM.
The hosts are not overcommitted in cpu or memory, so I don't think it's a scheduling problem or something like that. 

Free Windows Admin Tool Kit Click here and download it now
July 15th, 2015 4:09am

Thank you for your answers.

We are running Exchange 2013 Enterprise CU 8 at the moment. 
Have had this problem for a while now, also with previous CU's, but will upgrade to CU9 as soon as possible.

Will let you know if this resolves anything.

Strange thing is that i don't see any errors in the event logs which could explain this behaviour.

The VMWare hosts are HP ProLiant BL465c Gen8, AMD Opteron(tm) Processor 6380 (32 logical processors) and 256 GB of RAM.
The Virtual Machines running Exchange are equipped with 8 vcpu's (was 4 before and had the same problems) and 32 GB of RAM.
The hosts are not overcommitted in cpu or memory, so I don't think it's a scheduling problem or something like that. 

July 15th, 2015 4:09am

Hi,

According to your description, I understand that all server face 100% CPU and both server are unavailable for client connection when active one database in other DAG member, however it decline to normall over 30 minutes or so.
If I misunderstand your conern, please do not hesitate to let me know.

How does you active database, run Move-ActiveMailboxDatabase or not?

Please check the DAG health by Get-MailboxDatabaseCopyStatus and Test-ReplicationHealth during this issue arise. Also, check the replication for DC and GC.Check any Event log relate to RPC service.

Thanks

Free Windows Admin Tool Kit Click here and download it now
July 15th, 2015 5:02am

Hi,

According to your description, I understand that all server face 100% CPU and both server are unavailable for client connection when active one database in other DAG member, however it decline to normall over 30 minutes or so.
If I misunderstand your conern, please do not hesitate to let me know.

How does you active database, run Move-ActiveMailboxDatabase or not?

Please check the DAG health by Get-MailboxDatabaseCopyStatus and Test-ReplicationHealth during this issue arise. Also, check the replication for DC and GC.Check any Event log relate to RPC service.

Thanks

July 15th, 2015 5:02am

That's correct. 
And yes, i run Move-ActiveMailboxDatabase (or use the webinterface) to activate the copy.

I will try to create the problem again outside office hours, to check the evenlogs during the 100% CPU preriod.

But the thing is this does not only happen during database activate. But also for instance, when I let the Netscaler failover all client connections to the other server.
If anything happens with all open connections to one of the servers and all outlook clients reconnect at the same time we are seeing this issue.

Free Windows Admin Tool Kit Click here and download it now
July 15th, 2015 8:48am

That's correct. 
And yes, i run Move-ActiveMailboxDatabase (or use the webinterface) to activate the copy.

I will try to create the problem again outside office hours, to check the evenlogs during the 100% CPU preriod.

But the thing is this does not only happen during database activate. But also for instance, when I let the Netscaler failover all client connections to the other server.
If anything happens with all open connections to one of the servers and all outlook clients reconnect at the same time we are seeing this issue.

July 15th, 2015 8:48am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics