Migration Batches cause Exchange 2013 server issues

Hello,

We are currently in the process of migrating from Exchange 2007 to Exchange 2013, coexistence has been implemented and 20% of our mailboxes have been migrated.

In the past week or so I have had two occurrences where mailbox migration batches containing a high number of mailboxes with small mailbox size appear to have caused one or all of the Exchange 2013 servers to fall over. These batches are all started through PowerShell from a CSV containing mailbox primary email address and target database so as to target multiple databases at in a single batch:

New-MigrationBatch -Local -Name $BatchName -CSVData ([System.IO.File]::ReadAllBytes($CSV)) -BadItemLimit 100 -NotificationEmails $AdminEmail -AutoStart

In addition the concurrent mailbox move limit has been left at the default of 20, in both the occurences of this issue the batches contained target databases on 3 Exchange 2013 servers meaning as I understand it we can have up to 60 synchronisations in progress at any one time during the batch.

The initial occurrence of this was a migration batch of 408 users, all of whom have small mailboxes, so the entire batch totalled only 43GB. Roughly 2 hours after the batch had begun its initial sync our service desk began to receive reports of mail delay, following investigation it appeared that one of the three target servers had begun to get its submission queue backed up with messages unable to connect to target databases on that server in order to deliver the messages. Worrying that the migration batch was the cause we stopped the job and within about 15minutes everything had returned to normal. The batch was then deleted and split into 3 separate batches of roughly 130 users each based on target server and re-run in order to identify if this was an issue with the target server which had the problem, however all 3 completed without issue separately.

The second occurrence of the issue has however been far more severe, in this case the batch was 120 mailboxes (again all small totalling 17GB for the entire batch) as we had drawn the conclusion that smaller batches are were better following the previous issue. In this case roughly an hour following the start of the synchronisation all 3 target servers began to be unresponsive in varying degrees:

  • Users on all three servers were disconnected from Outlook
  • One would not load ECP, however this degraded to none loading ECP as time went on
  • SMTP continued to process initially, however this gradually begin to fail on each server
  • Exchange Management Shell would not load on two servers, the remaining server would hang processing any EMS commands
  • One of the three would not accept any new RDP connections and the majority of applications would not run
  • All three however showed no noticeable problems from a resource point of view, CPU and memory and disk latency were all normal.

From the experience of the previous issue the first thing to be done was to stop the suspected migration batch, however up until the point where ECP and EMS stopped functioning none of the move requests went into a stopping or suspended state, and in turn this had no corrective impact on the issue.

On the surface in initially appeared that IIS was unhappy on all three target servers, iisreset however had no impact.

We took the view that restarting the worst impacted server was the only course of action for that device, this reboot took a lot longer than normal but did restore connection to mailboxes on that server, as such the other more severely impacted server was also rebooted.

During these reboots the Exchange Search service was stopped on the least impacted server, this lead to EMS commands completing and a manual suspension of the move requests was done. This server however continued to be unable to offer any client connectivity or access to ECP. As such this ended up being rebooted once the others had returned.

I have concerns around this now as I am unable to track down why this issue happened. I'm of the suspicion that the number of frequent and concurrent move requests doing their initial sync on such small mailboxes is causing one of the transport services to go into a tailspin and take other services out along the way, that said no services crashed and there was no unusually high resource usage from any of the Exchange services during these events. I have been toying with the idea that it is may be related to indexing the mailboxes as the drop into a 'Synced' state, and the number of  indexing jobs running based on how quickly the mailboxes are syncing. Hence the delay in symptoms occurring after the batch is started and that stopping the Search Service seemed to somewhat alleviate some of the symptoms. If this were the case however I would have thought the noderunner.exe would have been chewing up CPU permanently, however it only appeared to be intermittently spiking up the resource tables during the course of the problems.

Is this likely to simply be a concurrency issue in move requests be it by the amount syncing at once or the amount sat open in total? Or is there something I'm missing here?

Thanks for any assistance anyone can offer.

February 4th, 2015 7:28am

Did you try with clearing the existing  migration batch and starting a new one and see the results 
Free Windows Admin Tool Kit Click here and download it now
February 4th, 2015 11:57pm

Also, when need to migrate from exchange 2007 to 2013, I usually give preference following this informative technet resource that provides step-wise instructions : http://blogs.technet.com/b/meamcs/archive/2013/07/25/part-1-step-by-step-exchange-2007-to-2013-migration.aspx

As an alternative, you may also have a look on this automated solution (http://www.exchangemailboxmigration.com/) that could be a good approach to get this job done in more hassle-free manner.

February 5th, 2015 3:11am

Hi,

Thanks for the response, in both of these occurrences we broke down the batches into smaller subsets targeting only a single server and they have processed.

My main concern here is trying to track down what has caused this and if it is an underlying issue with the environment we have put in place that will cause further issues beyond the scope of the migration. There is also of course the secondary concern around the extra time involved in doing much smaller batches adding to the period we need to remain in co-existence.

Really I was hoping for some diagnostic avenues to go down in order to best identify why we ended up in such a catastrophic situation during a synchronisation of mailboxes.

Thanks again

Free Windows Admin Tool Kit Click here and download it now
February 5th, 2015 12:05pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics