Exchange 2010 Architecture Storage and Recovery concerns for new deployment/migration.
Hi All, I work for a large 150,000 user company that is migrating from Notes/Domino to Outlook/Exchange. I have a position of influence in the area of storage design. I have done a lot of internet reading, but I still have some questions that I would like to ask people who have real life experience with Exchange 2010, that will influence some of the decisions we make. Our current plan and objectives includes: 5GB mail file quotas No backup Migration from Notes/Domino using Binarytree tools, taking across future calendar entries, contacts and 30 days worth of mail. 3rd party Archiving at some point, but nothing decided upon as yet 3 HA real time copies of each database. 2 in the Primary DC, 1 activation blocked copy in the Secondary DC. I am personally wanting a 4th copy. 80% of users spread across 3 major regional datacentres of roughly 50K, 40K and 30K users each. Some countries require their own DC's and servers due to regulatory requirements of as little as 23 staff :) SAN vs DAS JBOD vs DAS RAID still being worked out. My personal preference is DAS and JBOD SAS 7.2K 2TB drives but with 4 HA copies. I also like DAS RAID10 SAS 2TB but for the doubling of physical spindles required, I feel you are better off adding another DAG member as there are more reasons to require a failover and re-seed than losing the disk. Under either scenario we might be looking at something like 192 mail files per database so if all were at the 5GB limit then roughly 960GB database sizes not including content indexes, translogs (on the same LUN) etc. Outlook operating in local cached mode, most users connecting remotely to consolidated servers in the aforementioned DC's. Either 14 or 30 day single item recovery and un-deletion ability, depending on location. Probably F5 load balancers for CAS Arrays, except in the very small offices which might get Kemp... or something. So with that background, my questions for you are as follows. I'm not necessarily looking for answers to them all, if you have knowledge/experience of even one I would love to hear from you: Have you ever experienced logical corruption? I have read it is supposedly exceedingly rare. In the absence of a lagged copy and backup, can you recover from it using ESEUtil? In the absence of a lagged copy and backup, how do you deal with human error, such as admins or HR requesting deletion of the wrong mailbox. How often can you expect a local OST file to corrupt requiring a re-download of the entire mailfile - in our case, mostly remotely. Do you have experience with binarytree migration tools and can we expect complications? How often should we expect physical corruption of a database. Is it easier/quicker to use ESEUtil to recover or just re-seed? I have heard this can happen frequently because of network problems. Does anyone have any experience with how well Exchange works with Riverbed WAN Optimisation Controllers? Does it cache similar data well and drastically decrease bandwidth or is it minimal? Is there a great benefit to RAID0 or RAID10 the restore LUN so that maintenance utilities run faster, or are they not I/O bottlenecked? I have read that MS suggest a 35-70GB per hour re-seed rate. I have also read that no one "in the wild" gets more than 20GB per hour. Do you have a figure to share? Is 192x5GB mail files=960GB a waste of space if using 2TB disks? Or is having ample free space a sound idea, especially if you occasionally need to move lots of mail files and therefore generate lots of log space, or maintain logs for 2 days in case of network outage. Is a RPO of 0 minutes possible? If you can maintain real time replication between primary and secondary sites is that all that is required to achieve this? Many thanks in advance to all that reply :)
September 20th, 2011 6:40am

1. It is rare. 2. Don't give admins and HR access to mailboxes. 3. How often? Not that often. But users often have really terrible habits like powering down machines without shutting them down. 4. No experience. 5. It is rare unless you have faulty storage components. 6. I've heard great things, but you probably would be best off asking the vendor for references. I understand that Microsoft Support has a litany of support cases involving Riverbed devices. 7. RAID 0 doesn't offer any protection. RAID 1+0 is basically a mirror. Whether or not you need any protection depends on how many DAG copies you have. 8. YMMV. Build a testbed and try it yourself. 9. 192 5GB mail files? Are you deploying Microsoft Mail? Exchange is limited to 100 databases, and you ought to be looking at database much larger than 5 GB in size. 10. Good luck with that.Ed Crowley MVP "There are seldom good technological solutions to behavioral problems."
Free Windows Admin Tool Kit Click here and download it now
September 20th, 2011 7:09am

Thanks very much for your reply Ed it is appreciated :) To clarify some of my points: r.e. point 2, it's not HR having access to mailboxes, it's when HR messes up and tells IT to terminate a user, but they send the wrong name, or mistakenly terminate someone who is just on maternity leave. We've had 85 incidents of that just in our Sydney office of 2000 users in the past 3 years. Then there's mail admins just being careless. I'm under the impression that in the absence of a backup, the only way to protect against this is a lagged database copy. Or could you just instigate a policy of disabling access to mail files without deleting them? Or is there a 3rd party product anyone recommends to handle terminations. r.e. point 7, I understand the different RAID types, I was talking about just the restore LUN itself, for which it's been recommended to me that we could use for maintenance utilities, so the question was about whether the increased throughput of RAID0 or RAID10 would be worthwhile for running ISInteg and ESEUtil. r.e. point 9, I think the term mail file vs mailbox was confusing here. I meant that each database may be made up of 192 mailboxes with a max quota of 5GB each, so 960GB just there. Leaving the rest of the volume for content indexes, translogs, free space etc. Is this a wise use of space on a 2TB volume?
September 20th, 2011 8:19am

In response to point 2 you may want to implement a longer "keep deleted mailbox" period in which case you can reconnect the mailbox if an admin removes. You can configure this setting from vaule 0-24,885. Or, as you suggest a lagged copy will work and will give you up to 14 days if configured.
Free Windows Admin Tool Kit Click here and download it now
September 20th, 2011 2:45pm

2. You'd need a really long lag to do that. You already have that protection with Deleted Mailbox Retention, which is set by default to 30 days but can be increased. You may want to change your procedures so that terminated employees' user accounts aren't deleted, but retained for a month or few, simplifying their reinstatement later. Before considering a lagged copy, consider how you would actually make use of it in a procedural sense. It's not as simple as it might seem. Frankly, I don't know of too many organizations that have a coherent strategy for using lagged copies. 7. I would hope that you wouldn't be running ESEUTIL (ISINTEG is gone in Exchange 2010) so often that performance would be of great importance. 9. That's even more of a cushion than you think because few of your users will keep their mailboxes at the quota; lots of them will keep things clean. Also, service mailboxes, like conference room calendars, often don't get very large. If you're using a SAN, you'll also have considerations regarding backups and their own maintenance requirements. Generally, Exchange storage design requires you to acquire more disk spindles than you need for planned mailbox quota, so increasing quota is, in a sense, free. Your design should be based on the number of disk spindles you need for performance for your users, then you can allocate the databases across them pretty much however you want as long as it's relatively even. I recommend you use the Mailbox Role Storage Calculator to help you come up with such a design. Ed Crowley MVP "There are seldom good technological solutions to behavioral problems."
September 20th, 2011 5:33pm

Personally, I haven't encountered logical corruption in a long while. As some colleagues stated: it's rare. If you want to recover from logical corruption, a lagged copy might help, for as long as the corruption hasn't "replayed" into the copy. Running ESEUTIL might solve the problem but almost certainly will include dataloss (that's just how ESEUTIL works). If you've got no lagged copy and you're past the retention time (which you can increase), your only option is to restore from tape. (BACKUPS!) Haven't experienced this in a while either, but tends to happen now and then... Unfortunately, only used Quest tools before. Physical corruption really depends on the storage you're using. Running ESEUtil certainly is NOT the way to go (dataloss!). Re-seeding will help (in case of more than one database copy). No experience. I would not recommend running a RAID0 as this is quite error-prone. I don't really see the relation between maintenance utilities and the restore-lun. If you're referring to the restore lun as an alternative location to put a database from a backup (temporarily), it's less criticall but would still advice to use RAID 10. Re-seeding is pretty much dependant of the network and the performance of the server on both ends. Speeds vary a lot. To calculate storage needs, I would definitely advice you to use the Mailbox Role Size Calculator from MSFT. you're starting-point for your design! I was always told that the sky is the limit, but I frequently wonder: at what cost? There is almost always a slight delay between a database and a database copy and that's where the dumpster comes into action. Although if you have totally lost your active copy and it's not recoverable and if you don't have all the non-recoverable mails in the dumpster: you might have some dataloss (which you cannot predict exactly). Greets! __________________________________ Michael Van Horenbeeck Check out my blog @ Pro-Exchange (Belgian Usergroup)
Free Windows Admin Tool Kit Click here and download it now
September 20th, 2011 6:18pm

This is all really good stuff. It sounds like the retention time is a better answer than lagged. I am starting to see why even MS IT dont use lagged copies in their design. I have run a fair few scenarios through the storage calculator. This is part of the reason why I like JBOD. For a start, if you use RAID5, then the storage calculator insists on using 10K RPM SAS disks as a minimum, presumably because of the small write penalty on RAID5. That means the maximum disk size available is 600GB (although I have heard 900GB are on their way). When you're looking at a DC supporting 50K users and 3 HA copies, that's 3304 600GB disks. If you go JBOD 2TB SAS 7.2K RPM then it becomes 810 disks. Although I would be more comfortable with 4 HA copies when using JBOD. Of course there is always RAID10 which is 1620 disks, but when you look at the costs for doubling the disks, you are arguably better off adding a 4th DAG member, because there are more things that can go wrong with your storage than just losing the disks, which is largely why I am asking these questions. Another option is a hybrid where PDC copies are JBOD and SDC copies are RAID10, but again, if we are facing regular re-seeding events and have to do so across the WAN... R.e. the restore LUN and maintenance tools, I think the idea was that if you need to run ESEUtil on a database, you copy it to the restore LUN and run it there, because then the intensive (Yes/No?) I/O it generates will not cause any contention for the active databases in a RAID scenario. I guess it's not important for JBOD but the storage calculator still includes it. According to the storage calculator, having 192 x 5GB mailboxes per database requires a database+log volume of 1761GB, even though 192x5=960GB. I know this is factoring in content indexes, translogs free space etc. so I was looking for real world perspective on whether this was overly generous. That being said, I dont have much (any) faith that we will have an archiving system in place before migration commences. That means it's a waiting game before people start hitting their quotas which will be only 6 or so months in some cases. Yes not everyone will come anywhere close to 5GB, but we have existing users with mailboxes in excess of 35GB (Yes I know!!). It's one of those things that is impossible to predict accurately when existing quota and max email size restrictions are being simultaneously removed upon migration, but getting some real world perspective certainly helps. I'm off to the US in a couple of weeks for a face to face design session with MS, but I'm a big fan of real world experiences, and fore-warned is fore-armed :)
September 21st, 2011 8:21am

HI all, just wanted to say I'm still monitoring this thread so if anyone else can chime in with their own thoughts or experiences on any of these points it will be appreciated.
Free Windows Admin Tool Kit Click here and download it now
October 6th, 2011 10:34pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics