Site Resilence (Network Steve Forum)

Site Resilence

Greetings,

New Exchange Server 2013 Deployment,

Two Sites (Main Site, and DR Site)

Each Site has one Edge Server, CAS Server and Two MBX Servers

- External DNS Records Resolve to Edge Servers with Round Robin

- DR Site should automatically respond in case of any failure in Main Site

- CAS Server in DR should respond to users in case of CAS Server down in Main Site, and Vise versa

Question

1) What is the External and Internal DNS ( DNS Records) Configuration to Achieve the High Availability on Edge and CAS Servers? Should the DNS be configured with Round Robin on EDGE and CAS?

2- What DNS records should be presented in case of DNS Split design (Internal / External) ?

Thanking you

Jamil

January 30th, 2015 2:32am

A DR site implies a manual switchover, not automatic. If you want automatic failover, you need a 3rd datacenter for the File Share Witness and an even node DAG with the same number of DAG members in each data center.

Each Data Center would be a peer, with the same level of service in each. If this is a Split-Brain DNS, then the URLs for each service would match for internal and external URLs.

Free Windows Admin Tool Kit Click here and download it now

January 30th, 2015 4:41am

Thanks for your reply,

This is not an accurate answer, as of Exchange 2013 Documentation it is automatic, unlike 2010

Then, I am implementing similar scenario, just looking for the exact DNS Records required and their configurations. and Finally, I am asking about Edge and CAS Roles, not Mailbox Role and DAGs

as of the original question.

So, any other answer please post it

Thanks again

Edited by Jamil.Saif Friday, January 30, 2015 10:19 AM

January 30th, 2015 1:14pm

Thanks for your reply,

This is not an accurate answer, as of Exchange 2013 Documentation it is automatic, unlike 2010

Then, I am implementing similar scenario, just looking for the exact DNS Records required and their configurations. and Finally, I am asking about Edge and CAS Roles, not Mailbox Role and DAGs

as of the original question.

So, any other answer please post it

Thanks again

Please post the documentation link that states a DR site fail over in 2013 is automatic.

As for the DNS records, as I stated, if you are using Split-Brain DNS, the CAS URLs should be the same for both external and internal URLS for the Outlook Anywhere hostnames, and HTTPs URLs ( OAB, autodisover etc..)

Otherwise, please post specific examples of what DNS records you are asking about.

Free Windows Admin Tool Kit Click here and download it now

January 30th, 2015 4:08pm

to provide availability of EDGE Servers:

mydomain.com MX records (2 MX Records) resolves

EDGE1 212.213.10.1 10

EDGE2 212.213.11.1 20

Then:

2 DNS Records (A) Records, resolve to CAS Servers in each site

autodiscover.mydomain.com 212.213.10.2 10

autodiscover.mydomain.com 212.213.11.2 20

Question: are theses records correct? and then in case of EDGE1 Fails, EDGE2 Responds, in this case of EDGE2 responding, which CAS server will serve clients requests? is it CAS2 or CAS1?

Thanks

January 30th, 2015 5:25pm

I agree with Andy that the term DR site means manual failover.

Maybe your DR means something else. But there is no automatic failover for mailbox role if your primary site is totally down. It does not make sense if you only auto-fail CAS and edge roles.

Free Windows Admin Tool Kit Click here and download it now

January 30th, 2015 6:17pm

Li Zhen ...

First, I did not ask about DAG, if you just read the entire thread you could seen I am asking about CAS and EDGE roles ... specifically about DNS records related to high-availability scenario of two sites

Second, this is for you and Andy, I am quite sure you are out of technology, I am asking about Exchange Server 2013, just search for "High Availability and Site Resilience" in Exchange Online help, and see the section of "Site resilience" you will find:

==================

Site resilience

Although Exchange 2013 continues to use DAGs and Windows Failover
Clustering for Mailbox server role high availability and site resilience, site
resilience isn't the same in Exchange 2013. Site resilience is much better in
Exchange 2013 because it has been simplified. The underlying architectural
changes that were made in Exchange 2013 have significant impact on the recovery
aspects of a site resilience configuration.

In Exchange 2010, mailbox (DAG) and client access (Client Access
server array) recovery were tied together. If you lost all of your Client Access
servers, the VIP for the array, or a significant portion of your DAG, you were
in a situation where you needed to do a datacenter switchover. This is a
well-documented and generally well-understood process, although it takes time to
perform, and requires human intervention to begin the process.

In Exchange 2013, if you lose your Client Access server array for
whatever reason (for example, the load balancer fails), you don't need to
perform a datacenter switchover. With the proper configuration, failover happens
at the client level and clients are automatically redirected to a second
datacenter that has operating Client Access servers, and those operating Client
Access servers proxy the communication back to the user's Mailbox server, which
remains unaffected by the outage (because you don't do a switchover). Instead of
working to recover service, the service recovers itself and you can focus on
fixing the core issue (for example, replacing the failed load balancer).

Furthermore, with the namespace simplification, consolidation of
server roles, de-coupling of Active Directory site server role requirements,
separation of Client Access server array and DAG recovery, and load balancing
changes, there are changes in Exchange 2013 that now enable both Client Access
server and DAG recovery to be separate and automatic across sites, thereby
providing datacenter failover scenarios, if you have three locations.

In Exchange 2010, you could deploy a DAG across two datacenters and
host the witness in a third datacenter and enable failover for the Mailbox
server role for either datacenter. But you didn't get failover for the solution
itself, because the namespace still needed to be manually changed for the
non-Mailbox server roles.

In Exchange 2013, the namespace doesn't need to move with the DAG.
Exchange leverages fault tolerance built into the namespace through multiple IP
addresses, load balancing (and if need be, the ability to take servers in and
out of service). Modern HTTP clients work with this redundancy automatically.
The HTTP stack can accept multiple IP addresses for a fully qualified domain
name (FQDN), and if the first IP address it tries fails hard (that is, it can't
connect), it will try the next IP address in the list. In a soft failure
(connection is lost after the session is established, perhaps due to an
intermittent failure in the service where, for example, a device is dropping
packets and needs to be taken out of service), the user might need to refresh
their browser.

This means the namespace is no longer a single point of failure as
it was in Exchange 2010. In Exchange 2010, perhaps the biggest single point of
failure in the messaging system is the FQDN that you give to users because it
tells the user where to go. In the Exchange 2010 paradigm, changing where that
FQDN goes isn't easy because you have to change DNS, and then handle DNS
latency, which in some parts of the world is challenging. And you have name
caches in browsers that are typically about 30 minutes or more that also have to
be handled.

One of the changes in Exchange 2013 is to enable clients to have
more than one place to go. Assuming the client has the ability to use more than
one place to go (almost all the client access protocols in Exchange 2013 are
HTTP based (examples include Outlook, Outlook Anywhere, EAS, EWS, OWA, and EAC),
and all supported HTTP clients have the ability to use multiple IP addresses),
thereby providing failover on the client side. You can configure DNS to hand
multiple IP addresses to a client during name resolution. The client asks for
mail.contoso.com and gets back two IP addresses, or four IP addresses, for
example. However many IP addresses the client gets back will be used reliably by
the client. This makes the client a lot better off because if one of the IP
addresses fails, the client has one or more other IP addresses to try to connect
to. If a client tries one and it fails, it waits about 20 seconds and then tries
the next one in the list. Thus, if you lose the VIP for the Client Access server
array, recovery for the clients happens automatically, and in about 21
seconds.

The benefits include the following:

In Exchange 2010, if you lose the load balancer in your primary datacenter
and you don't have another one in that site, you had to do a datacenter
switchover. In Exchange 2013, if you lose the load balancer in your primary
site, you simply turn it off (or maybe turn off the VIP) and repair or replace
it. Clients that aren't already using the VIP in the secondary datacenter will
automatically fail over to the secondary VIP without any change of namespace,
and without any change in DNS. Not only does that mean you no longer have to
perform a switchover, but it also means that all of the time normally associated
with a datacenter switchover recovery isn't spent. In Exchange 2010, you had to
handle DNS latency (hence, the recommendation to set the Time to Live (TTL) to 5
minutes, and the introduction of the failback URL). In Exchange 2013, you don't
need to do that because you get fast failover (20 seconds) of the namespace
between VIPs (datacenters).
Because you can fail over the namespace between datacenters, all that's
needed to achieve a datacenter failover is a mechanism for failover of the
Mailbox server role across datacenters. To get automatic failover for the DAG,
you simply architect a solution where the DAG is evenly split between two
datacenters, and then place the witness server in a third location so that it
can be arbitrated by DAG members in either datacenter, regardless of the state
of the network between the datacenters that contain the DAG members.
In this scenario, the administrator's efforts are geared toward simply
fixing the problem, and not spent restoring service. You simply fix the thing
that failed; while service has been running and data integrity has been
maintained. The urgency and stress level you feel when fixing a broken device is
nothing like the urgency and stress you feel when you're working to restore
service. It's better for the end user, and less stressful for the
administrator.

You can allow failover to occur without having to perform
switchbacks (sometimes mistakenly referred to as failbacks). If you lose Client
Access servers in your primary datacenter and that results in a 20 second
interruption for clients, you might not even care about failing back. At this
point, your primary concern would be fixing the core issue (for example,
replacing the failed load balancer). After it's back online and functioning,
some clients will start using it, and other clients might remain operational
through the second datacenter.

Exchange 2013 also provides functionality that enables
administrators to deal with intermittent failures. An intermittent failure is
where, for example, the initial TCP connection can be made, but nothing happens
afterward. An intermittent failure requires some sort of extra administrative
action to be taken because it might be the result of a replacement device being
put into service. While this repair process is occurring, the device might be
powered on and accepting some requests, but not really ready to service clients
until the necessary configuration steps are performed. In this scenario, the
administrator can perform a namespace switchover by simply removing the VIP for
the device being replaced from DNS. Then during that service period, no clients
will be trying to connect to it. After the replacement process has completed,
the administrator can add the VIP back to DNS, and clients will eventually start
using it.

</content>

==================

January 30th, 2015 6:54pm

And here is a link for you Andy,, as per your request

https://technet.microsoft.com/en-us/library/dd638137(v=exchg.150).aspx

Free Windows Admin Tool Kit Click here and download it now

January 30th, 2015 7:40pm

And here is a link for you Andy,, as per your request

https://technet.microsoft.com/en-us/library/dd638137(v=exchg.150).aspx

Key there is this statement:

With the proper configuration...

I know you are focused on the CAS and SMTP piece of this, but you have to look at the total picture. Automatic "CAS" failover means nothing if the mailbox servers are not accessible.

And really, your CAS and MBX should be together. No reason to have separate roles. That will make this exercise clearer if you do that.

The proper configuration means 2 peer Data Centers and a 3rd datacenter with the File Share Witness as described here:

http://blogs.technet.com/b/exchange/archive/2014/04/21/the-preferred-architecture.aspx

If your organization has a third location with a network infrastructure that is isolated from network failures that affect the site resilient datacenter pair in which the DAG is deployed, then the recommendation is to deploy the DAGs witness server in that third location. This configuration gives the DAG the ability to automatically failover databases to the other datacenter in response to a datacenter-level failure event, regardless of which datacenter has the outage.

as for the DNS config, what are you using for a load balancer across the Data Centers? If you don't have one, then you can use a DNS Cname or Round-Robin, but understand that manual intervention is required if one DC is down and you may have to remove an entry or change it in DNS.

As for MX records, SMTP is more flexible, if one Edge is down, the other should be

January 31st, 2015 1:01am

I agree with Andy that the term DR site means manual failover.

Maybe your DR means something else. But there is no automatic failover for mailbox role if your primary site is totally down. It does not make sense if you only auto-fail CAS and edge roles.

Exactly!

Free Windows Admin Tool Kit Click here and download it now

January 31st, 2015 1:02am

Gentlemen,

Why do you talking about DAGs, I asked about CAS and EDGE roles DNS records to achieve High-Availability, it is clear that you are not getting the point and just wanting to post ANY REPLY to increase your posts !!!

Andy, now you admit it is there even with Proper Configuration !!! just before the article you were with another opinion !! Now, can you tell me how to configure CAS and EDGE internal and External DNS Records to achieve this Proper Configuration !!!????

Or you will jump again and talk about DAG?

January 31st, 2015 8:01am

Gentlemen,

Why do you talking about DAGs, I asked about CAS and EDGE roles DNS records to achieve High-Availability, it is clear that you are not getting the point and just wanting to post ANY REPLY to increase your posts !!!

Andy, now you admit it is there even with Proper Configuration !!! just before the article you were with another opinion !! Now, can you tell me how to configure CAS and EDGE internal and External DNS Records to achieve this Proper Configuration !!!????

Or you will jump again and talk about DAG?

I did. Re-read my responses.:

and

As for MX records, SMTP is more flexible, if one Edge is down, the other should be tried.

You are missing the point however. You asked about "site resilience" in the title of your thread.

You also stated that the "DR Site should automatically respond in case of any failure in Main Site"

That can not be achieved without a 3rd datacenter that holds the FSW. I haven't changed my opinion or stance on that. You keep mentioning the client and SMTP piece and that is just a part of that.

We keep mentioning DAGs because you are talking about high availability and you cant remove the DAG discussion if you want true HA. As for points, no one cares about that and you don't get those by simply responding so I don't get your point. Sorry you think my responses are not adequate.. We are trying to help you here to see the entire architecture that is required for HA and Site Resilience.

I will say no more. Good Luck

Free Windows Admin Tool Kit Click here and download it now

January 31st, 2015 8:25am

Li Zhen ...

First, I did not ask about DAG, if you just read the entire thread you could seen I am asking about CAS and EDGE roles ... specifically about DNS records related to high-availability scenario of two sites

Second, this is for you and Andy, I am quite sure you are out of technology, I am asking about Exchange Server 2013, just search for "High Availability and Site Resilience" in Exchange Online help, and see the section of "Site resilience" you will find:

==================

Site resilience
<content xmlns="http://ddue.schemas.microsoft.com/authoring/2003/5">
Although Exchange 2013 continues to use DAGs and Windows Failover
Clustering for Mailbox server role high availability and site resilience, site
resilience isn't the same in Exchange 2013. Site resilience is much better in
Exchange 2013 because it has been simplified. The underlying architectural
changes that were made in Exchange 2013 have significant impact on the recovery
aspects of a site resilience configuration.

In Exchange 2010, mailbox (DAG) and client access (Client Access
server array) recovery were tied together. If you lost all of your Client Access
servers, the VIP for the array, or a significant portion of your DAG, you were
in a situation where you needed to do a datacenter switchover. This is a
well-documented and generally well-understood process, although it takes time to
perform, and requires human intervention to begin the process.

In Exchange 2013, if you lose your Client Access server array for
whatever reason (for example, the load balancer fails), you don't need to
perform a datacenter switchover. With the proper configuration, failover happens
at the client level and clients are automatically redirected to a second
datacenter that has operating Client Access servers, and those operating Client
Access servers proxy the communication back to the user's Mailbox server, which
remains unaffected by the outage (because you don't do a switchover). Instead of
working to recover service, the service recovers itself and you can focus on
fixing the core issue (for example, replacing the failed load balancer).

Furthermore, with the namespace simplification, consolidation of
server roles, de-coupling of Active Directory site server role requirements,
separation of Client Access server array and DAG recovery, and load balancing
changes, there are changes in Exchange 2013 that now enable both Client Access
server and DAG recovery to be separate and automatic across sites, thereby
providing datacenter failover scenarios, if you have three locations.

In Exchange 2010, you could deploy a DAG across two datacenters and
host the witness in a third datacenter and enable failover for the Mailbox
server role for either datacenter. But you didn't get failover for the solution
itself, because the namespace still needed to be manually changed for the
non-Mailbox server roles.

In Exchange 2013, the namespace doesn't need to move with the DAG.
Exchange leverages fault tolerance built into the namespace through multiple IP
addresses, load balancing (and if need be, the ability to take servers in and
out of service). Modern HTTP clients work with this redundancy automatically.
The HTTP stack can accept multiple IP addresses for a fully qualified domain
name (FQDN), and if the first IP address it tries fails hard (that is, it can't
connect), it will try the next IP address in the list. In a soft failure
(connection is lost after the session is established, perhaps due to an
intermittent failure in the service where, for example, a device is dropping
packets and needs to be taken out of service), the user might need to refresh
their browser.

This means the namespace is no longer a single point of failure as
it was in Exchange 2010. In Exchange 2010, perhaps the biggest single point of
failure in the messaging system is the FQDN that you give to users because it
tells the user where to go. In the Exchange 2010 paradigm, changing where that
FQDN goes isn't easy because you have to change DNS, and then handle DNS
latency, which in some parts of the world is challenging. And you have name
caches in browsers that are typically about 30 minutes or more that also have to
be handled.

One of the changes in Exchange 2013 is to enable clients to have
more than one place to go. Assuming the client has the ability to use more than
one place to go (almost all the client access protocols in Exchange 2013 are
HTTP based (examples include Outlook, Outlook Anywhere, EAS, EWS, OWA, and EAC),
and all supported HTTP clients have the ability to use multiple IP addresses),
thereby providing failover on the client side. You can configure DNS to hand
multiple IP addresses to a client during name resolution. The client asks for
mail.contoso.com and gets back two IP addresses, or four IP addresses, for
example. However many IP addresses the client gets back will be used reliably by
the client. This makes the client a lot better off because if one of the IP
addresses fails, the client has one or more other IP addresses to try to connect
to. If a client tries one and it fails, it waits about 20 seconds and then tries
the next one in the list. Thus, if you lose the VIP for the Client Access server
array, recovery for the clients happens automatically, and in about 21
seconds.

The benefits include the following:

In Exchange 2010, if you lose the load balancer in your primary datacenter
and you don't have another one in that site, you had to do a datacenter
switchover. In Exchange 2013, if you lose the load balancer in your primary
site, you simply turn it off (or maybe turn off the VIP) and repair or replace
it. Clients that aren't already using the VIP in the secondary datacenter will
automatically fail over to the secondary VIP without any change of namespace,
and without any change in DNS. Not only does that mean you no longer have to
perform a switchover, but it also means that all of the time normally associated
with a datacenter switchover recovery isn't spent. In Exchange 2010, you had to
handle DNS latency (hence, the recommendation to set the Time to Live (TTL) to 5
minutes, and the introduction of the failback URL). In Exchange 2013, you don't
need to do that because you get fast failover (20 seconds) of the namespace
between VIPs (datacenters).
Because you can fail over the namespace between datacenters, all that's
needed to achieve a datacenter failover is a mechanism for failover of the
Mailbox server role across datacenters. To get automatic failover for the DAG,
you simply architect a solution where the DAG is evenly split between two
datacenters, and then place the witness server in a third location so that it
can be arbitrated by DAG members in either datacenter, regardless of the state
of the network between the datacenters that contain the DAG members.
In this scenario, the administrator's efforts are geared toward simply
fixing the problem, and not spent restoring service. You simply fix the thing
that failed; while service has been running and data integrity has been
maintained. The urgency and stress level you feel when fixing a broken device is
nothing like the urgency and stress you feel when you're working to restore
service. It's better for the end user, and less stressful for the
administrator.

You can allow failover to occur without having to perform
switchbacks (sometimes mistakenly referred to as failbacks). If you lose Client
Access servers in your primary datacenter and that results in a 20 second
interruption for clients, you might not even care about failing back. At this
point, your primary concern would be fixing the core issue (for example,
replacing the failed load balancer). After it's back online and functioning,
some clients will start using it, and other clients might remain operational
through the second datacenter.

Exchange 2013 also provides functionality that enables
administrators to deal with intermittent failures. An intermittent failure is
where, for example, the initial TCP connection can be made, but nothing happens
afterward. An intermittent failure requires some sort of extra administrative
action to be taken because it might be the result of a replacement device being
put into service. While this repair process is occurring, the device might be
powered on and accepting some requests, but not really ready to service clients
until the necessary configuration steps are performed. In this scenario, the
administrator can perform a namespace switchover by simply removing the VIP for
the device being replaced from DNS. Then during that service period, no clients
will be trying to connect to it. After the replacement process has completed,
the administrator can add the VIP back to DNS, and clients will eventually start
using it.
</content>

==================

1. As I already said, if you only failover CAS but not mailbox role, it does not make any sense while all database are dismounted.

2. In terms of DAG technology, there is no difference between 2010 and 2013. I don't think I am out of technology. If you look into my profile, I am MCSE on Exchange 2013 charter member.

January 31st, 2015 9:23am

Gentlemen,

Why do you talking about DAGs, I asked about CAS and EDGE roles DNS records to achieve High-Availability, it is clear that you are not getting the point and just wanting to post ANY REPLY to increase your posts !!!

Andy, now you admit it is there even with Proper Configuration !!! just before the article you were with another opinion !! Now, can you tell me how to configure CAS and EDGE internal and External DNS Records to achieve this Proper Configuration !!!????

Or you will jump again and talk about DAG?

What do you wnat to do if your CAS is auto-failed to DR site and your mailbox role is not? Do you know that Exchange is totally useless with only CAS running while all mailbox databases are dismounted?

Free Windows Admin Tool Kit Click here and download it now

January 31st, 2015 10:13am

because you did not understand the question, and never understood the replies followed the original post, that is why you keep talking about DAG, the easiest part of Exchange Availability, so, I will answer you clearly here for the last time:

I know how to implement DAG, and I am asking how to achieve availability for other roles (CAS / EDGE) to achieve site availability.

So, I don't need to discuss DAG.

January 31st, 2015 1:47pm

This topic is archived. No further replies will be accepted.