Restore a replica after failover

I'm experimenting with and reading about active geo-replication in the Premium tier. I like how easy it is to setup and then terminate replication, but what I am not seeing is any way to re-establish replication with an existing database.

Say for example I have one replica setup and need to failover. I terminate replication, enable write access to the replica and change my connection strings to the replica. I see no way in the Portal, nor in SQL or Powershell to re-establish the replication between the two databases. It seems I need to essentially re-create databases on the other end (twice in fact). First I create a new replica on the Primary (after renaming or deleting the primary db), let it catch up, then stop replication, delete the NOW primary db and reverse the process.

This is fast enough on a 1GB database (though very cumbersome and a little unnerving with my data), but what if I had a 100GB database. Is this just an interim approach to handling failover/failback or is it seen as the long term solution?

May 21st, 2014 11:37pm

Hello,

Failback in an Active Geo-Replication requires setting up a new continuous copy relationship and reseeding.
The Failback section in the following BOL interpret detail steps of the process. Please refer to:
Failover in an Active Geo-Replication Configuration

Regards,
Fanny Liu

If you have any feedback on our support, please click here. 

Free Windows Admin Tool Kit Click here and download it now
May 22nd, 2014 1:32am

Yes thank you Fanny for confirmation on what I have already found. But can anyone speak as to whether this is seen as a long term solution?
May 22nd, 2014 2:33pm

Hi George,

Let me address your concern. If the failover workflow you are designing is to be triggered by a catastrophic failure on the primary side (e.g. a prolonged outage in the region) then you will use a force terminate. Since we are using asynchronous replication all committed but not yet synchronized transactions will be lost after termination. Once you start using the new primary for writes the two copies will immediately diverge (aka split brain). Even if the original primary comes back later, reconnecting the two in general will not be possible due to conflicting transactions. This is why the force terminate is irreversible. The normal failover practice would be to use a different region as a new DR site after you terminated and yes you will have to create a new continuous copy there.

If you are using regular termination it does synchronize all committed transaction before disconnecting the two databases. So it is technically possible to reconnect and we are looking into enabling this in the future.  But keep in mind this is a planned operation, can only be executed from the source and cannot be used in cases when the target database is not available.

I hope this helps but please let me know if I am misunderstanding your scenario.

Thank you





Free Windows Admin Tool Kit Click here and download it now
May 22nd, 2014 8:23pm

Hi George,

Let me address your concern. If the failover workflow you are designing is to be triggered by a catastrophic failure on the primary side (e.g. a prolonged outage in the region) then you will use a force terminate. Since we are using asynchronous replication all committed but not yet synchronized transactions will be lost after termination. Once you start using the new primary for writes the two copies will immediately diverge (aka split brain). Even if the original primary comes back later, reconnecting the two in general will not be possible due to conflicting transactions. This is why the force terminate is irreversible. The normal failover practice would be to use a different region as a new DR site after you terminated and yes you will have to create a new continuous copy there.

If you are using regular termination it does synchronize all committed transaction before disconnecting the two databases. So it is technically possible to reconnect and we are looking into enabling this in the future.  But keep in mind this is a planned operation, can only be executed from the source and cannot be used in cases when the target database is not available.

I hope this helps but please let me know if I am misunderstanding your scenario.

Thank you





May 22nd, 2014 8:23pm

Hi George,

Let me address your concern. If the failover workflow you are designing is to be triggered by a catastrophic failure on the primary side (e.g. a prolonged outage in the region) then you will use a force terminate. Since we are using asynchronous replication all committed but not yet synchronized transactions will be lost after termination. Once you start using the new primary for writes the two copies will immediately diverge (aka split brain). Even if the original primary comes back later, reconnecting the two in general will not be possible due to conflicting transactions. This is why the force terminate is irreversible. The normal failover practice would be to use a different region as a new DR site after you terminated and yes you will have to create a new continuous copy there.

If you are using regular termination it does synchronize all committed transaction before disconnecting the two databases. So it is technically possible to reconnect and we are looking into enabling this in the future.  But keep in mind this is a planned operation, can only be executed from the source and cannot be used in cases when the target database is not available.

I hope this helps but please let me know if I am misunderstanding your scenario.

Thank you





Free Windows Admin Tool Kit Click here and download it now
May 22nd, 2014 8:23pm

Hi George,

Let me address your concern. If the failover workflow you are designing is to be triggered by a catastrophic failure on the primary side (e.g. a prolonged outage in the region) then you will use a force terminate. Since we are using asynchronous replication all committed but not yet synchronized transactions will be lost after termination. Once you start using the new primary for writes the two copies will immediately diverge (aka split brain). Even if the original primary comes back later, reconnecting the two in general will not be possible due to conflicting transactions. This is why the force terminate is irreversible. The normal failover practice would be to use a different region as a new DR site after you terminated and yes you will have to create a new continuous copy there.

If you are using regular termination it does synchronize all committed transaction before disconnecting the two databases. So it is technically possible to reconnect and we are looking into enabling this in the future.  But keep in mind this is a planned operation, can only be executed from the source and cannot be used in cases when the target database is not available.

I hope this helps but please let me know if I am misunderstanding your scenario.

Thank you





May 22nd, 2014 8:23pm

Sasha, thanks for the response. I appreciate the insight into the split brain scenario as that explanation does make a lot of sense to me and why it can't be supported.

I did have in mind the controlled failover scenario you also mention, and being able to recover by having the original primary "catch up" to the replica when we want to failback. So yes you speak to exactly what I'd love to see and I hope you continue down the path of seeing if you can support that scenario.

The funny thing with this whole business continuity subject is that one hopes to never need it, and the odds are high it would be rarely used. But then when the #%$ hits the fan is when you really want it to be as simple and seamless as possible, if you get what I mean. So my ask here may seem like a lot, but it lends to that "simple and seamless" aspect.

Thanks for the great work in getting these business continuity features out the door.

Free Windows Admin Tool Kit Click here and download it now
May 22nd, 2014 9:56pm

George,

Re the simple and seamless failover. I assume by that you mean it should just happen and you as a user should not have to manage it. I admit this question comes up fairly regularly.

During a major outrage unless there is a smoking hole in the ground our ops would need time to troubleshoot and understand if the recovery is possible, how long it would take and if there is a quick mitigation. Because the failover will impact thousands of databases they would take that decisioin only after exhausting every alternative. In the meantime the impacted applications will have to wait, i.e. take downtime. Some applications would prefer to wait in order to avoid the data loss. Many other however would prioritize availability over the small data loss. But the automatic failover by Azure DB would always prioritize data protection. Hence the model I which you can initiate the failover yourself because you know what is more important for your application.

Re the seamless part, as you can see we allow you to chose the DR region, which is critical for some customers. In addition, we allow you to use the secondary for readonly  queries so that you can do more with the secondary than DR. To make it possible the secondary database has a different connection string than the primary. It does make it less seamless but I hope you can see the advantages here.

We published a document discussing different failover design s the link:<o:p></o:p>

http://msdn.microsoft.com/en-us/library/azure/dn741328.aspx
<o:p></o:p>

Thank you

May 23rd, 2014 12:44am

Sasha, thanks so much for the additional comments. I actually am completely on board with what you're saying. When I referred to seamless failover I actually wasn't referring to everything being automated. In our scenario I actually much prefer to control the failover timing and process for exactly the reasons you mention. I want to control our destiny and not leave it in the hands of Azure Support.

For me the seamless reference was about making the actual failover and failback steps as simple and bullet proof as possible. Knowing that I control failover, but that I have to know "what" to do and "how" to do it means that making those steps seamless is important. Especially since some of the steps will be followed during a high stress time. I think you have the failover pretty well nailed in the Portal, it is simple. What I'd love to see is the failback be made just as simple. In particular the case where I am doing a regular termination of the replica and the sync completes fully, I'd love to be able to failback to the original primary and then be able to easily deem it the primary again. You stated this was a scenario that is technically possible and you were looking into the possibility of implementing that. Please do count me as someone hat would love to see that.

If you're interested, here is my ideal feature set for geo-replication:

1) All of the same geo-replication features are supported in both Standard and Premium tiers.

2) I can choose from a failover policy that I control, or one that Azure controls (active vs notactive in your terms I guess).

3) I can access the replica as read-only regardless of whether I choose active or not. I am ok with nonactive only having one replica and for Azure to control its location, but I'd really like to have read-only access to it.

4) The cost of the replicas is substantially reduced over the primary since I have limited access to them (I'm thinking 50% in my mind).

Thanks for listening Sasha.

Free Windows Admin Tool Kit Click here and download it now
May 23rd, 2014 6:40pm

Hi George,

I am picking up this thread from last year and hope you can respond. I get all the requirements you listed. Are you OK with having to change the connection string after the failover and also having to manage the security configuration on the target separately (matching logins, firewall etc). 

Also, can you please explain the need for read-only access?

Cheers,

Sasha

July 20th, 2015 7:05pm

Hi Sasha,

At this point I am "OK" with having to change the connection strings after a failover (because I've built the capability into my app), but I would much prefer not to have to do that. It would make the process MUCH easier and more seamless.

Having to match the security config on the target is a huge pain point for us. This is an area you could make big strides in. If you could automate that it would make the transition to the replica so much easier for us. Again we have built tooling to do this for us but we have to maintain it and as we build other apps we'd need the same tooling. A pain in the rear end.

Our use of read-only access is simple. We use the primary for our transactional workloads, but we have a number of reporting processes (some offline some online) that hit the database heavily and so we use the replica for this workload to allow best perf of transactional activity on the primary.

I am still bothered that the replicas cost the same as the primary database. I do see from your perspective the hardware requirements are the same, it's just from my perspective I am getting much less use out of that replica so it feels expensive to pay full price.

Hope this helps...

Free Windows Admin Tool Kit Click here and download it now
July 23rd, 2015 1:07pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics