RT? – Making Sense of High Availability

Kevin Mack

0/5 (0 vote)

Dec 2, 2019

CPOL

5 min read

2034

Hello all, in keeping with the last post on the blog, I started doing some posts around High Availability, so ultimately the focus here is how do I architect my solution to ensure that is meets the availability demands of my customers.

So odds are if you’ve started down this direction, you’ve heard 3 acronyms:

SLA – Service Level Agreement
RTO – Recovery Time Objective
RPO – Recovery Point Objective

So what do each of these items mean, and how do they relate to your solution. For SLA, I covered this pretty extensively in my previous post. So I would direct you there for a definition and then recommendations around how to approach that topic.

So the next question is really what are RTO and RPO? And how do they relate to High availability?

What is RTO?

RTO stands for Recovery Time Objective, and basically, in software terms, this refers to when something happens, how fast do you recover?

So let’s take an example because I work best with examples. So if I have a solution that is deployed in multiple regions, and my solution uses Traffic Manager and has replication of the solution into another region. If the Traffic manager is checking the endpoint every 5 seconds, and 3 failures cause a failover…that means my RTO is 15 seconds.

By using a dual region deployment, I’m able to keep my RTO relatively low. Now the above example is pretty simplistic. But really we should do this analysis per service in our architecture, to determine how long our failover takes, and then the longest of that is your solutions RTO.

How do we improve RTO?

Now, remember that this is really a measure of continuity of business, so really looking at High Availability and Disaster Recovery. So ultimately we are talking about service uptime more than anything else.

So the best way to improve RTO is to enable the replication and take steps to increase the speed of recovery. So if you look at the last discussion of SLA, we took steps to minimize downtime by increasing SLA. This conversation will be about how do we minimize the downtime caused by those failovers.

The most important things involved in this are the following:

Monitoring
Response time
Data Replication
Failover

So the key metric to pay attention to is how long it takes to get up and running.

Monitoring is the cornerstone of your RTO target. If you don’t know there is a problem, you can’t find it. Many blogs and articles will focus on the next 3 parts, but let’s be honest, if you don’t know there’s a problem, you can’t respond. If your logs operate on a 5-minute delay, then you need to factor in the 5 minutes into your RTO.

From there the next piece is response time. And I mean this in the true sense of how quickly can you trigger a failover to your DR state. How quickly can you triage the problem and respond to the situation? The best RTO targets leverage as much automation as possible here.

Next, by looking at data replication, we can ensure that we are able to bring back up any data stores quickly and maintain continuity of business. This is important because every time we have to restore a data store, that takes time and pulls out our RTO. If you can failover in 2 minutes it doesn’t do you much good if it takes 20 minutes to get the database up.

Finally, failover. If you are in a state where you need to failover, how long does that take and what automation and steps can you take to shorten that time significantly.

Let’s give an example if I have a solution that is the following in one region:

Azure App Service
Azure SQL

If I’m deployed in a single environment, and my DR plan is to standup another region in the event of a disaster. Now that solution has a pretty high RTO, if it takes 15 minutes to standup that environment and deploy it, then the RTO is 15 minutes. If I wanted to lower that, there a couple of things I can and those would be:

I can increase the automation I use to reduce that time.
I can do is spin up another region, or leverage options to do replication.
I can set up automation around detection and response.

What is RPO?

RPO stands for Recovery Point Objective, which really focuses on the idea of improving the ability to recover from a data perspective. So if you have a disaster, how much data would be lost? What would the impact be?

When looking at RPO, the key comes to data and potential data loss. So how do we minimize the window for data loss and lower the chances of lost transactions in your application?

There are a few key elements that can assist with this, looking at how your application handles eventual consistency. It is possible to get to an RPO of 0, as you have constant data replication in your solution.

Now the most important part of the replication is that the replication needs to be executed in a synchronous fashion, meaning that it must write and replicate the data before sending an acknowledgment. This means that eventual consistency will keep your RPO higher than zero because it means that the replication will “eventually” get there.

How do we improve RPO?

The most important factor here is replication and data consistency. So we really need to make sure that the strength of transactions is maintained about that consistency rules are enforced. This is why data stores like Cosmos gain popularity in terms of requirements for zero RPO and low RTO because it supports models where they can enforce this type of logic.

https://mathequality.files.wordpress.com/2014/01/math-meme-math-test-easy-or-wrong.png

Needless to say, this all comes down to operations and math and ultimately the requirements of your solution and balancing that against cost and impact. You really want to make sure you only take this to the level you need to as it can add a lot of cost and substantially raise the complexity of your solution.