Advanced Resiliency Concepts in the Public Cloud

By Larry Grant Cloud Insight Blog February 2, 2023

There are many forms of resiliency we can build into our applications. Public cloud providers extend these concepts with managed services such as load balancers that are redundant across availability zones and auto-scaling groups that can handle bursting. There are even health check services that integrate with other services to provide real-time monitoring and auto-replacement of applications and/or modules that aren’t behaving well. I classify these more as standard resiliency concepts for the public cloud.

Here are some advanced concepts to take your resiliency to the next level.

In-Region Advanced Resiliency

When implementing Auto-Scaling services some best practices dictate using multiple availability zones (AZ) to protect from an AZ outage. However, there are outages within an AZ that can further cause failure. Some examples include subnet exhaustion, selected instance type unavailability, or even API Throttling, although the latter is often at the account level, not an AZ level. To limit exposure to failed scaling/recovery operations there are some additional practices we can apply.

Consider deploying multiple concurrent copies of your application, each with minor tweaks. Proper automation can perform this without extra developer effort. The recommendation when deploying into multiple availability zones is to choose at least 3. However, some regions might offer 6 or more AZ’s to choose from. At first thought you could consider deploying one copy of an application across all 6 AZ’s for maximum availability.
Instead deploy one copy to 3 AZ’s and another to a different set of 3 AZ’s to help manage subnet exhaustion. Mixing up app deployments can help change the ration of usage across IP’s, thereby limiting failures, or isolating a failure to just one of the deployments.
Alternatively, you could use the same 3 AZ’s in regions not having more, but use different VPC’s (or at least subnets) to isolate the IP usage between each deployment.
Choose a different instance type for each deployment to minimize risk of the Cloud Service Provider (CSP) not being able to deliver the requested type.
Deploy each to a different account (micro account) to limit API throttling.
Use DNS Round-Robin or Weighted load balancing features to balance traffic between each deployment.
When upgrading or patching deploy whole new copies and post-certify before terminating the old deployments.
Keep in mind, if you normally deploy a minimum of 2 instances in an autoscaling group for improved resiliency, this doesn’t mean you now have to deploy 4 (2×2). Instead, you can deploy a minimum of 1 in each deployment, still providing a total of 2 and still providing resiliency, without increasing cost (except for the price of an additional load balancer.)
An alternative approach is to share a load balancer across separate deployments to save a few dollars. This is acceptable but doesn’t provide as much resiliency in case something goes awry with the load balancer itself. Furthermore, different scaling scenarios can be associated with each separate load balancer to better handle different types of situations.

Multi-Region Advanced Resiliency

Multi-Region is a much more complicated discussion and too long to cover all aspects of it in this post.

However, there are a few ideas that are worth mentioning. First, let’s define a few common phrases:

Active-Active: The idea of deploying and running two copies of the same application or service, concurrently, in two different regions, and using some sort of load-balancing to send traffic to each. Great care must be taken to ensure proper data management and transactional consistency are properly dealt with.
Active-Passive: The idea of deploying two copies of the same application or service, in two different regions, but only sending requests to one of them, and dynamically switching to the alternate region when the primary region is unavailable. Often the failover process is automated at an application or service level, and dependencies are equally able to choose alternate routes, or able to tolerate latency concerns. In some cases, a database must failover to ensure writes only happen from the active region, or cross-region writes are performed to ensure data consistency. This is more popular between near-region deployments such as Virginia and Ohio.

Active-Failover: The idea of deploying two copies of the same applications, in two different regions, but only sending requests to one of them, and manually switching to the alternate region when the primary region is unavailable.

With Active-Failover the environment is usually fully deployed and ready to take traffic and failover can be performed at an individual application level. However, due to complex dependencies, a manual decision is usually made to failover a group of applications as a unit to minimize the impact on dependencies and increased latency.

Disaster Recovery: The idea of deploying ALL applications to a single region, and having a method to quickly recover these applications in another region.

Sometimes this means restoring from backup, or replicating data, or pre-deploying an application but having it turned off to save costs. Normally this is an entire site failover for a major catastrophe. Very few if any custom code modifications are performed.

As far as the application is concerned, it behaves as a single-region application and is only ever active in one region. Time to Recover (RTO) is usually much longer than any of the above methods, and Recovery Point Objective (RPO) will be an acceptable level of data loss to the organization.

Any of these approaches can take months or years of planning and design work to get right, and will usually incur some level of custom code, be it scripting, deployments, or database modifications. Most organizations are not a one size fits all and will have a blend of the above patterns. Each of these patterns can further have refinements and sub-patterns associated.

One thing I want you to remember is that whichever pattern you leverage, getting a good handle on advanced DNS routing options and integrated health checks can be key to enabling these patterns. Modern cloud DNS solutions offer many routing options including:

In-region or multi-region capabilities.
Internal (ie: service-to-service calls) and public routing options. However, there are often limitations to internal routing between an office and the public cloud. Going over a dedicated connection will often terminate in a specific region, thereby limiting more advanced routing choices.
Using GEO, Latency, Weighted, or Failover based routing for public routing.
Using Weighted, Round Robin or Failover for internal routing.
Properly integrating health checks at the right levels to ensure proper routing decisions are made.
Layering combinations of the above to get the perfect routing solutions.

For companies who are looking to add that extra level of resiliency to their cloud deployments, you’ll first need to consider the specific needs of your business and the investment you’re willing to make. This should include identifying where you currently are on your cloud journey, your goals for shifting to the cloud, and what type of cloud technology you want to use (hybrid, on-prem, public, etc.). It’s helpful to have an outside vendor help you or your IT Team evaluate each approach and help determine which solutions best fit your needs. Not all applications are equal and most often each solution needs to be customized to support the specific needs of both the application and the organization.

Larry Grant

+ posts

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
Zoominfo	session	Zoominfo uses technologies to collect and store information when you interact with services it offer to their partners, such as advertising services or analytics. All of those processes are meant to improve your user experience and the overall quality of our services.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111355416_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	This cookie is used to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
_hjid	1 year	This is a Hotjar cookie that is set when the customer first lands on a page using the Hotjar script.
_hjIncludedInPageviewSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's pageview limit.
_hjIncludedInSessionSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's daily session limit.
_hjTLDTest	session	Hotjar test cookie to check the most generic cookie path it should use, instead of the page hostname. This is done so that cookies can be shared across subdomains (where applicable). To determine this, we store the _hjTLDTest cookie for different URL substring alternatives until it fails. After this check, the cookie is removed.
oktgid	1 year	This cookie is used for storing the visitor ID of the user who clicked on an okt.to link.
oktsid	session	This cookie is used for storing the session ID of the user who clicked on an okt.to link.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by YouTube and is used to track the views of embedded videos on YouTube pages.

Cookie	Duration	Description
__gwtCookieCheck	session	This cookie is used to check if the visitors' browser supports cookies.
AnalyticsSyncHistory	1 month	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
li_gc	2 years	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
UserMatchHistory	1 month	LinkedIn - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

Advanced Resiliency Concepts in the Public Cloud

Here are some advanced concepts to take your resiliency to the next level.

In-Region Advanced Resiliency

Multi-Region Advanced Resiliency

Larry Grant

UJIMA

VETs

WONDER

PRIDE

Mental Health Matters

KEVIN FLEURIE

Dan O’Brien

BRYAN CALDER

ROBERT KIM

VINCENT TRAMA

WAHEED CHOUDHRY

CHRIS CAGNAZZI

BRID GRAHAM

MICHAEL KELLY

KEVIN WATKINS

JENNIFER JACKSON

MANNY KORAKIS

JUSTIN FILIA

GOPINATHAN PANDURANGAN

STEVEN PALMESE

ELLIOT BRECHER

BARBARA ROBIDOUX

Please select your venue city and
complete registration form below.

Please complete registration
form below.

Dave Hart

CHRIS BARNEY

JOHN HANLON

Greg Hedrick

Courtney Washington

CHRISTINE KOMOLA

VINU THOMAS

JULIETTE AUSTIN