What You Can’t See Can Hurt You: Observability, Outages, and Cyber Resilience in the Cloud

human-eye-and-cyber-technology-panel-with-social-m-2025-10-15-05-20-33-utc

By Scott Barnhill Insight Blog Uncategorized November 14, 2025

When the AWS outage happened, the internet froze. But it wasn’t a cyberattack; it was a wake-up call. A subtle DNS misconfiguration inside a core service triggered a cascading brownout across services, revealing how even the most advanced cloud platforms can encounter unexpected challenges. Yet, AWS’s response also demonstrated the power of deep observability and operational discipline in containing impact and accelerating recovery.

In today’s complex architectures, visibility into internal systems is just as critical as monitoring external performance. Without insight into the platform’s inner workings, teams are flying blind: unable to see what’s changing beneath the surface. AWS had the right data, and it showed. The outage was disruptive, but it could have been far worse. Their ability to respond quickly and effectively was a testament to the value of platform-level observability.

Brownouts are the new blackouts

Modern platforms rarely fail outright. Instead, they degrade unevenly: some services work, others don’t, and the picture shifts minute by minute. These “brownouts” are harder to diagnose and more damaging to user trust. Worse, they often mask deeper issues like automation misfires, hidden dependencies, or control-plane instability. IT leaders should assume brownouts will happen and insist on the visibility and IT triage preparation to better manage them.

Small cause, big effect: A rare race condition in automation can expose hidden dependencies and knock healthy services off balance

Design limits show up at scale: Architects can’t anticipate every edge case; what matters is how fast the organization sees the patterns and contains the situation

Customer impact is uneven: Partial failures confuse users and teams alike, extending time to resolution and multiplying reputational risk through extended troubleshooting exercises

Cybersecurity Implication:

Partial failures can obscure signs of malicious activity. Without visibility into internal workflows, attackers can exploit blind spots – moving laterally, poisoning caches, or triggering automation in unintended ways.

Why traditional monitoring isn’t enough

Most dashboards emphasize infrastructure health, application performance, and user transactions. But what’s often missing is visibility into the connective tissue: the control planes, automations, and orchestration layers that quietly keep everything in sync.

Most teams are well-equipped to monitor servers, services, and user journeys, but that’s only part of the picture. Without visibility into how these layers interact, teams risk overlooking the early signals of cascading failures or security threats.

What’s missing: Signals from CI/CD pipelines, DNS propagation, IAM updates, and policy engines

Why it matters: These layers are often the first to fail or be exploited during an incident. Without seeing what’s happening under the surface, teams treat symptoms at the edge while the real issue amplifies in the middle.

Case Study: the October 2025 AWS brownout

The AWS event in Northern Virginia stands as a clear, public example of how deeply embedded platform dependencies can surface under stress and how having the right internal data can dramatically accelerate recovery.

A small trigger, big ripple: A subtle glitch in how AWS’s DynamoDB service managed DNS records caused one endpoint to go missing. That single misstep disrupted how other services located each other (like removing a key piece from a map) and set off a chain reaction across the platform.

Interconnected systems revealed: Services that depended on DynamoDB (such as EC2’s orchestration and internal control systems) began to struggle. Lease renewals failed, network updates slowed, and health checks became unreliable, leading to unpredictable shifts in system capacity. It was a clear reminder of how tightly woven cloud services really are.

Data-guided triage worked: AWS engineers quickly pinpointed the issue using rich internal observability. They manually fixed the DNS records, adjusted workflows, paused aggressive automation, and carefully reintroduced workloads while clearing backlogs. These precise, data-driven actions helped contain the impact and restore stability faster than most would expect.

Positive Takeaway:

AWS engineers quickly identified the root cause, manually repaired DNS records, throttled workflows, and staged recovery. Their ability to choreograph a safe restart using deep platform signals helped contain the blast radius and minimize customer impact. This wasn’t just a recovery – it was a demonstration of how visibility into internal data flows can turn a potential crisis into a controlled event.

Check back in for Part 2 of this article for a deeper dive into what true observability into modern platforms must look like.

Scott Barnhill

Sr. Splunk Solutions Architect | + posts

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
Zoominfo	session	Zoominfo uses technologies to collect and store information when you interact with services it offer to their partners, such as advertising services or analytics. All of those processes are meant to improve your user experience and the overall quality of our services.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111355416_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	This cookie is used to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
_hjid	1 year	This is a Hotjar cookie that is set when the customer first lands on a page using the Hotjar script.
_hjIncludedInPageviewSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's pageview limit.
_hjIncludedInSessionSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's daily session limit.
_hjTLDTest	session	Hotjar test cookie to check the most generic cookie path it should use, instead of the page hostname. This is done so that cookies can be shared across subdomains (where applicable). To determine this, we store the _hjTLDTest cookie for different URL substring alternatives until it fails. After this check, the cookie is removed.
oktgid	1 year	This cookie is used for storing the visitor ID of the user who clicked on an okt.to link.
oktsid	session	This cookie is used for storing the session ID of the user who clicked on an okt.to link.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by YouTube and is used to track the views of embedded videos on YouTube pages.

Cookie	Duration	Description
__gwtCookieCheck	session	This cookie is used to check if the visitors' browser supports cookies.
AnalyticsSyncHistory	1 month	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
li_gc	2 years	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
UserMatchHistory	1 month	LinkedIn - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.