The Amnesia of Outages - A Culture Shift

A major AWS outage made headlines this week. Across the world, teams watched dashboards flicker red while customers wondered what went wrong. In that case, services in the US-EAST-1 region were hit after a failure in internal monitoring systems. For most, it was a temporary inconvenience. For some, a direct hit to revenue, reputation, and trust. For all of us, a reminder of how dependent we’ve become on systems we don’t control.

I’ve lived through many outages, some at companies that ran the services affected, others that hit organizations I worked with, and plenty that disrupted my day as a consumer. They always start the same way, with silence, confusion, and the hum of a thousand people refreshing a page. Then come the updates, the bridge calls, the scramble to diagnose what’s failing and why, and in more mature organizations, resistance to premature conclusions and blame. In those moments, time turns into cost.

We recover, but we rarely reflect and follow through. Once dashboards turn green again, urgency dissolves and amnesia sets in. We celebrate uptime rather than resilience. We patch symptoms instead of tracing dependencies. We tell ourselves it was a one-off, until the next time.

The Moment of Failure

You are on the bridge. Sales is posting screenshots of angry customer chats. Support says churn threats are up. Finance wants an impact estimate. An engineer says failover will take ten minutes, then thirty, then unknown. Someone asks if legal should pre-draft credits. Your comms update is due in five, yet you still cannot say why logins are failing.

Someone else says this can’t be widespread. Another insists their metrics look fine. It always starts the same way, across clouds and across companies, wherever systems lean on systems they don’t fully control.

And you remember the review months ago when an engineer flagged the risk in plain words, “Auth remains single-homed. Region failover will not help.” It lost, as it often does, to a revenue feature.

The Forces That Guarantee Repetition

A system can span three regions and still fail as one, because dependencies often route through a single logical service or control plane. An architect insists it’s fully redundant, with multiple regions, replicas, and failovers. But all of it still routes through a single managed Redis. When that cluster hiccups, the façade of redundancy collapses. Every dashboard still shows green, until everything fails at once.

At some point the call turns to security. Someone says the failover can’t proceed without an exception. Someone else asks if that exception is approved. The CISO’s team joins with no context and is asked to make a decision in minutes that normally takes days. The same safeguards that protect us in calm conditions can amplify risk under stress. The failure isn’t security, it’s that the system treats safety as absolute when it should allow room to adapt.

The incentives that drive product and engineering rarely favor resilience. The engineer who ships a feature on time is rewarded. The one who spends a quarter reducing coupling must justify the delay. Product managers are measured by velocity. Executives by growth. KPIs exist for availability, not for completing the remediations we agree to after every incident.

Every leader knows the trade-off. Building for resilience takes more time, more people, and more patience. It adds layers of testing, coordination, and operational overhead. The cost isn’t just in infrastructure, it’s in attention. And attention is the one resource most organizations are already short of.

We call it amnesia, but it’s closer to self-preservation. Remembering would force trade-offs we’re not willing to make. True resilience costs time, talent, and money, all of which have quarterly limits. Forgetting becomes the path of least resistance. It’s not chosen, it just happens again and again.

We call this progress because the systems come back online faster. But faster recovery isn’t greater resilience. It’s the illusion of improvement built on practiced repetition.

Consumers have learned the same lesson. We reload, retry, and move on. A brief outage no longer breaks trust; it just resets expectations. “Try again later” or “Try someone else” has become normal.

The Choice

A few organizations break the pattern. They treat incidents as turning points, not interruptions.

In one outage, we failed over compute in minutes, yet customers still couldn’t sign in. The root cause was a shared entitlement call that lived in a single region because moving it would take a quarter. Afterward, we moved it anyway and changed the roadmap rule; no new feature could depend on that entitlement until it was decoupled.

The result was immediate. Support tickets about login errors dropped by half. Customers stopped losing access during partial failures. On-call engineers slept through the next regional event because the system degraded gracefully instead of collapsing. We remember that the fix was architectural, but the real shift was cultural. The next quarter’s roadmap had resilience work in the first sprint, not the backlog. We even introduced a metric to track resilience work alongside feature delivery, measuring time between incidents, cost of partial failure, and the share of roadmap effort spent on reducing risk. It was a small signal, but it meant reliability was finally part of the plan.

The best teams choose boring reliability over impressive recovery. Their systems are invisible, and their failures are uneventful. They understand that resilience isn’t a technical trait; it’s a leadership choice.

The rest will forget, until the next outage reminds them again.

Next
Next

In Your Face - Smart Glasses at Work