Your Website Just Went Dark. Here's What's Actually Happening Behind the Curtain.
What Actually Happens During an Outage (And How Engineers Fix It)
A user types your URL into their browser.
DNS resolves in milliseconds.
The request hits your load balancer, routes to a healthy backend, pulls data from cache or database, renders a response, and streams it back.
The whole dance takes maybe 200 milliseconds.
Then one morning you wake up and your site is down. Customers are angry. Your boss is asking questions you can’t answer yet. And somewhere in a stack of servers, containers, DNS records, certificates, and network paths, something has gone catastrophically wrong.
I’ve been on both sides of this.
The worst part isn’t the technical failure. It’s the chaos that follows when teams don’t understand where to look.
Most engineers, even experienced ones, treat a website outage as a single event. It’s not. It’s a cascade of potential failure points, any one of which can produce the same symptom: users can’t reach your site. The difference between a 15-minute incident and a 4-hour nightmare often comes down to whether your team understands the actual flow of a request and can quickly eliminate suspects.
In order for that to happen, let’s understand what’s happening under the hood so that when it breaks, and it will break, you know where to look.
Hi I’m Maxine, a cloud infrastructure engineer who spends my days scaling databases, debugging production incidents, and writing about what actually works in production.
You can get a copy of my LLMs for Humans: From Prompts to Production (at 30% off right now) ←
Or for free when you become a paid subscriber.
It’s 20 chapters of practical applied AI with real production context, not theory. And it’ll help you get smarter about using AI tools in infrastructure workflows.
Checkout my work:
Plus, if you’re thinking about making a career move into cloud or DevOps and want a structured path to get there, get a copy of my The DevOps Career Switch Blueprint.
Okay, let’s get into it
The Request Lifecycle Nobody Draws on Whiteboards
Before we can diagnose what went wrong, we need to agree on what’s supposed to happen. Most architecture diagrams lie by omission. They show the happy path with boxes and arrows. They don’t show the 47 places where things can silently fail.
Here’s the actual flow when someone tries to reach your site:
User’s browser → Stub resolver → Recursive resolver → Authoritative nameserver → IP address returned → TCP handshake → TLS negotiation → Load balancer → Backend selection → Application processing → Database or cache query → Response assembly → Return through the whole chain
That’s at least twelve distinct failure points, and I’m being generous by collapsing some of them.
Let me walk through what each layer actually does and where it tends to break.
The DNS Layer
Your user’s device has a stub resolver that talks to a configured recursive resolver, usually their ISP’s or something like 8.8.8.8 or 1.1.1.1. That recursive resolver walks the DNS tree until it finds your authoritative nameserver, which returns the actual IP address for your domain.
This layer fails in ways that look like everything is fine on your end. Your servers are up, and your monitoring shows green. But users can’t reach you because DNS is returning stale records, or your registrar suspended your domain, or someone let the domain expire.
I once spent two hours debugging an “outage” that turned out to be accidentally changing our domain’s NS records in Route 53. The servers never went down. DNS just stopped pointing at them.
The Network Layer
Once DNS resolves, the user’s request has to actually reach your infrastructure. This involves BGP routing, transit providers, peering agreements, and a bunch of infrastructure you probably don’t control. DDoS attacks, BGP hijacks, and upstream provider issues all manifest as “the site is down” even when your application is perfectly healthy.
The TLS Layer
Certificate expiration is the outage that keeps on giving. I’ve seen major companies go down because someone forgot to renew a cert. I’ve also seen outages caused by intermediate certificate chain issues, cipher suite mismatches, and OCSP stapling problems.
The Load Balancer Layer
Your load balancer is doing health checks against your backends. If those health checks are poorly configured, you can end up with situations where the LB thinks all backends are unhealthy even when they’re fine. Or the opposite: it keeps routing traffic to dead backends because the health check is too permissive.
The Application Layer
Finally, your actual code. Memory leaks, thread exhaustion, database connection pool saturation, deadlocks, infinite loops, OOM kills. The application layer tends to produce the most creative failures because it’s where your custom logic lives.
What the First Five Minutes Look Like
When the alert fires, you have a narrow window to establish basic facts before chaos takes over. I’ve developed a mental checklist over years of incidents that I run through almost automatically now.
Can I reach the site from my local machine?
Can I reach it from a different network?
What does
digshow for the domain?What does
curl -Ireturn?What’s the load balancer health status showing?
What do the application logs say in the last ten minutes?
These six questions, answered honestly in the first five minutes, will tell you which layer is broken about 80% of the time.
Here’s a concrete example. You run:
dig example.com +shortAnd you get nothing back. Or you get an IP that isn’t yours. That’s a DNS problem. You don’t need to look at your application yet.
Or you run:
curl -Iv https://example.com 2>&1 | head -20
And you see a certificate error. That’s TLS. Stop looking at your containers.
The mistake I see junior engineers make is jumping straight to the layer they’re most comfortable with. Backend devs assume it’s a code bug. Network engineers assume it’s routing. Platform folks assume it’s Kubernetes.
Start at the edge and work inward. Always.
Failures That Will Ruin Your Week
Let me walk through the failures I’ve seen most often in production, not theoretical ones from textbooks, but the ones that actually wake people up at night.
1. DNS propagation after a “routine” change
Someone updates a DNS record. Maybe they’re migrating to a new load balancer or switching CDN providers. They make the change, test it from their laptop, it works, they go home.
Two hours later, half your users can’t reach the site.
What’s actually happening is that recursive resolvers cache DNS records according to the TTL you set. If your TTL was 3600 seconds and some resolvers cached the old record right before you changed it, those users won’t see the new IP for an hour. But it gets worse. Some ISP resolvers ignore your TTL entirely. They’ll cache records for 24 hours regardless of what you specify.
The fix is to lower your TTL well before the migration. Try to drop to 300 seconds at least 24 hours before any DNS change. After the change propagates, you can raise it again if you want to reduce DNS query volume.
2. Certificate expiration during a holiday weekend
I know a team that had their wildcard certificate expire at 11 PM the night before Thanksgiving. Their automated renewal had been silently failing for three weeks because someone changed the DNS verification records during an unrelated migration.
No alerts. No warnings.
Just a hard down on one of the biggest traffic days of the year.
The technical cause was that Let’s Encrypt couldn’t verify domain ownership because the TXT record validation was pointed at an old account. The certificate renewal job was marked as “running” in their CI/CD, but it was actually just timing out and swallowing the error.
You need to monitor certificate expiration as a first-class metric. Not the renewal job status. The actual certificate expiration date. I’ve learned to trust nothing except a daily check that says “this cert expires in X days” and alerts when X drops below 14.
3. Load balancer health checks that lie
Your health check endpoint returns 200 OK. Your load balancer thinks everything is fine. But the health check only verifies that your HTTP server can respond. It doesn’t check that your database connection pool isn’t exhausted or that your Redis cluster is reachable.
I’ve seen production traffic routed to backends that could serve the health check page but couldn’t actually process any real requests. Users got either timeouts or 500 errors while monitoring showed 100% healthy backends.
The fix is to make your health check meaningful. It should touch every critical dependency your application needs to serve traffic. Not a full integration test, but enough to verify that the backend can actually do its job.
@app.route('/health')
def health():
try:
db.execute("SELECT 1")
redis.ping()
return "OK", 200
except Exception as e:
return str(e), 503That extra few milliseconds of overhead is worth it.
4. Connection pool exhaustion under load
Traffic spikes. Your application starts running slow. Database queries that normally take 5ms are now taking 500ms because the connection pool is saturated. Requests start queuing. Timeouts start firing. The load balancer sees backends not responding to health checks and starts removing them from rotation.
Now you have fewer backends handling the same traffic. The remaining ones get crushed even harder. This cascades until everything is down.
This one is insidious because it looks like your database is the problem. But the database is fine. Your application is just holding connections open too long because something upstream slowed down.
The fix involves proper connection pool sizing, request timeouts at every layer, and circuit breakers that fail fast instead of queuing indefinitely.
5. Silent deployment failures
A deploy goes out. It “succeeds” according to your CI/CD pipeline. But one of twenty pods fails to start due to a missing environment variable. Kubernetes keeps the old pods running because your deployment spec says
maxUnavailable: 0
So the cluster looks healthy. Nineteen pods are running the new version. One is stuck in CrashLoopBackOff. You don’t notice until load shifts to that node and users start seeing intermittent errors.
This happens more than teams want to admit. The fix is treating deployment success as more than “the container started.” You need readiness probes that verify the application is actually functional, and you need to fail the deployment if any pods don’t reach ready state.
The Terraform Problem Nobody Warns You About
Infrastructure as code is supposed to make this better. In practice, it creates its own class of outages.
State drift during incidents
You’re in the middle of an incident. You need to make a change fast. Someone clicks through the console to bump a capacity setting or flip a feature flag. The change works. The incident resolves.
Now your Terraform state doesn’t match reality. The next time someone runs terraform apply, it tries to “fix” the drift by reverting your emergency change. I’ve seen this cause a second outage during what should have been routine maintenance.
The workaround is discipline. If you touch the console during an incident, you document it immediately and update the Terraform before anyone else runs a plan. In practice, this rarely happens under pressure.
The import problem
Some resources are hard to import into Terraform cleanly. Load balancer rules, IAM policies with complex conditions, security groups that were created by hand years ago. You end up with resources that exist in two states: partially managed by Terraform, partially managed by hand.
When the outage involves one of these hybrid resources, you’re in trouble. Is the current state the intended state? What was it supposed to be? Nobody knows because the source of truth is split.
I don’t have a clean solution for this. The honest answer is that some legacy infrastructure will never be fully captured in IaC, and you need runbooks that acknowledge this.
The provider version surprise
Terraform providers update. AWS resources change. Sometimes a provider update changes how a resource is configured in ways that look like no-ops but actually cause recreation. I’ve seen a provider update trigger a load balancer replacement that took down production because someone ran terraform apply without carefully reading the plan.
Always pin your provider versions. Always read the plan before applying.
These rules sound obvious until you’re rushing to push a fix.
When the Right Answer Is “It Depends”
Some outage scenarios don’t have a clear correct response. The right call depends on context that only you have.
To rollback or push forward?
Your deploy broke something. Do you rollback to the previous version or push a fix forward? Both have risks. Rollback might re-introduce a different bug. Pushing forward means more changes during an incident.
I’ve seen teams have religious wars about this. In reality, the answer depends on how confident you are in the fix, how fast you can deploy, and how bad the current state is. There’s no universal rule.
Multi-region failover
Your primary region is having issues. Do you trigger failover to the secondary? If the issue is transient, failover might cause more disruption than waiting it out. If it’s extended, waiting costs you money and reputation.
I’ve gotten this wrong in both directions. I’ve failed over too early and caused data sync issues. I’ve also waited too long and let a five-minute blip turn into a thirty-minute outage. The judgment call doesn’t get easier with experience. You just get faster at making it.
Communicating during uncertainty
How do you write a status page update when you don’t know what’s wrong yet? Say too little and customers think you’re hiding something. Say too much and you commit to a theory that might be wrong.
My approach now is to communicate what we know, what we’re investigating, and when we’ll update next. Avoid speculation about root cause until you’re confident. People can handle “we’re still investigating” better than “we think it’s X” followed by “actually it was Y all along.”
What It Looks Like When Things Work
After years of incidents, I’ve learned to appreciate the quiet. When your infrastructure is healthy, you see patterns that are easy to take for granted:
Deploys complete in predictable time windows without manual intervention
Alerts fire only for actionable issues, not noise
On-call rotations are boring because nothing escalates past the first responder
New team members can follow runbooks to resolution without escalation
Post-incident reviews focus on process improvements, not blame
I remember a stretch of about four months where my team had zero production incidents. Not because we weren’t deploying, in fact, we shipped every day. But our health checks caught bad deploys before traffic reached them. Our capacity headroom absorbed traffic spikes. Our monitoring showed us slow degradation before it became visible to users.
That period didn’t show up on any dashboard or OKR. Nobody got a bonus for “didn’t break production.” But it was the direct result of years of learning from failures and building systems that failed gracefully.
The goal isn’t to never have outages. That’s not realistic.
The goal is to have short outages with clear diagnosis paths and confident remediation. And then to learn something from each one that makes the next one less likely or less severe.
What’s the outage that taught you the most about your own infrastructure? The one that changed how you build things now?
I’d love to hear about it in the comments.
With Love and DevOps,
Maxine
If you made it this far and you’re managing cloud infrastructure with Terraform, you might want to keep this one close too.
What Is Infrastructure as Code? A Beginner’s Guide to Terraform and Cloud Infrastructure
is where I start people who are new to IaC or who understand it conceptually but haven’t had to debug it in a real environment yet. It covers the mental model behind declarative infrastructure so that articles like this one make sense end to end, not just the code snippets.
And if you’re working with AI in your stack or trying to understand where LLMs actually fit in a production system without the hype, LLMs for Humans: From Prompts to Production is the guide I wish existed when I started. Written by an engineer for engineers, covering RAG, function calling, and the operational reality of running AI in real systems.
Let’s stay connected
Last Updated: May 2026I post about cloud infrastructure, DevOps, and AI in production a few times a week on LinkedIn. The real stuff: what I’m debugging, what I’m deploying, and the occasional thing that broke in a way nobody documented anywhere.
Come say hi. I actually respond.
Sources and Further Reading
Google SRE Book - Monitoring Distributed Systems
AWS Well-Architected Framework - Reliability Pillar





