Back to blog

20-Year Systems, Part 3: Incident Patterns and Fixes

The recurring incident patterns that almost took us down—and the fixes that stuck.

incidentsreliabilitypostmortementerprise

20-Year Systems, Part 3: Incident Patterns and Fixes

Incident Patterns

Long-lived systems collect scar tissue. Every outage leaves a mark—sometimes a fix that holds for a decade, sometimes a workaround that becomes its own problem.

This is Part 3 of the 20-Year Systems series. Parts 1 and 2 covered design constraints and operational playbooks. This one covers the incidents that almost killed us, the patterns behind them, and the fixes that actually stuck.

These aren't hypotheticals. These are the 3 a.m. calls, the "oh shit" Slack messages, the fixes we shipped while customers waited.


Pattern 1: Silent Queue Growth

The most dangerous outages are the ones that look normal until they don't.

What happened:

In 2015, our primary payment processor started throttling requests without warning. No error codes—just slower responses. Our retry logic interpreted "slow" as "failed" and queued retries. Those retries also got throttled, spawning more retries.

Within an hour, we had 40,000 queued payment attempts. The queue depth didn't trip alarms because we'd set thresholds based on normal variance, not exponential growth. Customers started calling about delayed vouchers before we noticed the backlog.

The root cause:

We trusted the happy path too much. Our queue monitoring measured depth but not growth rate. A queue going from 100 to 500 looks different than a queue going from 500 to 40,000, but our alerts treated them the same.

The fix that stuck:

  • Added queue depth SLOs (Service Level Objectives—the promises we make to customers about performance) with hard caps and automatic back-pressure. When queues hit 80% of capacity, new work gets a "busy, retry later" response instead of silent queueing.
  • Switched retry logic from fixed intervals to exponential backoff with jitter—meaning each retry waits longer than the last, with randomization to prevent thundering herds.
  • Created a "drain mode" runbook: steps to safely pause new work, process the backlog, and resume without losing data.

The payoff:

The next time a vendor flinched—2018, different processor, similar throttling—the back-pressure kicked in automatically. Users saw a clear "system busy" message for about 20 minutes. No manual intervention. No 40,000-item backlog. No customer calls.


Pattern 2: Long-Running Jobs with Hidden State

Batch jobs are liars. They look stateless. They are not.

What happened:

In 2014, we needed to reprocess five years of credentialing history—about 3 million records—for a compliance audit. We kicked off the job on a Friday afternoon, expecting it to finish over the weekend.

By Sunday, it had processed 60% of records and then stalled. No errors in the logs. Just... stopped.

Investigation revealed the job had cached credential definitions at startup. A schema migration we'd deployed on Saturday added a new credential type. The running job had stale cache. When it hit records requiring the new type, it silently skipped them—no errors, no logs, just missing data.

The root cause:

The job loaded context once at startup and assumed it wouldn't change. A reasonable assumption for a job that runs in minutes. Catastrophic for a job that runs for days.

The fix that stuck:

  • Jobs became pure functions over explicit payloads. No startup caching. Every batch fetches fresh state from the source of truth.
  • Added versioned readers—jobs record which schema version they're using and fail fast if the schema changes mid-run.
  • Blocked mid-run deployments for critical jobs. If a long-running job is active, deployments to that domain wait or require explicit acknowledgment.

The payoff:

In 2018, another audit required reprocessing. Same scale—millions of records. The job ran for 36 hours across a deployment window. When the deployment happened, the job paused cleanly, picked up with fresh context, and completed without data loss.

That audit passed without custom tooling or weekend work.


Pattern 3: Third-Party Timeouts Cascading

When one system gets slow, every system downstream gets slower. Then they all fall over together.

What happened:

In 2016, a state credentialing API that normally responded in 200ms started responding in 8 seconds. Our default timeout was 30 seconds—plenty of margin, we thought.

What we didn't account for: thread pool exhaustion. Each slow request held a thread for 8 seconds instead of 200ms. Our thread pool that normally served 500 concurrent requests could now serve 20.

Requests backed up. Users saw spinning loaders. Some users refreshed, creating duplicate requests. Some users clicked "submit" multiple times, creating duplicate submissions. Our idempotency guards caught some but not all.

The root cause:

We had timeouts, but they were too generous. We didn't have circuit breakers—automatic mechanisms that stop calling a failing service to prevent cascade failures.

The fix that stuck:

  • Introduced circuit breakers with three states: closed (normal), open (failing, don't call), and half-open (testing recovery). When error rates exceed 50% over 30 seconds, the breaker opens and stops calling the dependency.
  • Set per-tenant timeouts based on actual P99 latency, not generous guesses. If normal is 200ms, timeout at 2 seconds, not 30.
  • Added hedged requests with idempotency keys—if a request takes longer than P95, we send a parallel request to a backup. Idempotency keys ensure only one succeeds.
  • When breakers open, we serve "last known good" cached data where safe, with clear UI indicators that data may be stale.

The payoff:

In 2019, the same state API had another latency spike—12 seconds per request during their maintenance window that they didn't announce. Our breaker tripped in 45 seconds. Users saw a banner: "State verification delayed. Using cached data." Cached credentials were current as of that morning.

No thread pool exhaustion. No cascade. No duplicate submissions. The banner disappeared when the API recovered.


Pattern 4: Unbounded Batch Imports

"Bulk import" is another way of saying "please break production."

What happened:

In 2013, a national testing provider sent us their quarterly data dump. Normal quarterly dumps were 50,000 records. This one was 2 million—they'd included historical data without warning.

The import job was designed for 50,000 records. It loaded everything into memory, processed sequentially, and wrote results transactionally. Two million records meant memory exhaustion, transaction timeouts, and—the fun part—deadlocks with live traffic trying to read the same tables.

Live registrations started failing. Support lit up. We killed the import job, but the damage was done: half-written data, confused state, angry customers.

The root cause:

Batch jobs had no resource limits. They assumed they'd finish before resource constraints mattered. When input size exploded, assumptions became production fires.

The fix that stuck:

  • Sharded imports by tenant and time window. Instead of one 2-million-record job, we'd run 200 ten-thousand-record jobs with configurable parallelism.
  • Enforced batch size limits at ingestion. Files larger than threshold get automatically split before processing.
  • Prioritized live traffic over bulk. Import jobs run at lower priority and yield when live traffic queues grow.
  • Added a "circuit breaker" for imports: if live traffic latency exceeds threshold, bulk jobs pause automatically.

The payoff:

After these improvements, the system handled scale gracefully. When a later quarterly dump of 250,000 records came in, the sharded importer processed it in the background without incident. Live traffic never noticed. No deadlocks. No memory issues. The fixes we'd built after the 2013 crisis meant we could handle 5x the load without drama.


Pattern 5: Why Audit Trails Save You

Manual overrides are necessary. The question is whether you can prove what happened afterward.

What happened:

In 2010, a support rep manually edited a customer record to resolve a complaint. The customer had been double-charged, and the rep refunded them by editing the payment record directly.

The edit fixed the customer's portal view. It also created a divergence: the payment processor still showed the original charge, but our system showed the refund. The nightly reconciliation flagged it as an anomaly.

Why this wasn't a crisis:

We'd had full audit logging from day one. Every manual edit captured who made it, when, what changed, and why. Within 20 minutes of the reconciliation flag, we'd pulled the audit log, seen exactly what happened, and identified the fix.

The finance team saw the before/after values, the timestamp, and the support rep's notes. What could have been hours of forensic work became a 15-minute resolution.

The lesson that held:

This incident became our go-to example for why we never skipped audit trails, even when they seemed like overhead:

  • Every manual change logged: actor, timestamp, reason, before/after values.
  • High-risk changes (payment edits, credential status changes) required two-person approval.
  • The "support tooling" layer performed common fixes through validated, logged pathways.

The payoff:

In 2018, an auditor asked to see all manual credential changes for the past year. We ran a query against the audit log and handed them a report in 30 minutes. That same auditor told us most companies they reviewed couldn't answer that question at all.


What Stuck: The Meta-Patterns

After twenty years, the incidents blur together. The fixes don't. These are the principles we extracted:

  • Build for retries, back-pressure, and bounded work. Assume every queue will overflow. Assume every retry will stack. Design for it.
  • Keep jobs stateless and restartable. Any job that can't be killed and restarted safely is a liability.
  • Bound your batch imports. Shard, prioritize live traffic, and pause bulk when production gets stressed.
  • Audit everything from day one. When something goes wrong—and it will—you need to prove exactly what happened.
  • Prefer many small incidents you can rehearse over one big surprise you cannot. Practice the scary scenarios. Time how long detection takes. Fix the gaps.

Incident Hygiene That Actually Works

Process matters as much as technology. These rituals made incidents manageable:

One channel of record:

Incidents lived in a single Slack channel with timestamps, owners, and current status. No split-brain between chat, tickets, and email. If it wasn't in the incident channel, it didn't happen.

Templates that reduce cognitive load:

Every incident used the same format: trigger, impact, scope, mitigating actions, ETA to next update. Consistency reduced stress when everyone was tired.

Next-update discipline:

Even if there was "no update," we posted on schedule. Stakeholders calmed down because they knew when to expect news. Silence breeds anxiety; scheduled updates—even boring ones—build trust.

Debriefs with decisions, not discussions:

Retros ended with specific guardrails to implement, owners assigned, and due dates set. No "we should probably look into..." without accountability. If it didn't have an owner and a date, it didn't count.


Quick Plays for Common Patterns

When you're at 3 a.m. and your brain is mush, decision trees beat reasoning:

Queues rising unexpectedly:

  1. Check circuit breaker state—is something tripped?
  2. Throttle heavy tenants if one is dominating
  3. Shed non-critical work (defer batch jobs, pause imports)
  4. Surface user-friendly "retry soon" messaging
  5. Call for help if queue growth doesn't stabilize in 15 minutes

Partner slowness cascading:

  1. Trip the circuit breaker manually if it hasn't auto-tripped
  2. Serve stale-but-safe cached data where allowed
  3. Queue retries with idempotency keys
  4. Post status update: "Partner system slow, using cached data"

Bad batch import overwhelming system:

  1. Pause bulk jobs immediately
  2. Shard remaining work into smaller batches
  3. Resume with monitoring on lock/wait time
  4. Prioritize live traffic until batch completes

Practice Before Production Does It For You

The best incident response is the one you've rehearsed:

  • Monthly "chaos hour": Pick one scenario—partner outage, queue flood, schema surprise—and run it in staging. Time detection. Time recovery. Document gaps.
  • Detection timing: How long until the right person is paged? How long until users see a graceful degradation message? If detection takes longer than impact, your monitoring is wrong.
  • Dashboard validation: After every drill, ask: did the dashboards show the signal we actually needed? If not, fix dashboards before the next drill.

The point isn't zero incidents. Zero incidents is a fantasy. The point is predictable, bounded incidents with fast, repeatable recoveries.


Context → Decision → Outcome → Metric

  • Context: 20-year healthcare credentialing platform, 500M+ transactions, 15+ external integrations, regulated industry requiring audit trails and compliance reporting.
  • Decision: Treated every major incident as a pattern to fix permanently, not a fire to forget. Built fixes into architecture, automation, and operational playbooks.
  • Outcome: Recurring incident rate dropped by 60% over five years. Mean time to detect fell from hours to minutes. Mean time to recover fell from hours to under 30 minutes for documented scenarios.
  • Metric: Zero repeat incidents from the same root cause after fixes were implemented. Audit trail completeness: 100% for all production changes.

Anecdote: The 3 A.M. Queue That Didn't Become a Crisis

In 2019, I woke up to a 3 a.m. page. Payment queue depth at 85% capacity, still growing.

Three years earlier, that page would have meant four hours of manual queue purging, customer impact, and a morning full of apology emails.

This time, I opened the dashboard, saw the back-pressure had already kicked in, and watched the queue stabilize at 90%. Users were seeing "temporarily busy" messages. Retries were backing off with jitter. No cascade. No duplicate charges.

I checked the circuit breaker status—tripped automatically against the payment processor. Checked the processor's status page—"degraded performance, investigating."

I posted an update in the incident channel: "Processor slow, breaker tripped, back-pressure active, monitoring." Then I went back to sleep.

The processor recovered at 5 a.m. The breaker closed automatically. The retry queue drained. By morning standup, there was nothing to report except "the system worked as designed."

That's the payoff of fixing patterns instead of fighting fires.

Mini Checklist: Incident Pattern Prevention

  • [ ] Queues have depth limits, growth rate alerts, and automatic back-pressure
  • [ ] Long-running jobs are stateless, restartable, and version-aware
  • [ ] Batch imports are sharded, bounded, and lower-priority than live traffic
  • [ ] Manual changes go through audited workflows with approval requirements
  • [ ] Circuit breakers protect against cascading failures from slow dependencies
  • [ ] Incident channel has one source of truth with templated updates
  • [ ] Post-incident fixes have owners, dates, and verification criteria