20-Year Systems, Part 2: The Ops Playbook

Ops Playbook

Architecture gets you potential reliability. Operations determines whether you cash it in or light it on fire.

Conductor's architecture was solid—seams, idempotency, heavy logging, all the patterns I wrote about in Part 1. But architecture alone doesn't keep a system alive for twenty years. The playbook does.

This is the operating discipline that kept a healthcare credentialing platform reliable across two decades of regulatory shifts, vendor churn, customer growth, and team turnover.

Runbooks for Everything That Hurts

The difference between a 10-minute incident and a 4-hour scramble is usually documentation.

The rule we enforced:

Every alert had to map to a play. No exceptions. If you couldn't answer these questions, the alert got disabled or rewritten:

Who owns this? Not "the team"—a name, a phone number
What does this alert mean in business terms? "Database slow" isn't good enough. "Registrations backing up" is
What do you do first? Step one, not "assess the situation"
What's the reversal path if that doesn't work? Because the first thing you try usually isn't the fix
When do you escalate, and to whom? Specific criteria, specific people

If a runbook didn't exist, you had two choices: write one or delete the alert. "We'll figure it out when it fires" was not an option.

The drills:

Quarterly, we rehearsed the scariest paths:

Failed payment processor: Can you flip to the backup and queue retries in under 15 minutes?
Delayed state database responses: Can you flip the circuit breaker and serve cached data without customers noticing?
Mass schedule reprocessing: Can you run the reconciliation job without corrupting live data or double-booking exams?

New operators watched. Then they shadowed. Then they ran a drill themselves before taking primary on-call.

The payoff:

In 2016, our primary payment processor went dark at 11 p.m. on a Thursday. The on-call engineer had been on the team for three months. She'd never touched the payment code.

But the runbook existed. She opened it, followed the steps—flip breaker, enable backup, queue retries—and the switch was done in 14 minutes. Customers never noticed. She went back to bed.

That runbook took an afternoon to write. It saved us countless hours and probably a contract renewal.

Health Signals Over Noise

Most monitoring setups drown you in noise and still miss what matters.

CPU spiked? Maybe. Disk filling up? Sure. But what the business actually cares about is:

Are registrations processing?
Are scores getting posted on time?
Are vouchers reconciling before the deadline?

The structure we built:

We defined health in business terms first. These were our Service Level Objectives (SLOs)—the promises we made to customers about performance:

Registrations processed within 60 seconds: The customer clicked "submit," the clock started
Scores posted within 15 minutes of receipt: From the moment the exam provider sent results
Voucher reconciliation completed before 6 a.m. daily: Because that's when the call center opened

Every dashboard started with these. System metrics—CPU, memory, queue depth—only mattered when they explained why a business metric was slipping.

We ran three layers of signals:

Metrics for trends: "Is the system getting slower over time?"
Structured logs for forensics: "What happened to this specific request?"
Traces for weirdness: "Why did this transaction take 10x longer than normal?"

No single layer owned "truth." They reinforced each other.

The payoff:

In 2019, latency crept up slowly—so slowly that no single alert fired. But the SLO dashboard showed "scores posted within 15 minutes" dropping from 99.8% to 97%.

That was enough to trigger investigation. We found a slow query that had been fine for years but degraded as data volume grew. Fixed it before customers complained.

Without business-level SLOs, we would have noticed only when someone called angry.

Change Management with Guardrails

Velocity without guardrails is roulette.

We shipped thousands of changes over twenty years. Most were unremarkable. A few could have been disasters. The difference was the guardrails.

Strict deploy windows:

Deploys happened in defined windows—mid-day, mid-week, with someone watching. Emergency hotfixes existed, but we treated every hotfix as a failure of planning. Why didn't we catch this in testing? Why was this so urgent? Those questions got asked and answered.

Canaries and feature flags:

High-risk changes hit 5% of tenants first. Feature flags defaulted to off. "Turn it on everywhere" required verification in production with real traffic.

For one major workflow change, we ran a canary for six weeks. The 5% cohort caught three edge cases we'd missed. Each was fixed before the full rollout. Zero incidents at 100%.

The payoff:

Over twenty years, we had exactly one catastrophic deploy—and it taught us the lesson that built the guardrails. In 2008, a migration script ran in production without a canary. It corrupted scheduling data for about 40 customers. Cleanup took two days.

After that, no exception. Canary or don't deploy. The rule held for fifteen more years.

Backlogs We Actually Work

"We'll fix it later" is the most expensive lie in software.

Ops debt—the reliability improvements that keep getting deferred—compounds. That flaky alert you ignored becomes a fire drill. That manual reconciliation you tolerated becomes a full-time job. That workaround you promised to remove becomes permanent infrastructure.

The discipline we enforced:

Ops debt was first-class work. When we burned through error budget—when SLOs dipped—it created a ticket. Not a discussion. A ticket, sized to a day or less, scheduled in the current sprint.

Every incident got a short RFO (Reason for Outage)—our term for a mini-postmortem:

What triggered it? Root cause, not symptoms
What was the blast radius? How many customers, how much revenue at risk
What detection gap did we have? Why didn't we catch it sooner?
What guardrail would prevent recurrence? Specific, actionable, assignable

Guardrails from RFOs became ops debt tickets. Ops debt tickets got scheduled. "Later" meant "this sprint or next," not "someday."

The payoff:

The number of recurring incidents dropped by half within a year of enforcing this discipline. The same problems stopped showing up because fixes were institutional, not heroic.

One recurring issue—a race condition in batch processing—had caused monthly late-night pages for two years. Everyone knew about it. Everyone assumed someone would fix it "eventually."

After we started treating ops debt as sprint work, it got fixed in three days. Three days of effort eliminated two years of pain.

Vendor Volatility Plan

External dependencies fail. Vendors get acquired. APIs change without notice. If your only payment processor goes down, what do you do?

The structure we maintained:

For every critical external dependency—payments, background checks, state credentialing APIs—we maintained a hot standby. A second vendor, legally contracted and technically integrated, even if volume was low.

We flowed a small percentage of traffic through the backup continuously. Just enough to know it worked.

Contract renewals, price changes, and SLA modifications lived in a shared calendar with assigned owners. Surprises were escalations, not emergencies.

The payoff:

In 2015, a state contractor changed their onboarding requirements mid-cycle. The primary integration needed weeks of rework. But we had the backup.

Within days, we shifted volume to the standby flow. Customers saw no disruption. We fixed the primary integration at our pace, not theirs.

That backup integration cost us maybe $50K to maintain annually. It saved us from a potential $500K contract loss and months of emergency work.

Data Hygiene as Operations

Dirty data causes more outages than bad code.

When your platform state doesn't match your external dependencies, everything falls apart. Customers see mismatches. Reports lie. Support spends hours debugging phantom issues.

The discipline we enforced:

Daily automated reconciliations compared our platform state against external systems:

Our records vs. state credentialing databases: Catch license status changes before they bite
Our payment records vs. processor statements: Every penny accounted for, every dispute defensible
Our scheduling data vs. exam provider calendars: No "we thought you were booked" surprises

Deltas triggered human review. Not an email that got ignored—an item in a queue with a "resolve" button that required action.

Every action was stamped in immutable audit logs: actor, source IP, payload hash, timestamp. Logs were queryable by ops, not just engineers. Support could trace a customer's entire history without asking engineering for help.

The payoff:

In 2017, a customer disputed a payment—claimed they'd never authorized it. In a system with poor audit trails, that's a legal headache.

We pulled the logs. Showed the customer's IP, the timestamp, the signed request, the processor confirmation. Dispute resolved in one email. That audit trail wasn't just for regulators—it protected us every week.

Customer Communication Rituals

Silence during incidents breeds distrust.

When something breaks, customers don't just want it fixed. They want to know what's happening, when it will be fixed, and that someone competent is working on it.

The rituals we maintained:

Status page updates for every incident:

What's the impact? Plain language, not jargon. "You can't register for exams right now" beats "registration service degraded"
What's the scope? Which customers? Which features? Specifics reduce panic
When's the next update? A specific time. "We'll update by 3 PM" not "soon"

Quarterly reliability reports to enterprise customers:

Uptime numbers: The actual percentage, not "we met SLA"
Incident count and severity: How many, how bad, what they cost
What we fixed since last quarter: Proof that we learn from failures

No hiding, no spin. If we had a bad quarter, we said so and explained what we were doing about it.

The payoff:

In 2014, we had a rough month. Two significant incidents, both with customer impact. We could have minimized it, hoped no one noticed.

Instead, the quarterly report led with the incidents. What happened. What we learned. What guardrails we added.

Three enterprise customers emailed to say they appreciated the honesty. All three renewed. One specifically cited the transparency as a reason.

Customers stay through bumps when they see ownership and improvement. They leave when they feel lied to.

People Practices That Preserve Reliability

Burned-out engineers write unreliable code.

On-call that exhausts people becomes on-call that breaks systems. Heroics that save the day once become expectations that destroy teams.

The practices we enforced:

No solo on-call without rehearsed playbooks. New engineers shadowed for two full rotations—watching, asking questions, learning the system—before taking primary.

Every retro ended with named owners and due dates. Not a list of "things we should do"—actual assignments. "Alice will add the breaker by Friday" or it didn't count.

Guardrail work counted as "real work." If you spent two days writing runbooks, that was as valuable as shipping a feature. It was recognized and celebrated.

The payoff:

On-call stopped being punishment. People volunteered for rotations because the playbooks worked, the drills had prepared them, and they weren't alone.

In ten years, we had zero burnout-related departures from the ops rotation. Zero. That's the real metric.

The Daily/Weekly/Monthly Cadence

Reliability isn't a one-time project. It's a practice.

Daily:

Review SLO dashboards—are we meeting our promises to customers?
Skim error budget burn—are we trending toward trouble before month-end?
Scan quarantine queues—are bad records piling up waiting for human review?
Close the loop on any alerts that fired—no lingering "acknowledged" status without action

Weekly:

Schedule 1-2 ops debt tickets—pick from the backlog, assign to actual humans
Run a lightweight retro on any incidents—what can we prevent next time?
Rehearse one runbook in a sandbox—failover, drain mode, or rollback drill

Monthly:

Vendor SLA review—any contract renewals, price changes, or deprecations coming?
DR test for a critical path—does the backup actually work when you need it?
Update onboarding materials—what did we learn that new engineers need to know?

What to Steal for Your System

Tie every alert to a runbook and an owner—or delete it.
Measure health in business terms first; infrastructure metrics support the story.
Keep a backup vendor warm for your riskiest external dependencies.
Treat ops debt as sprint work, not backlog decoration.
Practice the scary stuff before it practices on you.

The playbook isn't clever. It's disciplined. That's the whole point.

Context → Decision → Outcome → Metric

Context: 20-year healthcare credentialing platform, multiple state integrations, regulated industry, team sizes ranging from 3 to 20 over the years.
Decision: Built and enforced an operating playbook: runbooks for every alert, business-level SLOs, canary deploys, ops debt as sprint work, backup vendors, daily reconciliations, transparent customer communication.
Outcome: Predictable incident response, zero burnout departures from on-call, sustained 99.9%+ uptime, contract renewals through difficult periods.
Metric: Mean time to resolve dropped from hours to minutes for documented scenarios; recurring incidents cut by 50% within one year of enforcing ops debt discipline.

Anecdote: The Three-Month On-Call Engineer

In 2016, our primary payment processor went dark at 11 p.m. on a Thursday. The on-call engineer had been on the team for three months. She'd never touched the payment code. She'd never seen a real payment incident.

But she'd done the drills. She'd shadowed two rotations. She'd read the runbooks.

When the alert fired, she opened the runbook—the one we'd written three years earlier and rehearsed quarterly ever since. Step one: confirm the processor is down. Step two: flip the breaker. Step three: enable the backup. Step four: queue retries. Step five: notify the ops channel.

Fourteen minutes later, the switch was done. Customers never noticed the outage. Payments queued during the downtime replayed automatically when the primary came back.

She went back to bed. The next morning, the team celebrated her in standup. Not because she was a hero—because the system worked exactly as designed.

That runbook took an afternoon to write. The drills took a few hours per quarter. The payoff was a junior engineer handling a critical incident alone, at night, without panic.

That's the playbook.

Mini Checklist: Operations Worth Stealing

[ ] Every alert has a runbook, an owner, and a reversal path
[ ] SLOs are defined in business terms and visible daily
[ ] High-risk changes go through canary before full rollout
[ ] Ops debt is scheduled as sprint work, not "someday" decoration
[ ] Backup vendors are integrated and tested, not just contracted
[ ] Daily reconciliations catch data drift before customers do
[ ] Incidents get RFOs with guardrails, not just "we fixed it"
[ ] On-call includes training, shadowing, and drills before going solo