Header image

20 Years, 99.9% Uptime: What That Actually Took

Everyone loves to throw around uptime numbers.

“We’re at 99.9% uptime.” “Our SLA is five nines.”

Sounds great on a pitch deck.

Here’s the problem:
You will not find “99.9% uptime” written cleanly in any log file.

That number doesn’t live in:

One dashboard
One EC2 instance
One database

It lives in the sum of every ugly decision, every “do we cut this corner or not?”, every 2am “do we roll back or push forward?” call — over decades.

Conductor ran for roughly 20 years with what I’m comfortable calling 99.9%+ effective uptime for customers.

Not because the servers were magical.

Because the system was designed and run so that when something failed (and things always fail), the business didn’t.

1. The Numbers (As Honestly As You Can Talk About Them)

Let’s be blunt:

No, I don’t have a pristine, unified Grafana export from 2005–2025.
Yes, there were migrations, stack changes, hosting changes, and reporting changes over that time.
Logs rolled, dashboards changed, vendors came and went.

So how do I justify “99.9%+”?

Because over ~20 years:

Planned downtime was rare, scheduled, and communicated.
Unplanned downtime that actually impacted customers:
- Was measured in hours per year, not days.
- Typically affected subsets of functionality, not the entire platform.
- Often had workarounds that kept customers operational even if a subsystem was unhappy.
99.9% uptime means:
Downtime budget/year: ~8.76 hours

Over 20 years, that’s ~175 hours.
Spread that over thousands of business days, across multiple customers and time zones, with redundancy and fallback, and the reality is:

Most customers never experienced hard platform outage in a way that derailed their business.
The system became “the thing that just works.”

That’s the real test:
If a platform runs for 20 years and your biggest “outage story” is “…uh, remember that one morning where X was slow for an hour?”, you did something right.

2. The Architecture for Reliability

This wasn’t luck. It was a stack of deliberate architecture decisions made early and enforced ruthlessly.

Some of the key ones:

Separation of concerns at the system level
- Core transaction processing was isolated from:
  - Reporting
  - Admin tooling
  - Heavy background jobs
- If reports got slow, transactions kept flowing.
- If batch jobs clogged, front-line workflows stayed online.
Stateless or “stateless-enough” application layer
- App servers could be:
  - Restarted
  - Replaced
  - Scaled horizontally
- No hidden state living in instance memory that would take the system down if one node died.
Database design built for load and recovery, not just “it works”
- Clear separation between:
  - Operational tables
  - Historical/archive tables
- Indexed for real-world queries, not just “dev environment” fake data.
- Backups and restore procedures that were actually tested, not just configured.
Message queues and async processing where it mattered
- Anything that didn’t need to be synchronous wasn’t.
- That meant:
  - Less user-facing blocking
  - More ability to absorb spikes
  - Fewer “everything dies because one dependency is slow” chain reactions
Graceful degradation patterns
- If a downstream system was offline:
  - Data was queued
  - Users were warned, not blocked
  - The rest of the platform stayed available
- The business could keep operating while specific integrations recovered.
Boring, battle-tested technologies
- No “cool new stack of the month” in the critical path.
- Chosen based on:
  - Predictability
  - Operational maturity
  - Ease of monitoring and support
Single-responsibility deployments
- Changes could be rolled out per component instead of “all or nothing.”
- If one piece misbehaved, you could rollback that piece without nuking the whole environment.

None of that is sexy in a conference talk. All of it is why the thing stayed up.

Architecture diagram

3. The Operational Discipline

Architecture gets you potential reliability.
Operations is whether you cash it in or light it on fire.

Some of the habits that actually kept Conductor running:

Deliberate deployment windows
- No “cowboy deploys” at 4:55pm on a Friday.
- Changes went out:
  - At predictable times
  - With a rollback plan
  - With someone who understood the blast radius on call
Real checklists, not vibes
- Pre-deploy:
  - What’s changing?
  - What could break?
  - How do we know if it’s going bad?
- Post-deploy:
  - Specific metrics checked
  - Key workflows tested
  - Logs reviewed, not ignored
Tight loop between dev and ops
- The same people who wrote the critical pieces:
  - Saw the logs
  - Saw the alerts
  - Talked to customers when shit went sideways
- That tends to focus the mind.
Respect for maintenance
- Indexes got tuned.
- Old logs got archived.
- Disks got watched.
- SSL certs didn’t “surprise” anyone two hours before expiration.
Zero tolerance for “we’ll fix it later” in core pathways
- You can defer features.
- You cannot defer:
  - Idempotency
  - Retry logic
  - Safe failure modes

4. Near-Misses (That Didn’t Become Headlines)

You don’t run something for 20 years without some “oh, shit” moments.

A couple of patterns:

Near-Miss 1: The Dependency That Fell Over

A critical third-party integration slowed to a crawl.

In a naive system, this would have:

Hung threads
Backed up requests
Taken down the entire API layer

Instead:

Calls were wrapped with timeouts and fallbacks
Work was queued instead of blocking users
Users got:
- “We’ve received your request; processing may be delayed.”
The system:
- Kept serving other traffic
- Flushed the backlog once the dependency recovered

In a less well-architected system, that’s a full outage.
Here, it was an ugly graph and a minor customer notice.

Near-Miss 2: The Database Under Siege

At one point, reporting usage spiked hard.

Everyone wanted big, complex queries.
A naive design would let that stomp all over OLTP operations.

Instead:

Heavy reporting loads were segregated:
- Read replicas / reporting-specific patterns
- Throttling on expensive, ad-hoc queries
Operational workload:
- Stayed within latency targets
- Didn’t suddenly fall over because someone ran a “fun” 10-join monstrosity

No late-night scramble.
Just a busy graph, some tuning, and business as usual.

5. The 3AM Pages That Never Happened

You can measure bad architecture in:

Pager volume
Tired engineers
Weekend fire drills

The quiet miracle of Conductor’s uptime is what didn’t happen:

No “every Saturday night batch job brings the system to its knees.”
No “one noisy customer kills performance for everyone.”
No “small code change bricks the entire operation because the system is insanely coupled.”

Those non-events were the payoff of:

Separation of concerns
Thoughtful capacity planning
Clear boundaries between components
A refusal to build “clever” but fragile hacks into the critical path

Clients never see this.

They only see:

"It works."
"It's always up when we need it."
"We don't think about it."

Which is the whole point.

The non-events

6. Monitoring and Alerting (What We Actually Watched)

You can’t run at 99.9%+ for 20 years on vibes alone.

Things we actually monitored:

Core health
- HTTP 5xx rates
- Request latency distributions
- Queue depths for async jobs
Database
- Connection counts
- Slow query logs
- Replication lag (if applicable)
- Disk space and IOPS
Key business transactions
- “Did this type of transaction complete successfully?”
- “Are we seeing abnormal failure rates or drop-offs in this workflow?”
Infrastructure basics
- CPU, memory, disk, network saturation
- Node availability
- SSL cert expiration
- Backup success/failure
Alerting rules
- Alerts fired on:
  - Sustained 5xx above a low threshold
  - Latency spikes beyond agreed SLOs
  - Queue backlog above defined thresholds
  - DB metrics outside of safe ranges
- Thresholds were tuned over time to avoid:
  - Noise
  - Pager fatigue
  - “Everything is red all the time”

Response expectation:

During business hours: minutes
Off-hours: on-call escalation, not “whenever someone notices”

Again: not glamorous. Just disciplined.

7. The Unsexy Truth: Real Reliability Is Invisible

99.9% uptime is a weird thing to brag about.

When you do it right:

No one thanks you.
No one writes case studies about it.
It doesn’t show up in the marketing deck.

Instead, you get:

“Oh yeah, that system? It just works. We don’t think about it.”

Which is both:

The ultimate compliment
And the fastest way to have your contribution ignored

Conductor processed serious volume and ran critical workflows.
And because it stayed up:

People assumed it was “easy”
Leadership focused on whatever was on fire elsewhere
The economics of that reliability were largely invisible

But “invisible” doesn’t mean “accidental.”

8. The Cost of Reliability (And Was It Worth It?)

What did 20 years of 99.9%+ actually cost?

Engineering time spent on:
- Better design instead of hacks
- Backward compatibility
- “Do it right” instead of “ship it and run”
Infrastructure cost:
- Extra capacity headroom
- Redundancy where it mattered
- Monitoring and backup tooling
Discipline cost:
- Slower yes on risky shortcuts
- Saying “no” or “not like that” to feature requests that would undermine stability
- Boring maintenance work that never gets applause

Would I do it again?

Yes.

If you’re running a platform processing $100M/year and supporting mission-critical operations, reliability isn’t “nice to have.”

It’s part of the product.

The right question isn’t:

“Can we afford to invest in this level of uptime?”

It’s:

“Can we afford the fallout of not doing it?”

9. When It DID Go Down

Was it perfect? Of course not.

There were outages. When they happened, they had a few things in common:

Root causes tended to be:
- External provider issues
- Network incidents
- Rare, complex edge cases that made it past testing
Duration:
- Typically measured in minutes to low hours, not days
- Contained and triaged with clear owners

How they were handled:

Own it quickly
- Acknowledge impact internally and externally.
- Don’t bullshit or downplay it.
Stabilize first, diagnose second
- Get the system back into a safe, working state.
- Then dig into deep root cause.
Make the fix structural
- If something took you down once, you fix it so:
  - It either can’t happen again
  - Or if it does, the impact is much smaller
Capture the learning
- What signal did we miss?
- What can we monitor next time?
- Where can we add guardrails?

Outages aren’t the story.
How you respond and evolve is.

10. Why It Matters (The Business Case)

Let’s talk money.

If you’re processing $100M/year through a platform like Conductor, what does downtime actually cost?

Some rough math:

$100M/year ≈ ~$8.3M/month
≈ ~$275K/day
≈ ~$11.5K/hour (on average — in reality, some hours are way more valuable)

If you have:

A 4-hour outage during peak time, that’s not just:
- Lost transactions
- Scrambled staff
- Manual rework

It’s also:

Lost trust
Contract risk
“We need to evaluate alternatives” conversations

Now spread that across:

Multiple customers
Renewals
Growth opportunities

The real cost of crappy reliability is:

Lost renewals
Churn
Deals that never even make it to RFP because someone says:

“Yeah, we heard they go down a lot.”

Conductor’s uptime:

Protected renewals
Protected reputation
Protected the $100M/year flow that depended on it

You don’t get a line item on the P&L that says:

“Revenue preserved by good architecture: $X”

But it’s there.

Closing Thought

“99.9% uptime over 20 years” doesn’t come from:

Heroic debugging at 3am
One brilliant dev
A cool framework

It comes from:

Boring, thoughtful architecture
Relentless operational discipline
Saying “no” to shortcuts that would feel good this quarter and hurt you for the next ten years

If you’re running anything that matters — money, health, operations, compliance — stop treating reliability as a buzzword.

It’s not a marketing metric.

It’s a design choice.

And if you don’t make that choice on purpose, you’re still making a choice.

You’re just betting your business on luck.

Context → Decision → Outcome → Metric

Context: 20-year credentialing platform handling $100M/year, 1M+ certifications, 2TB+ database, 15+ integrations.
Decision: Treat reliability as a product feature: versioned adapters, idempotent writes, expand/contract migrations, runbooks for every alert, and strict change windows with canaries.
Outcome: 99.9%+ uptime over two decades, zero data-loss incidents, and zero contract losses attributed to downtime.
Metric: Four recorded customer-visible outages in five years, each <90 minutes; renewal rate >95%; incident MTTD in minutes, MTTR in low hours.

Anecdote: The Night the State API Died

One of the largest state APIs hung for 40 minutes during peak scheduling. Old me might have watched threads starve. Instead, the breaker flipped, cached “last known good” availability appeared with a banner, and retries queued with idempotency keys. Customers kept booking with stale-but-safe slots. When the API came back, reconciliations replayed automatically and no double-bookings occurred. The incident retro took 30 minutes. That breaker paid for itself in one night.

Mini Checklist: Shipping Reliability on Purpose

Require idempotency keys on all external writes; retry with backoff + jitter.
Separate schema changes from code (expand → backfill → contract) so rollbacks are real.
Add breakers and back-pressure before you need them; show users a clear message instead of timing out.
Map alerts to runbooks and owners; delete orphaned alerts.
Measure in business terms (scores posted, vouchers reconciled) not just CPU/latency.