20-Year Systems, Part 1: Design Constraints That Age Well

Design Constraints

Most software dies young. Rewrites. Acquisitions. Abandonment. The median lifespan of a production system is probably five years if you're generous.

Conductor ran for twenty.

Not because we got lucky. Not because we never changed anything. It ran that long because we chose constraints for survival, not novelty. Every decision was filtered through one question:

"Will this still make sense in ten years?"

Most of the time, the answer was "probably not." So we picked differently than the industry recommended.

Here are the constraints that aged well—and why.

Constraint 1: Interfaces as Contracts, Not Convenience

Every external dependency eventually betrays you.

State databases change APIs without warning. Payment processors deprecate endpoints. Exam providers get acquired and sunset their platforms. If you fuse these dependencies directly into your business logic, you're signing up for yearly rewrites.

The pattern we chose:

Every integration lived behind a contract-first adapter—a boundary we called a "seam." The seam defined:

What data goes in (versioned schema)
What data comes out (versioned schema)
What errors look like (typed, not just HTTP codes)

The core domain never knew whether we were talking to SOAP, REST, flat files, or carrier pigeons. It just called the adapter.

The payoff:

Over twenty years, we swapped adapters for five different state credentialing systems. Each time, the business rules—the reconciliation logic, the approval workflows, the reporting—stayed untouched. The swap took weeks, not quarters. No heroics. No death marches. Just methodical adapter replacement while the core hummed along.

One state moved from nightly batch files (FTP, yes, FTP) to a real-time API in 2019. We wrote a new adapter, ran it in shadow mode for a month, cut over on a Tuesday, and had zero customer-facing incidents. The seam earned its keep that day.

Constraint 2: Idempotency Everywhere

When you process millions of transactions, retries are not an edge case. They're a certainty.

Networks flake. Timeouts happen at the worst moment. Support staff hit "retry" because the customer is on the phone yelling. Payment processors return ambiguous errors that could mean "failed" or "maybe succeeded, who knows."

If your writes aren't idempotent—meaning running them twice produces the same result as running them once—you will eventually have:

Double charges
Duplicate certifications
Phantom records that appear and disappear
Support tickets that take hours to untangle

The pattern we chose:

Every externally-facing write had an idempotency key—a deterministic identifier (usually a hash of the entity ID, action, and timestamp) that the system checked before doing anything.

At the database level:

Unique constraints on operation tables that rejected duplicates at the source
Stored procedures that short-circuited duplicates before touching data
"Already processed" responses that returned just as fast as "processing now"

For batch jobs: payload hashes stored with each run. If a job saw the same payload twice, it logged a skip and moved on. No guessing. No manual deduplication.

The payoff:

In 2017, a payment processor had a 40-minute outage during peak hours. Our retry logic kicked in, queueing up thousands of payment attempts. When the processor came back, all those retries hit at once.

Without idempotency, we would have double-charged hundreds of customers. Support would have spent the next week issuing refunds. Instead, the idempotency guards caught every duplicate. Total customer impact: zero.

Support tickets for "double charges" dropped 90% after we enforced idempotency everywhere. Ops started trusting retries instead of fearing them.

Constraint 3: State Light, Logs Heavy

Auditors, regulators, and angry customers all ask the same question:

"Show me what happened and when."

You cannot answer that question from mutated rows. If your system updates a record in place, you've destroyed history. You've turned debugging into archaeology.

The pattern we chose:

Append-only logs for every critical event: payments, scoring, scheduling, credential status changes. Each log entry captured:

Actor: Who or what triggered it (user, system job, API call)
Payload hash: What the input looked like, cryptographically verifiable
Correlation ID: How to trace it across systems when things go sideways
Timestamp: When it happened, not when it was recorded—critical distinction for audits

Mutable tables existed for current state—"what is this user's status right now?" Logs existed for truth—"how did they get there?"

The payoff:

In 2014, a state health department subpoenaed records for a fraud investigation. They wanted a complete timeline of credential changes for 200+ practitioners over three years.

A normal system would have triggered a war room. Ours took four hours. We queried the append-only logs, filtered by practitioner IDs, exported to a lawyer-friendly format, and sent it over.

That one feature—logs heavy—probably saved us $100K in emergency consulting and preserved a $2M contract renewal.

Constraint 4: Hard Limits and Back-Pressure

Unlimited queues are a lie you tell yourself.

"We'll handle whatever load comes in." No, you won't. You'll handle it until you don't, and then you'll collapse silently. Systems with no limits fail slowly and then all at once.

The pattern we chose:

Explicit limits everywhere:

Queue depth per tenant: No single customer could starve others, even accidentally
Concurrent jobs per worker pool: Predictable resource consumption, no surprise OOMs
API rate limits with user-facing explanations: Not just "429 Too Many Requests" but "Rate limited. You've used 1000/1000 requests this hour. Resets at 3:00 PM."

When limits hit, the system failed fast. Not with a cryptic timeout, but with a clear message: "We're busy. Retry in 60 seconds." Back-pressure signals fed dashboards so ops knew before customers did.

The payoff:

One November, a major exam provider ran a promotional campaign without telling us. Registration volume spiked 400% in two hours.

The queue depth limits caught it immediately. Users saw "busy" messages. The dashboard lit up. Ops added capacity within 30 minutes. No data loss. No 2 a.m. pages. No angry calls from the exam provider blaming us for their surprise traffic.

That incident became our standard example of why limits aren't restrictions—they're protection.

Constraint 5: Schemas Before Features

Most data pain is self-inflicted.

Nullable fields without meaning. Polymorphic columns that could be anything. "JSON blob now, figure it out later" as a design pattern. These choices feel expedient in the moment and cost you for years.

The pattern we chose:

Every table had a lifecycle:

Draft schema with explicit intent
Data contract review (what does each field mean?)
Migration with guards (what happens to existing data?)
Backfill plan (how do we populate historical records?)
Monitoring for anomalies (are we seeing unexpected values?)

JSON columns only when the shape legitimately varied—user preferences, plugin configurations. Never for core business data.

The payoff:

A decade in, we needed to add support for three new states with slightly different credentialing requirements. In a system with loose schemas, that's a rewrite. In ours, it was additive work: new enum values, new validation rules, new adapters. The core tables didn't change.

Reporting stayed fast because the schema encoded intent. Analysts could write queries without first asking engineers, "What does this column actually mean?"

Constraint 6: One Way to Do Each Hard Thing

Multiple patterns for the same problem guarantee drift and hidden bugs.

If you have three ways to send emails, you have three sets of bugs, three monitoring gaps, and three places where the next engineer will make a mistake. If you have two job queues, you'll spend every incident asking, "Which queue is this on?"

The pattern we chose:

One mailer. One scheduler. One job queue. One background worker pattern.

New integrations had to fit the existing shape or justify divergence with a written proposal. "This vendor's SDK wants us to do it differently" was never sufficient justification.

The payoff:

Onboarding took days instead of weeks. New engineers found fewer traps. Ops had fewer dials to watch. When something went wrong with email delivery, we looked at one system, not three.

Performance tuning was concentrated. When we optimized the job queue, every job got faster. When we added better monitoring to the mailer, every notification improved.

Constraint 7: Operability as a Feature

Uptime is a promise. If the system is opaque to operators, uptime is luck.

The pattern we chose:

Every feature shipped with:

Logs with correlation IDs: Trace a request across all services without grep gymnastics
Metrics with service-level indicators: Is this feature healthy? One glance tells you
A dashboard: What does "normal" look like? Because you can't spot anomalies without a baseline
A runbook: When it breaks at 3 a.m., what exactly do we do? Step by step

"How will we know it's broken?" was part of every acceptance checklist. If you couldn't answer that question, the feature wasn't done.

The payoff:

Incidents became procedures instead of heroics. Mean time to recovery shrank because signals were built-in, not bolted on after an outage taught us we needed them.

In 2020, a credential verification feature started throwing errors at 3 a.m. The on-call engineer had never touched that code. But the runbook existed. She followed the steps, restarted the affected service, escalated with full context, and went back to bed. Total customer impact: 12 minutes.

Constraint 8: No Hidden State in Jobs

Long-running jobs accumulate secrets.

They start with assumptions about ordering. They grow caches that aren't in the payload. They develop hidden dependencies on data shapes that weren't documented. Eventually, they become irreproducible—they work in production but not in staging, and no one knows why.

The pattern we chose:

Jobs were pure functions over explicit inputs. They:

Fetched fresh state at the start: No stale caches, no "well it worked yesterday"
Validated that state against explicit expectations: If something looked wrong, fail early with context
Wrote explicit outputs with audit trails: Every side effect documented, traceable, reversible

If a job needed context, the context was in the payload. No shared caches. No implicit dependencies on other jobs running first.

The payoff:

In 2018, an audit required us to reprocess five years of credentialing history—millions of records. The jobs ran exactly the same as day one. No "works only in production" surprises. No hidden state to reconstruct.

That audit passed without custom tooling, without war rooms, without anyone staying late.

Why These Constraints Mattered

Twenty years is a long time. Technologies changed. Team members left. Business requirements evolved. Vendors came and went.

The constraints held because they optimized for the right things:

Replaceability over convenience: Swapping vendors is painful—two weeks of focused work. Rewriting core business logic? That's a year of your life you're not getting back.
Auditability over speed: The extra logging cost microseconds per request. It saved weeks per audit and untold legal fees per subpoena.
Predictability over flexibility: One way to do things feels limiting until you're debugging at 2 a.m. and grateful there's only one damn place to look.

If you're building something you hope will last, steal these constraints. Not because they're clever—they're not. They're boring. They're discipline.

That's the point.

Context → Decision → Outcome → Metric

Context: 20-year healthcare credentialing platform, $100M+ annual transaction volume, 15+ external integrations, regulated industry with audit requirements.
Decision: Chose constraints for survival (seams, idempotency, heavy logging, hard limits, schema discipline, single patterns, operability, pure jobs) over convenience or novelty.
Outcome: Zero rewrites in 20 years, adapters swapped for 5+ major vendors without core changes, passed multiple regulatory audits without war rooms.
Metric: 99.9%+ uptime, 90% reduction in duplicate-related support tickets, audit response time measured in hours not weeks.

Anecdote: The Seam That Saved a Contract

One state moved from nightly batch files to a real-time API with 60 days' notice—during our busiest season. The old integration used FTP pulls of fixed-width files. The new one was REST with OAuth and webhooks.

Because we'd built the seam correctly—versioned contracts, adapter isolation—we wrote the new adapter in parallel. Ran it in shadow mode for three weeks. Compared outputs. Cut over on a quiet Tuesday.

Zero downtime. Zero customer notices. The state administrator said, "That was the smoothest vendor transition we've ever seen." They renewed for another five years.

That seam cost us an extra week of work upfront. It saved us a year of emergency remediation.

Mini Checklist: Constraints Worth Stealing

[ ] Every external integration lives behind a contract-first adapter (seam)
[ ] Every write has an idempotency key; retries are safe
[ ] Critical events are append-only logs, not just mutated rows
[ ] Queues have explicit limits; failures are user-facing, not silent
[ ] Schemas have lifecycles; JSON blobs are exceptions, not defaults
[ ] One blessed pattern per cross-cutting concern (email, jobs, auth)
[ ] Features ship with logs, metrics, dashboards, and runbooks
[ ] Jobs are pure functions; all context is explicit in the payload