Enterprise Integrations Without Breaking Everything

Enterprise Integrations

Integrations are where enterprise systems go to die.

Over twenty years building Conductor, we connected to 15+ external systems: state credentialing databases, payment processors, exam providers, background check services, and various legacy partners with APIs ranging from "well-documented REST" to "FTP a CSV at midnight and pray."

Every integration was a potential source of outages, data corruption, and 3 a.m. pages. Most systems treat integrations as plumbing—pipes that move data between systems. That's the mistake.

Integrations are products. They need owners, contracts, testing, and operational playbooks just like any user-facing feature. Treat them as plumbing, and every change becomes a potential disaster. Treat them as products, and changes become manageable.

Here's how we made integrations survivable.

Treat Integrations as Products, Not Plumbing

The first mindset shift: integrations are products that happen to connect to external systems.

Contracts first:

Every integration has defined contracts:

Request schema: What data goes out, in what format, with what validation rules
Response schema: What data comes back, including error shapes
Error taxonomy: Not just HTTP codes, but business-level errors (invalid credentials, rate limited, temporarily unavailable)
Versioning: Every contract has a version number; breaking changes require new versions

Code that bypasses adapters doesn't merge. Direct calls to partner APIs are banned. Everything goes through the contract layer.

Operational owner:

Every integration has a named owner—not "the team," a specific person. They're responsible for:

SLOs (Service Level Objectives—the promises we make about performance)
Runbooks for when things break
Test harnesses that run in CI
Relationship with the partner's technical contact

When an integration breaks at 3 a.m., someone is accountable. When a partner announces changes, someone is responsible for adapting.

The payoff:

Changes roll through one seam instead of rippling across the codebase. When the state credentialing API changed from SOAP to REST, we wrote a new adapter. The rest of the system—the business logic, the reporting, the user interfaces—didn't know or care.

Idempotency and Replay as First Principles

Networks fail. Partners flake. Users hit retry. If your integration writes aren't idempotent—meaning running them twice produces the same result as running them once—you will eventually create duplicates.

Idempotency keys:

Every write to partners carries a deterministic key—a hash of the operation, the entity, and a timestamp. If the partner sees the same key twice, it treats the second request as a duplicate.

This isn't optional. It's not "nice to have for edge cases." It's required for every write operation to every external system.

Replayable jobs:

Every integration job stores:

The original payload (what we tried to send)
The response (what we got back)
The outcome (success, failure, retry needed)

Replays are a button, not a script. Support can trigger a replay from a dashboard. Engineers don't need to craft SQL to fix data.

The payoff:

In 2017, a payment processor had a 40-minute outage. Our retry logic queued thousands of attempts. When the processor recovered, all those retries hit at once.

Because every payment had an idempotency key, zero duplicates. The processor deduplicated automatically. Support didn't spend the next week issuing refunds.

Support can fix issues without engineers manually crafting SQL. That's not just convenience—it's speed. Issues that used to take hours resolve in minutes.

Quarantine Bad Inputs, Don't Trust Them

External data is hostile. It doesn't follow your rules, and it changes without warning.

Validation at the edge:

Before external data touches production tables, it goes through validation:

Does it match the expected schema?
Are required fields present?
Are values within expected ranges?
Is this a known entity type?

Unknown fields or invalid states go to quarantine—a separate table or queue where they wait for human review. They don't corrupt production data. They don't break reports. They sit safely until someone figures out what they are.

Contract tests:

Sample payloads from partners become test fixtures. Every build runs these fixtures through the adapter. If the partner's schema drifts—new fields, changed types, removed values—CI fails before production sees it.

We maintain a library of "known weird" payloads: edge cases we've seen in production, malformed data that partners have sent, boundary conditions that have caused problems. Every weird payload becomes a test case.

The payoff:

In 2017, a state changed their credentialing API without notice. They added three fields and deprecated one. Our ingestion quarantined the new records—it didn't recognize the shape. Alerted ops. Kept production clean.

We adapted the schema in two days, replayed the quarantine, and never showed incorrect data to customers. The state administrator asked how we caught it so fast. We didn't mention we'd been burned before.

Back-Pressure and Circuit Breakers

When partners slow down, don't let them take you down with them.

Per-tenant throttles:

Heavy tenants can't starve others. If one customer is doing a bulk import, their queue grows, but other customers' traffic continues normally. Rate limits are per-tenant, not global.

Circuit breakers:

A circuit breaker is a pattern that stops calling a failing service to prevent cascade failures. It has three states:

Closed: Normal operation, requests flow through
Open: Partner is failing, requests immediately return errors
Half-open: Testing recovery, a few requests go through to see if partner is back

When a partner is slow or erroring, the breaker opens. Requests fail fast with a clear message. Retries queue for later. When the partner recovers, the breaker closes automatically.

Cached fallback:

For read operations, when the breaker is open, serve cached "last known good" data where it's safe to do so. The UI shows a banner: "Data may be slightly stale." Better than a spinner or an error page.

The payoff:

In 2019, a state API started responding in 8 seconds instead of 200ms. Without circuit breakers, our thread pool would have exhausted. Requests would have backed up. Users would have seen timeouts.

With circuit breakers, the breaker tripped in 45 seconds. Users saw a banner: "State verification delayed. Using cached data." Cached credentials were current as of that morning. No cascade. No thread exhaustion. The banner disappeared when the API recovered.

Dual Vendors for Critical Paths

External dependencies fail. Vendors get acquired. APIs deprecate. If your only payment processor goes down, what do you do?

Hot standby:

For critical paths—payments, background checks, core credentialing—integrate a secondary vendor. Not "we have a contract somewhere," actually integrated. Actually tested. Actually receiving a trickle of traffic to stay warm.

We flowed 5% of payment traffic through the backup processor continuously. Just enough to know the integration worked. When the primary went down, we could flip traffic in minutes.

Failover drills:

Practice switching. Monthly drills: flip traffic to backup, process some transactions, flip back. Document the credential swaps, the endpoint toggles, the reconciliation steps. Make the failover boring.

The payoff:

In 2015, our primary processor had a 4-hour outage during peak season. We flipped to the backup in 14 minutes. Customers never noticed. The processor's status page said "outage." Our customers saw "working normally."

That backup integration cost us maybe $50K/year to maintain. It saved us from potential contract losses worth 10x that.

Observability That Speaks Business

Technical metrics are necessary but not sufficient. Executives don't care about p99 latency. They care about "are registrations processing?"

Metrics layering:

Technical metrics: Success/error rates, latency percentiles, queue depth, breaker trips
Business metrics: Registrations per minute, vouchers reconciled, credentials verified
The connection: Every technical metric maps to a business impact

When queue depth rises, that's a technical signal. When "registrations per minute" drops, that's a business signal that triggers action.

Logs with correlation:

Every request gets a correlation ID that follows it through the entire flow: from user action, through our systems, to partner API, and back. The correlation ID includes tenant ID and business entity ID (registration, voucher, credential).

When support gets a ticket, they can search by correlation ID and see the entire journey: what we sent, what we got back, how long each step took.

The payoff:

When a customer calls asking "where is my credential verification?", support doesn't escalate. They search by the customer's ID, find the correlation ID, see that the state API returned "pending" at 3:47 PM, and explain the situation in one call.

That used to take hours of engineering time. Now it takes minutes of support time.

Migration and Versioning Discipline

Partners change. APIs evolve. If you couple schema changes with code changes in one deploy, you lose the ability to roll back.

Expand/contract migrations:

Expand: Add new fields as nullable or with defaults. Don't remove anything yet. Deploy.
Code change: Update code to use new fields. Old code still works. Deploy.
Contract: Once stable, remove old fields. Deploy.

Each step is independently rollable. You never get trapped with new code depending on old schema or vice versa.

Versioned adapters:

Run v1 and v2 adapters in parallel with per-tenant flags. Cut over intentionally:

Week 1: 5% of traffic to v2
Week 2: 25% of traffic to v2 (if no issues)
Week 3: 100% of traffic to v2
Week 4: Remove v1

The payoff:

When a state moved from batch files to real-time API with 60 days' notice, we wrote the new adapter while the old one kept running. Shadow mode for three weeks—both adapters processed, we compared outputs. Cut over on a quiet Tuesday. Zero disruption.

Rollbacks are real. Partners can change without freezing you.

Human Loops Where They Belong

Some problems require human judgment. Build the interfaces for it.

Exception queues:

A dashboard for ops to resolve quarantined records:

List of quarantined items with reason codes
Before/after preview for proposed fixes
Audit logging of every action
Undo where possible

Ops can act without engineering. Engineers can focus on systemic fixes, not individual data issues.

Customer messaging:

When a partner is down, don't hide it. Status pages and in-app banners in plain language:

"State credential verification is delayed. Your registration is queued and will process automatically."
Not: "Service temporarily unavailable. Error code: PARTNER_TIMEOUT_STATE_API_V2"

The payoff:

When the payment processor had issues, customers saw a banner. Support knew to say "we're aware, working on it, your payment is queued." Call volume dropped 60% compared to previous incidents where we hadn't communicated proactively.

Customers feel informed, not confused. Ops can act without engineering every time.

Testing Strategy That Keeps You Sane

Integrations are hard to test because they depend on external systems you don't control. Here's how to make them testable.

Contract tests in CI:

Sample payloads from partners—real responses we've captured—run through adapters every build. If the adapter can't parse them, CI fails. If schemas drift, we know before production.

Replay harness:

Record real production payloads (anonymized where necessary). Replay them against sandboxes. Verify:

Idempotency: Same payload twice produces same result
Side effects: External calls are made correctly
Error handling: Malformed inputs are handled gracefully

Canary tenants:

Route a small percentage of traffic through new adapter versions. Full logging, detailed monitoring. Compare outputs to the old adapter. Only expand rollout when canary looks good.

Failure drills:

Simulate partner failures:

Inject latency: Does the breaker trip?
Return 500s: Does retry logic work?
Return malformed data: Does quarantine catch it?
Simulate timeout: Do users see appropriate messages?

Run these monthly. Every drill reveals gaps.

Visibility Playbook

When an integration fails at 3 a.m., you need answers fast.

Dashboards:

Success/error rates by integration and endpoint
Latency percentiles (p50, p95, p99)
Queue depth and age by integration
Breaker state (closed/open/half-open)
Quarantine counts by error type

Quarantine views:

Counts by error type
"Oldest item age" to prevent silent rot
Top error reasons to identify patterns
Quick action buttons for common fixes

Correlation ID search:

Support searches by customer ID, registration ID, or correlation ID. Results show:

Timeline of all events
Partner requests and responses
Processing times at each step
Current status

Context → Decision → Outcome → Metric

Context: Enterprise platform with 15+ external integrations including state databases, payment processors, and exam providers. Integration failures were a primary source of incidents and customer complaints.
Decision: Treated integrations as products: contract-first adapters, idempotency everywhere, edge validation with quarantine, circuit breakers, dual vendors for critical paths, business-level observability.
Outcome: Integration-related incidents dropped 70%. Mean time to resolve integration issues dropped from hours to minutes. Partner changes handled without customer impact.
Metric: Error rates dropped ~70%. Support can answer "where is my update?" in minutes without engineering. Three partner API changes handled with same-day adapter patches and zero customer-visible disruption.

Anecdote: The Partner Who Changed Everything

In 2016, one of our largest partners—responsible for 30% of our exam registrations—announced they were deprecating their API. Six months' notice. New API was completely different: different auth, different data model, different error handling.

In a system without seams, this would have been a multi-month death march. The API touched everything: registration workflows, scheduling, reporting, reconciliation.

Because we had the adapter pattern, here's what actually happened:

Week 1-2: Wrote the new adapter behind a feature flag. Old adapter continued processing.
Week 3-4: Shadow mode. Both adapters ran. We compared outputs. Found three edge cases where our interpretation of their docs was wrong.
Week 5: 5% canary traffic to new adapter. Monitored closely.
Week 6: 50% traffic. Still comparing, still monitoring.
Week 7: 100% traffic. Old adapter stood down.
Week 8: Removed old adapter code.

Total customer impact: zero. Total engineering panic: minimal. Total late nights: one (the shadow mode comparison revealed a bug in our code, not theirs).

The partner's project manager asked how we'd migrated so smoothly. Other vendors, apparently, had been scrambling. We said "adapters" and she nodded politely while clearly not understanding why that mattered.

It mattered because the seam existed. The contract was defined. The testing was in place. The migration was boring.

Mini Checklist: Enterprise Integration Survival

[ ] Every integration goes through an adapter with versioned contracts
[ ] Idempotency keys on every write to external systems
[ ] Edge validation with quarantine for unknown/invalid inputs
[ ] Contract tests in CI using captured partner payloads
[ ] Circuit breakers with cached fallback where safe
[ ] Per-tenant rate limits to prevent starvation
[ ] Dual vendors (hot standby) for critical integrations
[ ] Monthly failover drills to backup vendors
[ ] Correlation IDs from edge to partner and back
[ ] Business-level observability (not just technical metrics)
[ ] Exception queue UI for ops to resolve quarantined records
[ ] Customer-facing status messaging for partner issues
[ ] Expand/contract migrations for schema changes
[ ] Canary rollouts for adapter version changes