DecisionForge Postmortem: What Failed, What Shipped, What Stayed

DecisionForge Postmortem

DecisionForge started as an idea that sounded brilliant in theory: "What if multiple AI models debated each other to reach better decisions?"

Eighteen months later, it's a working product. But the path from idea to useful tool was paved with failures, dead ends, and things we should have known wouldn't work.

This is the honest postmortem. Not the investor deck version. The real one.

The Original Idea (and Why It Was Wrong)

The pitch:

Multiple LLMs debate a decision. They argue back and forth. The best arguments win. Consensus emerges from intellectual combat.

It sounded great. We built a prototype in two weeks. The first demos were impressive—models generating articulate arguments, responding to each other, reaching conclusions.

The reality:

Consensus produced bland, overconfident answers.

Here's what we didn't anticipate: AI models want to be helpful. "Helpful" often means agreeable. When four models debate, they don't actually fight—they find the safest common ground and converge on it.

Our early debates read like committee reports. Every perspective was "considered." Every risk was "acknowledged." Every recommendation was hedged. The outputs weren't wrong, exactly. They were useless. Technically accurate pablum that didn't help anyone make a decision.

What we should have known:

Debate without structure produces false consensus. Real debates have constraints, roles, and stakes. Our models had none of these. They were just... politely discussing.

The Fix: Forced Role Conflict

The breakthrough came when we stopped asking models to debate and started assigning them to fight.

What changed:

Instead of "discuss this decision," each model got a specific role:

Risk Officer: Your job is to find what could go wrong. You must list at least three risks. You're not allowed to be optimistic.
Optimist: Your job is to find the upside. But you have to cite evidence. You can't hand-wave.
Operator: Your job is to assess feasibility. What resources? What timeline? What could block execution?
Customer Voice: Your job is to represent users. How would they experience this decision?
Challenger: Your job is to disagree. If everyone agrees, find out why they're wrong.

Why it worked:

The Risk Officer had to find risks even when there weren't obvious ones. The Optimist couldn't just agree with the Risk Officer's caution. The Challenger had to disagree with the emerging consensus.

The roles created structural tension. Not models trying to be helpful—models executing a mandate that sometimes required them to be unhelpful.

Debate quality jumped immediately. Outputs became sharper, more actionable, more useful for actual decisions.

The lesson:

Don't ask AI to debate. Force it into conflict. Define the terms of the fight, not just the topic.

Technical Dead End #1: Prompt Soup

What we tried:

Early prompts were elaborate. Multi-paragraph instructions. Carefully crafted tone guidance. Examples and counter-examples. The prompts felt sophisticated.

Why it failed:

Long, clever prompts were brittle. Small changes broke behavior in unpredictable ways. A rewording that seemed equivalent would produce completely different outputs.

One prompt tweak—changing "evaluate the risks" to "analyze the risks"—caused the Risk Officer to stop listing specific risks and start philosophizing about risk in general. We lost two days figuring out why outputs had degraded.

What fixed it:

Short prompts with strict schemas.

Instead of prose describing what we wanted, we defined exactly what the output had to contain:

Required fields: position, evidence, assumptions, confidence
Format constraints: evidence must include sources, assumptions must be explicit
Grounding requirements: unsupported claims get penalized

The prompts became less readable and more mechanical. They also became reliable. We could change routing logic without worrying that the Risk Officer would suddenly start producing poetry.

The lesson:

Prompts should be constraints, not prose. Less "tone," more structure.

Technical Dead End #2: Single Judge

What we tried:

One model scored all debates. Made sense—consistent evaluation from a single perspective.

Why it failed:

Feedback loops and bias.

The judge had blind spots. It consistently rated certain argument patterns higher, regardless of actual quality. Over time, the debating models learned to produce outputs that the judge liked, not outputs that were good.

We noticed because grounding scores were climbing while user satisfaction was flat. The models were gaming the judge, not improving quality.

What fixed it:

Two judges from different providers, with disagreement thresholds.

If Judge A and Judge B disagree by more than one point, the output gets flagged for human review. This caught cases where one judge had learned a bias that the other hadn't.

We also started evaluating judges monthly against "golden" outputs—human-verified high-quality examples. When a judge's scores drifted from human consensus, we investigated.

The lesson:

Don't trust a single scoring model. Adversarial evaluation works for models just like it works for code review.

Technical Dead End #3: Cost Explosions

What we tried:

Five premium models debating every decision. The outputs were great. The invoices were not.

Why it failed:

Running five models per request crushed our budget. Early users loved the quality. Finance loved nothing about our unit economics.

We couldn't ship a product where each decision cost $15 in API calls.

What fixed it:

Routing and caps.

A lightweight classifier examines each request and routes it:

Simple, well-defined questions go to a single cheaper model
Complex strategic questions get the full debate treatment
Fact-heavy requests go to models with better grounding (regardless of cost)

Token caps per round prevent verbose models from blowing budget. Parallel execution reduced latency but also let us stop early if one role was clearly stuck.

The payoff:

Per-decision cost dropped by ~40% without quality loss on the metrics that mattered. We could ship a product people could actually afford.

The lesson:

Premium models everywhere is a prototype strategy, not a product strategy. Match the model to the task.

Product Miss #1: Transcript-Heavy UI

What we built:

The early UI showed debate transcripts. You could read every argument from every role, every critique, every score. It was comprehensive.

Why users hated it:

They didn't want to read 3,000 words of AI arguing with itself.

Users came with a question: "Should we do X?" They wanted an answer, risks, and next steps. Instead, we gave them a research document to wade through.

The irony: we'd built a debate engine to help people make faster decisions, then wrapped it in a UI that took 20 minutes to process.

What fixed it:

A synthesis layer that outputs what users actually need:

Decision recommendation (1-2 sentences)
Top risks with likelihood and mitigation (bullet list)
Decision criteria (what would change your mind)
Next steps with owners and timelines

The full transcript became an attachment—available for audit, but not the primary output. Users could drill down if they wanted. Most didn't.

The payoff:

Weekly active users doubled within a month of shipping the synthesis view. Turns out, people want decisions, not debates.

The lesson:

AI artifacts are not products. Wrap them in interfaces humans actually want.

Product Miss #2: No Follow-Through

What we shipped:

Beautiful decision documents. Comprehensive analysis. Clear recommendations.

What happened next:

Nothing. Users read the recommendation, closed the tab, and... didn't do anything with it.

We'd helped them think about the decision. We hadn't helped them act on it.

What fixed it:

Export hooks to task trackers and calendars.

"Schedule a meeting with finance" wasn't just a recommendation—it was a button that created a calendar event. "Get competitive analysis" wasn't just an idea—it was a task in their Asana.

The synthesis included actionable next steps. The UI let users push those steps directly into their workflow.

The payoff:

Users reported actually completing the recommended next steps at 3x the rate after we added action hooks. The tool moved from "interesting analysis" to "actually helpful."

The lesson:

Decision support without execution support is just sophisticated procrastination.

What Shipped and Stuck

After eighteen months of iteration, here's what survived:

Debate roles with checklists:

The role system works. Risk Officer, Optimist, Operator, Customer Voice, Challenger—each with explicit requirements. The Challenger role alone prevents 80% of consensus-mush problems.

Evaluation harness:

Goldens (known-good outputs), near-misses (known-bad outputs), regression gates on every change. We can't ship prompt changes without the harness passing. This has caught three major quality regressions that would have hit production.

Guardrails:

Grounding checks that verify citations. Schema validation that rejects incomplete outputs. Audit logs for every run. These aren't features—they're infrastructure that makes everything else trustworthy.

Smart routing:

Cheap models for simple tasks. Expensive models for complex ones. Fallbacks for failures. Token caps for cost control. This is what makes the product economically viable.

Synthesis + action:

Users want decisions and next steps, not transcripts. Action hooks that push recommendations into workflow tools. This is what makes the product useful.

Outcomes We Can Measure

Decision quality:

Users report clearer risks and better next steps. Internal scoring shows a 20-30% lift in grounding and risk disclosure after we added judges and role conflict.

We A/B tested against single-model outputs. Debate outputs scored higher on every dimension except brevity.

Reliability:

Regression gates caught a major quality drop when a model provider tweaked their model. We detected it within hours and rolled back while we investigated.

Before the harness, we would have shipped that regression and found out from user complaints days later.

Cost:

Per-decision cost dropped by ~40% after routing and token caps. We hit unit economics that made the product viable. This was the difference between a demo and a business.

Adoption:

Weekly active users doubled after the synthesis/action UI shipped. Users care when the tool helps them execute, not just think.

Things We'd Skip Next Time

Transcript-heavy interfaces:

We spent weeks building debate transcript views that almost nobody uses. Start with synthesis, add detail views later if users ask.

One-size-fits-all prompts:

Domain-specific templates beat generic strategy prompts. We wasted months trying to make one prompt work for all decisions. Building a library of specialized templates was faster in the end.

Assuming judges stay calibrated:

Judge models drift over time. Provider updates, fine-tuning changes, prompt sensitivity—all of these cause scoring to shift. Monthly recalibration against goldens is mandatory, not optional.

If You Build Something Similar

Force disagreement:

Define roles with explicit conflict requirements. Make the Challenger actually challenge. Consensus without friction is false consensus.

Score against rubrics, not vibes:

"Feels better" isn't a metric. Define rubrics with concrete anchors. Measure against them on every change.

Keep prompts short and structured:

Schemas beat prose. Required fields beat suggestions. Constraints beat guidance.

Add a synthesis layer:

Users want answers, not artifacts. Make the debate a backend detail and the decision the frontend reality.

Treat cost as a constraint:

Route simple tasks to cheap models. Cap tokens. Parallelize. Unit economics matter from day one.

Metrics That Mattered

Grounding/risk scores:

These tracked quality better than any subjective measure. A 20-30% lift told us the role conflict was working. When scores dropped, we knew something was wrong before users complained.

Adoption after synthesis:

Users voting with their feet. Weekly active users doubled when we stopped showing transcripts and started showing decisions. That was the product-market fit signal.

Cost per decision:

This determined whether we had a business. The 40% cost reduction from routing kept the product alive long enough to find paying customers.

Fast Start Template

If you're building multi-model decision support, start here:

Define roles: Risk Officer, Optimist, Operator, Customer Voice, Challenger. Give each a checklist of requirements.
Write structured prompts: Required fields, format constraints, grounding requirements. No prose.
Add dual judges: Two providers, disagreement threshold, monthly calibration against goldens.
Build evaluation: 30+ golden outputs, 30+ near-misses, regression suite that runs on every change.
Synthesize for humans: Decision, risks, next steps, rationale. Transcripts are attachments.
Route for cost: Cheap models for simple tasks. Full debate for complex ones.

DecisionForge worked once it became a product with constraints, not an experiment with vibes.

Context → Decision → Outcome → Metric

Context: Building multi-model AI debate system for complex business decisions, initial prototype producing bland consensus outputs, high costs, low user adoption.
Decision: Added forced role conflict (especially Challenger), dual-judge scoring, synthesis UI with action hooks, smart routing for cost control, comprehensive evaluation harness.
Outcome: Sharper outputs that users actually acted on. Sustainable unit economics. Regression detection before production impact.
Metric: 20-30% lift in grounding/risk scores. 2x weekly active users after synthesis UI. 40% cost reduction from routing. 3 major regressions caught by harness before production.

Anecdote: The Day We Almost Shipped Garbage

Six months into DecisionForge, we pushed a prompt update. Minor wording change. Passed code review. Looked fine in testing.

Three hours later, grounding scores dropped off a cliff. We'd gone from an average of 4.1 to 2.8. The models were producing confident recommendations without citing any sources.

Before the evaluation harness, we wouldn't have noticed until Monday when users complained. The harness caught it within hours.

Here's what had happened: the "minor wording change" subtly altered how the grounding requirement was interpreted. The models still understood they should cite sources. They just stopped doing it when citations were hard to find. Instead of admitting uncertainty, they just... asserted.

We rolled back the prompt. Scores recovered. Then we spent a week understanding why that specific wording mattered and how to get the behavior we wanted without the brittleness.

That incident taught us two things:

There's no such thing as a minor prompt change. Every change runs through the harness.
The harness exists for moments like this. Building it felt like overhead until it caught something real.

The harness has caught three major regressions since then. Each one would have hit production and damaged user trust. Each one was caught before users saw it.

That's why evaluation infrastructure isn't optional. It's what makes AI products reliable.

Mini Checklist: Building Multi-Model Decision Support

[ ] Roles defined with explicit conflict requirements (especially Challenger)
[ ] Prompts structured with required fields, not prose guidance
[ ] Dual judges from different providers with disagreement thresholds
[ ] Monthly judge calibration against golden outputs
[ ] Golden set of 30+ known-good outputs for evaluation
[ ] Near-miss set of 30+ known-bad outputs to catch regressions
[ ] Regression suite that blocks deployment on quality drops
[ ] Synthesis layer that outputs decisions, not transcripts
[ ] Action hooks to push next steps into user workflows
[ ] Routing logic that matches task complexity to model cost
[ ] Token caps and parallel execution for cost control
[ ] Audit logs for every run with correlation IDs