Benchmark v2 · Cost per Reliable Decision

The cheapest model is not the cheapest answer. The cheapest reliable decision is.

In a 7-day liquidity triage benchmark, aiBlue Core™ governed Fable 5, Opus 4.8, and Haiku 4.5 toward the same decision architecture. The surprise was not which model won — it was how small the gap became once the reasoning was governed.

See the sweet spot → View the economics

The model thinks. The Core governs the thinking.And once governed, models converge — which is where the economics begin.

9.5
Performance ceiling
Fable 5 + CoreStrongest governed output
9.2–9.3
Premium middle path
Opus 4.8 + CoreNear-ceiling, matured tone
9.1–9.2
Economic leverage
Haiku 4.5 + CorePremium-band, low-cost profile
~0.4
Governed spread
Ceiling → sweet spotThe gap Core compresses
Three Theses

One architecture, measured three ways.

Model thesis

The model thinks. The Core governs the thinking.

Economic thesis

The most expensive model is not always the cheapest path to a reliable decision.

Benchmark thesis

We don't benchmark only model intelligence. We benchmark decision integrity per dollar.

The Test

The same messy pipeline, run through every configuration.

A single 7-day liquidity triage: 34 anonymized entities mixing receivables, proposals, recurring-revenue candidates, vague leads, status contacts, and product ideas — converted into one executable week without inventing contract status.

Scenario: 7-day liquidity triage under founder pressure Entities: 34, preserved in a ledger Constraint: no proposal counted as owed money Output: ledger · classification · capacity-bound agenda · consequences
COLLECT CLOSE PARK UNKNOWN IGNORE
Establishing The Ceiling

Fable 5 vs. Fable 5 + aiBlue Core™

Fable 5 alone is the strongest raw operational model tested. Governed, it sets the performance ceiling — the bar every cheaper configuration is then measured against.

Criterion
Fable 5
Fable 5 + Core
Score
Overall answer quality
9.2
+0.3 traceability
9.5
Entity preservation
9.2
+0.5 linked duplicates
9.7
Evidence discipline
9.2
+0.5 forecast rules
9.7
Classification precision
9.3
+0.4 UNKNOWN vs PARK
9.7
Operational compression
9.6
+0.1 capacity-bound
9.7
Strategic acoupling
9.3
+0.5 named-channel gate
9.8
Distraction control
9.0
+0.8 no-reactivation rule
9.8
Pipeline hygiene
9.0
+0.8 define-or-delete
9.8
Deadline & consequence logic
9.4
+0.3 state transitions
9.7
Forecast safety
9.0
+0.8 runway removal
9.8
Productization readiness
9.1
+0.7 replicable protocol
9.8
Enterprise readiness
9.0
+0.8 audit & teams
9.8
Composite
9.2
decision integrity gain
9.5
The Strategic Layer

Finding the sweet spot: performance, token cost, and reliable decisions.

Once the Core governs the reasoning, three configurations land within roughly four-tenths of a point of each other — but at very different cost profiles. That spread is the whole argument.

Performance ceiling
Fable 5 + Core
9.5 / 10

The strongest governed output in the benchmark. Operational sharpness preserved, full traceability and forecast protection added. This is the bar.

Premium middle path
Opus 4.8 + Core
9.2–9.3 / 10

Nearly the same governed structure with a more mature institutional tone — and, if its cost runs below the ceiling model, a strong premium alternative for client-facing output.

Sweet spot
Economic leverage
Haiku 4.5 + Core
9.1–9.2 / 10

A lower-cost model reaching premium-band governed performance: near-identical ledger, classification, capacity control, and pipeline hygiene. The clearest evidence that the Core compresses the gap.

The metric that matters
Cost per Reliable Decision = the total cost required to reach an answer that is entity-safe, evidence-disciplined, correctly classified, capacity-bound, deadline-driven, consequence-aware, and ready to execute without major human repair.
Entity-safe
Evidence-disciplined
Classification-correct
Capacity-bound
Deadline-driven
Consequence-aware
Execution-ready
No major repair
Configuration Scoreboard

Observed benchmark signals across configurations.

Decision and governance scores reflect observed benchmark outputs. Cost-efficiency potential and correction-round reduction are directional estimates. Pending models require controlled, repeated runs before final publication.

Configuration
Benchmark role
Decision
Governance
Cost efficiency
Correction ↓
Fable 5 — no CoreRaw operational model
Strong ungoverned ceiling
9.2
8.8
Medium
Medium
Fable 5 + CorePerformance ceiling
Performance benchmark
9.5
9.8
Medium
High
Opus 4.8 + CorePremium cost-balanced
Strong premium alternative
9.2–9.3
9.6
High
High
Haiku 4.5 + CoreLow-cost operational
Best cost-performance candidate
9.1–9.2
9.5
Very high
High
GPT-4.1 + CoreBaseline Core config
Baseline
8.2
8.8
Medium
Medium
GPT-5.4 mini + CorePending benchmark run
Scale candidate
TBD
TBD
Very high
TBD
GPT-5.4 + CorePending benchmark run
Expected sweet-spot candidate
TBD
TBD
High
TBD
GPT-5.5 + CorePending benchmark run
High-quality governed candidate
TBD
TBD
Medium
TBD
GPT-5.5 Pro — no CorePending access
Expensive raw ceiling
TBD
TBD
Low
TBD
GPT-5.5 Pro + CoreFuture ceiling test
Absolute governed ceiling
TBD
TBD
Low–Med
TBD
Token Economics

Cost per reliable decision — structure first, prices at publication.

The framework is fixed; the numbers are not yet. Token counts, prices, and correction rounds are measured per controlled run and filled at publication. What the benchmark already shows is the relationship, not the absolute figures.

ConfigurationInput tokOutput tokTotal tok Est. cost / runAvg correction roundsEffective cost DecisionGovernanceCPRD indexVerdict
Fable 5 — no Coreraw model 9.28.8Ungoverned ceiling
Fable 5 + Coregoverned ceiling 9.59.8Performance ceiling
Opus 4.8 + Corepremium balance 9.2–9.39.6Premium alternative
Haiku 4.5 + Corelow-cost sweet spot 9.1–9.29.5Cost-performance leader
GPT-4.1 + Corebaseline 8.28.8Baseline
GPT-5.4 mini + Corepending run TBDTBDScale candidate
GPT-5.4 + Corepending run TBDTBDExpected sweet spot
GPT-5.5 + Corepending run TBDTBDHigh-quality governed
GPT-5.5 Pro — no Corepending access TBDTBDExpensive raw ceiling
GPT-5.5 Pro + Corefuture test TBDTBDAbsolute governed ceiling
Note: Pricing must be updated at publication time using the official provider pricing pages. The benchmark focuses on relative cost per reliable decision, not static token pricing. Decision and governance scores reflect observed benchmark outputs; all token, cost, correction-round, and CPRD-index cells are placeholders pending controlled measurement.
The Mechanism

Why a lower-cost model can win when governed.

Raw intelligence is not the only cost driver. Correction rounds are. A governed cheaper model that lands right the first time can beat a premium model that needs to be re-prompted.

Premium model · without Core

Powerful, but unbounded

  • ·High reasoning ability
  • ·High output cost
  • ·Variable governance
  • ·May require human review
  • ·Strong but non-repeatable answers
Lower-cost model · with Core

Bounded, and repeatable

  • Structured extraction
  • Governed classification
  • Bounded actions & state transitions
  • Fewer correction rounds
  • Lower cost per reliable decision

A cheaper model becomes expensive when it requires three correction rounds. A premium model becomes wasteful when governance can produce the same operational reliability with fewer tokens.

Cost / Reliability

The strategic positioning matrix.

Cost per run on the horizontal axis, decision reliability on the vertical. Governance lifts configurations into the upper band; price moves them left or right. The prize is the upper-left.

COST PER RUN → DECISION RELIABILITY → SWEET SPOT PREMIUM + CORE UNGOVERNED · CHEAP PREMIUM · NO CORE Haiku 4.5 + Core 9.1–9.2 · sweet spot Opus 4.8 + Core 9.2–9.3 · premium Fable 5 + Core 9.5 · ceiling GPT-4.1 + Core 8.2 · baseline Fable 5 · no Core 9.2 · ungoverned ? GPT-5.5 Pro pending · untested
Haiku + Core — sweet spot Opus + Core — premium balance Fable + Core — ceiling Fable, no Core — ungoverned GPT-4.1 + Core — baseline Pending · untested

Positions are strategic, not measured coordinates: vertical placement reflects observed governance/decision scores, horizontal placement reflects each model's relative cost profile. Pending configurations are shown with unknown reliability until controlled runs and current pricing are added.

Reframing Procurement

The AI market has been trained to ask: Which model is best?

For operational teams, that is the wrong question. The better question is: which model becomes reliable when governed? aiBlue Core™ changes model procurement from a raw-intelligence decision into an economics-of-reliability decision.

Interpretation

What the sweet-spot test suggests.

These are directional signals from observed runs, not settled proof. Each is a hypothesis the pending benchmark rounds are designed to confirm or reject.

1

If Haiku 4.5 + Core keeps performing near Opus / Fable + Core, it becomes the strongest candidate for high-volume workflows.

2

If Opus 4.8 + Core stays close to Fable 5 + Core at lower cost, it becomes the premium cost-balanced option for client-facing output.

3

If GPT-5.4 + Core approaches GPT-5.5 / Pro quality, the Core becomes an economic arbitrage layer across model generations.

4

GPT-5.5 Pro should be tested as the expensive raw ceiling — but not assumed to be the best cost-per-decision configuration.

Core does not only improve answers. It compresses the performance gap between models.

The Decision Architecture

Five governed layers turn any model's answer into a decision.

The same scaffold runs above every model in this benchmark. That is why outputs converge — and why a cheaper model can reach the premium band.

00

Raw Pipeline

Unstructured input: receivables, proposals, leads, status contacts, product ideas, and ambiguous items — all mixed together.

Receivable AMajor ProposalStrategic ContactUnknown LeadPossible DuplicateUnattached Product Front
01

Entity Ledger

Every item is preserved and registered before any compression. Duplicates are linked, not merged. Nothing is silently dropped.

02

Evidence Classification

Stated facts are separated from assumptions. Each entity is tagged COLLECT / CLOSE / FOLLOW / PARK / UNKNOWN / IGNORE on evidence, not optimism.

03

Capacity-Bound Action Plan

Active workload is limited to realistic execution slots. The week is compressed into what can be done — not everything that could be attempted.

04

State Transitions & Forecast Protection

Every action defines what happens if it succeeds, stalls, or fails. Unverifiable receivables are removed from runway. The plan becomes auditable and repeatable.

Critical Differentiators

Where governance changes the outcome.

A · LEDGER

Entity Governance

The Core prevents state corruption by preserving every entity before compression — so nothing is merged, inflated, or lost.

B · EVIDENCE

Evidence Discipline

The Core prevents proposals, retainers, and vague opportunities from being treated as owed money.

C · ACOUPLING

Strategic Acoupling

The Core does not open new fronts unless they are attached to named buyers or channels already in the pipeline.

D · HYGIENE

Pipeline Hygiene

Unknowns are not left as mental noise. They enter a define-or-delete process that clears phantom opportunities.

E · FORECAST

Forecast Protection

If a receivable cannot be named, validated, or tied to a contractual basis, it is removed from runway planning.

F · PROTOCOL

Enterprise Repeatability

The output becomes a protocol a team can run again — not a one-time answer that depends on who prompted it.

Decision Integrity Per Dollar

Reduce the cost of reliable decisions.

aiBlue Core™ helps teams use models more intelligently — reduce correction rounds, preserve decision integrity, and find the right model for the right task.

Run a Core benchmark Test your workflow Compare cost per reliable decision