Public Benchmark Document · v1.0 · April 2026

Decision Integrity Under Governance

A cross-model, cross-domain behavioral benchmark on how cognitive governance shapes AI decision structure — independent of model scale, vendor, or domain.

Models Evaluated Opus 4.7 · GPT-4.1 · GPT-4.1 Mini · Gemini 3.0 Governance Layer aiBlue Core™ Domains 5 Scenarios 5 Phase Enterprise Validation
Read Executive Summary Full Technical Paper →
Sections 01 Summary 02 Problem 03 Method 04 Scenarios 05 Results 06 Cross-Model 07 Findings 08 Limitations 09 Interpretation 10 Conclusion
§1
Executive Summary

What was tested, and what was observed.

This benchmark evaluates how cognitive governance — specifically, the aiBlue Core™ architecture — shapes the structural behavior of large language models across five real-world strategic decision scenarios spanning regulated finance, healthcare, AI product ethics, public policy, and enterprise positioning.

Four models were evaluated in ungoverned and governed configurations: Claude Opus 4.7, GPT-4.1, GPT-4.1 Mini, and Gemini 3.0. The model weights were not modified. No fine-tuning was performed. The only variable was the presence or absence of the Core governance layer.

The central finding is behavioral, not capability-based. Ungoverned frontier models produced competent, sometimes excellent, reasoning — but with substantial variance in structure, risk treatment, and decision explicitness. Under governance, the same models produced output that exhibited a consistent decision architecture: explicit framing correction, asymmetric risk analysis, structured trade-off evaluation, and executable pathways with defined success criteria and override conditions.

Critically, governance did not force a single answer. Different models under Core selected different valid strategic paths. What governance did was constrain the space of acceptable decisions — eliminating poorly-structured, weakly-justified, or premature conclusions while preserving legitimate strategic variation.

Different models choose different paths.
Governance ensures those paths remain valid. — Core Design Principle
§2
Problem Statement

Capability is not the bottleneck. Reliability is.

Frontier language models now produce reasoning that would, in isolation, pass the threshold for competent strategic analysis. The limitation is not that models cannot reason well. It is that they do not reason the same way twice, do not reliably impose structure under ambiguity, and do not consistently distinguish between a well-analyzed decision and a confidently-stated opinion.

For enterprise adoption in regulated contexts — financial services, healthcare, public sector, legal advisory, board-level strategy — this is the blocking constraint. An AI system that produces excellent output 70% of the time and variable output 30% of the time is not deployable at the level required for decisions that carry institutional, legal, or fiduciary consequence.

Three properties are required for decision-grade AI:

Reproducibility. The same input produces comparable decision structure across runs. Not identical text — identical reasoning discipline.

Auditability. The reasoning path can be inspected, challenged, and documented. Every conclusion is traceable to its supporting analysis.

Boundedness. The output falls within a constrained space of acceptable decisions. Bad answers — answers that ignore irreversibility, collapse false binaries, or escape into theory — are structurally prevented.

Improving the model alone does not produce these properties. Larger models produce more articulate variance, not less variance. Fine-tuning changes what a model knows, not how a model structures its reasoning. The missing layer is architectural, not parametric.

§3
Benchmark Design

Methodology.

The benchmark was designed to isolate governance behavior from model capability. Every scenario was run with identical prompts, identical instructions, and no system-level manipulation. Only the presence of the Core layer varied.

3.1 Test conditions

Each scenario was posed to ungoverned frontier models (Opus 4.7, Gemini 3.0) and to the same class of models under Core governance (Opus 4.7 + Core, GPT-4.1 + Core, GPT-4.1 Mini + Core). No prompts were optimized for the Core. No post-processing was applied. Outputs are presented as generated.

3.2 Evaluation dimensions

Eight behavioral dimensions were evaluated qualitatively against each output. These are not capability metrics. They are structural properties of the reasoning produced.

D1

Framing Correction

Whether the output challenges or accepts the problem as framed. Governed reasoning reframes flawed premises before solving.

D2

False Binary Rejection

Whether the output accepts the stated options or generates a structured third path. Binary acceptance is a common failure mode under pressure.

D3

Risk Asymmetry

Whether the output distinguishes reversible costs from terminal consequences. Asymmetric risk analysis is decisive in high-stakes decisions.

D4

Domain Decomposition

Whether the output separates legal, governance, political, and commercial dimensions — or collapses them into a single evaluative frame.

D5

Decision Explicitness

Whether a recommendation is stated clearly with documented rationale, or softened into conditional advisory language.

D6

Execution Readiness

Whether the output includes a structured operational pathway with timelines, checkpoints, and success criteria.

D7

Override Conditions

Whether the output specifies what new information or evidence would change the recommendation — a marker of disciplined reasoning.

D8

Structural Consistency

Whether the output produces the same decision architecture across domains — or varies stylistically with topic.

§4
Scenarios

Five domains. Five decisions. No correct answers.

Scenarios were selected to span unrelated domains with genuinely difficult institutional trade-offs. Each presents incomplete data, misaligned stakeholders, and irreversibility asymmetries that defeat naive decision heuristics.

Scenario 01

Fintech Under Regulatory Pressure

A Brazilian fintech's credit AI has been flagged for potential bias. No formal violation yet. Slowing approvals to audit drops revenue 30%. Continuing risks regulatory sanction and reputational damage. Seven-day decision window.

Regulated Finance
Scenario 02

Medical AI Demographic Drift

A healthcare provider's early cancer detection AI shows a small but consistent false-negative rate increase in specific demographics. Shutdown delays diagnoses broadly. Continued use risks concentrated harm. No regulatory mandate.

Healthcare / Clinical AI
Scenario 03

Persuasive AI Product Launch

An AI assistant capable of high-level emotional persuasion has demonstrated significant influence over vulnerable users in internal testing. Launch projected at $50M year-one revenue. No explicit regulatory prohibition.

AI Product Ethics
Scenario 04

Predictive Policing Deployment

A municipal government considers scaling an AI predictive policing system. Pilot shows 20% crime reduction alongside disproportionate targeting of specific communities. Election in six months. Public opinion divided.

Public Sector / Governance
Scenario 05

Strategic Positioning Decision

A company has developed a proprietary AI governance framework that improves output consistency without changing the underlying model. Three positioning options: technical enhancement layer, governance system, or new AI infrastructure category.

Enterprise Strategy
§5
Results

Baseline behavior vs. governed behavior.

Across all five scenarios, a consistent behavioral divergence was observed between ungoverned and governed outputs. Ungoverned outputs remained competent — the models are capable reasoners. The divergence was structural.

Without Governance — Observed Patterns

Accepts problem as framed; reasons within provided options

Treats trade-offs as symmetric; reversibility handled informally

Produces narrative advisory language; recommendation softened

Variable structure across scenarios (reflects topic, not discipline)

Risk discussed qualitatively rather than hierarchized

Execution guidance present but unstructured

Override conditions rarely specified

Strong answers when topic matches model's training distribution; weaker on edge cases

With Governance (Core) — Observed Patterns

Applies framing correction; reframes the problem before solving

Rejects false binaries; constructs structured third paths when warranted

Risk register with likelihood, impact, reversibility explicitly ordered

Same architecture across domains (finance, health, ethics, policy, strategy)

Domain decomposition: legal / governance / political / commercial kept distinct

Execution pathway with phases, timelines, success criteria

Override conditions explicitly stated ("what would change this recommendation")

Critical warnings section surfacing the category's dominant failure modes

The pattern held across four different models. It held across five unrelated domains. It held regardless of model scale — GPT-4.1 Mini under Core produced the same structural properties as Opus 4.7 under Core, differing only in the final strategic path selected.

§5.1
Pattern Matrix

Structural consistency across scenarios.

The following matrix documents which structural properties appeared in governed outputs across each scenario. A checkmark indicates the property was clearly present; a partial marker indicates partial presence; a dash indicates absence.

Behavioral Property S01 Fintech S02 Health S03 Ethics S04 Policy S05 Strategy
Framing correction
False binary rejection
Risk asymmetry explicit
Domain decomposition
Risk register with reversibility
Structured execution pathway
Override conditions specified
Veto / refusal when warrantedpartial
Third path creation
Critical warning section

The consistency of this pattern across domains is the primary behavioral signal. A single model applying identical structural discipline to fintech bias investigations, medical AI ethics, persuasion-layer product decisions, predictive policing deployment, and enterprise positioning strategy indicates architectural control, not topical specialization.

§6
Cross-Model Analysis

Same governance. Different models. Different — but bounded — choices.

The most significant finding emerged when the same governance layer was applied to different base models on the same scenario. Under Core, the models did not converge on identical answers. They produced different strategic selections — each fully consistent with the governance architecture, none violating the benchmark's structural requirements.

On Scenario 05 (strategic positioning), for example:

Opus 4.7 + Core
Path D
Governance system with staged expansion toward infrastructure platform as evidence accumulates. Conservative, defensibility-first, institutional framing.
GPT-4.1 + Core
Path C
Position as new category in AI infrastructure. Ambitious, expansionist, category-creation framing. Builder-oriented.
Opus 4.7 (Ungoverned)
Path B-ish
Governance positioning, conservative. Strong reasoning, less formal structure. Reflects consultant-style advisory output.

Both Core-governed outputs — despite selecting different strategic paths — satisfied every structural requirement in the pattern matrix. Both produced framing correction, risk asymmetry analysis, domain decomposition, and execution pathways. Both specified override conditions.

This is the distinguishing property of the governance model. Governance does not collapse decisions into a single answer. It bounds the space of acceptable answers. Within that space, models retain legitimate strategic variation reflecting their underlying reasoning tendencies.

The Core does not force an answer.
It eliminates bad ones.

This property has a practical implication. Under governance, different models produce different decision profiles — more conservative, more expansionist, more institutional, more creative — but all within a quality-constrained decision space. The model contributes strategic character. The governance contributes structural integrity. Neither alone produces the full output.

§7
Key Findings

Six observations from the benchmark dataset.

Finding 01

Governance reduces variance without reducing capability.

Ungoverned outputs varied in structure across domains and runs. Governed outputs produced the same decision architecture regardless of topic or model. Capability was not constrained — the model's substantive reasoning was preserved and often deepened.

Finding 02

Structural properties transfer cross-domain.

The governance pattern — framing correction, risk asymmetry, domain decomposition, execution pathway, override conditions — appeared identically across regulated finance, clinical AI, product ethics, public policy, and enterprise strategy. Domain specificity did not break the pattern.

Finding 03

Model scale was not the dominant variable.

GPT-4.1 Mini under Core exhibited the same structural discipline as Opus 4.7 under Core. The smaller model produced less compressed output but identical architecture. This indicates that structural decision quality is governance-dependent, not parameter-dependent.

Finding 04

Governance bounds the decision space; it does not collapse it.

Different governed models selected different valid paths on the same scenario. All paths satisfied the structural requirements. This demonstrates that governance constrains quality boundaries while preserving legitimate strategic variation.

Finding 05

Governed outputs are audit-ready by construction.

Every governed output contains explicit risk assessment, decision rationale, execution timeline, success criteria, and override conditions. This is the format required for regulated-industry documentation. Ungoverned outputs required reformatting to achieve comparable auditability.

Finding 06

Behavioral consistency appears to be architectural, not emergent.

The consistency observed cannot be attributed to prompt variance, model training, or topic familiarity. It appears in multiple models, across unrelated domains, and is absent in the same models without the governance layer. The signal is consistent with architectural causation.

§8
Limitations

What this benchmark does not establish.

This benchmark is an initial, qualitative behavioral study. It produces directional evidence, not statistical proof. The following limitations should be read before citing findings.

Sample Size

Five scenarios, four models, single-run evaluation per scenario. This is below the threshold for statistical claims. Findings are qualitative pattern observations, not quantified effect sizes.

Evaluation Methodology

Structural property coding was performed by benchmark authors against defined criteria. Inter-rater reliability testing with independent evaluators is in progress as part of the Independent Evaluation Program (IEP).

Single-Run Evaluation

Within-model variance across repeated runs of the same scenario has not been formally measured in this document. Anecdotal observation of multi-run stability supports the consistency finding, but a controlled run-stability study remains outstanding.

Scenario Dependency

All scenarios share structural features (irreducible ambiguity, stakeholder misalignment, irreversibility asymmetry). Whether governance exhibits comparable properties on scenarios without these features is not established here.

Confounding: Model Distribution

Ungoverned and governed outputs were produced by different model configurations. While prompts were identical, isolated attribution of behavioral differences to the governance layer versus incidental generation variance requires a larger-scale controlled study.

Claims Not Made

This benchmark does not claim the Core makes models "smarter," "more intelligent," or "more capable." It claims that observed structural properties of governed outputs differ measurably and consistently from ungoverned outputs in ways that correspond to documented enterprise decision-quality requirements.

§9
Interpretation

What this benchmark actually measures.

This is not a model benchmark. The models evaluated are capable. A different study, measuring analytical depth or factual accuracy, would produce close distributions across these systems. That is not what this benchmark tests.

This is a decision reliability benchmark. It measures whether an AI system produces output with the structural properties required for institutional decision-making — reproducibility, auditability, and bounded decision space — regardless of the underlying model's capability.

The conceptual shift is from response generation to decision systems. A response generator produces text that answers the question as posed. A decision system produces a documented reasoning artifact that satisfies the structural requirements of the institutional context in which the decision will be made.

The difference is not cosmetic. Courts, regulators, procurement functions, boards, and audit committees do not evaluate outputs by their linguistic quality. They evaluate them by whether the reasoning is reproducible, whether the assumptions are explicit, whether the alternatives were considered, whether the risks were ordered, and whether the decision is traceable. Ungoverned frontier-model output does not reliably produce this structure. Governed output does, across models and domains.

The practical implication is direct. Organizations that deploy AI in contexts where decision structure matters — finance, healthcare, legal, public sector, regulated industry, board-level strategy — need governance infrastructure, not just model access. This benchmark provides initial evidence that such governance infrastructure is feasible, portable across vendors, and independent of model scale.

§10
Conclusion

Decision integrity as infrastructure.

The last five years of AI progress have been dominated by model improvement — larger parameter counts, better training data, more sophisticated fine-tuning. These improvements have produced capable systems. They have not, by themselves, produced deployable decision systems for institutions that require reproducible, auditable, bounded output.

The next phase of AI infrastructure is not about producing better answers. It is about producing answers that survive scrutiny — legal, regulatory, fiduciary, and institutional — at the level required for consequential decisions.

This benchmark contributes initial, qualitative evidence that a governance architecture can deliver these properties across models and domains, with structural consistency that does not depend on model scale. The signal is directional, the methodology is documented, and the findings are open to independent replication through the aiBlue Independent Evaluation Program (IEP).

The conclusion is neither that governance is a solved problem nor that this benchmark is dispositive. It is that the category of decision governance for AI systems is real, measurable, and architectural — and that the evidence supports further investment in governance infrastructure as a parallel track to model capability research.

The future of AI is not only about better models.
It is about systems that ensure decisions can be trusted.

Different models choose different paths.
Governance ensures those paths remain valid.

This benchmark, its methodology, and its underlying data are available for independent replication through the aiBlue Independent Evaluation Program.

Download Whitepaper Apply to Validate Request Access →