The audit trail problem in regulated AI.

Most major model providers ship audit logs. None of them, on their own, satisfy the audit requirements of a regulated deployment. The gap is not a matter of completeness or retention; it is a matter of what the logs were engineered to do.

This paper details what regulators actually ask for when they audit an AI deployment, why provider logs fall short of those requests, and what an audit trail engineered for compliance contains. The argument has practical consequences. Deployments that treat provider logs as their audit posture spend the second year of deployment rebuilding it under pressure. Deployments that engineer the trail as the primitive the routing layer is built around do not.

What provider audit logs contain.

Provider logs are engineered to satisfy provider concerns. The concerns are reasonable and well-scoped: rate limiting, abuse detection, billing reconciliation, model-performance telemetry. The logs are designed to support these workflows at the scale the provider operates, which is the scale of all customers combined.

The consequences for a regulated deployment are predictable. Records are retained for periods determined by the provider's commercial considerations, not the regulator's retention rules. Access is gated by the provider's role model, not the deployment's. Exports are formatted for the provider's ingestion pipelines, not the regulator's review tools. None of this is a defect in the provider's logs. It is a mismatch between what the logs were built for and what the regulated deployment needs.

The most common failure mode is that the deployment cannot reconstruct a single member-facing interaction six months after the fact. The provider can confirm that a call happened, can show the prompt and the response, can attest to the model version. None of that is sufficient. The regulator's question is about the policy that applied, the context that was loaded, the decision that was made, the reviewer who would have been notified if the call had escalated, and the chain of custody from the inputs to the outputs. The provider log cannot answer this question because it was not built to.

What regulators ask for.

We have been through several dozen regulator engagements across health, financial services, insurance, and the public sector. The asks vary in form and converge in substance. Five primitives appear in nearly every engagement.

Decision provenance. The regulator wants to know, for any given interaction, what inputs the system received, what policy was applied, what context was loaded, what model was selected, what response was produced, and how each of those facts can be verified independently of the others. Replay from inputs. Given the original inputs, the deployment must be able to reproduce — or formally explain why it cannot reproduce — the original decision. Role-based access. Different reviewers see different slices of the trail, and the access itself is logged. Retention by jurisdiction. Different jurisdictions impose different minimum and maximum retention periods, sometimes on the same record. The trail must enforce both. Export in formats the regulator's tooling can ingest. This is the unglamorous one and the one most deployments underestimate. A trail that can answer every regulator question in principle but only as a database query is not a trail the regulator can use.

The regulator does not ask for these primitives in this language. They ask in the language of their own regimes. The translation is the work the deployment owes.

What an engineered audit trail contains.

A trail engineered for compliance treats every routing decision as a first-class artifact. The artifact contains the inputs in their original form, a reference to the policy version that was applied, a reference to the cultural module that resolved, the model identifier and version, the response identifier with a hash of the response body, and a cryptographic chain to the previous decision in the same deployment. Each field has a defined retention rule. Each access to the record is itself a record.

The chain is non-negotiable. Without a chain, a regulator cannot distinguish a complete trail from a tampered one. With a chain, the trail is either intact or visibly broken, and the breakage itself is auditable. The chain costs almost nothing to maintain — a single hash per write — and recovers a class of regulator confidence that no other primitive provides.

The trail is not the same as the logs. The logs continue to serve their operational role: rate limiting, debugging, telemetry. The trail is the artifact the regulator inspects, and it is engineered for that single purpose. Separating the two is the architectural decision that makes both possible.

What replay actually means.

Replay is the regulator's most demanding ask and the one most deployments interpret incorrectly. The ask is not that the system produce an identical output given identical inputs. Modern model providers do not guarantee determinism, and pretending otherwise is dishonest.

The ask is that the deployment can reproduce the decision — the routing decision, the policy decision, the cultural-module resolution — given the original inputs. The model response itself may differ; the path that led to it must not. A replay framework that satisfies this ask records the inputs, replays them through the same compiled policy and the same resolved module, and produces the same routing decision. The model response is then either retrieved from the trail or regenerated with the understanding that it is a fresh generation.

This distinction matters because it determines what has to be deterministic in the system and what does not. The behavioral layer is deterministic and replayable. The model is not, and does not have to be, provided the trail makes the boundary clear. The deployments that have struggled with replay are the ones that conflated the two. The deployments that have not, have not.

What the trail enables operationally.

An engineered trail is expensive to build and pays back beyond the audit case. The same trail that satisfies the regulator supports internal incident review, model-comparison studies, deflection-quality analysis, and the kind of operational telemetry that operators actually need. The investment is dual-purpose.

The deployments that have leaned hardest on the trail operationally are the ones that have also passed audits most cleanly. The two reinforce each other. An operator who reviews routing decisions weekly will notice anomalies long before a regulator does, and the changes made in response are themselves recorded in a way that strengthens the next audit. This is the same compounding pattern documented in our sub-100ms work: the right primitives at the boundary determine what is possible at the center.

An audit trail is not a log. It is a contract with the regulator. The difference is what it costs to maintain.

The audit trail problem is solvable, but not by treating it as a logging concern. It has to be the primitive the routing layer is built around. Everything else — the latency budget, the cultural-module pattern, the replay framework — follows from that decision.

The deployments in the Brevor fleet that have passed every audit they have been through, in every jurisdiction they operate in, all share the same posture: the trail is the foundation, the logs are the convenience, and the boundary between them is enforced. The deployments that have had findings have always had findings of the same shape: the logs were treated as the trail, and the gap surfaced under regulator scrutiny. The architectural choice is small. The compounding consequence is large.

Recommended citation

Brevor Research, November 2022

RELATED RESEARCH

March 2026 · 12 min read

Behavioral routing at sub-100ms: lessons from 1.8B production calls.

February 2024 · 10 min read