Why behavioral context belongs outside the model.

Behavioral context is the thing that makes a generic model usable in a regulated deployment. It is the policy posture, the cultural calibration, the consent template, the escalation rule, and the audit posture that turn a general-purpose model into a deployable system. The architectural question is where that context should live.

This paper argues that behavioral context belongs outside the model, in a layer that is versioned, audited, and swapped independently of model upgrades. The argument is the foundation of the Brevor product line, and the production patterns that follow from it are the ones we have built every subsequent system around. This is the earliest of the papers in the Brevor Research series, and the one whose conclusions have been most consistently reinforced by everything that came after.

The case against in-model context.

Context baked into the model — through fine-tuning, instruction-tuning, or system-prompt embedding — is context that cannot be swapped, audited, or held to a per-deployment policy. The deployment loses the ability to adapt without a model retrain, and the regulator loses the ability to inspect what posture the model is actually applying.

The deeper problem is that in-model context conflates two responsibilities. The model's responsibility is to produce coherent language conditioned on its inputs. The behavioral layer's responsibility is to define what those inputs should be, what policy governs the conditioning, and what the deployment commits to in front of its regulators. Conflating the two ties the deployment's policy posture to the model provider's release cadence, and that cadence is set by considerations the deployment does not control.

The practical consequence is that deployments built on in-model context age poorly. When the model provider releases a new version with different behavior, the deployment has to choose between staying on the old version — and losing the new capabilities — or upgrading and re-validating the entire policy posture from scratch. Neither outcome is sustainable across the life of a regulated deployment.

The architecture of the external layer.

An external behavioral layer sits between the deployment's surface and the model. It receives the prompt with the deployment's metadata, resolves the relevant policy and cultural module, evaluates the inputs against the policy, selects the model and the routing parameters, dispatches the call, and writes the decision to the audit trail.

The layer is deterministic and replayable. The model is not. The boundary between the two is the discipline that lets either be replaced without rebuilding the other. A new model version can be evaluated against the same policy and the same audit posture; a new policy can be deployed against the same model. Neither change requires the other.

The layer is also where the deployment's contract with its regulator lives. The policy is the artifact the regulator inspects, the audit trail is the artifact the regulator queries, and the cultural module is the artifact the regulator validates against jurisdictional norms. The model is, from the regulator's perspective, an interchangeable component bound by the policy. This framing is the one regulators have responded to most positively across every regime we have engaged with.

Three production patterns.

Three deployment patterns recur across the Brevor fleet, each appropriate to a different class of deployment.

The first is inline routing with hot-swappable context. The behavioral layer sits on the request path, resolves context per call, evaluates policy, and dispatches the model call. This is the pattern for member-facing and operator-facing interactive deployments, where latency budgets are tight and the policy is well-defined. The latency cost is the budget documented in our latency-budget work; the safety benefit is full per-call audit.

The second is side-car policy evaluation with cryptographic chain. The behavioral layer runs alongside the request path rather than on it, evaluating policy in parallel with the model call and writing a chained audit record. This pattern fits deployments where the model call must complete with minimal added latency but the policy posture still needs to be enforced and audited. The tradeoff is that the policy cannot block the call; it can only flag, escalate, or remediate after the fact. The pattern is appropriate where the deployment's risk model permits this.

The third is federated context across multi-jurisdiction deployments. A single deployment surface is served by multiple behavioral layers, one per jurisdiction, with a resolver that selects the appropriate layer per call. This is the pattern for the largest multi-national deployments, and it is the most operationally complex of the three. It is also the pattern that has aged best, because the per-jurisdiction layers can evolve independently as local guidance changes.

What this enables.

Context that lives outside the model can be versioned independently, audited separately, and swapped at deployment time without retraining. Each of these capabilities compounds.

Independent versioning lets the deployment respond to policy changes in hours rather than the months a fine-tune workflow requires. The deployments in the fleet that have used this capability most aggressively have absorbed regulatory changes mid-quarter that competitors are still working through at year-end. The architectural choice predates the regulatory event by years; the payback is at the event.

Separate audit lets the regulator inspect the policy without inspecting the weights. This is the framing every regulator we have worked with has responded to, and it is the framing that survives a change in model provider. A deployment that has built its audit posture around the policy can replace its model provider without replacing its audit posture. A deployment that has built its audit posture around the model cannot.

Swap-without-retraining lets the deployment cross jurisdictions, add cultural modules, and update consent posture as the deployment grows. Each of these would be a multi-month engagement in a fine-tuned deployment. In an external-context deployment they are scheduled changes against a known artifact.

Why this has held.

This paper was the first in the Brevor Research series and it has been the foundation of every product we have shipped in the years since. The architectural argument has not had to be revised. Every subsequent paper — the audit-trail work, the cultural-module work, the latency-budget framework, the deflection-metric work, the sub-100ms architecture — has reinforced the same posture: the model is a tool, the behavioral context is the deployment, and the boundary between them is the discipline that lets either be replaced without rebuilding the other.

The rare cases where deployments in the fleet have struggled have, without exception, involved a temporary collapse of that boundary — usually under schedule pressure, occasionally under provider pressure. The collapse has always cost more to recover from than the discipline would have cost to maintain. The pattern is consistent enough that we now write the boundary into the deployment plan as a non-negotiable, and we have not regretted it.

The model is a tool. The behavioral context is the deployment. Keeping the two separate is the discipline that lets either be replaced without rebuilding the other.

This argument has been the foundation of every Brevor product we have shipped in the years since this paper was written. The architectural choice compounds: deployments built on external context have aged well, deployments built on in-model context have rebuilt twice, and the rebuilds have been more expensive than the original deployments.

If there is a single thesis that the Brevor Research series defends, it is the one in this paper. Every subsequent paper extends it; none has overturned it. The discipline is to keep the boundary between the model and the behavioral layer clean, to treat the layer as the deployment's contract with its regulator, and to treat the model as the interchangeable component it actually is. The deployments that have committed to this posture have not had to revisit it. That is, in the end, the strongest evidence the architecture can offer.

Recommended citation

Brevor Research, March 2019

RELATED RESEARCH

August 2021 · 10 min read

Measuring deflection that matters.

March 2026 · 12 min read