Behavioral routing only matters if it disappears into the deployment. The operators we work with do not measure us by the elegance of the policy graph or the cleanliness of the audit export. They measure us by whether the conversation feels normal. Whether the clinician notices a pause before the model responds. Whether the underwriter has to wait long enough that the workflow breaks.
At 1.8 billion calls per quarter across forty-one production deployments, sub-100ms is the threshold below which the behavioral layer stops being something operators have to think about. Above it, every conversation carries a small tax that compounds across the day. This paper documents the architectural decisions that held at that scale, the ones that did not, and the rebuilds underway. It is a working document, not a finished one. The engine that ran our first thousand calls is unrecognizable from the engine that runs today, and the engine that runs today will be unrecognizable a year from now.
What we mean by sub-100ms.
The number we defend is the p95 routing overhead measured between the moment a prompt arrives at the behavioral layer and the moment the layer hands control to the selected model. It does not include model inference time, which we do not control, and it does not include network egress to the model provider, which we measure separately.
We defend p95 rather than median because the tail is where deployments break. A median of 40ms with a p95 of 600ms produces a deployment in which one call in twenty feels broken, and operators do not forgive one call in twenty. We learned this in the first health-system deployment that crossed a million calls. The median was excellent. The complaints were entirely about the tail.
The budget breaks down approximately as follows in a healthy deployment: 8ms for context fetch from the warm tier, 22ms for policy evaluation against the per-deployment policy graph, 14ms for cultural-module resolution, 18ms for routing decision and provenance write, and the remainder absorbed by the boundaries between them. None of these numbers were the targets we started with. All of them are the result of two or three rebuilds.
What scaled cleanly.
Two subsystems held without modification across two orders of magnitude, and both held for the same reason: they were designed to be append-only with bounded fan-out.
The decision-replay store is the canonical example. Every routing decision writes a single immutable record that contains the inputs, the policy applied, the model selected, the response identifier, and a cryptographic chain to the previous decision in the same deployment. The store was sized at pilot for ten million records and has crossed nine billion without architectural change. The reason is uninteresting and important: append-only stores scale, and the fan-out of a single decision write is exactly one.
The policy-evaluation kernel is the other. We made the call early to compile each per-deployment policy graph into a flat decision table at deployment time rather than evaluating it as a tree at call time. The compile is expensive — sometimes several seconds for the largest financial-services deployments — and the runtime cost is a single table lookup. This was the right tradeoff and it has stayed the right tradeoff. The argument for it is documented in our earlier work on why behavioral context belongs outside the model, and the production data has reinforced rather than complicated that argument.
What broke at scale.
Context loading was the first bottleneck and the most painful. The pilot architecture fetched the relevant cultural module, deployment policy, and member context on every call. It worked at pilot load and it crumbled in production. The fetch path became the long tail.
The rewrite moved to a three-tier cache: a hot tier in process memory holding the active modules for the deployment, a warm tier in a co-located store holding modules with recent activity, and a cold tier for everything else. A per-deployment warmth signal — driven by call volume and operator schedule — promotes and demotes modules between tiers ahead of demand. The hot tier hit rate is now above 96% in mature deployments and the warm tier absorbs almost all of the remainder. The cold tier is hit by fewer than one call in two thousand.
The audit-export path was the second. Regulators do not call us on the hot path, but they do call us on a schedule that the pilot architecture treated as a background concern. When a single export query started touching a billion records, the background concern stopped being background. The export path now runs against a read-optimized projection that is built incrementally as decisions are written, and the export itself is a range scan rather than a query. This is the same pattern documented in our work on the audit trail problem in regulated AI, applied at a scale we did not anticipate when we wrote it.
The third break was less obvious. Cultural-module resolution, which had been a constant-time lookup at pilot, became a hot spot once deployments crossed multiple jurisdictions per call. A single member interaction in a multi-state insurance deployment can touch three or four cultural modules, and the naive resolution path evaluated them sequentially. The fix was to resolve modules in parallel against a precomputed compatibility matrix per deployment, which collapsed the cost from additive to constant.
What we'd build differently.
The policy compiler should have been ahead-of-time from the start. We treated it as a per-deployment runtime concern for two years longer than we should have, and the operational cost compounded. Every deployment carried a small ongoing tax that turned into a significant tax once we had forty of them. The ahead-of-time compile lifted that tax permanently, but it required a rebuild of the deployment pipeline that would have been much cheaper to do at year one.
The second is the boundary between routing and provenance. We treated the provenance write as a side effect of the routing decision, which meant that under load the two contended for the same write path. Splitting them — routing on the hot path, provenance on a guaranteed-delivery async queue — was the single largest tail-latency improvement we have ever shipped. The architectural cost of treating provenance as a side effect for too long was about 80ms of p99.
The third is observability. We instrumented the routing engine the way one instruments a service: counters, histograms, traces. None of that was wrong, but none of it was sufficient. The instrumentation we needed was the instrumentation that lets an operator answer a regulator's question about a specific decision six months later. We have that now, and the discipline of building it has changed how we think about every new subsystem. It is the same pattern, again, from the audit-trail work: the right primitives at the boundary determine what is possible at the center.
What the next two years look like.
The next architectural rebuild is already underway. The policy compiler is being moved from a per-deployment artifact to a per-tenant artifact, with module overlays applied at routing time rather than compile time. This lets us push policy updates without recompilation and recovers a class of deployment that has historically been too dynamic for the ahead-of-time path.
The second rebuild is the routing decision itself. Today the decision is a single deterministic evaluation against the compiled table. The next-generation engine will support a small set of bounded probabilistic decisions for cases where the policy explicitly defines a tolerance band — primarily in the cultural-module space, where strict determinism has produced brittle behavior at jurisdictional edges. The discipline here is that the probabilistic decision is itself audited, and the policy that permits it is the artifact regulators inspect.
Neither rebuild changes the sub-100ms target. Both are designed around it. If anything they tighten it: the per-tenant compile recovers headroom that the per-deployment compile spent, and the bounded probabilistic decisions are cheaper to evaluate than the deterministic ones they replace. The target moves down to a p95 of 80ms over the next four quarters, with a stretch goal of 65ms in the largest deployments by the end of the planning horizon.
Sub-100ms is not a latency target. It is the threshold below which the behavioral layer stops being something operators have to think about.
The architecture that runs the routing engine today is not the architecture we shipped on day one, and it is not the architecture we will run a year from now. The discipline is in knowing which parts to rebuild and when, and in being honest with operators about which parts are working and which parts are temporary.
The deeper lesson from 1.8 billion calls is that the right primitives compound and the wrong ones tax. Append-only stores compound. Ahead-of-time compilation compounds. Provenance as a first-class artifact compounds. Treating any of those as a convenience rather than a primitive taxes the deployment for years. We have paid that tax twice now. The third time we will not.