Every behavioral routing decision costs something. The question for any deployment is not whether to pay that cost but where to spend it. A budget framed in the wrong units produces a deployment that feels slow on the calls that matter and over-engineered on the calls that do not.
This paper proposes a framework for allocating latency budget across use-case tiers. The framework is the one we use internally when we scope a new deployment, and it has held up across forty-plus production environments. The targets are not aspirational. They are the numbers below which the deployment ships without operator complaints and above which it does not.
Why budget at all.
Most teams that fail at this stage fail because they treat latency as a single number. They set a target — say, 100ms — and apply it uniformly. The result is a deployment that overspends on background calls and underspends on the calls a human is waiting on.
The right approach is to budget by tier. Each tier has its own target, its own tolerance for tail behavior, and its own permitted optimizations. The budgets are not independent; they share a common policy graph and a common context store. But the calls themselves run through different paths, and the paths are sized differently.
This is the same architectural posture documented in our sub-100ms work: the routing engine is a single system with multiple disciplines. The disciplines are what the tiers express.
Three tiers of latency budget.
The first tier is member-facing interactive. A member typing a message in a portal, a caller on a voice line, a patient in a triage flow. The target here is 80ms p95 for routing overhead. The user is waiting; the workflow assumes the response will arrive in a perceptually continuous way. Above 80ms, the conversation feels stilted. Above 150ms, it feels broken.
The second tier is operator-facing. An underwriter, a clinician, a case manager. These users are domain experts running a tool, not consumers waiting on a chat. They tolerate more overhead because they understand what is happening. The target is 120ms p95, with a hard ceiling of 250ms before the tool starts to feel sluggish. The same routing engine serves both tiers; the difference is which fast-paths and which audit paths are enabled.
The third tier is background and batch. Nightly summarization, scheduled audit-trail compaction, retrospective routing for cohort analysis. The target here is throughput, not latency, and a single call can spend 500ms or more on routing overhead without operator impact. The budget is set by the wall-clock window the batch must complete in, not by a per-call number.
Fast-paths for low-risk prompts.
Not every prompt needs the full behavioral evaluation. A prompt that asks for the deployment's hours of operation does not need the cultural module, the consent template, or the escalation policy. A prompt that asks for an account balance is in scope for a narrow policy that can be evaluated against a small table without touching the full graph.
The fast-path is the architectural recognition of this. A prompt that matches a well-understood pattern — defined per deployment, audited as a first-class artifact — bypasses the full evaluation and routes through a precompiled narrow policy. The latency saving is 30 to 50ms in practice, which is the difference between hitting the member-facing target and missing it.
The risk of the fast-path is that it becomes a back door. The discipline is that every fast-path is itself audited, and every prompt that takes the fast-path is logged with the pattern it matched. The audit-trail work documents the primitives that make this safe; the policy that defines which patterns are fast-path-eligible is the artifact regulators review. We have never had a deployment in which the fast-path was the source of a finding, and the reason is that the fast-path is treated as policy, not as optimization.
Hot-swappable context.
Context modules that load per-call dominate the budget. A cultural module that is fetched from cold storage on every call adds 40 to 80ms of overhead and produces a tail that defeats the entire budgeting exercise. The only sustainable architecture is hot-swappable: modules are loaded into the engine ahead of demand and held warm.
Hot-swappability has a second benefit beyond latency: it makes policy updates safe. A new cultural module can be loaded, validated, and promoted to the hot tier without restarting the routing engine. The audit trail records the moment of promotion as a first-class event, which gives the regulator a clean record of when policy changed.
The hot tier is sized per deployment, not per tenant. A multi-jurisdiction insurance deployment may keep twelve cultural modules warm; a single-state credit union may keep two. The sizing is part of the deployment plan, not part of the runtime configuration, and it is reviewed at every quarterly deployment review. Operators who try to manage this at runtime end up with a hot tier that drifts, and a drifting hot tier produces a drifting tail.
What breaks the budget.
Three failure modes dominate. The first is uncached context: a deployment that grows new modules faster than the warmth signal can promote them. The fix is to expand the hot tier ahead of the deployment growth, not to wait for the tail to degrade and react.
The second is policy bloat. A policy graph that grows linearly with deployment age — every new rule added, none ever removed — eventually exceeds the budget the compiler was sized for. The discipline is quarterly policy review with explicit retirement of rules that no longer trigger. We have one deployment that has retired 38% of the rules it shipped with, and it is among the fastest in the fleet.
The third is provenance contention. Treating the audit write as a side effect of routing — the mistake we documented in the sub-100ms work — produces a budget that holds at median and collapses at p99. The fix is structural: provenance writes go through a guaranteed-delivery async path that does not block the routing decision. This is non-negotiable in any deployment we run.
Latency is the tax on safety. Behavioral routing is worth it only if the tax stays small enough that nobody notices it.
Latency budgeting is one of the few areas where the right answer depends almost entirely on the deployment. The framework here is a starting point. The per-deployment work is where the budget gets defended, and the discipline of defending it is what separates a deployment that ages well from one that does not.
The deployments in the Brevor fleet that have held their budgets for multiple years all share two habits: they review the policy graph quarterly and they treat the hot tier as a deployment artifact, not a runtime knob. The deployments that have not held them all share the opposite habits. The framework is necessary; the discipline is what makes it work.