Measuring deflection that matters.

Deflection rate — the share of inquiries resolved without human intervention — is the headline metric most teams adopt when they deploy a behavioral AI system. It is also a vanity metric that incentivizes the wrong behaviors and obscures the outcomes the deployment was built to produce.

This paper proposes three replacement metrics that correlate with deployment value and that the deployments in the Brevor fleet have adopted in place of, or alongside, deflection rate. The metrics are harder to compute and harder to celebrate. They are the ones that age well.

Why deflection rate is misleading.

A high deflection rate can mean the system handled inquiries well, or it can mean the system aggressively closed inquiries that should have escalated, or it can mean the system handled the easy inquiries and routed the hard ones to a queue the operator does not measure. The metric does not distinguish among the three outcomes.

The second-order problem is that deflection rate creates incentives that produce the wrong system. A deployment optimized for deflection will tend toward over-confident responses on edge cases, premature closure of ambiguous interactions, and routing decisions that look good on the dashboard and feel bad to the member. We have seen deployments cross 80% measured deflection while member-satisfaction scores moved in the opposite direction. The dashboard was right and useless.

The third-order problem is that the metric is hard to argue with internally. A deflection rate that has gone up is a metric an operator can defend in a quarterly review. A deployment that has held a measured-value metric flat while replacing 15% of the routing volume with a higher-quality path is harder to defend, even though it is the better outcome. The political weight of the wrong metric is part of why it persists.

Completion-without-escalation.

The first replacement metric is the share of inquiries that completed to the user's satisfaction without requiring a human handoff. This is harder to measure than deflection rate, because it depends on a definition of completion that has to be deployment-specific. A clinical-triage deployment defines completion differently from an underwriting deployment.

The Brevor pattern is to define completion per deployment as a small set of observable outcomes — the member took the next action, the case was closed in the system of record, the follow-up survey was returned positive — and to measure the AI's contribution as the share of inquiries that reached completion without crossing the escalation threshold. The metric is computed against the same audit trail that satisfies regulator review, which means it is computed once and is consistent across the deployment.

A mature deployment carries a completion-without-escalation rate that is lower than its deflection rate, sometimes by 10 to 20 percentage points. That gap is the most honest number in the deployment. Closing the gap is the work.

Audit-clean rate.

The second metric is the share of routing decisions that pass audit review without remediation. The denominator is the sample of decisions reviewed by the credentialed audit function in the deployment. The numerator is the subset for which no remediation is required.

This metric is unusual in that it does not depend on the AI's accuracy directly. It depends on the AI's policy posture being aligned with what the audit function would have done in the same situation. A deployment can have a high audit-clean rate with modest accuracy if its policy is appropriately conservative, and it can have a low audit-clean rate with high accuracy if its policy is misaligned with the institution's standards.

We have found audit-clean rate to be the metric that correlates most strongly with the long-run health of a deployment. Deployments that hold an audit-clean rate above 95% across a meaningful sample tend to renew. Deployments that drift below 90% tend to enter remediation cycles. The metric is leading; the renewal is lagging.

Override frequency.

The third metric is the frequency with which credentialed reviewers override the system's decision when given the opportunity. The metric is computed only on the slice of interactions where a reviewer was in the loop, but that slice tends to be the most informative one.

A low override frequency is not unambiguously good. It can mean the system's decisions are aligned with the reviewer's; it can also mean the reviewer is not engaged. Operators distinguish between the two by sampling — a small share of low-override interactions are independently reviewed to confirm the alignment is real and not a function of disengagement. The sampling itself is part of the deployment's quality posture.

A high override frequency is always a signal worth investigating. It is sometimes a sign that the policy is misaligned with the institution; it is sometimes a sign that the reviewer is overriding inappropriately. Both cases deserve attention, and the audit trail that supports the metric also supports the investigation.

How to use the metrics together.

No single metric is sufficient. The pattern we recommend, and that the deployments in the fleet have adopted, is to track all three replacement metrics alongside deflection rate, with deflection rate treated as a context number rather than a target.

The useful question is not 'is deflection going up' but 'is completion-without-escalation rising faster than deflection, are audit-clean rate and override frequency stable, and is the gap between deflection and completion closing or widening'. A deployment in which all three move in the right direction is a deployment that is improving. A deployment in which deflection rises while the others stagnate is a deployment that is mortgaging future renewal for present optics.

This posture requires a dashboard that operators are willing to read and a leadership culture that is willing to defend it. Both are harder than the one-metric alternative. Both compound.

Deflection rate measures what was deflected. The metrics that matter measure what was resolved.

The deflection-rate culture is hard to displace because the metric is easy to compute, easy to chart, and easy to celebrate. The replacement metrics are harder on all three counts, and they are the ones the deployment actually depends on.

The deployments that have made the switch in the Brevor fleet are uniform in two things: they treat deflection rate as a context number, and they hold their leadership accountable to completion-without-escalation. The pattern is the same one that recurs across the research: the right primitive at the boundary determines what is possible at the center. Deflection rate is the wrong primitive. The right ones are available, and they are the ones we recommend.

Recommended citation

Brevor Research, August 2021

RELATED RESEARCH

March 2019 · 14 min read

Why behavioral context belongs outside the model.

March 2026 · 12 min read