How should a Quality Threshold be defined for a T1 task class?

A T1 task class threshold has three components. First, the output schema: the precise structure, field types, field constraints, and validation rules that define a correct output for this task class. This is the same specification you would write for a database schema. Second, the completeness requirement: what proportion of required fields must be populated for the output to be accepted? An output missing 20% of required fields may still conform to the schema; the threshold must specify whether that constitutes acceptable output. Third, the hallucination boundary: what field values, if incorrect, constitute a threshold failure regardless of schema conformance? A date formatted correctly but incorrect is schema-compliant but wrong. Define the fields where accuracy matters and test explicitly against them. Test the threshold against 100 production-representative inputs from the relevant task class before making the routing decision — not against synthetic examples, which models tend to handle better than real-world edge cases.

Can a routing decision be reversed if the Quality Threshold is later found to be insufficient?

Yes, but reversing it costs more than getting it right the first time. A routing decision that has been in production for six months has propagated the cheaper model’s output characteristics through the downstream task classes that depend on it. If the threshold was set too permissively and the cheaper model has been producing subtly incorrect outputs, the [Operational Ledger](https://arcoventure.studio/lexicon/operational-ledger) may contain incorrectly resolved exceptions, the downstream [Context Architecture](https://arcoventure.studio/lexicon/context-architecture) may have been calibrated on incorrect data, and the [Escalation Rate](https://arcoventure.studio/lexicon/escalation-rate) may have been elevated for reasons the Steward attributed to other causes. Reversing the routing decision is straightforward — route back to the more capable model. Identifying and correcting the downstream consequences of the incorrect routing period is the harder work. The correct approach is to validate the Quality Threshold against production-representative data before the routing decision is made, not to treat the production system as the validation environment.

How does the Quality Threshold interact with the Intervention Threshold?

They govern adjacent but distinct boundaries. The [Intervention Threshold](https://arcoventure.studio/lexicon/intervention-threshold) defines the conditions under which the system escalates to the Steward — when the output of a task class requires human judgment to validate or extend. The [Quality Threshold](https://arcoventure.studio/lexicon/quality-threshold) defines the minimum acceptable output from the model routing decision — before the output reaches the Intervention Threshold evaluation. An output that meets the Quality Threshold but triggers the Intervention Threshold is correct but complex — it needs a human review. An output that fails the Quality Threshold should not reach the Intervention Threshold at all; it should be re-routed to a more capable model or escalated as a routing failure before the Steward is burdened with a structurally incorrect output. The design sequence is: Quality Threshold first (is this output structurally correct?), Intervention Threshold second (is this correct output one the system should resolve autonomously or escalate?).

The Routing Decision

Quality Threshold is the minimum acceptable output standard for a given task class — defined before routing decisions are made and used to bound Intelligence Arbitrage such that cost optimisation never routes a task to a model incapable of meeting the standard the revenue loop requires. The framing of model routing as a cost reduction tool is accurate but structurally incomplete. Routing to a cheaper model reduces cost. Routing to a cheaper model without a defined quality bound may also reduce output quality — silently, without triggering an error signal, while the Escalation Rate rises for task classes where the cheaper model is insufficient and the system does not know it.

The Inference Floor argument established that frontier model capability has converged on most operational task classes — making model selection procurement rather than strategy. This convergence is what makes Intelligence Arbitrage economically significant: if multiple models can produce equivalent output on a given task class, the routing decision is straightforward — use the cheapest capable one. The engineering discipline that makes this decision safe is the Quality Threshold: the specification of what “equivalent output” means for each task class, defined precisely enough that it can be evaluated by logic rather than by impression.

Two task classes, two routing strategies

Structured output tasks and open-ended reasoning tasks require different routing strategies. For structured output tasks — extracting fields from a document, classifying an input into a predefined taxonomy, generating a formatted record from raw data — the Quality Threshold is a schema specification: the model must produce output that conforms to the defined structure, within the defined field constraints, without hallucinated values. This threshold can typically be met by small, fast, inexpensive models. The Inference Floor has already reached most structured extraction tasks. Routing these to a frontier model at frontier pricing is Operational Arbitrage surrendered.

For reasoning tasks — generating a sales coaching recommendation based on a call transcript, identifying the root cause of an escalation from an operational log, producing a novel exception resolution — the Quality Threshold is harder to specify because the correct output is not a schema. For these task classes, the Quality Threshold must specify an accuracy benchmark against a sample dataset: the model must produce outputs that a domain expert rates as acceptable on at least X% of test cases in the defined task domain. This threshold requires pre-validation before the routing decision is made — a model that passes the threshold on 92% of test cases is eligible for routing; one that passes on 76% is not, regardless of how much cheaper it is. The Quality Threshold is the instrument that makes this pre-validation systematic rather than anecdotal.

Structured output and routing compose directly

The most productive routing insight in practice is that structured output specification and model routing are architecturally complementary. A task class with a schema-defined Quality Threshold can be routed to small, cheap models because schema compliance is evaluable by logic — the routing decision is provably safe for any model that reliably produces schema-conformant output. This includes the majority of T1 task classes in most revenue loops: document processing, field extraction, classification, formatting, and data normalisation. For these tasks, the Quality Threshold is a one-time design decision that makes Intelligence Arbitrage available indefinitely: as new models emerge with equivalent schema-compliance capability at lower cost, the routing decision updates automatically within the defined threshold, and the cost advantage compounds without requiring architectural changes.

AI Gateway implements this mechanism in production: a provider-neutral routing layer that connects to different LLMs and updates when new models are released. “You can try the latest model when it gets updated on a Sunday without touching your application code.” This is Architectural Decoupling at the routing layer — the application code defines the Quality Threshold; the gateway handles the routing decision against that threshold as the model landscape evolves. The threshold is stable. The routing is dynamic. The cost compounds.

What happens without the Quality Threshold

Without a defined Quality Threshold, routing becomes a cost tool that operates by trial and error. The team routes a task class to a cheaper model, observes whether the output “seems” acceptable, and makes the routing decision on impression. This approach has two failure modes. First, gradual output degradation that does not trigger visible errors: the cheaper model produces outputs that are structurally correct but subtly wrong in ways that compound in downstream task classes before the error becomes visible. The Escalation Rate for downstream task classes rises; the cause is the routing decision upstream. Second, Execution Divergence that appears as a routing failure but is actually a threshold failure: the cheaper model handles 92% of inputs correctly but fails on the 8% that require reasoning the model cannot perform. Without a defined threshold, the 8% failure rate is a surprise. With a threshold, it is a pre-validated exclusion.

The Operator’s Verdict

The Quality Threshold is a design decision, not an optimisation. Made at design time, it makes Intelligence Arbitrage available for the lifetime of the system, compounding the cost advantage every quarter as the Inference Floor advances to new task classes. Made after the fact, it is a debugging exercise that costs more than the routing savings it was designed to capture.

Technology changes what models cost. The Quality Threshold determines what routing safely captures.

KEY TAKEAWAY

What is the Quality Threshold and why is it the precondition for safe model routing?

The Quality Threshold is the minimum acceptable output standard for a given task class — defined before routing decisions are made and used to bound Intelligence Arbitrage routing such that cost optimisation never routes a task to a model incapable of meeting the standard the revenue loop requires. For structured output tasks, the threshold is a schema specification: the model must produce schema-conformant output without hallucinated values. For reasoning tasks, the threshold is an accuracy benchmark against a pre-validated test dataset. Without a Quality Threshold, routing to cheaper models degrades output quality silently — the model produces plausible results that violate business constraints without triggering error signals, raising the Escalation Rate for downstream task classes before the cause is visible. With a Quality Threshold, routing is provably safe for any model that meets it, and the cost advantage of routing to cheaper models compounds automatically as the Inference Floor advances to new task classes. Key metric: structured output tasks — the majority of T1 task classes — can be routed to small, inexpensive models when the Quality Threshold is schema-defined, because schema compliance is evaluable by logic. The Inference Floor has already reached most structured extraction tasks.