The Stewardship Model defines the Steward’s role as governance rather than operation: governing exceptions, refining the Exception Architecture, and maintaining the boundary between the Execution Layer and the Judgment Layer. The Audit Surface — the structured governance digest derived from the Proof of Action trail — is what makes that governance possible at operational tempo. What neither the Stewardship Model nor the Audit Surface memo addressed is the technical layer beneath the digest: the observability infrastructure that generates the trace data from which the Audit Surface is derived. A Steward governing without this infrastructure is making architectural decisions with incomplete information. The Nominal MTTI condition — where interventions are rare because the Steward stopped reading the audit surface — is one failure mode. This memo addresses a different one: where the Steward is reading but the surface is built from insufficient data.
The two hard problems in production
Two problems define the transition from prototype to production for agentic systems. The first is accuracy: AI applications are built on non-deterministic models, and agents can regress while still returning 200 OK. A workflow that produced correct output last week may be producing subtly incorrect output this week because a prompt changed, a model was updated, or the data environment shifted — without triggering any error signal in the conventional sense. The second is token cost: agents call models in loops, generate multiple potential responses, and accumulate context across multi-step workflows. The operational cost of running agents at production volume can exceed the revenue they generate if the cost structure is not observable at the step level. Both problems have the same solution: observability.
Observability in the agentic context means the ability to see the input and output of every step in every agent workflow — not just the final output, but every intermediate call, every tool invocation, every model response, every escalation, and every exception. This is what a trace provides: the full execution tree of a workflow, from the triggering event through every step to the final output, with timing, cost, and status information for each node. The standard format for trace data is OpenTelemetry (OTel) — an open standard supported across every major observability vendor and agentic framework. Building on OTel ensures the trace data is portable: the Steward can switch observability vendors without losing the ability to read historical traces, maintaining the Architectural Decoupling principle at the observability layer.
What the Steward actually needs from observability
Raw trace data is not a governance instrument. A Steward presented with the full OpenTelemetry trace for a 10-step agent workflow — thousands of spans, each with input/output JSON, timing, and status — has more information than they can process in the time available for governance. The Audit Surface Problem established that the Steward’s daily governance review must be completable in five minutes. Five minutes is not enough time to analyse raw trace data. It is enough time to review a structured digest that surfaces anomalies, confirms stable baselines, and identifies the specific traces worth investigating in depth.
The Steward Intelligence Layer bridges this gap. It is the architectural component that combines trace data, evaluated patterns, and specialised interpretation agents to surface actionable architectural signals to the Steward — reducing the decision-to-insight cycle to a single review session. Three components constitute it. First, automated evals: the system runs each agent workflow against a defined test dataset after every deployment and flags regressions before the Steward’s review cycle. Second, anomaly detection: the system identifies deviations in execution time, token cost, Escalation Rate, and output pattern across task classes and surfaces them as signals in the governance digest. Third, specialised interpretation agents: lightweight agents that process the trace data, compare it against the v0 baseline, identify cost outliers at the step level, and generate plain-language summaries of what changed and what it might mean — so the Steward arrives at architectural recommendations rather than raw data.
The target state is infrastructure that is itself agentic — that gives the Steward solutions rather than alerts, dashboards, and problems. The distinction is precise: an alert tells the Steward that something changed. An agentic observability layer tells the Steward what changed, why it matters, and what architectural response is appropriate. The first requires the Steward to diagnose. The second allows the Steward to govern.
Online and offline evals — the two observability modes
Evals are the quantitative instrument through which the Steward measures agent quality over time. Offline evals run against a fixed dataset before each deployment — they catch regressions before the deployment reaches production. Online evals run against live production traffic — they catch the failure modes that real users introduce that no synthetic dataset anticipates. Most teams start with offline evals and add online evals once the agent is in production. The mature Steward Intelligence Layer runs both simultaneously: offline evals gate deployments, online evals monitor drift.
The eval dataset itself compounds. It is built in three stages: hand-curated cases that establish what correct output looks like for each task class; synthetically generated cases that expand coverage to edge cases; and production logs that capture the actual inputs real users provide. The production log component is the highest-signal source and the one most teams delay adding. It should be built from the first week of production — every exception resolved by the Steward is a potential eval case, and every exception encoded into the Exception Architecture is a regression test for the next deployment.
The Operator’s Verdict
The Stewardship Model is only as effective as the information it operates on. A Steward with a well-designed Steward Intelligence Layer makes governance decisions on the basis of what the system is actually doing. A Steward without one makes them on the basis of what the system was doing when it was last explicitly checked — which may be weeks ago, and which may have been measured incompletely even then.
Technology changes what agents can do. Observability determines whether the Steward knows what they are doing.
KEY TAKEAWAY
What is the Steward Intelligence Layer and why does observability quality determine the Stewardship Model’s governance quality?
The Steward Intelligence Layer is the architectural component that combines OpenTelemetry trace data, automated evals, and specialised interpretation agents to surface actionable architectural signals to the Steward — converting raw trace data into a governance digest that can be reviewed in minutes rather than hours. Observability determines governance quality because agents can regress while returning 200 OK: output quality degrades silently without triggering conventional error signals, while the Escalation Rate rises for downstream task classes and the Steward attributes it to other causes. A Steward without observability infrastructure makes architectural decisions on impression. A Steward with a Steward Intelligence Layer makes them on evidence. The Audit Surface defines what the Steward must verify. Observability is the infrastructure that makes verification possible. Key metric: teams shipping agents into production typically spend months reviewing their observability tools as they move from prototype to production — observability is not a setup task but an ongoing governance instrument.
