What does an OpenTelemetry trace show and why is it the standard for agentic observability?

An OpenTelemetry trace shows the full execution tree of a workflow: every function called, every model invoked, every tool used, every data retrieved, with the input and output of each call, the time taken, and the status. For an agentic workflow, this means the Steward can see exactly what the agent received as input at each step, what it passed to each model or tool, and what it received back — making it possible to identify exactly where in a multi-step workflow an incorrect output was produced, rather than having to infer the failure point from the final output. OTel is the standard because it is portable: trace data generated in OTel format can be read by any OTel-compatible observability tool, ensuring the Steward is not locked into a specific vendor’s trace format and maintaining [Architectural Decoupling](https://arcoventure.studio/lexicon/architectural-decoupling) at the observability layer.

How does the Steward Intelligence Layer differ from a standard observability dashboard?

A standard observability dashboard presents trace data in a visual format — timeline views, flamecharts, status indicators, cost summaries. It requires the Steward to analyse the data to identify what matters. The Steward Intelligence Layer inverts this: specialised interpretation agents process the trace data, compare it against baselines, identify anomalies, and generate plain-language architectural recommendations for the Steward to evaluate. The distinction is precise: an alert requires the Steward to diagnose. A recommendation requires evaluation. The Steward’s governance capacity is more efficiently spent evaluating recommendations than diagnosing alerts — which is the design principle the Steward Intelligence Layer implements. It also connects to the pull-based governance model: the Steward Intelligence Layer surfaces conditions that require attention rather than requiring the Steward to search for them, preventing the [Nominal MTTI](https://arcoventure.studio/lexicon/nominal-mtti) condition where long MTTI reflects absence of monitoring rather than genuine [Architectural Certainty](https://arcoventure.studio/lexicon/architectural-certainty).

The Steward Cannot Govern What They Cannot See

The Stewardship Model defines the Steward’s role as governance rather than operation: governing exceptions, refining the Exception Architecture, and maintaining the boundary between the Execution Layer and the Judgment Layer. The Audit Surface — the structured governance digest derived from the Proof of Action trail — is what makes that governance possible at operational tempo. What neither the Stewardship Model nor the Audit Surface memo addressed is the technical layer beneath the digest: the observability infrastructure that generates the trace data from which the Audit Surface is derived. A Steward governing without this infrastructure is making architectural decisions with incomplete information. The Nominal MTTI condition — where interventions are rare because the Steward stopped reading the audit surface — is one failure mode. This memo addresses a different one: where the Steward is reading but the surface is built from insufficient data.

The two hard problems in production

Two problems define the transition from prototype to production for agentic systems. The first is accuracy: AI applications are built on non-deterministic models, and agents can regress while still returning 200 OK. A workflow that produced correct output last week may be producing subtly incorrect output this week because a prompt changed, a model was updated, or the data environment shifted — without triggering any error signal in the conventional sense. The second is token cost: agents call models in loops, generate multiple potential responses, and accumulate context across multi-step workflows. The operational cost of running agents at production volume can exceed the revenue they generate if the cost structure is not observable at the step level. Both problems have the same solution: observability.

Observability in the agentic context means the ability to see the input and output of every step in every agent workflow — not just the final output, but every intermediate call, every tool invocation, every model response, every escalation, and every exception. This is what a trace provides: the full execution tree of a workflow, from the triggering event through every step to the final output, with timing, cost, and status information for each node. The standard format for trace data is OpenTelemetry (OTel) — an open standard supported across every major observability vendor and agentic framework. Building on OTel ensures the trace data is portable: the Steward can switch observability vendors without losing the ability to read historical traces, maintaining the Architectural Decoupling principle at the observability layer.

What the Steward actually needs from observability

Raw trace data is not a governance instrument. A Steward presented with the full OpenTelemetry trace for a 10-step agent workflow — thousands of spans, each with input/output JSON, timing, and status — has more information than they can process in the time available for governance. The Audit Surface Problem established that the Steward’s daily governance review must be completable in five minutes. Five minutes is not enough time to analyse raw trace data. It is enough time to review a structured digest that surfaces anomalies, confirms stable baselines, and identifies the specific traces worth investigating in depth.

The Steward Intelligence Layer bridges this gap. It is the architectural component that combines trace data, evaluated patterns, and specialised interpretation agents to surface actionable architectural signals to the Steward — reducing the decision-to-insight cycle to a single review session. Three components constitute it. First, automated evals: the system runs each agent workflow against a defined test dataset after every deployment and flags regressions before the Steward’s review cycle. Second, anomaly detection: the system identifies deviations in execution time, token cost, Escalation Rate, and output pattern across task classes and surfaces them as signals in the governance digest. Third, specialised interpretation agents: lightweight agents that process the trace data, compare it against the v0 baseline, identify cost outliers at the step level, and generate plain-language summaries of what changed and what it might mean — so the Steward arrives at architectural recommendations rather than raw data.

The target state is infrastructure that is itself agentic — that gives the Steward solutions rather than alerts, dashboards, and problems. The distinction is precise: an alert tells the Steward that something changed. An agentic observability layer tells the Steward what changed, why it matters, and what architectural response is appropriate. The first requires the Steward to diagnose. The second allows the Steward to govern.

Online and offline evals — the two observability modes

Evals are the quantitative instrument through which the Steward measures agent quality over time. Offline evals run against a fixed dataset before each deployment — they catch regressions before the deployment reaches production. Online evals run against live production traffic — they catch the failure modes that real users introduce that no synthetic dataset anticipates. Most teams start with offline evals and add online evals once the agent is in production. The mature Steward Intelligence Layer runs both simultaneously: offline evals gate deployments, online evals monitor drift.

The eval dataset itself compounds. It is built in three stages: hand-curated cases that establish what correct output looks like for each task class; synthetically generated cases that expand coverage to edge cases; and production logs that capture the actual inputs real users provide. The production log component is the highest-signal source and the one most teams delay adding. It should be built from the first week of production — every exception resolved by the Steward is a potential eval case, and every exception encoded into the Exception Architecture is a regression test for the next deployment.

The Operator’s Verdict

The Stewardship Model is only as effective as the information it operates on. A Steward with a well-designed Steward Intelligence Layer makes governance decisions on the basis of what the system is actually doing. A Steward without one makes them on the basis of what the system was doing when it was last explicitly checked — which may be weeks ago, and which may have been measured incompletely even then.

Technology changes what agents can do. Observability determines whether the Steward knows what they are doing.

KEY TAKEAWAY

What is the Steward Intelligence Layer and why does observability quality determine the Stewardship Model’s governance quality?

The Steward Intelligence Layer is the architectural component that combines OpenTelemetry trace data, automated evals, and specialised interpretation agents to surface actionable architectural signals to the Steward — converting raw trace data into a governance digest that can be reviewed in minutes rather than hours. Observability determines governance quality because agents can regress while returning 200 OK: output quality degrades silently without triggering conventional error signals, while the Escalation Rate rises for downstream task classes and the Steward attributes it to other causes. A Steward without observability infrastructure makes architectural decisions on impression. A Steward with a Steward Intelligence Layer makes them on evidence. The Audit Surface defines what the Steward must verify. Observability is the infrastructure that makes verification possible. Key metric: teams shipping agents into production typically spend months reviewing their observability tools as they move from prototype to production — observability is not a setup task but an ongoing governance instrument.