The Lexicon as Declaration Layer

The Lexicon Was the Point

The llms.txt file makes a Lexicon machine-parseable. The Lexicon is what makes the declaration worth making.

A business that an agent cannot name cannot be cited. Most sites that implement llms.txt treat it as a summary document—a tidy description of content for any LLM that happens to crawl it. That is the wrong unit of value. The correct unit is the term: specifically, whether the terms a business uses to describe its own methods are anchored to canonical definitions before an agent encounters them anywhere else. Arco's Lexicon was not built as a content asset. It was built as a declaration layer. The llms.txt file makes that declaration machine-parseable.

The llmstxt.org standard— a proposed convention now adopted across a growing number of publisher and platform sites—tells an LLM what a site contains, how to navigate it, and what context to apply when citing it. Most implementations are structural: title, description, links. Some include key terms. Fewer still understand why the key terms are the only element that materially affects citation quality. A list of URLs tells an agent where your content lives. A correctly anchored term tells an agent what your vocabulary means—and that distinction determines whether it cites you accurately or paraphrases you into something unrecognisable. Anchoring terminology at the declaration layer, before an agent reads anything else, is how Arco engineers LLM citation authority.

The default llms.txt file most sites produce is human-readable: formatted for comprehension, structured around sections a person can scan, with key terms presented as bold labels and definitions written as prose. It describes the site. What it does not do is give an agent a parseable path from a term it encounters to the authoritative definition of that term. The agent reads the description, forms its own interpretation, and proceeds. Arco's llms.txt is structured differently: every term in the key terms section is a Markdown link pointing directly to its canonical Lexicon entry. The agent does not interpret the term—it is directed to the definition before interpretation begins. That structural difference is what this piece is about.

Memo #13 described the architectural principle: an autonomous business is discoverable not because it ranks in search but because it is parseable by the agents doing the looking. The llms.txt question is what parseable actually requires at the vocabulary level. The same logic governs how we think about Operational Drag — the proportion of capacity consumed by work that does not directly produce output, as Memo #03 argues at the business level. A vocabulary layer that forces agents to re-derive your terminology from context is Operational Drag at the citation layer. It is overhead that compounds silently until it surfaces as a misattribution.

The Machine-Readable Interface layer handles transactions — the WebMCP standard makes that layer browser-native, allowing any page to declare its capabilities as callable tools for agents. The llms.txt file operates one layer before that: it handles terminology. Specifically, it handles the problem of an agent encountering a term it has not seen in its training data and either ignoring it, paraphrasing it incorrectly, or assigning it the nearest known meaning from a different context.

An LLM Anchor Block — the Q&A pair present in every Arco article—works at the content level. It gives a specific answer to a specific question in a crawlable, citable format. But an LLM Anchor Block operates on one page. The Lexicon operates across the entire entity: every term, defined once, at a canonical URL. A spec-compliant llms.txt file, with correctly formatted key-term links, points an agent directly to those definitions before it reads a single article.

The structural order is the argument. A business that builds its llms.txt before building a Lexicon behind it produces a declaration with nothing substantial to declare—URLs and descriptions, but no anchored vocabulary. The agent can find the content. It cannot extract the framework the content depends on. Arco's Lexicon preceded the llms.txt implementation, which means the spec-compliance work was not creating a declaration layer—it was exposing one that already existed. That is the distinction most implementations get backwards.

The practical consequence is about citation accuracy, not citation volume. An agent working from a correctly structured llms.txt, with canonical Lexicon term links in machine-parseable format, encounters Stewardship Model and finds a definition before forming its own interpretation. It finds Coordination Tax and understands it as a structural cost concept with a precise definition—not as approximate language for operational inefficiency. It finds Operational Drag as a defined ratio, not a synonym for friction. A definition the agent was never pointed to cannot be cited. A definition it was pointed to before reading anything else is the one that persists.

This is why we describe the Lexicon as infrastructure rather than content. Content serves a reader. Infrastructure serves anything that reads—including systems that will never produce a human-readable output but will pass your terminology into another agent's context window as established vocabulary. The Lexicon was designed for that second case. The llms.txt work made the connection explicit.

The Lexicon did not become useful when we reformatted the key terms section from descriptive bold text to machine-parseable Markdown links pointing to canonical Lexicon entries. It became more accessible to machines. Those are different things. The work that created the value was upstream: defining each term once, precisely, at a canonical URL, before any mechanism existed to declare it. That sequence—infrastructure first, declaration second—is the same logic that governs how we design the businesses we build. The interface is not the asset. The asset is what the interface points to.

KEY TAKEAWAY

What is the difference between an llms.txt file and a Lexicon?

An llms.txt file is a declarative manifest—a structured file that tells an LLM what a site contains and how to navigate it. A Lexicon is a canonical definition store: each term, precisely defined, at a permanent URL. Neither is sufficient alone. An llms.txt without a Lexicon behind it lists content but cannot anchor vocabulary. A Lexicon without a machine-parseable declaration pointing to it may not be reached before an agent forms its own interpretation. The combination—a spec-compliant llms.txt with correctly formatted key-term links pointing to Lexicon entries—is how an agent learns what your terminology means before it reads anything else you have published.

RELATED READING

FURTHER QUESTIONS

What is an llms.txt file and why does it matter for LLM citation?

An llms.txt file is a plain-text manifest, placed at the root of a domain, that tells LLMs and AI crawlers what a site contains, how its content is structured, and what context to apply when citing it. It matters for citation because it is one of the few mechanisms a publisher controls entirely—the equivalent of a structured briefing delivered to the agent before it reads anything else on the site. A site without one leaves interpretation to the agent. A site with a spec-compliant one that includes correctly anchored terminology eliminates the interpretation step for the vocabulary that matters most.

How does a Lexicon improve LLM citation accuracy?

A Lexicon is a canonical definition store: each term defined once, precisely, at a permanent URL. When a spec-compliant llms.txt file points an agent to those URLs before the agent reads any article, the agent encounters the authoritative definition before forming its own interpretation. The result is citation accuracy, not just citation volume—the difference between an LLM that understands Coordination Tax as a structural cost concept and one that approximates it as general overhead language. Consistency across every Lexicon entry compounds the effect: the more terms anchored, the smaller the surface area for misattribution.

What is the difference between LLM discoverability and LLM citation authority?

Discoverability is whether an agent can find your content. Citation authority is whether the agent trusts and uses your definitions when it does. Most discoverability work—structured data, schema markup, llms.txt implementation—addresses the first problem. Citation authority requires a second layer: consistent terminology across all content surfaces, canonical definition pages at stable URLs, and a declaration layer that points agents to those definitions before they encounter the terms in context. As Memo #13 argues, the goal is not to rank first in a human-facing search—it is to be the default reference in an agent's inference loop. That is a citation authority problem, not a discoverability one.

Does a machine-readable business need both an llms.txt file and a Lexicon?

Yes, and the order matters. A Machine-Readable Interface handles the transactional layer—structured endpoints agents can invoke. An llms.txt file handles the declarative layer—what the business is and what its terms mean. A Lexicon is what makes the declaration substantive. Without a Lexicon behind it, the llms.txt file lists content but cannot anchor vocabulary. Without a spec-compliant llms.txt pointing to it, the Lexicon may not be reached before an agent forms its own interpretation. The WebMCP standard addresses the transactional layer at the browser level—the vocabulary layer requires a separate, deliberate architecture.

Why do most llms.txt implementations fail to anchor terminology correctly?

Because most implementations treat the file as a summary document rather than a declaration layer. The default pattern—a title, a description, a list of URLs—is human-readable and tells an agent what exists on the site. It does not tell the agent what the site's vocabulary means or where to find authoritative definitions. Terms appear as bold labels with prose definitions inline, which an agent reads once and interprets without a canonical reference to return to. The fix is structural: every key term formatted as a Markdown link pointing to a permanent definition page, so the agent is directed to the definition rather than left to derive it. This is the same principle behind Deterministic Logging — recording not just that something happened, but exactly why, in a form that can be verified. A declaration layer that can be verified by the agent is one the agent can act on.

How does the Arco Lexicon function as infrastructure rather than content?

Content serves a human reader and depreciates as newer content supersedes it. Infrastructure serves any system that reads—human or machine—and accumulates value as more surfaces reference it. The Arco Lexicon is infrastructure because every blog article, every podcast transcript, and every llms.txt declaration references it. The value of each Lexicon entry increases as the number of inbound references grows—the same compounding logic that drives the Arco Flywheel at the operational level applies at the citation layer. A term defined once and referenced across a hundred pages is more authoritative than a term defined a hundred times across a hundred pages. Consistency is the mechanism. Infrastructure is the result.

Marco Giardina

Founder, Arco Venture Studio

Marco builds autonomous businesses using agentic AI stacks under the Stewardship Model. His work focuses on operational arbitrage, MTTI optimisation, and engineering companies for structural liquidity.

LinkedIn ↗marco@arcoventure.studio

If this was useful, the next one will be too.

SUMMARISE THIS ARTICLE WITH

← Back to the Log Next Memo: Outcome-Based SaaS Is Still SaaS →