Why SOM Matters: The Case for a Semantic Web Format for AI Agents

The web has evolved through three eras of consumption, each driven by a new class of consumer that needed the web to speak its language.

In the first era, browsers consumed HTML and rendered it into pixels on a screen. HTML was purpose-built for this: it encodes layout, typography, color, interactivity, and visual hierarchy. The entire specification assumes a human will look at the result.

In the second era, search engines needed to index and rank web content. HTML alone was not sufficient because search engines do not render pages visually. Publishers responded by adding structured metadata: sitemaps told crawlers which pages existed, robots.txt defined access rules, and Schema.org markup embedded machine-readable facts directly into HTML. These additions were designed specifically for non-human consumers.

In the third era, applications needed to consume web data programmatically. REST APIs, GraphQL endpoints, and webhooks emerged as purpose-built interfaces for machine-to-machine communication. No one expected an application to parse HTML to get structured data.

Now we are entering the fourth era. AI agents browse the web, read pages, reason about content, and take actions. They are fundamentally different from every prior consumer. They are not rendering pixels. They are not building an index. They are not calling a structured API. They are reading page content and using it as context for language model reasoning.

And they have no format designed for them.

What breaks when you feed HTML to an LLM

The problems with raw HTML as LLM input are both quantitative and qualitative.

The token cost problem

A typical web page contains 200KB to 400KB of HTML. After tokenization with cl100k_base (the tokenizer used by GPT-4 and similar models), this translates to 30,000 to 60,000 tokens. In our WebTaskBench evaluation across 50 real websites, the average was 33,181 input tokens per page.

The vast majority of these tokens encode information that is irrelevant to agent reasoning:

CSS class names make up a significant fraction of modern HTML. A single Tailwind CSS element might carry class="flex items-center justify-between px-4 py-2 bg-white border-b border-gray-200 shadow-sm sticky top-0 z-50". That is 25 tokens encoding visual presentation that an LLM cannot see and does not need. JavaScript is embedded inline or referenced via script tags. React, Vue, and Angular applications often include hundreds of kilobytes of application code in the HTML response. None of this is useful as LLM context. Tracking and analytics markup includes data attributes, pixel images, event handlers, and embedded JSON blobs for tools like Google Analytics, Segment, Hotjar, and dozens of others. Navigation boilerplate is repeated on every page of a site. The header, footer, sidebar, and cookie consent banner appear identically across thousands of pages but consume tokens on every fetch. Advertising markup on media sites can account for 30% to 50% of the total HTML, including ad containers, auction scripts, and fallback content.

The practical consequence is that agents burn through context windows and API budgets on noise. At $3 per million input tokens, processing 1,000 pages of raw HTML costs approximately $100. The same pages in SOM cost approximately $25.

The ambiguity problem

Beyond cost, raw HTML creates ambiguity that degrades agent performance. Consider a simple task: "Click the login button on this page."

In HTML, the login button might be:

<button class="btn btn-primary sc-gsnTZi hover:bg-blue-600">Log In</button>

Or it might be:

<a href="/login" class="nav-link">Log In</a>

Or it might be:

<div role="button" tabindex="0" onclick="showLoginModal()">Log In</div>

Or it might be a with a click handler. Or an . The model has to reason about which HTML patterns constitute "a button" across every possible site. This reasoning consumes tokens and time, and it is error-prone.

The interaction gap

The most fundamental problem is that HTML does not explicitly declare what an agent can do. A