SOM-first Websites: How Publishers Can Serve AI Agents Directly
Every day, AI agents crawl your website. Each one independently renders your pages in a headless browser, extracts content through heuristic parsing, and discards the rest. If 50 agents visit the same page, that page is rendered 50 times, consuming server bandwidth, compute, and electricity on both sides.
This is wasteful in the same way that having every search engine re-render every page was wasteful before sitemaps and structured data existed. The solution is the same: give the consumer a purpose-built representation so it does not have to extract one from your HTML.
SOM-first publishing means serving a Semantic Object Model representation of your pages alongside your HTML. Agents that understand SOM fetch the structured representation directly. Traditional browsers see no change. Search engines continue indexing HTML normally.
This guide covers the implementation from scratch for static sites, dynamic sites, and CMS platforms.
Why publishers should care
Reduced infrastructure load
When an agent crawls your site today, it triggers a full page render. If your site uses server-side rendering, that means your server generates the full HTML response. If your site relies on client-side JavaScript, the agent must execute that JavaScript in a headless browser, which may hit your APIs, CDN, and database.
With SOM-first serving, agents fetch a single JSON file. For static sites, this is served directly from your CDN or file storage. For dynamic sites, you can cache the SOM representation with a TTL appropriate to your content freshness requirements. Either way, the load per agent request drops dramatically.
Content control
Without SOM, every agent interprets your HTML however it wants. Different agents use different extraction algorithms, producing different (and sometimes incorrect) representations of your content. You have no control over what they see.
With SOM, you declare the canonical semantic representation. You decide which content is included, how it is structured, and what metadata accompanies it. This is analogous to how Schema.org markup lets you tell search engines "this is the product name, this is the price, this is the rating" rather than hoping the search engine's parser gets it right.
Future-proofing
Agent traffic is growing rapidly. As more users delegate information gathering to AI assistants, sites that are invisible to agents will lose relevance. But sites that actively serve structured content to agents are positioned to be the preferred sources for AI-mediated discovery.
SOM-first serving is the cooperative alternative to the adversarial cycle of blocking and scraping. It signals to agents: "You are welcome here, and this is how I want you to consume my content."
The discovery mechanism
SOM-aware agents check three places to find a site's SOM representation:
- Well-known path:
/.well-known/som.json(checked first by convention) - HTML link tag:
in the page - robots.txt directive:
SOM-Endpoint: /.well-known/som.json
Implementation for static sites
Static sites (Hugo, Jekyll, Astro, Eleventy, or plain HTML) are the simplest case.
Step 1: Generate the SOM
Install Plasmate and fetch your homepage:
npm install -g plasmate
plasmate fetch https://your-site.com > som.json
Examine the output to verify it captures your content correctly:
cat som.json | python3 -m json.tool | head -30
You should see your page title, semantic regions (navigation, main, footer), and content elements with their roles and text.
Step 2: Place at the well-known path
mkdir -p public/.well-known
cp som.json public/.well-known/som.json
The public/ directory is the standard static assets root for most static site generators. Adjust the path for your framework:
| Framework | Path |
|---|---|
| Hugo | static/.well-known/som.json |
| Jekyll | _site/.well-known/som.json (or root .well-known/) |
| Astro | public/.well-known/som.json |
| Next.js (static export) | public/.well-known/som.json |
| Plain HTML | .well-known/som.json (relative to document root) |
Step 3: Add the HTML link tag
In your base template or layout file, add this to the :
<link rel="alternate" type="application/som+json"
href="/.well-known/som.json">
For Hugo, add it to layouts/partials/head.html. For Jekyll, add it to _includes/head.html. For Astro, add it to your base layout component.
Step 4: Add the robots.txt directive
Append to your robots.txt:
SOM-Endpoint: /.well-known/som.json
SOM-Version: 1.0
Step 5: Automate regeneration
Add the SOM generation to your build pipeline so the representation stays current:
# In your CI/CD script or Makefile
plasmate fetch https://your-site.com > public/.well-known/som.json
For sites deployed on Vercel, Netlify, or Cloudflare Pages, add this as a post-build step.
Implementation for dynamic sites
Dynamic sites (Express, Rails, Django, Laravel) serve different content per request. The SOM representation needs to be generated per page or per template.
Option A: build-time generation for key pages
If your site has a known set of important pages (homepage, about, pricing, docs), generate SOM for each during your build or deploy:
PAGES="https://your-site.com https://your-site.com/about https://your-site.com/pricing"
for url in $PAGES; do
slug=$(echo "$url" | sed 's|https://your-site.com||' | sed 's|/|_|g')
plasmate fetch "$url" > "public/.well-known/som${slug}.json"
done
Serve the appropriate SOM file based on the request path.
Option B: on-demand generation with caching
For sites with many pages or frequently changing content, generate SOM on demand and cache it:
import { execSync } from 'child_process';
const somCache = new Map();
const CACHE_TTL = 300_000; // 5 minutes
app.get('/.well-known/som.json', (req, res) => {
const pageUrl = ${req.protocol}://${req.get('host')}${req.query.page || '/'};
const cached = somCache.get(pageUrl);
if (cached && Date.now() - cached.time < CACHE_TTL) {
res.set('Content-Type', 'application/json');
res.set('Cache-Control', 'public, max-age=300');
return res.send(cached.data);
}
try {
const som = execSync(
plasmate fetch "${pageUrl}",
{ timeout: 15000, encoding: 'utf-8' }
);
somCache.set(pageUrl, { data: som, time: Date.now() });
res.set('Content-Type', 'application/json');
res.set('Cache-Control', 'public, max-age=300');
res.send(som);
} catch (err) {
res.status(503).json({ error: 'SOM generation failed' });
}
});
Option C: use the SOM Cache as your SOM provider
Instead of running Plasmate locally, register your site with the SOM Cache at cache.plasmate.app. The cache handles generation, caching, and serving on your behalf:
<link rel="alternate" type="application/som+json"
href="https://cache.plasmate.app/v1/som?url=https://your-site.com">
This offloads all compute to the cache infrastructure. Agents that check your SOM endpoint get redirected to the cache.
CMS and framework integration notes
Static and dynamic examples cover the concept, but many publishers run on frameworks and CMS platforms where the practical details live in plugins, middleware, and cache invalidation.
Next.js (App Router or Pages Router)
If you already have a Next.js site, the easiest SOM endpoint is an API route that returns a cached JSON payload.
A practical pattern:
- Use a route handler at
app/.well-known/som.json/route.ts. - Generate SOM in a background job or on deploy.
- Store the JSON in an object store (S3, R2) or a KV cache.
- In the route, return the stored blob with strong cache headers.
Cloudflare and edge caching
If you serve through a CDN, push caching to the edge.
- Set
Cache-Control: public, max-age=300, s-maxage=3600as a default. - Add an ETag header so agents can revalidate cheaply.
- Consider
stale-while-revalidateif your CDN supports it.
WordPress
WordPress is a natural home for SOM-first support because it already has:
- a publish workflow
- a plugin ecosystem
- established caching plugins
/.well-known/som.jsonas a site-level index or homepage SOM/som?post=as a per-post SOM endpoint
Headless CMS platforms
If you run a headless CMS like Contentful or Sanity, you might already have a structured JSON representation of your content. That does not eliminate the need for SOM.
CMS JSON describes your content model. SOM describes the rendered semantic surface an agent experiences, including:
- navigation structure
- cross-links and related content blocks
- disclosure widgets, tabs, and other UI affordances
- tables and callouts that are assembled at render time
- keep the CMS JSON for developers and integrations
- publish SOM as the agent-friendly view of the rendered page
Validation and schema stability
If you want third parties to build against your SOM endpoint, validate the output.
- Keep required fields stable: page URL, title, regions, element ids, element roles.
- Allow additive evolution: new optional fields should not break consumers.
Freshness and caching strategies
Different types of content have different freshness requirements:
| Content Type | Recommended TTL | Strategy |
|---|---|---|
| Static pages (about, docs) | 24 hours or more | Regenerate on deploy |
| News articles | 15 to 60 minutes | On-demand with short cache |
| Ecommerce product pages | 5 to 15 minutes | On-demand with cache invalidation on price change |
| Real-time data (stock prices) | 1 to 5 minutes | On-demand, low TTL |
| User-generated content | 30 to 60 minutes | On-demand with moderate cache |
Cache-Control headers on your SOM responses. The max-age value tells agents how long they can use a cached version before refetching.
For content that changes unpredictably (breaking news, flash sales), use stale-while-revalidate to serve stale content while regenerating in the background:
Cache-Control: public, max-age=60, stale-while-revalidate=300
Verifying your setup
After implementing SOM-first serving, verify that agents can discover and fetch your SOM:
Check the well-known path
curl -s https://your-site.com/.well-known/som.json | python3 -m json.tool | head -10
You should see valid JSON with som_version, url, title, and regions fields.
Check the HTML link tag
curl -s https://your-site.com | grep -i "som+json"
You should see the tag.
Check robots.txt
curl -s https://your-site.com/robots.txt | grep -i "SOM"
You should see the SOM-Endpoint directive.
Validate the SOM output
Use the JSON Schema to validate your SOM document:
npm install -g ajv-cli
ajv validate -s node_modules/plasmate/specs/som-schema.json \
-d your-som-output.json
Who is already doing this
Six properties currently serve SOM alternates:
| Site | Type | SOM Path |
|---|---|---|
| plasmate.app | Product site | /.well-known/som.json |
| docs.plasmate.app | Documentation | /.well-known/som.json |
| plasmatelabs.com | Company site | /.well-known/som.json |
| somordom.com | Comparison tool | /.well-known/som.json |
| betterbrowser.ai | Landing page | /.well-known/som.json |
| cache.plasmate.app | API dashboard | /.well-known/som.json |
The bigger picture
SOM-first publishing is one of three infrastructure primitives we propose for the agentic web:
- SOM provides the structured representation that agents consume.
- Agent Web Protocol (AWP) provides the interaction protocol for agents to navigate and act on pages.
- Cooperative robots.txt directives provide the discovery and permission mechanism.
The detailed proposal is in our robots.txt for the agentic web documentation, and the full vision is described in The Agentic Web blog post.
GitHub | SOM Spec | Documentation | Robots.txt Proposal