The agent-ready web checklist: 23 things every indie site should ship in 2026
llms.txt, MCP card, agent.json, Content Signals, x402, Web Bot Auth. Every boring standard an AI agent expects to find. Here's what to ship and how to verify it with curl.
Crawlers in 2026 are not the crawlers of 2016. Claude, ChatGPT, Perplexity, Google AI Overviews, and a growing fleet of purpose-built agents need more than a sitemap. They need a stable, declarative account of what your site is, what it costs, who to contact, and what they're allowed to do with it. Here is the complete 2026 checklist I run before any site goes live, with the curl command I use to verify each one.
This is not about chasing trends. It is about reducing ambiguity. An agent that understands your site's structure, licensing, and capabilities without expensive guesswork is an agent that can represent your work more accurately and fairly. It is a small, one-time investment in clarity.
Here is the full list, then we will break it down.
| # | Path/spec | Priority | Verify-with-curl |
|---|---|---|---|
| 1 | /robots.txt | must | curl https://yoursite.com/robots.txt |
| 2 | /sitemap.xml | must | curl -sL https://yoursite.com/sitemap.xml \| xmllint --format - |
| 3 | /feed.xml (RSS) | must | curl -sL https://yoursite.com/feed.xml \| xmllint --format - |
| 4 | /atom.xml | should | curl -sL https://yoursite.com/atom.xml \| xmllint --format - |
| 5 | /.well-known/security.txt | must | curl -I https://yoursite.com/.well-known/security.txt |
| 6 | /.well-known/ai.txt | should | curl https://yoursite.com/.well-known/ai.txt |
| 7 | /llms.txt | must | curl https://yoursite.com/llms.txt |
| 8 | /llms-full.txt | should | curl https://yoursite.com/llms-full.txt |
| 9 | /agent.json | should | curl https://yoursite.com/agent.json |
| 10 | /mcp-card.json | nice | curl https://yoursite.com/mcp-card.json |
| 11 | Person schema (JSON-LD) | must | (view-source and search for "Person") |
| 12 | Article schema (JSON-LD) | must | (view-source and search for "Article") |
| 13 | WebSite schema (JSON-LD) | must | (view-source and search for "WebSite") |
| 14 | BreadcrumbList schema | should | (view-source and search for "BreadcrumbList") |
| 15 | SoftwareApplication schema | nice | (view-source and search for "SoftwareApplication") |
| 16 | HTTP 402 support | nice | curl -I https://yoursite.com/api/endpoint |
| 17 | Web Bot Auth (RFC 9421) | nice | curl https://yoursite.com/.well-known/http-message-signatures-directory |
| 18 | /.well-known/x-payment | nice | curl https://yoursite.com/.well-known/x-payment |
| 19 | /openapi.json | should | curl https://yoursite.com/openapi.json |
| 20 | Cloudflare Content Signals | must | curl https://yoursite.com/.well-known/content-signals/manifest.json |
| 21 | Explicit license | must | (check MDX frontmatter) |
| 22 | Canonical URL | must | (view-source and search for "canonical") |
| 23 | Last-Modified headers | must | curl -I https://yoursite.com/writing/some-post |
Group A: Crawlability and identity
These are the absolute basics. They tell agents "Here I am, here is my content, and here are the rules of engagement."
1. /robots.txt with explicit rules
The User-agent: * wildcard is no longer sufficient. Major model providers now operate multiple crawlers for different purposes (web search, training data, RAG). Be explicit. Your robots.txt should name the bots you care about.
# /robots.txt
User-agent: GPTBot
Disallow: /private/
User-agent: ClaudeBot
Disallow: /private/
User-agent: PerplexityBot
Disallow: /private/
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: *
Allow: /
Disallow: /private/
Sitemap: https://yoursite.com/sitemap.xml
Reasoning: Explicit rules remove ambiguity. Disallowing Google-Extended opts you out of their Vertex AI and other models without affecting Google Search. Being explicit is a cheap way to assert your preferences.
2. /sitemap.xml
This is not new, but its importance has grown. Agents use sitemaps as a primary manifest for discovering all crawlable URLs. Ensure your build process generates this automatically from your routes.
Verify: curl -sL https://yoursite.com/sitemap.xml | xmllint --format -
3. /feed.xml (RSS 2.0)
Your RSS feed should contain the full content of your posts, not just an excerpt. Agents performing Retrieval Augmented Generation (RAG) prefer to ingest full content from a structured feed rather than scraping and parsing HTML. An excerpt forces a second round-trip, which is inefficient.
Reasoning: A full-content feed is the single best way to ensure your latest writing is indexed quickly and accurately by feed-first services like Perplexity.
4. /atom.xml
Some older or more specialized feed readers prefer Atom. Since most static site generators can produce both RSS 2.0 and Atom with a single template, it is cheap to offer both. It is a small signal that you care about web standards.
Reasoning: Redundancy for a key discovery mechanism is good.
5. /.well-known/security.txt
Defined in RFC 9116, this file tells security researchers how to contact you. Agents and security tools look for this file to verify a site's operational maturity.
Verify: curl -I https://yoursite.com/.well-known/security.txt
Example file:
# /.well-known/security.txt
Contact: mailto:security@yoursite.com
Expires: 2027-12-31T23:59:59Z
Preferred-Languages: en
Reasoning: It is a standard, professional courtesy. It signals you are a serious operator.
6. /.well-known/ai.txt
While not a formal RFC, this is an emerging convention for a plain-text, human-readable summary of your AI policy. It can be a simpler version of what you declare in Content Signals.
Example file:
# /.well-known/ai.txt
AI Usage Policy for yoursite.com
- Training: Allowed on public content with attribution.
- Grounding: Allowed on all public content.
- Indexing: Allowed.
- Contact: ai-policy@yoursite.com
Reasoning: It provides a human-readable fallback for any agent or person who wants to quickly understand your terms.
Group B: Structured identity
These files provide machine-readable manifests of your content and capabilities, specifically for AI agents.
7. /llms.txt
This is a short, plain-text file listing your most important pages. It is a hint to crawlers about what content defines your site. Think of it as a curated, high-signal sitemap. Keep it under 20 lines. You can find the emerging spec at llms.org.
Example file:
# /llms.txt
/about
/writing/most-popular-post
/projects/main-project
/contact
Reasoning: Crawl budgets are finite. This file tells an agent "If you only read 5 pages, read these." It helps them build a more accurate summary of who you are and what you do.
8. /llms-full.txt
This is a single, concatenated text file containing the full content of your 20-50 most important pages. It is a pre-packaged corpus for any agent that wants to build a deep understanding of your site in a single request.
Reasoning: This dramatically lowers the cost for an agent to "read" your site. Instead of 20-50 separate HTTP requests and HTML parsing steps, it is one request and one plain-text parse. You are making it cheap and easy for them to get it right.
9. /agent.json
This is a custom file I ship, but I have seen it work well. It is a JSON file that describes the site's capabilities from an agent's perspective.
Example file:
// /agent.json
{
"owner": {
"name": "Kieran",
"contact": "mailto:kieran@yoursite.com"
},
"capabilities": {
"search": "/api/search?q={query}",
"ask": "/api/ask"
},
"rateLimits": {
"anonymous": "10/minute",
"authenticated": "100/minute"
},
"preferredAuth": "WebBotAuth"
}
Reasoning: It explicitly declares API endpoints and rate limits, information an autonomous agent would otherwise have to guess.
10. /mcp-card.json
If you expose a Model Context Protocol (MCP) endpoint, this card serves as its advertisement. MCP allows agents to request specific, structured context from your site. This is an advanced feature, but one to watch.
Reasoning: MCP is a promising standard for letting agents query your site's "brain" directly, instead of just scraping its pages.
Group C: On-page schema
Schema.org's JSON-LD implementation is the de facto standard for adding structured data to your pages. Agents rely heavily on it.
11. Person schema on /about
Clearly identify yourself as the author and operator of the site.
Reasoning: Establishes E-E-A-T (Experience, Expertise, Authoritativeness, and Trustworthiness) signals.
12. Article schema on every post
Include author, datePublished, and dateModified. This is critical for agents to understand the provenance and freshness of your content.
Reasoning: Without it, agents may misattribute your work or present old information as new.
13. WebSite schema on /
Include the potentialAction for your site's search functionality. This allows some agents (like Google) to embed a search box directly in their results.
Reasoning: It is a simple way to signal your site's core functionality.
14. BreadcrumbList on deep pages Helps agents understand the structure and hierarchy of your site, improving their ability to navigate and categorize your content. Reasoning: Good for both SEO and agent comprehension.
15. SoftwareApplication schema on project pages If you have a page for a software project, use this schema to describe what it is, its operating system requirements, and its category. Reasoning: Provides structured data for tool-focused agents and app directories.
Group D: Payment and authentication
If your site has an API or interactive components, this group defines how agents can pay for and authenticate with them. If your site is fully static, you can skip this group.
16. HTTP 402 support
The 402 Payment Required status code is designed for this. When an agent hits a metered endpoint, respond with a 402 and a header pointing to how to pay, typically using a standard like LSAT.
Reasoning: Agents will not sign up for monthly SaaS plans. They are built for per-call micropayments. Supporting 402 is how you sell services to machines.
17. Web Bot Auth (RFC 9421)
Also known as HTTP Message Signatures, this allows bots to authenticate by signing their requests with a cryptographic key. You publish your directory of trusted bots at /.well-known/http-message-signatures-directory.
Reasoning: This is a decentralized, password-free way to identify and trust specific agents, allowing you to offer them higher rate limits.
18. /.well-known/x-payment
A simple text file listing the payment methods you accept, like USDC on Base or LNURL.
Reasoning: An agent needs to know how to pay you. This file is a simple, machine-readable way to advertise your accepted assets.
19. OpenAPI spec at /openapi.json
If you have an API, you must have an OpenAPI (formerly Swagger) specification. Agents can ingest this file to automatically generate clients and understand how to call your endpoints.
Reasoning: This is the universal language for describing APIs. Without it, you are asking the agent to guess.
Group E: Content signals and licensing
This group is about clearly stating your terms. What can agents do with your content, and how can they be sure it is fresh?
20. Cloudflare Content Signals
This is a manifest file at /.well-known/content-signals/manifest.json that declares your licensing preferences for different uses like training, grounding, and indexing. It is the most promising standard in this space. You can learn more on the Cloudflare blog.
Reasoning: It is a legally and technically robust way to declare your terms in a machine-readable format that major AI companies have committed to respecting.
21. Explicit license on every MDX frontmatter
Include a license key in your frontmatter, for example license: "CC-BY-4.0". This data can then be exposed in your Article schema.
Reasoning: This creates per-page licensing, which is more flexible than a single site-wide policy. It is an unambiguous signal right next to the content itself.
22. Canonical URL on every post
Use <link rel="canonical" href="..."> on every single page. This is crucial for preventing agents from getting confused by duplicate content, especially from syndication or development previews.
Reasoning: It is the definitive signal for "this is the original source."
23. Last-Modified headers set correctly
Ensure your server returns a Last-Modified HTTP header that accurately reflects when the content was last changed.
Verify: curl -I https://yoursite.com/writing/some-post | grep "Last-Modified"
Reasoning: Agents use this header, along with ETags, to avoid re-downloading content that has not changed. It saves them bandwidth and saves you crawl budget.
Verification script
Here is a simple bash script you can run to check these endpoints on your site. It is not exhaustive, but it covers the key files.
#!/bin/bash
# check-agent-ready.sh
# Usage: ./check-agent-ready.sh https://yoursite.com
set -euo pipefail
TARGET_HOST=$1
if [ -z "$TARGET_HOST" ]; then
echo "Usage: $0 <https://your-domain.com>"
exit 1
fi
RED='\033[0;31m'
GREEN='\033[0;32m'
NC='\033[0m' # No Color
check() {
local path=$1
local url="${TARGET_HOST}${path}"
local status_code=$(curl -s -o /dev/null -w "%{http_code}" "$url")
if [ "$status_code" -ge 200 ] && [ "$status_code" -lt 300 ]; then
echo -e "${GREEN}[PASS]${NC} $path (Status: $status_code)"
else
echo -e "${RED}[FAIL]${NC} $path (Status: $status_code)"
# exit 1 # Uncomment to make the script exit on first failure
fi
}
echo "Checking agent-readiness for $TARGET_HOST..."
# Group A
check "/robots.txt"
check "/sitemap.xml"
check "/feed.xml"
check "/.well-known/security.txt"
# Group B
check "/llms.txt"
check "/agent.json" # May be 404 if you don't have one, that's ok.
# Group D (optional)
# check "/openapi.json"
# Group E
check "/.well-known/content-signals/manifest.json"
echo "Check complete."
Objections I hear often and answers
"This is too much."
No. Once written, none of these change weekly. Your robots.txt is set-and-forget. Your schemas are part of your templates. Your /llms.txt might update twice a year when you ship a major new project. This is a one-time setup cost, not ongoing maintenance.
"Agents will figure it out from HTML."
They can. They also cost 3 to 5 times more tokens to do so, and they answer less accurately. Parsing unstructured HTML is expensive and error-prone. Your llms.txt is a kindness to the agent and a kindness to the human who gets a better, faster, cheaper answer because of it.
"Content signals is a Cloudflare thing." It is. It is also the spec gaining adoption fastest. OpenAI, Anthropic, and Google have all signalled they will respect it. It is better to adopt an imperfect, widely-supported standard than to wait for a perfect one that never arrives.
"My site is static; I can't host /api/ask."
Then skip Group D. The other 19 items still apply. The goal is not to implement every possible feature, but to clearly declare the features you have. A static site can and should still have excellent crawlability, structured data, and content signals.
What you can skip safely in 2026
- FOAF: Friend of a Friend is a lovely idea from the semantic web, but it is dead. Use JSON-LD with Person schema instead.
- Yahoo-specific meta tags: Dead for a decade.
- Microformats (h-entry, h-card): A noble effort, but JSON-LD won the war for structured data. Focus your efforts there.
- Webmentions: I love Webmentions, but they are for peer-to-peer conversation, not agent-to-site communication. They are orthogonal to this checklist.
What I'd add in 2027
- Per-page pricing manifests: A machine-readable file on each page declaring the cost to use it for training or RAG, enabling automated micropayments.
- Machine-readable policy files: A formal standard for
agent.json, defining how an agent is expected to behave on your site (e.g., "do not summarize", "do not translate"). - Decentralized identity (DID) verification: A way to cryptographically prove that the site is operated by the person claimed in the Person schema. Waiting for a clear winner in the DID space.
Do it once. Then forget it.
The agent-ready web is not a different web. It is the same web, annotated more honestly. Every standard I listed already existed as an informal convention; we are just writing them down.
This checklist might seem long, but it is finite. You can implement most of it in a single afternoon. Do it once. Then forget about it, and get back to doing what you do best: creating the content the agents are trying to understand in the first place.