Problem Statement
AX Score currently lacks detection for several important AI discoverability signals, particularly for content-focused websites. The tool doesn't check for:
llms.txt — An emerging standard (llmstxt.org) specifically designed to help LLMs discover and understand site content
- Rich JSON-LD schema types — Only checks for JSON-LD presence, not specific schema types that matter for AI understanding (WebSite, BlogPosting, Person, BreadcrumbList, FAQPage)
- AI crawler permissions — Whether robots.txt explicitly allows AI crawlers (GPTBot, ClaudeBot, PerplexityBot, anthropic-ai, Google-Extended)
- Content feed availability — RSS/Atom feeds for machine-readable content syndication
- Semantic HTML quality — Proper heading hierarchy,
<article>, <main>, <nav> landmarks
These are the signals that actually determine whether AI agents can discover, understand, and cite a website's content.
Proposed Solution
1. llms.txt Detection (High Priority)
Check: GET /llms.txt
Scoring:
- File exists and is valid Markdown: +points
- Contains H1 heading (site name): +points
- Contains blockquote summary: +points
- Contains sectioned URL lists: +points
- Has companion llms-full.txt: +bonus points
2. JSON-LD Schema Type Analysis (High Priority)
Instead of just checking if <script type="application/ld+json"> exists, analyze the types present:
| Schema Type |
Page Context |
Points |
WebSite |
Homepage |
High |
BlogPosting / Article |
Article pages |
High |
Person / Organization |
Any page |
Medium |
BreadcrumbList |
All pages |
Medium |
FAQPage |
FAQ pages |
Medium |
Uses stable @id references |
Cross-page |
Bonus |
sameAs links to social profiles |
Author entity |
Bonus |
3. AI Crawler Permissions (Medium Priority)
Parse robots.txt and check:
- Does it explicitly allow AI crawler user agents?
- Are there specific AI crawler rules (not just catch-all)?
- Known AI crawlers:
GPTBot, ChatGPT-User, ClaudeBot, anthropic-ai, PerplexityBot, Google-Extended, Bard, Applebot-Extended, CCBot
4. Content Feed Detection (Medium Priority)
Check: <link rel="alternate" type="application/rss+xml" ...>
Check: GET /rss.xml, /feed.xml, /atom.xml
Scoring: Feed exists and returns valid XML with entries
5. Semantic HTML Analysis (Lower Priority)
- Proper heading hierarchy (h1 → h2 → h3, no skips)
- Use of
<article>, <main>, <nav>, <header>, <footer>
- Content accessible without JavaScript rendering
- Meaningful
<meta name="description"> present
Alternatives Considered
- Only add llms.txt: Quick win but misses the bigger picture of content discoverability.
- Rely on existing Discovery category: The current Discovery check (16% for a well-optimized blog) shows it's not comprehensive enough.
Use Case
As a content creator who has:
- ✅ Comprehensive JSON-LD (WebSite, BlogPosting, Person, BreadcrumbList, FAQPage schemas)
- ✅
llms.txt with curated content map
- ✅ robots.txt allowing 10+ AI crawlers
- ✅ RSS feed with all published posts
- ✅ XML sitemap
- ✅ Rich meta tags (OG, Twitter Cards, AEO tags)
I still get a Discovery score of 16% because the tool doesn't detect most of these signals. This severely undervalues well-optimized content sites and makes the score unreliable for content creators.
Additional Context
The llms.txt standard has growing adoption (~1,000+ domains) and is documented at llmstxt.org. While no major LLM provider has officially confirmed they follow llms.txt during crawling, it's a low-effort, high-signal file that clearly communicates site structure to AI systems.
Relevant research:
Problem Statement
AX Score currently lacks detection for several important AI discoverability signals, particularly for content-focused websites. The tool doesn't check for:
llms.txt— An emerging standard (llmstxt.org) specifically designed to help LLMs discover and understand site content<article>,<main>,<nav>landmarksThese are the signals that actually determine whether AI agents can discover, understand, and cite a website's content.
Proposed Solution
1.
llms.txtDetection (High Priority)2. JSON-LD Schema Type Analysis (High Priority)
Instead of just checking if
<script type="application/ld+json">exists, analyze the types present:WebSiteBlogPosting/ArticlePerson/OrganizationBreadcrumbListFAQPage@idreferencessameAslinks to social profiles3. AI Crawler Permissions (Medium Priority)
Parse
robots.txtand check:GPTBot,ChatGPT-User,ClaudeBot,anthropic-ai,PerplexityBot,Google-Extended,Bard,Applebot-Extended,CCBot4. Content Feed Detection (Medium Priority)
5. Semantic HTML Analysis (Lower Priority)
<article>,<main>,<nav>,<header>,<footer><meta name="description">presentAlternatives Considered
Use Case
As a content creator who has:
llms.txtwith curated content mapI still get a Discovery score of 16% because the tool doesn't detect most of these signals. This severely undervalues well-optimized content sites and makes the score unreliable for content creators.
Additional Context
The
llms.txtstandard has growing adoption (~1,000+ domains) and is documented at llmstxt.org. While no major LLM provider has officially confirmed they followllms.txtduring crawling, it's a low-effort, high-signal file that clearly communicates site structure to AI systems.Relevant research: