The DatoCMS Blog
Using a Headless CMS to index your content for LLMs
TLDR
llms.txtis a simple proposal: slap a Markdown index at/llms.txt, and expose clean.mdversions of useful pages so LLM tools do not have to scrape your HTML.THIS IS A STANDARD PROPOSAL. It's not widely confirmed to be "the way" llms are learning from websites, but since all the cool kids are doing it, we went ahead with our own approach to generating them.
With DatoCMS as your content source, you can generate
/llms.txt,/llms-full.txt, and per-page.mdexports at build time in any framework. We like Next.js and Astro, so we've put in some examples.We recently released a package that converts Structured Text back into Markdown, so we're making things way easier for you.
We use this approach ourselves to generate
datocms.com/docs/llms-full.txt. Did it help 10000x our "GEO" traffic. No. But it was fun.
Let's talk LLM crawling
Ok first, let's get the whole buzzword bingo out of the way before I drive myself up a wall because I'll want to avoid stuffing these terms everywhere going foward.
There's a proposal going around to standardize having a .md version of websites for LLMs to consume. Just like robots.txt and sitemap.xml, llms.txt is a new standard proposal for how LLMs should/could/would learn about your website content, just like the Big G uses the others to crawl your URLs to index on SERPs. The full docs on this are on https://llmstxt.org/
Also, in the very noisy world of Marketing we're all reading about how GEO is taking over SEO and how you're leaving billions on the table by not optimising your content for LLMs. Hype? Maybe. Worth listening to? Also maybe. Are we doing anything about it yet? Honestly, not much.
Anyways.
HTML is for browsers.
LLMs can read HTML, sure. But they also have to wade through your navigation, footers, cookie banners, repeated âTry for freeâ buttons, newsletter signups, styling, random layout text, and whatever else your frontend framework produced that day.
If you have ever pasted a docs page into ChatGPT and watched it hallucinate a method that does not exist, you have met the consequences of bad context.
So instead of letting AI tools scrape your UI, we give them the actual content, in a format they alledgedly like. Good 'ol markdown.
The proposal is straightforward:
Publish a Markdown file at
/llms.txtthat gives background, guidance, and links to the good stuff.Also provide a clean Markdown version of pages at the same URL with
.mdappended (and if the URL has no filename, you appendindex.html.md).
That is it. No magic. No ânew crawler standardâ. Just a sane convention so tooling can reliably find high-signal content without playing DOM archaeologist.
If you want the âgive me everythingâ version, the ecosystem has drifted toward a second artifact: llms-full.txt. Same idea, but it is one big compiled Markdown file. We shipped that for our docs too. You can find it on https://www.datocms.com/docs/llms-full.txt
Why does Markdown work better though? This is not âMarkdown is prettierâ. It is âMarkdown is predictableâ.
Markdown preserves structure with minimal noise: headings, lists, code blocks, quotes. The llms.txt proposal explicitly calls out that Markdown is both human and LLM readable, and also consistent enough for deterministic processing.
HTML can represent structure too, but it also represents your layout. LLM tools do not care about your layout. They care about the content and its relationships.
LLM indexing today
From everything I could find when I was researching this topic for us, there seem to be 2 lanes, and they constantly get mixed up.
Lane 1: Training-time web data
Some model training datasets are derived from the great big web crawls. Common Crawl exists specifically to provide large-scale web crawl data, and it is widely used in research and industry. There are also well-known filtered datasets built from Common Crawl, like C4 (Colossal Clean Crawled Corpus).
Do I know anything about what these mean? No.
You do not control what gets used, when, or how itâs filtered. Also, even if a page is crawled, that does not mean it ends up in a specific modelâs training data, or stays there forever.
Lane 2: Inference-time retrieval and ingesting
This is the one you actually should care about day to day.
Tools like Claude, Custom GPTs, Cursor, and coding assistants ingest a source, index it, and retrieve relevant chunks when you ask questions. Our own llms-full.txt post is basically a love letter to this workflow, because it turns âopen 10 tabs and copy paste like a maniacâ into âone file, full context, here you go, gobble it all up.
Also yes, vendors run crawlers too. OpenAI documents its crawlers and how site owners can manage them, and Anthropic documents its bots as well.
But again, the easiest win is still to just publish better context.
OK BUT ENOUGH THEORY. VAMOS, LETS GET TO THE FUN STUFF!
GENERATING LLM exports from DatoCMS
Can you use your Headless CMS project as your source of truth to generate these files on the frontend? Yes.
Can you turn Structured Text into markdown? Also Yes. Have you met the all new structured-text-to-markdown?
Can your repo zip up clean .md files at build time with all the new content added in? Also yes. Let's play around with how you can do that in Next.js and Astro.
The architecture is boring, which is good:
Content lives in DatoCMS.
Your site renders normally.
At build time, you generate:
/llms.txtas an index/llms-full.txtas the âeverything dumpâoptional per-page
.mdendpoints for deep links like we do.
But first, if you've got a project using Structured Text, let's get that one little rendering hiccup out of the way so you don't have to write your own logic.
Install the package which converts Structured Text nodes back into CommonMark-compatible Markdown, including headings, lists, code blocks, links, and formatting.
npm install datocms/structured-text-to-markdownAnd now your pipeline can:
fetch records
convert Structured Text field outputs to Markdown
stitch outputs into
llms-full.txtor per-page.md
Without you writing a custom renderer that breaks the first time someone pastes a table, or whatever.
Playing around in Next.js
Next.js App Router route handlers are perfect for this, because they are just Web Request and Response handlers, living wherever you want in app/.
One important detail though, GET route handlers are not cached by default. If you want this to behave like a build artifact, opt into caching with export const dynamic = 'force-static' (or another caching strategy).
So, you could have a simple setup for a app/docs/llms-full.txt/route.ts
export const dynamic = "force-static";
export async function GET() { // 1. Fetch docs records from DatoCMS // 2. Convert Structured Text to Markdown // 3. Join all this into one big .md boi
const body = [ "# DatoCMS Docs", "", "This is a compiled export of all our docs.", "", "## Getting started", "", "...", ].join("\n");
return new Response(body, { headers: { "content-type": "text/plain; charset=utf-8" }, });}For /docs/llms.txt, you do the same, but generate a smaller index that links out to your important .md pages. If you want per-page .md, you can add routes that map slugs to Markdown output. The proposal explicitly recommends the .md suffix convention for âclean Markdown version of this page".
Playing around in Astro
Astro is almost annoyingly perfectly suited to something like this. Why? Endpoints can emit plain text, and if your site is statically generated, Astro will freshly bake that file at build time.
The setup is also extremely simple and straightforward:
import type { APIRoute } from "astro";
export const GET: APIRoute = async () => { // 1. Fetch docs records from DatoCMS // 2. Convert Structured Text to Markdown // 3. Join all this into one big .md boi
const body = [ "# Docs", "", "This is a compiled export of all our docs.", "", "## Getting started", "", "...", ].join("\n");
return new Response(body, { headers: { "content-type": "text/plain; charset=utf-8" }, });};Astro also documents the convention that the filename determines the output path, so llms-full.txt.ts becomes /llms-full.txt. Clean and obvious.
The lazy conclusion
If you want LLMs to help you build, you have to give them context that does not suck.
Use a headless CMS to manage the content properly. Then publish an LLM-friendly Markdown representation at build time, using the conventions tools are starting to align on. That is the whole trick.
Also yes, this is us doing the hard work so you can be lazy later đ
And if you're intimidated by this because it's new and complicated, don't be. I'm not even a developer and I managed to vibe-get this done (which of course, boss man nuked and re-did properly, but hey, my thing still worked đ)