SEO and LLM Files
1. Overview
Three root-level files control how search engines and LLM crawlers discover and index the site:
| File | Purpose | Audience |
|---|---|---|
|
Allow or disallow crawling of specific paths |
Search engine bots (Googlebot, Bingbot, etc.) and AI crawlers |
|
Map all indexable pages with priority and update frequency |
Search engines (Google, Bing, etc.) |
|
Curate high-signal content with structured context |
LLM crawlers (GPTBot, ClaudeBot, Google-Extended, etc.) |
All three files are served from the site root (e.g., https://www.myriadevents.co.za/robots.txt). The sitemap is generated automatically at build time by the @astrojs/sitemap integration; the other two are static files in public/.
2. robots.txt
2.1. What It Does
robots.txt tells web crawlers which parts of the site they may or may not access. It is the first file any well-behaved crawler reads before indexing content.
2.3. Structure
# Default rule — allow all crawlers
User-agent: *
Allow: /
Disallow: /api/ (1)
# Sitemap location
Sitemap: https://www.myriadevents.co.za/sitemap.xml (2)
# AI-specific crawler rules
User-agent: GPTBot (3)
Allow: /
| 1 | Block crawling of the API endpoint (the contact form function) |
| 2 | Points crawlers to the sitemap for full page discovery |
| 3 | Named AI crawler sections allow per-crawler control |
2.4. When to Update
-
A new page is added — no change needed (the default
Allow: /covers new pages) -
A path should be hidden from crawlers — add a
Disallow:line -
A new AI crawler should be blocked or allowed — add a
User-agent:section -
The domain changes — update the
Sitemap:URL
2.5. Known AI Crawler User-Agents
| User-Agent | Operator |
|---|---|
|
OpenAI |
|
Anthropic |
|
Google (AI training, separate from Googlebot) |
|
Common Crawl (used by many AI datasets) |
|
Perplexity AI |
To block a specific crawler, change Allow: / to Disallow: / under its section:
User-agent: GPTBot
Disallow: /
3. sitemap.xml
3.1. What It Does
sitemap.xml tells search engines which pages exist and how to crawl them efficiently.
3.2. How It Is Generated
The sitemap is generated automatically at build time by the @astrojs/sitemap integration. It discovers all static routes and produces a sitemap index at /sitemap-index.xml (which references /sitemap-0.xml).
The integration is configured in astro.config.mjs:
import sitemap from '@astrojs/sitemap';
export default defineConfig({
site: 'https://www.myriadevents.co.za',
integrations: [sitemap()],
// ...
});
The site property is required — the integration uses it to generate absolute URLs.
3.3. When to Update
No manual updates are needed for static pages. When you add or remove an .astro page in src/pages/, the sitemap updates automatically on the next build.
3.4. Excluding Pages
To exclude specific pages from the sitemap, pass a filter function:
integrations: [
sitemap({
filter: (page) => !page.includes('/internal/'),
}),
]
3.5. Customising Priority and Change Frequency
The @astrojs/sitemap integration supports serialize for per-page customisation:
integrations: [
sitemap({
serialize: (item) => {
if (item.url === 'https://www.myriadevents.co.za/') {
item.priority = 1.0;
item.changefreq = 'weekly';
}
return item;
},
}),
]
When dynamic content (events, results) is published as individual pages in the future, revisit whether robots.txt and llms.txt should also be generated dynamically. See the project README for details.
|
4. llms.txt
4.1. What It Does
llms.txt is a Markdown file at the site root that provides LLM crawlers with a curated, high-signal summary of the site’s content. While robots.txt controls access and sitemap.xml lists pages, llms.txt provides context — helping LLMs understand what the site is about and which pages are most relevant.
This follows the llms.txt specification, an emerging convention for making websites more accessible to AI systems.
4.3. Structure
# Company Name (1)
> One-line description of the company. (2)
A paragraph with more context about the company. (3)
## Services (4)
- [Service Name](URL): One-line description of the service.
## Key Pages (5)
- [Page Name](URL): One-line description of the page.
## Contact (6)
- Phone: ...
- Email: ...
| 1 | H1 heading with the company or site name |
| 2 | Blockquote with a concise tagline or description |
| 3 | Introductory paragraph with key context |
| 4 | Grouped sections for services, pages, or topics |
| 5 | Links with inline descriptions — LLMs use these to understand page relevance |
| 6 | Contact details for the organisation |
4.4. When to Update
-
A new page or service is added — add a link entry under the appropriate section
-
A page is removed — remove its entry
-
Company details change — update the description, contact info, or tagline
-
The domain changes — update all URLs
4.5. Writing Guidelines
-
Keep descriptions to one sentence — LLMs parse these as structured metadata
-
Use full absolute URLs (not relative paths)
-
Group related pages under descriptive headings
-
Lead with the most important content — LLMs may truncate long files
-
Do not include internal/private pages, API endpoints, or admin URLs
4.6. Optional: Per-Page Markdown Versions
For higher-value LLM consumption (especially if the site becomes content-heavy), you can provide Markdown versions of key pages:
public/
├── llms.txt # Main index
├── llms/
│ ├── about.md # Clean text version of the About page
│ ├── services/
│ │ ├── event-management.md
│ │ └── timing-results.md
Then reference them from llms.txt:
## Services
- [Event Management](https://www.myriadevents.co.za/llms/services/event-management.md): Full-service event management.
This is a higher-effort, higher-value step — only worth doing if the site has substantial content that LLMs should consume in full.
5. File Summary
| File | What to Update | When | Effort |
|---|---|---|---|
|
Add/remove |
When paths or crawler policy changes |
Low |
|
Adjust |
Only when customisation is required |
None (automatic) |
|
Add/remove link entries, update descriptions |
When pages, services, or company info changes |
Low |