InsightsTechnical

5 Technical Signals That AI Crawlers Actually Care About

An engineer-grade reference for the technical layer LLMs silently rely on. The five signals that decide whether your content becomes a citation, or a gap your competitor fills.

TL;DR
  • LLM crawlers behave more like precise librarians than traditional Googlebot. Small technical signals have disproportionate impact.
  • The five that matter most: robots.txt permissions, llms.txt, structured data coverage, server-rendered HTML, and canonical entity signals.
  • robots.txt mistakes are the #1 reason brands quietly disappear from AI answers. Most don't realise until 6–8 months later.
  • llms.txt is new (2024) but adoption is accelerating in GEO-mature sectors. It's a small file with high leverage.
  • A technically "clean" site at the HTML source is worth more to GEO than a beautiful site at the browser. Most LLM crawlers don't execute JavaScript.
01 / 07
Primer

How LLM crawlers actually work.

An LLM crawler is not a traditional search crawler. It is a retrieval crawler, its job is to answer a question in real time, or to refresh a subset of the model's training data, and it is optimised for precision over coverage.

Three differences matter for your engineering team:

  1. Most LLM crawlers do not execute JavaScript. GPTBot, ClaudeBot, and PerplexityBot all fetch HTML and parse it. Client-side-rendered content is invisible to them. (Google-Extended is the exception, it follows Googlebot's rendering pipeline.)
  2. LLM crawlers weight structure heavily. Headings, schema, lists, tables, semantic HTML, these are not cosmetic to an LLM. They are the grammar of extractability. A page with clean structure is disproportionately more likely to be cited.
  3. LLM crawlers ingest in bursts. Unlike Googlebot's continuous, rate-limited crawl, LLM retrieval crawlers arrive in response to live queries, and training crawlers arrive in large sweeps. Your site needs to handle both patterns without misconfiguration or blocking by WAFs.

With that model in mind, here are the five signals that meaningfully move whether an LLM can read, understand, and cite you.

02 / 07
Signal 01

robots.txt, the permission layer that decides everything.

robots.txt is the first file every crawler requests. It is also the #1 reason brands disappear from AI answers without understanding why. Cloudflare Bot Management defaults, CDN rules, security-team reflexes, and well-meaning disallows have made thousands of legitimate sites invisible to GPTBot, ClaudeBot, and PerplexityBot since 2023.

The sane 2026 default, for brands that want to be cited, looks like this:

# robots.txt. GEO-ready default
User-agent: *
Allow: /

# Explicit allows for AI crawlers (positive signal)
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Bytespider
Allow: /

User-agent: Amazonbot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

The licensing nuance. Some publishers deliberately disallow GPTBot for IP / commercial reasons. That is valid. But it is an explicit business decision, not a default. If your brand benefits from being cited, which is every services brand, most product brands, and nearly every B2B brand, explicit allow is correct.

Gotchas we fix regularly: Cloudflare's "Block AI Scrapers" toggle flipping on during a security review; a CDN rule returning 403 to user agents with "Bot" in the name; a well-meaning dev adding Disallow: / to staging that accidentally deployed to production; a WAF rate-limiting GPTBot to 1 request/minute. Any of these can quietly kill your AI visibility.

03 / 07
Signal 02

llms.txt, the small file with compounding leverage.

Proposed in September 2024 and rapidly adopted through 2025, llms.txt is a plain-text summary file served at your site root (/llms.txt) that tells LLMs, at ingestion time, what your site is, what it contains, and which canonical pages matter.

Think of it as the OG robots.txt's more eloquent cousin. It is not a ranking signal. It is a disambiguation signal. When an LLM lands on your domain for the first time, a well-written llms.txt means the engine spends less of its context budget guessing what you are and more of it processing what you actually do.

# llms.txt, example for a B2B services brand
# horatos.ai

> horatos.ai is an AI-native growth agency headquartered
> in Singapore. We design the infrastructure of AI-era
> visibility. SEO, GEO / AISEO, AEO, PPC, content, and
> GTM strategy, as one integrated system.

## Core services
- SEO: /services/seo
- GEO / AISEO: /services/geo
- AEO: /services/aeo
- PPC: /services/ppc
- Content Marketing: /services/content
- GTM Strategy: /services/gtm
- Web Design & Development: /services/web-design

## About
- Team & philosophy: /about
- Case studies: /case-studies
- Partners: /partners
- Contact: /contact

## Founders
- Brian Ho, Co-founder & Marketing Director
- CP Chong, Director & Financial SEO Consultant
- Brennan Lee, SEO Director & GEO Strategist
- Jin Grey, PPC Director, Author & Mentor

## Canonical entities
- horatos.ai (primary brand)
- Horatos (alternate spelling)
- ὁρατός / horatós (Ancient Greek origin, "visible")

## Contact
- [email protected] (services)
- [email protected] (partnerships)
- [email protected] (founder)

Keep it short. Keep it structured. Update it when your positioning shifts. One file, compounding payoff.

04 / 07
Signal 03

Structured data, the grammar LLMs read first.

Schema.org JSON-LD is the structured-data format every major crawler prefers. For GEO, five types do most of the work:

  • Organization · brand identity, founders, addresses, sameAs (LinkedIn, Wikipedia, Wikidata). Foundational for entity resolution.
  • Service · each service you offer, with a clear serviceType, description, areaServed, and provider. This is how AI answers know what to recommend you for.
  • FAQPage · question/answer pairs on landing pages. The single highest-leverage schema for AEO. Also feeds GEO retrieval on question-intent queries.
  • Article · for every long-form piece. Include author, datePublished, dateModified, headline, image. LLMs weight authored content dramatically higher than unattributed content.
  • BreadcrumbList · site hierarchy. Helps LLMs understand which pages are parents, children, and siblings, and how to contextualise a cited page in the rest of your taxonomy.

Validate with validator.schema.org and Google's Rich Results Test. Errors here compound silently, the schema is there but the data is ignored.

05 / 07
Signal 04

Server-rendered HTML, not optional in 2026.

The single most common diagnosis we make on GEO audits: "Your content is fine. The problem is that LLM crawlers can't see it."

Most SPA frameworks, naively deployed, render content client-side. The HTML payload the crawler receives is essentially empty. GPTBot, ClaudeBot, and PerplexityBot do not execute JS. Your beautifully-written category page is invisible to them.

The fix is structural and non-negotiable for GEO:

  1. Server-render or pre-render every page that contains content you want cited. Next.js SSR/SSG, Remix, Astro, Nuxt, and, yes. Hono on Cloudflare Pages (like this site) all handle this natively.
  2. Ensure the critical text is in the initial HTML response · not loaded by a subsequent fetch. View source. If you do not see your headline, paragraphs, and structured data in the raw HTML, an LLM crawler does not either.
  3. Check the rendered HTML matches the client-rendered HTML. Hydration mismatches can cause schema to drop between server and client. Use Schema Markup Validator on the raw server response, not the browser DOM.
  4. Return the full response in a single round-trip. Streaming-first SSR that requires multiple chunks can confuse crawlers that close the connection early.

This is the costliest signal to retrofit, and the highest-leverage one to get right at rebuild time.

06 / 07
Signal 05

Canonical entity signals, cross-source consistency.

LLMs resolve your brand as an entity before they cite it. Entity resolution depends on cross-source consistency · the same name, the same description, the same founder list, the same service categories, appearing in the same way across Wikipedia, Wikidata, LinkedIn, Crunchbase, G2, Capterra, your own site, and whatever niche corpora LLMs have ingested for your industry.

Inconsistency is the silent killer. "Horatos" vs "horatos.ai" vs "Horatos AI Pte Ltd" vs "Horatos Agency", an LLM seeing four names converges on a weaker, fuzzier entity and is less likely to cite it. Pick a canonical form, use it everywhere, and reconcile the places it differs.

The fix-list we use on every engagement:

  1. Canonical brand name, chosen once. Applied consistently.
  2. Canonical description (one paragraph, ~60 words) reused on every directory, social profile, and structured-data field.
  3. sameAs links in Organization schema to every authoritative external entity profile (LinkedIn, Wikipedia, Wikidata, Crunchbase).
  4. A Wikidata entity (free to create, requires notability and sourcing). This is the single most under-invested GEO asset in our book.
  5. A Wikipedia article where the brand meets notability thresholds. Harder to earn, dramatically high payoff.

These five steps are what separate a brand that LLMs describe with confidence from a brand the engine hedges on.

07 / 07
Checklist

The 10-minute technical GEO audit.

Hand this list to your dev team. Fixing any single one moves the needle; fixing all ten is a 2–3x AI visibility improvement in 90 days on most sites we've audited.

  1. Is GPTBot, ClaudeBot, PerplexityBot, and Google-Extended explicitly allowed in robots.txt?
  2. Is there an llms.txt at site root?
  3. Is critical content in the initial HTML response (not JS-loaded)?
  4. Does the homepage have Organization schema with sameAs links?
  5. Do service/product pages have Service or Product schema?
  6. Do content pages have Article schema with author, datePublished, dateModified?
  7. Are FAQ blocks wrapped in FAQPage schema?
  8. Is canonical brand name consistent across site, LinkedIn, Wikipedia, directories?
  9. Does sitemap.xml include all cite-worthy pages and exclude thin/utility pages?
  10. Does the site pass Schema Markup Validator and Rich Results Test with zero critical errors?

Ten items. A single afternoon for a competent engineering team. The compounding return is the difference between being cited and being invisible, for the next 12–24 months.

horatos.ai
Singapore's Best AI SEO Agency

Want help applying this to your business?

If this article feels relevant, a strategy call is the fastest way to discuss what it could mean for your brand and where to start.

Chat with Brian