AI Crawlers

A reference guide to the major AI crawlers currently active on the web — what each one does, and how to control their access through robots.txt.

Why This Matters

Each AI company operates its own crawler (or crawlers) with a distinct user-agent name. Controlling access at a per-crawler level — rather than a single blanket rule — lets you make deliberate choices about which AI systems can train on, search, or cite your content. See Content Signals for how to express more granular preferences per crawler.

OpenAI / ChatGPT

User-Agent	Purpose
GPTBot	Crawls content for model training
OAI-SearchBot	Powers ChatGPT's search/browsing features
ChatGPT-User	Fetches pages in real time when a user asks ChatGPT to browse a specific site

Anthropic / Claude

User-Agent	Purpose
ClaudeBot	General crawling, including potential training use
Claude-SearchBot	Powers Claude's web search feature
Claude-User	Fetches pages in real time when a user asks Claude to browse a specific site

Perplexity

User-Agent	Purpose
PerplexityBot	Crawls and indexes content to power Perplexity's answer engine

Google AI

User-Agent	Purpose
Google-Extended	Controls whether Google's AI products (AI Overviews, Gemini) may use your content, separately from standard Googlebot search indexing

Note: Google-Extended does not control regular Google Search crawling — that's governed by the standard Googlebot rules in your robots.txt.

Meta AI

User-Agent	Purpose
FacebookBot	Crawls content related to Meta's AI products

Apple

User-Agent	Purpose
Applebot	Powers Siri, Spotlight, and Apple Intelligence features

Amazon

User-Agent	Purpose
Amazonbot	Crawls for Alexa and Amazon's AI systems

Cohere

User-Agent	Purpose
cohere-ai	Used in enterprise retrieval-augmented generation (RAG) pipelines

Example robots.txt Block

User-agent: GPTBot
Allow: /llms.txt
Allow: /semantic/
Allow: /markdown/
Content-Signal: ai-train=no, search=yes, ai-input=yes

User-agent: ClaudeBot
Allow: /llms.txt
Allow: /semantic/
Allow: /markdown/
Content-Signal: ai-train=no, search=yes, ai-input=yes

See robots.txt for the complete directive structure across all crawlers.

Standard (Non-AI) Search Crawlers

For comparison, these are the traditional search engine crawlers most sites already allow:

User-Agent	Engine
Googlebot	Google Search
Bingbot	Bing
Slurp	Yahoo
DuckDuckBot	DuckDuckGo

Verifying Crawler Activity

To see which of these crawlers are actually visiting your site:

Google Search Console → Settings → Crawl Stats → "Other agent type" often surfaces non-Googlebot activity
Server access logs → filter by user-agent string for direct evidence
isitagentready.com → automated scan of your bot access configuration

See Validation for the full process.

This List Will Change

New AI crawlers appear regularly as the field evolves. AIA Matrix keeps its generated robots.txt templates updated as new, significant crawlers emerge — Professional plan users receive these updates automatically with each re-scan.

robots.txt — full directive syntax
Content Signals — granular usage preferences per crawler
Validation — confirming crawler activity