AI Crawlers
A reference guide to the major AI crawlers currently active on the web — what each one does, and how to control their access through robots.txt.
Why This Matters
Each AI company operates its own crawler (or crawlers) with a distinct user-agent name. Controlling access at a per-crawler level — rather than a single blanket rule — lets you make deliberate choices about which AI systems can train on, search, or cite your content. See Content Signals for how to express more granular preferences per crawler.
OpenAI / ChatGPT
| User-Agent | Purpose |
|---|---|
| GPTBot | Crawls content for model training |
| OAI-SearchBot | Powers ChatGPT's search/browsing features |
| ChatGPT-User | Fetches pages in real time when a user asks ChatGPT to browse a specific site |
Anthropic / Claude
| User-Agent | Purpose |
|---|---|
| ClaudeBot | General crawling, including potential training use |
| Claude-SearchBot | Powers Claude's web search feature |
| Claude-User | Fetches pages in real time when a user asks Claude to browse a specific site |
Perplexity
| User-Agent | Purpose |
|---|---|
| PerplexityBot | Crawls and indexes content to power Perplexity's answer engine |
Google AI
| User-Agent | Purpose |
|---|---|
| Google-Extended | Controls whether Google's AI products (AI Overviews, Gemini) may use your content, separately from standard Googlebot search indexing |
Note: Google-Extended does not control regular Google Search crawling — that's governed by the standard Googlebot rules in your robots.txt.
Meta AI
| User-Agent | Purpose |
|---|---|
| FacebookBot | Crawls content related to Meta's AI products |
Apple
| User-Agent | Purpose |
|---|---|
| Applebot | Powers Siri, Spotlight, and Apple Intelligence features |
Amazon
| User-Agent | Purpose |
|---|---|
| Amazonbot | Crawls for Alexa and Amazon's AI systems |
Cohere
| User-Agent | Purpose |
|---|---|
| cohere-ai | Used in enterprise retrieval-augmented generation (RAG) pipelines |
Example robots.txt Block
User-agent: GPTBot
Allow: /llms.txt
Allow: /semantic/
Allow: /markdown/
Content-Signal: ai-train=no, search=yes, ai-input=yes
User-agent: ClaudeBot
Allow: /llms.txt
Allow: /semantic/
Allow: /markdown/
Content-Signal: ai-train=no, search=yes, ai-input=yesSee robots.txt for the complete directive structure across all crawlers.
Standard (Non-AI) Search Crawlers
For comparison, these are the traditional search engine crawlers most sites already allow:
| User-Agent | Engine |
|---|---|
| Googlebot | Google Search |
| Bingbot | Bing |
| Slurp | Yahoo |
| DuckDuckBot | DuckDuckGo |
Verifying Crawler Activity
To see which of these crawlers are actually visiting your site:
- Google Search Console → Settings → Crawl Stats → "Other agent type" often surfaces non-Googlebot activity
- Server access logs → filter by user-agent string for direct evidence
- isitagentready.com → automated scan of your bot access configuration
See Validation for the full process.
This List Will Change
New AI crawlers appear regularly as the field evolves. AIA Matrix keeps its generated robots.txt templates updated as new, significant crawlers emerge — Professional plan users receive these updates automatically with each re-scan.
Related
- robots.txt — full directive syntax
- Content Signals — granular usage preferences per crawler
- Validation — confirming crawler activity