Question 1

Will blocking AI bots hurt my Google search ranking?

Accepted Answer

No. The AI training bots (GPTBot, Google-Extended, ClaudeBot, CCBot) are separate from Googlebot, Bingbot, and other regular search crawlers. Google-Extended specifically is a pseudo-bot: it doesn't crawl your site, it just tells Google whether to USE the content Googlebot already fetched for Gemini training. Blocking Google-Extended doesn't change your Google Search ranking. Same with Applebot-Extended (separate from Applebot for Spotlight/Siri). Regular search crawlers stay allowed via the User-agent: * Allow: / fallback the tool includes by default.

Question 2

What's the difference between training bots and citation bots?

Accepted Answer

Training bots (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, CCBot, Bytespider) scrape the open web to build the dataset future LLM versions are trained on. Citation bots (ChatGPT-User, Claude-Web, Perplexity-User, OAI-SearchBot) fetch your site at the moment a user asks about you, so the AI's answer can cite your URL and send you traffic. Most sites want to block training (preserve content control) but allow citation (keep referral traffic). That's the 'Block AI training' preset.

Question 3

Does robots.txt actually work? Can't bots just ignore it?

Accepted Answer

Major bots from major companies (OpenAI, Anthropic, Google, Apple, Meta, Common Crawl) honor robots.txt as a matter of public policy and have published documentation on the user-agent strings to use. Some bots are reported to ignore it: Perplexity has been called out by 404 Media and others for fetching blocked content via Perplexity-User even when PerplexityBot is disallowed; Bytespider has a history of slow compliance. For stronger control on top of robots.txt: Cloudflare offers AI bot blocking at the firewall layer, you can server-side filter by user-agent and 403 known AI bots, or rate-limit aggressive crawlers. robots.txt is the polite baseline; harder controls layer on top.

Question 4

What about bots that aren't on this list?

Accepted Answer

The list covers the 22 most-active and most-discussed AI crawlers in 2026. New bots appear regularly (new model labs, new scraping startups). The tool defaults to a 'Allow all other crawlers' fallback, meaning your robots.txt only specifically blocks what's listed; everything else can crawl unless explicitly blocked. Watch dark visitors logs and add specific User-agent entries if you see new ones. The list updates as new bots become known; check back periodically for new entries.

Question 5

Will this work on any website host?

Accepted Answer

Yes. robots.txt is the universal web crawler standard, supported by every host. Drop the generated file at the root of your domain so it's accessible at /robots.txt. For Vercel, Netlify, Cloudflare Pages, and most static-site hosts, that means /public/robots.txt in your repo. For WordPress, use a robots.txt SEO plugin or your hosting control panel. For server-rendered apps, serve it as a static asset from the root path.

Question 6

Why include explanatory comments by default?

Accepted Answer

Two reasons. First, comments make the file maintainable: in six months when you revisit it, you'll know why each block exists. Second, comments are a positive signal: if an AI company audits why their bot is blocked, the comment 'Block AI training' is clearer than just a User-agent line, and may inform future bot-naming or opt-out conventions. The comments add maybe 50 lines to a 100-line file; turn them off if you prefer a minimal output.

Question 7

I want stronger blocking than robots.txt. What else can I do?

Accepted Answer

robots.txt is a request, not enforcement. For real enforcement: (1) Cloudflare's 'AI bot blocking' feature blocks known AI crawlers at the edge regardless of robots.txt. (2) Server-side user-agent filtering: drop or 403 any request whose User-agent matches a known AI bot. (3) Rate limiting: AI scrapers are typically aggressive; aggressive rate-limits hurt them more than humans. (4) Authentication: if your content is genuinely sensitive, gate it behind a login. (5) Watermarking: embed canary phrases that only your content has, search for them in chatbot outputs to detect training-data inclusion.

Block AI crawlers.

· OpenAI

· Anthropic

· Google

· Perplexity

· Apple

· Meta

· Amazon

· ByteDance

· Common Crawl

· Cohere

· Diffbot

· ImageSift

· Webz.io

· You.com

· DuckDuckGo

Two kinds of AI bots, two different decisions

Where to put the file

FAQ