Bot detection
Allow only real Googlebot and Bingbot — and stop AI crawlers spoofing them
Updated · Related endpoint: /api/v1/is_bot
The afternoon "Googlebot" wasn't Googlebot
A Tuesday in March. Grafana was showing a slow climb on origin CPU and the cache-miss panel was lit up orange. I grepped the Nginx log for the noisiest path:
$ awk '$7 == "/products"' access.log | wc -l
4218
$ awk '$7 == "/products"' access.log | grep -ci googlebot
4011
Four thousand "Googlebot" hits on a single endpoint in one hour. We don't have four thousand products.
Google does not crawl a single URL four thousand times an hour. Whatever was hitting /products
was using User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
and was something else entirely.
The rest of this guide is the playbook I wish I'd had open in another tab. It's about verifying that a request claiming to be Googlebot, Bingbot, GPTBot or ClaudeBot is the real thing, and about the production pattern most teams end up wanting: allow Googlebot and Bingbot, gate everything else, block the AI training crawlers entirely.
The User-Agent string is meaningless
It bears repeating because almost every "block bots" tutorial on the web is still telling people to regex the UA header. You can claim to be Googlebot from your laptop, right now:
$ curl -s -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
https://example.com/expensive-endpoint
The header is whatever the client puts in it. It is a hint, not an identity. Any allowlist that uses it as the primary signal is a free pass — and worse, it's a free pass that scrapers know about and target deliberately, because it bypasses naïve "skip the rate limiter for known good bots" branches.
The textbook fix: forward-confirmed reverse DNS
Google's own verifying Googlebot page lays out the canonical method: forward-confirmed reverse DNS (FCrDNS). Bing has the same protocol for Bingbot. The procedure is:
- Take the source IP from the request.
- Do a reverse DNS (PTR) lookup. The hostname must end in
.googlebot.comor.google.comfor Google;.search.msn.comfor Bing. - Forward-resolve that hostname back to an IP. The resolved IP must equal the original source IP.
If any step fails or the suffix is wrong, the request is not from the bot it claims to be. Worked example:
$ dig +short -x 66.249.66.1
crawl-66-249-66-1.googlebot.com.
$ dig +short crawl-66-249-66-1.googlebot.com
66.249.66.1
And an impostor — same UA, different IP:
$ dig +short -x 192.0.2.7
;; (NXDOMAIN, or a PTR that does NOT end in googlebot.com / google.com)
Where FCrDNS quietly breaks in production
This is the part the textbooks leave out. I have hit every one of these:
- IPv6. Don't assume your reverse zone is wired up.
dig -x 2001:4860:4801:10::24should return a.google.comPTR; if your resolver is dropping AAAA queries you'll getNXDOMAINand falsely flag a real Googlebot as a spoof. Test withdig @8.8.8.8 -x <ipv6>before you trust your local view. - Resolver latency on the hot path. Two synchronous DNS round-trips per request is fine
for an analytics job. It is not fine for a request handler. If you must do this inline, cache
aggressively — the PTR for
66.249.66.1isn't changing this hour. A small in-process LRU keyed on the IP, with a 1-hour TTL, eliminates ~99% of the lookups on a busy server. - PTR drift. When Bing rolled a new range a few years back, the PTR records lagged the IP range publication by several days. If you were checking PTR-only, you flagged real Bingbot as fake for a week.
- Suffix-only matching is a footgun. A PTR of
my-evil-host.googlebot.com.attacker.tldends in.attacker.tld, not.googlebot.com— but a sloppyendsWith(".googlebot.com")check that doesn't anchor to a label boundary will acceptfoo.googlebot.com.evil.tld. Always split on dots and compare labels, or anchor the regex with\.googlebot\.com$. - The forward-confirmation step is not optional. An attacker who controls a reverse zone (rare but not unheard of, especially on poorly-configured colos) can put anything in PTR. The forward lookup back to the original IP is what makes it tamper-evident.
The other approach: published CIDR ranges
The major crawlers also publish their IP ranges as machine-readable JSON. This is what you want if you're classifying logs offline or building an allowlist into a config file — no DNS dependency, deterministic, cacheable on disk.
The asymmetry across publishers is the annoying part:
| Crawler | Published source | Refresh cadence |
|---|---|---|
| Googlebot | developers.google.com/search/apis/ipranges/googlebot.json |
Roughly weekly |
| Bingbot | www.bing.com/toolbox/bingbot.json |
Irregular; lag of a few days is normal |
| GPTBot (training) | openai.com/gptbot.json |
Days-to-weeks |
| ChatGPT-User (user-initiated fetch) | openai.com/chatgpt-user.json |
Days-to-weeks |
| OAI-SearchBot | openai.com/searchbot.json |
Days-to-weeks |
| ClaudeBot / claude-web / claude-user | Anthropic support docs (JSON list) | Irregular |
| PerplexityBot | Perplexity docs (UA + small CIDR set) | Often stale |
| Applebot | FCrDNS only — .applebot.apple.com |
No published JSON |
| DuckDuckBot | Plain-text IP list on duckduckgo.com/duckduckbot | Rarely changes |
| Common Crawl (CCBot) | No published IP list — UA only | n/a |
Maintaining ten of these by hand, fetching them on different cadences, normalising the schemas, dealing
with the ones that just publish a paragraph of prose on a support page — that's the job
/api/v1/is_bot exists to absorb. One HTTP call, every published list aggregated and
refreshed daily, plus a record of retired ranges for analysing older logs (see the
log
analysis guide's historical and date flags).
Why most teams want "allow Google and Bing, block the AI crawlers"
This is the rule I end up writing for every site I've ever consulted on. The reasoning is mundane:
- Google and Bing send paying users. A page indexed by Googlebot or Bingbot can appear in search results and convert into a customer. Blocking them costs traffic and revenue.
- The AI training crawlers don't. A page slurped into a model's training set rarely drives attribution back to your site. When it does surface (in a Perplexity answer, in a ChatGPT response with citations), the user has already gotten what they wanted without a click. The trade is mostly one-way.
robots.txtis unenforceable. It is a polite request. The well-behaved crawlers honour it; the rest do not. If your business model depends on content not being scraped, aDisallow:line is a sticker on a door, not a lock.- The lawful path is IP-based. Allow the IP ranges of the bots you want, force everything else (whether it claims to be GPTBot, ClaudeBot, or "Mozilla/5.0") to clear whatever bar you set for unknown clients.
A subtlety that matters: not all "AI bots" are training crawlers. ChatGPT-User and
Claude-User are user-initiated fetches — somebody pasted a URL into a chat and the
assistant fetched it on their behalf. That is closer to a browser visit than to a scrape, and treating
it like a training crawler will frustrate your real users. /api/v1/is_bot's
categories field distinguishes these (see "Category codes" below) — a real production
allowlist usually treats USER_INITIATED_FETCHING the way it treats a browser, not the way
it treats AI_TRAINING.
The finding that surprised me
Back to the Tuesday-afternoon log. I expected the spoofers to live in the usual places — DigitalOcean,
Hetzner, OVH, a fresh VPS pool. Most of them did. The surprise was a long tail of "Googlebot" hits
coming from inside Google's own ASN, AS15169, but not from the
googlebot.json ranges. Those weren't Google's crawler. They were Google Cloud customers
running scrapers from GCE instances and counting on operators to confuse "AS15169" with "Googlebot."
Sharing an ASN with a crawler is not a verification signal. If your detection is "ASN looks Googley, good enough," half of GCP's customer base looks Googley. The published CIDR is narrower than the ASN on purpose. The same trap exists for Microsoft (AS8075 ≠ Bingbot) and AWS (AS16509 hosts plenty of bots that aren't Amazonbot).
A second, smaller surprise: roughly 6% of "Googlebot" hits passed the PTR check but failed the
forward-confirmation step — the PTR pointed at crawl-…-shaped hostnames in zones that
weren't .googlebot.com at all. Suffix-only checks would have waved them through.
A working Nginx allowlist
The pattern I land on: a cron job pulls the verified-crawler CIDRs once a day, writes them to an Nginx
geo map file, reloads Nginx. The hot path stays in C — no DNS, no API call per request —
and the daily refresh keeps the list fresh.
Pull the lists each morning:
#!/bin/bash
set -euo pipefail
OUT=/etc/nginx/crawler-allowlist.conf
TMP=$(mktemp)
# Ask urlcap for every active CIDR belonging to bot groups we want to allow.
# Filter by botGroup id: Common crawlers (4) for Googlebot + friends, Bingbot (8).
curl -fsS https://urlcap.com/api/v1/datasets/bot_cidrs.csv \
-H "Authorization: Bearer $URLCAP_KEY" \
| awk -F, 'NR>1 && ($3=="Google" || $3=="Bing") {print $1 " 1;"}' \
> "$TMP"
echo "geo \$is_allowed_crawler {" > "$OUT"
echo " default 0;" >> "$OUT"
cat "$TMP" >> "$OUT"
echo "}" >> "$OUT"
nginx -t && systemctl reload nginx
rm -f "$TMP"
And in the server block:
include /etc/nginx/crawler-allowlist.conf;
map $http_user_agent $claims_to_be_bot {
default 0;
"~*(Googlebot|bingbot|GPTBot|ClaudeBot|PerplexityBot|CCBot)" 1;
}
server {
# ...
# If the request claims to be a crawler but the IP isn't in the allow set, drop it.
if ($claims_to_be_bot = 1) { set $verdict "claim"; }
if ($is_allowed_crawler = 1) { set $verdict "${verdict}_ok"; }
if ($verdict = "claim") { return 403; } # claimed crawler, IP not in our allow set
}
Two things this gets right that most "block bots" snippets get wrong: (1) it never sees the UA as identity — UA is only the trigger for the IP check, not a substitute for it; and (2) it lets unflagged traffic (no crawler UA, not in the allow set) flow through normally to whatever else handles humans — Cloudflare, rate limiting, your app's own auth. We're not building a WAF here. We're closing the specific hole where "I am Googlebot" buys preferential treatment.
Doing it without the local cron + map dance
If you don't want a CIDR file on disk, ask /api/v1/is_bot per request (cache the answer for
~24h client-side). The single-IP call:
curl -s https://urlcap.com/api/v1/is_bot \
-H "Authorization: Bearer $URLCAP_KEY" \
-H "Content-Type: application/json" \
-d '{"ip":"66.249.66.1"}'
Want the PTR included so you can audit the FCrDNS chain in the same call?
curl -s "https://urlcap.com/api/v1/is_bot?reverseDns=true" \
-H "Authorization: Bearer $URLCAP_KEY" \
-H "Content-Type: application/json" \
-d '{"ip":"66.249.66.1"}'
The response includes reverseDns.names[] alongside the CIDR match. Cross-reference the
hostname suffix yourself if you want belt-and-braces verification; for most use cases the CIDR match is
already authoritative because the index is built from each operator's own published list.
The decision table: what to allow, what to drop
The categories field on every match is the practical signal for "what is this bot
for" — more useful than the operator name when you're writing a policy. The codes you'll see:
| Category | Examples | Typical policy |
|---|---|---|
SEARCH_INDEXING |
Googlebot, Bingbot, Yandex, DuckDuckBot | Allow |
SEARCH_SPECIALIZED |
Google AdsBot, Google Mobile-Friendly, image / news bots | Allow |
AI_TRAINING |
GPTBot, ClaudeBot, CCBot, Amazonbot, Applebot-Extended | Block (or paywall) |
AI_SEARCH_OR_ANSWERING |
OAI-SearchBot, PerplexityBot, DuckAssistBot | Allow if you want attribution / citations; block otherwise |
USER_INITIATED_FETCHING |
ChatGPT-User, Claude-User, Perplexity-User | Treat like a browser (it is one, on behalf of a real human) |
SEO_ANALYTICS |
AhrefsBot, SemrushBot, MJ12bot, MozBot | Block (no traffic upside, full content download) |
SOCIAL_PREVIEW |
Twitterbot, facebookexternalhit, LinkedInBot, Discordbot | Allow (one fetch per shared URL, generates traffic back) |
WEB_DATASET_ARCHIVING |
CCBot, Internet Archive, research crawlers | Block unless you actively want to be archived |
Branching on category instead of operator name means you don't have to rewrite your policy every time
a new AI lab launches a crawler. A new entrant categorised AI_TRAINING falls under the
same rule the day it's added to the index.
When this approach is the wrong tool
A few honest caveats. This guide is about identifying named crawlers — bots that publish identities and ranges. It is not the right tool for:
- Adversarial scraping that doesn't claim to be anyone. A scraper running on residential proxies with a real-looking browser UA and TLS fingerprint won't appear in any published list, because it's not a published crawler. You need behavioural detection (rate, request pattern, JS-challenge, TLS/JA4 fingerprint) for that — see /docs#ja4.
- Edge mitigation at line rate. If you need to drop traffic in microseconds before it touches origin, that's Cloudflare / Fastly / Akamai territory. The pattern in this guide adds ~1ms of latency if cached and ~30–100ms uncached. Fine for an auth middleware; not fine for a CDN.
- Geolocating users or detecting VPN/Tor. Different question, different vendor.
IPinfo, MaxMind, and IP-Quality-Score do that well.
/api/v1/is_botonly answers "is this IP a published crawler?"
Pricing & limits
Each /api/v1/is_bot call counts as one request unit. Batching up to 200 IPs per call is
free of charge beyond the single request. A daily cron that pulls the full allowlist via
/api/v1/datasets/bot_cidrs.csv counts as one request per fetch. For most production
sites that's well inside the free 1,000/mo. See /pricing.
Related
- Detect bots from server logs — the offline, batch-classify-a-CSV version of this story.
- API reference:
/api/v1/is_bot— full parameter list, includingreverseDns,historical,date. - API reference:
/api/v1/reverse_dns— standalone PTR lookups with built-in TTL caching. - urlcap's own crawler identity and CIDRs — same publishing discipline we ask others to follow.
Build the allowlist in 60 seconds.
No card. 1,000 requests / month on the free tier.