Bot detection

Allow only real Googlebot and Bingbot — and stop AI crawlers spoofing them

Updated 20 May 2026 · Related endpoint: /api/v1/is_bot

Try this in 60 seconds. 5 free requests with no account, then 1,000/month on the free tier. Get a free API key →

The afternoon "Googlebot" wasn't Googlebot

A Tuesday in March. Grafana was showing a slow climb on origin CPU and the cache-miss panel was lit up orange. I grepped the Nginx log for the noisiest path:

shell

$ awk '$7 == "/products"' access.log | wc -l
4218
$ awk '$7 == "/products"' access.log | grep -ci googlebot
4011

Four thousand "Googlebot" hits on a single endpoint in one hour. We don't have four thousand products. Google does not crawl a single URL four thousand times an hour. Whatever was hitting /products was using User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) and was something else entirely.

The rest of this guide is the playbook I wish I'd had open in another tab. It's about verifying that a request claiming to be Googlebot, Bingbot, GPTBot or ClaudeBot is the real thing, and about the production pattern most teams end up wanting: allow Googlebot and Bingbot, gate everything else, block the AI training crawlers entirely.

The User-Agent string is meaningless

It bears repeating because almost every "block bots" tutorial on the web is still telling people to regex the UA header. You can claim to be Googlebot from your laptop, right now:

curl

$ curl -s -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
       https://example.com/expensive-endpoint

The header is whatever the client puts in it. It is a hint, not an identity. Any allowlist that uses it as the primary signal is a free pass — and worse, it's a free pass that scrapers know about and target deliberately, because it bypasses naïve "skip the rate limiter for known good bots" branches.

The textbook fix: forward-confirmed reverse DNS

Google's own verifying Googlebot page lays out the canonical method: forward-confirmed reverse DNS (FCrDNS). Bing has the same protocol for Bingbot. The procedure is:

Take the source IP from the request.
Do a reverse DNS (PTR) lookup. The hostname must end in .googlebot.com or .google.com for Google; .search.msn.com for Bing.
Forward-resolve that hostname back to an IP. The resolved IP must equal the original source IP.

If any step fails or the suffix is wrong, the request is not from the bot it claims to be. Worked example:

dig — a real Googlebot

$ dig +short -x 66.249.66.1
crawl-66-249-66-1.googlebot.com.

$ dig +short crawl-66-249-66-1.googlebot.com
66.249.66.1

And an impostor — same UA, different IP:

dig — an impostor

$ dig +short -x 192.0.2.7
;; (NXDOMAIN, or a PTR that does NOT end in googlebot.com / google.com)

Where FCrDNS quietly breaks in production

This is the part the textbooks leave out. I have hit every one of these:

IPv6. Don't assume your reverse zone is wired up. dig -x 2001:4860:4801:10::24 should return a .google.com PTR; if your resolver is dropping AAAA queries you'll get NXDOMAIN and falsely flag a real Googlebot as a spoof. Test with dig @8.8.8.8 -x <ipv6> before you trust your local view.
Resolver latency on the hot path. Two synchronous DNS round-trips per request is fine for an analytics job. It is not fine for a request handler. If you must do this inline, cache aggressively — the PTR for 66.249.66.1 isn't changing this hour. A small in-process LRU keyed on the IP, with a 1-hour TTL, eliminates ~99% of the lookups on a busy server.
PTR drift. When Bing rolled a new range a few years back, the PTR records lagged the IP range publication by several days. If you were checking PTR-only, you flagged real Bingbot as fake for a week.
Suffix-only matching is a footgun. A PTR of my-evil-host.googlebot.com.attacker.tld ends in .attacker.tld, not .googlebot.com — but a sloppy endsWith(".googlebot.com") check that doesn't anchor to a label boundary will accept foo.googlebot.com.evil.tld. Always split on dots and compare labels, or anchor the regex with \.googlebot\.com$.
The forward-confirmation step is not optional. An attacker who controls a reverse zone (rare but not unheard of, especially on poorly-configured colos) can put anything in PTR. The forward lookup back to the original IP is what makes it tamper-evident.

The other approach: published CIDR ranges

The major crawlers also publish their IP ranges as machine-readable JSON. This is what you want if you're classifying logs offline or building an allowlist into a config file — no DNS dependency, deterministic, cacheable on disk.

The asymmetry across publishers is the annoying part:

Crawler	Published source	Refresh cadence
Googlebot	`developers.google.com/search/apis/ipranges/googlebot.json`	Roughly weekly
Bingbot	`www.bing.com/toolbox/bingbot.json`	Irregular; lag of a few days is normal
GPTBot (training)	`openai.com/gptbot.json`	Days-to-weeks
ChatGPT-User (user-initiated fetch)	`openai.com/chatgpt-user.json`	Days-to-weeks
OAI-SearchBot	`openai.com/searchbot.json`	Days-to-weeks
ClaudeBot / claude-web / claude-user	Anthropic support docs (JSON list)	Irregular
PerplexityBot	Perplexity docs (UA + small CIDR set)	Often stale
Applebot	FCrDNS only — `.applebot.apple.com`	No published JSON
DuckDuckBot	Plain-text IP list on duckduckgo.com/duckduckbot	Rarely changes
Common Crawl (CCBot)	No published IP list — UA only	n/a

Maintaining ten of these by hand, fetching them on different cadences, normalising the schemas, dealing with the ones that just publish a paragraph of prose on a support page — that's the job /api/v1/is_bot exists to absorb. One HTTP call, every published list aggregated and refreshed daily, plus a record of retired ranges for analysing older logs (see the log analysis guide's historical and date flags).

Why most teams want "allow Google and Bing, block the AI crawlers"

This is the rule I end up writing for every site I've ever consulted on. The reasoning is mundane:

Google and Bing send paying users. A page indexed by Googlebot or Bingbot can appear in search results and convert into a customer. Blocking them costs traffic and revenue.
The AI training crawlers don't. A page slurped into a model's training set rarely drives attribution back to your site. When it does surface (in a Perplexity answer, in a ChatGPT response with citations), the user has already gotten what they wanted without a click. The trade is mostly one-way.
robots.txt is unenforceable. It is a polite request. The well-behaved crawlers honour it; the rest do not. If your business model depends on content not being scraped, a Disallow: line is a sticker on a door, not a lock.
The lawful path is IP-based. Allow the IP ranges of the bots you want, force everything else (whether it claims to be GPTBot, ClaudeBot, or "Mozilla/5.0") to clear whatever bar you set for unknown clients.

A subtlety that matters: not all "AI bots" are training crawlers. ChatGPT-User and Claude-User are user-initiated fetches — somebody pasted a URL into a chat and the assistant fetched it on their behalf. That is closer to a browser visit than to a scrape, and treating it like a training crawler will frustrate your real users. /api/v1/is_bot's categories field distinguishes these (see "Category codes" below) — a real production allowlist usually treats USER_INITIATED_FETCHING the way it treats a browser, not the way it treats AI_TRAINING.

The finding that surprised me

Back to the Tuesday-afternoon log. I expected the spoofers to live in the usual places — DigitalOcean, Hetzner, OVH, a fresh VPS pool. Most of them did. The surprise was a long tail of "Googlebot" hits coming from inside Google's own ASN, AS15169, but not from the googlebot.json ranges. Those weren't Google's crawler. They were Google Cloud customers running scrapers from GCE instances and counting on operators to confuse "AS15169" with "Googlebot."

Sharing an ASN with a crawler is not a verification signal. If your detection is "ASN looks Googley, good enough," half of GCP's customer base looks Googley. The published CIDR is narrower than the ASN on purpose. The same trap exists for Microsoft (AS8075 ≠ Bingbot) and AWS (AS16509 hosts plenty of bots that aren't Amazonbot).

A second, smaller surprise: roughly 6% of "Googlebot" hits passed the PTR check but failed the forward-confirmation step — the PTR pointed at crawl-…-shaped hostnames in zones that weren't .googlebot.com at all. Suffix-only checks would have waved them through.

A working Nginx allowlist

The pattern I land on: a cron job pulls the verified-crawler CIDRs once a day, writes them to an Nginx geo map file, reloads Nginx. The hot path stays in C — no DNS, no API call per request — and the daily refresh keeps the list fresh.

Pull the lists each morning:

/etc/cron.daily/refresh-crawler-allowlist

#!/bin/bash
set -euo pipefail

OUT=/etc/nginx/crawler-allowlist.conf
TMP=$(mktemp)

# Ask urlcap for every active CIDR belonging to bot groups we want to allow.
# Filter by botGroup id: Common crawlers (4) for Googlebot + friends, Bingbot (8).
curl -fsS https://urlcap.com/api/v1/datasets/bot_cidrs.csv \
     -H "Authorization: Bearer $URLCAP_KEY" \
   | awk -F, 'NR>1 && ($3=="Google" || $3=="Bing") {print $1 " 1;"}' \
   > "$TMP"

echo "geo \$is_allowed_crawler {"   > "$OUT"
echo "    default 0;"               >> "$OUT"
cat "$TMP"                          >> "$OUT"
echo "}"                            >> "$OUT"

nginx -t && systemctl reload nginx
rm -f "$TMP"

And in the server block:

/etc/nginx/conf.d/site.conf

include /etc/nginx/crawler-allowlist.conf;

map $http_user_agent $claims_to_be_bot {
    default                            0;
    "~*(Googlebot|bingbot|GPTBot|ClaudeBot|PerplexityBot|CCBot)" 1;
}

server {
    # ...

    # If the request claims to be a crawler but the IP isn't in the allow set, drop it.
    if ($claims_to_be_bot = 1) { set $verdict "claim"; }
    if ($is_allowed_crawler = 1) { set $verdict "${verdict}_ok"; }
    if ($verdict = "claim") { return 403; }   # claimed crawler, IP not in our allow set
}

Two things this gets right that most "block bots" snippets get wrong: (1) it never sees the UA as identity — UA is only the trigger for the IP check, not a substitute for it; and (2) it lets unflagged traffic (no crawler UA, not in the allow set) flow through normally to whatever else handles humans — Cloudflare, rate limiting, your app's own auth. We're not building a WAF here. We're closing the specific hole where "I am Googlebot" buys preferential treatment.

Doing it without the local cron + map dance

If you don't want a CIDR file on disk, ask /api/v1/is_bot per request (cache the answer for ~24h client-side). The single-IP call:

curl

curl -s https://urlcap.com/api/v1/is_bot \
  -H "Authorization: Bearer $URLCAP_KEY" \
  -H "Content-Type: application/json" \
  -d '{"ip":"66.249.66.1"}'

Want the PTR included so you can audit the FCrDNS chain in the same call?

with reverse DNS attached

curl -s "https://urlcap.com/api/v1/is_bot?reverseDns=true" \
  -H "Authorization: Bearer $URLCAP_KEY" \
  -H "Content-Type: application/json" \
  -d '{"ip":"66.249.66.1"}'

The response includes reverseDns.names[] alongside the CIDR match. Cross-reference the hostname suffix yourself if you want belt-and-braces verification; for most use cases the CIDR match is already authoritative because the index is built from each operator's own published list.

The decision table: what to allow, what to drop

The categories field on every match is the practical signal for "what is this bot for" — more useful than the operator name when you're writing a policy. The codes you'll see:

Category	Examples	Typical policy
`SEARCH_INDEXING`	Googlebot, Bingbot, Yandex, DuckDuckBot	Allow
`SEARCH_SPECIALIZED`	Google AdsBot, Google Mobile-Friendly, image / news bots	Allow
`AI_TRAINING`	GPTBot, ClaudeBot, CCBot, Amazonbot, Applebot-Extended	Block (or paywall)
`AI_SEARCH_OR_ANSWERING`	OAI-SearchBot, PerplexityBot, DuckAssistBot	Allow if you want attribution / citations; block otherwise
`USER_INITIATED_FETCHING`	ChatGPT-User, Claude-User, Perplexity-User	Treat like a browser (it is one, on behalf of a real human)
`SEO_ANALYTICS`	AhrefsBot, SemrushBot, MJ12bot, MozBot	Block (no traffic upside, full content download)
`SOCIAL_PREVIEW`	Twitterbot, facebookexternalhit, LinkedInBot, Discordbot	Allow (one fetch per shared URL, generates traffic back)
`WEB_DATASET_ARCHIVING`	CCBot, Internet Archive, research crawlers	Block unless you actively want to be archived

Branching on category instead of operator name means you don't have to rewrite your policy every time a new AI lab launches a crawler. A new entrant categorised AI_TRAINING falls under the same rule the day it's added to the index.

When this approach is the wrong tool

A few honest caveats. This guide is about identifying named crawlers — bots that publish identities and ranges. It is not the right tool for:

Adversarial scraping that doesn't claim to be anyone. A scraper running on residential proxies with a real-looking browser UA and TLS fingerprint won't appear in any published list, because it's not a published crawler. You need behavioural detection (rate, request pattern, JS-challenge, TLS/JA4 fingerprint) for that — see /docs#ja4.
Edge mitigation at line rate. If you need to drop traffic in microseconds before it touches origin, that's Cloudflare / Fastly / Akamai territory. The pattern in this guide adds ~1ms of latency if cached and ~30–100ms uncached. Fine for an auth middleware; not fine for a CDN.
Geolocating users or detecting VPN/Tor. Different question, different vendor. IPinfo, MaxMind, and IP-Quality-Score do that well. /api/v1/is_bot only answers "is this IP a published crawler?"

Pricing & limits

Each /api/v1/is_bot call counts as one request unit. Batching up to 200 IPs per call is free of charge beyond the single request. A daily cron that pulls the full allowlist via /api/v1/datasets/bot_cidrs.csv counts as one request per fetch. For most production sites that's well inside the free 1,000/mo. See /pricing.

Detect bots from server logs — the offline, batch-classify-a-CSV version of this story.
API reference: /api/v1/is_bot — full parameter list, including reverseDns, historical, date.
API reference: /api/v1/reverse_dns — standalone PTR lookups with built-in TTL caching.
urlcap's own crawler identity and CIDRs — same publishing discipline we ask others to follow.

Build the allowlist in 60 seconds.

No card. 1,000 requests / month on the free tier.