urlcap

Bot detection

Allow only real Googlebot and Bingbot — and stop AI crawlers spoofing them

Updated · Related endpoint: /api/v1/is_bot

Try this in 60 seconds. 5 free requests with no account, then 1,000/month on the free tier. Get a free API key →

The afternoon "Googlebot" wasn't Googlebot

A Tuesday in March. Grafana was showing a slow climb on origin CPU and the cache-miss panel was lit up orange. I grepped the Nginx log for the noisiest path:

shell
$ awk '$7 == "/products"' access.log | wc -l
4218
$ awk '$7 == "/products"' access.log | grep -ci googlebot
4011

Four thousand "Googlebot" hits on a single endpoint in one hour. We don't have four thousand products. Google does not crawl a single URL four thousand times an hour. Whatever was hitting /products was using User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) and was something else entirely.

The rest of this guide is the playbook I wish I'd had open in another tab. It's about verifying that a request claiming to be Googlebot, Bingbot, GPTBot or ClaudeBot is the real thing, and about the production pattern most teams end up wanting: allow Googlebot and Bingbot, gate everything else, block the AI training crawlers entirely.

The User-Agent string is meaningless

It bears repeating because almost every "block bots" tutorial on the web is still telling people to regex the UA header. You can claim to be Googlebot from your laptop, right now:

curl
$ curl -s -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \
       https://example.com/expensive-endpoint

The header is whatever the client puts in it. It is a hint, not an identity. Any allowlist that uses it as the primary signal is a free pass — and worse, it's a free pass that scrapers know about and target deliberately, because it bypasses naïve "skip the rate limiter for known good bots" branches.

The textbook fix: forward-confirmed reverse DNS

Google's own verifying Googlebot page lays out the canonical method: forward-confirmed reverse DNS (FCrDNS). Bing has the same protocol for Bingbot. The procedure is:

  1. Take the source IP from the request.
  2. Do a reverse DNS (PTR) lookup. The hostname must end in .googlebot.com or .google.com for Google; .search.msn.com for Bing.
  3. Forward-resolve that hostname back to an IP. The resolved IP must equal the original source IP.

If any step fails or the suffix is wrong, the request is not from the bot it claims to be. Worked example:

dig — a real Googlebot
$ dig +short -x 66.249.66.1
crawl-66-249-66-1.googlebot.com.

$ dig +short crawl-66-249-66-1.googlebot.com
66.249.66.1

And an impostor — same UA, different IP:

dig — an impostor
$ dig +short -x 192.0.2.7
;; (NXDOMAIN, or a PTR that does NOT end in googlebot.com / google.com)

Where FCrDNS quietly breaks in production

This is the part the textbooks leave out. I have hit every one of these:

The other approach: published CIDR ranges

The major crawlers also publish their IP ranges as machine-readable JSON. This is what you want if you're classifying logs offline or building an allowlist into a config file — no DNS dependency, deterministic, cacheable on disk.

The asymmetry across publishers is the annoying part:

Crawler Published source Refresh cadence
Googlebot developers.google.com/search/apis/ipranges/googlebot.json Roughly weekly
Bingbot www.bing.com/toolbox/bingbot.json Irregular; lag of a few days is normal
GPTBot (training) openai.com/gptbot.json Days-to-weeks
ChatGPT-User (user-initiated fetch) openai.com/chatgpt-user.json Days-to-weeks
OAI-SearchBot openai.com/searchbot.json Days-to-weeks
ClaudeBot / claude-web / claude-user Anthropic support docs (JSON list) Irregular
PerplexityBot Perplexity docs (UA + small CIDR set) Often stale
Applebot FCrDNS only — .applebot.apple.com No published JSON
DuckDuckBot Plain-text IP list on duckduckgo.com/duckduckbot Rarely changes
Common Crawl (CCBot) No published IP list — UA only n/a

Maintaining ten of these by hand, fetching them on different cadences, normalising the schemas, dealing with the ones that just publish a paragraph of prose on a support page — that's the job /api/v1/is_bot exists to absorb. One HTTP call, every published list aggregated and refreshed daily, plus a record of retired ranges for analysing older logs (see the log analysis guide's historical and date flags).

Why most teams want "allow Google and Bing, block the AI crawlers"

This is the rule I end up writing for every site I've ever consulted on. The reasoning is mundane:

A subtlety that matters: not all "AI bots" are training crawlers. ChatGPT-User and Claude-User are user-initiated fetches — somebody pasted a URL into a chat and the assistant fetched it on their behalf. That is closer to a browser visit than to a scrape, and treating it like a training crawler will frustrate your real users. /api/v1/is_bot's categories field distinguishes these (see "Category codes" below) — a real production allowlist usually treats USER_INITIATED_FETCHING the way it treats a browser, not the way it treats AI_TRAINING.

The finding that surprised me

Back to the Tuesday-afternoon log. I expected the spoofers to live in the usual places — DigitalOcean, Hetzner, OVH, a fresh VPS pool. Most of them did. The surprise was a long tail of "Googlebot" hits coming from inside Google's own ASN, AS15169, but not from the googlebot.json ranges. Those weren't Google's crawler. They were Google Cloud customers running scrapers from GCE instances and counting on operators to confuse "AS15169" with "Googlebot."

Sharing an ASN with a crawler is not a verification signal. If your detection is "ASN looks Googley, good enough," half of GCP's customer base looks Googley. The published CIDR is narrower than the ASN on purpose. The same trap exists for Microsoft (AS8075 ≠ Bingbot) and AWS (AS16509 hosts plenty of bots that aren't Amazonbot).

A second, smaller surprise: roughly 6% of "Googlebot" hits passed the PTR check but failed the forward-confirmation step — the PTR pointed at crawl-…-shaped hostnames in zones that weren't .googlebot.com at all. Suffix-only checks would have waved them through.

A working Nginx allowlist

The pattern I land on: a cron job pulls the verified-crawler CIDRs once a day, writes them to an Nginx geo map file, reloads Nginx. The hot path stays in C — no DNS, no API call per request — and the daily refresh keeps the list fresh.

Pull the lists each morning:

/etc/cron.daily/refresh-crawler-allowlist
#!/bin/bash
set -euo pipefail

OUT=/etc/nginx/crawler-allowlist.conf
TMP=$(mktemp)

# Ask urlcap for every active CIDR belonging to bot groups we want to allow.
# Filter by botGroup id: Common crawlers (4) for Googlebot + friends, Bingbot (8).
curl -fsS https://urlcap.com/api/v1/datasets/bot_cidrs.csv \
     -H "Authorization: Bearer $URLCAP_KEY" \
   | awk -F, 'NR>1 && ($3=="Google" || $3=="Bing") {print $1 " 1;"}' \
   > "$TMP"

echo "geo \$is_allowed_crawler {"   > "$OUT"
echo "    default 0;"               >> "$OUT"
cat "$TMP"                          >> "$OUT"
echo "}"                            >> "$OUT"

nginx -t && systemctl reload nginx
rm -f "$TMP"

And in the server block:

/etc/nginx/conf.d/site.conf
include /etc/nginx/crawler-allowlist.conf;

map $http_user_agent $claims_to_be_bot {
    default                            0;
    "~*(Googlebot|bingbot|GPTBot|ClaudeBot|PerplexityBot|CCBot)" 1;
}

server {
    # ...

    # If the request claims to be a crawler but the IP isn't in the allow set, drop it.
    if ($claims_to_be_bot = 1) { set $verdict "claim"; }
    if ($is_allowed_crawler = 1) { set $verdict "${verdict}_ok"; }
    if ($verdict = "claim") { return 403; }   # claimed crawler, IP not in our allow set
}

Two things this gets right that most "block bots" snippets get wrong: (1) it never sees the UA as identity — UA is only the trigger for the IP check, not a substitute for it; and (2) it lets unflagged traffic (no crawler UA, not in the allow set) flow through normally to whatever else handles humans — Cloudflare, rate limiting, your app's own auth. We're not building a WAF here. We're closing the specific hole where "I am Googlebot" buys preferential treatment.

Doing it without the local cron + map dance

If you don't want a CIDR file on disk, ask /api/v1/is_bot per request (cache the answer for ~24h client-side). The single-IP call:

curl
curl -s https://urlcap.com/api/v1/is_bot \
  -H "Authorization: Bearer $URLCAP_KEY" \
  -H "Content-Type: application/json" \
  -d '{"ip":"66.249.66.1"}'

Want the PTR included so you can audit the FCrDNS chain in the same call?

with reverse DNS attached
curl -s "https://urlcap.com/api/v1/is_bot?reverseDns=true" \
  -H "Authorization: Bearer $URLCAP_KEY" \
  -H "Content-Type: application/json" \
  -d '{"ip":"66.249.66.1"}'

The response includes reverseDns.names[] alongside the CIDR match. Cross-reference the hostname suffix yourself if you want belt-and-braces verification; for most use cases the CIDR match is already authoritative because the index is built from each operator's own published list.

The decision table: what to allow, what to drop

The categories field on every match is the practical signal for "what is this bot for" — more useful than the operator name when you're writing a policy. The codes you'll see:

Category Examples Typical policy
SEARCH_INDEXING Googlebot, Bingbot, Yandex, DuckDuckBot Allow
SEARCH_SPECIALIZED Google AdsBot, Google Mobile-Friendly, image / news bots Allow
AI_TRAINING GPTBot, ClaudeBot, CCBot, Amazonbot, Applebot-Extended Block (or paywall)
AI_SEARCH_OR_ANSWERING OAI-SearchBot, PerplexityBot, DuckAssistBot Allow if you want attribution / citations; block otherwise
USER_INITIATED_FETCHING ChatGPT-User, Claude-User, Perplexity-User Treat like a browser (it is one, on behalf of a real human)
SEO_ANALYTICS AhrefsBot, SemrushBot, MJ12bot, MozBot Block (no traffic upside, full content download)
SOCIAL_PREVIEW Twitterbot, facebookexternalhit, LinkedInBot, Discordbot Allow (one fetch per shared URL, generates traffic back)
WEB_DATASET_ARCHIVING CCBot, Internet Archive, research crawlers Block unless you actively want to be archived

Branching on category instead of operator name means you don't have to rewrite your policy every time a new AI lab launches a crawler. A new entrant categorised AI_TRAINING falls under the same rule the day it's added to the index.

When this approach is the wrong tool

A few honest caveats. This guide is about identifying named crawlers — bots that publish identities and ranges. It is not the right tool for:

Pricing & limits

Each /api/v1/is_bot call counts as one request unit. Batching up to 200 IPs per call is free of charge beyond the single request. A daily cron that pulls the full allowlist via /api/v1/datasets/bot_cidrs.csv counts as one request per fetch. For most production sites that's well inside the free 1,000/mo. See /pricing.

Build the allowlist in 60 seconds.

No card. 1,000 requests / month on the free tier.