Bot detection

Detect Googlebot, Bingbot, GPTBot and other AI crawlers from server logs

Updated 14 May 2026 · Related endpoint: /api/v1/is_bot

Try this in 60 seconds. 5 free requests with no account, then 1,000/month on the free tier. Get a free API key →

The problem: user-agent strings can't be trusted

An HTTP request claiming User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) tells you nothing reliable — anyone can set that header. Scrapers, security scanners, and grey-hat tools regularly spoof Googlebot, Bingbot, and the major AI crawlers to bypass rate limits and access controls.

The reliable signal is the source IP. Each well-known crawler publishes the CIDR ranges its production fleet operates from — Google's googlebot.json, Bing's bingbot.json, OpenAI's gptbot.json, Anthropic's ClaudeBot list, etc. A request from 66.249.66.0/27 is almost certainly Googlebot, regardless of what the user-agent string says.

Maintaining those lists by hand is busywork; the lists change weekly and there are dozens of them. /api/v1/is_bot aggregates them, refreshes them daily, and lets you ask "does this IP belong to a known crawler?" with one HTTP call.

Who this is for

Engineers analysing old Nginx / Apache / CDN access logs after the fact — counting how much traffic was actually Googlebot vs. spoofed.
Teams building internal analytics dashboards that need a "bot vs. human vs. AI crawler" breakdown.
SRE / abuse teams who want a simple HTTP signal for backend logic ("if this IP is a known good bot, skip the rate limiter") without standing up a bot-management platform.

Step 1 — check a single IP

The smallest possible request is a single IP. Both GET and POST are supported:

single IP · curl

curl -s https://urlcap.com/api/v1/is_bot \
  -H "Authorization: Bearer $URLCAP_KEY" \
  -H "Content-Type: application/json" \
  -d '{"ip":"66.249.66.1"}'

A successful response:

200 OK

{
  "version": "1",
  "requestId": "…",
  "data": {
    "ip": "66.249.66.1",
    "family": 4,
    "isBot": true,
    "matchCount": 1,
    "matches": [{
      "cidr": "66.249.66.0/27",
      "active": true,
      "botGroup":     { "id": 4, "description": "Common crawlers" },
      "searchEngine": { "id": 1, "description": "Google" },
      "categories":   [{ "code": "SEARCH_INDEXING", "confidence": "high" }]
    }]
  }
}

The fields you'll actually store in your analytics table are isBot, matches[0].searchEngine.description (the operator — "Google", "Bing", "OpenAI"…), matches[0].botGroup.description (the family — "Common crawlers", "AI crawlers"…), matches[0].categories[].code (intent — SEARCH_INDEXING, AI_TRAINING, SEO_TOOL…), and the matching cidr for audit.

Step 2 — process a real Nginx access log

A typical Nginx access-log line looks like:

access.log

66.249.66.1 - - [14/May/2026:08:01:23 +0000] "GET /pricing HTTP/1.1" 200 18421 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
192.0.2.7   - - [14/May/2026:08:01:24 +0000] "GET /pricing HTTP/1.1" 200 18421 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
40.77.167.5 - - [14/May/2026:08:01:25 +0000] "GET /docs    HTTP/1.1" 200 88102 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

The second line claims Googlebot but the IP isn't in Google's published range — that's the kind of spoof you want to flag.

Step 3 — batch up to 200 IPs per call

For log analysis, batching is essential. /api/v1/is_bot accepts up to 200 IPs per request via the ips field — one HTTP round-trip instead of 200.

batch · curl

curl -s https://urlcap.com/api/v1/is_bot \
  -H "Authorization: Bearer $URLCAP_KEY" \
  -H "Content-Type: application/json" \
  -d '{"ips":["66.249.66.1","192.0.2.7","40.77.167.5","8.8.8.8"]}'

Step 4 — a small Python pipeline

Extract unique IPs from a log, batch them in groups of 200, and emit a CSV with the bot classification:

classify-log.py

#!/usr/bin/env python3
import csv, json, os, re, sys, urllib.request

KEY  = os.environ["URLCAP_KEY"]
URL  = "https://urlcap.com/api/v1/is_bot"
BATCH = 200

ip_re = re.compile(r"^(\d+\.\d+\.\d+\.\d+|[0-9a-f:]+)")

def unique_ips(path):
    seen = set()
    with open(path) as f:
        for line in f:
            m = ip_re.match(line)
            if m:
                seen.add(m.group(1))
    return sorted(seen)

def classify(ips):
    req = urllib.request.Request(
        URL,
        data=json.dumps({"ips": ips}).encode(),
        headers={"Authorization": f"Bearer {KEY}", "Content-Type": "application/json"},
    )
    with urllib.request.urlopen(req, timeout=10) as r:
        return json.load(r)["data"]["results"]

def main(path):
    ips = unique_ips(path)
    w = csv.writer(sys.stdout)
    w.writerow(["ip", "isBot", "operator", "group", "categories", "cidr"])
    for i in range(0, len(ips), BATCH):
        for row in classify(ips[i : i + BATCH]):
            m = (row.get("matches") or [{}])[0]
            w.writerow([
                row["ip"],
                row["isBot"],
                (m.get("searchEngine") or {}).get("description", ""),
                (m.get("botGroup") or {}).get("description", ""),
                ",".join(c["code"] for c in m.get("categories", [])),
                m.get("cidr", ""),
            ])

if __name__ == "__main__":
    main(sys.argv[1])

Run with python3 classify-log.py access.log > bots.csv.

Analysing old logs: `historical` and `date`

Bot operators rotate IP ranges. A CIDR that was Googlebot in 2024 might not be in their current list — but the log entry from 2024 was still a real Googlebot hit. Two flags help:

"historical": true — also match against retired CIDRs (any range that's ever appeared in the published list). Use when classifying logs older than a few weeks.
"date": "2024-09-01" — point-in-time lookup. Match against the CIDR set that was active on that date. The most precise option when each log line has its own timestamp.

point-in-time

curl -s https://urlcap.com/api/v1/is_bot \
  -H "Authorization: Bearer $URLCAP_KEY" -H "Content-Type: application/json" \
  -d '{"ip":"66.249.66.1","date":"2024-09-01"}'

Production notes

Cache aggressively client-side. The bot index changes once a day; an in-process LRU keyed on (ip, day) will eliminate 99% of repeat calls on a busy server.
Always batch. 200 IPs per call is the sweet spot — at the rate limits the difference between batched and per-IP is two orders of magnitude.
Treat 5xx as transient. Retry with exponential backoff and a jitter; the index rebuild happens once a day and is the only window when you might see a brief 5xx.
Don't trust this for live, hot-path mitigation. Bot identity is good for analytics, dashboards, and offline classification. If you need to block traffic at the edge in real time, see the comparison below.

Security notes

IPs in your logs may be considered personal data under UK GDPR. /api/v1/is_bot processes the IP in memory and writes only the IP + endpoint + status to usage_events; the IP itself is subject to our retention policy (90 days at full detail).
If your logs contain end-user IPs, scrub or anonymise before sending if your data-protection policy requires it.

Pricing & limits

Each call counts as one request unit against your monthly quota, whether you send one IP or 200. A batch of 200 IPs once per minute fits comfortably inside the Free plan (1,000 req/mo). High-volume continuous log-analytics workloads typically land on Developer ($29/mo, 50,000 req/mo) or Startup ($99/mo, 250,000 req/mo). See /pricing.

When urlcap is simpler than a bot-management platform

Bot detection and "bot management" sound similar but solve different problems. urlcap isn't trying to replace edge-protection platforms; it's a different shape of tool. Roughly:

Use case	Cloudflare Bot Management	IPinfo / IP Trust	urlcap is_bot
Block malicious bots at the edge in real time	Excellent	No	No
Classify a CSV / batch of IPs after the fact	Limited	Partial (intelligence)	Strong fit
Check if an IP is a known crawler (Googlebot, GPTBot…)	Indirect	Partial	Strong fit
Build internal scripts / dashboards from logs	Possible but heavy	Good for IP intelligence	Strong fit
Geolocation, ASN, VPN/proxy/Tor detection	Partial	Strong fit	No
Analyse logs by historical date	Not the main use case	Depends on provider	Strong fit (`date` + `historical`)
Site must be behind your CDN / proxy	Yes	No	No

Use Cloudflare when your traffic flows through Cloudflare and you want edge mitigation. Use IPinfo or IP Trust when you need broad IP intelligence — geolocation, ASN, VPN/proxy/Tor flags, hosting metadata. Use urlcap when your problem is narrower: "I have a list of IPs from a log file, tell me which ones are known crawlers".

Create a free API key and run this in 60 seconds.

No card. 1,000 requests / month on the free tier.