Bot detection
Detect Googlebot, Bingbot, GPTBot and other AI crawlers from server logs
Updated · Related endpoint: /api/v1/is_bot
The problem: user-agent strings can't be trusted
An HTTP request claiming User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
tells you nothing reliable — anyone can set that header. Scrapers, security scanners, and grey-hat tools regularly
spoof Googlebot, Bingbot, and the major AI crawlers to bypass rate limits and access controls.
The reliable signal is the source IP. Each well-known crawler publishes the CIDR ranges its
production fleet operates from — Google's googlebot.json, Bing's bingbot.json,
OpenAI's gptbot.json, Anthropic's ClaudeBot list, etc. A request from
66.249.66.0/27 is almost certainly Googlebot, regardless of what the user-agent string says.
Maintaining those lists by hand is busywork; the lists change weekly and there are dozens of them.
/api/v1/is_bot aggregates them, refreshes them daily, and lets you ask "does this IP belong to a
known crawler?" with one HTTP call.
Who this is for
- Engineers analysing old Nginx / Apache / CDN access logs after the fact — counting how much traffic was actually Googlebot vs. spoofed.
- Teams building internal analytics dashboards that need a "bot vs. human vs. AI crawler" breakdown.
- SRE / abuse teams who want a simple HTTP signal for backend logic ("if this IP is a known good bot, skip the rate limiter") without standing up a bot-management platform.
Step 1 — check a single IP
The smallest possible request is a single IP. Both GET and POST are supported:
curl -s https://urlcap.com/api/v1/is_bot \
-H "Authorization: Bearer $URLCAP_KEY" \
-H "Content-Type: application/json" \
-d '{"ip":"66.249.66.1"}'
A successful response:
{
"version": "1",
"requestId": "…",
"data": {
"ip": "66.249.66.1",
"family": 4,
"isBot": true,
"matchCount": 1,
"matches": [{
"cidr": "66.249.66.0/27",
"active": true,
"botGroup": { "id": 4, "description": "Common crawlers" },
"searchEngine": { "id": 1, "description": "Google" },
"categories": [{ "code": "SEARCH_INDEXING", "confidence": "high" }]
}]
}
}
The fields you'll actually store in your analytics table are isBot,
matches[0].searchEngine.description (the operator — "Google", "Bing", "OpenAI"…),
matches[0].botGroup.description (the family — "Common crawlers", "AI crawlers"…),
matches[0].categories[].code (intent — SEARCH_INDEXING, AI_TRAINING,
SEO_TOOL…), and the matching cidr for audit.
Step 2 — process a real Nginx access log
A typical Nginx access-log line looks like:
66.249.66.1 - - [14/May/2026:08:01:23 +0000] "GET /pricing HTTP/1.1" 200 18421 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
192.0.2.7 - - [14/May/2026:08:01:24 +0000] "GET /pricing HTTP/1.1" 200 18421 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
40.77.167.5 - - [14/May/2026:08:01:25 +0000] "GET /docs HTTP/1.1" 200 88102 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
The second line claims Googlebot but the IP isn't in Google's published range — that's the kind of spoof you want to flag.
Step 3 — batch up to 200 IPs per call
For log analysis, batching is essential. /api/v1/is_bot accepts up to 200 IPs per
request via the ips field — one HTTP round-trip instead of 200.
curl -s https://urlcap.com/api/v1/is_bot \
-H "Authorization: Bearer $URLCAP_KEY" \
-H "Content-Type: application/json" \
-d '{"ips":["66.249.66.1","192.0.2.7","40.77.167.5","8.8.8.8"]}'
Step 4 — a small Python pipeline
Extract unique IPs from a log, batch them in groups of 200, and emit a CSV with the bot classification:
#!/usr/bin/env python3
import csv, json, os, re, sys, urllib.request
KEY = os.environ["URLCAP_KEY"]
URL = "https://urlcap.com/api/v1/is_bot"
BATCH = 200
ip_re = re.compile(r"^(\d+\.\d+\.\d+\.\d+|[0-9a-f:]+)")
def unique_ips(path):
seen = set()
with open(path) as f:
for line in f:
m = ip_re.match(line)
if m:
seen.add(m.group(1))
return sorted(seen)
def classify(ips):
req = urllib.request.Request(
URL,
data=json.dumps({"ips": ips}).encode(),
headers={"Authorization": f"Bearer {KEY}", "Content-Type": "application/json"},
)
with urllib.request.urlopen(req, timeout=10) as r:
return json.load(r)["data"]["results"]
def main(path):
ips = unique_ips(path)
w = csv.writer(sys.stdout)
w.writerow(["ip", "isBot", "operator", "group", "categories", "cidr"])
for i in range(0, len(ips), BATCH):
for row in classify(ips[i : i + BATCH]):
m = (row.get("matches") or [{}])[0]
w.writerow([
row["ip"],
row["isBot"],
(m.get("searchEngine") or {}).get("description", ""),
(m.get("botGroup") or {}).get("description", ""),
",".join(c["code"] for c in m.get("categories", [])),
m.get("cidr", ""),
])
if __name__ == "__main__":
main(sys.argv[1])
Run with python3 classify-log.py access.log > bots.csv.
Analysing old logs: historical and date
Bot operators rotate IP ranges. A CIDR that was Googlebot in 2024 might not be in their current list — but the log entry from 2024 was still a real Googlebot hit. Two flags help:
"historical": true— also match against retired CIDRs (any range that's ever appeared in the published list). Use when classifying logs older than a few weeks."date": "2024-09-01"— point-in-time lookup. Match against the CIDR set that was active on that date. The most precise option when each log line has its own timestamp.
curl -s https://urlcap.com/api/v1/is_bot \
-H "Authorization: Bearer $URLCAP_KEY" -H "Content-Type: application/json" \
-d '{"ip":"66.249.66.1","date":"2024-09-01"}'
Production notes
- Cache aggressively client-side. The bot index changes once a day; an in-process LRU keyed
on
(ip, day)will eliminate 99% of repeat calls on a busy server. - Always batch. 200 IPs per call is the sweet spot — at the rate limits the difference between batched and per-IP is two orders of magnitude.
- Treat 5xx as transient. Retry with exponential backoff and a jitter; the index rebuild happens once a day and is the only window when you might see a brief 5xx.
- Don't trust this for live, hot-path mitigation. Bot identity is good for analytics, dashboards, and offline classification. If you need to block traffic at the edge in real time, see the comparison below.
Security notes
- IPs in your logs may be considered personal data under UK GDPR.
/api/v1/is_botprocesses the IP in memory and writes only the IP + endpoint + status tousage_events; the IP itself is subject to our retention policy (90 days at full detail). - If your logs contain end-user IPs, scrub or anonymise before sending if your data-protection policy requires it.
Pricing & limits
Each call counts as one request unit against your monthly quota, whether you send one IP or 200. A batch of 200 IPs once per minute fits comfortably inside the Free plan (1,000 req/mo). High-volume continuous log-analytics workloads typically land on Developer ($29/mo, 50,000 req/mo) or Startup ($99/mo, 250,000 req/mo). See /pricing.
When urlcap is simpler than a bot-management platform
Bot detection and "bot management" sound similar but solve different problems. urlcap isn't trying to replace edge-protection platforms; it's a different shape of tool. Roughly:
| Use case | Cloudflare Bot Management | IPinfo / IP Trust | urlcap is_bot |
|---|---|---|---|
| Block malicious bots at the edge in real time | Excellent | No | No |
| Classify a CSV / batch of IPs after the fact | Limited | Partial (intelligence) | Strong fit |
| Check if an IP is a known crawler (Googlebot, GPTBot…) | Indirect | Partial | Strong fit |
| Build internal scripts / dashboards from logs | Possible but heavy | Good for IP intelligence | Strong fit |
| Geolocation, ASN, VPN/proxy/Tor detection | Partial | Strong fit | No |
| Analyse logs by historical date | Not the main use case | Depends on provider | Strong fit (date + historical) |
| Site must be behind your CDN / proxy | Yes | No | No |
Use Cloudflare when your traffic flows through Cloudflare and you want edge mitigation. Use IPinfo or IP Trust when you need broad IP intelligence — geolocation, ASN, VPN/proxy/Tor flags, hosting metadata. Use urlcap when your problem is narrower: "I have a list of IPs from a log file, tell me which ones are known crawlers".
Related
- API reference:
/api/v1/is_bot - Datasets — maintain your own IP/CIDR collections
- The urlcap bot (our crawler's identity and CIDRs)
Create a free API key and run this in 60 seconds.
No card. 1,000 requests / month on the free tier.