urlcap

Bot detection

Detect Googlebot, Bingbot, GPTBot and other AI crawlers from server logs

Updated · Related endpoint: /api/v1/is_bot

Try this in 60 seconds. 5 free requests with no account, then 1,000/month on the free tier. Get a free API key →

The problem: user-agent strings can't be trusted

An HTTP request claiming User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) tells you nothing reliable — anyone can set that header. Scrapers, security scanners, and grey-hat tools regularly spoof Googlebot, Bingbot, and the major AI crawlers to bypass rate limits and access controls.

The reliable signal is the source IP. Each well-known crawler publishes the CIDR ranges its production fleet operates from — Google's googlebot.json, Bing's bingbot.json, OpenAI's gptbot.json, Anthropic's ClaudeBot list, etc. A request from 66.249.66.0/27 is almost certainly Googlebot, regardless of what the user-agent string says.

Maintaining those lists by hand is busywork; the lists change weekly and there are dozens of them. /api/v1/is_bot aggregates them, refreshes them daily, and lets you ask "does this IP belong to a known crawler?" with one HTTP call.

Who this is for

Step 1 — check a single IP

The smallest possible request is a single IP. Both GET and POST are supported:

single IP · curl
curl -s https://urlcap.com/api/v1/is_bot \
  -H "Authorization: Bearer $URLCAP_KEY" \
  -H "Content-Type: application/json" \
  -d '{"ip":"66.249.66.1"}'

A successful response:

200 OK
{
  "version": "1",
  "requestId": "…",
  "data": {
    "ip": "66.249.66.1",
    "family": 4,
    "isBot": true,
    "matchCount": 1,
    "matches": [{
      "cidr": "66.249.66.0/27",
      "active": true,
      "botGroup":     { "id": 4, "description": "Common crawlers" },
      "searchEngine": { "id": 1, "description": "Google" },
      "categories":   [{ "code": "SEARCH_INDEXING", "confidence": "high" }]
    }]
  }
}

The fields you'll actually store in your analytics table are isBot, matches[0].searchEngine.description (the operator — "Google", "Bing", "OpenAI"…), matches[0].botGroup.description (the family — "Common crawlers", "AI crawlers"…), matches[0].categories[].code (intent — SEARCH_INDEXING, AI_TRAINING, SEO_TOOL…), and the matching cidr for audit.

Step 2 — process a real Nginx access log

A typical Nginx access-log line looks like:

access.log
66.249.66.1 - - [14/May/2026:08:01:23 +0000] "GET /pricing HTTP/1.1" 200 18421 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
192.0.2.7   - - [14/May/2026:08:01:24 +0000] "GET /pricing HTTP/1.1" 200 18421 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
40.77.167.5 - - [14/May/2026:08:01:25 +0000] "GET /docs    HTTP/1.1" 200 88102 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

The second line claims Googlebot but the IP isn't in Google's published range — that's the kind of spoof you want to flag.

Step 3 — batch up to 200 IPs per call

For log analysis, batching is essential. /api/v1/is_bot accepts up to 200 IPs per request via the ips field — one HTTP round-trip instead of 200.

batch · curl
curl -s https://urlcap.com/api/v1/is_bot \
  -H "Authorization: Bearer $URLCAP_KEY" \
  -H "Content-Type: application/json" \
  -d '{"ips":["66.249.66.1","192.0.2.7","40.77.167.5","8.8.8.8"]}'

Step 4 — a small Python pipeline

Extract unique IPs from a log, batch them in groups of 200, and emit a CSV with the bot classification:

classify-log.py
#!/usr/bin/env python3
import csv, json, os, re, sys, urllib.request

KEY  = os.environ["URLCAP_KEY"]
URL  = "https://urlcap.com/api/v1/is_bot"
BATCH = 200

ip_re = re.compile(r"^(\d+\.\d+\.\d+\.\d+|[0-9a-f:]+)")

def unique_ips(path):
    seen = set()
    with open(path) as f:
        for line in f:
            m = ip_re.match(line)
            if m:
                seen.add(m.group(1))
    return sorted(seen)

def classify(ips):
    req = urllib.request.Request(
        URL,
        data=json.dumps({"ips": ips}).encode(),
        headers={"Authorization": f"Bearer {KEY}", "Content-Type": "application/json"},
    )
    with urllib.request.urlopen(req, timeout=10) as r:
        return json.load(r)["data"]["results"]

def main(path):
    ips = unique_ips(path)
    w = csv.writer(sys.stdout)
    w.writerow(["ip", "isBot", "operator", "group", "categories", "cidr"])
    for i in range(0, len(ips), BATCH):
        for row in classify(ips[i : i + BATCH]):
            m = (row.get("matches") or [{}])[0]
            w.writerow([
                row["ip"],
                row["isBot"],
                (m.get("searchEngine") or {}).get("description", ""),
                (m.get("botGroup") or {}).get("description", ""),
                ",".join(c["code"] for c in m.get("categories", [])),
                m.get("cidr", ""),
            ])

if __name__ == "__main__":
    main(sys.argv[1])

Run with python3 classify-log.py access.log > bots.csv.

Analysing old logs: historical and date

Bot operators rotate IP ranges. A CIDR that was Googlebot in 2024 might not be in their current list — but the log entry from 2024 was still a real Googlebot hit. Two flags help:

point-in-time
curl -s https://urlcap.com/api/v1/is_bot \
  -H "Authorization: Bearer $URLCAP_KEY" -H "Content-Type: application/json" \
  -d '{"ip":"66.249.66.1","date":"2024-09-01"}'

Production notes

Security notes

Pricing & limits

Each call counts as one request unit against your monthly quota, whether you send one IP or 200. A batch of 200 IPs once per minute fits comfortably inside the Free plan (1,000 req/mo). High-volume continuous log-analytics workloads typically land on Developer ($29/mo, 50,000 req/mo) or Startup ($99/mo, 250,000 req/mo). See /pricing.

When urlcap is simpler than a bot-management platform

Bot detection and "bot management" sound similar but solve different problems. urlcap isn't trying to replace edge-protection platforms; it's a different shape of tool. Roughly:

Use case Cloudflare Bot Management IPinfo / IP Trust urlcap is_bot
Block malicious bots at the edge in real time ExcellentNoNo
Classify a CSV / batch of IPs after the fact LimitedPartial (intelligence)Strong fit
Check if an IP is a known crawler (Googlebot, GPTBot…) IndirectPartialStrong fit
Build internal scripts / dashboards from logs Possible but heavyGood for IP intelligenceStrong fit
Geolocation, ASN, VPN/proxy/Tor detection PartialStrong fitNo
Analyse logs by historical date Not the main use caseDepends on providerStrong fit (date + historical)
Site must be behind your CDN / proxy YesNoNo

Use Cloudflare when your traffic flows through Cloudflare and you want edge mitigation. Use IPinfo or IP Trust when you need broad IP intelligence — geolocation, ASN, VPN/proxy/Tor flags, hosting metadata. Use urlcap when your problem is narrower: "I have a list of IPs from a log file, tell me which ones are known crawlers".

Create a free API key and run this in 60 seconds.

No card. 1,000 requests / month on the free tier.