urlcap
Changelog
What's new in the urlcap API. Breaking changes ship under a new version path; everything additive lands within the current version. Subscribe to updates by emailing info@urlcap.com.
-
v1 · new
Candidate-IP feed — /sites/{id}/bot-ip-candidates
IP sibling of the JA4-side /bot-ja4s feed: a small list of IPs exhibiting abusive behaviour on your site, ranked by a 5-component composite (block-ratio 40%, volume 20%, path breadth 15%, JA4 churn 15%, vuln-probe hits 10%). Already-classified IPs (in
bot_observed_ips) and trust-list IPs are excluded from scoring entirely. Defaultmin_score=0.50score floor keeps the list small by construction.formatflag supportsjson(default operator shape),txt(one IP per line) andcidr(each IP as/32or/128) — the text formats aretext/plainso you can feed straight intoipset,iptables, or a Cloudflare IP list. -
internal
Bot-candidate auto-promote, auto-dismiss, de-promote & reopen
The candidate queue at
/account/admin/bot-candidatesis now self-maintaining. A daily 04:30 UTC job auto-promotes obvious bots that pass eight conservative gates (score ≥ 0.85, reqs ≥ 10k, distinct_ips ≥ 100, sites_seen ≥ 5, asset_ratio < 0.01, purchases = 0, age ≥ 48h, status = pending) into a single umbrella bot_groupAuto-flagged scraper (discovery), and auto-dismisses stale rows (last_seen > 14 days ago + score < 0.7) to keep the queue bounded.The admin UI gains a header banner with live queue counts (pending / promoted-auto-vs-manual / dismissed) and two new per-row actions: De-promote (deletes the corresponding
bot_observed_ja4srow, garbage-collects an orphaned per-candidate bot_group if any, sends the candidate back topending) and Reopen (dismissed → pending for re-review).Both automated paths are independently switchable via
system_config:bot.discovery.auto_promote.enabled(default false) andbot.discovery.auto_dismiss.enabled(default true). Audit trail on each promoted candidate (promotion_kind,promoted_to_bot_group_id,promoted_at) and on bot_groups created by manual promotion (discovery_kind = 'manual-from-candidate'or'auto-umbrella'). -
v1 · new
Site-level quick-check endpoints — /url-stats & /bot-traffic
Two new helper endpoints on the sites surface that bottle up the most common investigation queries:
-
GET /api/v1/sites/{id}/url-stats?host=&path=&days=N— totals + status breakdown + per-day chart + a 25-row non-2xx sample for a specific URL. Answers "any rejections on URL X today?" without writing ClickHouse by hand. -
GET /api/v1/sites/{id}/bot-traffic?bot_group=&days=N— recent traffic from a named bot_group (substring-match), with status mix and a sample of recent requests. Answers "are we blocking Google?" in one call. Bot identification matches every JA4 the discovery system has attributed to the bot_group viabot_observed_ja4s.
-
-
v1 · new
Phase-level timings on /capture and monitor checks
/api/v1/capture'sresponse.timingsobject now carries a phase breakdown:dnsMs,connectMs(TCP + TLS),requestSendMs,ttfbMs,bodyMs, plus theresolvedIpthe socket actually landed on. Phase fields are omitted when their lifecycle hook didn't fire — pooled keep-alive reuse skips DNS / connect, andfollowRedirects=trueonly captures the first leg's timings.Capture-kind URL monitors persist the same fields per check on
monitor_checksand surface them on/monitors/{id}/checks+ the monitor detail page. Extract-kind monitors don't get phase timings — HtmlUnit's connection layer doesn't expose them. While we were there: fixed the persisted checksize_bytescolumn reading 0 on capture monitors (it was looking atsizeBytesin the response envelope, which is actuallybodyBytes). -
v1 · new
URL monitors — up/down checks with webhook + email alerts
New
/api/v1/monitors(plus/checks,/uptime,/pause,/resume) let you schedule a recurring capture or extract against any URL and get notified on state transitions. Same primitives as UptimeRobot — plus full headless-browser checks, JSON-API validation via the extract path, and the User-Agent personas from /user_agent_profiles.Sub-minute polling on paid plans (30 s minimum on Startup + Business, 60 s on Developer, 300 s on Free). State transitions fire HMAC-SHA256-signed webhooks AND email through the platform mailer. Flapping is debounced via
alertFailureThreshold— consecutive failures required before flipping to down, so single-shot 502s don't page anyone. Check history retained 30 days; sweep runs daily.Manage from the UI at /account/monitors — create, edit, view recent checks, see uptime over the last 30 days.
-
v1 · new
User-Agent profiles for capture & extract
New
GET /api/v1/user_agent_profilesplus two new fields on capture + extract requests —userAgent(raw string override) anduserAgentProfile(key into the catalogue). The catalogue is an operator-managed MySQL table, seeded with the common browser personas (chrome-latest-mac,firefox-latest-win,edge-latest-win,safari-latest-mac, …) plusgooglebotand the defaulturlcap.On
/extractthe profile selects both the UA string AND the HtmlUnitBrowserVersion(Chrome / Firefox / Edge) so the in-pagenavigator.userAgentand the JS engine flavour line up.safari-*profiles still run on Chromium (HtmlUnit has no Safari engine) —navigator.vendorwill read Google Inc..googlebotonly sets the UA string; for cryptographic urlcap attribution usewebBotAuth=true. -
v1 · new
Per-minute traffic-shape signals on the bot-JA4 blocklist
/api/v1/sites/{id}/bot-ja4snow returnsreqs_per_minute_peak,reqs_per_minute_mean(over active minutes only),active_minutesandburstiness(coefficient of variation) per JA4. Useful for catching attack-shape JA4s that hide under a moderateasset_ratiobut spike to thousands of req/min during their active windows — the case where score >= 0.70 and asset_ratio sits between 0.05 and 0.50 so neither Tier 1 nor Tier 2 fires. -
v1 · new
Bot-JA4 blocklist endpoint — one feed, edge-blocklist ready
New
GET /api/v1/sites/{id}/bot-ja4s: returns every JA4 the discovery system has flagged on a site over a sliding window (1..90 days), labelledknown_botwith bot-group attribution orcandidatewith discovery score. Browser-shaped fingerprints are excluded; the strings are the keys you feed into nginx, Cloudflare WAF, or any edge map. Each row also carriescross_customer_action— how many other sites have explicitly blocked, allowed, or challenged this JA4 via edge_action outcomes — plusblock_likely_on_this_siteinferred from your own status-code mix. -
v1 · new
Single-IP intelligence endpoint
GET /api/v1/ip/intelligence?ip=…composes every signal urlcap has on a single IP into one response — MaxMind geo, PTR records, bot-CIDR membership, everybot_observed_ipsattribution row (cidr_match / ua_match / vuln_match / manual), trust-list status, and cross-site behavioural data over the last 1–7 days: request count, distinct JA4s/UAs/sites, status-code mix, asset ratio, vuln-probe hits, every JA4 with classification, top user agents, and cross-customeredge_actionvotes summed across every JA4 the IP has used. The operator-grade investigation endpoint. -
v1 · quality
Base62 site public ids + per-domain
kindEvery site now carries a 12-char base62
publicIdalongside its numeric id. Use it wherever a site appears in a URL —/api/v1/sites/2DrxGfsYW0jv/bot-ja4s,/api/v1/ingest/2DrxGfsYW0jv/events— to avoid leaking enumeration order to anyone holding a URL. The numeric form remains accepted. Domains now carry akindhint (doc/asset/api); informational only for v1, but expect future signals to lean on it (assets-vs-docs ratio per host, etc.). -
v1 · new
Outcomes channel: tell urlcap what happened next
New
POST /api/v1/ingest/{site}/outcomesaccepts asynchronous verdicts keyed by the original event'srequest_id— kindsjs_challenge/auth/purchase/edge_action. urlcap resolves the JA4 server-side at ingest time and denormalises it onto the outcome row so verdicts survive the 7-day raw-event TTL. Eight new outcome-derived signals surface on/api/v1/ja4/intelligence(js_challenge_attempts/passes/pass_rate,auth_observations,distinct_users,purchases,total_purchase_cents,last_purchase_at) plusja4_ua_consistency— UA-spoofing tell where a JA4 keeps claiming different agent/OS values.Plain-HTTP traffic (no TLS handshake, no JA4) now also flows through ingest — IP-keyed aggregates and vuln-probe / bot-discovery still see those rows; JA4-keyed materialised views filter them out.
-
v1 · new
Vuln-probe attribution + trust list
Every ingested request is tagged at write-time with a
vuln_probe_idif its path matches the curated vuln-scan registry (52 paths across 9 categories — exposed WP plugins, exposed.git, common dropper paths, etc.). IPs hitting three or more distinct probes are auto-attributed to the synthetic "Vulnerability scanner" bot_group; the discovery loop currently catches ~50–70 new scanner IPs per hour.New per-user trust list of IPs / CIDRs / domains the discovery scan never flags — protect monitoring probes, office IPs, and partner integrations from being auto-classified as scanners. Trust applies globally (signal modulation) while ownership stays per-user (admin audit).
-
v1 · new
JA4 intelligence: OS / agent / device classification + 25-metric snapshot
Three new read endpoints on top of the long-window aggregates:
-
/api/v1/ja4/intelligence— the rolled-up profile for one JA4 over 7 / 30 / 90 days. Returns the top OS / agent / device guess with per-dimension confidence (the score IS the confidence — apply your own threshold), diversity scores, and the new outcome-derived counts. Hourly recompute. -
/api/v1/ja4/profile— drill into one dimension (os_name,agent_name,device_class,country,asnand 4 more), top-N values with HLL-merged distinct counts. -
/api/v1/ja4/metrics— on-demand "prioritised 25" snapshot: trailing-hour rollup (counts + 10 ratios), optional IP+JA4 sub-block whenip=is supplied (withis_new_24h), and a top-10 JA4×UA breakdown over 24h with each UA's share of the JA4's volume. -
/api/v1/ip/profile— per-IP behavioural summary on one site over N days (distinct JA4s/UAs/paths/hosts + heuristic / challenge counts). The proxy / NAT detection lookup.
-
-
v1 · new
Bot discovery: auto-classify what your traffic actually contains
A background loop now ingests every request, attributes IPs and JA4s to known bot groups (CIDR match, UA self-identification, vuln-probe pattern), and queues unknown candidates for review. UA-self-identifying bots are auto-classified into new bot_groups —
MistralBot,Meta-ExternalFetcher,DeepSeekBotand others have been added without human intervention. The bot-IP index rebuilds every hour and feeds/api/v1/is_bot; pending candidates surface in the admin queue. -
v1 · new
Customer sites + ingest channel + JA4 signals
The headline release: you can now ship request events from your edge to urlcap and get per-JA4 signals back. New site management surface:
-
/api/v1/sites— list and create sites (one per customer-facing property). -
/api/v1/sites/{id}/domains— attach hostnames; UNIQUE across all sites (a hostname can only belong to one site). -
/api/v1/sites/{id}/ingest_keys— mint per-site ingest tokens. Cleartext is returned exactly once at mint-time; urlcap only stores the SHA-256 hash. -
POST /api/v1/ingest/{site}/events— NDJSON batches of request events, authed with the per-site token (not your urlcap API key). 1000 events / 4 MB per batch; partial failures never fail the batch.
Reads land in
/api/v1/ja4/signals: Cloudflare-style "what does this JA4 look like right now?" snapshot, recomputed every minute — 10 ratios (h2h3, browser, cache, error etc.), 4 per-site ranks, and 2 quantiles. Plus GeoIP (country / city / ASN) enrichment at ingest from MaxMind GeoLite2, also surfaced on/api/v1/is_bot,/api/v1/ip/lookupand/api/v1/ip/geo. -
-
v1 · new
Capture & extract can now sign outbound requests (Web Bot Auth)
Pass
webBotAuth: trueto/api/v1/captureor/api/v1/extractand urlcap signs every outbound HTTP request it makes with its own Ed25519 keypair — the same RFC 9421 / Web Bot Auth machinery /web_bot_auth/verify already understands. Target sites that allow known crawlers but block unknown bots can verify our identity against the JWKS published at/.well-known/http-message-signatures-directoryand choose to allow urlcap-attributed traffic. Signature covers@authority,signature-agent,@methodand@target-uri; 60-second lifetime. -
v1 · new
Web Bot Auth signature verification
New endpoint
/api/v1/web_bot_auth/verifyverifies an inbound HTTP request's RFC 9421 HTTP Message Signature against the operator's published Ed25519 key directory — the cryptographic identity check the Web Bot Auth draft proposes as a successor to IP-range and reverse-DNS checks. urlcap parses theSignature-Input+Signature+Signature-Agentheaders, fetches the JWKS-style directory (TTL-cached 1 h), rebuilds the canonical signature base, and verifies with Ed25519 using the right keyid. Pairs naturally with /is_bot and /reverse_dns for a three-layer "is this bot really who it says?" check. -
v1 · new
robots.txt suite — fetch, allow/deny check, change-tracker with webhooks
Four new endpoints under
/api/v1/robots: fetch + parse a site's robots.txt, decide whether a URL is allowed for a given user-agent per RFC 9309, and create per-user watches that re-fetch on a schedule and POST an HMAC-SHA256-signed notification whenever the content changes. Snapshots are only stored on change, so a watch's history is the file's actual edit log. Fetched bodies are TTL-cached in-process for 1 hour./api/v1/is_botmatches now carry abotGroup.honoursRobotsblock with four booleans —robots,crawlDelay,allow,sitemap— for whether each known bot has ever been caught skipping that aspect of robots.txt. Seeded conservatively:null= not researched yet,false= at least one credible violation report. -
v1 · new
Reverse DNS (PTR) with FCrDNS confirmation
New endpoint
/api/v1/reverse_dnsresolves IPv4 / IPv6 addresses to their PTR names. WithforwardConfirm=true, every PTR is re-resolved (A + AAAA) and we report whether the original IP is in the answer — the standard FCrDNS check used to verify crawlers like Googlebot or Bingbot. Results are cached in-process for the upstream record's TTL (clamped 30s–1h; negative answers 5 minutes), and every response reportsttlSecondsremaining pluscached./is_botpicked up a matchingreverseDns=trueoption for the common "is this IP a real bot and what's its PTR?" case. -
v1 · update
Bot detection:
date&historical/api/v1/is_botlearned two point-in-time / lifetime parameters:date(an ISO-8601 as-of-date — answers "was this IP a known bot on this date?", including CIDRs that have since been retired) andhistorical=true(returns every CIDR ever published by any bot, regardless of whether it's currently active). Each match now carriesaddedAt,removedAtandactive, so you can see when a CIDR was retired and why. Same in-memory index, same sub-millisecond lookups — see the reference. -
v1 · new
Bot detection —
/api/v1/is_botOne call to tell if an address is a known web crawler.
GET /api/v1/is_bot?ip=66.249.66.1(or?ips=a,b,c, or a JSON body for batches up to 200) returnsisBotplus, on a match, the bot's search engine, bot group, the matching CIDR, and its category codes (SEARCH_INDEXING,AI_TRAINING,AI_SEARCH_OR_ANSWERING, …) with per-link confidence. Backed by a daily-refreshed in-memory CIDR index — no DB hit per request. Covers Googlebot, Bingbot, Yandex, DuckDuckBot & DuckAssistBot, Applebot, GPTBot & ChatGPT-User & OAI-SearchBot, ClaudeBot, PerplexityBot & Perplexity-User, AhrefsBot, Amazonbot, CCBot, and more. See the reference. -
v1 · new
Datasets —
/api/v1/datasetsKeep your own named, de-duplicated collections of IP / CIDR ranges or URLs: create a dataset,
POSTitems to add,PUTto replace the whole set,DELETEitems or the dataset itself, and askGET /containswhether a value is in it (for IP datasets that includes "is this address covered by some CIDR in the dataset?"). Optional history mode keeps the items dropped by each replace, with the date they were deactivated — useful for tracking a set as it evolves. Per-plan caps for number of datasets, max items each, and whether history is allowed (free / developer / startup / business). See the reference. -
v1 · new
Extract: JSON content & JSONPath
The extract tool now handles JSON responses: point a job at a URL that returns JSON (a REST API, say) and pull out values with JSONPath —
selectorbecomes a JSONPath expression (e.g.$.results[*].name), with the familiarvalue/list/itemsshapes. JSON is auto-detected from the response, or force it with"content": "json". See the JSON content reference. -
v1 · new
Scheduled tasks —
/api/v1/schedulesRun a capture or an extract job at a future time — once (
runAt) or on a recurring cron schedule (classic 5-field crontab, optional seconds field,@daily& co., evaluated in a time zone you choose). Each execution stores its full JSON result; fetch it from/api/v1/schedules/{id}/runs/{n}. Pause / resume / run-now / cancel, plus a Scheduled tasks page in your account. See the reference. -
v1 · new
Accounts & sign-in
Self-service accounts: email/password sign-up with verification, password reset, per-user API keys, plans & billing, and one-click Sign in with Google, GitHub or Microsoft. API requests now also accept
Authorization: Bearer api_…per-user keys alongside the legacyX-API-Key. -
v1 · new
IP & CIDR tool —
/api/v1/ipAdded IPv4/IPv6 tooling:
/api/v1/ip/contains(is an address in a CIDR? — single or batch),/api/v1/ip/lookup(which of your stored ranges contain an address?), and/api/v1/ip/rangesto manage a list of named ranges. Stored in an optimised table — 16-byte IPv4-mapped keys, B-tree-indexed range bounds. See the IP & CIDR reference. -
v1 · new
Extract tool —
/api/v1/extractAdded an async extract tool: submit a job model (load a URL, perform actions — fill / click / wait / navigate / select — and pull out data with CSS/XPath selectors, optionally across paginated pages) and get a
taskId; pollGET /api/v1/extract/{taskId}for the result. Every HTTP request the engine performs while running a job is recorded against the task. See the Extract reference. -
v1 · new
Capture tool —
/api/v1/capture& the workbenchAdded the capture tool — send an HTTP request to any URL (with full control over method, headers and their order, body, redirects) and get the parsed response: status, headers, cookies, query parameters, body, timings. The home page is now a capture workbench. Every captured request and response (and all headers) is recorded. See the Capture reference.
-
v1
Versioned API & the
/api/v1/totpendpointIntroduced the versioned API base at
https://urlcap.com/api/v1. The first endpoint,/api/v1/totp, mirrors the existing TOTP functionality but returns a structured JSON envelope —{ version, requestId, data }— including the code, digit count, period, algorithm and seconds-until-rotation. Accepts bothGETandPOST.- New consistent error envelope:
{ version, error: { type, message } }with400/401statuses. - The legacy plain-text endpoint at
/authis unchanged and will stay that way.
- New consistent error envelope:
-
site
New developer site & API reference
Launched the redesigned urlcap site — a marketing landing page and a full API reference at
/docs— with a dark-mode toggle, copy-to-clipboard code samples, and a responsive layout. -
legacy
TOTP at
/authThe original endpoint:
GET /auth?uri=otpauth://totp/...with anX-API-Keyheader, returning the current code as plain text. Still available, frozen for compatibility.
Looking for what's coming next? Custom HTTP request endpoints — full control over method, headers (including order), and body — are in progress.