urlcap changelog

2026-05-28 v1 · new

Candidate-IP feed — /sites/{id}/bot-ip-candidates

IP sibling of the JA4-side /bot-ja4s feed: a small list of IPs exhibiting abusive behaviour on your site, ranked by a 5-component composite (block-ratio 40%, volume 20%, path breadth 15%, JA4 churn 15%, vuln-probe hits 10%). Already-classified IPs (in bot_observed_ips) and trust-list IPs are excluded from scoring entirely. Default min_score=0.50 score floor keeps the list small by construction.

format flag supports json (default operator shape), txt (one IP per line) and cidr (each IP as /32 or /128) — the text formats are text/plain so you can feed straight into ipset, iptables, or a Cloudflare IP list.

2026-05-28 internal

Bot-candidate auto-promote, auto-dismiss, de-promote & reopen

The candidate queue at /account/admin/bot-candidates is now self-maintaining. A daily 04:30 UTC job auto-promotes obvious bots that pass eight conservative gates (score ≥ 0.85, reqs ≥ 10k, distinct_ips ≥ 100, sites_seen ≥ 5, asset_ratio < 0.01, purchases = 0, age ≥ 48h, status = pending) into a single umbrella bot_group Auto-flagged scraper (discovery), and auto-dismisses stale rows (last_seen > 14 days ago + score < 0.7) to keep the queue bounded.

The admin UI gains a header banner with live queue counts (pending / promoted-auto-vs-manual / dismissed) and two new per-row actions: De-promote (deletes the corresponding bot_observed_ja4s row, garbage-collects an orphaned per-candidate bot_group if any, sends the candidate back to pending) and Reopen (dismissed → pending for re-review).

Both automated paths are independently switchable via system_config: bot.discovery.auto_promote.enabled (default false) and bot.discovery.auto_dismiss.enabled (default true). Audit trail on each promoted candidate (promotion_kind, promoted_to_bot_group_id, promoted_at) and on bot_groups created by manual promotion (discovery_kind = 'manual-from-candidate' or 'auto-umbrella').

2026-05-27 v1 · new

Site-level quick-check endpoints — /url-stats & /bot-traffic

Two new helper endpoints on the sites surface that bottle up the most common investigation queries:

GET /api/v1/sites/{id}/url-stats?host=&path=&days=N — totals + status breakdown + per-day chart + a 25-row non-2xx sample for a specific URL. Answers "any rejections on URL X today?" without writing ClickHouse by hand.
GET /api/v1/sites/{id}/bot-traffic?bot_group=&days=N — recent traffic from a named bot_group (substring-match), with status mix and a sample of recent requests. Answers "are we blocking Google?" in one call. Bot identification matches every JA4 the discovery system has attributed to the bot_group via bot_observed_ja4s.

2026-05-26 v1 · new

Phase-level timings on /capture and monitor checks

/api/v1/capture's response.timings object now carries a phase breakdown: dnsMs, connectMs (TCP + TLS), requestSendMs, ttfbMs, bodyMs, plus the resolvedIp the socket actually landed on. Phase fields are omitted when their lifecycle hook didn't fire — pooled keep-alive reuse skips DNS / connect, and followRedirects=true only captures the first leg's timings.

Capture-kind URL monitors persist the same fields per check on monitor_checks and surface them on /monitors/{id}/checks + the monitor detail page. Extract-kind monitors don't get phase timings — HtmlUnit's connection layer doesn't expose them. While we were there: fixed the persisted check size_bytes column reading 0 on capture monitors (it was looking at sizeBytes in the response envelope, which is actually bodyBytes).

2026-05-26 v1 · new

URL monitors — up/down checks with webhook + email alerts

New /api/v1/monitors (plus /checks, /uptime, /pause, /resume) let you schedule a recurring capture or extract against any URL and get notified on state transitions. Same primitives as UptimeRobot — plus full headless-browser checks, JSON-API validation via the extract path, and the User-Agent personas from /user_agent_profiles.

Sub-minute polling on paid plans (30 s minimum on Startup + Business, 60 s on Developer, 300 s on Free). State transitions fire HMAC-SHA256-signed webhooks AND email through the platform mailer. Flapping is debounced via alertFailureThreshold — consecutive failures required before flipping to down, so single-shot 502s don't page anyone. Check history retained 30 days; sweep runs daily.

Manage from the UI at /account/monitors — create, edit, view recent checks, see uptime over the last 30 days.

2026-05-26 v1 · new

User-Agent profiles for capture & extract

New GET /api/v1/user_agent_profiles plus two new fields on capture + extract requests — userAgent (raw string override) and userAgentProfile (key into the catalogue). The catalogue is an operator-managed MySQL table, seeded with the common browser personas (chrome-latest-mac, firefox-latest-win, edge-latest-win, safari-latest-mac, …) plus googlebot and the default urlcap.

On /extract the profile selects both the UA string AND the HtmlUnit BrowserVersion (Chrome / Firefox / Edge) so the in-page navigator.userAgent and the JS engine flavour line up. safari-* profiles still run on Chromium (HtmlUnit has no Safari engine) — navigator.vendor will read Google Inc.. googlebot only sets the UA string; for cryptographic urlcap attribution use webBotAuth=true.

2026-05-23 v1 · new

Per-minute traffic-shape signals on the bot-JA4 blocklist

/api/v1/sites/{id}/bot-ja4s now returns reqs_per_minute_peak, reqs_per_minute_mean (over active minutes only), active_minutes and burstiness (coefficient of variation) per JA4. Useful for catching attack-shape JA4s that hide under a moderate asset_ratio but spike to thousands of req/min during their active windows — the case where score >= 0.70 and asset_ratio sits between 0.05 and 0.50 so neither Tier 1 nor Tier 2 fires.

2026-05-22 v1 · new

Bot-JA4 blocklist endpoint — one feed, edge-blocklist ready

New GET /api/v1/sites/{id}/bot-ja4s: returns every JA4 the discovery system has flagged on a site over a sliding window (1..90 days), labelled known_bot with bot-group attribution or candidate with discovery score. Browser-shaped fingerprints are excluded; the strings are the keys you feed into nginx, Cloudflare WAF, or any edge map. Each row also carries cross_customer_action — how many other sites have explicitly blocked, allowed, or challenged this JA4 via edge_action outcomes — plus block_likely_on_this_site inferred from your own status-code mix.

2026-05-22 v1 · new

Single-IP intelligence endpoint

GET /api/v1/ip/intelligence?ip=… composes every signal urlcap has on a single IP into one response — MaxMind geo, PTR records, bot-CIDR membership, every bot_observed_ips attribution row (cidr_match / ua_match / vuln_match / manual), trust-list status, and cross-site behavioural data over the last 1–7 days: request count, distinct JA4s/UAs/sites, status-code mix, asset ratio, vuln-probe hits, every JA4 with classification, top user agents, and cross-customer edge_action votes summed across every JA4 the IP has used. The operator-grade investigation endpoint.

2026-05-22 v1 · quality

Base62 site public ids + per-domain `kind`

Every site now carries a 12-char base62 publicId alongside its numeric id. Use it wherever a site appears in a URL — /api/v1/sites/2DrxGfsYW0jv/bot-ja4s, /api/v1/ingest/2DrxGfsYW0jv/events — to avoid leaking enumeration order to anyone holding a URL. The numeric form remains accepted. Domains now carry a kind hint (doc / asset / api); informational only for v1, but expect future signals to lean on it (assets-vs-docs ratio per host, etc.).

2026-05-21 v1 · new

Outcomes channel: tell urlcap what happened next

New POST /api/v1/ingest/{site}/outcomes accepts asynchronous verdicts keyed by the original event's request_id — kinds js_challenge / auth / purchase / edge_action. urlcap resolves the JA4 server-side at ingest time and denormalises it onto the outcome row so verdicts survive the 7-day raw-event TTL. Eight new outcome-derived signals surface on /api/v1/ja4/intelligence (js_challenge_attempts / passes / pass_rate, auth_observations, distinct_users, purchases, total_purchase_cents, last_purchase_at) plus ja4_ua_consistency — UA-spoofing tell where a JA4 keeps claiming different agent/OS values.

Plain-HTTP traffic (no TLS handshake, no JA4) now also flows through ingest — IP-keyed aggregates and vuln-probe / bot-discovery still see those rows; JA4-keyed materialised views filter them out.

2026-05-21 v1 · new

Vuln-probe attribution + trust list

Every ingested request is tagged at write-time with a vuln_probe_id if its path matches the curated vuln-scan registry (52 paths across 9 categories — exposed WP plugins, exposed .git, common dropper paths, etc.). IPs hitting three or more distinct probes are auto-attributed to the synthetic "Vulnerability scanner" bot_group; the discovery loop currently catches ~50–70 new scanner IPs per hour.

New per-user trust list of IPs / CIDRs / domains the discovery scan never flags — protect monitoring probes, office IPs, and partner integrations from being auto-classified as scanners. Trust applies globally (signal modulation) while ownership stays per-user (admin audit).

2026-05-20 v1 · new

JA4 intelligence: OS / agent / device classification + 25-metric snapshot

Three new read endpoints on top of the long-window aggregates:

/api/v1/ja4/intelligence — the rolled-up profile for one JA4 over 7 / 30 / 90 days. Returns the top OS / agent / device guess with per-dimension confidence (the score IS the confidence — apply your own threshold), diversity scores, and the new outcome-derived counts. Hourly recompute.
/api/v1/ja4/profile — drill into one dimension (os_name, agent_name, device_class, country, asn and 4 more), top-N values with HLL-merged distinct counts.
/api/v1/ja4/metrics — on-demand "prioritised 25" snapshot: trailing-hour rollup (counts + 10 ratios), optional IP+JA4 sub-block when ip= is supplied (with is_new_24h), and a top-10 JA4×UA breakdown over 24h with each UA's share of the JA4's volume.
/api/v1/ip/profile — per-IP behavioural summary on one site over N days (distinct JA4s/UAs/paths/hosts + heuristic / challenge counts). The proxy / NAT detection lookup.

2026-05-20 v1 · new

Bot discovery: auto-classify what your traffic actually contains

A background loop now ingests every request, attributes IPs and JA4s to known bot groups (CIDR match, UA self-identification, vuln-probe pattern), and queues unknown candidates for review. UA-self-identifying bots are auto-classified into new bot_groups — MistralBot, Meta-ExternalFetcher, DeepSeekBot and others have been added without human intervention. The bot-IP index rebuilds every hour and feeds /api/v1/is_bot; pending candidates surface in the admin queue.

2026-05-19 v1 · new

Customer sites + ingest channel + JA4 signals

The headline release: you can now ship request events from your edge to urlcap and get per-JA4 signals back. New site management surface:

/api/v1/sites — list and create sites (one per customer-facing property).
/api/v1/sites/{id}/domains — attach hostnames; UNIQUE across all sites (a hostname can only belong to one site).
/api/v1/sites/{id}/ingest_keys — mint per-site ingest tokens. Cleartext is returned exactly once at mint-time; urlcap only stores the SHA-256 hash.
POST /api/v1/ingest/{site}/events — NDJSON batches of request events, authed with the per-site token (not your urlcap API key). 1000 events / 4 MB per batch; partial failures never fail the batch.

Reads land in /api/v1/ja4/signals: Cloudflare-style "what does this JA4 look like right now?" snapshot, recomputed every minute — 10 ratios (h2h3, browser, cache, error etc.), 4 per-site ranks, and 2 quantiles. Plus GeoIP (country / city / ASN) enrichment at ingest from MaxMind GeoLite2, also surfaced on /api/v1/is_bot, /api/v1/ip/lookup and /api/v1/ip/geo.

2026-05-16 v1 · new

Capture & extract can now sign outbound requests (Web Bot Auth)

Pass webBotAuth: true to /api/v1/capture or /api/v1/extract and urlcap signs every outbound HTTP request it makes with its own Ed25519 keypair — the same RFC 9421 / Web Bot Auth machinery /web_bot_auth/verify already understands. Target sites that allow known crawlers but block unknown bots can verify our identity against the JWKS published at /.well-known/http-message-signatures-directory and choose to allow urlcap-attributed traffic. Signature covers @authority, signature-agent, @method and @target-uri; 60-second lifetime.

2026-05-16 v1 · new

Web Bot Auth signature verification

New endpoint /api/v1/web_bot_auth/verify verifies an inbound HTTP request's RFC 9421 HTTP Message Signature against the operator's published Ed25519 key directory — the cryptographic identity check the Web Bot Auth draft proposes as a successor to IP-range and reverse-DNS checks. urlcap parses the Signature-Input + Signature + Signature-Agent headers, fetches the JWKS-style directory (TTL-cached 1 h), rebuilds the canonical signature base, and verifies with Ed25519 using the right keyid. Pairs naturally with /is_bot and /reverse_dns for a three-layer "is this bot really who it says?" check.

2026-05-16 v1 · new

robots.txt suite — fetch, allow/deny check, change-tracker with webhooks

Four new endpoints under /api/v1/robots: fetch + parse a site's robots.txt, decide whether a URL is allowed for a given user-agent per RFC 9309, and create per-user watches that re-fetch on a schedule and POST an HMAC-SHA256-signed notification whenever the content changes. Snapshots are only stored on change, so a watch's history is the file's actual edit log. Fetched bodies are TTL-cached in-process for 1 hour.

/api/v1/is_bot matches now carry a botGroup.honoursRobots block with four booleans — robots, crawlDelay, allow, sitemap — for whether each known bot has ever been caught skipping that aspect of robots.txt. Seeded conservatively: null = not researched yet, false = at least one credible violation report.

2026-05-15 v1 · new

Reverse DNS (PTR) with FCrDNS confirmation

New endpoint /api/v1/reverse_dns resolves IPv4 / IPv6 addresses to their PTR names. With forwardConfirm=true, every PTR is re-resolved (A + AAAA) and we report whether the original IP is in the answer — the standard FCrDNS check used to verify crawlers like Googlebot or Bingbot. Results are cached in-process for the upstream record's TTL (clamped 30s–1h; negative answers 5 minutes), and every response reports ttlSeconds remaining plus cached. /is_bot picked up a matching reverseDns=true option for the common "is this IP a real bot and what's its PTR?" case.

2026-05-13 v1 · update

Bot detection: `date` & `historical`

/api/v1/is_bot learned two point-in-time / lifetime parameters: date (an ISO-8601 as-of-date — answers "was this IP a known bot on this date?", including CIDRs that have since been retired) and historical=true (returns every CIDR ever published by any bot, regardless of whether it's currently active). Each match now carries addedAt, removedAt and active, so you can see when a CIDR was retired and why. Same in-memory index, same sub-millisecond lookups — see the reference.

2026-05-13 v1 · new

Bot detection — `/api/v1/is_bot`

One call to tell if an address is a known web crawler. GET /api/v1/is_bot?ip=66.249.66.1 (or ?ips=a,b,c, or a JSON body for batches up to 200) returns isBot plus, on a match, the bot's search engine, bot group, the matching CIDR, and its category codes (SEARCH_INDEXING, AI_TRAINING, AI_SEARCH_OR_ANSWERING, …) with per-link confidence. Backed by a daily-refreshed in-memory CIDR index — no DB hit per request. Covers Googlebot, Bingbot, Yandex, DuckDuckBot & DuckAssistBot, Applebot, GPTBot & ChatGPT-User & OAI-SearchBot, ClaudeBot, PerplexityBot & Perplexity-User, AhrefsBot, Amazonbot, CCBot, and more. See the reference.

2026-05-13 v1 · new

Datasets — `/api/v1/datasets`

Keep your own named, de-duplicated collections of IP / CIDR ranges or URLs: create a dataset, POST items to add, PUT to replace the whole set, DELETE items or the dataset itself, and ask GET /contains whether a value is in it (for IP datasets that includes "is this address covered by some CIDR in the dataset?"). Optional history mode keeps the items dropped by each replace, with the date they were deactivated — useful for tracking a set as it evolves. Per-plan caps for number of datasets, max items each, and whether history is allowed (free / developer / startup / business). See the reference.

2026-05-12 v1 · new

Extract: JSON content & JSONPath

The extract tool now handles JSON responses: point a job at a URL that returns JSON (a REST API, say) and pull out values with JSONPath — selector becomes a JSONPath expression (e.g. $.results[*].name), with the familiar value / list / items shapes. JSON is auto-detected from the response, or force it with "content": "json". See the JSON content reference.

2026-05-12 v1 · new

Scheduled tasks — `/api/v1/schedules`

Run a capture or an extract job at a future time — once (runAt) or on a recurring cron schedule (classic 5-field crontab, optional seconds field, @daily & co., evaluated in a time zone you choose). Each execution stores its full JSON result; fetch it from /api/v1/schedules/{id}/runs/{n}. Pause / resume / run-now / cancel, plus a Scheduled tasks page in your account. See the reference.

2026-05-12 v1 · new

Accounts & sign-in

Self-service accounts: email/password sign-up with verification, password reset, per-user API keys, plans & billing, and one-click Sign in with Google, GitHub or Microsoft. API requests now also accept Authorization: Bearer api_… per-user keys alongside the legacy X-API-Key.

2026-05-11 v1 · new

IP & CIDR tool — `/api/v1/ip`

Added IPv4/IPv6 tooling: /api/v1/ip/contains (is an address in a CIDR? — single or batch), /api/v1/ip/lookup (which of your stored ranges contain an address?), and /api/v1/ip/ranges to manage a list of named ranges. Stored in an optimised table — 16-byte IPv4-mapped keys, B-tree-indexed range bounds. See the IP & CIDR reference.

2026-05-11 v1 · new

Extract tool — `/api/v1/extract`

Added an async extract tool: submit a job model (load a URL, perform actions — fill / click / wait / navigate / select — and pull out data with CSS/XPath selectors, optionally across paginated pages) and get a taskId; poll GET /api/v1/extract/{taskId} for the result. Every HTTP request the engine performs while running a job is recorded against the task. See the Extract reference.

2026-05-11 v1 · new

Capture tool — `/api/v1/capture` & the workbench

Added the capture tool — send an HTTP request to any URL (with full control over method, headers and their order, body, redirects) and get the parsed response: status, headers, cookies, query parameters, body, timings. The home page is now a capture workbench. Every captured request and response (and all headers) is recorded. See the Capture reference.

2026-05-11 v1

Versioned API & the `/api/v1/totp` endpoint

Introduced the versioned API base at https://urlcap.com/api/v1. The first endpoint, /api/v1/totp, mirrors the existing TOTP functionality but returns a structured JSON envelope — { version, requestId, data } — including the code, digit count, period, algorithm and seconds-until-rotation. Accepts both GET and POST.

New consistent error envelope: { version, error: { type, message } } with 400 / 401 statuses.
The legacy plain-text endpoint at /auth is unchanged and will stay that way.

2026-05-11 site

New developer site & API reference

Launched the redesigned urlcap site — a marketing landing page and a full API reference at /docs — with a dark-mode toggle, copy-to-clipboard code samples, and a responsive layout.

earlier legacy

TOTP at `/auth`

The original endpoint: GET /auth?uri=otpauth://totp/... with an X-API-Key header, returning the current code as plain text. Still available, frozen for compatibility.

Changelog

Candidate-IP feed — /sites/{id}/bot-ip-candidates

Bot-candidate auto-promote, auto-dismiss, de-promote & reopen

Site-level quick-check endpoints — /url-stats & /bot-traffic

Phase-level timings on /capture and monitor checks

URL monitors — up/down checks with webhook + email alerts

User-Agent profiles for capture & extract

Per-minute traffic-shape signals on the bot-JA4 blocklist

Bot-JA4 blocklist endpoint — one feed, edge-blocklist ready

Single-IP intelligence endpoint

Base62 site public ids + per-domain kind

Outcomes channel: tell urlcap what happened next

Vuln-probe attribution + trust list

JA4 intelligence: OS / agent / device classification + 25-metric snapshot

Bot discovery: auto-classify what your traffic actually contains

Customer sites + ingest channel + JA4 signals

Capture & extract can now sign outbound requests (Web Bot Auth)

Web Bot Auth signature verification

robots.txt suite — fetch, allow/deny check, change-tracker with webhooks

Reverse DNS (PTR) with FCrDNS confirmation

Bot detection: date & historical

Bot detection — /api/v1/is_bot

Datasets — /api/v1/datasets

Extract: JSON content & JSONPath

Scheduled tasks — /api/v1/schedules

Accounts & sign-in

IP & CIDR tool — /api/v1/ip

Extract tool — /api/v1/extract

Capture tool — /api/v1/capture & the workbench

Versioned API & the /api/v1/totp endpoint

New developer site & API reference

TOTP at /auth

Base62 site public ids + per-domain `kind`

Bot detection: `date` & `historical`

Bot detection — `/api/v1/is_bot`

Datasets — `/api/v1/datasets`

Scheduled tasks — `/api/v1/schedules`

IP & CIDR tool — `/api/v1/ip`

Extract tool — `/api/v1/extract`

Capture tool — `/api/v1/capture` & the workbench

Versioned API & the `/api/v1/totp` endpoint

TOTP at `/auth`