urlcap

urlcap

Changelog

What's new in the urlcap API. Breaking changes ship under a new version path; everything additive lands within the current version. Subscribe to updates by emailing info@urlcap.com.

  1. v1 · new

    Candidate-IP feed — /sites/{id}/bot-ip-candidates

    IP sibling of the JA4-side /bot-ja4s feed: a small list of IPs exhibiting abusive behaviour on your site, ranked by a 5-component composite (block-ratio 40%, volume 20%, path breadth 15%, JA4 churn 15%, vuln-probe hits 10%). Already-classified IPs (in bot_observed_ips) and trust-list IPs are excluded from scoring entirely. Default min_score=0.50 score floor keeps the list small by construction.

    format flag supports json (default operator shape), txt (one IP per line) and cidr (each IP as /32 or /128) — the text formats are text/plain so you can feed straight into ipset, iptables, or a Cloudflare IP list.

  2. internal

    Bot-candidate auto-promote, auto-dismiss, de-promote & reopen

    The candidate queue at /account/admin/bot-candidates is now self-maintaining. A daily 04:30 UTC job auto-promotes obvious bots that pass eight conservative gates (score ≥ 0.85, reqs ≥ 10k, distinct_ips ≥ 100, sites_seen ≥ 5, asset_ratio < 0.01, purchases = 0, age ≥ 48h, status = pending) into a single umbrella bot_group Auto-flagged scraper (discovery), and auto-dismisses stale rows (last_seen > 14 days ago + score < 0.7) to keep the queue bounded.

    The admin UI gains a header banner with live queue counts (pending / promoted-auto-vs-manual / dismissed) and two new per-row actions: De-promote (deletes the corresponding bot_observed_ja4s row, garbage-collects an orphaned per-candidate bot_group if any, sends the candidate back to pending) and Reopen (dismissed → pending for re-review).

    Both automated paths are independently switchable via system_config: bot.discovery.auto_promote.enabled (default false) and bot.discovery.auto_dismiss.enabled (default true). Audit trail on each promoted candidate (promotion_kind, promoted_to_bot_group_id, promoted_at) and on bot_groups created by manual promotion (discovery_kind = 'manual-from-candidate' or 'auto-umbrella').

  3. v1 · new

    Site-level quick-check endpoints — /url-stats & /bot-traffic

    Two new helper endpoints on the sites surface that bottle up the most common investigation queries:

    • GET /api/v1/sites/{id}/url-stats?host=&path=&days=N — totals + status breakdown + per-day chart + a 25-row non-2xx sample for a specific URL. Answers "any rejections on URL X today?" without writing ClickHouse by hand.
    • GET /api/v1/sites/{id}/bot-traffic?bot_group=&days=N — recent traffic from a named bot_group (substring-match), with status mix and a sample of recent requests. Answers "are we blocking Google?" in one call. Bot identification matches every JA4 the discovery system has attributed to the bot_group via bot_observed_ja4s.
  4. v1 · new

    Phase-level timings on /capture and monitor checks

    /api/v1/capture's response.timings object now carries a phase breakdown: dnsMs, connectMs (TCP + TLS), requestSendMs, ttfbMs, bodyMs, plus the resolvedIp the socket actually landed on. Phase fields are omitted when their lifecycle hook didn't fire — pooled keep-alive reuse skips DNS / connect, and followRedirects=true only captures the first leg's timings.

    Capture-kind URL monitors persist the same fields per check on monitor_checks and surface them on /monitors/{id}/checks + the monitor detail page. Extract-kind monitors don't get phase timings — HtmlUnit's connection layer doesn't expose them. While we were there: fixed the persisted check size_bytes column reading 0 on capture monitors (it was looking at sizeBytes in the response envelope, which is actually bodyBytes).

  5. v1 · new

    URL monitors — up/down checks with webhook + email alerts

    New /api/v1/monitors (plus /checks, /uptime, /pause, /resume) let you schedule a recurring capture or extract against any URL and get notified on state transitions. Same primitives as UptimeRobot — plus full headless-browser checks, JSON-API validation via the extract path, and the User-Agent personas from /user_agent_profiles.

    Sub-minute polling on paid plans (30 s minimum on Startup + Business, 60 s on Developer, 300 s on Free). State transitions fire HMAC-SHA256-signed webhooks AND email through the platform mailer. Flapping is debounced via alertFailureThreshold — consecutive failures required before flipping to down, so single-shot 502s don't page anyone. Check history retained 30 days; sweep runs daily.

    Manage from the UI at /account/monitors — create, edit, view recent checks, see uptime over the last 30 days.

  6. v1 · new

    User-Agent profiles for capture & extract

    New GET /api/v1/user_agent_profiles plus two new fields on capture + extract requests — userAgent (raw string override) and userAgentProfile (key into the catalogue). The catalogue is an operator-managed MySQL table, seeded with the common browser personas (chrome-latest-mac, firefox-latest-win, edge-latest-win, safari-latest-mac, …) plus googlebot and the default urlcap.

    On /extract the profile selects both the UA string AND the HtmlUnit BrowserVersion (Chrome / Firefox / Edge) so the in-page navigator.userAgent and the JS engine flavour line up. safari-* profiles still run on Chromium (HtmlUnit has no Safari engine) — navigator.vendor will read Google Inc.. googlebot only sets the UA string; for cryptographic urlcap attribution use webBotAuth=true.

  7. v1 · new

    Per-minute traffic-shape signals on the bot-JA4 blocklist

    /api/v1/sites/{id}/bot-ja4s now returns reqs_per_minute_peak, reqs_per_minute_mean (over active minutes only), active_minutes and burstiness (coefficient of variation) per JA4. Useful for catching attack-shape JA4s that hide under a moderate asset_ratio but spike to thousands of req/min during their active windows — the case where score >= 0.70 and asset_ratio sits between 0.05 and 0.50 so neither Tier 1 nor Tier 2 fires.

  8. v1 · new

    Bot-JA4 blocklist endpoint — one feed, edge-blocklist ready

    New GET /api/v1/sites/{id}/bot-ja4s: returns every JA4 the discovery system has flagged on a site over a sliding window (1..90 days), labelled known_bot with bot-group attribution or candidate with discovery score. Browser-shaped fingerprints are excluded; the strings are the keys you feed into nginx, Cloudflare WAF, or any edge map. Each row also carries cross_customer_action — how many other sites have explicitly blocked, allowed, or challenged this JA4 via edge_action outcomes — plus block_likely_on_this_site inferred from your own status-code mix.

  9. v1 · new

    Single-IP intelligence endpoint

    GET /api/v1/ip/intelligence?ip=… composes every signal urlcap has on a single IP into one response — MaxMind geo, PTR records, bot-CIDR membership, every bot_observed_ips attribution row (cidr_match / ua_match / vuln_match / manual), trust-list status, and cross-site behavioural data over the last 1–7 days: request count, distinct JA4s/UAs/sites, status-code mix, asset ratio, vuln-probe hits, every JA4 with classification, top user agents, and cross-customer edge_action votes summed across every JA4 the IP has used. The operator-grade investigation endpoint.

  10. v1 · quality

    Base62 site public ids + per-domain kind

    Every site now carries a 12-char base62 publicId alongside its numeric id. Use it wherever a site appears in a URL — /api/v1/sites/2DrxGfsYW0jv/bot-ja4s, /api/v1/ingest/2DrxGfsYW0jv/events — to avoid leaking enumeration order to anyone holding a URL. The numeric form remains accepted. Domains now carry a kind hint (doc / asset / api); informational only for v1, but expect future signals to lean on it (assets-vs-docs ratio per host, etc.).

  11. v1 · new

    Outcomes channel: tell urlcap what happened next

    New POST /api/v1/ingest/{site}/outcomes accepts asynchronous verdicts keyed by the original event's request_id — kinds js_challenge / auth / purchase / edge_action. urlcap resolves the JA4 server-side at ingest time and denormalises it onto the outcome row so verdicts survive the 7-day raw-event TTL. Eight new outcome-derived signals surface on /api/v1/ja4/intelligence (js_challenge_attempts / passes / pass_rate, auth_observations, distinct_users, purchases, total_purchase_cents, last_purchase_at) plus ja4_ua_consistency — UA-spoofing tell where a JA4 keeps claiming different agent/OS values.

    Plain-HTTP traffic (no TLS handshake, no JA4) now also flows through ingest — IP-keyed aggregates and vuln-probe / bot-discovery still see those rows; JA4-keyed materialised views filter them out.

  12. v1 · new

    Vuln-probe attribution + trust list

    Every ingested request is tagged at write-time with a vuln_probe_id if its path matches the curated vuln-scan registry (52 paths across 9 categories — exposed WP plugins, exposed .git, common dropper paths, etc.). IPs hitting three or more distinct probes are auto-attributed to the synthetic "Vulnerability scanner" bot_group; the discovery loop currently catches ~50–70 new scanner IPs per hour.

    New per-user trust list of IPs / CIDRs / domains the discovery scan never flags — protect monitoring probes, office IPs, and partner integrations from being auto-classified as scanners. Trust applies globally (signal modulation) while ownership stays per-user (admin audit).

  13. v1 · new

    JA4 intelligence: OS / agent / device classification + 25-metric snapshot

    Three new read endpoints on top of the long-window aggregates:

    • /api/v1/ja4/intelligence — the rolled-up profile for one JA4 over 7 / 30 / 90 days. Returns the top OS / agent / device guess with per-dimension confidence (the score IS the confidence — apply your own threshold), diversity scores, and the new outcome-derived counts. Hourly recompute.
    • /api/v1/ja4/profile — drill into one dimension (os_name, agent_name, device_class, country, asn and 4 more), top-N values with HLL-merged distinct counts.
    • /api/v1/ja4/metrics — on-demand "prioritised 25" snapshot: trailing-hour rollup (counts + 10 ratios), optional IP+JA4 sub-block when ip= is supplied (with is_new_24h), and a top-10 JA4×UA breakdown over 24h with each UA's share of the JA4's volume.
    • /api/v1/ip/profile — per-IP behavioural summary on one site over N days (distinct JA4s/UAs/paths/hosts + heuristic / challenge counts). The proxy / NAT detection lookup.
  14. v1 · new

    Bot discovery: auto-classify what your traffic actually contains

    A background loop now ingests every request, attributes IPs and JA4s to known bot groups (CIDR match, UA self-identification, vuln-probe pattern), and queues unknown candidates for review. UA-self-identifying bots are auto-classified into new bot_groups — MistralBot, Meta-ExternalFetcher, DeepSeekBot and others have been added without human intervention. The bot-IP index rebuilds every hour and feeds /api/v1/is_bot; pending candidates surface in the admin queue.

  15. v1 · new

    Customer sites + ingest channel + JA4 signals

    The headline release: you can now ship request events from your edge to urlcap and get per-JA4 signals back. New site management surface:

    • /api/v1/sites — list and create sites (one per customer-facing property).
    • /api/v1/sites/{id}/domains — attach hostnames; UNIQUE across all sites (a hostname can only belong to one site).
    • /api/v1/sites/{id}/ingest_keys — mint per-site ingest tokens. Cleartext is returned exactly once at mint-time; urlcap only stores the SHA-256 hash.
    • POST /api/v1/ingest/{site}/events — NDJSON batches of request events, authed with the per-site token (not your urlcap API key). 1000 events / 4 MB per batch; partial failures never fail the batch.

    Reads land in /api/v1/ja4/signals: Cloudflare-style "what does this JA4 look like right now?" snapshot, recomputed every minute — 10 ratios (h2h3, browser, cache, error etc.), 4 per-site ranks, and 2 quantiles. Plus GeoIP (country / city / ASN) enrichment at ingest from MaxMind GeoLite2, also surfaced on /api/v1/is_bot, /api/v1/ip/lookup and /api/v1/ip/geo.

  16. v1 · new

    Capture & extract can now sign outbound requests (Web Bot Auth)

    Pass webBotAuth: true to /api/v1/capture or /api/v1/extract and urlcap signs every outbound HTTP request it makes with its own Ed25519 keypair — the same RFC 9421 / Web Bot Auth machinery /web_bot_auth/verify already understands. Target sites that allow known crawlers but block unknown bots can verify our identity against the JWKS published at /.well-known/http-message-signatures-directory and choose to allow urlcap-attributed traffic. Signature covers @authority, signature-agent, @method and @target-uri; 60-second lifetime.

  17. v1 · new

    Web Bot Auth signature verification

    New endpoint /api/v1/web_bot_auth/verify verifies an inbound HTTP request's RFC 9421 HTTP Message Signature against the operator's published Ed25519 key directory — the cryptographic identity check the Web Bot Auth draft proposes as a successor to IP-range and reverse-DNS checks. urlcap parses the Signature-Input + Signature + Signature-Agent headers, fetches the JWKS-style directory (TTL-cached 1 h), rebuilds the canonical signature base, and verifies with Ed25519 using the right keyid. Pairs naturally with /is_bot and /reverse_dns for a three-layer "is this bot really who it says?" check.

  18. v1 · new

    robots.txt suite — fetch, allow/deny check, change-tracker with webhooks

    Four new endpoints under /api/v1/robots: fetch + parse a site's robots.txt, decide whether a URL is allowed for a given user-agent per RFC 9309, and create per-user watches that re-fetch on a schedule and POST an HMAC-SHA256-signed notification whenever the content changes. Snapshots are only stored on change, so a watch's history is the file's actual edit log. Fetched bodies are TTL-cached in-process for 1 hour.

    /api/v1/is_bot matches now carry a botGroup.honoursRobots block with four booleans — robots, crawlDelay, allow, sitemap — for whether each known bot has ever been caught skipping that aspect of robots.txt. Seeded conservatively: null = not researched yet, false = at least one credible violation report.

  19. v1 · new

    Reverse DNS (PTR) with FCrDNS confirmation

    New endpoint /api/v1/reverse_dns resolves IPv4 / IPv6 addresses to their PTR names. With forwardConfirm=true, every PTR is re-resolved (A + AAAA) and we report whether the original IP is in the answer — the standard FCrDNS check used to verify crawlers like Googlebot or Bingbot. Results are cached in-process for the upstream record's TTL (clamped 30s–1h; negative answers 5 minutes), and every response reports ttlSeconds remaining plus cached. /is_bot picked up a matching reverseDns=true option for the common "is this IP a real bot and what's its PTR?" case.

  20. v1 · update

    Bot detection: date & historical

    /api/v1/is_bot learned two point-in-time / lifetime parameters: date (an ISO-8601 as-of-date — answers "was this IP a known bot on this date?", including CIDRs that have since been retired) and historical=true (returns every CIDR ever published by any bot, regardless of whether it's currently active). Each match now carries addedAt, removedAt and active, so you can see when a CIDR was retired and why. Same in-memory index, same sub-millisecond lookups — see the reference.

  21. v1 · new

    Bot detection — /api/v1/is_bot

    One call to tell if an address is a known web crawler. GET /api/v1/is_bot?ip=66.249.66.1 (or ?ips=a,b,c, or a JSON body for batches up to 200) returns isBot plus, on a match, the bot's search engine, bot group, the matching CIDR, and its category codes (SEARCH_INDEXING, AI_TRAINING, AI_SEARCH_OR_ANSWERING, …) with per-link confidence. Backed by a daily-refreshed in-memory CIDR index — no DB hit per request. Covers Googlebot, Bingbot, Yandex, DuckDuckBot & DuckAssistBot, Applebot, GPTBot & ChatGPT-User & OAI-SearchBot, ClaudeBot, PerplexityBot & Perplexity-User, AhrefsBot, Amazonbot, CCBot, and more. See the reference.

  22. v1 · new

    Datasets — /api/v1/datasets

    Keep your own named, de-duplicated collections of IP / CIDR ranges or URLs: create a dataset, POST items to add, PUT to replace the whole set, DELETE items or the dataset itself, and ask GET /contains whether a value is in it (for IP datasets that includes "is this address covered by some CIDR in the dataset?"). Optional history mode keeps the items dropped by each replace, with the date they were deactivated — useful for tracking a set as it evolves. Per-plan caps for number of datasets, max items each, and whether history is allowed (free / developer / startup / business). See the reference.

  23. v1 · new

    Extract: JSON content & JSONPath

    The extract tool now handles JSON responses: point a job at a URL that returns JSON (a REST API, say) and pull out values with JSONPathselector becomes a JSONPath expression (e.g. $.results[*].name), with the familiar value / list / items shapes. JSON is auto-detected from the response, or force it with "content": "json". See the JSON content reference.

  24. v1 · new

    Scheduled tasks — /api/v1/schedules

    Run a capture or an extract job at a future time — once (runAt) or on a recurring cron schedule (classic 5-field crontab, optional seconds field, @daily & co., evaluated in a time zone you choose). Each execution stores its full JSON result; fetch it from /api/v1/schedules/{id}/runs/{n}. Pause / resume / run-now / cancel, plus a Scheduled tasks page in your account. See the reference.

  25. v1 · new

    Accounts & sign-in

    Self-service accounts: email/password sign-up with verification, password reset, per-user API keys, plans & billing, and one-click Sign in with Google, GitHub or Microsoft. API requests now also accept Authorization: Bearer api_… per-user keys alongside the legacy X-API-Key.

  26. v1 · new

    IP & CIDR tool — /api/v1/ip

    Added IPv4/IPv6 tooling: /api/v1/ip/contains (is an address in a CIDR? — single or batch), /api/v1/ip/lookup (which of your stored ranges contain an address?), and /api/v1/ip/ranges to manage a list of named ranges. Stored in an optimised table — 16-byte IPv4-mapped keys, B-tree-indexed range bounds. See the IP & CIDR reference.

  27. v1 · new

    Extract tool — /api/v1/extract

    Added an async extract tool: submit a job model (load a URL, perform actions — fill / click / wait / navigate / select — and pull out data with CSS/XPath selectors, optionally across paginated pages) and get a taskId; poll GET /api/v1/extract/{taskId} for the result. Every HTTP request the engine performs while running a job is recorded against the task. See the Extract reference.

  28. v1 · new

    Capture tool — /api/v1/capture & the workbench

    Added the capture tool — send an HTTP request to any URL (with full control over method, headers and their order, body, redirects) and get the parsed response: status, headers, cookies, query parameters, body, timings. The home page is now a capture workbench. Every captured request and response (and all headers) is recorded. See the Capture reference.

  29. v1

    Versioned API & the /api/v1/totp endpoint

    Introduced the versioned API base at https://urlcap.com/api/v1. The first endpoint, /api/v1/totp, mirrors the existing TOTP functionality but returns a structured JSON envelope — { version, requestId, data } — including the code, digit count, period, algorithm and seconds-until-rotation. Accepts both GET and POST.

    • New consistent error envelope: { version, error: { type, message } } with 400 / 401 statuses.
    • The legacy plain-text endpoint at /auth is unchanged and will stay that way.
  30. site

    New developer site & API reference

    Launched the redesigned urlcap site — a marketing landing page and a full API reference at /docs — with a dark-mode toggle, copy-to-clipboard code samples, and a responsive layout.

  31. legacy

    TOTP at /auth

    The original endpoint: GET /auth?uri=otpauth://totp/... with an X-API-Key header, returning the current code as plain text. Still available, frozen for compatibility.

Looking for what's coming next? Custom HTTP request endpoints — full control over method, headers (including order), and body — are in progress.