url-ingest Overview

Turn URLs into structured, searchable context. Dedup-aware, cache-friendly, pluggable backends.

Pipeline

Detect — content type (HEAD / extension).
Fingerprint — MD5 of downloaded bytes (PDF / HTML) or metadata-only fingerprint (YouTube).
Dedup check — ask the storage backend (via /lookup) and verify url-ingest has the matching cost cache entries.
Process — route to the content-type-bound processing backend (OCR / parser / transcriber).
Store — ctx-storage (or your own storage backend) indexes markdown, chunks, and image blobs; returns credits used.
Serve — progressive-disclosure read + semantic search over the stored content.

Resource — one row per (content_md5, processing_backend_id); different backends on the same content live as separate resources because they produce different output + billing.
Access — each project that touches a resource gets a ResourceAccess row. Storage is only billed once per (project, resource, storage backend).
Cost cache — platform backends cache their reported credits on url-ingest's side so dedup hits can bill without re-running the pipeline.

Self-deployed backends bypass the cost cache entirely — they pay the flat min_circulation_fee per billing event.