url-ingest Overview
Turn URLs into structured, searchable context. Dedup-aware, cache-friendly, pluggable backends.
Pipeline
- Detect — content type (HEAD / extension).
- Fingerprint — MD5 of downloaded bytes (PDF / HTML) or metadata-only fingerprint (YouTube).
- Dedup check — ask the storage backend (via /lookup) and verify url-ingest has the matching cost cache entries.
- Process — route to the content-type-bound processing backend (OCR / parser / transcriber).
- Store — ctx-storage (or your own storage backend) indexes markdown, chunks, and image blobs; returns credits used.
- Serve — progressive-disclosure read + semantic search over the stored content.
Key Concepts
- Resource — one row per (content_md5, processing_backend_id); different backends on the same content live as separate resources because they produce different output + billing.
- Access — each project that touches a resource gets a ResourceAccess row. Storage is only billed once per (project, resource, storage backend).
- Cost cache — platform backends cache their reported credits on url-ingest's side so dedup hits can bill without re-running the pipeline.
Self-deployed backends bypass the cost cache entirely — they pay the flat min_circulation_fee per billing event.