url-ingest Overview

Turn URLs into structured, searchable context. Dedup-aware, cache-friendly, pluggable backends.

Pipeline

  • Detectcontent type (HEAD / extension).
  • FingerprintMD5 of downloaded bytes (PDF / HTML) or metadata-only fingerprint (YouTube).
  • Dedup checkask the storage backend (via /lookup) and verify url-ingest has the matching cost cache entries.
  • Processroute to the content-type-bound processing backend (OCR / parser / transcriber).
  • Storectx-storage (or your own storage backend) indexes markdown, chunks, and image blobs; returns credits used.
  • Serveprogressive-disclosure read + semantic search over the stored content.

Key Concepts

  • Resourceone row per (content_md5, processing_backend_id); different backends on the same content live as separate resources because they produce different output + billing.
  • Accesseach project that touches a resource gets a ResourceAccess row. Storage is only billed once per (project, resource, storage backend).
  • Cost cacheplatform backends cache their reported credits on url-ingest's side so dedup hits can bill without re-running the pipeline.
Self-deployed backends bypass the cost cache entirely — they pay the flat min_circulation_fee per billing event.