ADR-0001 — Semantic cache implemented in the wrapper
Status: accepted Date: 2026-06-01
Context
Layer 4 costs (LLM calls) scale as × N sessions × N tokens and are the only cost lever
that grows unbounded as usage increases. Cortex provides no native semantic caching:
identical or near-identical prompts from different users each trigger a full model call.
On repetitive workloads (e.g. PLUTO_SCHOOL: "explain the Pythagorean theorem" asked by dozens of students) this means N LLM calls for what is semantically one question.
Decision
PinkyAI implements a semantic cache layer on top of AI_EMBED + VECTOR_COSINE_SIMILARITY.
- Each prompt is embedded before the Cortex call.
- The cache table (
DB_{APP}.{client}.AI_CACHE) is queried for a vector with cosine similarity ≥ threshold (default 0.92). - On hit: return the stored response — 0 tokens consumed.
- On miss: call Cortex, store (embedding, response, model) in the cache.
Cache tables are per-client schema: they drop automatically when the client schema is dropped at unsubscription. No cross-client data bleed.
Consequences
- Estimated ~40% cache hit rate on repetitive educational workloads.
- Threshold 0.92 is the default; callers can override per use case (lower = more hits, higher = stricter match).
- This ADR is explicitly temporary: if Snowflake ships native semantic caching in Cortex,
the cache layer is removed and
PinkyAIdelegates directly. - Adds one
AI_EMBEDcall per cache miss (amortised over hits).