ADR-0001 — Semantic cache implemented in the wrapper

Status: accepted Date: 2026-06-01

Context

Layer 4 costs (LLM calls) scale as × N sessions × N tokens and are the only cost lever that grows unbounded as usage increases. Cortex provides no native semantic caching: identical or near-identical prompts from different users each trigger a full model call.

On repetitive workloads (e.g. PLUTO_SCHOOL: "explain the Pythagorean theorem" asked by dozens of students) this means N LLM calls for what is semantically one question.

Decision

PinkyAI implements a semantic cache layer on top of AI_EMBED + VECTOR_COSINE_SIMILARITY.

Each prompt is embedded before the Cortex call.
The cache table (DB_{APP}.{client}.AI_CACHE) is queried for a vector with cosine similarity ≥ threshold (default 0.92).
On hit: return the stored response — 0 tokens consumed.
On miss: call Cortex, store (embedding, response, model) in the cache.

Cache tables are per-client schema: they drop automatically when the client schema is dropped at unsubscription. No cross-client data bleed.

Consequences

Estimated ~40% cache hit rate on repetitive educational workloads.
Threshold 0.92 is the default; callers can override per use case (lower = more hits, higher = stricter match).
This ADR is explicitly temporary: if Snowflake ships native semantic caching in Cortex, the cache layer is removed and PinkyAI delegates directly.
Adds one AI_EMBED call per cache miss (amortised over hits).