Skip to content

Design — pinky-ai

Update date : 2026-06-01 08:26

pinky-ai wraps all Snowflake Cortex AI functions behind a single PinkyAI class that adds semantic caching, model routing, and token compression — the three levers that prevent Layer 4 costs from scaling unbounded with usage.


Placement in the suite

pinky-ai is a standalone package. It requires a Snowpark session for backend="cortex" but can run without one (backends "ollama", "groq"). It is not part of pinky-snowpark because AI concerns are orthogonal to Snowpark data transformation helpers.

Every feature in this package exists because Cortex does not provide it natively. When Snowflake adds native semantic caching or model routing, the corresponding layer is removed. See the Note in ADR-0001.


Layer 4 cost model

Layer 0-3  →  fixed or amortised costs
Layer 4    →  variable: × N sessions × N tokens — the only lever that scales unbounded

pinky-ai intercepts Layer 4 with three levers:

Lever Mechanism Estimated saving
Semantic cache cosine-similarity on AI_EMBED, threshold 0.92 ~40% fewer Cortex calls
Model router task_type → cheapest adequate model ~25% average cost reduction
Token compression summarise oldest 2/3 of history at max_tokens ~45% input token reduction

Combined on repetitive workloads: ~75% reduction on Layer 4. Source: 06_pinky_ai design doc.


Key decisions

  • ADR-0001 — semantic cache
  • ADR-0002 — backend abstraction
  • ADR-0003 — model sovereignty (Mistral EU, no Meta)
  • ADR-0004 — Cortex (0 WH) for sessions; EAI for batch