Design — pinky-ai

Update date : 2026-06-01 08:26

pinky-ai wraps all Snowflake Cortex AI functions behind a single PinkyAI class that adds semantic caching, model routing, and token compression — the three levers that prevent Layer 4 costs from scaling unbounded with usage.

Placement in the suite

pinky-ai is a standalone package. It requires a Snowpark session for backend="cortex" but can run without one (backends "ollama", "groq"). It is not part of pinky-snowpark because AI concerns are orthogonal to Snowpark data transformation helpers.

Every feature in this package exists because Cortex does not provide it natively. When Snowflake adds native semantic caching or model routing, the corresponding layer is removed. See the Note in ADR-0001.

Layer 4 cost model

Layer 0-3  →  fixed or amortised costs
Layer 4    →  variable: × N sessions × N tokens — the only lever that scales unbounded

pinky-ai intercepts Layer 4 with three levers:

Lever	Mechanism	Estimated saving
Semantic cache	cosine-similarity on `AI_EMBED`, threshold 0.92	~40% fewer Cortex calls
Model router	`task_type` → cheapest adequate model	~25% average cost reduction
Token compression	summarise oldest 2/3 of history at `max_tokens`	~45% input token reduction

Combined on repetitive workloads: ~75% reduction on Layer 4. Source: 06_pinky_ai design doc.

Key decisions

ADR-0001 — semantic cache
ADR-0002 — backend abstraction
ADR-0003 — model sovereignty (Mistral EU, no Meta)
ADR-0004 — Cortex (0 WH) for sessions; EAI for batch