Design — pinky-ai
Update date : 2026-06-01 08:26
pinky-ai wraps all Snowflake Cortex AI functions behind a single PinkyAI class that adds
semantic caching, model routing, and token compression — the three levers that prevent Layer 4
costs from scaling unbounded with usage.
Placement in the suite
pinky-ai is a standalone package. It requires a Snowpark session for backend="cortex" but
can run without one (backends "ollama", "groq"). It is not part of pinky-snowpark because
AI concerns are orthogonal to Snowpark data transformation helpers.
Every feature in this package exists because Cortex does not provide it natively. When Snowflake adds native semantic caching or model routing, the corresponding layer is removed. See the Note in ADR-0001.
Layer 4 cost model
Layer 0-3 → fixed or amortised costs
Layer 4 → variable: × N sessions × N tokens — the only lever that scales unbounded
pinky-ai intercepts Layer 4 with three levers:
| Lever | Mechanism | Estimated saving |
|---|---|---|
| Semantic cache | cosine-similarity on AI_EMBED, threshold 0.92 |
~40% fewer Cortex calls |
| Model router | task_type → cheapest adequate model |
~25% average cost reduction |
| Token compression | summarise oldest 2/3 of history at max_tokens |
~45% input token reduction |
Combined on repetitive workloads: ~75% reduction on Layer 4. Source: 06_pinky_ai design doc.