Skip to content

ai

Cortex AI wrapper with semantic cache, model routing, and token compression.

ModelMap dataclass

Cortex model name → local/EAI equivalent lookup.

Field names use underscores; call .get() with the hyphenated Cortex name. The local mapping uses "mistral" (Ollama) as a uniform stand-in regardless of the Cortex model. Local mode is for pipeline integration testing, not output quality.

Source code in src/pinky_ai/ai.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
@dataclass(frozen=True)
class ModelMap:
    """Cortex model name → local/EAI equivalent lookup.

    Field names use underscores; call `.get()` with the hyphenated Cortex name.
    The local mapping uses `"mistral"` (Ollama) as a uniform stand-in regardless
    of the Cortex model. Local mode is for pipeline integration testing, not
    output quality.
    """

    mistral_7b: str = "mistral"
    mistral_large2: str = "mistral"
    claude_haiku_4_5: str = "mistral"
    claude_sonnet_4_6t: str = "mistral"

    def get(self, model: str, default: str = "mistral") -> str:
        """Return the local/EAI equivalent for *model*.

        Hyphens and dots in the Cortex model name are normalised to underscores
        before attribute lookup.

        Args:
            model: Cortex model name, e.g. `"claude-haiku-4-5"`.
            default: Returned when *model* is not a known field.

        Returns:
            Local model name string, e.g. `"mistral"`.
        """
        return getattr(self, model.replace("-", "_").replace(".", "_"), default)

get(model, default='mistral')

Return the local/EAI equivalent for model.

Hyphens and dots in the Cortex model name are normalised to underscores before attribute lookup.

Parameters:

Name Type Description Default
model str

Cortex model name, e.g. "claude-haiku-4-5".

required
default str

Returned when model is not a known field.

'mistral'

Returns:

Type Description
str

Local model name string, e.g. "mistral".

Source code in src/pinky_ai/ai.py
59
60
61
62
63
64
65
66
67
68
69
70
71
72
def get(self, model: str, default: str = "mistral") -> str:
    """Return the local/EAI equivalent for *model*.

    Hyphens and dots in the Cortex model name are normalised to underscores
    before attribute lookup.

    Args:
        model: Cortex model name, e.g. `"claude-haiku-4-5"`.
        default: Returned when *model* is not a known field.

    Returns:
        Local model name string, e.g. `"mistral"`.
    """
    return getattr(self, model.replace("-", "_").replace(".", "_"), default)

PinkyAI

Cortex AI wrapper for Layer 4 cost optimisation.

Intercepts every LLM call with three levers:

  • Semantic cache — cosine-similarity lookup over AI_EMBED vectors before any Cortex call. Cache stored in DB_{APP}.{client}.AI_CACHE per client schema. Default similarity threshold: 0.92. See ADR-0001.
  • Model router — resolves the cheapest adequate model from task_type via ROUTING_RULES. See ADR-0003 for sovereignty constraints.
  • Token compression — summarises the oldest 2/3 of conversation history when input exceeds max_tokens. Reduces input by 60–70% on long sessions.

Backend selection (see ADR-0002 and ADR-0004):

  • "cortex" (default) — serverless Cortex, 0 WH, data stays in Snowflake. Required for interactive Streamlit sessions and for products handling minor data.
  • "ollama" — local Ollama server. Zero cost, zero network. Dev only.
  • "groq" — Groq cloud API. Fast and cheap. Dev-cloud only.
  • "task" — direct EAI call to Anthropic/Mistral. Only valid from serverless tasks. Prohibited for products handling minor data.

Parameters:

Name Type Description Default
session Any

Active Snowpark session. Required for backend="cortex". May be None for pure-local backends.

required
backend str

Compute backend. Defaults to "cortex".

'cortex'
cache_table str | None

Fully-qualified cache table name, e.g. "DB_PLUTO_SCHOOL.CLIENT_001.AI_CACHE". Required when using the semantic cache.

None
similarity_threshold float

Minimum cosine similarity for a cache hit. Default 0.92.

0.92
**kwargs Any

Backend-specific options: base_url (ollama), api_key (groq/task).

{}
Example
from pinky_ai import get_ai

ai = get_ai(session)
answer = ai.complete("Explain the Pythagorean theorem", task_type="explanation")
Source code in src/pinky_ai/ai.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
class PinkyAI:
    """Cortex AI wrapper for Layer 4 cost optimisation.

    Intercepts every LLM call with three levers:

    - **Semantic cache** — cosine-similarity lookup over `AI_EMBED` vectors before
      any Cortex call. Cache stored in `DB_{APP}.{client}.AI_CACHE` per client schema.
      Default similarity threshold: 0.92. See ADR-0001.
    - **Model router** — resolves the cheapest adequate model from `task_type` via
      `ROUTING_RULES`. See ADR-0003 for sovereignty constraints.
    - **Token compression** — summarises the oldest 2/3 of conversation history when
      input exceeds `max_tokens`. Reduces input by 60–70% on long sessions.

    Backend selection (see ADR-0002 and ADR-0004):

    - `"cortex"` (default) — serverless Cortex, 0 WH, data stays in Snowflake.
      Required for interactive Streamlit sessions and for products handling minor data.
    - `"ollama"` — local Ollama server. Zero cost, zero network. Dev only.
    - `"groq"` — Groq cloud API. Fast and cheap. Dev-cloud only.
    - `"task"` — direct EAI call to Anthropic/Mistral. Only valid from serverless
      tasks. Prohibited for products handling minor data.

    Args:
        session: Active Snowpark session. Required for `backend="cortex"`.
            May be `None` for pure-local backends.
        backend: Compute backend. Defaults to `"cortex"`.
        cache_table: Fully-qualified cache table name, e.g.
            `"DB_PLUTO_SCHOOL.CLIENT_001.AI_CACHE"`. Required when using the
            semantic cache.
        similarity_threshold: Minimum cosine similarity for a cache hit. Default 0.92.
        **kwargs: Backend-specific options: `base_url` (ollama), `api_key` (groq/task).

    Example:
        ```python
        from pinky_ai import get_ai

        ai = get_ai(session)
        answer = ai.complete("Explain the Pythagorean theorem", task_type="explanation")
        ```
    """

    MODEL_MAP: ClassVar[ModelMap] = ModelMap()

    def __init__(
        self,
        session: Any,
        backend: str = "cortex",
        cache_table: str | None = None,
        similarity_threshold: float = 0.92,
        **kwargs: Any,
    ) -> None: ...

    def complete(
        self,
        prompt: str,
        model: str = "mistral-7b",
        task_type: str | None = None,
        **kwargs: Any,
    ) -> str:
        """Call COMPLETE with semantic cache lookup and optional model routing.

        If `task_type` is provided, the model is resolved from `ROUTING_RULES`
        instead of the `model` argument.

        Cache hit path (0 tokens consumed):
        1. Embed the prompt with `AI_EMBED`.
        2. Query `cache_table` for cosine similarity ≥ `similarity_threshold`.
        3. Return stored response if found.

        Cache miss path:
        1. Call Cortex (or the active backend).
        2. Store (embedding, response, model) in `cache_table`.
        3. Return the response.

        Args:
            prompt: User prompt.
            model: Cortex model name. Ignored when `task_type` is set.
            task_type: Routing key resolved against `ROUTING_RULES`. One of
                `"factual_simple"`, `"explanation"`, `"complex_reason"`,
                `"scoring_batch"`.
            **kwargs: Forwarded to the active backend.

        Returns:
            Model response text.
        """
        ...

    def extract(self, text: str, schema: dict[str, Any]) -> dict[str, Any]:
        """Extract structured fields from text using `AI_EXTRACT`.

        Args:
            text: Source text to extract from.
            schema: Target JSON schema describing fields to extract.

        Returns:
            Dict matching the provided schema.
        """
        ...

    def classify(self, text: str, categories: list[str]) -> str:
        """Classify text into one of the provided categories using `AI_CLASSIFY`.

        Args:
            text: Text to classify.
            categories: List of candidate category labels.

        Returns:
            The most probable category label.
        """
        ...

    def filter(self, text: str, condition: str) -> bool:  # noqa: A003
        """Check whether text satisfies a natural-language condition using `AI_FILTER`.

        Intended for guardrails: input validation, output quality checks.

        Args:
            text: Text to evaluate.
            condition: Natural-language condition to test against.

        Returns:
            `True` if the text satisfies the condition.
        """
        ...

    def embed(self, text: str) -> list[float]:
        """Embed text into a 768-dimension vector using `AI_EMBED`.

        Uses `e5-base-v2` in Cortex. Falls back to `nomic-embed-text` via Ollama
        in local mode. The embedding is consumed by the semantic cache layer.

        Args:
            text: Text to embed.

        Returns:
            List of 768 floats.
        """
        ...

    def parse_document(self, file_url: str) -> str:
        """Extract text from a document using `AI_PARSE_DOCUMENT` (OCR).

        Args:
            file_url: Snowflake stage URL or presigned URL pointing to the document.

        Returns:
            Extracted text with layout hints (`LAYOUT` mode).
        """
        ...

    def count_tokens(self, text: str, model: str = "mistral-7b") -> int:
        """Count tokens in text using `AI_COUNT_TOKENS`.

        Used for pre-call cost estimation and context window management.
        Approximates to `len(text) // 4` when backend is not `"cortex"`.

        Args:
            text: Input text.
            model: Model to use for token counting (tokenisation is model-specific).

        Returns:
            Estimated token count.
        """
        ...

classify(text, categories)

Classify text into one of the provided categories using AI_CLASSIFY.

Parameters:

Name Type Description Default
text str

Text to classify.

required
categories list[str]

List of candidate category labels.

required

Returns:

Type Description
str

The most probable category label.

Source code in src/pinky_ai/ai.py
177
178
179
180
181
182
183
184
185
186
187
def classify(self, text: str, categories: list[str]) -> str:
    """Classify text into one of the provided categories using `AI_CLASSIFY`.

    Args:
        text: Text to classify.
        categories: List of candidate category labels.

    Returns:
        The most probable category label.
    """
    ...

complete(prompt, model='mistral-7b', task_type=None, **kwargs)

Call COMPLETE with semantic cache lookup and optional model routing.

If task_type is provided, the model is resolved from ROUTING_RULES instead of the model argument.

Cache hit path (0 tokens consumed): 1. Embed the prompt with AI_EMBED. 2. Query cache_table for cosine similarity ≥ similarity_threshold. 3. Return stored response if found.

Cache miss path: 1. Call Cortex (or the active backend). 2. Store (embedding, response, model) in cache_table. 3. Return the response.

Parameters:

Name Type Description Default
prompt str

User prompt.

required
model str

Cortex model name. Ignored when task_type is set.

'mistral-7b'
task_type str | None

Routing key resolved against ROUTING_RULES. One of "factual_simple", "explanation", "complex_reason", "scoring_batch".

None
**kwargs Any

Forwarded to the active backend.

{}

Returns:

Type Description
str

Model response text.

Source code in src/pinky_ai/ai.py
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
def complete(
    self,
    prompt: str,
    model: str = "mistral-7b",
    task_type: str | None = None,
    **kwargs: Any,
) -> str:
    """Call COMPLETE with semantic cache lookup and optional model routing.

    If `task_type` is provided, the model is resolved from `ROUTING_RULES`
    instead of the `model` argument.

    Cache hit path (0 tokens consumed):
    1. Embed the prompt with `AI_EMBED`.
    2. Query `cache_table` for cosine similarity ≥ `similarity_threshold`.
    3. Return stored response if found.

    Cache miss path:
    1. Call Cortex (or the active backend).
    2. Store (embedding, response, model) in `cache_table`.
    3. Return the response.

    Args:
        prompt: User prompt.
        model: Cortex model name. Ignored when `task_type` is set.
        task_type: Routing key resolved against `ROUTING_RULES`. One of
            `"factual_simple"`, `"explanation"`, `"complex_reason"`,
            `"scoring_batch"`.
        **kwargs: Forwarded to the active backend.

    Returns:
        Model response text.
    """
    ...

count_tokens(text, model='mistral-7b')

Count tokens in text using AI_COUNT_TOKENS.

Used for pre-call cost estimation and context window management. Approximates to len(text) // 4 when backend is not "cortex".

Parameters:

Name Type Description Default
text str

Input text.

required
model str

Model to use for token counting (tokenisation is model-specific).

'mistral-7b'

Returns:

Type Description
int

Estimated token count.

Source code in src/pinky_ai/ai.py
228
229
230
231
232
233
234
235
236
237
238
239
240
241
def count_tokens(self, text: str, model: str = "mistral-7b") -> int:
    """Count tokens in text using `AI_COUNT_TOKENS`.

    Used for pre-call cost estimation and context window management.
    Approximates to `len(text) // 4` when backend is not `"cortex"`.

    Args:
        text: Input text.
        model: Model to use for token counting (tokenisation is model-specific).

    Returns:
        Estimated token count.
    """
    ...

embed(text)

Embed text into a 768-dimension vector using AI_EMBED.

Uses e5-base-v2 in Cortex. Falls back to nomic-embed-text via Ollama in local mode. The embedding is consumed by the semantic cache layer.

Parameters:

Name Type Description Default
text str

Text to embed.

required

Returns:

Type Description
list[float]

List of 768 floats.

Source code in src/pinky_ai/ai.py
203
204
205
206
207
208
209
210
211
212
213
214
215
def embed(self, text: str) -> list[float]:
    """Embed text into a 768-dimension vector using `AI_EMBED`.

    Uses `e5-base-v2` in Cortex. Falls back to `nomic-embed-text` via Ollama
    in local mode. The embedding is consumed by the semantic cache layer.

    Args:
        text: Text to embed.

    Returns:
        List of 768 floats.
    """
    ...

extract(text, schema)

Extract structured fields from text using AI_EXTRACT.

Parameters:

Name Type Description Default
text str

Source text to extract from.

required
schema dict[str, Any]

Target JSON schema describing fields to extract.

required

Returns:

Type Description
dict[str, Any]

Dict matching the provided schema.

Source code in src/pinky_ai/ai.py
165
166
167
168
169
170
171
172
173
174
175
def extract(self, text: str, schema: dict[str, Any]) -> dict[str, Any]:
    """Extract structured fields from text using `AI_EXTRACT`.

    Args:
        text: Source text to extract from.
        schema: Target JSON schema describing fields to extract.

    Returns:
        Dict matching the provided schema.
    """
    ...

filter(text, condition)

Check whether text satisfies a natural-language condition using AI_FILTER.

Intended for guardrails: input validation, output quality checks.

Parameters:

Name Type Description Default
text str

Text to evaluate.

required
condition str

Natural-language condition to test against.

required

Returns:

Type Description
bool

True if the text satisfies the condition.

Source code in src/pinky_ai/ai.py
189
190
191
192
193
194
195
196
197
198
199
200
201
def filter(self, text: str, condition: str) -> bool:  # noqa: A003
    """Check whether text satisfies a natural-language condition using `AI_FILTER`.

    Intended for guardrails: input validation, output quality checks.

    Args:
        text: Text to evaluate.
        condition: Natural-language condition to test against.

    Returns:
        `True` if the text satisfies the condition.
    """
    ...

parse_document(file_url)

Extract text from a document using AI_PARSE_DOCUMENT (OCR).

Parameters:

Name Type Description Default
file_url str

Snowflake stage URL or presigned URL pointing to the document.

required

Returns:

Type Description
str

Extracted text with layout hints (LAYOUT mode).

Source code in src/pinky_ai/ai.py
217
218
219
220
221
222
223
224
225
226
def parse_document(self, file_url: str) -> str:
    """Extract text from a document using `AI_PARSE_DOCUMENT` (OCR).

    Args:
        file_url: Snowflake stage URL or presigned URL pointing to the document.

    Returns:
        Extracted text with layout hints (`LAYOUT` mode).
    """
    ...

RoutingRules dataclass

Task type → Cortex model routing table.

Frozen dataclass — safe to use as a class-level constant. Override at instantiation time for products with different quality/cost needs. See ADR-0003 for model sovereignty rationale (Mistral EU-first, no Meta for minor data).

Example
custom = RoutingRules(factual_simple="mistral-large2")
ai = PinkyAI(session, routing=custom)
Source code in src/pinky_ai/ai.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
@dataclass(frozen=True)
class RoutingRules:
    """Task type → Cortex model routing table.

    Frozen dataclass — safe to use as a class-level constant.
    Override at instantiation time for products with different quality/cost needs.
    See ADR-0003 for model sovereignty rationale (Mistral EU-first, no Meta for minor data).

    Example:
        ```python
        custom = RoutingRules(factual_simple="mistral-large2")
        ai = PinkyAI(session, routing=custom)
        ```
    """

    factual_simple: str = "mistral-7b"
    explanation: str = "claude-haiku-4-5"
    complex_reason: str = "claude-sonnet-4-6t"
    scoring_batch: str = "mistral-large2"

    def resolve(self, task_type: str, default: str = "claude-haiku-4-5") -> str:
        """Return the Cortex model for *task_type*, falling back to *default*.

        Args:
            task_type: One of `"factual_simple"`, `"explanation"`,
                `"complex_reason"`, `"scoring_batch"`.
            default: Model returned when *task_type* is not a known field.

        Returns:
            Cortex model name string.
        """
        return getattr(self, task_type, default)

resolve(task_type, default='claude-haiku-4-5')

Return the Cortex model for task_type, falling back to default.

Parameters:

Name Type Description Default
task_type str

One of "factual_simple", "explanation", "complex_reason", "scoring_batch".

required
default str

Model returned when task_type is not a known field.

'claude-haiku-4-5'

Returns:

Type Description
str

Cortex model name string.

Source code in src/pinky_ai/ai.py
30
31
32
33
34
35
36
37
38
39
40
41
def resolve(self, task_type: str, default: str = "claude-haiku-4-5") -> str:
    """Return the Cortex model for *task_type*, falling back to *default*.

    Args:
        task_type: One of `"factual_simple"`, `"explanation"`,
            `"complex_reason"`, `"scoring_batch"`.
        default: Model returned when *task_type* is not a known field.

    Returns:
        Cortex model name string.
    """
    return getattr(self, task_type, default)

get_ai(session)

Resolve the active backend from environment variables and return a PinkyAI instance.

Environment variables:

  • PINKY_AI_BACKEND — backend selector. Default: "cortex".
  • PINKY_AI_URL — base URL for Ollama. Default: "http://localhost:11434".
  • PINKY_AI_KEY — API key for Groq or EAI task backend. Default: empty string.

Parameters:

Name Type Description Default
session Any

Active Snowpark session (required for backend="cortex").

required

Returns:

Type Description
PinkyAI

Configured PinkyAI instance.

Example
# .env dev
# PINKY_AI_BACKEND=ollama
# PINKY_AI_URL=http://localhost:11434

ai = get_ai(session)  # backend resolved from env
Source code in src/pinky_ai/ai.py
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
def get_ai(session: Any) -> PinkyAI:
    """Resolve the active backend from environment variables and return a `PinkyAI` instance.

    Environment variables:

    - `PINKY_AI_BACKEND` — backend selector. Default: `"cortex"`.
    - `PINKY_AI_URL` — base URL for Ollama. Default: `"http://localhost:11434"`.
    - `PINKY_AI_KEY` — API key for Groq or EAI task backend. Default: empty string.

    Args:
        session: Active Snowpark session (required for `backend="cortex"`).

    Returns:
        Configured `PinkyAI` instance.

    Example:
        ```python
        # .env dev
        # PINKY_AI_BACKEND=ollama
        # PINKY_AI_URL=http://localhost:11434

        ai = get_ai(session)  # backend resolved from env
        ```
    """
    backend = os.getenv("PINKY_AI_BACKEND", "cortex")
    return PinkyAI(
        session,
        backend=backend,
        base_url=os.getenv("PINKY_AI_URL", "http://localhost:11434"),
        api_key=os.getenv("PINKY_AI_KEY", ""),
    )