# Embedding Design — `lib/llm/embed.py`

## Goal

Embed texts (and optionally images) in parallel, cache results, rank by similarity to a query. Lives in `lib/llm/` since it's the natural sibling to `call_llm`/`stream_llm`.

## Core API

```python
from lib.llm.embed import embed, rank_by_similarity

# Embed texts (cached, parallel batched)
vectors = await embed(["text1", "text2", ...])

# Rank by closeness to query
query_vec = await embed(["what is the main risk?"])
ranked = rank_by_similarity(query_vec[0], vectors)
# → [(idx, score), (idx, score), ...] sorted by descending similarity
```

## Architecture

```
embed(texts, model=..., dims=...)
  │
  ├─ check cache: sha256(model + text) → cached vector
  ├─ batch uncached texts → litellm.embedding(model, input=[...])
  ├─ store results in cache
  └─ return np.ndarray (N × dims)

rank_by_similarity(query_vec, corpus_vecs, k=None)
  └─ normalized dot product → sorted indices + scores
```

## Model Selection

### Text-Only Models

| Model                    | Provider    | Cost       | Dims         | MTEB     | Matryoshka | Notes                          |
|--------------------------|-------------|------------|--------------|----------|------------|--------------------------------|
| gemini-embedding-001     | Google API  | ~$0.01/1Mt | 3072 (flex)  | **68.3** | Yes        | Best text quality, near-free   |
| text-embedding-3-small   | OpenAI API  | $0.02/1Mt  | 1536 (flex)  | ~62      | Yes        | Cheap, decent                  |
| text-embedding-3-large   | OpenAI API  | $0.13/1Mt  | 3072 (flex)  | ~64.6    | Yes        | Better quality, pricier        |
| nomic-embed-text-v1.5    | Local/free  | $0         | 768 (flex)   | ~62      | Yes        | Shares space w/ vision model   |

**Default text model**: `gemini/gemini-embedding-001` — best MTEB, near-free, Matryoshka truncation.

### Multimodal Models (Text + Image)

Speed estimates for M3 Max MBP (MPS backend). M3 Pro ~20% slower, M3 base ~40% slower.

| Model                       | Run   | Cost               | Dims        | Matryo   | Quality (Flickr30k)  | 1K frames (M3 Max) | HW req         |
|-----------------------------|-------|--------------------|-------------|----------|----------------------|---------------------|----------------|
| **Nomic Embed Vision v1.5** | Local | $0                 | 768 (flex)  | 64–768   | ~71 IN 0-shot        | **~18s**, 1 GB      | Any Mac, CPU ok |
| **SigLIP2 Base (86M)**      | Local | $0                 | 768         | No       | Good                 | ~18s, 1 GB          | Any Mac, CPU ok |
| **SigLIP2 SO400M**          | Local | $0                 | 1152        | No       | **~90+ R@1**         | ~65s, 2.5 GB        | 16 GB+ RAM      |
| **Jina CLIP v2 (865M)**     | Local | $0                 | 1024 (flex) | 64–1024  | **98 R@1 I2T**       | ~2–3 min, 4 GB      | 16 GB+ RAM, MPS |
| **OpenCLIP ViT-H/14**       | Local | $0                 | 1024        | No       | ~88 R@1              | ~90s, 3 GB          | 16 GB+ RAM      |
| **OpenAI CLIP ViT-B/16**    | Local | $0                 | 512         | No       | ~83 R@1              | ~22s, 1.2 GB        | Any Mac, CPU ok |
| **Google Vertex mm@001**     | API   | $0.0001/img        | 1408 (flex) | 128–1408 | N/A (proprietary)    | API latency          | None            |
| **Cohere Embed v4**          | API   | ~$0.47/1M tok      | 1536 (flex) | 256–1536 | SOTA (reported)      | API latency          | None            |
| **Voyage multimodal-3.5**    | API   | $0.0003–0.0012/img | 1024 (flex) | 256–2048 | Strong               | API latency          | None            |

**Recommended local**: Nomic Embed Vision v1.5 — 93M params, Matryoshka to 256d, ~18s/1K frames on M3, shares space with strong text model. Runs on CPU too (just 2–3× slower).
**Step up**: SigLIP2 SO400M — best open retrieval accuracy, ~65s/1K frames, needs 16 GB.
**Cheapest API**: Google Vertex multimodalembedding@001 — $0.10 per 1K images, flexible dims.
**Best quality local (heavy)**: Jina CLIP v2 — 98% Flickr I2T, Matryoshka, 89 languages, but 865M params.

## Caching Strategy

```
cache_key = sha256(f"{model}:{dims}:{text}")
```

Storage options (in order of simplicity):
1. **SQLite** — `embeddings_cache.db` with `(key TEXT PK, vector BLOB, model TEXT, dims INT, created_at)`
   - Store as `np.float16` (half the size, negligible quality loss for retrieval)
   - Query: `SELECT vector FROM cache WHERE key = ?`
2. **diskcache** — if we want LRU eviction
3. **numpy files** — one `.npy` per text (messier, but zero-dep)

**Matryoshka trick**: Cache at full dimensions, truncate at query time for fast initial ranking, use full dims for reranking top-K. This means one cache entry serves all dimension sizes.

## Similarity Search

For <50K vectors (our typical scale): **numpy dot product** is sufficient.

```python
scores = corpus_vecs @ query_vec  # assumes normalized
ranked_indices = np.argsort(-scores)
```

For 50K–10M: `faiss` (IndexFlatIP for exact, IVF/HNSW for approximate).

## Batching & Parallelism

- litellm `embedding()` accepts `input=[list]` natively — provider handles batching
- OpenAI: up to 2048 texts per request
- Gemini: generous rate limits, batch API at 50% price
- For large corpora: chunk into batches of ~100, run with `asyncio.gather`

## Image Embeddings

For text+image in the same vector space:
- **Cohere embed-v4** via litellm — production API, 1024 dims
- **OpenCLIP** locally — free, well-tested, `pip install open-clip-torch`
- **Vertex multimodalembedding@001** — text+image+video, 1408 dims

### Video Frame Pipeline

```
Video → ffmpeg scene detection or 1fps sampling
      → embed frames (Nomic/SigLIP2, 256d via Matryoshka)
      → L2 normalize → UMAP(25d) → HDBSCAN cluster
      → pick centroid frame per cluster as representative
```

- **Scene detection first**: `ffmpeg -filter:v "select='gt(scene,0.3)'"` or PySceneDetect — far more efficient than fixed-rate
- **Fallback**: 1 fps for 30fps video (every 30th frame)
- **HDBSCAN > agglomerative**: handles noise (black frames, transitions), no cluster count needed
- **UMAP pre-reduction**: 768d → 25d before clustering improves HDBSCAN quality and speed

### Matryoshka for Images

256 dims retains ~95% of full-dimension retrieval quality. For image-to-image similarity (clustering), even more forgiving — visual features survive truncation well. 128 dims still good. Below 64 dims quality drops ~8–12%.

Models with image Matryoshka: Nomic Vision v1.5, Jina CLIP v2, Cohere v4, Voyage, Google Vertex.
Models without: SigLIP2, OpenCLIP (fixed output dim per model size).

### Apple Silicon Acceleration

Local models use PyTorch MPS backend on Apple Silicon automatically. For additional speed:
- **MLX**: Apple's ML framework — `mlx-clip` and `mlx-vlm` packages provide CLIP/SigLIP inference optimized for M-series. ~30–50% faster than PyTorch MPS for embedding workloads.
- **Core ML**: Convert models via `coremltools` — best for batch inference, but conversion effort is higher.
- **Practical**: PyTorch MPS is already fast enough for our scale (1K frames in <30s). MLX is worth trying if we hit bottlenecks.

Image support is Phase 2 — text-only first, then add `embed_images()` with same cache/similarity API.

## Implementation Plan

1. **`lib/llm/embed.py`** — `embed()`, `rank_by_similarity()`, `EmbeddingCache`
2. **Cache in SQLite** — `lib/llm/data/embeddings_cache.db`
3. **Tests** — `lib/llm/tests/test_embed.py`
4. **CLI** — `python -m lib.llm.embed "text to embed"` for quick testing
5. **Integration** — use in brain, learning, people discovery

## Dependencies

- `numpy` (already in env)
- `litellm` (already in env)
- `faiss-cpu` (optional, for large-scale)

## Open Questions

- Should we support local models (sentence-transformers) as a fallback?
- Is there a use case for cross-model embedding comparison (embed with model A, search with model B)?
- Should the cache be per-project or global?