# Healthy Gamer Portal — Design Document

**Date**: 2026-02-23
**Status**: Approved
**Location**: `projects/healthygamer/`
**Depends on**: `lib/semnet/`, `lib/vectors/`, `lib/llm/embed.py`

## Problem

We have 935 Healthy Gamer YouTube videos with VTT transcripts and metadata. We want an indexed, searchable portal where you can:
1. Browse a topic cloud showing what the channel covers
2. Navigate to any topic and find all relevant snippets across all videos
3. View a single video's structure: chapters, key insights, soundbites
4. Search semantically — find snippets by meaning, not just keywords
5. Apply this same system to any YouTube channel (a16z, Dwarkesh, Lex Fridman, etc.)

## Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| UI model | Static HTML + JS | Hosted on static.localhost / Cloudflare Pages. No server needed, works offline, shareable. |
| Architecture | SemanticNet adapter | HG is a domain adapter for `lib/semnet/`. Reuses all infra, serves VIC and HG alike. |
| Adapter hierarchy | YouTube base → HG specialization | `YouTubeChannelAdapter` handles VTT/chapters/chunking. `HGAdapter` adds mental health taxonomy + extraction prompt. Other channels reuse the base. |
| Chunking strategy | Topic-change detection | LLM scans transcript for topic shifts, uses those as chunk boundaries. Uses YouTube chapters when available as primary boundaries. Better than fixed windows — conversations meander. |
| kb relationship | SemanticNet absorbs kb | `kb/` becomes a thin consumer. JSONL corpus pattern retires in favor of SQLite + Qdrant. `kb/wisdom/` migrates to `projects/vic/`. |

## Architecture

```
lib/semnet/
├── adapters/
│   ├── youtube.py           # YouTubeChannelAdapter base
│   └── web.py               # WebContentAdapter (future, replaces kb/scenario.py)
├── chunk.py                 # Topic-change chunking engine (shared)
├── embed.py                 # 3-level embedding
├── store.py                 # Qdrant + SQLite dual storage
├── query.py                 # Hybrid retrieval
├── portal.py                # Static site generator (generic)
├── extract.py               # Claim extraction (exists)
├── schema.py                # SQLite schema (exists)
├── adapter.py               # DomainAdapter base (exists)
├── models.py                # Dataclasses (exists)
└── tests/

projects/healthygamer/
├── README.md                # Vision doc
├── adapter.py               # HGAdapter(YouTubeChannelAdapter)
├── taxonomy_seed.yaml       # Mental health topic taxonomy
├── pipeline.py              # CLI: process videos, generate portal
├── portal/                  # Generated static site
│   ├── .share               # Opt-in for static.localhost
│   └── (generated HTML/JS/JSON)
└── tests/
```

### Adapter Hierarchy

**YouTubeChannelAdapter** (in `lib/semnet/adapters/youtube.py`) handles:
- VTT → plain text with timestamps (uses existing `lib/transcript_viewer/loader.py`)
- Chapter extraction from YouTube metadata (or LLM detection as fallback)
- Topic-change chunking (delegates to `lib/semnet/chunk.py`)
- Standard metadata: video_id, title, duration, upload_date, channel
- Content preprocessing: clean VTT artifacts, merge short cues

**HGAdapter** (in `projects/healthygamer/adapter.py`) adds:
- Mental health seed taxonomy (anxiety, depression, ADHD, relationships, meditation, etc.)
- HG-specific extraction prompt (extract insights/advice/soundbites, not investment claims)
- Direction field repurposed: "actionable" vs "conceptual" (instead of "bullish/bearish")

Other YouTube channels (a16z, Dwarkesh, Lex) will create their own adapters extending YouTubeChannelAdapter with domain-specific taxonomy and prompts.

## Processing Pipeline (per video)

```
1. Read metadata.json + VTT from video-analysis/content/healthy_gamer/{video_id}/
   ↓
2. Extract YouTube chapters
   - From metadata.json description field (timestamp + title pattern)
   - Or from yt-dlp --write-info-json if chapters field exists
   - Fallback: no chapters (many older HG videos lack them)
   ↓
3. Topic-change chunking (lib/semnet/chunk.py):
   - If chapters: use as primary boundaries, sub-chunk within at topic shifts
   - If no chapters: LLM scans full transcript, detects topic changes
   - Each chunk: {start_ts, end_ts, text, chapter_title?, topic_label}
   - Constraints: min 30s, max 180s per chunk
   ↓
4. Claim/insight extraction (per chunk, via lib/semnet/extract.py):
   - Key insights, soundbites, actionable advice
   - Classified into HG taxonomy
   - Tags: freeform descriptors
   ↓
5. Embed (lib/semnet/embed.py → lib/llm/embed.py):
   - Level 1: Video summary (LLM-generated from full transcript)
   - Level 2: Chunk embeddings (contextualized — chunk text + video title + chapter)
   - Level 3: Individual claim/insight embeddings
   ↓
6. Store (lib/semnet/store.py → lib/vectors/ + SQLite):
   - Qdrant: vectors with payloads (level, video_id, timestamps, taxonomy)
   - SQLite: claims, taxonomy_nodes, processing log
```

## Topic-Change Chunking

The key shared component in `lib/semnet/chunk.py`:

```python
async def chunk_by_topic_change(
    transcript_cues: list[Cue],         # Parsed VTT
    chapters: list[Chapter] | None,      # YouTube chapters if available
    *,
    min_chunk_seconds: int = 30,         # Don't split finer than this
    max_chunk_seconds: int = 180,        # Force split if no topic change detected
    model: str = "flash",                # Fast + cheap for scanning
) -> list[Chunk]:
```

Approach:
1. If chapters exist, treat each chapter as a segment
2. Within each segment (or the whole transcript if no chapters), send overlapping windows to Flash
3. Flash identifies topic shift points: "At timestamp X, speaker shifts from discussing Y to Z"
4. Split at those points, respecting min/max constraints
5. Each chunk gets a topic label from Flash's analysis

Cost: ~$0.001/video × 935 videos ≈ **$1 total** for topic detection.

## Static Portal

### File Structure

```
projects/healthygamer/portal/
├── index.html              # Landing: topic cloud + recent videos + search bar
├── topics/
│   ├── index.json          # All topics with counts + descriptions
│   └── {slug}.html         # All snippets for one topic, across all videos
├── videos/
│   ├── index.json          # All videos with metadata
│   └── {video_id}.html     # Single video: chapters, timeline, snippets, claims
├── search/
│   ├── index.json          # Pre-computed search index (lunr.js compatible)
│   └── embeddings.bin      # Quantized embeddings for client-side similarity
└── assets/
    ├── style.css           # Dark theme (matches existing gallery.html pattern)
    └── app.js              # Topic cloud (D3), search, navigation
```

### Pages

**index.html** — Landing page:
- D3 force-directed topic cloud (sized by claim count, colored by category)
- Search bar with live results (lunr.js for keyword, client-side embedding similarity for semantic)
- Recent/popular videos grid

**topics/{slug}.html** — Topic page:
- All snippets matching this topic, across all videos
- Each snippet: video title, timestamp link, transcript excerpt, insight text
- Sort by: relevance, recency, video
- Many snippets per term (not just top-1)

**videos/{video_id}.html** — Video page:
- Embedded YouTube player
- Chapter timeline (clickable)
- All extracted insights/claims, organized by chapter
- Soundbites highlighted
- Related videos (by embedding similarity)

### Generation

`lib/semnet/portal.py` — generic static site generator:
- Takes: SQLite DB path, Qdrant collection, Jinja2 templates, output dir
- Produces: HTML files + JSON indexes
- Domain adapter provides: templates, CSS overrides, metadata formatters

`projects/healthygamer/pipeline.py generate-portal` — HG-specific CLI command that calls the generator.

## Observability

| What | How |
|------|-----|
| Processing progress | SQLite table: `processing_log(video_id, stage, status, started_at, completed_at, error)` |
| Chunk quality | After first 10 videos, generate diagnostic HTML showing chunks + detected topics for manual review |
| Embedding coverage | Dashboard query: videos processed, total chunks, total claims, taxonomy distribution |
| Cost tracking | Existing `lib/llm` cost logging to `~/.coord/llm_costs.db` |
| Portal freshness | `portal/meta.json` with generation timestamp, video count, claim count |

## Estimated Costs

| Step | Cost |
|------|------|
| Topic-change chunking (935 × Flash) | ~$1 |
| Insight extraction (935 × ~10 chunks × Flash) | ~$10 |
| Video summaries (935 × Flash) | ~$1 |
| Embeddings (~10K chunks × text-embedding-3-small) | ~$0.50 |
| **Total** | **~$13** |

## End-to-End Build Order

1. **YouTubeChannelAdapter** — VTT parsing, chapter extraction, metadata loading
2. **Topic-change chunking** — `lib/semnet/chunk.py`
3. **Process 1 video** — wire adapter + chunking, inspect output
4. **HGAdapter** — taxonomy seed, extraction prompt
5. **Claim extraction on 1 video** — inspect quality
6. **Embedding** — build out `lib/semnet/embed.py` (wraps `lib/llm/embed.py`)
7. **Store** — build `lib/semnet/store.py` (wraps `lib/vectors/`)
8. **Process 10 videos** — spot-check diagnostic report
9. **Portal generator** — static HTML from SQLite + Qdrant data
10. **Process all 935** — batch pipeline with progress tracking
11. **Portal polish** — topic cloud, search, navigation

## Existing Infrastructure Reused

| Component | Location | What it provides |
|-----------|----------|-----------------|
| VTT parser | `lib/transcript_viewer/loader.py` | Parse WebVTT into timestamped cues |
| VectorStore | `lib/vectors/__init__.py` | Qdrant wrapper (local mode, string IDs) |
| Embeddings | `lib/llm/embed.py` | `embed_texts()` with cost tracking |
| SemanticNet extract | `lib/semnet/extract.py` | Claim extraction pipeline |
| SemanticNet schema | `lib/semnet/schema.py` | SQLite tables for claims + taxonomy |
| Static server | `static.localhost` | `.share` file pattern for hosting |
| Gallery patterns | `learning/pond/gallery.html` | Dark theme, carousel, grid layouts |
| Video data | `video-analysis/content/healthy_gamer/` | 935 videos with VTT + metadata.json |

## Open Questions

1. Should chapter extraction also call yt-dlp to re-fetch `--write-info-json` for the chapters field, or only parse description text?
2. Client-side embedding similarity: ship quantized embeddings (~5MB for 10K chunks) or only keyword search for static portal?
3. Should the portal generator live in `lib/semnet/portal.py` (generic) or `projects/healthygamer/generate.py` (HG-specific) initially?
