# Healthy Gamer Portal — Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Build an indexed, searchable static portal from 935 Healthy Gamer YouTube transcripts using SemanticNet.

**Architecture:** YouTubeChannelAdapter (shared VTT/chapter/chunking) → HGAdapter (taxonomy/prompts) → SemanticNet pipeline (extract → embed → store) → static HTML portal on static.localhost.

**Tech Stack:** SemanticNet, Qdrant (local), SQLite, Jinja2, D3.js, lunr.js, lib/llm/embed.py, lib/transcript_viewer/loader.py

**Design doc:** `docs/plans/2026-02-23-healthygamer-portal-design.md`

---

### Task 1: Data Models — Chunk and Chapter

**Files:**
- Modify: `lib/semnet/models.py`
- Test: `lib/semnet/tests/test_models.py`

**Step 1: Write failing test**

```python
# lib/semnet/tests/test_models.py
from lib.semnet.models import Chapter, Chunk

def test_chapter_from_description_line():
    ch = Chapter.from_description_line("0:05:30 Treatment Options")
    assert ch.title == "Treatment Options"
    assert ch.start_s == 330.0

def test_chunk_text_preview():
    c = Chunk(start_s=0, end_s=60, text="Hello world " * 50, topic_label="intro")
    assert len(c.text_preview) <= 200
    assert c.text_preview.endswith("...")

def test_chunk_duration():
    c = Chunk(start_s=10.5, end_s=70.5, text="test", topic_label="x")
    assert c.duration_s == 60.0
```

**Step 2:** Run: `pytest lib/semnet/tests/test_models.py -v` → FAIL

**Step 3: Implement**

Add to `lib/semnet/models.py`:

```python
import re

@dataclass
class Chapter:
    """A YouTube chapter (from description or info.json)."""
    title: str
    start_s: float
    end_s: float | None = None  # None = until next chapter or video end

    _TS_RE = re.compile(r"(?:(\d+):)?(\d+):(\d{2})")

    @classmethod
    def from_description_line(cls, line: str) -> Chapter | None:
        """Parse '0:05:30 Treatment Options' format."""
        m = cls._TS_RE.match(line.strip())
        if not m:
            return None
        h, mn, s = int(m.group(1) or 0), int(m.group(2)), int(m.group(3))
        title = line[m.end():].strip().lstrip("- –—")
        return cls(title=title or "Untitled", start_s=h * 3600 + mn * 60 + s)

    @classmethod
    def parse_description(cls, desc: str) -> list[Chapter]:
        """Extract chapters from a YouTube description with timestamps."""
        chapters = []
        for line in desc.splitlines():
            ch = cls.from_description_line(line)
            if ch is not None:
                chapters.append(ch)
        # Set end_s from next chapter's start
        for i in range(len(chapters) - 1):
            chapters[i].end_s = chapters[i + 1].start_s
        return chapters


@dataclass
class Chunk:
    """A segment of transcript, bounded by topic changes or chapters."""
    start_s: float
    end_s: float
    text: str
    topic_label: str = ""
    chapter_title: str = ""
    video_id: str = ""

    @property
    def duration_s(self) -> float:
        return self.end_s - self.start_s

    @property
    def text_preview(self) -> str:
        if len(self.text) <= 200:
            return self.text
        return self.text[:197] + "..."
```

**Step 4:** Run: `pytest lib/semnet/tests/test_models.py -v` → PASS

**Step 5:** `git add lib/semnet/models.py lib/semnet/tests/test_models.py && git commit -m "feat(semanticnet): add Chapter and Chunk data models"`

---

### Task 2: YouTubeChannelAdapter — VTT Loading + Metadata

**Files:**
- Create: `lib/semnet/adapters/__init__.py` (empty)
- Create: `lib/semnet/adapters/youtube.py`
- Test: `lib/semnet/tests/test_youtube_adapter.py`

**Step 1: Write failing test**

```python
# lib/semnet/tests/test_youtube_adapter.py
import json
from pathlib import Path
from lib.semnet.adapters.youtube import YouTubeChannelAdapter

# Use a real video from the HG dataset for integration
SAMPLE_DIR = Path("video-analysis/content/healthy_gamer")

class _TestAdapter(YouTubeChannelAdapter):
    """Minimal concrete adapter for testing."""
    name = "test_yt"
    def __init__(self):
        super().__init__(content_dir=SAMPLE_DIR, db_path=Path("/tmp/test_yt.db"))
    def get_seed_taxonomy(self): return {}
    def get_extraction_prompt(self, taxonomy_yaml): return ""
    def get_system_prompt(self): return ""

def test_list_video_ids():
    a = _TestAdapter()
    ids = a.get_doc_ids(limit=5)
    assert len(ids) == 5
    assert all(isinstance(i, str) for i in ids)

def test_load_document():
    a = _TestAdapter()
    ids = a.get_doc_ids(limit=1)
    doc = a.load_document(ids[0])
    assert "text" in doc
    assert "title" in doc
    assert "video_id" in doc
    assert len(doc["text"]) > 100  # VTT content loaded

def test_preprocess_strips_vtt_tags():
    a = _TestAdapter()
    raw = "hello<00:00:01.200><c> world</c> test"
    clean = a.preprocess(raw)
    assert "<c>" not in clean
    assert "<00:" not in clean
    assert "hello" in clean
    assert "world" in clean
```

**Step 2:** Run: `pytest lib/semnet/tests/test_youtube_adapter.py -v` → FAIL

**Step 3: Implement**

```python
# lib/semnet/adapters/youtube.py
"""YouTube channel adapter for SemanticNet.

Base class for any YouTube channel. Handles VTT parsing, chapter
extraction, metadata loading. Domain-specific adapters extend this.
"""
from __future__ import annotations

import json
import re
from abc import abstractmethod
from pathlib import Path

from loguru import logger

from lib.semnet.adapter import DomainAdapter
from lib.semnet.models import Chapter
from lib.transcript_viewer.loader import load_vtt

# VTT karaoke tags: <00:00:01.200><c> word</c>
_VTT_TAG_RE = re.compile(r"<[^>]+>")


class YouTubeChannelAdapter(DomainAdapter):
    """Base adapter for YouTube channel content."""

    def __init__(self, *, content_dir: Path, db_path: Path):
        self._content_dir = Path(content_dir)
        self._db_path = db_path

    @property
    def db_path(self) -> Path:
        return self._db_path

    def get_doc_ids(self, *, limit: int | None = None, holdout: bool = False) -> list[str]:
        """List video IDs that have VTT transcripts."""
        ids = []
        for d in sorted(self._content_dir.iterdir()):
            if not d.is_dir():
                continue
            # Must have a VTT file
            vtts = list(d.glob("*.vtt"))
            if vtts:
                ids.append(d.name)
            if limit and len(ids) >= limit:
                break
        return ids

    def load_document(self, doc_id: str) -> dict:
        """Load video metadata + transcript text."""
        vid_dir = self._content_dir / doc_id

        # Load metadata
        meta_path = vid_dir / "metadata.json"
        meta = {}
        if meta_path.exists():
            meta = json.loads(meta_path.read_text())

        # Load VTT → plain text
        vtts = list(vid_dir.glob("*.vtt"))
        text = ""
        cues = []
        if vtts:
            cues = load_vtt(vtts[0])
            text = "\n".join(c.get("text_display", c.get("text", "")) for c in cues)

        # Parse chapters from description if available
        chapters = []
        desc = meta.get("description", "")
        if desc:
            chapters = Chapter.parse_description(desc)

        return {
            "text": text,
            "video_id": doc_id,
            "title": meta.get("title", ""),
            "duration": meta.get("duration", 0),
            "upload_date": meta.get("upload_date", ""),
            "url": meta.get("url", ""),
            "cues": cues,
            "chapters": chapters,
        }

    def preprocess(self, raw_content: str) -> str:
        """Strip VTT tags, collapse whitespace."""
        text = _VTT_TAG_RE.sub("", raw_content)
        text = re.sub(r"\s+", " ", text).strip()
        return text
```

**Step 4:** Run: `pytest lib/semnet/tests/test_youtube_adapter.py -v` → PASS

**Step 5:** `git add lib/semnet/adapters/ lib/semnet/tests/test_youtube_adapter.py && git commit -m "feat(semanticnet): YouTubeChannelAdapter — VTT loading + metadata"`

---

### Task 3: Topic-Change Chunking

**Files:**
- Create: `lib/semnet/chunk.py`
- Test: `lib/semnet/tests/test_chunk.py`

**Step 1: Write failing test**

```python
# lib/semnet/tests/test_chunk.py
import asyncio
from lib.semnet.chunk import chunk_fixed_window, chunk_by_topic_change
from lib.semnet.models import Chapter

SAMPLE_CUES = [
    {"offset_s": i * 5.0, "end_s": (i + 1) * 5.0, "text": f"sentence {i}"}
    for i in range(60)  # 5 minutes of cues
]

def test_fixed_window_basic():
    chunks = chunk_fixed_window(SAMPLE_CUES, window_s=60, overlap_s=0)
    assert len(chunks) == 5  # 300s / 60s
    assert chunks[0].start_s == 0.0
    assert chunks[0].end_s == 60.0

def test_fixed_window_with_chapters():
    chapters = [
        Chapter("Intro", 0, 120),
        Chapter("Main", 120, 300),
    ]
    chunks = chunk_fixed_window(SAMPLE_CUES, window_s=60, overlap_s=0, chapters=chapters)
    # Should respect chapter boundaries
    assert chunks[0].chapter_title == "Intro"
    assert chunks[2].chapter_title == "Main"

def test_chunk_by_topic_change_returns_chunks():
    """Integration test — calls LLM. Use mock if needed."""
    # For unit testing, just verify the function signature works
    # with fixed-window fallback when model="mock"
    chunks = asyncio.run(
        chunk_by_topic_change(SAMPLE_CUES, chapters=None, model="mock")
    )
    assert len(chunks) > 0
    assert all(c.text for c in chunks)
```

**Step 2:** Run test → FAIL

**Step 3: Implement**

```python
# lib/semnet/chunk.py
"""Topic-change aware transcript chunking.

Two modes:
- chunk_fixed_window: Simple time-based windows (fast, no LLM)
- chunk_by_topic_change: LLM detects topic shifts (better quality)
"""
from __future__ import annotations

from lib.semnet.models import Chapter, Chunk
from loguru import logger


def _cues_in_range(cues: list[dict], start_s: float, end_s: float) -> list[dict]:
    """Filter cues within a time range."""
    return [c for c in cues if c.get("offset_s", 0) >= start_s
            and c.get("offset_s", 0) < end_s]


def _cues_to_text(cues: list[dict]) -> str:
    return " ".join(c.get("text_display", c.get("text", "")) for c in cues).strip()


def chunk_fixed_window(
    cues: list[dict],
    *,
    window_s: float = 60,
    overlap_s: float = 0,
    chapters: list[Chapter] | None = None,
) -> list[Chunk]:
    """Split transcript into fixed-duration windows, respecting chapter boundaries."""
    if not cues:
        return []

    total_end = max(c.get("end_s", c.get("offset_s", 0) + 5) for c in cues)
    chunks = []

    if chapters:
        # Chunk within each chapter
        for ch in chapters:
            ch_end = ch.end_s if ch.end_s else total_end
            t = ch.start_s
            while t < ch_end:
                seg_end = min(t + window_s, ch_end)
                seg_cues = _cues_in_range(cues, t, seg_end)
                if seg_cues:
                    chunks.append(Chunk(
                        start_s=t, end_s=seg_end,
                        text=_cues_to_text(seg_cues),
                        chapter_title=ch.title,
                    ))
                t = seg_end - overlap_s if overlap_s else seg_end
    else:
        t = cues[0].get("offset_s", 0)
        while t < total_end:
            seg_end = min(t + window_s, total_end)
            seg_cues = _cues_in_range(cues, t, seg_end)
            if seg_cues:
                chunks.append(Chunk(
                    start_s=t, end_s=seg_end,
                    text=_cues_to_text(seg_cues),
                ))
            t = seg_end - overlap_s if overlap_s else seg_end

    return chunks


async def chunk_by_topic_change(
    cues: list[dict],
    chapters: list[Chapter] | None = None,
    *,
    min_chunk_s: int = 30,
    max_chunk_s: int = 180,
    model: str = "flash",
) -> list[Chunk]:
    """Chunk transcript at topic change boundaries detected by LLM.

    Falls back to fixed windows for mock/test mode.
    """
    if model == "mock":
        return chunk_fixed_window(cues, window_s=60, chapters=chapters)

    from lib.llm import call_llm

    # Build segments to analyze (chapters or whole transcript)
    if not cues:
        return []

    total_end = max(c.get("end_s", c.get("offset_s", 0) + 5) for c in cues)
    segments: list[tuple[float, float, str]] = []  # (start, end, chapter_title)

    if chapters:
        for ch in chapters:
            segments.append((ch.start_s, ch.end_s or total_end, ch.title))
    else:
        segments.append((0, total_end, ""))

    all_chunks = []

    for seg_start, seg_end, ch_title in segments:
        seg_cues = _cues_in_range(cues, seg_start, seg_end)
        if not seg_cues:
            continue

        seg_duration = seg_end - seg_start
        if seg_duration <= max_chunk_s:
            # Short enough to be one chunk
            all_chunks.append(Chunk(
                start_s=seg_start, end_s=seg_end,
                text=_cues_to_text(seg_cues),
                chapter_title=ch_title,
            ))
            continue

        # Build timestamped transcript for LLM
        ts_lines = []
        for c in seg_cues:
            t = c.get("offset_s", 0)
            mm, ss = int(t // 60), int(t % 60)
            ts_lines.append(f"[{mm:02d}:{ss:02d}] {c.get('text_display', c.get('text', ''))}")
        ts_text = "\n".join(ts_lines)

        prompt = f"""\
Analyze this transcript segment and identify where the topic changes.
Return ONLY a JSON array of objects, each with:
- "time_s": timestamp in seconds where a new topic starts
- "topic": 3-5 word label for the topic starting here

Constraints:
- Minimum {min_chunk_s}s between topic changes
- Maximum {max_chunk_s}s per topic (force split if needed)
- First entry should be at {seg_start:.0f}s

Transcript:
{ts_text[:15000]}"""

        try:
            import json
            resp = await call_llm(model=model, prompt=prompt, temperature=0.0)
            # Strip markdown fencing
            cleaned = resp.strip()
            if cleaned.startswith("```"):
                import re
                cleaned = re.sub(r'^```(?:json)?\s*', '', cleaned)
                cleaned = re.sub(r'\s*```$', '', cleaned)
            splits = json.loads(cleaned)
        except Exception as e:
            logger.warning(f"Topic detection failed, using fixed windows: {e}")
            all_chunks.extend(chunk_fixed_window(
                seg_cues, window_s=60, chapters=[Chapter(ch_title, seg_start, seg_end)] if ch_title else None
            ))
            continue

        # Build chunks from detected splits
        for i, split in enumerate(splits):
            t_start = float(split["time_s"])
            t_end = float(splits[i + 1]["time_s"]) if i + 1 < len(splits) else seg_end
            split_cues = _cues_in_range(cues, t_start, t_end)
            if split_cues:
                all_chunks.append(Chunk(
                    start_s=t_start, end_s=t_end,
                    text=_cues_to_text(split_cues),
                    topic_label=split.get("topic", ""),
                    chapter_title=ch_title,
                ))

    return all_chunks
```

**Step 4:** Run: `pytest lib/semnet/tests/test_chunk.py -v` → PASS

**Step 5:** `git add lib/semnet/chunk.py lib/semnet/tests/test_chunk.py && git commit -m "feat(semanticnet): topic-change chunking engine"`

---

### Task 4: Process 1 Video — Integration Wire-Up

**Files:**
- Create: `projects/healthygamer/adapter.py`
- Create: `projects/healthygamer/taxonomy_seed.yaml`
- Create: `projects/healthygamer/pipeline.py`
- Test: `projects/healthygamer/tests/test_adapter.py`

**Step 1: Write test**

```python
# projects/healthygamer/tests/test_adapter.py
from projects.healthygamer.adapter import HGAdapter

def test_hg_adapter_basics():
    a = HGAdapter()
    assert a.name == "healthygamer"
    ids = a.get_doc_ids(limit=1)
    assert len(ids) == 1

def test_hg_taxonomy_loaded():
    a = HGAdapter()
    tax = a.get_seed_taxonomy()
    assert "mental_health" in tax or "emotional_regulation" in tax
    # Should have multiple categories
    assert len(tax) >= 5

def test_hg_load_and_preprocess():
    a = HGAdapter()
    ids = a.get_doc_ids(limit=1)
    doc = a.load_document(ids[0])
    clean = a.preprocess(doc["text"])
    assert "<c>" not in clean
    assert len(clean) > 50
```

**Step 2:** Run → FAIL

**Step 3: Implement adapter**

```python
# projects/healthygamer/adapter.py
"""Healthy Gamer domain adapter for SemanticNet."""
from __future__ import annotations

from pathlib import Path

import yaml

from lib.semnet.adapters.youtube import YouTubeChannelAdapter

_ROOT = Path(__file__).parent
_CONTENT_DIR = Path("video-analysis/content/healthy_gamer")
_DB_PATH = _ROOT / "data" / "healthygamer.db"


class HGAdapter(YouTubeChannelAdapter):
    """Healthy Gamer YouTube channel adapter."""

    name = "healthygamer"

    def __init__(self):
        super().__init__(content_dir=_CONTENT_DIR, db_path=_DB_PATH)
        self._taxonomy = None

    def get_seed_taxonomy(self) -> dict:
        if self._taxonomy is None:
            tax_path = _ROOT / "taxonomy_seed.yaml"
            self._taxonomy = yaml.safe_load(tax_path.read_text())
        return self._taxonomy

    def get_system_prompt(self) -> str:
        return (
            "You are a mental health content analyst. Extract key insights, "
            "actionable advice, and notable soundbites from this video transcript. "
            "Return ONLY valid JSON — no markdown fencing, no explanation."
        )

    def get_extraction_prompt(self, taxonomy_yaml: str) -> str:
        return f"""\
Analyze this Healthy Gamer video transcript segment and extract ALL distinct insights.

For each insight, return a JSON object with:
- claim_text: The insight in 1-2 sentences (clear, standalone)
- evidence: Verbatim quote from transcript (1-3 sentences)
- category: Top-level taxonomy category (see below)
- sub_type: Sub-type within that category
- tags: 2-5 descriptive tags
- confidence: 0.0-1.0 classification confidence
- direction: "actionable" (advice/technique) or "conceptual" (explanation/framework)

If an insight doesn't fit existing categories, use category="uncategorized".

TAXONOMY:
{taxonomy_yaml}

Return a JSON object with a "claims" array.

<transcript>
{{text}}
</transcript>"""
```

**Step 3b: Create taxonomy seed** (`projects/healthygamer/taxonomy_seed.yaml`):

```yaml
mental_health:
  name: "Mental Health Conditions"
  description: "Specific conditions, symptoms, diagnosis"
  sub_types:
    depression:
      description: "Depression, anhedonia, low motivation"
      decision_rule: "Assign if discussing depressive symptoms or their management"
    anxiety:
      description: "Anxiety disorders, panic, social anxiety, OCD"
      decision_rule: "Assign if discussing anxiety-related conditions"
    adhd:
      description: "ADHD, focus, executive function, stimulant medication"
      decision_rule: "Assign if discussing attention deficit or executive function"
    addiction:
      description: "Gaming, internet, substance addiction"
      decision_rule: "Assign if discussing addictive behaviors or recovery"
    trauma:
      description: "PTSD, childhood trauma, emotional wounds"
      decision_rule: "Assign if discussing past trauma and its effects"

emotional_regulation:
  name: "Emotional Regulation & Processing"
  description: "Managing emotions, processing feelings"
  sub_types:
    awareness:
      description: "Identifying and naming emotions"
      decision_rule: "Assign if about recognizing emotional states"
    techniques:
      description: "Specific techniques for emotion management"
      decision_rule: "Assign if providing actionable emotion regulation methods"
    suppression:
      description: "Emotional suppression and its consequences"
      decision_rule: "Assign if discussing avoidance or suppression of emotions"

relationships:
  name: "Relationships & Social Skills"
  description: "Dating, friendships, family, loneliness"
  sub_types:
    dating:
      description: "Romantic relationships, dating challenges"
      decision_rule: "Assign if about romantic relationship dynamics"
    family:
      description: "Parent-child relationships, family conflict"
      decision_rule: "Assign if about family dynamics"
    social:
      description: "Friendships, social skills, loneliness"
      decision_rule: "Assign if about platonic relationships or social isolation"
    boundaries:
      description: "Setting and maintaining boundaries"
      decision_rule: "Assign if about establishing healthy interpersonal limits"

motivation:
  name: "Motivation & Productivity"
  description: "Willpower, discipline, procrastination, goals"
  sub_types:
    procrastination:
      description: "Understanding and overcoming procrastination"
      decision_rule: "Assign if discussing task avoidance patterns"
    discipline:
      description: "Building habits, willpower, consistency"
      decision_rule: "Assign if about sustained effort and habit formation"
    purpose:
      description: "Finding meaning, direction, dharma"
      decision_rule: "Assign if about life purpose or existential direction"

identity:
  name: "Identity & Self-Worth"
  description: "Self-esteem, ego, identity formation"
  sub_types:
    self_esteem:
      description: "Self-worth, confidence, shame"
      decision_rule: "Assign if about how one values oneself"
    ego:
      description: "Ego, ahamkara, sense of self"
      decision_rule: "Assign if discussing ego structures or false self"
    comparison:
      description: "Social comparison, envy, inadequacy"
      decision_rule: "Assign if about comparing oneself to others"

meditation:
  name: "Meditation & Mindfulness"
  description: "Meditation techniques, mindfulness, awareness practices"
  sub_types:
    technique:
      description: "Specific meditation instructions"
      decision_rule: "Assign if providing step-by-step meditation guidance"
    philosophy:
      description: "Yogic/Buddhist philosophy behind practices"
      decision_rule: "Assign if explaining philosophical foundations"
    benefits:
      description: "Effects of meditation practice"
      decision_rule: "Assign if discussing outcomes of meditation"

neuroscience:
  name: "Neuroscience & Psychology"
  description: "Brain science, dopamine, cognitive patterns"
  sub_types:
    dopamine:
      description: "Dopamine system, reward circuits"
      decision_rule: "Assign if about dopamine pathways or reward mechanisms"
    cognitive:
      description: "Cognitive biases, thought patterns"
      decision_rule: "Assign if about how the mind processes information"
    development:
      description: "Brain development, neuroplasticity"
      decision_rule: "Assign if about brain maturation or change capacity"

career:
  name: "Career & Life Skills"
  description: "Job searching, interviews, career decisions"
  sub_types:
    job_search:
      description: "Finding work, interviews, applications"
      decision_rule: "Assign if about practical job-seeking"
    skill_building:
      description: "Learning, education, skill development"
      decision_rule: "Assign if about acquiring capabilities"
    burnout:
      description: "Work burnout, work-life balance"
      decision_rule: "Assign if about exhaustion from work demands"
```

**Step 3c: Create pipeline CLI** (`projects/healthygamer/pipeline.py`):

```python
#!/usr/bin/env python
"""Healthy Gamer portal pipeline — process videos and generate portal."""
from __future__ import annotations

import asyncio
import json
import sys
from pathlib import Path

import click
from loguru import logger

sys.path.insert(0, str(Path(__file__).resolve().parent.parent.parent))

from projects.healthygamer.adapter import HGAdapter
from lib.semnet.chunk import chunk_by_topic_change, chunk_fixed_window


@click.group()
def cli():
    """Healthy Gamer portal pipeline."""


@cli.command()
@click.argument("video_id")
@click.option("--mode", default="fixed", type=click.Choice(["fixed", "topic"]))
def chunk(video_id: str, mode: str):
    """Chunk a single video's transcript."""
    adapter = HGAdapter()
    doc = adapter.load_document(video_id)
    if not doc.get("cues"):
        click.echo(f"No transcript found for {video_id}")
        return

    if mode == "topic":
        chunks = asyncio.run(chunk_by_topic_change(
            doc["cues"], chapters=doc.get("chapters") or None))
    else:
        chunks = chunk_fixed_window(
            doc["cues"], window_s=60,
            chapters=doc.get("chapters") or None)

    click.echo(f"\n{doc['title']}")
    click.echo(f"Duration: {doc['duration']:.0f}s | Chunks: {len(chunks)}\n")
    for i, c in enumerate(chunks):
        mm_s, ss_s = int(c.start_s // 60), int(c.start_s % 60)
        mm_e, ss_e = int(c.end_s // 60), int(c.end_s % 60)
        label = f" [{c.topic_label}]" if c.topic_label else ""
        ch = f" ({c.chapter_title})" if c.chapter_title else ""
        click.echo(f"  {i+1:3d}. {mm_s:02d}:{ss_s:02d}-{mm_e:02d}:{ss_e:02d}{ch}{label}")
        click.echo(f"       {c.text_preview}\n")


@cli.command()
@click.option("--limit", default=None, type=int)
def list_videos(limit: int | None):
    """List available video IDs."""
    adapter = HGAdapter()
    ids = adapter.get_doc_ids(limit=limit)
    for vid in ids:
        doc = adapter.load_document(vid)
        click.echo(f"{vid}  {doc.get('title', '?')[:60]}")
    click.echo(f"\nTotal: {len(ids)} videos")


if __name__ == "__main__":
    cli()
```

**Step 4:** Run: `pytest projects/healthygamer/tests/test_adapter.py -v` → PASS

Then manual test: `python projects/healthygamer/pipeline.py list-videos --limit 3`
Then: `python projects/healthygamer/pipeline.py chunk <video_id> --mode fixed`

**Step 5:** Commit all HG adapter files.

---

### Task 5: SemanticNet Schema — Processing Log Table

**Files:**
- Modify: `lib/semnet/schema.py`
- Test: `lib/semnet/tests/test_schema.py`

Add `processing_log` and `chunks` tables to track pipeline progress and store chunks.

**Step 1: Write test**

```python
# lib/semnet/tests/test_schema.py
import sqlite3
from pathlib import Path
from lib.semnet.schema import ensure_schema

def test_schema_creates_all_tables(tmp_path):
    db = tmp_path / "test.db"
    conn = ensure_schema(db)
    tables = [r[0] for r in conn.execute(
        "SELECT name FROM sqlite_master WHERE type='table'").fetchall()]
    assert "claims" in tables
    assert "taxonomy_nodes" in tables
    assert "processing_log" in tables
    assert "chunks" in tables
    conn.close()
```

**Step 2:** Run → FAIL

**Step 3:** Add to `lib/semnet/schema.py`:

```python
PROCESSING_LOG_SCHEMA = """
CREATE TABLE IF NOT EXISTS processing_log (
    video_id    TEXT NOT NULL,
    stage       TEXT NOT NULL,
    status      TEXT DEFAULT 'pending',
    started_at  TEXT,
    completed_at TEXT,
    error       TEXT,
    meta        TEXT DEFAULT '{}',
    PRIMARY KEY (video_id, stage)
)
"""

CHUNKS_SCHEMA = """
CREATE TABLE IF NOT EXISTS chunks (
    chunk_id    INTEGER PRIMARY KEY AUTOINCREMENT,
    doc_id      TEXT NOT NULL,
    start_s     REAL NOT NULL,
    end_s       REAL NOT NULL,
    text        TEXT NOT NULL,
    topic_label TEXT DEFAULT '',
    chapter_title TEXT DEFAULT '',
    created_at  TEXT DEFAULT (datetime('now'))
)
"""
```

Add both to the `ensure_schema` loop and add index: `idx_chunks_doc_id`, `idx_processing_log_video`.

**Step 4:** Run → PASS

**Step 5:** Commit.

---

### Task 6: Claim Extraction on 1 Video

**Files:**
- Test: `projects/healthygamer/tests/test_extraction.py`

This task wires up the existing `lib/semnet/extract.py` with HGAdapter on a real video. Requires LLM API key.

**Step 1: Write test** (integration, marked slow)

```python
# projects/healthygamer/tests/test_extraction.py
import asyncio
import pytest
from projects.healthygamer.adapter import HGAdapter
from lib.semnet.extract import extract_claims

@pytest.mark.slow
def test_extract_one_video():
    adapter = HGAdapter()
    ids = adapter.get_doc_ids(limit=1)
    result = asyncio.run(extract_claims(adapter, ids[0], store=False))
    assert result.error is None
    assert len(result.claims) >= 3
    # Check claim structure
    c = result.claims[0]
    assert c.claim_text
    assert c.category
```

**Step 2:** Run: `pytest projects/healthygamer/tests/test_extraction.py -v -m slow` → PASS (calls real LLM)

**Step 3:** Inspect output quality. Adjust extraction prompt in adapter if needed.

**Step 4:** Commit.

---

### Task 7: SemanticNet Embed Module

**Files:**
- Create: `lib/semnet/embed.py`
- Test: `lib/semnet/tests/test_embed.py`

Wraps `lib/llm/embed.py` with 3-level embedding logic.

**Step 1: Write test**

```python
# lib/semnet/tests/test_embed.py
import asyncio
import pytest
from lib.semnet.embed import embed_chunks, embed_document_summary
from lib.semnet.models import Chunk

@pytest.mark.slow
def test_embed_chunks():
    chunks = [
        Chunk(start_s=0, end_s=60, text="ADHD affects dopamine pathways", video_id="test"),
        Chunk(start_s=60, end_s=120, text="Meditation can help focus", video_id="test"),
    ]
    results = asyncio.run(embed_chunks(chunks, context_title="Test Video"))
    assert len(results) == 2
    assert len(results[0]) == 1536  # text-embedding-3-small dimension

@pytest.mark.slow
def test_embed_summary():
    vec = asyncio.run(embed_document_summary("A video about ADHD and meditation techniques"))
    assert len(vec) == 1536
```

**Step 2:** Run → FAIL

**Step 3: Implement**

```python
# lib/semnet/embed.py
"""SemanticNet embedding — 3-level embedding via lib/llm/embed.py."""
from __future__ import annotations

from lib.llm.embed import embed_texts
from lib.semnet.models import Chunk


async def embed_document_summary(summary: str, *, model: str = "text-embedding-3-small") -> list[float]:
    """Embed a single document summary (Level 1)."""
    vecs = await embed_texts([summary], model=model)
    return vecs[0]


async def embed_chunks(
    chunks: list[Chunk],
    *,
    context_title: str = "",
    model: str = "text-embedding-3-small",
    batch_size: int = 100,
) -> list[list[float]]:
    """Embed chunks with contextual prefixes (Level 2).

    Each chunk's text is prefixed with the video title for Contextual Retrieval.
    """
    texts = []
    for c in chunks:
        prefix = f"Video: {context_title}. " if context_title else ""
        if c.chapter_title:
            prefix += f"Chapter: {c.chapter_title}. "
        texts.append(prefix + c.text)

    all_vecs = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        vecs = await embed_texts(batch, model=model)
        all_vecs.extend(vecs)

    return all_vecs


async def embed_claims(
    claim_texts: list[str],
    *,
    model: str = "text-embedding-3-small",
    batch_size: int = 100,
) -> list[list[float]]:
    """Embed individual claims (Level 3)."""
    all_vecs = []
    for i in range(0, len(claim_texts), batch_size):
        batch = claim_texts[i:i + batch_size]
        vecs = await embed_texts(batch, model=model)
        all_vecs.extend(vecs)
    return all_vecs
```

**Step 4:** Run → PASS (with API key)

**Step 5:** Commit.

---

### Task 8: SemanticNet Store Module

**Files:**
- Create: `lib/semnet/store.py`
- Test: `lib/semnet/tests/test_store.py`

Wraps `lib/vectors/VectorStore` + SQLite for dual storage.

**Step 1: Write test**

```python
# lib/semnet/tests/test_store.py
from pathlib import Path
from lib.semnet.store import SemanticStore
from lib.semnet.models import Chunk

def test_store_and_retrieve_chunks(tmp_path):
    store = SemanticStore(db_path=tmp_path / "test.db", vector_path=tmp_path / "vectors")
    chunks = [
        Chunk(start_s=0, end_s=60, text="ADHD and dopamine", video_id="v1", topic_label="adhd"),
    ]
    # Fake embedding (1536-dim)
    vectors = [[0.1] * 1536]
    store.store_chunks(chunks, vectors)
    assert store.chunk_count() == 1

    # Retrieve by video
    stored = store.get_chunks_for_video("v1")
    assert len(stored) == 1
    assert stored[0]["text"] == "ADHD and dopamine"
    store.close()
```

**Step 2:** Run → FAIL

**Step 3: Implement**

```python
# lib/semnet/store.py
"""SemanticNet dual storage — SQLite (structured) + Qdrant (vectors)."""
from __future__ import annotations

from contextlib import closing
from pathlib import Path

from loguru import logger

from lib.semnet.models import Chunk
from lib.semnet.schema import ensure_schema
from lib.vectors import VectorStore

EMBED_MODEL = "text-embedding-3-small"
EMBED_DIM = 1536
COLLECTION = "chunks"


class SemanticStore:
    """Combined SQLite + Qdrant storage for chunks and claims."""

    def __init__(self, *, db_path: Path, vector_path: Path):
        self._db_path = Path(db_path)
        self._conn = ensure_schema(self._db_path)
        self._vs = VectorStore(vector_path)
        self._vs.ensure_collection(COLLECTION, dim=EMBED_DIM, model=EMBED_MODEL)

    def store_chunks(self, chunks: list[Chunk], vectors: list[list[float]]) -> None:
        """Store chunks in both SQLite and Qdrant."""
        points = []
        for chunk, vec in zip(chunks, vectors):
            # SQLite
            cursor = self._conn.execute(
                """INSERT INTO chunks (doc_id, start_s, end_s, text, topic_label, chapter_title)
                   VALUES (?, ?, ?, ?, ?, ?)""",
                (chunk.video_id, chunk.start_s, chunk.end_s,
                 chunk.text, chunk.topic_label, chunk.chapter_title),
            )
            chunk_id = cursor.lastrowid

            # Qdrant
            points.append({
                "id": f"{chunk.video_id}:{chunk_id}",
                "vector": vec,
                "payload": {
                    "doc_id": chunk.video_id,
                    "chunk_id": chunk_id,
                    "start_s": chunk.start_s,
                    "end_s": chunk.end_s,
                    "text": chunk.text[:500],
                    "topic_label": chunk.topic_label,
                    "chapter_title": chunk.chapter_title,
                    "level": "chunk",
                },
            })

        self._conn.commit()
        if points:
            self._vs.upsert_batch(COLLECTION, points, model=EMBED_MODEL)
        logger.info(f"Stored {len(chunks)} chunks")

    def chunk_count(self) -> int:
        return self._conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0]

    def get_chunks_for_video(self, video_id: str) -> list[dict]:
        rows = self._conn.execute(
            "SELECT * FROM chunks WHERE doc_id = ? ORDER BY start_s",
            (video_id,),
        ).fetchall()
        return [dict(r) for r in rows]

    def search_chunks(self, query_vector: list[float], *, limit: int = 20) -> list[dict]:
        return self._vs.search(COLLECTION, query_vector, limit=limit, model=EMBED_MODEL)

    def close(self):
        self._conn.close()
        self._vs.close()

    def __enter__(self):
        return self

    def __exit__(self, *exc):
        self.close()
```

**Step 4:** Run → PASS

**Step 5:** Commit.

---

### Task 9: Process 10 Videos + Diagnostic Report

**Files:**
- Modify: `projects/healthygamer/pipeline.py` (add `process` and `diagnose` commands)
- Create: `projects/healthygamer/diagnose.py` (diagnostic HTML generator)

**Step 1:** Add `process` CLI command to pipeline.py:

```python
@cli.command()
@click.option("--limit", default=10, type=int)
@click.option("--mode", default="fixed", type=click.Choice(["fixed", "topic"]))
def process(limit: int, mode: str):
    """Process videos: chunk → extract → embed → store."""
    asyncio.run(_process(limit, mode))

async def _process(limit: int, mode: str):
    adapter = HGAdapter()
    store = SemanticStore(db_path=adapter.db_path, vector_path=adapter.db_path.parent / "vectors")
    ids = adapter.get_doc_ids(limit=limit)
    # ... process each video through chunk→extract→embed→store
```

**Step 2:** Add `diagnose` command that generates an HTML report showing:
- Chunks per video with topic labels and timestamps
- Taxonomy distribution (how many claims per category)
- Sample claims for spot-checking

**Step 3:** Run `process --limit 10`, then `diagnose` to inspect output.

**Step 4:** Adjust taxonomy/prompts based on findings.

**Step 5:** Commit.

---

### Task 10: Static Portal Generator

**Files:**
- Create: `lib/semnet/portal.py` (generic)
- Create: `projects/healthygamer/templates/` (Jinja2 templates)
- Create: `projects/healthygamer/portal/.share`

**Step 1:** Build Jinja2 templates for index.html, topic page, video page.

**Step 2:** Portal generator reads SQLite, builds JSON indexes, renders templates.

**Step 3:** Add `generate-portal` command to pipeline.py.

**Step 4:** Test with 10-video dataset: `python projects/healthygamer/pipeline.py generate-portal`

**Step 5:** Verify on `static.localhost/healthygamer/`

**Step 6:** Commit.

---

### Task 11: Batch Process All 935 Videos

**Files:**
- Modify: `projects/healthygamer/pipeline.py` (add progress tracking, resume)

**Step 1:** Add resume capability — skip already-processed videos (check processing_log).

**Step 2:** Run: `python projects/healthygamer/pipeline.py process --limit 935`

**Step 3:** Monitor via processing_log queries.

**Step 4:** Regenerate portal: `python projects/healthygamer/pipeline.py generate-portal`

**Step 5:** Commit.

---

### Task 12: Portal Polish — Topic Cloud, Search, Navigation

**Files:**
- Modify: `projects/healthygamer/templates/index.html`
- Create: `projects/healthygamer/templates/assets/app.js`

**Step 1:** Add D3.js topic cloud to index page.

**Step 2:** Add lunr.js search index generation to portal generator.

**Step 3:** Add client-side search to app.js.

**Step 4:** Add snippet navigation (many results per topic, with timestamps linking to YouTube).

**Step 5:** Final visual review + commit.

---

## Dependency Graph

```
Task 1 (models) → Task 2 (youtube adapter) → Task 3 (chunking) → Task 4 (HG adapter + pipeline)
                                                                        ↓
Task 5 (schema) ──────────────────────────────────────────────→ Task 6 (extraction)
                                                                        ↓
                                                                Task 7 (embed)
                                                                        ↓
                                                                Task 8 (store)
                                                                        ↓
                                                                Task 9 (process 10 + diagnose)
                                                                        ↓
                                                                Task 10 (portal generator)
                                                                        ↓
                                                                Task 11 (batch 935)
                                                                        ↓
                                                                Task 12 (polish)
```

Tasks 1-4 and Task 5 can run in parallel (models+adapter vs schema).
