# Intel Data Quality Audit — Implementation Plan

> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

**Goal:** Fix entity resolution errors in companies.db (starting with Micron), build a reusable verification pipeline, audit 20+ companies, and produce an HTML report.

**Architecture:** Cross-reference companies.db records against finnhub_profiles (812 cached) and live Finnhub API calls via `lib.finnhub.api()`. Use vario ng `confirm` recipe for multi-model consensus on ambiguous cases. Output an HTML report to `reports/tmp/` served via static.localhost.

**Tech Stack:** SQLite (companies.db), lib/finnhub (rate-limited API client), vario (confirm recipe), Jinja2-style HTML report generation.

---

## Task 1: Fix Micron Data (Direct DB Fix)

**Files:**
- Modify: `intel/companies/data/companies.db` (via Python script)

**Step 1: Verify current Micron state and Finnhub ground truth**

```bash
python -c "
from lib.finnhub import api
profile = api('stock/profile2', symbol='MU')
print(f'Finnhub name: {profile[\"name\"]}')
print(f'Market cap: {profile.get(\"marketCapitalization\")}')
print(f'Exchange: {profile.get(\"exchange\")}')
print(f'Industry: {profile.get(\"finnhubIndustry\")}')
print(f'Website: {profile.get(\"weburl\")}')
print(f'Country: {profile.get(\"country\")}')
"
```

Expected: Name is "Micron Technology Inc", confirms the DB entry "Micron Memory Japan, G.K." is wrong.

**Step 2: Fix company 6564 in DB**

```python
import sqlite3

conn = sqlite3.connect('intel/companies/data/companies.db')
conn.row_factory = sqlite3.Row

# Update the main Micron record (id=6564)
conn.execute("""
    UPDATE companies SET
        name = 'Micron Technology, Inc.',
        canonical_name = 'micron technology',
        website = 'https://www.micron.com',
        country = 'US',
        description = 'Leading global manufacturer of memory and storage semiconductors (DRAM, NAND, NOR). Serves data center, mobile, automotive, industrial, and consumer markets.',
        updated_at = datetime('now')
    WHERE id = 6564
""")

# Mark India subsidiary (id=8195) — add note in description, no ticker
conn.execute("""
    UPDATE companies SET
        description = 'Subsidiary of Micron Technology, Inc. (MU). Indian operations.',
        updated_at = datetime('now')
    WHERE id = 8195
""")

conn.commit()

# Verify
for r in conn.execute("SELECT id, name, canonical_name, ticker, website, country, description FROM companies WHERE id IN (6564, 8195)"):
    print(dict(r))

conn.close()
```

**Step 3: Also fix the finnhub_profiles cache for MU if stale**

```python
import sqlite3, json
conn = sqlite3.connect('intel/companies/data/companies.db')
row = conn.execute("SELECT * FROM finnhub_profiles WHERE symbol = 'MU'").fetchone()
if row:
    print(f"Cached Finnhub: name={row[1]}, fetched_at={row[-1]}")
else:
    # Insert from live API
    from lib.finnhub import api
    p = api('stock/profile2', symbol='MU')
    conn.execute("""
        INSERT OR REPLACE INTO finnhub_profiles (symbol, name, market_cap, finnhub_industry, sector, ipo_date, country, exchange, raw_json)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, ('MU', p['name'], p.get('marketCapitalization'), p.get('finnhubIndustry'),
          p.get('sector'), p.get('ipo'), p.get('country'), p.get('exchange'), json.dumps(p)))
    conn.commit()
    print("Inserted MU into finnhub_profiles")
conn.close()
```

**Step 4: Commit**

```bash
git add intel/companies/data/companies.db
git commit -m "fix(intel): correct Micron entity — subsidiary name → parent company"
```

---

## Task 2: Build Verification Pipeline

**Files:**
- Create: `intel/companies/verify.py`
- Test: `intel/companies/tests/test_verify.py`

**Step 1: Write the test**

```python
# intel/companies/tests/test_verify.py
"""Tests for company data verification pipeline."""
import pytest
from unittest.mock import patch

from intel.companies.verify import (
    check_ticker_name_match,
    find_suspicious_companies,
    VerificationResult,
    Severity,
)


def test_check_ticker_name_match_correct():
    """Correct name returns no issue."""
    result = check_ticker_name_match(
        db_name="Micron Technology, Inc.",
        finnhub_name="Micron Technology Inc",
        ticker="MU",
    )
    assert result is None  # no issue


def test_check_ticker_name_match_subsidiary():
    """Subsidiary name flagged as error."""
    result = check_ticker_name_match(
        db_name="Micron Memory Japan, G.K.",
        finnhub_name="Micron Technology Inc",
        ticker="MU",
    )
    assert result is not None
    assert result.severity == Severity.HIGH
    assert "Micron Technology" in result.suggested_fix


def test_check_ticker_name_match_minor_difference():
    """Minor formatting difference flagged as low severity."""
    result = check_ticker_name_match(
        db_name="Apple Inc.",
        finnhub_name="Apple Inc",
        ticker="AAPL",
    )
    assert result is None  # close enough


def test_find_suspicious_companies_returns_results(tmp_path):
    """find_suspicious_companies returns companies matching heuristic filters."""
    import sqlite3
    db = sqlite3.connect(str(tmp_path / "test.db"))
    db.execute("""CREATE TABLE companies (
        id INTEGER PRIMARY KEY, name TEXT, canonical_name TEXT,
        ticker TEXT, symbol TEXT, market_cap_m REAL,
        description TEXT, website TEXT, country TEXT,
        sector TEXT, industry TEXT, is_public INTEGER DEFAULT 1,
        exchange TEXT, updated_at TEXT
    )""")
    db.execute("""INSERT INTO companies (id, name, ticker, market_cap_m, description)
        VALUES (1, 'Bad Corp, G.K.', 'BAD', 50000, NULL)""")
    db.execute("""INSERT INTO companies (id, name, ticker, market_cap_m, description)
        VALUES (2, 'Good Corp', 'GOOD', 50000, 'A fine company')""")
    db.commit()

    results = find_suspicious_companies(db)
    assert len(results) >= 1
    assert any(r['name'] == 'Bad Corp, G.K.' for r in results)
    db.close()
```

**Step 2: Run test to verify it fails**

```bash
cd /Users/tchklovski/all-code/rivus && python -m pytest intel/companies/tests/test_verify.py -v
```

Expected: ImportError — `intel.companies.verify` doesn't exist yet.

**Step 3: Implement `intel/companies/verify.py`**

```python
#!/usr/bin/env python
"""Company data verification pipeline.

Cross-references companies.db records against Finnhub ground truth
and heuristic checks to find entity resolution errors, missing data,
and suspicious records.

Usage:
    python -m intel.companies.verify [--limit N] [--fix] [--report]
"""

import json
import re
import sqlite3
from dataclasses import dataclass, field
from enum import Enum
from pathlib import Path

from loguru import logger

from intel.companies.common import DB_PATH


class Severity(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"


@dataclass
class VerificationResult:
    company_id: int
    company_name: str
    ticker: str | None
    issue_type: str  # name_mismatch, missing_description, ticker_conflict, duplicate, malformed_name
    severity: Severity
    detail: str
    suggested_fix: str | None = None
    ground_truth: dict = field(default_factory=dict)


# ── Name matching ──────────────────────────────────────────────

_LEGAL_SUFFIXES = re.compile(
    r"[,.\s]*(Inc\.?|Corp\.?|Corporation|LLC|Ltd\.?|Limited|PLC|"
    r"S\.?A\.?S?\.?|SE|AG|GmbH|N\.?V\.?|Co\.?,?\s*Ltd\.?|Group|"
    r"G\.?K\.?|Pvt\.?\s*Ltd\.?|B\.?V\.?|KK|Co\.?)\s*\.?\s*$",
    re.IGNORECASE,
)

_SUBSIDIARY_MARKERS = re.compile(
    r"\b(G\.?K\.?|Pvt\.?\s*Ltd\.?|GmbH|B\.?V\.?|S\.?A\.?S\.?|KK)\b",
    re.IGNORECASE,
)


def _normalize_name(name: str) -> str:
    """Strip legal suffixes and normalize for comparison."""
    clean = _LEGAL_SUFFIXES.sub("", name).strip()
    clean = re.sub(r"\s+", " ", clean).lower()
    return clean


def check_ticker_name_match(
    db_name: str, finnhub_name: str, ticker: str
) -> VerificationResult | None:
    """Check if DB name matches Finnhub name for a given ticker.

    Returns a VerificationResult if there's a mismatch, None if OK.
    """
    norm_db = _normalize_name(db_name)
    norm_fh = _normalize_name(finnhub_name)

    # Exact match after normalization
    if norm_db == norm_fh:
        return None

    # Check if one contains the other (e.g., "Apple" vs "Apple Inc")
    if norm_db in norm_fh or norm_fh in norm_db:
        return None

    # Check word overlap — if >70% of words match, it's close enough
    db_words = set(norm_db.split())
    fh_words = set(norm_fh.split())
    if db_words and fh_words:
        overlap = len(db_words & fh_words) / max(len(db_words), len(fh_words))
        if overlap >= 0.7:
            return None

    # It's a mismatch — determine severity
    has_subsidiary_marker = bool(_SUBSIDIARY_MARKERS.search(db_name))
    severity = Severity.HIGH if has_subsidiary_marker else Severity.MEDIUM

    return VerificationResult(
        company_id=0,  # filled by caller
        company_name=db_name,
        ticker=ticker,
        issue_type="name_mismatch",
        severity=severity,
        detail=f"DB: '{db_name}' vs Finnhub: '{finnhub_name}'",
        suggested_fix=finnhub_name,
        ground_truth={"finnhub_name": finnhub_name},
    )


# ── Heuristic filters ─────────────────────────────────────────

def find_suspicious_companies(conn: sqlite3.Connection) -> list[dict]:
    """Find companies that look like data quality issues using heuristics.

    Returns list of company dicts with 'suspicion_reason' field.
    """
    results = []

    # 1. Subsidiary markers on high-value tickered companies
    for row in conn.execute("""
        SELECT id, name, ticker, market_cap_m, description FROM companies
        WHERE ticker IS NOT NULL AND ticker != ''
          AND (name LIKE '%, G.K.%' OR name LIKE '%Pvt.%' OR name LIKE '% KK'
               OR name LIKE '%GmbH%' OR name LIKE '%B.V.%' OR name LIKE '%S.A.S%')
          AND market_cap_m > 1000
        ORDER BY market_cap_m DESC LIMIT 30
    """):
        results.append({
            "id": row[0], "name": row[1], "ticker": row[2],
            "market_cap_m": row[3], "description": row[4],
            "suspicion_reason": "subsidiary_marker_with_ticker",
        })

    # 2. High-value companies with no description
    for row in conn.execute("""
        SELECT id, name, ticker, market_cap_m, description FROM companies
        WHERE ticker IS NOT NULL AND ticker != ''
          AND market_cap_m > 10000
          AND (description IS NULL OR description = '')
        ORDER BY market_cap_m DESC LIMIT 30
    """):
        results.append({
            "id": row[0], "name": row[1], "ticker": row[2],
            "market_cap_m": row[3], "description": row[4],
            "suspicion_reason": "high_value_no_description",
        })

    # 3. Malformed names (missing spaces before Co., Ltd.)
    for row in conn.execute("""
        SELECT id, name, ticker, market_cap_m, description FROM companies
        WHERE ticker IS NOT NULL AND ticker != ''
          AND (name LIKE '%yCo.%' OR name LIKE '%sCo.%' OR name LIKE '%tCo.%'
               OR name LIKE '%nCo.%' OR name LIKE '%eCo.%' OR name LIKE '%lCo.%')
          AND market_cap_m > 100
        ORDER BY market_cap_m DESC LIMIT 20
    """):
        results.append({
            "id": row[0], "name": row[1], "ticker": row[2],
            "market_cap_m": row[3], "description": row[4],
            "suspicion_reason": "malformed_name",
        })

    # Deduplicate by company id
    seen = set()
    unique = []
    for r in results:
        if r["id"] not in seen:
            seen.add(r["id"])
            unique.append(r)
    return unique


# ── Finnhub cross-check ───────────────────────────────────────

def verify_against_finnhub(
    conn: sqlite3.Connection,
    company_ids: list[int] | None = None,
    limit: int = 20,
    use_cache: bool = True,
) -> list[VerificationResult]:
    """Cross-check company records against Finnhub profiles.

    Uses cached finnhub_profiles first, falls back to live API.
    """
    from lib.finnhub import api as finnhub_api, FinnhubError

    issues = []

    # Get companies to check
    if company_ids:
        placeholders = ",".join("?" * len(company_ids))
        rows = conn.execute(
            f"SELECT id, name, ticker, market_cap_m, description FROM companies WHERE id IN ({placeholders})",
            company_ids,
        ).fetchall()
    else:
        rows = conn.execute("""
            SELECT id, name, ticker, market_cap_m, description FROM companies
            WHERE ticker IS NOT NULL AND ticker != ''
              AND market_cap_m > 1000
            ORDER BY market_cap_m DESC LIMIT ?
        """, (limit,)).fetchall()

    for row in rows:
        cid, name, ticker, mcap, desc = row
        if not ticker:
            continue

        # Try cache first
        finnhub_name = None
        finnhub_mcap = None
        if use_cache:
            cached = conn.execute(
                "SELECT name, market_cap, raw_json FROM finnhub_profiles WHERE symbol = ?",
                (ticker,),
            ).fetchone()
            if cached:
                finnhub_name = cached[0]
                finnhub_mcap = cached[1]

        # Fall back to live API
        if not finnhub_name:
            try:
                profile = finnhub_api("stock/profile2", symbol=ticker)
                if profile and profile.get("name"):
                    finnhub_name = profile["name"]
                    finnhub_mcap = profile.get("marketCapitalization")
                    # Cache it
                    conn.execute("""
                        INSERT OR REPLACE INTO finnhub_profiles
                        (symbol, name, market_cap, finnhub_industry, sector, ipo_date, country, exchange, raw_json)
                        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
                    """, (ticker, finnhub_name, finnhub_mcap,
                          profile.get("finnhubIndustry"), profile.get("sector"),
                          profile.get("ipo"), profile.get("country"),
                          profile.get("exchange"), json.dumps(profile)))
                    conn.commit()
            except FinnhubError as e:
                logger.warning(f"Finnhub error for {ticker}: {e}")
                continue

        if not finnhub_name:
            continue

        # Check name match
        result = check_ticker_name_match(name, finnhub_name, ticker)
        if result:
            result.company_id = cid
            result.ground_truth["finnhub_market_cap"] = finnhub_mcap
            issues.append(result)

        # Check missing description on high-value companies
        if (not desc or desc.strip() == "") and mcap and mcap > 10000:
            issues.append(VerificationResult(
                company_id=cid,
                company_name=name,
                ticker=ticker,
                issue_type="missing_description",
                severity=Severity.MEDIUM,
                detail=f"Market cap ${mcap:,.0f}M but no description",
                ground_truth={"finnhub_name": finnhub_name, "finnhub_market_cap": finnhub_mcap},
            ))

        # Check market cap divergence (>50% off)
        if mcap and finnhub_mcap and finnhub_mcap > 0:
            ratio = mcap / finnhub_mcap
            if ratio < 0.5 or ratio > 2.0:
                issues.append(VerificationResult(
                    company_id=cid,
                    company_name=name,
                    ticker=ticker,
                    issue_type="market_cap_divergence",
                    severity=Severity.LOW,
                    detail=f"DB: ${mcap:,.0f}M vs Finnhub: ${finnhub_mcap:,.0f}M ({ratio:.1f}x)",
                    ground_truth={"finnhub_market_cap": finnhub_mcap},
                ))

    return issues


# ── Duplicate detection ───────────────────────────────────────

def find_duplicate_tickers(conn: sqlite3.Connection) -> list[VerificationResult]:
    """Find cases where multiple companies share the same ticker."""
    issues = []
    for row in conn.execute("""
        SELECT ticker, GROUP_CONCAT(id), GROUP_CONCAT(name, ' | '), COUNT(*) as cnt
        FROM companies
        WHERE ticker IS NOT NULL AND ticker != ''
        GROUP BY ticker
        HAVING cnt > 1
        ORDER BY cnt DESC
        LIMIT 20
    """):
        ticker, ids, names, cnt = row
        issues.append(VerificationResult(
            company_id=int(ids.split(",")[0]),
            company_name=names,
            ticker=ticker,
            issue_type="duplicate_ticker",
            severity=Severity.HIGH,
            detail=f"{cnt} companies share ticker {ticker}: {names}",
        ))
    return issues


# ── CLI ───────────────────────────────────────────────────────

def run_audit(limit: int = 50, fix: bool = False) -> list[VerificationResult]:
    """Run full data quality audit. Returns all issues found."""
    conn = sqlite3.connect(str(DB_PATH))
    conn.row_factory = sqlite3.Row

    all_issues = []

    # Phase 1: Heuristic suspicious companies
    logger.info("Phase 1: Finding suspicious companies via heuristics...")
    suspicious = find_suspicious_companies(conn)
    logger.info(f"Found {len(suspicious)} suspicious companies")

    # Phase 2: Finnhub cross-check on suspicious + top companies
    suspect_ids = [s["id"] for s in suspicious]
    logger.info(f"Phase 2: Cross-checking {len(suspect_ids)} companies against Finnhub...")
    finnhub_issues = verify_against_finnhub(conn, company_ids=suspect_ids if suspect_ids else None, limit=limit)
    all_issues.extend(finnhub_issues)

    # Phase 3: Duplicate tickers
    logger.info("Phase 3: Checking for duplicate tickers...")
    dup_issues = find_duplicate_tickers(conn)
    all_issues.extend(dup_issues)

    # Phase 4: Also check top companies by market cap
    logger.info("Phase 4: Verifying top companies by market cap...")
    top_issues = verify_against_finnhub(conn, limit=limit)
    # Deduplicate
    seen_ids = {(i.company_id, i.issue_type) for i in all_issues}
    for issue in top_issues:
        if (issue.company_id, issue.issue_type) not in seen_ids:
            all_issues.append(issue)
            seen_ids.add((issue.company_id, issue.issue_type))

    # Summary
    by_severity = {}
    for issue in all_issues:
        by_severity.setdefault(issue.severity.value, []).append(issue)

    logger.info(f"Audit complete: {len(all_issues)} issues found")
    for sev in ["critical", "high", "medium", "low"]:
        if sev in by_severity:
            logger.info(f"  {sev}: {len(by_severity[sev])}")

    conn.close()
    return all_issues


if __name__ == "__main__":
    import click

    @click.command()
    @click.option("--limit", default=50, help="Max companies to check against Finnhub")
    @click.option("--fix", is_flag=True, help="Auto-fix name mismatches from Finnhub")
    def main(limit: int, fix: bool):
        """Run company data quality audit."""
        issues = run_audit(limit=limit, fix=fix)
        for issue in sorted(issues, key=lambda i: i.severity.value):
            marker = {"critical": "🔴", "high": "🟠", "medium": "🟡", "low": "⚪"}.get(issue.severity.value, "?")
            print(f"{marker} [{issue.issue_type}] {issue.company_name} ({issue.ticker}): {issue.detail}")

    main()
```

**Step 4: Run tests**

```bash
cd /Users/tchklovski/all-code/rivus && python -m pytest intel/companies/tests/test_verify.py -v
```

Expected: All 4 tests pass.

**Step 5: Run the pipeline manually on a few companies**

```bash
cd /Users/tchklovski/all-code/rivus && python -m intel.companies.verify --limit 10
```

Expected: Output showing any issues found.

**Step 6: Commit**

```bash
git add intel/companies/verify.py intel/companies/tests/test_verify.py
git commit -m "feat(intel): add company data verification pipeline with Finnhub cross-check"
```

---

## Task 3: Run Sample Audit (20+ Companies)

**Files:**
- Run: `intel/companies/verify.py`

**Step 1: Run full audit with limit 50**

```bash
cd /Users/tchklovski/all-code/rivus && python -m intel.companies.verify --limit 50 2>&1 | tee /tmp/audit-results.txt
```

Capture the output — it will be used for the HTML report.

**Step 2: Also check specific high-profile companies**

```python
import sqlite3
from intel.companies.verify import verify_against_finnhub

conn = sqlite3.connect('intel/companies/data/companies.db')
conn.row_factory = sqlite3.Row

# Find anchor companies by name
anchors = ['TSMC', 'NVIDIA', 'ASML', 'Intel', 'Samsung', 'Apple', 'Microsoft', 'AMD', 'Qualcomm', 'Broadcom']
anchor_ids = []
for name in anchors:
    row = conn.execute(
        "SELECT id FROM companies WHERE name LIKE ? OR ticker = ? LIMIT 1",
        (f"%{name}%", name)
    ).fetchone()
    if row:
        anchor_ids.append(row[0])

issues = verify_against_finnhub(conn, company_ids=anchor_ids)
for i in issues:
    print(f"[{i.severity.value}] {i.company_name} ({i.ticker}): {i.detail}")
conn.close()
```

---

## Task 4: Generate HTML Report

**Files:**
- Create: `intel/companies/report_verify.py` — report generator
- Create: `reports/tmp/.share` — opt-in sharing
- Output: `reports/tmp/intel-data-quality-audit-20260309.html`

**Step 1: Create reports directory with .share**

```bash
mkdir -p reports/tmp
echo "Intel data quality audit reports" > reports/tmp/.share
```

**Step 2: Write report generator**

Create `intel/companies/report_verify.py` that:

1. Imports `run_audit()` from verify.py (or accepts pre-computed results)
2. Generates an HTML report with:
   - Executive summary (counts by severity)
   - Micron case study (before/after)
   - Table of all flagged companies (sortable)
   - Pipeline description
   - Recommendations
3. Uses the `report-writing` skill conventions (inline CSS, no external deps)
4. Writes to `reports/tmp/intel-data-quality-audit-YYYYMMDD.html`

The report should be self-contained HTML with embedded CSS. Use the standard rivus report style:
- Dark header, clean tables, severity color coding
- Responsive, works in browser

```python
#!/usr/bin/env python
"""Generate HTML report from company verification audit results."""

import sqlite3
from datetime import datetime
from pathlib import Path

from intel.companies.common import DB_PATH
from intel.companies.verify import run_audit, VerificationResult, Severity

REPORT_DIR = Path(__file__).resolve().parent.parent.parent / "reports" / "tmp"


def generate_report(issues: list[VerificationResult], output_path: Path | None = None) -> Path:
    """Generate HTML data quality report."""
    ...  # Full implementation — see step 3 for HTML template


if __name__ == "__main__":
    issues = run_audit(limit=50)
    path = generate_report(issues)
    print(f"Report: https://static.localhost/reports/tmp/{path.name}")
```

**Step 3: Run report generation**

```bash
cd /Users/tchklovski/all-code/rivus && python -m intel.companies.report_verify
```

Expected: HTML file written, URL printed.

**Step 4: Verify report is accessible**

```bash
curl -s -o /dev/null -w "%{http_code}" https://static.localhost/reports/tmp/intel-data-quality-audit-20260309.html
```

Expected: 200

**Step 5: Commit**

```bash
git add intel/companies/report_verify.py reports/tmp/.share
git commit -m "feat(intel): add HTML report generator for data quality audit"
```

---

## Task 5: Final Integration & Cleanup

**Step 1: Add verify command to intel CLI**

In `intel/cli.py`, add a `verify` subcommand under `companies`:

```python
@companies.command()
@click.option("--limit", default=50)
@click.option("--report", is_flag=True, help="Generate HTML report")
def verify(limit, report):
    """Run data quality verification on company records."""
    from intel.companies.verify import run_audit
    issues = run_audit(limit=limit)
    if report:
        from intel.companies.report_verify import generate_report
        path = generate_report(issues)
        click.echo(f"Report: https://static.localhost/reports/tmp/{path.name}")
```

**Step 2: Run end-to-end**

```bash
intel companies verify --limit 30 --report
```

**Step 3: Commit all**

```bash
git add intel/cli.py
git commit -m "feat(intel): add 'intel companies verify' CLI command"
```
