People Discovery Pipeline

Find someone's profiles across platforms, focused on the right person

The Problem

You have a person's name and want to find their profiles across LinkedIn, Twitter/X, Instagram, Facebook, TikTok, YouTube, and GitHub. The challenge is disambiguation: search "Gil Elbaz" and you'll find the Factual founder, a chess player, a musician, and several others. The pipeline needs to figure out which results belong to the person you mean.

How It Works

discover.py runs 10 concurrent searches for a given name:

6 platform site: searches (Serper/Google) — LinkedIn, Twitter/X, Instagram, Facebook, TikTok, YouTube
Google web search — general results for the quoted name
Google image search — photos associated with the name
Google video search — video appearances
GitHub API user search — developer profiles matching the name

All 10 run in parallel (async), typically completing in 1–2 seconds.

Results are scored by name match — each result's title and snippet are checked for the person's name parts. The best-scoring URL per platform is marked as confirmed. Output is a candidates table showing every platform, the number of results found, the top score, and the best URL.

Example output

$ python discover.py "Mayukh Sukhatme"

══════════════════════════════════════════════════════════════════
  DISCOVERY: Mayukh Sukhatme
  Hint: (none)
══════════════════════════════════════════════════════════════════

  ── OPS ──────────────────────────────────────────────────────
  10 sources: 10 ok, 0 error, 0 cached — 1.4s wall

  ── CANDIDATES ───────────────────────────────────────────────
  Platform       #  Score  URL
  ──────────── ──── ──────  ────────────────────────────────────
  twitter         4   1.00  https://x.com/search?q=Mayukh%20Sukhatme ✓
  youtube         1   1.00  https://www.youtube.com/watch?v=t8IJtUBjVWM ✓
  instagram       1   0.70  https://www.instagram.com/hopkinsbiotechpodcast/ ✓
  facebook        4   0.70  https://www.facebook.com/photo.php?fbid=... ✓
  linkedin        0      —  (no results)
  tiktok          0      —  (no results)
  github          0      —  (no results)

  ── WEB PRESENCE ─────────────────────────────────────────────
  Web: ~9 results | Images: 10 | Videos: 1

The # column shows how many results Serper returned for that platform. "4" means 4 results found; "5+" means the request limit was hit and more may exist.

Refining with `--hint`

For common names, name matching alone isn't enough. The --hint flag provides a free-text description of who you're looking for:

$ python discover.py "Gil Elbaz" --hint "founder Factual, Applied Semantics"

The hint does two things:

Sharpens searches — keywords from the hint are appended to every site: query, so Google returns more relevant results.
Improves scoring — results are scored on both name match (40%) and hint keyword match (60%). A result mentioning "Gil Elbaz" + "Factual" scores 1.0; one about "Gil Elbaz" playing chess scores lower.

Scoring formula

score = name_match × 0.4 + hint_match × 0.6

name_match = fraction of name parts found in title+snippet
hint_match = fraction of hint keywords found in title+snippet
no hint    → hint_match defaults to 0.5 (neutral)

The hint is persistent: first time you pass --hint, it's saved to data/{slug}/hint.txt. Subsequent runs load it automatically.

Useful Flags

Flag	What it does
`--verbose` / `-v`	Shows full titles and snippets for all results (platform candidates + web search organics), not just the summary table.
`--cluster`	Uses an LLM (Gemini Flash) to group all results into 3–6 conceptual themes, showing how the person appears across different contexts.
`--min-score`	Threshold for confirming a candidate (default 0.3). Raise to 0.5–0.6 for common names.
`--refresh`	Bypass the SQLite cache and re-run all searches.
`--json`	Output raw JSON instead of the formatted table.

Face Search (`face_search.py`)

A related capability that approaches person discovery visually rather than textually. Given a name, face_search.py:

Searches Google Images for the person's name (via Serper), including name variants and nickname expansion (e.g. "Tim" / "Timothy" / "T.")
Downloads the images and detects faces using dlib + Gemini
Computes 128-d face embeddings and clusters them by identity using DBSCAN
Identifies the largest cluster as the target person
Generates an HTML report with face crops, context thumbnails, and source links

Useful for verifying that a discovered profile belongs to the right person, or for finding photos and appearances that text searches miss.

Troubleshooting

Symptom	Cause	Fix
Wrong person confirmed	Hint too vague or missing	Add a more specific `--hint`, raise `--min-score 0.5`
Right person not confirmed	Hint keywords don't appear in snippet	Try different hint wording, lower `--min-score 0.2`
Platform shows "0" results but person exists there	Google index missed it, or name differs on platform	Search manually, pass URL directly to fetch
Fetch errors / timeouts	Bright Data scraper timeout	`--refresh` to retry
All results cached, want fresh data	SQLite cache hit	`--refresh` bypasses cache

Quick Reference

# Basic discovery
python discover.py "Name"

# Discovery with identity hint
python discover.py "Name" --hint "context about the person"

# Strict matching for common names
python discover.py "John Smith" --hint "CTO Acme Corp" --min-score 0.6

# Show full titles and snippets
python discover.py "Name" --verbose

# Group results into themes
python discover.py "Name" --cluster

# Machine-readable output
python discover.py "Name" --json 2>/dev/null | jq '.confirmed'

# Fresh data, no cache
python discover.py "Name" --refresh

# Face search with HTML report
python face_search.py "Name"

projects/people/ — 2026-02-06