People Discovery Pipeline

Find someone's profiles across platforms, focused on the right person

The Problem

You have a person's name and want to find their profiles across LinkedIn, Twitter/X, Instagram, Facebook, TikTok, YouTube, and GitHub. The challenge is disambiguation: search "Gil Elbaz" and you'll find the Factual founder, a chess player, a musician, and several others. The pipeline needs to figure out which results belong to the person you mean.

How It Works

discover.py runs 10 concurrent searches for a given name:

  1. 6 platform site: searches (Serper/Google) — LinkedIn, Twitter/X, Instagram, Facebook, TikTok, YouTube
  2. Google web search — general results for the quoted name
  3. Google image search — photos associated with the name
  4. Google video search — video appearances
  5. GitHub API user search — developer profiles matching the name

All 10 run in parallel (async), typically completing in 1–2 seconds.

Results are scored by name match — each result's title and snippet are checked for the person's name parts. The best-scoring URL per platform is marked as confirmed. Output is a candidates table showing every platform, the number of results found, the top score, and the best URL.

Example output

$ python discover.py "Mayukh Sukhatme"

══════════════════════════════════════════════════════════════════
  DISCOVERY: Mayukh Sukhatme
  Hint: (none)
══════════════════════════════════════════════════════════════════

  ── OPS ──────────────────────────────────────────────────────
  10 sources: 10 ok, 0 error, 0 cached — 1.4s wall

  ── CANDIDATES ───────────────────────────────────────────────
  Platform       #  Score  URL
  ──────────── ──── ──────  ────────────────────────────────────
  twitter         4   1.00  https://x.com/search?q=Mayukh%20Sukhatme ✓
  youtube         1   1.00  https://www.youtube.com/watch?v=t8IJtUBjVWM ✓
  instagram       1   0.70  https://www.instagram.com/hopkinsbiotechpodcast/ ✓
  facebook        4   0.70  https://www.facebook.com/photo.php?fbid=... ✓
  linkedin        0      —  (no results)
  tiktok          0      —  (no results)
  github          0      —  (no results)

  ── WEB PRESENCE ─────────────────────────────────────────────
  Web: ~9 results | Images: 10 | Videos: 1

The # column shows how many results Serper returned for that platform. "4" means 4 results found; "5+" means the request limit was hit and more may exist.

Refining with --hint

For common names, name matching alone isn't enough. The --hint flag provides a free-text description of who you're looking for:

$ python discover.py "Gil Elbaz" --hint "founder Factual, Applied Semantics"

The hint does two things:

  1. Sharpens searches — keywords from the hint are appended to every site: query, so Google returns more relevant results.
  2. Improves scoring — results are scored on both name match (40%) and hint keyword match (60%). A result mentioning "Gil Elbaz" + "Factual" scores 1.0; one about "Gil Elbaz" playing chess scores lower.
Scoring formula
score = name_match × 0.4 + hint_match × 0.6

name_match = fraction of name parts found in title+snippet
hint_match = fraction of hint keywords found in title+snippet
no hint    → hint_match defaults to 0.5 (neutral)

The hint is persistent: first time you pass --hint, it's saved to data/{slug}/hint.txt. Subsequent runs load it automatically.

Useful Flags

FlagWhat it does
--verbose / -v Shows full titles and snippets for all results (platform candidates + web search organics), not just the summary table.
--cluster Uses an LLM (Gemini Flash) to group all results into 3–6 conceptual themes, showing how the person appears across different contexts.
--min-score Threshold for confirming a candidate (default 0.3). Raise to 0.5–0.6 for common names.
--refresh Bypass the SQLite cache and re-run all searches.
--json Output raw JSON instead of the formatted table.

Face Search (face_search.py)

A related capability that approaches person discovery visually rather than textually. Given a name, face_search.py:

  1. Searches Google Images for the person's name (via Serper), including name variants and nickname expansion (e.g. "Tim" / "Timothy" / "T.")
  2. Downloads the images and detects faces using dlib + Gemini
  3. Computes 128-d face embeddings and clusters them by identity using DBSCAN
  4. Identifies the largest cluster as the target person
  5. Generates an HTML report with face crops, context thumbnails, and source links

Useful for verifying that a discovered profile belongs to the right person, or for finding photos and appearances that text searches miss.

Troubleshooting

SymptomCauseFix
Wrong person confirmed Hint too vague or missing Add a more specific --hint, raise --min-score 0.5
Right person not confirmed Hint keywords don't appear in snippet Try different hint wording, lower --min-score 0.2
Platform shows "0" results but person exists there Google index missed it, or name differs on platform Search manually, pass URL directly to fetch
Fetch errors / timeouts Bright Data scraper timeout --refresh to retry
All results cached, want fresh data SQLite cache hit --refresh bypasses cache

Quick Reference

# Basic discovery
python discover.py "Name"

# Discovery with identity hint
python discover.py "Name" --hint "context about the person"

# Strict matching for common names
python discover.py "John Smith" --hint "CTO Acme Corp" --min-score 0.6

# Show full titles and snippets
python discover.py "Name" --verbose

# Group results into themes
python discover.py "Name" --cluster

# Machine-readable output
python discover.py "Name" --json 2>/dev/null | jq '.confirmed'

# Fresh data, no cache
python discover.py "Name" --refresh

# Face search with HTML report
python face_search.py "Name"

projects/people/ — 2026-02-06