Building India's Most Comprehensive Political Data Platform

The Problem

India has 543 Members of Parliament in the Lok Sabha alone. Citizens vote every five years, but between elections, there is almost no accessible, structured way to track how representatives actually perform. Parliamentary attendance data exists on government portals. Criminal records sit in election affidavits. Asset declarations are buried in PDFs. But nobody had stitched it all together into a single, evidence-backed accountability score.

We set out to build Politia: a platform that treats every claim as a data point that must trace back to an official source. No AI-generated summaries, no media opinions. Just records.

Every score on Politia traces back to an official parliamentary record or election affidavit. If we cannot source it, we do not display it.

Architecture Decisions

We chose a hexagonal architecture with strict SOLID principles for the backend. This was not academic perfectionism -- it was survival. When you are normalizing data from six different sources with different schemas, field names, and levels of completeness, you need clear boundaries.

FastAPI for the backend -- async by default, Pydantic models as the contract layer between ingestion and API
PostgreSQL on Neon -- free tier, serverless scaling, branching for safe migrations
Next.js 16 on Vercel -- React 19 server components for the frontend, zero client JS where possible
Cloudflare R2 -- object storage for raw affidavit PDFs and bulk data exports
GitHub Actions -- cron jobs for periodic data refresh, zero infrastructure cost

The entire stack runs on free tiers. Total infrastructure cost: $0/month.

The Data Pipeline

The key insight that saved us weeks of work: most of the data we needed already existed as structured downloads. We did not need to build scrapers for 80% of our requirements.

Vonter/india-representatives-activity -- MP performance data from PRS Legislative Research, already in JSON/CSV
datameet/india-election-data -- All Lok Sabha election results since independence, clean CSV
TCPD-IED via LokDhaba -- Election data from 1962 onward with unique politician IDs (critical for entity resolution)
Parliamentary candidates affidavit data -- Affidavits from 2004 to 2019, structured
OpenSanctions in_sansad -- All current MPs with biographical data in JSON
data.gov.in -- Attendance records, committee memberships, bills participation in CSV

Scraping was only needed for two sources: MyNeta 2024 election data and the latest PRS session data. Everything else was download-normalize-ingest.

The Scoring Formula

The scoring formula went through two major versions. Version 1 was intuitive but deeply flawed. It attempted to score MPs on a 0-100 scale across five dimensions:

Parliamentary Activity (30%) -- attendance, questions asked, debates participated in
Financial Integrity (25%) -- asset disclosure completeness, wealth growth rate
Criminal Record (20%) -- number and severity of declared criminal cases
Education & Background (15%) -- educational qualifications, professional background
Constituency Work (10%) -- MPLADS fund utilization

This looked reasonable on paper. It was not.

The P0 Audit

We ran a comprehensive audit of every score in the database and the results were devastating. 72.5% of all computed scores were effectively meaningless because they relied on fields that had no data.

The audit uncovered five critical problems:

Ghost fields: The formula referenced data columns that did not exist in any of our six data sources. Education level, MPLADS utilization, and constituency development metrics were scored as zeros, dragging down every MP
Entity resolution failures: 147 MPs could not be matched across datasets because name variations (Scindia vs Scindia vs Jyotiraditya M. Scindia) broke our naive string matching
Integrity-rewards-ignorance bug: MPs who never filed an affidavit scored higher on "financial integrity" than those who filed and declared assets, because missing data defaulted to a neutral score while actual declarations got penalized for wealth
Criminal severity not weighted: An MP with one murder charge scored the same as one with a traffic violation
2019 data missing entirely: The 2019 election -- the most recent completed election at the time -- had zero records ingested

We killed the scoring formula, fixed entity resolution using TCPD unique IDs as the canonical key, ingested 2014 and 2024 election data, and rebuilt the formula from scratch with data sufficiency gates: if a score component has less than 50% of its required data points, it is marked as "insufficient data" rather than computed from garbage.

A score computed from missing data is worse than no score at all. It creates false confidence.

Entity Resolution

Matching the same politician across six different datasets is harder than it sounds. Consider Jyotiraditya Scindia:

Election Commission: "JYOTIRADITYA M. SCINDIA"
PRS Legislative: "Jyotiraditya Madhavrao Scindia"
MyNeta: "Shri Jyotiraditya M Scindia"
Parliament records: "SCINDIA, SHRI JYOTIRADITYA M."

Our solution: use the TCPD-IED unique politician ID as the canonical identifier. TCPD has already done the painstaking work of assigning stable IDs to politicians across elections since 1962. We match each source to TCPD first, then join everything through that ID.

For sources without TCPD IDs, we use a multi-pass fuzzy matching pipeline: exact name + constituency match first, then normalized name + state + party, then Levenshtein distance with manual review for ambiguous cases.

What We Learned

Download before you scrape. 80% of Indian political data already exists in structured form on GitHub and academic portals. The instinct to build scrapers first wasted time we did not have
Audit your scores before you ship. The P0 audit was painful but it saved us from shipping a dashboard that would have given voters provably wrong information
Entity resolution is the hardest problem. Not the ML, not the UI, not the infrastructure. Matching "JYOTIRADITYA M. SCINDIA" to "Shri Jyotiraditya M Scindia" across six datasets is where the real engineering challenge lives
Free tiers are production-ready. Neon Postgres, Vercel, Cloudflare R2, GitHub Actions -- zero cost, zero compromises for our scale
Data sufficiency gates are non-negotiable. Never compute a score from incomplete data. Show "insufficient data" instead. Users trust you more when you admit what you do not know