Open Source Roadmap: From 50MB to 16TB

Where We Are

Politia today runs on a PostgreSQL database that is approximately 50 MB. It contains election results, MP biographical data, parliamentary activity records, asset declarations, and criminal case disclosures from six official data sources. The backend is a FastAPI application with a hexagonal architecture. The frontend is Next.js 16 with React 19 server components.

It works. It is fast. It is free to run. But it is a fraction of what Indian political data could be.

The Vision

We want to build India's largest open political data lakehouse -- a single, queryable repository that contains every piece of public information about every elected representative, from independence to today.

The full dataset, when assembled, would be approximately 16 TB:

Structured data (~2 GB): Election results, affidavits, attendance records, committee memberships, voting records from 1952 to present
Documents (~500 GB): Scanned affidavits, FIR copies, property declarations, committee reports in PDF
Parliamentary audio (~15 TB): Approximately 17,000 hours of Lok Sabha and Rajya Sabha proceedings available from Sansad TV archives
News corpus (~500 GB): Timestamped news articles about MPs from major outlets, for cross-referencing claims against records

The goal is not to hoard data. It is to make every public record about every public servant genuinely, usably public.

Phase 1: Foundation (Current)

This is where we are today. The core platform with six data sources, the scoring formula v2 with data sufficiency gates, and the accountability dashboard.

Done: Election data ingestion (2004-2024), entity resolution via TCPD IDs, scoring formula v2, P0 audit complete
In progress: 2019 affidavit data normalization, criminal case severity weighting, MyNeta 2024 scraper
Stack: FastAPI + PostgreSQL + Next.js 16 on free tiers

Phase 2: The Lakehouse

PostgreSQL is excellent for transactional queries -- "give me this MP's score" -- but it is the wrong tool for analytical queries like "show me wealth growth percentiles by party across all elections since 2004."

Phase 2 introduces DuckDB as an analytical layer sitting on Parquet files stored in Cloudflare R2:

Parquet export pipeline: Nightly export from PostgreSQL to columnar Parquet files, optimized for analytical queries
DuckDB for analytics: In-process OLAP engine that can query Parquet files directly from R2 -- no server needed
Public data API: Anyone can download the Parquet files and run their own analysis locally with DuckDB
Bulk data downloads: CSV and Parquet exports available for researchers, journalists, and civic tech projects

DuckDB can process analytical queries over millions of rows in milliseconds on a single machine. No Spark cluster needed. No EMR bills. Just files and a binary.

Phase 3: Audio & NLP

This is the ambitious phase. Indian Parliament has approximately 17,000 hours of recorded proceedings available through Sansad TV. This audio is currently unsearchable -- you cannot find what your MP said about a specific topic without manually scrubbing through hours of footage.

Speech-to-text pipeline: Whisper large-v3 for transcription, handling Hindi-English code-switching that is ubiquitous in parliamentary proceedings
Speaker diarization: Identify which MP is speaking at any given moment, linked to our entity database
Topic extraction: Automated tagging of parliamentary discussions by topic (agriculture, defense, education, etc.)
Searchable transcripts: Full-text search across all parliamentary proceedings, linked to specific MPs and dates

Processing 17,000 hours of audio is expensive at cloud rates. Our plan: community-distributed processing where contributors donate compute cycles, similar to how Folding@Home distributes protein folding calculations.

Phase 4: Semantic Search

Once we have structured data, documents, and transcripts, the next step is making it all semantically searchable.

Vector embeddings for all text content -- affidavits, transcripts, committee reports
Natural language queries: "Which MPs spoke about farmer subsidies in the 2023 monsoon session?" returns ranked results with source links
Cross-reference engine: Automatically flag when an MP's statements in Parliament contradict their affidavit declarations or voting record
Claim verification: Link political claims to underlying data points -- "MP X says they raised 50 questions" can be verified against the actual parliamentary record

Phase 5: Community Scale

The final phase is about making Politia a platform that the civic tech community owns and extends:

State legislature expansion: Apply the same framework to MLAs across all 28 states and 8 union territories
API marketplace: Third-party developers can build applications on top of the Politia data layer
OCR pipeline: Community-driven digitization of older paper records -- affidavits, property declarations, FIRs -- that exist only as scanned images
Journalist toolkit: Pre-built queries and data exports designed for investigative journalism workflows
Academic access: Structured datasets with DOIs for citation in political science research

How to Contribute

Politia is open source. Here is how you can help:

Data normalization: We have raw datasets that need cleaning. If you know Python and pandas, there is always a CSV that needs wrangling
Frontend development: The dashboard is Next.js 16 with React 19. We need data visualization components, comparison tools, and mobile UX improvements
Entity resolution: The hardest problem. Matching politician names across datasets requires both algorithmic approaches and domain knowledge about Indian politics
OCR and document processing: Older affidavits are scanned PDFs. We need OCR pipelines that handle Hindi, English, and mixed-script documents
Research and validation: Cross-check our data against known facts. File issues when numbers don't match. The integrity of the platform depends on community verification

Start by checking the GitHub issues tagged good-first-issue. Read the methodology page for context on how scores are computed. And if you find a bug in the data, please report it -- the P0 audit taught us that data bugs are the most dangerous kind.