Where We Are
Politia today runs on a PostgreSQL database that is approximately 50 MB. It contains election results, MP biographical data, parliamentary activity records, asset declarations, and criminal case disclosures from six official data sources. The backend is a FastAPI application with a hexagonal architecture. The frontend is Next.js 16 with React 19 server components.
It works. It is fast. It is free to run. But it is a fraction of what Indian political data could be.
The Vision
We want to build India's largest open political data lakehouse -- a single, queryable repository that contains every piece of public information about every elected representative, from independence to today.
The full dataset, when assembled, would be approximately 16 TB:
- Structured data (~2 GB): Election results, affidavits, attendance records, committee memberships, voting records from 1952 to present
- Documents (~500 GB): Scanned affidavits, FIR copies, property declarations, committee reports in PDF
- Parliamentary audio (~15 TB): Approximately 17,000 hours of Lok Sabha and Rajya Sabha proceedings available from Sansad TV archives
- News corpus (~500 GB): Timestamped news articles about MPs from major outlets, for cross-referencing claims against records
The goal is not to hoard data. It is to make every public record about every public servant genuinely, usably public.
Phase 1: Foundation (Current)
This is where we are today. The core platform with six data sources, the scoring formula v2 with data sufficiency gates, and the accountability dashboard.
- Done: Election data ingestion (2004-2024), entity resolution via TCPD IDs, scoring formula v2, P0 audit complete
- In progress: 2019 affidavit data normalization, criminal case severity weighting, MyNeta 2024 scraper
- Stack: FastAPI + PostgreSQL + Next.js 16 on free tiers
Phase 2: The Lakehouse
PostgreSQL is excellent for transactional queries -- "give me this MP's score" -- but it is the wrong tool for analytical queries like "show me wealth growth percentiles by party across all elections since 2004."
Phase 2 introduces DuckDB as an analytical layer sitting on Parquet files stored in Cloudflare R2:
- Parquet export pipeline: Nightly export from PostgreSQL to columnar Parquet files, optimized for analytical queries
- DuckDB for analytics: In-process OLAP engine that can query Parquet files directly from R2 -- no server needed
- Public data API: Anyone can download the Parquet files and run their own analysis locally with DuckDB
- Bulk data downloads: CSV and Parquet exports available for researchers, journalists, and civic tech projects
DuckDB can process analytical queries over millions of rows in milliseconds on a single machine. No Spark cluster needed. No EMR bills. Just files and a binary.
Phase 3: Audio & NLP
This is the ambitious phase. Indian Parliament has approximately 17,000 hours of recorded proceedings available through Sansad TV. This audio is currently unsearchable -- you cannot find what your MP said about a specific topic without manually scrubbing through hours of footage.
- Speech-to-text pipeline: Whisper large-v3 for transcription, handling Hindi-English code-switching that is ubiquitous in parliamentary proceedings
- Speaker diarization: Identify which MP is speaking at any given moment, linked to our entity database
- Topic extraction: Automated tagging of parliamentary discussions by topic (agriculture, defense, education, etc.)
- Searchable transcripts: Full-text search across all parliamentary proceedings, linked to specific MPs and dates
Processing 17,000 hours of audio is expensive at cloud rates. Our plan: community-distributed processing where contributors donate compute cycles, similar to how Folding@Home distributes protein folding calculations.
Phase 4: Semantic Search
Once we have structured data, documents, and transcripts, the next step is making it all semantically searchable.
- Vector embeddings for all text content -- affidavits, transcripts, committee reports
- Natural language queries: "Which MPs spoke about farmer subsidies in the 2023 monsoon session?" returns ranked results with source links
- Cross-reference engine: Automatically flag when an MP's statements in Parliament contradict their affidavit declarations or voting record
- Claim verification: Link political claims to underlying data points -- "MP X says they raised 50 questions" can be verified against the actual parliamentary record
Phase 5: Community Scale
The final phase is about making Politia a platform that the civic tech community owns and extends:
- State legislature expansion: Apply the same framework to MLAs across all 28 states and 8 union territories
- API marketplace: Third-party developers can build applications on top of the Politia data layer
- OCR pipeline: Community-driven digitization of older paper records -- affidavits, property declarations, FIRs -- that exist only as scanned images
- Journalist toolkit: Pre-built queries and data exports designed for investigative journalism workflows
- Academic access: Structured datasets with DOIs for citation in political science research
How to Contribute
Politia is open source. Here is how you can help:
- Data normalization: We have raw datasets that need cleaning. If you know Python and pandas, there is always a CSV that needs wrangling
- Frontend development: The dashboard is Next.js 16 with React 19. We need data visualization components, comparison tools, and mobile UX improvements
- Entity resolution: The hardest problem. Matching politician names across datasets requires both algorithmic approaches and domain knowledge about Indian politics
- OCR and document processing: Older affidavits are scanned PDFs. We need OCR pipelines that handle Hindi, English, and mixed-script documents
- Research and validation: Cross-check our data against known facts. File issues when numbers don't match. The integrity of the platform depends on community verification
Start by checking the GitHub issues tagged good-first-issue. Read the methodology page for context on how scores are computed. And if you find a bug in the data, please report it -- the P0 audit taught us that data bugs are the most dangerous kind.