A public-interest web application that lets Indian citizens explore and audit 16.6 million Central Public Procurement Portal (CPPP) tender and award records via pre-built investigation queries or custom SQL, filtered by state / year / sector.
The project's credibility rests on three pillars:
- Reproducible pipeline.
make reproduceregenerates byte-identical Parquet output from the source SQLite snapshots (SHA256 match). - Per-file provenance. Every Parquet file carries 13 key/value metadata entries — conversion script SHA, source URL, scrape date, DuckDB version, ZSTD level, source-table name — so any downloaded partition is independently verifiable.
- Zero-loss conversion. Row counts in Parquet match the source SQLite exactly for all 4 tables; JSON round-trip checks on the two detail tables return 0 mismatches on 100/100 sampled rows.
The source SQLite database (12.5 GB, 16.6M records across 4 tables) was
scraped and published by tender.sarthaksidhant.com.
This repository does not redistribute the SQLite source — fetch it from
the upstream URL if you want to run make reproduce end-to-end.
| Table | Rows | Description |
|---|---|---|
aoc_tenders |
4,921,960 | Authority of Contract (AoC) award notices |
aoc_details |
4,540,739 | AoC line-item details (JSON blob + flattened columns) |
tenders |
3,952,191 | Live tender notices |
tender_details |
3,178,484 | Tender line-item details (JSON blob + flattened columns) |
| Total | 16,593,374 |
The converted dataset lives in parquet/ (excluded from git; see below for
download). Structure:
parquet/
aoc_tenders/year=2017/data.parquet
aoc_tenders/year=2018/data.parquet
...
tender_details/year=unparsed/data.parquet
- 84 partition files, year-partitioned, ZSTD-compressed, 3.0 GB total (largest single file: 759 MB)
- Hybrid JSON storage: flattened typed columns for query performance +
raw_jsonescape-hatch column for forward compatibility statecolumn derived fromorg_name/Locationvia a 28-states + 8-UTs dictionary with 78 synonyms (convert/states.json)- Each file carries 13 metadata entries via DuckDB
kv_metadata:conversion_script_sha,conversion_script_path,conversion_timestamp,duckdb_version,row_count,scraped_at_min,scraped_at_max,source_db,source_db_sha256,source_table,source_url,year_partition,zstd_level
manifest.json lists every partition with its SHA256, row count, size, and
the full kv_metadata blob, so integrity can be verified with a single
HEAD request per file.
git clone https://github.com/dhrubasumatary/cpp-tender-audit.git
cd cpp-tender-audit
# install duckdb python binding
make install
# reproduce the parquet tree from source SQLite (requires the 12.5 GB DB
# files placed at repo root: aoc_tenders.db, tenders_vps.db)
make reproduce
# sanity-check: assert partition count in [50,120] and total size in [3G,6G]
make verifyTo validate without re-running conversion:
make validate # row-count match + 100-row JSON round-trip sample
make manifest # rebuild manifest.json with fresh SHA256sdata_quality.json is the canonical validation report. Summary as of the
last run (2026-06-27):
| Table | Row count | PK unique | State populated | JSON round-trip |
|---|---|---|---|---|
aoc_tenders |
4,921,960 ✅ | ✅ | 63.2% | n/a |
aoc_details |
4,540,739 ✅ | ✅ | 12.8% | 100/100 clean |
tenders |
3,952,191 ✅ | ✅ | 6.3% | n/a |
tender_details |
3,178,484 ✅ | ✅ | 19.4% | 100/100 clean |
The low state-population fractions on the *_details tables are honest and
intentional: many award line items do not repeat the issuing organisation's
name in their JSON blob, so state stays NULL. This is a real data
characteristic, not a conversion bug — do not try to "fix" it.
.
├── convert/
│ ├── convert.py # SQLite → year-partition Parquet pipeline
│ ├── validate.py # row-count + JSON round-trip validator
│ ├── manifest.py # manifest.json builder (SHA256 + kv_metadata)
│ ├── states.json # 28 states + 8 UTs + 78 synonyms
│ └── recompress.py # utility: re-ZSTD a parquet tree
├── Makefile # install / convert / validate / reproduce / verify
├── manifest.json # per-partition SHA256 + provenance metadata
├── data_quality.json # canonical validation report
├── LICENSE # MIT
└── README.md # this file
The 84 Parquet files (3.0 GB total) are not committed to git — they exceed GitHub's per-file size limits and the architecture is designed for HTTP range requests against an R2 bucket (the project's next phase).
A read-only mirror is published as GitHub Release assets on the
Releases page. Each partition is a separate asset so you
can download only the years/tables you need. After download, place them
under parquet/<table>/year=<YYYY>/data.parquet and run make verify.
See data_quality.json for the exact validation checks. The conversion
script's git blob SHA is embedded in every Parquet file's kv_metadata under
the key conversion_script_sha; the canonical script hash is also listed in
manifest.json.
MIT — see LICENSE. The Parquet dataset is a transformed view of publicly scraped CPPP records; the upstream source is tender.sarthaksidhant.com.
- Phase 1 — SQLite → Parquet pipeline (this repository)
- Phase 2 — R2 deployment (public bucket, CORS, HTTP range support)
- Phase 3 — Static site (state picker, flagship audit query cards, SQL editor)
- Phase 4 — Launch + performance tuning