CPP Tender Audit Tool

A public-interest web application that lets Indian citizens explore and audit 16.6 million Central Public Procurement Portal (CPPP) tender and award records via pre-built investigation queries or custom SQL, filtered by state / year / sector.

The project's credibility rests on three pillars:

Reproducible pipeline. make reproduce regenerates byte-identical Parquet output from the source SQLite snapshots (SHA256 match).
Per-file provenance. Every Parquet file carries 13 key/value metadata entries — conversion script SHA, source URL, scrape date, DuckDB version, ZSTD level, source-table name — so any downloaded partition is independently verifiable.
Zero-loss conversion. Row counts in Parquet match the source SQLite exactly for all 4 tables; JSON round-trip checks on the two detail tables return 0 mismatches on 100/100 sampled rows.

Data source

The source SQLite database (12.5 GB, 16.6M records across 4 tables) was scraped and published by tender.sarthaksidhant.com. This repository does not redistribute the SQLite source — fetch it from the upstream URL if you want to run make reproduce end-to-end.

Table	Rows	Description
`aoc_tenders`	4,921,960	Authority of Contract (AoC) award notices
`aoc_details`	4,540,739	AoC line-item details (JSON blob + flattened columns)
`tenders`	3,952,191	Live tender notices
`tender_details`	3,178,484	Tender line-item details (JSON blob + flattened columns)
Total	16,593,374

Parquet dataset

The converted dataset lives in parquet/ (excluded from git; see below for download). Structure:

parquet/
  aoc_tenders/year=2017/data.parquet
  aoc_tenders/year=2018/data.parquet
  ...
  tender_details/year=unparsed/data.parquet

84 partition files, year-partitioned, ZSTD-compressed, 3.0 GB total (largest single file: 759 MB)
Hybrid JSON storage: flattened typed columns for query performance + raw_json escape-hatch column for forward compatibility
state column derived from org_name / Location via a 28-states + 8-UTs dictionary with 78 synonyms (convert/states.json)
Each file carries 13 metadata entries via DuckDB kv_metadata: conversion_script_sha, conversion_script_path, conversion_timestamp, duckdb_version, row_count, scraped_at_min, scraped_at_max, source_db, source_db_sha256, source_table, source_url, year_partition, zstd_level

manifest.json lists every partition with its SHA256, row count, size, and the full kv_metadata blob, so integrity can be verified with a single HEAD request per file.

Quickstart

git clone https://github.com/dhrubasumatary/cpp-tender-audit.git
cd cpp-tender-audit

# install duckdb python binding
make install

# reproduce the parquet tree from source SQLite (requires the 12.5 GB DB
# files placed at repo root: aoc_tenders.db, tenders_vps.db)
make reproduce

# sanity-check: assert partition count in [50,120] and total size in [3G,6G]
make verify

To validate without re-running conversion:

make validate   # row-count match + 100-row JSON round-trip sample
make manifest   # rebuild manifest.json with fresh SHA256s

Data quality

data_quality.json is the canonical validation report. Summary as of the last run (2026-06-27):

Table	Row count	PK unique	State populated	JSON round-trip
`aoc_tenders`	4,921,960 ✅	✅	63.2%	n/a
`aoc_details`	4,540,739 ✅	✅	12.8%	100/100 clean
`tenders`	3,952,191 ✅	✅	6.3%	n/a
`tender_details`	3,178,484 ✅	✅	19.4%	100/100 clean

The low state-population fractions on the *_details tables are honest and intentional: many award line items do not repeat the issuing organisation's name in their JSON blob, so state stays NULL. This is a real data characteristic, not a conversion bug — do not try to "fix" it.

Repository layout

.
├── convert/
│   ├── convert.py      # SQLite → year-partition Parquet pipeline
│   ├── validate.py      # row-count + JSON round-trip validator
│   ├── manifest.py      # manifest.json builder (SHA256 + kv_metadata)
│   ├── states.json      # 28 states + 8 UTs + 78 synonyms
│   └── recompress.py    # utility: re-ZSTD a parquet tree
├── Makefile             # install / convert / validate / reproduce / verify
├── manifest.json        # per-partition SHA256 + provenance metadata
├── data_quality.json    # canonical validation report
├── LICENSE              # MIT
└── README.md            # this file

Parquet dataset download

The 84 Parquet files (3.0 GB total) are not committed to git — they exceed GitHub's per-file size limits and the architecture is designed for HTTP range requests against an R2 bucket (the project's next phase).

A read-only mirror is published as GitHub Release assets on the Releases page. Each partition is a separate asset so you can download only the years/tables you need. After download, place them under parquet/<table>/year=<YYYY>/data.parquet and run make verify.

Methodology

See data_quality.json for the exact validation checks. The conversion script's git blob SHA is embedded in every Parquet file's kv_metadata under the key conversion_script_sha; the canonical script hash is also listed in manifest.json.

License

MIT — see LICENSE. The Parquet dataset is a transformed view of publicly scraped CPPP records; the upstream source is tender.sarthaksidhant.com.

Roadmap

Phase 1 — SQLite → Parquet pipeline (this repository)
Phase 2 — R2 deployment (public bucket, CORS, HTTP range support)
Phase 3 — Static site (state picker, flagship audit query cards, SQL editor)
Phase 4 — Launch + performance tuning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CPP Tender Audit Tool

Data source

Parquet dataset

Quickstart

Data quality

Repository layout

Parquet dataset download

Methodology

License

Roadmap

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
convert		convert
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SHA256SUMS		SHA256SUMS
data_quality.json		data_quality.json
manifest.json		manifest.json

Folders and files

Latest commit

History

Repository files navigation

CPP Tender Audit Tool

Data source

Parquet dataset

Quickstart

Data quality

Repository layout

Parquet dataset download

Methodology

License

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages