Skip to content

Commit 87bf34a

Browse files
dfa1claude
andcommitted
docs(readme): add compression benchmark vs Rust JNI
Java produces 40.7 MB from the NYC taxi parquet (vs 47 MB from Rust), 13% smaller. Explains the global-dict advantage on low-cardinality F64 columns and references TaxiParquetOracleVsJavaIntegrationTest as the data-integrity proof. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 1a1a676 commit 87bf34a

1 file changed

Lines changed: 16 additions & 0 deletions

File tree

README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,22 @@ ops/s = complete file scans per second; higher is better.
3131

3232
Measured 2026-06-16, commit `74ec207b`. See [docs/explanation.md](docs/explanation.md#benchmarks) for full tables and methodology.
3333

34+
**Compression** — NYC Yellow Taxi 2024-01, 2,964,624 rows × 19 columns, imported from the
35+
same Parquet file (47.6 MB), cascading depth 3, Apple M5:
36+
37+
| Implementation | Output size | vs Parquet |
38+
|---------------|-------------|------------|
39+
| Rust JNI | 47.0 MB | −1.3% |
40+
| **Java** | **40.7 MB** | **−14.5%** |
41+
42+
Java produces a 13% smaller file than the Rust reference from identical input.
43+
The gap comes from the global dictionary encoder that catches low-cardinality `F64`
44+
columns (`mta_tax`, `Airport_fee`, `congestion_surcharge`) that Rust's compressor
45+
leaves as plain ALP. Data integrity is verified by
46+
`TaxiParquetOracleVsJavaIntegrationTest`: hardwood reads the Parquet file directly
47+
to a CSV (oracle); `ParquetImporter``CsvExporter` produces a second CSV (SUT);
48+
line-by-line diff is zero.
49+
3450
## Who is this for
3551

3652
- JVM analytics engines and OLAP systems

0 commit comments

Comments
 (0)