Skip to content
/ mzip Public

Store the formula, not the data. Detection-based compression: 32KB for 1MB of sequential IDs (32768x). Beats zstd/brotli/bzip2 on structured data.

License

Notifications You must be signed in to change notification settings

Cranot/mzip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mzip

Detection-based compression that finds patterns zstd, brotli, and bzip2 miss.

Store the formula, not the data.

Why mzip?

mzip detects mathematical structure in your data and compresses it optimally. Where other compressors see bytes, mzip sees patterns.

Pattern Detected mzip Best Alternative Advantage
Sequential IDs (1, 2, 3, ...) 32 bytes bzip2: 3.4KB 106x better
Repeating templates (JSON APIs) 10KB brotli: 49KB 4.9x better
Audio PCM waveforms 2.1KB bzip2: 4KB 2x better
Image gradients 124 bytes brotli: 397B 3.2x better

Result: 75% win rate across 250 tests (50 data types × 5 sizes) against 8 compressors including brotli, bzip2, xz, 7z, and rar.


Key Strengths

Where mzip Dominates

Data Category Win Rate Why
Numeric sequences 100% Formula compression: v[i] = a + b*i beats any LZ77
Structured JSON/XML 90% Template extraction captures repeating structure
Audio/sensor data 100% Delta encoding exploits temporal correlation
Log files 80% Columnar separation + BWT on each column
Large files (>256KB) 86% More data = more patterns to detect

Where Others Win

Scenario Winner Why
Text/code (small margins) bzip2 BWT tuning differences (typically <100 bytes)
Random/encrypted data zstd No patterns to detect, just raw entropy coding

How It Works

Most compressors treat all data as random bytes. But real data has structure:

Pattern Example What mzip does
Sequential values 1, 2, 3, 4, ... Store formula v[i] = start + i × step
Repeating templates Same function 100x with different IDs Store template once + variable list
Columnar data Log files with fixed columns Separate columns, compress each optimally
Audio samples Smooth waveforms Delta encoding exploits sample-to-sample correlation

zstd-19 compresses 1MB of sequential IDs to 8KB. mzip compresses it to 32 bytes.


Benchmark Results

All benchmarks run on synthetic data generated by generators.hpp. Click sample links to download the exact input/output files.

Overall Compressor Scoreboard (250 tests: 50 types × 5 sizes)

Compressor Avg Ratio Range MB/s Wins Win% Score Rank
mzip 8.16x 1.0-32768x 0.6 188 75.2% 153.6 1
bzip2:9 5.66x 1.0-1001x 0.6 63 25.2% 39.5 2
zstd:19 5.14x 1.0-2641x 1.4 30 12.0% 21.3 3
rar:m5 5.97x 1.0-1014x 2.6 0 0.0% 6.6 4
xz:9 5.89x 1.0-997x 2.3 0 0.0% 6.4 5
7z:mx9 5.88x 1.0-922x 2.3 0 0.0% 6.4 6
gzip:9 4.78x 1.0-240x 0.8 0 0.0% 4.7 7

Score = ratio × speed^0.1 × (1 + 0.1×wins). Total: 66.60 MB. lz4/snappy excluded (speed-focused).

Decompression Speed

Compressor Time (ms) Speed (MB/s)
zstd 96.4 690.8
mzip 3285.2 20.3

zstd decompresses 34.1x faster than mzip

Win Rate by Size

Size Wins Total Win%
4KB 44 50 88.0%
16KB 27 50 54.0%
64KB 30 50 60.0%
256KB 38 50 76.0%
1MB 49 50 98.0%

Top 10 mzip Wins

Type mzip 2nd Best Advantage
Database IDs (1MB) 32B (32768x) 3.4KB 106.8x better
Timestamps (1MB) 32B (32768x) 2.7KB 84.2x better
Database IDs (256KB) 32B (8192x) 937B 29.3x better
Timestamps (256KB) 32B (8192x) 772B 24.1x better
Database IDs (64KB) 32B (2048x) 301B 9.4x better
Timestamps (64KB) 32B (2048x) 287B 9.0x better
Image gradient (256KB) 53B (4946x) 323B 6.1x better
Image gradient (64KB) 39B (1680x) 212B 5.4x better
Timestamps (16KB) 32B (512x) 160B 5.0x better
JSON API (1MB) 10KB (104x) 49KB 4.9x better

Where bzip2 Wins (Top 10 Gaps)

bzip2's BWT implementation occasionally beats mzip by small margins on text/code files.

Type mzip Best Gap
Metrics (1MB) 121KB bzip2: 120KB +1KB
Nginx log (256KB) 23.7KB bzip2: 23.5KB +250B
.env file (256KB) 73KB bzip2: 73KB +184B
CSS (64KB) 4.3KB bzip2: 4.2KB +106B
TOML config (4KB) 1064B bzip2: 969B +95B
Natural text (4KB) 741B bzip2: 653B +88B
INI config (64KB) 10.4KB bzip2: 10.3KB +87B
Unicode text (256KB) 7.8KB bzip2: 7.7KB +83B
Unicode text (16KB) 1181B bzip2: 1101B +80B
Python (16KB) 2585B bzip2: 2505B +80B

NUMERIC

Type 64KB 256KB 1MB Samples
Timestamps 32B vs 287B 32B vs 772B 32B vs 2.6KB 64k 256k 1m
Database IDs 32B vs 301B 32B vs 937B 32B vs 3.3KB 64k 256k 1m
Integer array 3.3KB vs 4.4KB 12KB vs 17KB 51KB vs 67KB 64k 256k 1m
GPS coordinates 9.7KB vs 11KB 38KB vs 44KB 154KB vs 179KB 64k 256k 1m
Float temperature 11KB vs 22KB 40KB vs 87KB 151KB vs 331KB 64k 256k 1m
Sensor 16-bit 26KB vs 27KB 107KB vs 111KB 430KB vs 445KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

STRUCTURED

Type 64KB 256KB 1MB Samples
GraphQL queries 2.8KB vs 2.8KB 7.8KB vs 7.8KB 26KB vs 28KB 64k 256k 1m
SQL dump 4.8KB vs 4.7KB 15KB vs 15KB 54KB vs 56KB 64k 256k 1m
JSON API 1016B vs 3.7KB 2.8KB vs 12KB 9.8KB vs 48KB 64k 256k 1m
XML document 1020B vs 2.2KB 2.9KB vs 8.0KB 10KB vs 29KB 64k 256k 1m
CSV data 7.1KB vs 9.8KB 23KB vs 33KB 88KB vs 122KB 64k 256k 1m
Base64 data 47KB vs 48KB 189KB vs 192KB 758KB vs 771KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

CODE

Type 64KB 256KB 1MB Samples
JavaScript 4.4KB vs 5.2KB 11KB vs 12KB 37KB vs 43KB 64k 256k 1m
Python 5.4KB vs 5.3KB 12KB vs 12KB 35KB vs 40KB 64k 256k 1m
TypeScript 4.4KB vs 4.4KB 12KB vs 12KB 43KB vs 45KB 64k 256k 1m
HTML 5.7KB vs 5.7KB 17KB vs 18KB 65KB vs 68KB 64k 256k 1m
CSS 4.2KB vs 4.1KB 11KB vs 11KB 41KB vs 43KB 64k 256k 1m
Go 3.4KB vs 3.4KB 8.4KB vs 8.5KB 26KB vs 28KB 64k 256k 1m
Rust 3.5KB vs 3.5KB 9.1KB vs 9.1KB 29KB vs 31KB 64k 256k 1m
Java 3.9KB vs 3.9KB 10KB vs 10KB 32KB vs 36KB 64k 256k 1m
C 5.2KB vs 5.2KB 15KB vs 15KB 50KB vs 54KB 64k 256k 1m
Bash 3.7KB vs 3.7KB 10KB vs 10KB 35KB vs 37KB 64k 256k 1m
PHP 3.3KB vs 3.3KB 8.4KB vs 8.6KB 25KB vs 27KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

CONFIG

Type 64KB 256KB 1MB Samples
Docker Compose 2.2KB vs 2.1KB 5.5KB vs 5.6KB 18KB vs 19KB 64k 256k 1m
Terraform 3.4KB vs 3.0KB 11KB vs 10KB 42KB vs 40KB 64k 256k 1m
K8s manifests 3.3KB vs 3.3KB 7.6KB vs 7.6KB 21KB vs 24KB 64k 256k 1m
YAML config 3.8KB vs 3.8KB 11KB vs 11KB 38KB vs 41KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

LOG

Type 64KB 256KB 1MB Samples
Access log 6.3KB vs 6.8KB 21KB vs 24KB 85KB vs 94KB 64k 256k 1m
Nginx access log 6.7KB vs 6.8KB 22KB vs 22KB 83KB vs 87KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

BINARY

Type 64KB 256KB 1MB Samples
Image gradient 39B vs 212B 53B vs 323B 124B vs 397B 64k 256k 1m
Audio PCM 1.7KB vs 4.0KB 1.7KB vs 4.0KB 1.7KB vs 4.0KB 64k 256k 1m
Sparse bitmap 689B vs 880B 2.6KB vs 3.0KB 10KB vs 11KB 64k 256k 1m
Protobuf-like 40KB vs 41KB 160KB vs 163KB 640KB vs 650KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

ADDITIONAL

Type 64KB 256KB 1MB Samples
Natural text 7.3KB vs 7.2KB 28KB vs 28KB 111KB vs 111KB 64k 256k 1m
Markdown docs 3.8KB vs 3.8KB 11KB vs 11KB 39KB vs 41KB 64k 256k 1m
Email headers 6.8KB vs 6.8KB 21KB vs 21KB 76KB vs 79KB 64k 256k 1m
Unicode text 2.5KB vs 2.4KB 7.6KB vs 7.5KB 27KB vs 28KB 64k 256k 1m
Syslog 8.8KB vs 9.4KB 32KB vs 34KB 126KB vs 133KB 64k 256k 1m
Metrics 7.8KB vs 7.8KB 29KB vs 29KB 118KB vs 117KB 64k 256k 1m
JSON log 7.2KB vs 8.0KB 29KB vs 30KB 116KB vs 122KB 64k 256k 1m
Timestamps (jitter) 14KB vs 15KB 56KB vs 61KB 224KB vs 244KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.

BUILD

Type 64KB 256KB 1MB Samples
Makefile 3.8KB vs 4.6KB 12KB vs 16KB 46KB vs 61KB 64k 256k 1m
package.json 4.8KB vs 4.8KB 15KB vs 15KB 55KB vs 57KB 64k 256k 1m
Cargo.toml 3.5KB vs 3.5KB 10KB vs 10KB 36KB vs 38KB 64k 256k 1m

Format: mzip vs 2nd-best. Bold = winner.


Real-World File Benchmark

Benchmarked against 47 real files (7.2MB) from GitHub - actual source code from React, Linux kernel, Django, Bootstrap, and 20+ programming languages.

Where mzip WINS or TIES

File Size mzip Best Result
events.csv 592KB 7.07x bzip2 7.05x mzip +0.3%
lodash.js 545KB 7.69x bzip2 7.69x tie
app.log 475KB 7.72x bzip2 7.70x mzip +0.3%

Where brotli WINS

Brotli's 120KB static dictionary is optimized for common web/code patterns.

File Size mzip brotli Gap
apache_log_sample.log 2.3MB 18.71x 19.99x -7%
bootstrap.css 280KB 10.83x 11.45x -6%
k8s_deployments.yaml 22KB 17.55x 20.85x -19%
terraform_main.tf 6KB 3.11x 3.56x -14%

Summary

mzip excels on structured data with patterns (logs, CSV, JSON with templates, sequences). On general source code, brotli's pre-built dictionary gives it an edge.

Category Note
Numeric sequences mzip wins 100% (formula compression)
Structured logs/CSV mzip wins or ties (BWT competitive)
Small code (<30KB) brotli wins 10-20% (dictionary)
Config files (K8s, TF) brotli wins 15-20% (domain keywords)

Key insight: mzip excels on structured/templated data (logs, CSV, repeated patterns). For small source code files, brotli's pre-built dictionary gives it an edge mzip can't match without shipping a dictionary.


Compression Strategies

mzip automatically detects the best strategy for your data:

Strategy What it does Best for Example ratio
LINEAR_GEN Stores v[i] = a + b×i formula Sequential IDs, timestamps, counters 32768x
NUMERIC Delta/strided encoding Audio PCM, sensor data, floats 485x
COLUMNAR Separates fixed-width columns Access logs, nginx logs, CSV 12x
SECTION_TEMPLATE Extracts multi-line template + variables Repeated code blocks with IDs 100x
BWT_TEXT Burrows-Wheeler Transform General text, source code 20x
RAW Falls back to zstd-19 Random/encrypted data 1x

Quick Start

Option 1: Single-Header (Recommended)

// In ONE .cpp file:
#include <zstd.h>
#define MZIP_IMPLEMENTATION
#include "mzip_amalgamated.hpp"

// In other files:
#include "mzip_amalgamated.hpp"

// Usage
auto compressed = mzip::compress(data.data(), data.size());
auto decompressed = mzip::decompress(compressed.data(), compressed.size());

Option 2: Separate Headers

#include <zstd.h>      // Required: include zstd first
#include "mzip.hpp"

// Compress
std::vector<uint8_t> data = /* your data */;
auto compressed = mzip::compress(data.data(), data.size());

// Decompress
auto decompressed = mzip::decompress(compressed.data(), compressed.size());

Build

Requires C++17 and zstd:

# Single-header (no libsais.c needed - it's bundled)
g++ -O3 -march=native -I/path/to/zstd/include \
    -L/path/to/zstd/lib -o mzip_cli mzip_cli.cpp -lzstd

# Separate headers
g++ -O3 -march=native -I/path/to/zstd/include \
    -L/path/to/zstd/lib -o mzip_cli mzip_cli.cpp libsais.c -lzstd

CLI Usage

# Compress
./mzip_cli compress input.bin output.mzip

# Decompress
./mzip_cli decompress output.mzip restored.bin

Run Benchmarks

# Build benchmark tool
g++ -O3 -march=native -I./zstd/include -L./zstd/lib \
    -o mzip_bench mzip_bench.cpp libsais.c -lzstd

# Run all benchmarks (46 types × 3 sizes)
./mzip_bench

# Quick test (64KB only)
./mzip_bench --quick

# Test specific type
./mzip_bench --type graphql

Files

File Description
mzip.hpp Main library (include this)
bwt_compress_*.hpp BWT implementations
generators.hpp Test data generators
libsais.c/h BWT suffix array (Apache 2.0)
mzip_bench.cpp Benchmark tool
mzip_cli.cpp Command-line interface
samples/ Sample files at 64KB/256KB/1MB

License

Dual Licensed: AGPL-3.0 OR Commercial

  • AGPL-3.0: Free for open source. Service deployment requires source release.
  • Commercial: Contact for proprietary use.

Third-party: libsais (Apache 2.0), stb_image (Public Domain), zstd (BSD, external)

About

Store the formula, not the data. Detection-based compression: 32KB for 1MB of sequential IDs (32768x). Beats zstd/brotli/bzip2 on structured data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published