compression-golf

Can you beat 5,996,236 bytes?

A compression challenge: encode 1,000,000 GitHub events into the smallest possible binary format.

Leaderboards

There are two leaderboards for this challenge:

Training Dataset Leaderboard

This leaderboard uses the data.json.gz dataset included in the repo. Use this to develop and test your codec.

Rank	Who	Size (Bytes)
1	natebrennand	5,996,236
2	jakedgy	6,402,499
3	hachikuji	6,524,516
4	XiangpengHao	6,847,283
5	agavra	7,273,680
6	fabinout	7,283,778
7	samsond	7,564,554
8	Zstd(22)	11,917,798
9	Zstd(9)	17,869,403
	Naive (baseline)	210,727,389

Evaluation Dataset Leaderboard

To prevent overfitting to the training data, a separate evaluation dataset will be announced on March 1st, 2026 when the challenge ends. All submitted codecs will be run against this hidden dataset.

Two winners will be announced:

Best compression on the training dataset
Best compression on the evaluation dataset

Submit a PR to claim your spot!

The Challenge

Your codec must:

Implement the EventCodec trait
Perfectly reconstruct the original data (lossless)
Beat the Naive codec (210,727,389 bytes)

Quick Start

git clone https://github.com/agavra/compression-golf
cd compression-golf
gunzip -k data.json.gz  # decompress the dataset
cargo run --release

The dataset is distributed as data.json.gz to keep the repo size manageable.

To run only your codec:

cargo run --release -- --codec yourname

To test against a different dataset:

cargo run --release -- path/to/your/data.json

How to Compete

Fork this repo
Create src/<your-github-username>.rs implementing EventCodec
Add it to main.rs (see Adding Your Codec)
Run cargo run --release to verify it beats the current best
Submit a PR with only your single codec file to claim your spot on the leaderboard

Important: Your PR should only add one file: src/<your-github-username>.rs. Do not modify other files (except the necessary main.rs imports). This keeps submissions clean and easy to review.

The Data

Each of the 11,351 events contains:

pub struct EventKey {
    pub id: String,          // numeric string, e.g., "2489651045"
    pub event_type: String,  // 14 unique types (e.g., "PushEvent", "WatchEvent")
}

pub struct EventValue {
    pub repo: Repo,
    pub created_at: String,  // ISO 8601, e.g., "2015-01-01T15:00:00Z"
}

pub struct Repo {
    pub id: u64,             // 6,181 unique repos
    pub name: String,        // e.g., "owner/repo"
    pub url: String,         // e.g., "https://api.github.com/repos/owner/repo"
}

The Interface

pub trait EventCodec {
    fn name(&self) -> &str;
    fn encode(&self, events: &[(EventKey, EventValue)]) -> Result<Bytes, Box<dyn Error>>;
    fn decode(&self, bytes: &[u8]) -> Result<Vec<(EventKey, EventValue)>, Box<dyn Error>>;
}

Adding Your Codec

Create src/yourname.rs:

use bytes::Bytes;
use std::error::Error;
use crate::codec::EventCodec;
use crate::{EventKey, EventValue};

pub struct YournameCodec;

impl YournameCodec {
    pub fn new() -> Self {
        Self
    }
}

impl EventCodec for YournameCodec {
    fn name(&self) -> &str {
        "yourname"
    }

    fn encode(&self, events: &[(EventKey, EventValue)]) -> Result<Bytes, Box<dyn Error>> {
        todo!()
    }

    fn decode(&self, bytes: &[u8]) -> Result<Vec<(EventKey, EventValue)>, Box<dyn Error>> {
        todo!()
    }
}

Add to src/main.rs:

mod yourname;
use yourname::YournameCodec;

Add your codec to the codecs vec in main():

let codecs: Vec<(Box<dyn EventCodec>, &[(EventKey, EventValue)])> = vec![
    // ... existing codecs ...
    (Box::new(YournameCodec::new()), &sorted_events),
];

Rules

Codec must be deterministic
No external data or pretrained models
Must compile with stable Rust
Decode must produce byte-identical output to sorted input
PRs must add a single file: src/<your-github-username>.rs
Submission deadline: March 1st, 2025 — evaluation dataset revealed and winners announced

Generating Your Own Evaluation Dataset

Want to test your codec against different data? You can generate your own dataset from GitHub Archive, which provides hourly dumps of all public GitHub events.

Download Raw Data

GitHub Archive files are available at https://data.gharchive.org/{YYYY-MM-DD-H}.json.gz:

# Download a single hour
curl -O https://data.gharchive.org/2024-01-15-12.json.gz

# Download a full day (24 files)
for hour in {0..23}; do
  curl -O "https://data.gharchive.org/2024-01-15-${hour}.json.gz"
done

Extract Required Fields

The raw GitHub Archive data contains many fields, but this challenge only uses a subset. Use jq to extract the required fields:

# Extract fields and combine into a single file
gunzip -c 2024-01-15-*.json.gz | jq -c '{
  id,
  type,
  repo: {id: .repo.id, name: .repo.name, url: .repo.url},
  created_at
}' > my_data.json

Limit to a Specific Size (Optional)

# Take the first N events
head -n 100000 my_data.json > my_data_100k.json

Run Against Your Dataset

cargo run --release -- my_data_100k.json

Resources

Strategies for beating the current best (blog post coming soon)
GitHub Archive — source of the dataset

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.claude/skills/add-to-leaderboard		.claude/skills/add-to-leaderboard
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
data.json.gz		data.json.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

compression-golf

Leaderboards

Training Dataset Leaderboard

Evaluation Dataset Leaderboard

The Challenge

Quick Start

How to Compete

The Data

The Interface

Adding Your Codec

Rules

Generating Your Own Evaluation Dataset

Download Raw Data

Extract Required Fields

Limit to a Specific Size (Optional)

Run Against Your Dataset

Resources

License

About

Uh oh!

Releases

Packages

Languages

License

Sayan-/compression-golf

Folders and files

Latest commit

History

Repository files navigation

compression-golf

Leaderboards

Training Dataset Leaderboard

Evaluation Dataset Leaderboard

The Challenge

Quick Start

How to Compete

The Data

The Interface

Adding Your Codec

Rules

Generating Your Own Evaluation Dataset

Download Raw Data

Extract Required Fields

Limit to a Specific Size (Optional)

Run Against Your Dataset

Resources

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages