Can you beat 5,996,236 bytes?
A compression challenge: encode 1,000,000 GitHub events into the smallest possible binary format.
There are two leaderboards for this challenge:
This leaderboard uses the data.json.gz dataset included in the repo. Use this to develop and test your codec.
| Rank | Who | Size (Bytes) |
|---|---|---|
| 1 | natebrennand | 5,996,236 |
| 2 | jakedgy | 6,402,499 |
| 3 | hachikuji | 6,524,516 |
| 4 | XiangpengHao | 6,847,283 |
| 5 | agavra | 7,273,680 |
| 6 | fabinout | 7,283,778 |
| 7 | samsond | 7,564,554 |
| 8 | Zstd(22) | 11,917,798 |
| 9 | Zstd(9) | 17,869,403 |
| Naive (baseline) | 210,727,389 |
To prevent overfitting to the training data, a separate evaluation dataset will be announced on March 1st, 2026 when the challenge ends. All submitted codecs will be run against this hidden dataset.
Two winners will be announced:
- Best compression on the training dataset
- Best compression on the evaluation dataset
Submit a PR to claim your spot!
Your codec must:
- Implement the
EventCodectrait - Perfectly reconstruct the original data (lossless)
- Beat the Naive codec (210,727,389 bytes)
git clone https://github.com/agavra/compression-golf
cd compression-golf
gunzip -k data.json.gz # decompress the dataset
cargo run --releaseThe dataset is distributed as data.json.gz to keep the repo size manageable.
To run only your codec:
cargo run --release -- --codec yournameTo test against a different dataset:
cargo run --release -- path/to/your/data.json- Fork this repo
- Create
src/<your-github-username>.rsimplementingEventCodec - Add it to
main.rs(see Adding Your Codec) - Run
cargo run --releaseto verify it beats the current best - Submit a PR with only your single codec file to claim your spot on the leaderboard
Important: Your PR should only add one file: src/<your-github-username>.rs. Do not modify other files (except the necessary main.rs imports). This keeps submissions clean and easy to review.
Each of the 11,351 events contains:
pub struct EventKey {
pub id: String, // numeric string, e.g., "2489651045"
pub event_type: String, // 14 unique types (e.g., "PushEvent", "WatchEvent")
}
pub struct EventValue {
pub repo: Repo,
pub created_at: String, // ISO 8601, e.g., "2015-01-01T15:00:00Z"
}
pub struct Repo {
pub id: u64, // 6,181 unique repos
pub name: String, // e.g., "owner/repo"
pub url: String, // e.g., "https://api.github.com/repos/owner/repo"
}pub trait EventCodec {
fn name(&self) -> &str;
fn encode(&self, events: &[(EventKey, EventValue)]) -> Result<Bytes, Box<dyn Error>>;
fn decode(&self, bytes: &[u8]) -> Result<Vec<(EventKey, EventValue)>, Box<dyn Error>>;
}- Create
src/yourname.rs:
use bytes::Bytes;
use std::error::Error;
use crate::codec::EventCodec;
use crate::{EventKey, EventValue};
pub struct YournameCodec;
impl YournameCodec {
pub fn new() -> Self {
Self
}
}
impl EventCodec for YournameCodec {
fn name(&self) -> &str {
"yourname"
}
fn encode(&self, events: &[(EventKey, EventValue)]) -> Result<Bytes, Box<dyn Error>> {
todo!()
}
fn decode(&self, bytes: &[u8]) -> Result<Vec<(EventKey, EventValue)>, Box<dyn Error>> {
todo!()
}
}- Add to
src/main.rs:
mod yourname;
use yourname::YournameCodec;- Add your codec to the
codecsvec inmain():
let codecs: Vec<(Box<dyn EventCodec>, &[(EventKey, EventValue)])> = vec![
// ... existing codecs ...
(Box::new(YournameCodec::new()), &sorted_events),
];- Codec must be deterministic
- No external data or pretrained models
- Must compile with stable Rust
- Decode must produce byte-identical output to sorted input
- PRs must add a single file:
src/<your-github-username>.rs - Submission deadline: March 1st, 2025 β evaluation dataset revealed and winners announced
Want to test your codec against different data? You can generate your own dataset from GitHub Archive, which provides hourly dumps of all public GitHub events.
GitHub Archive files are available at https://data.gharchive.org/{YYYY-MM-DD-H}.json.gz:
# Download a single hour
curl -O https://data.gharchive.org/2024-01-15-12.json.gz
# Download a full day (24 files)
for hour in {0..23}; do
curl -O "https://data.gharchive.org/2024-01-15-${hour}.json.gz"
doneThe raw GitHub Archive data contains many fields, but this challenge only uses a subset. Use jq to extract the required fields:
# Extract fields and combine into a single file
gunzip -c 2024-01-15-*.json.gz | jq -c '{
id,
type,
repo: {id: .repo.id, name: .repo.name, url: .repo.url},
created_at
}' > my_data.json# Take the first N events
head -n 100000 my_data.json > my_data_100k.jsoncargo run --release -- my_data_100k.json- Strategies for beating the current best (blog post coming soon)
- GitHub Archive β source of the dataset
MIT