Epstein Document Archive

1.39 million OCR'd documents from the DOJ Jeffrey Epstein document release, with extracted entities, text embeddings, a knowledge graph, and full pipeline provenance.

This is the data behind epstein.academy.

What's in the dataset

Layer	Rows	Download Size	Description
`documents`	1,413,765	~800 MB	Full text of every document with metadata
`entities`	8,542,849	~200 MB	Extracted people, organizations, locations, dates
`chunks`	2,039,205	~1.5 GB	~800-token text chunks for RAG
`embeddings_chunk`	1,956,803	~5 GB	768-dim Gemini embeddings per chunk
`provenance`	4.9M rows	~400 MB	Full pipeline audit trail
`persons`	1,614	<1 MB	Curated person registry with aliases
`kg_entities`	467	<1 MB	Knowledge graph entities
`kg_relationships`	4,190	<1 MB	Knowledge graph relationships
`recovered_redactions`	39,588	~3 MB	Recovered text from redacted pages

Total: ~8 GB compressed (Parquet with zstd). Each layer is independent -- download only what you need.

Quick start

With HuggingFace `datasets`

from datasets import load_dataset

# Stream documents (no full download needed)
ds = load_dataset("kabasshouse/epstein-data", "documents", split="train", streaming=True)
for doc in ds:
    print(doc["file_key"], doc["dataset"], doc["document_type"])
    print(doc["full_text"][:200])
    break

# Load entities into memory
entities = load_dataset("kabasshouse/epstein-data", "entities", split="train")
print(f"{len(entities):,} entities loaded")

# Filter to a specific dataset
ds10 = load_dataset("kabasshouse/epstein-data", "documents", split="train")
ds10 = ds10.filter(lambda x: x["dataset"] == "DataSet10")

With DuckDB (no download)

-- Query directly from HuggingFace (auto-downloads Parquet)
SELECT file_key, dataset, document_type, date
FROM 'hf://datasets/kabasshouse/epstein-data/data/documents/*.parquet'
WHERE dataset = 'DataSet10' AND document_type = 'Email'
LIMIT 10;

-- Count entities by type
SELECT entity_type, COUNT(*) as cnt
FROM 'hf://datasets/kabasshouse/epstein-data/data/entities/*.parquet'
GROUP BY entity_type
ORDER BY cnt DESC;

With pandas

import pandas as pd

# Read a specific shard
df = pd.read_parquet("hf://datasets/kabasshouse/epstein-data/data/documents/documents-00000.parquet")
print(df.shape)
print(df.columns.tolist())

Assemble a local SQLite database

pip install pyarrow numpy tqdm huggingface_hub

# Core tables only (documents + entities, ~1 GB download)
python assemble_db.py --layers core --output epstein.db

# With text chunks (~2.5 GB download)
python assemble_db.py --layers text --output epstein.db

# Full database with embeddings (~7.5 GB download)
python assemble_db.py --layers full --output epstein.db

# Everything including provenance (~8 GB download)
python assemble_db.py --layers all --output epstein.db

# From a local Parquet export
python assemble_db.py --local ./hf_export/ --layers core --output epstein.db

Source documents

The documents come from 12 DOJ FOIA dataset releases plus two community-sourced collections:

Dataset	Files	Source
DataSet 1	3,158	DOJ FOIA
DataSet 2	574	DOJ FOIA
DataSet 3	67	DOJ FOIA
DataSet 4	152	DOJ FOIA
DataSet 5	120	DOJ FOIA
DataSet 6	13	DOJ FOIA
DataSet 7	17	DOJ FOIA
DataSet 8	10,595	DOJ FOIA
DataSet 9	531,279	DOJ FOIA
DataSet 10	503,154	DOJ FOIA
DataSet 11	331,655	DOJ FOIA
DataSet 12	152	DOJ FOIA
FBIVault	22	FBI Vault FOIA
HouseOversightEstate	4,892	House Oversight Committee

Total: 1,385,850 successful + 472 unrecoverable failures (documented in release/epstein_problems.json).

OCR provenance

Two OCR sources were used:

Gemini 2.5 Flash Lite (848,228 files): Primary OCR engine. These documents have ocr_source = NULL.
Tesseract (community) (537,622 files): Gap-fill from community repositories. These have ocr_source = "tesseract-community".

See PROVENANCE.md for per-table source documentation.

Schema

Every document has a unique file_key (e.g., EFTA00000001) that serves as the primary identifier across all tables. The Parquet files use file_key everywhere -- no opaque integer IDs.

Key fields on documents:

file_key -- unique identifier (EFTA number)
dataset -- source dataset (e.g., "DataSet10")
full_text -- complete OCR text
document_type -- classified type (Email, Form, Letter, Photo, etc.)
date -- extracted date if available
is_photo -- whether the document is a photograph
ocr_source -- NULL for Gemini, "tesseract-community" for community OCR

See schema.sql for the full SQLite schema used by assemble_db.py.

Known issues

472 source PDFs could not be processed (corrupt, empty, or unavailable). These are cataloged in release/epstein_problems.json with DOJ download URLs.
DataSet 9 (531K files) was entirely community-processed with Tesseract OCR, which has lower quality than Gemini.
Some documents are heavily redacted. recovered_redactions contains ML-recovered text from 39,588 redacted pages.
Embedding coverage is ~96% for chunks (1,249 malformed embeddings excluded). Summary embeddings were removed as redundant -- 92% of documents have a single chunk, making summary and chunk embeddings identical.

Release artifacts

Small reference files are included directly in this repo under release/:

File	Size	Description
`epstein_problems.json`	280 KB	472 processing failures with DOJ URLs
`efta_dataset_mapping.json`	4 KB	EFTA file key to DOJ URL mapping
`persons_registry.json`	436 KB	1,614 curated person records
`knowledge_graph_entities.json`	172 KB	467 KG entities
`knowledge_graph_relationships.json`	932 KB	4,190 KG relationships
`extracted_entities_filtered.json`	1.9 MB	Filtered entity export
`redacted_text_recovered.json.gz`	2.5 MB	39,588 recovered redacted pages
`document_summary.csv.gz`	1.8 MB	Document metadata summary
`image_catalog.csv.gz`	15 MB	Photo/image catalog

License

This dataset is released under CC-BY-4.0. The underlying documents are U.S. government records released under FOIA.

Citation

@dataset{epstein_data_2026,
  title={Epstein Document Archive},
  author={Kevin Bass},
  year={2026},
  url={https://huggingface.co/datasets/kabasshouse/epstein-data},
  note={1.39M OCR'd DOJ documents with entities, embeddings, and knowledge graph}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
release		release
.gitignore		.gitignore
LICENSE		LICENSE
PROVENANCE.md		PROVENANCE.md
README.md		README.md
assemble_db.py		assemble_db.py
export_to_hf.py		export_to_hf.py
schema.sql		schema.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Epstein Document Archive

What's in the dataset

Quick start

With HuggingFace `datasets`

With DuckDB (no download)

With pandas

Assemble a local SQLite database

Source documents

OCR provenance

Schema

Known issues

Release artifacts

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Epstein Document Archive

What's in the dataset

Quick start

With HuggingFace datasets

With DuckDB (no download)

With pandas

Assemble a local SQLite database

Source documents

OCR provenance

Schema

Known issues

Release artifacts

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

With HuggingFace `datasets`

Packages