1.39 million OCR'd documents from the DOJ Jeffrey Epstein document release, with extracted entities, text embeddings, a knowledge graph, and full pipeline provenance.
This is the data behind epstein.academy.
| Layer | Rows | Download Size | Description |
|---|---|---|---|
documents |
1,413,765 | ~800 MB | Full text of every document with metadata |
entities |
8,542,849 | ~200 MB | Extracted people, organizations, locations, dates |
chunks |
2,039,205 | ~1.5 GB | ~800-token text chunks for RAG |
embeddings_chunk |
1,956,803 | ~5 GB | 768-dim Gemini embeddings per chunk |
provenance |
4.9M rows | ~400 MB | Full pipeline audit trail |
persons |
1,614 | <1 MB | Curated person registry with aliases |
kg_entities |
467 | <1 MB | Knowledge graph entities |
kg_relationships |
4,190 | <1 MB | Knowledge graph relationships |
recovered_redactions |
39,588 | ~3 MB | Recovered text from redacted pages |
Total: ~8 GB compressed (Parquet with zstd). Each layer is independent -- download only what you need.
from datasets import load_dataset
# Stream documents (no full download needed)
ds = load_dataset("kabasshouse/epstein-data", "documents", split="train", streaming=True)
for doc in ds:
print(doc["file_key"], doc["dataset"], doc["document_type"])
print(doc["full_text"][:200])
break
# Load entities into memory
entities = load_dataset("kabasshouse/epstein-data", "entities", split="train")
print(f"{len(entities):,} entities loaded")
# Filter to a specific dataset
ds10 = load_dataset("kabasshouse/epstein-data", "documents", split="train")
ds10 = ds10.filter(lambda x: x["dataset"] == "DataSet10")-- Query directly from HuggingFace (auto-downloads Parquet)
SELECT file_key, dataset, document_type, date
FROM 'hf://datasets/kabasshouse/epstein-data/data/documents/*.parquet'
WHERE dataset = 'DataSet10' AND document_type = 'Email'
LIMIT 10;
-- Count entities by type
SELECT entity_type, COUNT(*) as cnt
FROM 'hf://datasets/kabasshouse/epstein-data/data/entities/*.parquet'
GROUP BY entity_type
ORDER BY cnt DESC;import pandas as pd
# Read a specific shard
df = pd.read_parquet("hf://datasets/kabasshouse/epstein-data/data/documents/documents-00000.parquet")
print(df.shape)
print(df.columns.tolist())pip install pyarrow numpy tqdm huggingface_hub
# Core tables only (documents + entities, ~1 GB download)
python assemble_db.py --layers core --output epstein.db
# With text chunks (~2.5 GB download)
python assemble_db.py --layers text --output epstein.db
# Full database with embeddings (~7.5 GB download)
python assemble_db.py --layers full --output epstein.db
# Everything including provenance (~8 GB download)
python assemble_db.py --layers all --output epstein.db
# From a local Parquet export
python assemble_db.py --local ./hf_export/ --layers core --output epstein.dbThe documents come from 12 DOJ FOIA dataset releases plus two community-sourced collections:
| Dataset | Files | Source |
|---|---|---|
| DataSet 1 | 3,158 | DOJ FOIA |
| DataSet 2 | 574 | DOJ FOIA |
| DataSet 3 | 67 | DOJ FOIA |
| DataSet 4 | 152 | DOJ FOIA |
| DataSet 5 | 120 | DOJ FOIA |
| DataSet 6 | 13 | DOJ FOIA |
| DataSet 7 | 17 | DOJ FOIA |
| DataSet 8 | 10,595 | DOJ FOIA |
| DataSet 9 | 531,279 | DOJ FOIA |
| DataSet 10 | 503,154 | DOJ FOIA |
| DataSet 11 | 331,655 | DOJ FOIA |
| DataSet 12 | 152 | DOJ FOIA |
| FBIVault | 22 | FBI Vault FOIA |
| HouseOversightEstate | 4,892 | House Oversight Committee |
Total: 1,385,850 successful + 472 unrecoverable failures (documented in release/epstein_problems.json).
Two OCR sources were used:
- Gemini 2.5 Flash Lite (848,228 files): Primary OCR engine. These documents have
ocr_source= NULL. - Tesseract (community) (537,622 files): Gap-fill from community repositories. These have
ocr_source="tesseract-community".
See PROVENANCE.md for per-table source documentation.
Every document has a unique file_key (e.g., EFTA00000001) that serves as the primary identifier across all tables. The Parquet files use file_key everywhere -- no opaque integer IDs.
Key fields on documents:
file_key-- unique identifier (EFTA number)dataset-- source dataset (e.g., "DataSet10")full_text-- complete OCR textdocument_type-- classified type (Email, Form, Letter, Photo, etc.)date-- extracted date if availableis_photo-- whether the document is a photographocr_source-- NULL for Gemini, "tesseract-community" for community OCR
See schema.sql for the full SQLite schema used by assemble_db.py.
- 472 source PDFs could not be processed (corrupt, empty, or unavailable). These are cataloged in
release/epstein_problems.jsonwith DOJ download URLs. - DataSet 9 (531K files) was entirely community-processed with Tesseract OCR, which has lower quality than Gemini.
- Some documents are heavily redacted.
recovered_redactionscontains ML-recovered text from 39,588 redacted pages. - Embedding coverage is ~96% for chunks (1,249 malformed embeddings excluded). Summary embeddings were removed as redundant -- 92% of documents have a single chunk, making summary and chunk embeddings identical.
Small reference files are included directly in this repo under release/:
| File | Size | Description |
|---|---|---|
epstein_problems.json |
280 KB | 472 processing failures with DOJ URLs |
efta_dataset_mapping.json |
4 KB | EFTA file key to DOJ URL mapping |
persons_registry.json |
436 KB | 1,614 curated person records |
knowledge_graph_entities.json |
172 KB | 467 KG entities |
knowledge_graph_relationships.json |
932 KB | 4,190 KG relationships |
extracted_entities_filtered.json |
1.9 MB | Filtered entity export |
redacted_text_recovered.json.gz |
2.5 MB | 39,588 recovered redacted pages |
document_summary.csv.gz |
1.8 MB | Document metadata summary |
image_catalog.csv.gz |
15 MB | Photo/image catalog |
This dataset is released under CC-BY-4.0. The underlying documents are U.S. government records released under FOIA.
@dataset{epstein_data_2026,
title={Epstein Document Archive},
author={Kevin Bass},
year={2026},
url={https://huggingface.co/datasets/kabasshouse/epstein-data},
note={1.39M OCR'd DOJ documents with entities, embeddings, and knowledge graph}
}