Skip to content

kevinnbass/epstein-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Epstein Document Archive

1.39 million OCR'd documents from the DOJ Jeffrey Epstein document release, with extracted entities, text embeddings, a knowledge graph, and full pipeline provenance.

This is the data behind epstein.academy.

What's in the dataset

Layer Rows Download Size Description
documents 1,413,765 ~800 MB Full text of every document with metadata
entities 8,542,849 ~200 MB Extracted people, organizations, locations, dates
chunks 2,039,205 ~1.5 GB ~800-token text chunks for RAG
embeddings_chunk 1,956,803 ~5 GB 768-dim Gemini embeddings per chunk
provenance 4.9M rows ~400 MB Full pipeline audit trail
persons 1,614 <1 MB Curated person registry with aliases
kg_entities 467 <1 MB Knowledge graph entities
kg_relationships 4,190 <1 MB Knowledge graph relationships
recovered_redactions 39,588 ~3 MB Recovered text from redacted pages

Total: ~8 GB compressed (Parquet with zstd). Each layer is independent -- download only what you need.

Quick start

With HuggingFace datasets

from datasets import load_dataset

# Stream documents (no full download needed)
ds = load_dataset("kabasshouse/epstein-data", "documents", split="train", streaming=True)
for doc in ds:
    print(doc["file_key"], doc["dataset"], doc["document_type"])
    print(doc["full_text"][:200])
    break

# Load entities into memory
entities = load_dataset("kabasshouse/epstein-data", "entities", split="train")
print(f"{len(entities):,} entities loaded")

# Filter to a specific dataset
ds10 = load_dataset("kabasshouse/epstein-data", "documents", split="train")
ds10 = ds10.filter(lambda x: x["dataset"] == "DataSet10")

With DuckDB (no download)

-- Query directly from HuggingFace (auto-downloads Parquet)
SELECT file_key, dataset, document_type, date
FROM 'hf://datasets/kabasshouse/epstein-data/data/documents/*.parquet'
WHERE dataset = 'DataSet10' AND document_type = 'Email'
LIMIT 10;

-- Count entities by type
SELECT entity_type, COUNT(*) as cnt
FROM 'hf://datasets/kabasshouse/epstein-data/data/entities/*.parquet'
GROUP BY entity_type
ORDER BY cnt DESC;

With pandas

import pandas as pd

# Read a specific shard
df = pd.read_parquet("hf://datasets/kabasshouse/epstein-data/data/documents/documents-00000.parquet")
print(df.shape)
print(df.columns.tolist())

Assemble a local SQLite database

pip install pyarrow numpy tqdm huggingface_hub

# Core tables only (documents + entities, ~1 GB download)
python assemble_db.py --layers core --output epstein.db

# With text chunks (~2.5 GB download)
python assemble_db.py --layers text --output epstein.db

# Full database with embeddings (~7.5 GB download)
python assemble_db.py --layers full --output epstein.db

# Everything including provenance (~8 GB download)
python assemble_db.py --layers all --output epstein.db

# From a local Parquet export
python assemble_db.py --local ./hf_export/ --layers core --output epstein.db

Source documents

The documents come from 12 DOJ FOIA dataset releases plus two community-sourced collections:

Dataset Files Source
DataSet 1 3,158 DOJ FOIA
DataSet 2 574 DOJ FOIA
DataSet 3 67 DOJ FOIA
DataSet 4 152 DOJ FOIA
DataSet 5 120 DOJ FOIA
DataSet 6 13 DOJ FOIA
DataSet 7 17 DOJ FOIA
DataSet 8 10,595 DOJ FOIA
DataSet 9 531,279 DOJ FOIA
DataSet 10 503,154 DOJ FOIA
DataSet 11 331,655 DOJ FOIA
DataSet 12 152 DOJ FOIA
FBIVault 22 FBI Vault FOIA
HouseOversightEstate 4,892 House Oversight Committee

Total: 1,385,850 successful + 472 unrecoverable failures (documented in release/epstein_problems.json).

OCR provenance

Two OCR sources were used:

  • Gemini 2.5 Flash Lite (848,228 files): Primary OCR engine. These documents have ocr_source = NULL.
  • Tesseract (community) (537,622 files): Gap-fill from community repositories. These have ocr_source = "tesseract-community".

See PROVENANCE.md for per-table source documentation.

Schema

Every document has a unique file_key (e.g., EFTA00000001) that serves as the primary identifier across all tables. The Parquet files use file_key everywhere -- no opaque integer IDs.

Key fields on documents:

  • file_key -- unique identifier (EFTA number)
  • dataset -- source dataset (e.g., "DataSet10")
  • full_text -- complete OCR text
  • document_type -- classified type (Email, Form, Letter, Photo, etc.)
  • date -- extracted date if available
  • is_photo -- whether the document is a photograph
  • ocr_source -- NULL for Gemini, "tesseract-community" for community OCR

See schema.sql for the full SQLite schema used by assemble_db.py.

Known issues

  • 472 source PDFs could not be processed (corrupt, empty, or unavailable). These are cataloged in release/epstein_problems.json with DOJ download URLs.
  • DataSet 9 (531K files) was entirely community-processed with Tesseract OCR, which has lower quality than Gemini.
  • Some documents are heavily redacted. recovered_redactions contains ML-recovered text from 39,588 redacted pages.
  • Embedding coverage is ~96% for chunks (1,249 malformed embeddings excluded). Summary embeddings were removed as redundant -- 92% of documents have a single chunk, making summary and chunk embeddings identical.

Release artifacts

Small reference files are included directly in this repo under release/:

File Size Description
epstein_problems.json 280 KB 472 processing failures with DOJ URLs
efta_dataset_mapping.json 4 KB EFTA file key to DOJ URL mapping
persons_registry.json 436 KB 1,614 curated person records
knowledge_graph_entities.json 172 KB 467 KG entities
knowledge_graph_relationships.json 932 KB 4,190 KG relationships
extracted_entities_filtered.json 1.9 MB Filtered entity export
redacted_text_recovered.json.gz 2.5 MB 39,588 recovered redacted pages
document_summary.csv.gz 1.8 MB Document metadata summary
image_catalog.csv.gz 15 MB Photo/image catalog

License

This dataset is released under CC-BY-4.0. The underlying documents are U.S. government records released under FOIA.

Citation

@dataset{epstein_data_2026,
  title={Epstein Document Archive},
  author={Kevin Bass},
  year={2026},
  url={https://huggingface.co/datasets/kabasshouse/epstein-data},
  note={1.39M OCR'd DOJ documents with entities, embeddings, and knowledge graph}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages