ML Research Monorepo

A monorepo for machine learning research and original research projects.

Directory Structure

~/research/
├── pyproject.toml              # Dependency management for entire monorepo
├── .gitignore                  # Comprehensive ignore rules
│
├── assets/                     # Centralized storage (gitignored)
│   ├── datasets/               # Raw & processed datasets
│   ├── models/                 # Pretrained models & checkpoints
│   └── outputs/                # Experiment outputs, plots, results
│
├── common/                     # Shared codebase (Python package)
│   ├── models/                 # Reusable model architectures
│   ├── training/               # Training loops, optimizers, schedulers
│   ├── data/                   # Dataset loaders, transforms, augmentation
│   ├── utils/                  # Logging, metrics, checkpointing, tracking
│   └── visualization/          # Plotting, analysis, figure generation
│
├── projects/                   # Research projects
│   └── [project-name]/
│       ├── README.md
│       ├── experiments/
│       ├── notebooks/
│       └── docker/
│
├── tools/                      # Standalone utilities
│   ├── download_hf_dataset.py  # CLI for downloading HF datasets
│   ├── download_hf_model.py    # CLI for downloading HF models
│   ├── benchmark.py            # (to be added)
│   └── visualize_results.py    # (to be added)
│
└── exploratory/                # Exploratory analysis & one-off experiments
    └── [date-or-topic]/
        └── *.ipynb

Current Projects

Project	Description
custom_transformer	Decoder-only transformer with manual backpropagation (no autograd) for educational purposes
embedded_attention	Chunk-based conversational memory system using RAG with DuckDB vector storage
continual_learning	Experiments to build intuition for catastrophic forgetting in transformers
research_agent	Research assistant that reads papers, maintains hypotheses, and synthesizes findings
research_repository_agent	Agent for navigating and managing this ML research monorepo

Setup

1. Install uv (if not already installed)

uv is a fast Python package installer and resolver, written in Rust. It's much faster than pip.

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Verify installation
uv --version

2. Install Dependencies

cd ~/research

# Create virtual environment with uv
uv venv

# Activate virtual environment
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install core dependencies (much faster than pip!)
uv pip install -e .

# Install with dev tools
uv pip install -e ".[dev]"

# Install with all extras (CV, NLP, dev tools)
uv pip install -e ".[all]"

3. Test Your Environment

Run the environment test script to verify all dependencies are properly installed:

# Make sure venv is activated
source .venv/bin/activate

# Run tests
python tools/test_env.py

This will test:

Python version
Core ML packages (NumPy, Pandas, SciPy, scikit-learn)
PyTorch & CUDA availability
HuggingFace ecosystem
Visualization tools
Experiment tracking
Development tools
Your common package imports

4. Install Pre-commit Hooks (Optional)

source .venv/bin/activate
pre-commit install

Usage

Importing from Common Package

The common/ directory is a Python package that can be imported from anywhere in the monorepo:

# Import models
from common.models import ResNet, Transformer, AttentionLayer

# Import training utilities
from common.training import Trainer, get_optimizer, get_scheduler

# Import data utilities
from common.data import ImageDataset, TextDataset, create_dataloader

# Import utilities
from common.utils import save_checkpoint, load_checkpoint, log_metrics
from common.utils import setup_logger, get_device

# Import visualization
from common.visualization import plot_training_curves, plot_confusion_matrix

Adding a New Research Project

Create project directory:

mkdir -p projects/my_research_project
cd projects/my_research_project

Structure:

my_research_project/
├── README.md              # Project overview & results
├── experiments/
│   ├── baseline.yaml
│   ├── ablation_1.yaml
│   └── final_model.yaml
├── notebooks/
│   └── analysis.ipynb
├── train.py
├── evaluate.py
└── docker/                # If needed
    ├── Dockerfile
    └── docker-compose.yml

Managing Assets

Assets (datasets, models, outputs) are centralized and gitignored.

Downloading from HuggingFace

Use the built-in HuggingFace utilities (see HUGGINGFACE_GUIDE.md for full details):

# Download dataset via CLI
python tools/download_hf_dataset.py --name squad
python tools/download_hf_dataset.py --name wmt14 --split train

# Download model via CLI
python tools/download_hf_model.py --repo-id bert-base-uncased
python tools/download_hf_model.py --repo-id openai/clip-vit-base-patch32

# Or use programmatically in your code

from common.data import download_dataset, download_model

# Download in training scripts
dataset = download_dataset('squad')
model_path = download_model('bert-base-uncased')

Managing Downloaded Assets

# Symlink from a project
cd projects/my_project
ln -s ../../assets/datasets/wmt14 data

# Download from other sources
cd ~/research/assets/models
wget https://download.pytorch.org/models/resnet50-0676ba61.pth

Dataset Preparation

This section describes the full workflow for preparing domain-specific datasets from large corpora like FineWeb. The example below creates automotive and food corpora for continual learning experiments.

1. Build a Domain Index

For large datasets, first create an index to enable fast domain filtering:

# Build index from FineWeb (extracts domain, token counts, file locations)
python common/data/build_domain_index.py

This creates a Parquet index at assets/datasets/.../domain_index.parquet with columns:

etld_plus_one: Domain (e.g., "bmw.com", "allrecipes.com")
token_count: Number of tokens in document
shard_file: Source Arrow file
row_idx: Row index within shard

2. Query the Index to Find Categories

Use SQL queries to explore available domains and estimate corpus sizes:

# Interactive exploration
python common/data/query_domain_index.py

# Example queries:
# - Find automotive domains: WHERE etld_plus_one LIKE '%auto%'
# - Estimate size: SELECT SUM(token_count) FROM idx WHERE ...
# - List top domains: SELECT etld_plus_one, COUNT(*) GROUP BY 1 ORDER BY 2 DESC

3. Extract Domain-Specific Corpora

Once you've identified target domains, extract them to JSONL files:

# Preview what will be extracted
python common/data/extract_corpus.py --preview

# Extract both corpora (default ~125M tokens each, ~500MB)
python common/data/extract_corpus.py

# Extract specific corpus with custom size
python common/data/extract_corpus.py --corpus automotive --target-tokens 50000000

Output structure:

data/
├── corpus_automotive/
│   ├── train.jsonl      # {"text": "..."}
│   ├── val.jsonl
│   └── metadata.json
└── corpus_food/
    ├── train.jsonl
    ├── val.jsonl
    └── metadata.json

4. Analyze Token Distributions

Understand vocabulary usage to choose optimal tokenizer size:

# Analyze a single corpus
python tools/analyze_tokens.py --jsonl data/corpus_automotive/train.jsonl

# Compare multiple corpora
python tools/analyze_tokens.py \
  --jsonl data/corpus_automotive/train.jsonl data/corpus_food/train.jsonl \
  --compare

# Get tokenizer size recommendations
python tools/analyze_tokens.py --jsonl data/corpus_automotive/train.jsonl --recommendations

Key metrics to look for:

Coverage milestones: Tokens needed for 90%/95%/99% coverage
Vocabulary utilization: What fraction of GPT-2's 50k tokens are actually used
Edge cases: Single chars, punctuation, numbers, rare tokens

5. Train a Custom Tokenizer

Train a BPE tokenizer optimized for your corpus:

# Train on single corpus
python tools/analyze_tokens.py \
  --jsonl data/corpus_automotive/train.jsonl \
  --train-tokenizer 16384

# Train on combined corpora (recommended for multi-domain experiments)
python tools/analyze_tokens.py \
  --jsonl data/corpus_automotive/train.jsonl data/corpus_food/train.jsonl \
  --train-tokenizer 32768

Tokenizer saved to: assets/models/tokenizers/combined_bpe_32768/

Choosing vocab size:

Use coverage analysis to find where diminishing returns start (~99% coverage)
Powers of 2 are optimal for GPU efficiency (4096, 8192, 16384, 32768)
Smaller vocab = faster training, larger embeddings proportion

6. Pre-tokenize for Fast Training

Cache tokenized data to eliminate startup overhead:

# Pre-tokenize with your custom tokenizer
python tools/pretokenize_dataset.py \
  --jsonl data/corpus_automotive/train.jsonl \
  --val-jsonl data/corpus_automotive/val.jsonl \
  --tokenizer combined_bpe_32768 \
  --max-length 1024

# Repeat for other corpora
python tools/pretokenize_dataset.py \
  --jsonl data/corpus_food/train.jsonl \
  --val-jsonl data/corpus_food/val.jsonl \
  --tokenizer combined_bpe_32768 \
  --max-length 1024

Cache location: data/corpus_automotive/train_tokenized/combined_bpe_32768_v32768_len1024/

Benefits:

First run: ~70s to tokenize 180k documents
Subsequent runs: ~1s to load from cache
Cache is tokenizer-specific (different tokenizer = different cache)

Complete Example: FineWeb → Training-Ready Data

# 1. Build index (one-time, ~30 min for 100B token sample)
python common/data/build_domain_index.py

# 2. Explore domains
python common/data/query_domain_index.py
# → Identified: automotive (~800M tokens), food (~350M tokens)

# 3. Extract corpora (~500MB each)
python common/data/extract_corpus.py

# 4. Analyze and choose tokenizer size
python tools/analyze_tokens.py \
  --jsonl data/corpus_automotive/train.jsonl data/corpus_food/train.jsonl \
  --compare --recommendations
# → 99% coverage needs ~34k tokens, chose 32768

# 5. Train combined tokenizer
python tools/analyze_tokens.py \
  --jsonl data/corpus_automotive/train.jsonl data/corpus_food/train.jsonl \
  --train-tokenizer 32768

# 6. Pre-tokenize both corpora
for corpus in automotive food; do
  python tools/pretokenize_dataset.py \
    --jsonl data/corpus_${corpus}/train.jsonl \
    --val-jsonl data/corpus_${corpus}/val.jsonl \
    --tokenizer combined_bpe_32768 \
    --max-length 1024
done

# Ready for training!

Running Experiments

# Activate environment
source .venv/bin/activate

# Run from project directory
cd projects/custom_transformer
python train.py --config configs/tinystories.yaml

# Or run from root with module syntax
cd ~/research
python -m projects.custom_transformer.train --config projects/custom_transformer/configs/tinystories.yaml

Using Docker (Per-Project)

Each project can have its own Docker setup:

# projects/my_project/docker/Dockerfile
FROM pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime

WORKDIR /research
COPY . .

RUN pip install -e .
CMD ["python", "projects/my_project/train.py"]

# projects/my_project/docker/docker-compose.yml
version: '3.8'
services:
  training:
    build:
      context: ../..
      dockerfile: projects/my_project/docker/Dockerfile
    volumes:
      - ../../assets:/research/assets
      - ../../common:/research/common
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Workflow Examples

Typical Research Workflow

Exploratory phase: Create notebook in exploratory/2025-01-15-idea/
Extract reusable code: Move common functionality to common/
Create project: Set up in projects/my_idea/
Run experiments: Use common/ utilities for training/eval
Analyze results: Use common/visualization for plots

Best Practices

Never commit large files: Use .gitignore, store in assets/
Share code via common/: Don't duplicate utility functions
Use config files: YAML configs for reproducible experiments
Document everything: READMEs for each paper/project
Symlink datasets: Avoid duplicating large files across projects
Version control code only: Models, datasets, outputs are gitignored

Development

Code Quality

# Format code
black .

# Lint code
ruff check .

# Type check
mypy common/

# Run tests (when you add them)
pytest

Adding to Common Package

When you find reusable code in a paper/project:

Move it to appropriate common/ subdirectory
Add to common/[submodule]/__init__.py if needed
Update imports in your paper/project
Document in docstrings

Tips

Import path: Always import from root (e.g., from common.models import ...)
Relative imports: Avoid them; use absolute imports from common/
Config management: Use omegaconf or yaml for experiment configs
Experiment tracking: Use wandb or tensorboard (already in dependencies)
GPU management: Use common/utils for device handling

Next Steps

Add your first paper implementation or project
Build out common/ with reusable utilities as you go
Create tools/ scripts for common tasks (dataset downloads, benchmarking)
Set up experiment tracking (wandb, mlflow, or tensorboard)

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
assets		assets
common		common
projects		projects
setups		setups
tests		tests
tools		tools
.gitignore		.gitignore
=0.40.0		=0.40.0
CLAUDE.md		CLAUDE.md
HUGGINGFACE_GUIDE.md		HUGGINGFACE_GUIDE.md
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

ML Research Monorepo

Directory Structure

Current Projects

Setup

1. Install uv (if not already installed)

2. Install Dependencies

3. Test Your Environment

4. Install Pre-commit Hooks (Optional)

Usage

Importing from Common Package

Adding a New Research Project

Managing Assets

Downloading from HuggingFace

Managing Downloaded Assets

Dataset Preparation

1. Build a Domain Index

2. Query the Index to Find Categories

3. Extract Domain-Specific Corpora

4. Analyze Token Distributions

5. Train a Custom Tokenizer

6. Pre-tokenize for Fast Training

Complete Example: FineWeb → Training-Ready Data

Running Experiments

Using Docker (Per-Project)

Workflow Examples

Typical Research Workflow

Best Practices

Development

Code Quality

Adding to Common Package

Tips

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages