Skip to content

cellethology/deepdraw

Repository files navigation

Deepdraw: genetic circuit design with genomic foundation model

CI Coverage

Deepdraw is an active learning algorithm for genetic circuit design. It uses genomic foundation model (GFM) embeddings to make accurate predictions from very few experimental observations. At each iteration, Deepdraw integrates measurements from previous rounds with sequence-level circuit embeddings and proposes informative candidate designs in practical batches of 12, a 100-fold reduction relative to prior active learning approaches for circuit design.

This README is for users who want to apply Deepdraw to their own design pool. Retrospective benchmarking, Hydra sweeps, and Slurm array jobs are documented separately in job_sub/README.md.

Quick Start

1. Install

Prerequisites: Git, Python, and uv.

git clone https://github.com/cellethology/deepdraw.git
cd deepdraw

uv sync --python 3.10
uv run deepdraw --help

Plain uv sync installs the user-facing Deepdraw runtime. If you are developing the package and want test, lint, notebook, and plotting tools, include the dev group:

uv sync --python 3.10 --group dev

Optional retrospective analysis and cluster-job tooling can be added only when needed:

uv sync --extra analysis
uv sync --extra cluster

If you are testing the current development branch before it is merged:

git checkout codex/deepdraw-user-workflow

2. Run The Dummy Example

The fastest way to verify the workflow is to run the bundled dummy example. It includes a 60-sequence design pool, a matching embeddings file, and fake first-round measurements.

uv run deepdraw init \
  --pool-csv examples/deepdraw_dummy/design_pool.csv \
  --embeddings examples/deepdraw_dummy/embeddings.npz \
  --sequence-column sequence \
  --id-column variant_id

Deepdraw writes the first batch to:

deepdraw_run/round_000_to_measure.csv

Now simulate receiving measurements from the first experimental round. In a real project, this should be a cumulative file, for example measurements.csv, that you keep appending to after each round. The dummy example provides its starter version as measurements.csv.

uv run deepdraw suggest \
  --run-dir deepdraw_run \
  --measurements examples/deepdraw_dummy/measurements.csv \
  --label-column Expression

Deepdraw trains on all rows in the measurement file and writes the next batch to:

deepdraw_run/round_001_to_measure.csv

The dummy example uses the same defaults as a real run: first batch size 12, later batch size 12, seed 0, ProbCover-euclidean initial selection, BoTorch GP prediction, and MES acquisition. See Useful Flags for ways to change output location, batch size, model, acquisition strategy, and preprocessing.

Use Deepdraw On Your Own Project

Deepdraw expects you to start with an unlabeled design pool. You do not need any experimental measurements for the first round. The full workflow is:

Step What you provide or run What Deepdraw writes
1 Input design pool: designs.csv -
2 Run deepdraw embed-alphagenome embeddings_alphagenome_raw.npz
3 Run deepdraw pca embeddings_alphagenome_kneedle.npz
4 Run deepdraw init round_000_to_measure.csv
5 Measure round_000_to_measure.csv designs and append labels measurements.csv
6 Run deepdraw suggest with measurements.csv round_001_to_measure.csv
7 Repeat measurement and deepdraw suggest rounds round_002_to_measure.csv, ...

All workflow code needed for this path lives in this repository. Deepdraw provides the AlphaGenome embedding command, PCA command, and active-learning commands. The AlphaGenome runtime is kept in envs/alphagenome so its JAX/Hugging Face dependencies do not affect the normal Deepdraw environment. Model weights are still downloaded or read from your Hugging Face cache according to AlphaGenome's access terms.

1. Prepare A Design Pool

Create a CSV with one row per candidate design. Include a sequence column and, preferably, a stable design ID column.

variant_id,sequence
variant_001,ATGCGTACGTTAGCGA
variant_002,ATGCGTACGATAGCAA
variant_003,ATGCGTACGCTAGCTA

The stable ID column is recommended because it makes measurement files easier to merge across rounds.

2. Generate GFM Embeddings

Generate one embedding vector per design using your chosen genomic foundation model. Deepdraw includes an AlphaGenome embedding command and a PCA reduction command; you can also provide an NPZ generated by another model.

Required NPZ structure:

{
    "embeddings": np.ndarray,  # shape: (num_designs, embedding_dim)
    "ids": np.ndarray,         # variant IDs or row indices aligned to the pool CSV
}

sample_ids is also accepted instead of ids. If you pass --id-column variant_id, the NPZ IDs should match that CSV column. If you do not pass an ID column, use row indices 0, 1, 2, ....

AlphaGenome has a much heavier JAX/Hugging Face runtime than the normal Deepdraw loop, so keep it in the separate environment bundled under envs/alphagenome:

uv sync --project envs/alphagenome --python 3.11

Recommended Hardware For Embeddings

Embedding is usually the most computationally expensive Deepdraw setup step. For production-size design pools, use a Linux machine with an NVIDIA CUDA GPU; the isolated AlphaGenome environment installs CUDA-enabled JAX on Linux. CPU inference is supported and useful for smoke tests or small pools, but it will be slower, so keep --batch-size 1 unless you have already checked memory use on your machine.

On Apple Silicon, this environment uses JAX CPU for AlphaGenome. The M-series GPU is visible through JAX Metal on some machines, but the Metal plugin is still experimental and is not the recommended Deepdraw path for AlphaGenome embedding. For GPU throughput, run the same Deepdraw command on Linux with a CUDA GPU.

Before the first embedding run, accept access to the gated AlphaGenome model on Hugging Face and authenticate the AlphaGenome environment:

uv run --project envs/alphagenome hf auth login

You can also provide a token with HF_TOKEN. Without model access and authentication, Hugging Face will return 401 Unauthorized when Deepdraw tries to download google/alphagenome-all-folds.

Generate AlphaGenome embeddings from the same design pool you will pass to deepdraw init:

uv run --project envs/alphagenome deepdraw embed-alphagenome \
  --pool-csv designs.csv \
  --sequence-column sequence \
  --id-column variant_id \
  --output embeddings_alphagenome_raw.npz \
  --resolution 128 \
  --pooling mean \
  --batch-size 1

Then run kneedle PCA in the regular Deepdraw environment:

uv run deepdraw pca \
  --input embeddings_alphagenome_raw.npz \
  --output embeddings_alphagenome_kneedle.npz \
  --selection-method kneedle

The PCA output, embeddings_alphagenome_kneedle.npz, is the embeddings file you pass to deepdraw init in the next step.

3. Select The First Batch

Run deepdraw init to choose the first experimental batch from embeddings only.

uv run deepdraw init \
  --pool-csv designs.csv \
  --embeddings embeddings_alphagenome_kneedle.npz \
  --sequence-column sequence \
  --id-column variant_id \
  --output-dir runs/my_deepdraw_run \
  --starting-batch-size 12 \
  --batch-size 12

Outputs:

runs/my_deepdraw_run/
├── deepdraw_state.json
├── latest_recommendations.csv
├── round_000_to_measure.csv
└── selection_history.csv

Send round_000_to_measure.csv to the wet lab. Recommendation CSVs keep the original design-pool columns; selection_history.csv adds only deepdraw_round so you can see which batch each design came from.

If you generated embeddings with a different model, replace embeddings_alphagenome_kneedle.npz with your own Deepdraw-compatible NPZ.

4. Add Measurements

After the first experiment, create one cumulative measurements CSV, such as measurements.csv. The easiest approach is to copy round_000_to_measure.csv and add a measured label column.

variant_id,sequence,Expression
variant_001,ATGCGTACGTTAGCGA,1.42
variant_005,ATGCGTACGAAAGCGA,3.87
variant_009,ATGCGTACGCAAGTTA,5.11

Keep the stable ID column, such as variant_id, in every measurement update. Deepdraw uses that column to map measurements back to the original design pool.

You can include extra measured designs that were not recommended by Deepdraw, as long as they are present in the original design pool. deepdraw suggest trains on every measured design in measurements.csv, then excludes all measured designs from the next recommendation batch.

Keep updating this same file over time. After you measure round_001_to_measure.csv, append those new rows to measurements.csv; after round_002_to_measure.csv, append those rows too. Each deepdraw suggest call expects the measurements file to contain every previously recommended design with a measured label.

5. Select The Next Batch

Run deepdraw suggest with the cumulative measurement table:

uv run deepdraw suggest \
  --run-dir runs/my_deepdraw_run \
  --measurements measurements.csv \
  --label-column Expression

This writes:

runs/my_deepdraw_run/round_001_to_measure.csv

Measure that batch, append the new labels to the same measurements.csv, and run deepdraw suggest again:

uv run deepdraw suggest \
  --run-dir runs/my_deepdraw_run \
  --measurements measurements.csv \
  --label-column Expression

The next output will be:

runs/my_deepdraw_run/round_002_to_measure.csv

Continue the same measure, append, and deepdraw suggest loop for each later round.

Recommended Defaults

For a first real campaign, use the defaults unless you have a reason to compare strategies:

uv run deepdraw init \
  --pool-csv designs.csv \
  --embeddings embeddings_alphagenome_kneedle.npz \
  --sequence-column sequence \
  --id-column variant_id \
  --output-dir runs/my_deepdraw_run \
  --starting-batch-size 12 \
  --batch-size 12

The default workflow uses:

  • initial selection: probcover_euclidean
  • predictor: gp
  • query strategy: mes
  • feature transforms: standardize
  • target transforms: log_standardize

For very small smoke tests, you can use faster settings: random, ridge_regressor, topk, and no transforms.

Useful Flags

The examples above use production defaults. You can override pieces of the workflow when you need a different experimental setup or a faster local smoke test.

Change the number of designs per round:

uv run deepdraw init \
  --pool-csv designs.csv \
  --embeddings embeddings_alphagenome_kneedle.npz \
  --sequence-column sequence \
  --id-column variant_id \
  --output-dir runs/my_deepdraw_run \
  --starting-batch-size 24 \
  --batch-size 12

Choose where outputs are written:

uv run deepdraw init \
  --pool-csv designs.csv \
  --embeddings embeddings_alphagenome_kneedle.npz \
  --sequence-column sequence \
  --id-column variant_id \
  --output-dir runs/my_deepdraw_run

If the run directory already exists, deepdraw init stops rather than overwriting deepdraw_state.json. Add --force only when you intentionally want to overwrite an existing run directory.

Use a faster local smoke-test configuration by swapping in lighter model and transform settings:

uv run deepdraw init \
  --pool-csv examples/deepdraw_dummy/design_pool.csv \
  --embeddings examples/deepdraw_dummy/embeddings.npz \
  --sequence-column sequence \
  --id-column variant_id \
  --output-dir /tmp/deepdraw_dummy_fast_run \
  --starting-batch-size 4 \
  --batch-size 3 \
  --seed 11 \
  --initial-selection-strategy random \
  --predictor ridge_regressor \
  --query-strategy topk \
  --feature-transforms none \
  --target-transforms none \
  --force

Common override flags:

  • --starting-batch-size: number of designs in the first batch.
  • --batch-size: number of designs in each later batch.
  • --seed: random seed for reproducible initial selection and stochastic model components.
  • --initial-selection-strategy: first-round strategy, such as probcover_euclidean, core_set, or random.
  • --predictor: model used after measurements arrive, such as gp or ridge_regressor. Existing botorch_* names are also accepted.
  • --query-strategy: acquisition strategy for later rounds, such as mes, qlog_nei, or topk. Existing botorch_* names are also accepted.
  • --feature-transforms: feature preprocessing config, such as standardize or none.
  • --target-transforms: label preprocessing config, such as log_standardize or none.
  • --log-level: progress output verbosity; default is INFO, use WARNING for quieter runs.

CLI Reference

Create a run and select the first batch:

uv run deepdraw init --help

Train on measured labels and select the next batch:

uv run deepdraw suggest --help

Generate AlphaGenome embeddings:

uv run --project envs/alphagenome deepdraw embed-alphagenome --help

Reduce embeddings with PCA/kneedle:

uv run deepdraw pca --help

Required/common arguments:

  • --pool-csv: CSV containing candidate designs.
  • --embeddings: NPZ containing GFM embeddings aligned to the design pool.
  • --sequence-column: sequence column in the design pool CSV.
  • --id-column: optional stable design ID column in the pool CSV.
  • --output-dir: run directory where Deepdraw writes state and recommendations. Defaults to deepdraw_run.
  • --measurements: cumulative CSV containing measured labels for all previous Deepdraw recommendations.
  • --label-column: measured target column, such as Expression or Fold Change.

Repository Layout

├── deepdraw/                  # User-facing Deepdraw CLI and workflow
├── envs/alphagenome/          # Optional isolated AlphaGenome embedding runtime
├── examples/deepdraw_dummy/   # Tiny runnable example
├── core/                      # Active learning models, strategies, and trainers
├── job_sub/                   # Retrospective benchmark and Slurm/Hydra tooling
├── test/                      # Unit and workflow tests
└── utils/                     # Supporting utilities

Testing

uv run pytest test/test_deepdraw_workflow.py
uv run pytest

Citation

Citation information will be added with the manuscript/release.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages