Deepdraw is an active learning algorithm for genetic circuit design. It uses genomic foundation model (GFM) embeddings to make accurate predictions from very few experimental observations. At each iteration, Deepdraw integrates measurements from previous rounds with sequence-level circuit embeddings and proposes informative candidate designs in practical batches of 12, a 100-fold reduction relative to prior active learning approaches for circuit design.
This README is for users who want to apply Deepdraw to their own design pool. Retrospective benchmarking, Hydra sweeps, and Slurm array jobs are documented separately in job_sub/README.md.
Prerequisites: Git, Python, and uv.
git clone https://github.com/cellethology/deepdraw.git
cd deepdraw
uv sync --python 3.10
uv run deepdraw --helpPlain uv sync installs the user-facing Deepdraw runtime. If you are developing the package and want test, lint, notebook, and plotting tools, include the dev group:
uv sync --python 3.10 --group devOptional retrospective analysis and cluster-job tooling can be added only when needed:
uv sync --extra analysis
uv sync --extra clusterIf you are testing the current development branch before it is merged:
git checkout codex/deepdraw-user-workflowThe fastest way to verify the workflow is to run the bundled dummy example. It includes a 60-sequence design pool, a matching embeddings file, and fake first-round measurements.
uv run deepdraw init \
--pool-csv examples/deepdraw_dummy/design_pool.csv \
--embeddings examples/deepdraw_dummy/embeddings.npz \
--sequence-column sequence \
--id-column variant_idDeepdraw writes the first batch to:
deepdraw_run/round_000_to_measure.csv
Now simulate receiving measurements from the first experimental round. In a real project, this should be a cumulative file, for example measurements.csv, that you keep appending to after each round. The dummy example provides its starter version as measurements.csv.
uv run deepdraw suggest \
--run-dir deepdraw_run \
--measurements examples/deepdraw_dummy/measurements.csv \
--label-column ExpressionDeepdraw trains on all rows in the measurement file and writes the next batch to:
deepdraw_run/round_001_to_measure.csv
The dummy example uses the same defaults as a real run: first batch size 12, later batch size 12, seed 0, ProbCover-euclidean initial selection, BoTorch GP prediction, and MES acquisition. See Useful Flags for ways to change output location, batch size, model, acquisition strategy, and preprocessing.
Deepdraw expects you to start with an unlabeled design pool. You do not need any experimental measurements for the first round. The full workflow is:
| Step | What you provide or run | What Deepdraw writes |
|---|---|---|
| 1 | Input design pool: designs.csv |
- |
| 2 | Run deepdraw embed-alphagenome |
embeddings_alphagenome_raw.npz |
| 3 | Run deepdraw pca |
embeddings_alphagenome_kneedle.npz |
| 4 | Run deepdraw init |
round_000_to_measure.csv |
| 5 | Measure round_000_to_measure.csv designs and append labels |
measurements.csv |
| 6 | Run deepdraw suggest with measurements.csv |
round_001_to_measure.csv |
| 7 | Repeat measurement and deepdraw suggest rounds |
round_002_to_measure.csv, ... |
All workflow code needed for this path lives in this repository. Deepdraw provides the AlphaGenome embedding command, PCA command, and active-learning commands. The AlphaGenome runtime is kept in envs/alphagenome so its JAX/Hugging Face dependencies do not affect the normal Deepdraw environment. Model weights are still downloaded or read from your Hugging Face cache according to AlphaGenome's access terms.
Create a CSV with one row per candidate design. Include a sequence column and, preferably, a stable design ID column.
variant_id,sequence
variant_001,ATGCGTACGTTAGCGA
variant_002,ATGCGTACGATAGCAA
variant_003,ATGCGTACGCTAGCTAThe stable ID column is recommended because it makes measurement files easier to merge across rounds.
Generate one embedding vector per design using your chosen genomic foundation model. Deepdraw includes an AlphaGenome embedding command and a PCA reduction command; you can also provide an NPZ generated by another model.
Required NPZ structure:
{
"embeddings": np.ndarray, # shape: (num_designs, embedding_dim)
"ids": np.ndarray, # variant IDs or row indices aligned to the pool CSV
}sample_ids is also accepted instead of ids. If you pass --id-column variant_id, the NPZ IDs should match that CSV column. If you do not pass an ID column, use row indices 0, 1, 2, ....
AlphaGenome has a much heavier JAX/Hugging Face runtime than the normal Deepdraw loop, so keep it in the separate environment bundled under envs/alphagenome:
uv sync --project envs/alphagenome --python 3.11Embedding is usually the most computationally expensive Deepdraw setup step. For production-size design pools, use a Linux machine with an NVIDIA CUDA GPU; the isolated AlphaGenome environment installs CUDA-enabled JAX on Linux. CPU inference is supported and useful for smoke tests or small pools, but it will be slower, so keep --batch-size 1 unless you have already checked memory use on your machine.
On Apple Silicon, this environment uses JAX CPU for AlphaGenome. The M-series GPU is visible through JAX Metal on some machines, but the Metal plugin is still experimental and is not the recommended Deepdraw path for AlphaGenome embedding. For GPU throughput, run the same Deepdraw command on Linux with a CUDA GPU.
Before the first embedding run, accept access to the gated AlphaGenome model on Hugging Face and authenticate the AlphaGenome environment:
uv run --project envs/alphagenome hf auth loginYou can also provide a token with HF_TOKEN. Without model access and authentication, Hugging Face will return 401 Unauthorized when Deepdraw tries to download google/alphagenome-all-folds.
Generate AlphaGenome embeddings from the same design pool you will pass to deepdraw init:
uv run --project envs/alphagenome deepdraw embed-alphagenome \
--pool-csv designs.csv \
--sequence-column sequence \
--id-column variant_id \
--output embeddings_alphagenome_raw.npz \
--resolution 128 \
--pooling mean \
--batch-size 1Then run kneedle PCA in the regular Deepdraw environment:
uv run deepdraw pca \
--input embeddings_alphagenome_raw.npz \
--output embeddings_alphagenome_kneedle.npz \
--selection-method kneedleThe PCA output, embeddings_alphagenome_kneedle.npz, is the embeddings file you pass to deepdraw init in the next step.
Run deepdraw init to choose the first experimental batch from embeddings only.
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings_alphagenome_kneedle.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_run \
--starting-batch-size 12 \
--batch-size 12Outputs:
runs/my_deepdraw_run/
├── deepdraw_state.json
├── latest_recommendations.csv
├── round_000_to_measure.csv
└── selection_history.csv
Send round_000_to_measure.csv to the wet lab.
Recommendation CSVs keep the original design-pool columns; selection_history.csv
adds only deepdraw_round so you can see which batch each design came from.
If you generated embeddings with a different model, replace embeddings_alphagenome_kneedle.npz with your own Deepdraw-compatible NPZ.
After the first experiment, create one cumulative measurements CSV, such as measurements.csv. The easiest approach is to copy round_000_to_measure.csv and add a measured label column.
variant_id,sequence,Expression
variant_001,ATGCGTACGTTAGCGA,1.42
variant_005,ATGCGTACGAAAGCGA,3.87
variant_009,ATGCGTACGCAAGTTA,5.11Keep the stable ID column, such as variant_id, in every measurement update. Deepdraw uses that column to map measurements back to the original design pool.
You can include extra measured designs that were not recommended by Deepdraw, as long as they are present in the original design pool. deepdraw suggest trains on every measured design in measurements.csv, then excludes all measured designs from the next recommendation batch.
Keep updating this same file over time. After you measure round_001_to_measure.csv, append those new rows to measurements.csv; after round_002_to_measure.csv, append those rows too. Each deepdraw suggest call expects the measurements file to contain every previously recommended design with a measured label.
Run deepdraw suggest with the cumulative measurement table:
uv run deepdraw suggest \
--run-dir runs/my_deepdraw_run \
--measurements measurements.csv \
--label-column ExpressionThis writes:
runs/my_deepdraw_run/round_001_to_measure.csv
Measure that batch, append the new labels to the same measurements.csv, and run deepdraw suggest again:
uv run deepdraw suggest \
--run-dir runs/my_deepdraw_run \
--measurements measurements.csv \
--label-column ExpressionThe next output will be:
runs/my_deepdraw_run/round_002_to_measure.csv
Continue the same measure, append, and deepdraw suggest loop for each later round.
For a first real campaign, use the defaults unless you have a reason to compare strategies:
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings_alphagenome_kneedle.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_run \
--starting-batch-size 12 \
--batch-size 12The default workflow uses:
- initial selection:
probcover_euclidean - predictor:
gp - query strategy:
mes - feature transforms:
standardize - target transforms:
log_standardize
For very small smoke tests, you can use faster settings: random, ridge_regressor, topk, and no transforms.
The examples above use production defaults. You can override pieces of the workflow when you need a different experimental setup or a faster local smoke test.
Change the number of designs per round:
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings_alphagenome_kneedle.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_run \
--starting-batch-size 24 \
--batch-size 12Choose where outputs are written:
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings_alphagenome_kneedle.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_runIf the run directory already exists, deepdraw init stops rather than overwriting deepdraw_state.json. Add --force only when you intentionally want to overwrite an existing run directory.
Use a faster local smoke-test configuration by swapping in lighter model and transform settings:
uv run deepdraw init \
--pool-csv examples/deepdraw_dummy/design_pool.csv \
--embeddings examples/deepdraw_dummy/embeddings.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir /tmp/deepdraw_dummy_fast_run \
--starting-batch-size 4 \
--batch-size 3 \
--seed 11 \
--initial-selection-strategy random \
--predictor ridge_regressor \
--query-strategy topk \
--feature-transforms none \
--target-transforms none \
--forceCommon override flags:
--starting-batch-size: number of designs in the first batch.--batch-size: number of designs in each later batch.--seed: random seed for reproducible initial selection and stochastic model components.--initial-selection-strategy: first-round strategy, such asprobcover_euclidean,core_set, orrandom.--predictor: model used after measurements arrive, such asgporridge_regressor. Existingbotorch_*names are also accepted.--query-strategy: acquisition strategy for later rounds, such asmes,qlog_nei, ortopk. Existingbotorch_*names are also accepted.--feature-transforms: feature preprocessing config, such asstandardizeornone.--target-transforms: label preprocessing config, such aslog_standardizeornone.--log-level: progress output verbosity; default isINFO, useWARNINGfor quieter runs.
Create a run and select the first batch:
uv run deepdraw init --helpTrain on measured labels and select the next batch:
uv run deepdraw suggest --helpGenerate AlphaGenome embeddings:
uv run --project envs/alphagenome deepdraw embed-alphagenome --helpReduce embeddings with PCA/kneedle:
uv run deepdraw pca --helpRequired/common arguments:
--pool-csv: CSV containing candidate designs.--embeddings: NPZ containing GFM embeddings aligned to the design pool.--sequence-column: sequence column in the design pool CSV.--id-column: optional stable design ID column in the pool CSV.--output-dir: run directory where Deepdraw writes state and recommendations. Defaults todeepdraw_run.--measurements: cumulative CSV containing measured labels for all previous Deepdraw recommendations.--label-column: measured target column, such asExpressionorFold Change.
├── deepdraw/ # User-facing Deepdraw CLI and workflow
├── envs/alphagenome/ # Optional isolated AlphaGenome embedding runtime
├── examples/deepdraw_dummy/ # Tiny runnable example
├── core/ # Active learning models, strategies, and trainers
├── job_sub/ # Retrospective benchmark and Slurm/Hydra tooling
├── test/ # Unit and workflow tests
└── utils/ # Supporting utilities
uv run pytest test/test_deepdraw_workflow.py
uv run pytestCitation information will be added with the manuscript/release.