Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 78 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,19 @@ The dummy example uses the same defaults as a real run: first batch size 12, lat

## Use Deepdraw On Your Own Project

Deepdraw expects you to start with an unlabeled design pool. You do not need any experimental measurements for the first round.
Deepdraw expects you to start with an unlabeled design pool. You do not need any experimental measurements for the first round. The full workflow is:

| Step | What you provide or run | What Deepdraw writes |
| --- | --- | --- |
| 1 | Input design pool: `designs.csv` | - |
| 2 | Run `deepdraw embed-alphagenome` | `embeddings_alphagenome_raw.npz` |
| 3 | Run `deepdraw pca` | `embeddings_alphagenome_kneedle.npz` |
| 4 | Run `deepdraw init` | `round_000_to_measure.csv` |
| 5 | Measure `round_000_to_measure.csv` designs and append labels | `measurements.csv` |
| 6 | Run `deepdraw suggest` with `measurements.csv` | `round_001_to_measure.csv` |
| 7 | Repeat measurement and `deepdraw suggest` rounds | `round_002_to_measure.csv`, ... |

All workflow code needed for this path lives in this repository. Deepdraw provides the AlphaGenome embedding command, PCA command, and active-learning commands. The AlphaGenome runtime is kept in `envs/alphagenome` so its JAX/Hugging Face dependencies do not affect the normal Deepdraw environment. Model weights are still downloaded or read from your Hugging Face cache according to AlphaGenome's access terms.

### 1. Prepare A Design Pool

Expand All @@ -94,7 +106,7 @@ The stable ID column is recommended because it makes measurement files easier to

### 2. Generate GFM Embeddings

Generate one embedding vector per design using your chosen genomic foundation model. Deepdraw currently expects embeddings to be provided as an NPZ file; embedding generation itself is outside the `deepdraw` CLI.
Generate one embedding vector per design using your chosen genomic foundation model. Deepdraw includes an AlphaGenome embedding command and a PCA reduction command; you can also provide an NPZ generated by another model.

Required NPZ structure:

Expand All @@ -107,14 +119,58 @@ Required NPZ structure:

`sample_ids` is also accepted instead of `ids`. If you pass `--id-column variant_id`, the NPZ IDs should match that CSV column. If you do not pass an ID column, use row indices `0, 1, 2, ...`.

AlphaGenome has a much heavier JAX/Hugging Face runtime than the normal Deepdraw loop, so keep it in the separate environment bundled under `envs/alphagenome`:

```bash
uv sync --project envs/alphagenome --python 3.11
```

#### Recommended Hardware For Embeddings

Embedding is usually the most computationally expensive Deepdraw setup step. For production-size design pools, use a Linux machine with an NVIDIA CUDA GPU; the isolated AlphaGenome environment installs CUDA-enabled JAX on Linux. CPU inference is supported and useful for smoke tests or small pools, but it will be slower, so keep `--batch-size 1` unless you have already checked memory use on your machine.

On Apple Silicon, this environment uses JAX CPU for AlphaGenome. The M-series GPU is visible through JAX Metal on some machines, but the Metal plugin is still experimental and is not the recommended Deepdraw path for AlphaGenome embedding. For GPU throughput, run the same Deepdraw command on Linux with a CUDA GPU.

Before the first embedding run, accept access to the gated AlphaGenome model on Hugging Face and authenticate the AlphaGenome environment:

```bash
uv run --project envs/alphagenome hf auth login
```

You can also provide a token with `HF_TOKEN`. Without model access and authentication, Hugging Face will return `401 Unauthorized` when Deepdraw tries to download `google/alphagenome-all-folds`.

Generate AlphaGenome embeddings from the same design pool you will pass to `deepdraw init`:

```bash
uv run --project envs/alphagenome deepdraw embed-alphagenome \
--pool-csv designs.csv \
--sequence-column sequence \
--id-column variant_id \
--output embeddings_alphagenome_raw.npz \
--resolution 128 \
--pooling mean \
--batch-size 1
```

Then run kneedle PCA in the regular Deepdraw environment:

```bash
uv run deepdraw pca \
--input embeddings_alphagenome_raw.npz \
--output embeddings_alphagenome_kneedle.npz \
--selection-method kneedle
```

The PCA output, `embeddings_alphagenome_kneedle.npz`, is the embeddings file you pass to `deepdraw init` in the next step.

### 3. Select The First Batch

Run `deepdraw init` to choose the first experimental batch from embeddings only.

```bash
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings.npz \
--embeddings embeddings_alphagenome_kneedle.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_run \
Expand All @@ -136,6 +192,8 @@ Send `round_000_to_measure.csv` to the wet lab.
Recommendation CSVs keep the original design-pool columns; `selection_history.csv`
adds only `deepdraw_round` so you can see which batch each design came from.

If you generated embeddings with a different model, replace `embeddings_alphagenome_kneedle.npz` with your own Deepdraw-compatible NPZ.

### 4. Add Measurements

After the first experiment, create one cumulative measurements CSV, such as `measurements.csv`. The easiest approach is to copy `round_000_to_measure.csv` and add a measured label column.
Expand Down Expand Up @@ -185,32 +243,7 @@ The next output will be:
runs/my_deepdraw_run/round_002_to_measure.csv
```

The loop is:

```text
design pool + embeddings
|
v
deepdraw init
|
v
measure round_000
|
v
deepdraw suggest
|
v
measure round_001
|
v
append round_001 labels to measurements.csv
|
v
deepdraw suggest
|
v
repeat
```
Continue the same measure, append, and `deepdraw suggest` loop for each later round.

## Recommended Defaults

Expand All @@ -219,7 +252,7 @@ For a first real campaign, use the defaults unless you have a reason to compare
```bash
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings.npz \
--embeddings embeddings_alphagenome_kneedle.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_run \
Expand All @@ -246,7 +279,7 @@ Change the number of designs per round:
```bash
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings.npz \
--embeddings embeddings_alphagenome_kneedle.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_run \
Expand All @@ -259,7 +292,7 @@ Choose where outputs are written:
```bash
uv run deepdraw init \
--pool-csv designs.csv \
--embeddings embeddings.npz \
--embeddings embeddings_alphagenome_kneedle.npz \
--sequence-column sequence \
--id-column variant_id \
--output-dir runs/my_deepdraw_run
Expand Down Expand Up @@ -313,6 +346,18 @@ Train on measured labels and select the next batch:
uv run deepdraw suggest --help
```

Generate AlphaGenome embeddings:

```bash
uv run --project envs/alphagenome deepdraw embed-alphagenome --help
```

Reduce embeddings with PCA/kneedle:

```bash
uv run deepdraw pca --help
```

Required/common arguments:

- `--pool-csv`: CSV containing candidate designs.
Expand All @@ -327,6 +372,7 @@ Required/common arguments:

```text
├── deepdraw/ # User-facing Deepdraw CLI and workflow
├── envs/alphagenome/ # Optional isolated AlphaGenome embedding runtime
├── examples/deepdraw_dummy/ # Tiny runnable example
├── core/ # Active learning models, strategies, and trainers
├── job_sub/ # Retrospective benchmark and Slurm/Hydra tooling
Expand Down
136 changes: 135 additions & 1 deletion deepdraw/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
import logging
from pathlib import Path

from deepdraw.embeddings.alphagenome import embed_design_pool, embed_fasta
from deepdraw.embeddings.pca import reduce_embeddings_pca
from deepdraw.workflow import initialize_run, suggest_next_batch

_LOG_LEVEL_CHOICES = ("DEBUG", "INFO", "WARNING", "ERROR")
Expand Down Expand Up @@ -88,6 +90,88 @@ def build_parser() -> argparse.ArgumentParser:
suggest_parser.add_argument("--measurements", required=True, type=Path)
suggest_parser.add_argument("--label-column")
suggest_parser.add_argument("--measurement-id-column")

embed_parser = subparsers.add_parser(
"embed-alphagenome",
help="Generate AlphaGenome embeddings for a design pool or FASTA file.",
)
_add_log_level_argument(embed_parser)
input_group = embed_parser.add_mutually_exclusive_group(required=True)
input_group.add_argument("--pool-csv", type=Path)
input_group.add_argument("--fasta", type=Path)
embed_parser.add_argument("--output", required=True, type=Path)
embed_parser.add_argument("--sequence-column")
embed_parser.add_argument("--id-column")
embed_parser.add_argument("--model-version", default="all_folds")
embed_parser.add_argument("--batch-size", type=int, default=1)
embed_parser.add_argument("--pooling", choices=["mean", "none"], default="mean")
embed_parser.add_argument("--resolution", type=int, choices=[1, 128], default=128)
embed_parser.add_argument(
"--species",
choices=["human", "mouse"],
default="human",
)
embed_parser.add_argument(
"--no-pad-to-multiple",
action="store_true",
help="Disable AlphaGenome input padding to multiples of 2048 bp.",
)
embed_parser.add_argument(
"--no-validate",
action="store_true",
help="Disable DNA sequence validation before embedding.",
)
embed_parser.add_argument(
"--device",
choices=["cpu", "gpu", "tpu"],
help="Force an AlphaGenome runtime device. Defaults to GPU/TPU if available.",
)

pca_parser = subparsers.add_parser(
"pca",
help="Reduce an embedding NPZ with PCA and kneedle component selection.",
)
_add_log_level_argument(pca_parser)
pca_parser.add_argument("--input", dest="input_file", required=True, type=Path)
pca_parser.add_argument("--output", required=True, type=Path)
pca_parser.add_argument(
"--n-components",
type=int,
help=(
"Number of PCA components to fit. By default this also keeps exactly "
"that many PCs unless --select-components is passed."
),
)
pca_parser.set_defaults(exact_n_components=None)
pca_parser.add_argument(
"--exact-n-components",
dest="exact_n_components",
action="store_true",
help="Keep exactly --n-components.",
)
pca_parser.add_argument(
"--select-components",
dest="exact_n_components",
action="store_false",
help="Use --selection-method after fitting --n-components as an upper cap.",
)
pca_parser.add_argument("--target-variance", type=float, default=0.95)
pca_parser.add_argument(
"--selection-method",
choices=["target-variance", "elbow", "kneedle", "l-method"],
default="kneedle",
)
pca_parser.add_argument(
"--power-of-two",
action=argparse.BooleanOptionalAction,
default=True,
help="Round target-variance selection to a power of two.",
)
pca_parser.add_argument(
"--use-mean-pooling",
action="store_true",
help="Mean-pool 3D or ragged embeddings over sequence positions before PCA.",
)
return parser


Expand All @@ -102,7 +186,7 @@ def main(argv: list[str] | None = None) -> None:

try:
_run_command(args, parser)
except (OSError, ValueError) as exc:
except (OSError, RuntimeError, ValueError) as exc:
if log_level == "DEBUG":
raise
parser.exit(1, f"Error: {exc}\n")
Expand Down Expand Up @@ -142,6 +226,56 @@ def _run_command(args: argparse.Namespace, parser: argparse.ArgumentParser) -> N
print(f"Wrote Deepdraw round {latest_round}: {round_path}")
return

if args.command == "embed-alphagenome":
if args.pool_csv is not None:
embed_design_pool(
pool_csv=args.pool_csv,
output_path=args.output,
sequence_column=args.sequence_column,
id_column=args.id_column,
model_version=args.model_version,
batch_size=args.batch_size,
pooling=args.pooling,
resolution=args.resolution,
species=args.species,
pad_to_multiple=not args.no_pad_to_multiple,
validate=not args.no_validate,
device=args.device,
)
else:
embed_fasta(
fasta_path=args.fasta,
output_path=args.output,
model_version=args.model_version,
batch_size=args.batch_size,
pooling=args.pooling,
resolution=args.resolution,
species=args.species,
pad_to_multiple=not args.no_pad_to_multiple,
validate=not args.no_validate,
device=args.device,
)
print(f"Wrote AlphaGenome embeddings: {args.output}")
return

if args.command == "pca":
summary = reduce_embeddings_pca(
input_file=args.input_file,
output_file=args.output,
n_components=args.n_components,
target_variance=args.target_variance,
use_mean_pooling=args.use_mean_pooling,
exact_n_components=args.exact_n_components,
power_of_two=args.power_of_two,
selection_method=args.selection_method.replace("-", "_"),
)
print(
f"Wrote PCA embeddings: {args.output} "
f"({summary.n_components} PCs, "
f"{summary.cumulative_explained_variance:.2%} variance)"
)
return

parser.error(f"Unknown command {args.command}")


Expand Down
5 changes: 5 additions & 0 deletions deepdraw/embeddings/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""Embedding preparation helpers for Deepdraw."""

from deepdraw.embeddings.pca import reduce_embeddings_pca

__all__ = ["reduce_embeddings_pca"]
Loading
Loading