cellethology · neonine2 · May 23, 2026 · May 23, 2026
diff --git a/README.md b/README.md
@@ -77,7 +77,19 @@ The dummy example uses the same defaults as a real run: first batch size 12, lat
 
 ## Use Deepdraw On Your Own Project
 
-Deepdraw expects you to start with an unlabeled design pool. You do not need any experimental measurements for the first round.
+Deepdraw expects you to start with an unlabeled design pool. You do not need any experimental measurements for the first round. The full workflow is:
+
+| Step | What you provide or run | What Deepdraw writes |
+| --- | --- | --- |
+| 1 | Input design pool: `designs.csv` | - |
+| 2 | Run `deepdraw embed-alphagenome` | `embeddings_alphagenome_raw.npz` |
+| 3 | Run `deepdraw pca` | `embeddings_alphagenome_kneedle.npz` |
+| 4 | Run `deepdraw init` | `round_000_to_measure.csv` |
+| 5 | Measure `round_000_to_measure.csv` designs and append labels | `measurements.csv` |
+| 6 | Run `deepdraw suggest` with `measurements.csv` | `round_001_to_measure.csv` |
+| 7 | Repeat measurement and `deepdraw suggest` rounds | `round_002_to_measure.csv`, ... |
+
+All workflow code needed for this path lives in this repository. Deepdraw provides the AlphaGenome embedding command, PCA command, and active-learning commands. The AlphaGenome runtime is kept in `envs/alphagenome` so its JAX/Hugging Face dependencies do not affect the normal Deepdraw environment. Model weights are still downloaded or read from your Hugging Face cache according to AlphaGenome's access terms.
 
 ### 1. Prepare A Design Pool
 
@@ -94,7 +106,7 @@ The stable ID column is recommended because it makes measurement files easier to
 
 ### 2. Generate GFM Embeddings
 
-Generate one embedding vector per design using your chosen genomic foundation model. Deepdraw currently expects embeddings to be provided as an NPZ file; embedding generation itself is outside the `deepdraw` CLI.
+Generate one embedding vector per design using your chosen genomic foundation model. Deepdraw includes an AlphaGenome embedding command and a PCA reduction command; you can also provide an NPZ generated by another model.
 
 Required NPZ structure:
 
@@ -107,14 +119,58 @@ Required NPZ structure:
 
 `sample_ids` is also accepted instead of `ids`. If you pass `--id-column variant_id`, the NPZ IDs should match that CSV column. If you do not pass an ID column, use row indices `0, 1, 2, ...`.
 
+AlphaGenome has a much heavier JAX/Hugging Face runtime than the normal Deepdraw loop, so keep it in the separate environment bundled under `envs/alphagenome`:
+
+```bash
+uv sync --project envs/alphagenome --python 3.11
+```
+
+#### Recommended Hardware For Embeddings
+
+Embedding is usually the most computationally expensive Deepdraw setup step. For production-size design pools, use a Linux machine with an NVIDIA CUDA GPU; the isolated AlphaGenome environment installs CUDA-enabled JAX on Linux. CPU inference is supported and useful for smoke tests or small pools, but it will be slower, so keep `--batch-size 1` unless you have already checked memory use on your machine.
+
+On Apple Silicon, this environment uses JAX CPU for AlphaGenome. The M-series GPU is visible through JAX Metal on some machines, but the Metal plugin is still experimental and is not the recommended Deepdraw path for AlphaGenome embedding. For GPU throughput, run the same Deepdraw command on Linux with a CUDA GPU.
+
+Before the first embedding run, accept access to the gated AlphaGenome model on Hugging Face and authenticate the AlphaGenome environment:
+
+```bash
+uv run --project envs/alphagenome hf auth login
+```
+
+You can also provide a token with `HF_TOKEN`. Without model access and authentication, Hugging Face will return `401 Unauthorized` when Deepdraw tries to download `google/alphagenome-all-folds`.
+
+Generate AlphaGenome embeddings from the same design pool you will pass to `deepdraw init`:
+
+```bash
+uv run --project envs/alphagenome deepdraw embed-alphagenome \
+  --pool-csv designs.csv \
+  --sequence-column sequence \
+  --id-column variant_id \
+  --output embeddings_alphagenome_raw.npz \
+  --resolution 128 \
+  --pooling mean \
+  --batch-size 1
+```
+
+Then run kneedle PCA in the regular Deepdraw environment:
+
+```bash
+uv run deepdraw pca \
+  --input embeddings_alphagenome_raw.npz \
+  --output embeddings_alphagenome_kneedle.npz \
+  --selection-method kneedle
+```
+
+The PCA output, `embeddings_alphagenome_kneedle.npz`, is the embeddings file you pass to `deepdraw init` in the next step.
+
 ### 3. Select The First Batch
 
 Run `deepdraw init` to choose the first experimental batch from embeddings only.
 
 ```bash
 uv run deepdraw init \
   --pool-csv designs.csv \
-  --embeddings embeddings.npz \
+  --embeddings embeddings_alphagenome_kneedle.npz \
   --sequence-column sequence \
   --id-column variant_id \
   --output-dir runs/my_deepdraw_run \
@@ -136,6 +192,8 @@ Send `round_000_to_measure.csv` to the wet lab.
 Recommendation CSVs keep the original design-pool columns; `selection_history.csv`
 adds only `deepdraw_round` so you can see which batch each design came from.
 
+If you generated embeddings with a different model, replace `embeddings_alphagenome_kneedle.npz` with your own Deepdraw-compatible NPZ.
+
 ### 4. Add Measurements
 
 After the first experiment, create one cumulative measurements CSV, such as `measurements.csv`. The easiest approach is to copy `round_000_to_measure.csv` and add a measured label column.
@@ -185,32 +243,7 @@ The next output will be:
 runs/my_deepdraw_run/round_002_to_measure.csv
 ```
 
-The loop is:
-
-```text
-design pool + embeddings
-        |
-        v
-deepdraw init
-        |
-        v
-measure round_000
-        |
-        v
-deepdraw suggest
-        |
-        v
-measure round_001
-        |
-        v
-append round_001 labels to measurements.csv
-        |
-        v
-deepdraw suggest
-        |
-        v
-repeat
-```
+Continue the same measure, append, and `deepdraw suggest` loop for each later round.
 
 ## Recommended Defaults
 
@@ -219,7 +252,7 @@ For a first real campaign, use the defaults unless you have a reason to compare
 ```bash
 uv run deepdraw init \
   --pool-csv designs.csv \
-  --embeddings embeddings.npz \
+  --embeddings embeddings_alphagenome_kneedle.npz \
   --sequence-column sequence \
   --id-column variant_id \
   --output-dir runs/my_deepdraw_run \
@@ -246,7 +279,7 @@ Change the number of designs per round:
 ```bash
 uv run deepdraw init \
   --pool-csv designs.csv \
-  --embeddings embeddings.npz \
+  --embeddings embeddings_alphagenome_kneedle.npz \
   --sequence-column sequence \
   --id-column variant_id \
   --output-dir runs/my_deepdraw_run \
@@ -259,7 +292,7 @@ Choose where outputs are written:
 ```bash
 uv run deepdraw init \
   --pool-csv designs.csv \
-  --embeddings embeddings.npz \
+  --embeddings embeddings_alphagenome_kneedle.npz \
   --sequence-column sequence \
   --id-column variant_id \
   --output-dir runs/my_deepdraw_run
@@ -313,6 +346,18 @@ Train on measured labels and select the next batch:
 uv run deepdraw suggest --help
 ```
 
+Generate AlphaGenome embeddings:
+
+```bash
+uv run --project envs/alphagenome deepdraw embed-alphagenome --help
+```
+
+Reduce embeddings with PCA/kneedle:
+
+```bash
+uv run deepdraw pca --help
+```
+
 Required/common arguments:
 
 - `--pool-csv`: CSV containing candidate designs.
@@ -327,6 +372,7 @@ Required/common arguments:
 
 ```text
 ├── deepdraw/                  # User-facing Deepdraw CLI and workflow
+├── envs/alphagenome/          # Optional isolated AlphaGenome embedding runtime
 ├── examples/deepdraw_dummy/   # Tiny runnable example
 ├── core/                      # Active learning models, strategies, and trainers
 ├── job_sub/                   # Retrospective benchmark and Slurm/Hydra tooling

diff --git a/deepdraw/cli.py b/deepdraw/cli.py
@@ -6,6 +6,8 @@
 import logging
 from pathlib import Path
 
+from deepdraw.embeddings.alphagenome import embed_design_pool, embed_fasta
+from deepdraw.embeddings.pca import reduce_embeddings_pca
 from deepdraw.workflow import initialize_run, suggest_next_batch
 
 _LOG_LEVEL_CHOICES = ("DEBUG", "INFO", "WARNING", "ERROR")
@@ -88,6 +90,88 @@ def build_parser() -> argparse.ArgumentParser:
     suggest_parser.add_argument("--measurements", required=True, type=Path)
     suggest_parser.add_argument("--label-column")
     suggest_parser.add_argument("--measurement-id-column")
+
+    embed_parser = subparsers.add_parser(
+        "embed-alphagenome",
+        help="Generate AlphaGenome embeddings for a design pool or FASTA file.",
+    )
+    _add_log_level_argument(embed_parser)
+    input_group = embed_parser.add_mutually_exclusive_group(required=True)
+    input_group.add_argument("--pool-csv", type=Path)
+    input_group.add_argument("--fasta", type=Path)
+    embed_parser.add_argument("--output", required=True, type=Path)
+    embed_parser.add_argument("--sequence-column")
+    embed_parser.add_argument("--id-column")
+    embed_parser.add_argument("--model-version", default="all_folds")
+    embed_parser.add_argument("--batch-size", type=int, default=1)
+    embed_parser.add_argument("--pooling", choices=["mean", "none"], default="mean")
+    embed_parser.add_argument("--resolution", type=int, choices=[1, 128], default=128)
+    embed_parser.add_argument(
+        "--species",
+        choices=["human", "mouse"],
+        default="human",
+    )
+    embed_parser.add_argument(
+        "--no-pad-to-multiple",
+        action="store_true",
+        help="Disable AlphaGenome input padding to multiples of 2048 bp.",
+    )
+    embed_parser.add_argument(
+        "--no-validate",
+        action="store_true",
+        help="Disable DNA sequence validation before embedding.",
+    )
+    embed_parser.add_argument(
+        "--device",
+        choices=["cpu", "gpu", "tpu"],
+        help="Force an AlphaGenome runtime device. Defaults to GPU/TPU if available.",
+    )
+
+    pca_parser = subparsers.add_parser(
+        "pca",
+        help="Reduce an embedding NPZ with PCA and kneedle component selection.",
+    )
+    _add_log_level_argument(pca_parser)
+    pca_parser.add_argument("--input", dest="input_file", required=True, type=Path)
+    pca_parser.add_argument("--output", required=True, type=Path)
+    pca_parser.add_argument(
+        "--n-components",
+        type=int,
+        help=(
+            "Number of PCA components to fit. By default this also keeps exactly "
+            "that many PCs unless --select-components is passed."
+        ),
+    )
+    pca_parser.set_defaults(exact_n_components=None)
+    pca_parser.add_argument(
+        "--exact-n-components",
+        dest="exact_n_components",
+        action="store_true",
+        help="Keep exactly --n-components.",
+    )
+    pca_parser.add_argument(
+        "--select-components",
+        dest="exact_n_components",
+        action="store_false",
+        help="Use --selection-method after fitting --n-components as an upper cap.",
+    )
+    pca_parser.add_argument("--target-variance", type=float, default=0.95)
+    pca_parser.add_argument(
+        "--selection-method",
+        choices=["target-variance", "elbow", "kneedle", "l-method"],
+        default="kneedle",
+    )
+    pca_parser.add_argument(
+        "--power-of-two",
+        action=argparse.BooleanOptionalAction,
+        default=True,
+        help="Round target-variance selection to a power of two.",
+    )
+    pca_parser.add_argument(
+        "--use-mean-pooling",
+        action="store_true",
+        help="Mean-pool 3D or ragged embeddings over sequence positions before PCA.",
+    )
     return parser
 
 
@@ -102,7 +186,7 @@ def main(argv: list[str] | None = None) -> None:
 
     try:
         _run_command(args, parser)
-    except (OSError, ValueError) as exc:
+    except (OSError, RuntimeError, ValueError) as exc:
         if log_level == "DEBUG":
             raise
         parser.exit(1, f"Error: {exc}\n")
@@ -142,6 +226,56 @@ def _run_command(args: argparse.Namespace, parser: argparse.ArgumentParser) -> N
         print(f"Wrote Deepdraw round {latest_round}: {round_path}")
         return
 
+    if args.command == "embed-alphagenome":
+        if args.pool_csv is not None:
+            embed_design_pool(
+                pool_csv=args.pool_csv,
+                output_path=args.output,
+                sequence_column=args.sequence_column,
+                id_column=args.id_column,
+                model_version=args.model_version,
+                batch_size=args.batch_size,
+                pooling=args.pooling,
+                resolution=args.resolution,
+                species=args.species,
+                pad_to_multiple=not args.no_pad_to_multiple,
+                validate=not args.no_validate,
+                device=args.device,
+            )
+        else:
+            embed_fasta(
+                fasta_path=args.fasta,
+                output_path=args.output,
+                model_version=args.model_version,
+                batch_size=args.batch_size,
+                pooling=args.pooling,
+                resolution=args.resolution,
+                species=args.species,
+                pad_to_multiple=not args.no_pad_to_multiple,
+                validate=not args.no_validate,
+                device=args.device,
+            )
+        print(f"Wrote AlphaGenome embeddings: {args.output}")
+        return
+
+    if args.command == "pca":
+        summary = reduce_embeddings_pca(
+            input_file=args.input_file,
+            output_file=args.output,
+            n_components=args.n_components,
+            target_variance=args.target_variance,
+            use_mean_pooling=args.use_mean_pooling,
+            exact_n_components=args.exact_n_components,
+            power_of_two=args.power_of_two,
+            selection_method=args.selection_method.replace("-", "_"),
+        )
+        print(
+            f"Wrote PCA embeddings: {args.output} "
+            f"({summary.n_components} PCs, "
+            f"{summary.cumulative_explained_variance:.2%} variance)"
+        )
+        return
+
     parser.error(f"Unknown command {args.command}")
 
 

diff --git a/deepdraw/embeddings/__init__.py b/deepdraw/embeddings/__init__.py
@@ -0,0 +1,5 @@
+"""Embedding preparation helpers for Deepdraw."""
+
+from deepdraw.embeddings.pca import reduce_embeddings_pca
+
+__all__ = ["reduce_embeddings_pca"]