This repository implements an iterative generative design pipeline that combines ProFam (protein sequence generation) and BAGEL (structure prediction + energies), driven by simple YAML configuration files. ProFam and BAGEL are installed as external pip packages from their GitHub repositories — they are not part of this codebase.
The main entrypoint is run_profam_bagel_pipeline.py at the repository root, plus convenience scripts for running on a PBS cluster or on a local Mac.
For each cycle (1 … max_cycles), the pipeline:
-
ProFam generation
- Reads an input FASTA containing the initial sequences and, from cycle 2 onward, the subset selected in the previous cycle.
- Calls ProFam's sampling API directly to generate
profam_num_samplesnew sequences.
-
BAGEL folding + energy evaluation
- For each generated sequence:
- Builds a
bagel.Chain(chain IDA). If any energy term specifies a"target"sequence, builds additional target chains (chain IDsB,C, …) and folds the multi-chain complex. - Uses an
ESMFoldfolding oracle to predict the 3D structure (only when at least one energy term requires it). - Computes a weighted total energy as the sum of user-configured BAGEL energy terms.
- Builds a
- Computes the average sequence similarity between generated sequences and the initial input sequences.
- For each generated sequence:
-
Probability assignment
- Converts energies into probabilities via softmax over
-E / T(where T =softmax_temperature).
- Converts energies into probabilities via softmax over
-
Subset selection and logging
- Samples a subset of size
floor(f_inject * N_output)(at least 1), with or without replacement depending onsample_with_reinsertion. - Writes a JSON log per cycle with energies, sequence similarity, selected indices, sequences, and per-term breakdowns.
- Saves CIF structures for all generated sequences and for the selected subset (when folding was performed).
- Samples a subset of size
-
Injection into the next cycle
- When
reinject_initialistrue(default), the selected subset is merged with the initial sequences to form the input FASTA for the next ProFam call. - When
reinject_initialisfalse, only the selected subset is used from cycle 2 onward.
- When
-
Summary plot
- After the final cycle, generates a plot of average and minimum energy vs cycle index, with average sequence similarity on a secondary y-axis.
| Requirement | Details |
|---|---|
| Python | 3.11 (BAGEL requires >=3.11,<3.14) |
| conda | Recommended for environment management |
| GPU | Needed for ESMFold folding (either locally, on Modal, or on a cluster) |
ProFam and BAGEL have overlapping but conflicting dependency pins:
| Package | BAGEL requires | ProFam freeze |
|---|---|---|
| numpy | >=2.2.0 |
==1.26.4 |
| matplotlib | >=3.10.0 |
==3.9.4 |
| transformers | >=4.49.0 |
==4.48.3 |
BAGEL also needs boileroom==0.2.2, pydantic, modal, and biotite, which are not in ProFam's dependency list. The provided setup script resolves all of these conflicts automatically.
Additionally, boileroom==0.2.2 constrains the torch version it installs (currently torch==2.6.0), and torchvision/torchaudio must match exactly (e.g. torchvision==0.21.0 for torch==2.6.0). The setup script handles this automatically by detecting the installed torch version and installing matching companion packages.
# From the repository root:
chmod +x setup_environment.sh
./setup_environment.shThis script will:
- Create a
profam_bagelconda environment with Python 3.11. - Install BAGEL (
biobagel) from GitHub — pullsboileroom,biotite,numpy,pydantic, etc. - Detect the
torchversion thatboileroominstalled and install matchingtorchvision/torchaudio. - Install ProFam from GitHub — pulls
transformers,lightning,hydra-core, etc. - Install pipeline utilities (
pyyaml,modal). - Verify that all key imports work.
After the script completes:
conda activate profam_bagel
python run_profam_bagel_pipeline.py --config example_pipeline_config.yamlIf you prefer to set up the environment manually:
conda create -n profam_bagel python=3.11 -y
conda activate profam_bagel
# 1. Install BAGEL from GitHub (sets the torch version via boileroom)
pip install "biobagel[local] @ git+https://github.com/softnanolab/bagel.git"
# 2. Install matching torchvision/torchaudio for the torch version boileroom pulled
# Check: python -c "import torch; print(torch.__version__)"
# For torch 2.6.x:
pip install torchvision==0.21.0 torchaudio==2.6.0
# 3. Install ProFam from GitHub
pip install "git+https://github.com/alex-hh/profam.git"
# 4. Install additional ProFam runtime dependencies not in its setup.py
pip install rootutils safetensors huggingface-hub biopython scipy scikit-learn
# 5. Install pipeline utilities
pip install pyyaml modalBefore running the pipeline you also need:
- ProFam model checkpoint — download it into
model_checkpoints/:python -c "from huggingface_hub import snapshot_download; snapshot_download('alex-hh/profam-1', local_dir='model_checkpoints/profam-1')" - Template structure (if using
TemplateMatchEnergy) — place the.ciffile at the path specified in your energy config (e.g.template.cifin the repo root), or use apdb_codeto download it automatically. - Modal token (if using
run_on_modal: true) — authenticate with:modal token new
The pipeline is configured with two YAML files:
- A pipeline YAML file for high-level pipeline settings.
- An energy YAML file specifying the BAGEL folding oracle and energy terms.
- CLI flags can override any value from the pipeline YAML.
Minimal example (see example_pipeline_config.yaml for the full version):
# Input sequences
initial_fasta: initial_sequences.fasta
# ProFam
profam_checkpoint_dir: model_checkpoints/profam-1
profam_sampler: single
profam_num_samples: 64
profam_temperature: 0.8
profam_top_p: 0.95
# BAGEL energy configuration
energy_config: example_energy_template_match.yaml
# Pipeline control
f_inject: 0.25
max_cycles: 10
output_dir: outputs/pipeline_run1
softmax_temperature: 1.0
random_seed: 42
# Subset selection
sample_with_reinsertion: true # false = sample without replacement
reinject_initial: true # false = don't prepend initial seqs from cycle 2+
n_memory: 0 # pool previous cycles' sequences for selection
# Run entire pipeline on Modal (GPU in the cloud)
run_on_modal: true
output_frequency: 1 # sync results every N cycles
enforce_template: true # force template residue identity during generationRequired keys (must be provided either in YAML or via CLI):
initial_fastaprofam_checkpoint_direnergy_config
The energy YAML defines the folding oracle(s) and a list of energy terms. Two schemas are supported: single-oracle (legacy) and multi-oracle (for running several predictors side-by-side).
Single-oracle schema:
folding_oracle:
type: Boltz
kwargs:
diffusion_samples: 5 # Boltz averages N samples in one subprocess call
recycling_steps: 3
energies:
- type: ipSAEEnergy
kwargs:
weight: 1.0
pae_cutoff: 10.0
target: <TARGET_SEQUENCE>
residues:
GEN: "all"
B: "all"
- type: iPTMEnergy
kwargs:
weight: 0.05Supported oracle types (via folding_oracle.type or each entry under
folding_oracles:): ESMFold, BatchedESMFold, Boltz (Boltz2),
Chai1, AF2BindCraft, AlphaFast, ColabFold. kwargs are passed
verbatim to the oracle class's constructor — see the BAGEL source for
each oracle's full parameter list. Notable per-oracle ensembling kwargs:
Boltz.diffusion_samples(int, default 1) — Boltz averages N diffusion samples inside a singleboltz predict --diffusion_samples Nsubprocess call. The oracle returns a singleBoltzResultwhose scalar (pTM, iPTM) and tensor (pLDDT, PAE) metrics are averaged across samples; the structure comes from sample 0.Chai1.num_diffn_samples(int, default 5) — Chai-1's native ensemble.AF2BindCraft.prediction_models(list[int], default [0, 1]) — which AF2 models to average over.
Multi-oracle schema (runs every listed oracle on every sequence and extracts per-oracle metrics):
folding_oracles:
esmfold:
type: BatchedESMFold
kwargs: {use_modal: false}
boltz2:
type: Boltz
kwargs: {diffusion_samples: 5}
chai1:
type: Chai1
kwargs: {num_diffn_samples: 3}
af2:
type: AF2BindCraft
kwargs:
target_pdb: /abs/path/to/target.pdb
target_chain: A
conda_env: BindCraft
energies:
- type: iPTMEnergy
oracle: boltz2 # per-energy oracle reference
kwargs: {weight: 0.05, name: b2}
- type: iPTMEnergy
oracle: chai1
kwargs: {weight: 0.05, name: chai}
- type: PLDDTEnergy
oracle: esmfold
kwargs: {weight: 0.1, name: esm, residues: {GEN: "all"}}
# ... mix and match to get the same metric from multiple predictorsIn multi-oracle mode each sequence is folded by every referenced oracle
(with a torch.cuda.empty_cache() between calls so in-process models
don't OOM each other). Energy terms without an explicit oracle: key
fall back to the first oracle in folding_oracles. Use the name:
kwarg on each energy term to keep metrics from different oracles
distinct in the cycle stats (e.g. iPTM_b2 vs iPTM_chai).
energies: Each entry is a BAGEL energy term class frombagel.energies:type: e.g.ipSAEEnergy,iPTMEnergy,PLDDTEnergy,LISEnergy,SolMPNNPerplexityEnergy,TemplateMatchEnergy,SeparationEnergy, etc.oracle(optional, multi-oracle mode only): name of the oracle to bind this term to. Defaults to the first oracle infolding_oracles.kwargs: Passed to the energy term's__init__, withoracleset automatically to the resolved oracle instance.- If a
residuesfield is present, it must be a dictionary mapping chain names to residue specifications. The generated chain always uses the keyGEN. For multi-chain energy terms, additional keys identify target chains (see sections 3.5 and 3.8). Values can be compact range strings (e.g."0-43"), integer lists, or"all". - If a
targetfield is present, the pipeline builds a multi-chain system for that energy term (see section 3.5). - Structure references (e.g.
template_structure_path) can be replaced withpdb_codeto download directly from the RCSB PDB (see section 3.9).
TemplateMatchEnergy computes the RMSD between a subset of the generated structure and the corresponding subset of a reference template structure. The key concept is that the residues.GEN indices specify the same 0-based positions in both structures:
-
residues: A dictionary with a single keyGENwhose value lists 0-based residue positions extracted from both the generated (folded) structure and the template chain. Values can be a compact range string (e.g."0-43", see section 3.8) or a list of integers. Because the same positions are used on both sides, the atom counts always match. -
Template structure loading: Provide either:
template_structure_path+ optionaltemplate_chain_id: a local CIF/PDB file and chain to filter by.pdb_code+ optionaltemplate_chain_id: a PDB identifier to download from RCSB (see section 3.9).
-
backbone_only: Whentrue, only backbone atoms (CA, N, C — 3 per residue) are compared on both sides. Whenfalse, all atoms are compared — this requires that the amino-acid identity at each compared position is the same in both structures (otherwise the per-residue atom counts differ).
Example with local file (see example_energy_template_match.yaml):
- type: TemplateMatchEnergy
kwargs:
weight: 1.0
backbone_only: true
template_structure_path: template.cif
template_chain_id: A
residues:
GEN: "0-43"Example with PDB download:
- type: TemplateMatchEnergy
kwargs:
weight: 1.0
backbone_only: true
pdb_code: "1ubq"
template_chain_id: A
residues:
GEN: "0-43"Here the first 44 residues (0-based) are compared between the generated structure and the template chain.
When TemplateMatchEnergy is used, the generated sequence must have the correct amino-acid identity at the template-matching positions, otherwise the atom counts will differ and evaluation fails. The enforce_template YAML flag controls how this is handled:
enforce_template |
Behaviour |
|---|---|
true (default) |
During ProFam generation, the amino acids at the residues positions are forced to match the template sequence using a logits processor. This guarantees the correct identity at constrained positions while letting the model freely generate the remaining positions. |
false |
ProFam generates freely. If a generated sequence has a different amino acid at a template position, the atom count mismatch causes a ValueError during energy evaluation. The pipeline catches this error and assigns infinity energy to that sequence. If all sequences in a cycle receive infinity energy, the cycle is retried (up to 5 attempts). |
Energy terms that operate on two groups of residues (e.g. PAEEnergy, SeparationEnergy, LISEnergy) can be used to evaluate interactions between the ProFam-generated chain and a fixed target chain. This is useful for designing sequences that bind to or interact with a known protein.
To enable this, specify the target chain in one of two ways:
Option A — inline sequence:
- type: LISEnergy
kwargs:
weight: 1.0
target: MKTAYIAKQRQISFVKSH...
residues:
GEN: "0-19"
B: "0-9"The non-GEN key (B here) becomes the target chain's identifier in the output CIF files.
Option B — download from PDB:
- type: LISEnergy
kwargs:
weight: 1.0
target_pdb_code: "1ubq"
target_chain_id: A
residues:
GEN: "0-19"
A: "0-9"The non-GEN key must match the target_chain_id value.
When a target is present:
- The pipeline builds a target chain from the provided sequence (or downloaded from PDB). Its chain ID is taken from the non-
GENkey in theresiduesdict. - The generated sequence uses chain ID
GEN, the target uses the key fromresidues. - ESMFold folds both chains together as a multi-chain complex (using its native
":"separator). - The
"residues"dict maps each chain key to its residue indices.GEN= residues on the generated chain, the other key = residues on the target chain. Compact range strings are supported (see section 3.8). - The energy term (e.g.
LISEnergy) computes the metric across the two chains using the predicted complex structure. - Output CIF files use these chain IDs (
GENfor the generated chain, the target key for the target chain).
A full example is provided in example_energy_lis_binding.yaml.
Multiple energy entries can each have their own target — if two entries share the same target sequence and chain ID, the pipeline de-duplicates them into a single target chain.
SolMPNNPerplexityEnergy scores how well a sequence fits its predicted backbone using SolubleMPNN's autoregressive perplexity. Lower perplexity means the sequence is "well-justified" by the backbone. It works with any folding oracle (ESMFold, Boltz, Chai-1, AF2BindCraft) because it only reads the predicted structure from the oracle result.
Two backends (select via use_modal):
use_modal: true— routes scoring through a deployed Modal app. Kept for backwards compatibility; requiresmodal deploy modal_proteinmpnn_score.py.use_modal: false(default, recommended) — runs locally by invoking a bundled BAGEL script (bagel.scripts.proteinmpnn_scorer) inside a separate conda env (to isolate ProteinMPNN's older dependency pins from the main BAGEL/Boltz stack). Requires:- A clone of ProteinMPNN.
- A conda env (e.g.
proteinmpnn) with torch + numpy that can import it.
Example config:
- type: SolMPNNPerplexityEnergy
kwargs:
weight: 0.1
use_modal: false
proteinmpnn_env: proteinmpnn
proteinmpnn_path: /mnt/disk2/ProteinMPNN
backbone_noise: 0.1 # augment_eps — Gaussian noise on backbone coords
ensemble_n: 10 # forward passes per call (each with independent noise + decoding order)
decoding_order: random # or "fixed:<seed>" for determinism
residues:
GEN: "all"Ensemble semantics:
Each of the ensemble_n passes re-featurises the structure (applying fresh Gaussian backbone noise when backbone_noise > 0) and samples a new decoding order. The returned perplexity is exp(mean NLL) across all passes. Setting backbone_noise > 0 together with ensemble_n > 1 propagates structural uncertainty into the score.
Complex context (important):
The energy is always computed on whatever structure the folding oracle produced. In standard binder campaigns the oracle folds binder + target together, so SolMPNN sees the full complex — and even though only residues on the GEN chain contribute to the loss, the encoder sees the target as "binding context". This is what you want for binder design: a good interface residue is one that is justified given the target, not one that is stable in isolation.
Validation (see test_mpnn_context_significance.py): On 1YCR (p53 peptide + MDM2), the p53 peptide's perplexity drops from 16.57 ± 0.19 (monomer only) to 4.47 ± 0.07 (in MDM2 complex) — a 73% drop, t-statistic ~-188. This confirms the complex context meaningfully informs the score, well beyond the natural ensemble variance. Always pass the complex to SolMPNN for binder scoring.
At each cycle, the pipeline automatically computes the average sequence similarity between the ProFam-generated sequences and the initial input sequences (from initial_fasta). For each generated sequence, similarity to each initial sequence is computed via global pairwise alignment (Needleman–Wunsch, using Biopython's PairwiseAligner) as the fraction of identical residues over the alignment length. This correctly handles insertions and deletions — a single indel no longer shifts all downstream positions. The best (maximum) similarity across all initial sequences is kept, and the mean of those best-match values is:
- Printed to the console during the run.
- Stored in
cycle_stats.jsonas"all_avg_similarity"per cycle. - Plotted on a secondary y-axis in
energy_summary.png.
This metric helps monitor how much the generated sequences drift from the starting point over successive cycles.
The residues field is a dictionary mapping chain names to residue specifications. Each value can be a compact range string instead of a full list of integers:
| Format | Expands to |
|---|---|
"5" |
[5] |
"1,2,5" |
[1, 2, 5] |
"0-43" |
[0, 1, 2, ..., 43] |
"0-5,10,20-25" |
[0, 1, 2, 3, 4, 5, 10, 20, 21, 22, 23, 24, 25] |
Single-chain energy terms (e.g. TemplateMatchEnergy) use only the GEN key:
residues:
GEN: "0-43"Multi-chain energy terms (e.g. PAEEnergy, LISEnergy, SeparationEnergy) use GEN for the generated chain and the target's chain ID as the second key:
residues:
GEN: "0-19"
A: "0-9"This maps residues 0–19 on the generated chain (GEN) and residues 0–9 on the target chain (A). The GEN key is always group 0, the target key is group 1.
The explicit integer list format (e.g. [0,1,2,3]) and the "all" shorthand continue to work as values within the dict.
Instead of providing a local CIF/PDB file, you can specify a PDB code and the pipeline will download the structure from the RCSB PDB:
pdb_code: "1ubq"
template_chain_id: AThis replaces template_structure_path. Downloaded files are cached in ~/.cache/profam_bagel/pdb/ so they are not re-downloaded on subsequent runs.
For multi-chain energy targets, use target_pdb_code and target_chain_id instead of an inline target sequence:
target_pdb_code: "1ubq"
target_chain_id: AThe pipeline downloads the CIF, extracts the specified chain, and uses its sequence as the target.
The utility functions download_pdb_cif() and extract_chain_from_cif() are also available for standalone use from run_profam_bagel_pipeline.
The pipeline supports two strategies for choosing which sequences to condition ProFam on in the next cycle:
selection_strategy |
Behaviour |
|---|---|
greedy (default) |
Softmax over energies → sample injection set → elitism/conditional swap. The classic path described in sections 1–5 above. |
thompson |
Thompson sampling with Beta posteriors. Treats each sequence as a bandit arm and learns which conditioning sequences produce the best progeny over time. |
When selection_strategy: thompson, the pipeline uses a multi-armed bandit approach instead of greedy softmax selection:
- Arm = a protein sequence with a Beta(α, β) posterior representing the distribution of rewards from its progeny.
- Reward =
clamp(-ipSAE, 0, 1). Since ipSAE is negative (more negative = better binding), negating and clamping gives a reward in [0, 1]. - Bootstrap: When a sequence is first observed, its own ipSAE provides the first observation → Beta(1 + r, 2 - r).
Each cycle:
- Sample θᵢ ~ Beta(αᵢ, βᵢ) for every arm.
- Pick the arm with the highest θᵢ → use that sequence to condition ProFam.
- Generate progeny, evaluate their ipSAE.
- Update the chosen arm's posterior: α += reward, β += (1 - reward).
- Register all progeny (with finite ipSAE) as new arms.
Max-seeking variant: Setting thompson_m_samples > 1 samples m times from each arm's Beta posterior and takes the maximum. This biases selection toward high-variance (under-explored) arms — useful for encouraging exploration.
Configuration:
selection_strategy: thompson
thompson_m_samples: 1 # 1 = standard Thompson, >1 = max-seeking
thompson_reward_term: ipSAE # energy term name used as reward signalThe thompson_reward_term must match a key in the energy_terms dict produced by BAGEL evaluation (e.g. ipSAE from ipSAEEnergy).
Thompson mode outputs an additional file:
thompson_arms.json— full state of all arms (α, β, sequence, parent lineage, selection count), updated each cycle.cycle_stats.jsongainsthompson_selected_arm_id,thompson_progeny_reward, andthompson_num_armsfields per cycle.
An example config is at configs/pipelines/pipeline_thompson_example.yaml.
| Scenario | Recommended |
|---|---|
| Short runs (< 20 cycles), well-understood scaffold | greedy with elitism |
| Long exploratory runs, multiple scaffolds, unknown which sequences are good generators | thompson |
| Need to balance exploration vs exploitation across a growing pool of candidates | thompson with thompson_m_samples: 3 |
By default, when selecting which sequences to inject into the next ProFam cycle, only the sequences generated in the current cycle are considered. The n_memory parameter widens this selection pool to include sequences from previous cycles:
n_memory: 3 # include sequences from the last 3 cycles in the pooln_memory |
Pool contents |
|---|---|
0 (default) |
Current cycle only |
N > 0 |
Current cycle + up to the last N cycles |
The number of sequences selected for injection stays the same (floor(f_inject * profam_num_samples)); only the candidate pool grows. Probabilities are computed via softmax over the entire pool, so good sequences from earlier cycles can survive even if they were not selected at the time.
Every generated sequence receives a global unique ID that increments monotonically across cycles. With 10 sequences per cycle: cycle 1 produces IDs 0–9, cycle 2 produces IDs 10–19, and so on. These IDs appear in cycle_stats.json as "selected_ids" and in each sequence entry's "id" field, making it unambiguous which sequence is which regardless of cycle of origin.
From the repository root:
conda activate profam_bagel
python run_profam_bagel_pipeline.py --config example_pipeline_config.yamlOverride any YAML key via CLI:
python run_profam_bagel_pipeline.py \
--config example_pipeline_config.yaml \
--max_cycles 5 \
--f_inject 0.1 \
--run_on_modal falseRun without a YAML file by supplying all required flags:
python run_profam_bagel_pipeline.py \
--initial_fasta initial_sequences.fasta \
--profam_checkpoint_dir model_checkpoints/profam-1 \
--energy_config example_energy_template_match.yaml \
--profam_num_samples 64 \
--f_inject 0.25 \
--max_cycles 10 \
--output_dir outputs/pipeline_run1Note: Running locally with
run_on_modal: falserequires a GPU with enough memory for ESMFold, and theMODEL_DIRenvironment variable must point to the folder containing the ESMFold model weights (e.g.export MODEL_DIR=~/.cache/bagel/models).
A convenience wrapper for local runs is also available:
./run_pipeline_mac.sh example_pipeline_config.yaml [extra CLI args...]Modal lets you run the entire pipeline — ProFam generation, ESMFold folding, and energy evaluation — on cloud GPUs without managing any infrastructure.
-
Install and authenticate with Modal:
pip install modal # already included by setup_environment.sh modal token new # opens a browser to authenticate
-
Set
run_on_modal: truein your YAML config (this is the default inexample_pipeline_config.yaml).
When run_on_modal: true:
run_profam_bagel_pipeline.pyserialises the pipeline configuration and dispatches it to a remote Modal job defined inrun_profam_bagel_modal_app.py.- The Modal container receives a Docker image with all dependencies pre-installed (PyTorch, transformers, boileroom, BAGEL, etc.).
- Your local repository is uploaded to
/workspaceinside the container. - If cached ESMFold model weights exist locally (at
MODEL_DIRor~/.cache/bagel/models), they are uploaded to/models/bagelto avoid re-downloading. - ESMFold's
use_modalis forced totrueinside the container, regardless of the energy config setting. - The job runs on an NVIDIA A10G GPU with a 24-hour timeout.
conda activate profam_bagel
python run_profam_bagel_pipeline.py --config example_pipeline_config.yamlThat's it — the script handles the rest. You'll see Modal's output stream in your terminal.
For institutional HPC clusters that use the PBS job scheduler, a batch script is provided: run_pipeline_pbs.sh.
The same setup_environment.sh script works on Linux HPC nodes. The only difference is that on Linux with NVIDIA GPUs, the script installs CUDA-enabled PyTorch packages automatically.
# On the cluster login node:
git clone <this-repo-url> profam_bagel
cd profam_bagel
# Load conda (adapt to your cluster's module system)
module load anaconda3 # or: module load miniconda
# or: source ~/miniconda3/etc/profile.d/conda.sh
# Run the setup script
chmod +x setup_environment.sh
./setup_environment.sh
# Download the ProFam checkpoint
conda activate profam_bagel
python -c "from huggingface_hub import snapshot_download; snapshot_download('alex-hh/profam-1', local_dir='model_checkpoints/profam-1')"If the cluster does not have internet access on compute nodes, run the setup and model download on the login node (which typically does have internet access).
You have two options for ESMFold folding when running on a cluster:
| Option | YAML setting | Energy config use_modal |
Requirements |
|---|---|---|---|
| A. Local GPU folding | run_on_modal: false |
false |
GPU node, MODEL_DIR set to ESMFold weights |
| B. Modal folding | run_on_modal: false |
true |
Internet access from compute nodes, Modal token configured |
Option A (recommended for GPU clusters): The pipeline runs entirely on the cluster node. Set run_on_modal: false in the pipeline YAML and use_modal: false in the energy YAML. You must set the MODEL_DIR environment variable to the directory containing ESMFold model weights:
export MODEL_DIR=/path/to/esmfold/modelsOption B: The pipeline runs on the cluster but ESMFold folding is offloaded to Modal. Set run_on_modal: false in the pipeline YAML but use_modal: true in the energy YAML. This requires internet access from compute nodes and a configured Modal token (modal token new).
Note: Setting
run_on_modal: trueon a cluster would send the entire pipeline (including ProFam) to Modal, which is usually not what you want on an HPC system — use Option A or B instead.
Edit run_pipeline_pbs.sh to match your cluster's resource configuration and add the conda activation command:
# Inside run_pipeline_pbs.sh, uncomment and adapt:
module load anaconda3
source activate profam_bagel
# If using local folding (Option A), also add:
export MODEL_DIR=/path/to/esmfold/modelsThen submit:
qsub run_pipeline_pbs.sh -v CONFIG=example_pipeline_config.yamlThe default resource request is:
- 1 node, 8 CPUs, 1 GPU, 64 GB RAM, 24-hour walltime
Adjust the #PBS -l directives in the script as needed for your cluster.
For cloud deployment on AWS EC2, GCP Compute Engine, Azure VMs, or similar infrastructure.
| Use Case | Instance Type | GPU | vCPU | Memory | Hourly Cost* |
|---|---|---|---|---|---|
| Production runs | g5.xlarge |
A10G (24GB) | 4 | 16 GB | ~$1.00 |
| Large batches | g5.2xlarge |
A10G (24GB) | 8 | 32 GB | ~$1.50 |
| Modal offload | t3.medium |
None | 2 | 4 GB | ~$0.04 |
| Budget GPU | g4dn.xlarge |
T4 (16GB) | 4 | 16 GB | ~$0.50 |
*Prices approximate, on-demand, us-east-1. Use spot instances for ~70% savings.
Recommended AMI: "Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04)" — comes with CUDA drivers pre-installed.
For a fresh cloud instance with the repository already cloned:
cd ThinkingPLM
chmod +x setup_cloud.sh && ./setup_cloud.sh
source ~/.bashrc && conda activate profam_bagelThe setup_cloud.sh script:
- Installs Miniconda if conda is not available
- Creates the
profam_bagelconda environment with all dependencies - Downloads the ProFam model checkpoint (~3GB)
- Verifies GPU availability and all imports
If the one-command setup fails, follow these steps:
# 1. Install Miniconda (if conda not available)
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh
bash /tmp/miniconda.sh -b -p $HOME/miniconda3
eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
conda init bash
source ~/.bashrc
# 2. Clone and enter repo
git clone https://github.com/JudeWells/ThinkingPLM.git
cd ThinkingPLM
# 3. Run environment setup
chmod +x setup_environment.sh
./setup_environment.sh
# 4. Activate and download model
conda activate profam_bagel
python -c "from huggingface_hub import snapshot_download; snapshot_download('alex-hh/profam-1', local_dir='.profam_repo/model_checkpoints/profam-1')"
# 5. Verify GPU
nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"Option A: Local GPU mode (requires GPU instance like g5.xlarge)
Set run_on_modal: false in your pipeline config:
run_on_modal: falseRun with:
export MODEL_DIR=~/.cache/bagel/models
python run_profam_bagel_pipeline.py --config configs/pipelines/your_config.yamlOption B: Modal mode (works on any instance, including CPU-only)
Set run_on_modal: true in your pipeline config (this is the default). All GPU work happens on Modal's cloud infrastructure.
First configure Modal:
modal token new # Authenticate
modal secret create huggingface-secret HF_TOKEN=hf_xxxxx # For model accessThen run:
python run_profam_bagel_pipeline.py --config configs/pipelines/your_config.yaml| Problem | Solution |
|---|---|
conda: command not found |
Run the Miniconda installation commands above |
CUDA out of memory |
Reduce profam_num_samples, or use a larger GPU instance |
No CUDA GPUs available |
Check nvidia-smi; may need driver install or instance restart |
libcudnn.so not found |
conda install cudnn -c conda-forge |
| SSH disconnects during long runs | Use tmux or screen: tmux new -s pipeline |
| Slow first Modal run | Normal — container image builds on first run (~5-10 min) |
# Using tmux (recommended)
tmux new -s pipeline
conda activate profam_bagel
python run_profam_bagel_pipeline.py --config configs/pipelines/your_config.yaml
# Press Ctrl+B then D to detach; `tmux attach -t pipeline` to reconnect
# Using nohup
nohup python run_profam_bagel_pipeline.py --config configs/pipelines/your_config.yaml > run.log 2>&1 &
tail -f run.log
# Monitor progress
tail -f outputs/your_run/cycle_stats.jsonFor a run with output_dir: outputs/pipeline_run1, the pipeline creates:
-
outputs/pipeline_run1/cycle_stats.json- Dictionary keyed by cycle number (as a string), e.g.
"1","2", ... - Each entry contains:
cycle: integer cycle indexall_avg_energy,all_min_energy: statistics over all generated sequencesall_avg_similarity: average sequence similarity to initial sequencesbest_sequence: dict with the lowest-energy sequence's detailsselected_avg_energy,selected_min_energy: statistics over the selected subsetnum_selected,selected_indices: subset selection infoselected_sequences: list of dicts withindex,sequence,energy, andenergy_terms
- Dictionary keyed by cycle number (as a string), e.g.
-
Structures (when folding was performed):
outputs/pipeline_run1/cycle_XXX/sequences_cycle_all_XXX/sequence_XXXX.cif— all sequences folded in that cycleoutputs/pipeline_run1/sequences_cycle_XXX/sequence_XXXX.cif— only the selected subset
-
Plot:
outputs/pipeline_run1/energy_summary.png— line plot of average and minimum energies vs cycle index (left y-axis), with average sequence similarity on the right y-axis
| Problem | Solution |
|---|---|
modal.exception.AuthError: Token missing |
Run modal token new to authenticate |
AssertionError: MODEL_DIR must be set |
Set export MODEL_DIR=~/.cache/bagel/models (or wherever your ESMFold weights are) |
ImportError: lightning or torchvision errors |
torch/torchvision version mismatch — re-run setup_environment.sh or manually install matching versions (see Section 2) |
boileroom not found |
Ensure BAGEL was installed: pip install "biobagel[local] @ git+https://github.com/JudeWells/bagel.git" |
| ProFam checkpoint not found | Download with python -c "from huggingface_hub import snapshot_download; snapshot_download('alex-hh/profam-1', local_dir='.profam_repo/model_checkpoints/profam-1')" |
| Slow first Modal run | First run builds the container image; subsequent runs reuse the cached image |
enforce_template has no effect |
The fixed_positions feature used by constrained generation is only available in newer (unreleased) versions of ProFam. The current GitHub version (main branch) does not support it. If your local ProFam has fixed_positions (e.g. from a development branch), the pipeline will use it; otherwise it prints a warning and generates freely. Template mismatches then receive infinity energy and the cycle is retried. |
| Problem | Solution |
|---|---|
conda: command not found |
Install Miniconda: wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh -b then source ~/.bashrc |
CUDA out of memory |
Reduce profam_num_samples in config, use smaller batch size, or upgrade to larger GPU instance |
No CUDA GPUs available |
Check nvidia-smi; ensure you're on a GPU instance; may need sudo nvidia-smi -pm 1 |
libcudnn.so not found |
Install cuDNN: conda install cudnn -c conda-forge |
libcuda.so.1: cannot open |
NVIDIA drivers not installed — use Deep Learning AMI or install drivers manually |
| SSH disconnects during long runs | Use tmux or screen to keep session alive |
| Disk space errors | Clear HuggingFace cache: rm -rf ~/.cache/huggingface |
Permission denied on HF download |
Set HF_TOKEN env var or run huggingface-cli login |
| Instance terminated mid-run | Use spot instance interruption handling or switch to on-demand |
profam_bagel/
├── run_profam_bagel_pipeline.py # Main pipeline entrypoint
├── run_profam_bagel_modal_app.py # Modal app for cloud execution
├── setup_environment.sh # Environment setup script (local/HPC)
├── setup_cloud.sh # Cloud VM setup (AWS/GCP/Azure)
├── example_pipeline_config.yaml # Example YAML config
├── example_energy_template_match.yaml # Example energy config (template matching)
├── example_energy_lis_binding.yaml # Example energy config (LIS binding with target chain)
├── run_pipeline_pbs.sh # PBS cluster batch script
├── run_pipeline_mac.sh # Local convenience wrapper
├── initial_sequences.fasta # Example initial sequences
├── configs/ # Configuration files
│ ├── pipelines/ # Pipeline YAML configs
│ ├── energy/ # Energy YAML configs
│ └── sequences/ # Initial FASTA sequences
└── model_checkpoints/ # ProFam model weights (user-downloaded)
External dependencies (installed via pip from GitHub):
- BAGEL (
biobagel):pip install "biobagel[local] @ git+https://github.com/softnanolab/bagel.git"— structure prediction + energy terms - ProFam:
pip install "git+https://github.com/alex-hh/profam.git"— protein sequence generation