GitHub - JudeWells/ThinkingPLM

ProFam + BAGEL Generative Pipeline

This repository implements an iterative generative design pipeline that combines ProFam (protein sequence generation) and BAGEL (structure prediction + energies), driven by simple YAML configuration files. ProFam and BAGEL are installed as external pip packages from their GitHub repositories — they are not part of this codebase.

The main entrypoint is run_profam_bagel_pipeline.py at the repository root, plus convenience scripts for running on a PBS cluster or on a local Mac.

1. High-level pipeline overview

For each cycle (1 … max_cycles), the pipeline:

ProFam generation
- Reads an input FASTA containing the initial sequences and, from cycle 2 onward, the subset selected in the previous cycle.
- Calls ProFam's sampling API directly to generate profam_num_samples new sequences.
BAGEL folding + energy evaluation
- For each generated sequence:
  - Builds a bagel.Chain (chain ID A). If any energy term specifies a "target" sequence, builds additional target chains (chain IDs B, C, …) and folds the multi-chain complex.
  - Uses an ESMFold folding oracle to predict the 3D structure (only when at least one energy term requires it).
  - Computes a weighted total energy as the sum of user-configured BAGEL energy terms.
- Computes the average sequence similarity between generated sequences and the initial input sequences.
Probability assignment
- Converts energies into probabilities via softmax over -E / T (where T = softmax_temperature).
Subset selection and logging
- Samples a subset of size floor(f_inject * N_output) (at least 1), with or without replacement depending on sample_with_reinsertion.
- Writes a JSON log per cycle with energies, sequence similarity, selected indices, sequences, and per-term breakdowns.
- Saves CIF structures for all generated sequences and for the selected subset (when folding was performed).
Injection into the next cycle
- When reinject_initial is true (default), the selected subset is merged with the initial sequences to form the input FASTA for the next ProFam call.
- When reinject_initial is false, only the selected subset is used from cycle 2 onward.
Summary plot
- After the final cycle, generates a plot of average and minimum energy vs cycle index, with average sequence similarity on a secondary y-axis.

2. Requirements and environment setup

Requirement	Details
Python	3.11 (BAGEL requires `>=3.11,<3.14`)
conda	Recommended for environment management
GPU	Needed for ESMFold folding (either locally, on Modal, or on a cluster)

Dependency notes

ProFam and BAGEL have overlapping but conflicting dependency pins:

Package	BAGEL requires	ProFam freeze
numpy	`>=2.2.0`	`==1.26.4`
matplotlib	`>=3.10.0`	`==3.9.4`
transformers	`>=4.49.0`	`==4.48.3`

BAGEL also needs boileroom==0.2.2, pydantic, modal, and biotite, which are not in ProFam's dependency list. The provided setup script resolves all of these conflicts automatically.

Additionally, boileroom==0.2.2 constrains the torch version it installs (currently torch==2.6.0), and torchvision/torchaudio must match exactly (e.g. torchvision==0.21.0 for torch==2.6.0). The setup script handles this automatically by detecting the installed torch version and installing matching companion packages.

Quick setup (recommended)

# From the repository root:
chmod +x setup_environment.sh
./setup_environment.sh

This script will:

Create a profam_bagel conda environment with Python 3.11.
Install BAGEL (biobagel) from GitHub — pulls boileroom, biotite, numpy, pydantic, etc.
Detect the torch version that boileroom installed and install matching torchvision/torchaudio.
Install ProFam from GitHub — pulls transformers, lightning, hydra-core, etc.
Install pipeline utilities (pyyaml, modal).
Verify that all key imports work.

After the script completes:

conda activate profam_bagel
python run_profam_bagel_pipeline.py --config example_pipeline_config.yaml

Manual setup

If you prefer to set up the environment manually:

conda create -n profam_bagel python=3.11 -y
conda activate profam_bagel

# 1. Install BAGEL from GitHub (sets the torch version via boileroom)
pip install "biobagel[local] @ git+https://github.com/softnanolab/bagel.git"

# 2. Install matching torchvision/torchaudio for the torch version boileroom pulled
#    Check: python -c "import torch; print(torch.__version__)"
#    For torch 2.6.x:
pip install torchvision==0.21.0 torchaudio==2.6.0

# 3. Install ProFam from GitHub
pip install "git+https://github.com/alex-hh/profam.git"

# 4. Install additional ProFam runtime dependencies not in its setup.py
pip install rootutils safetensors huggingface-hub biopython scipy scikit-learn

# 5. Install pipeline utilities
pip install pyyaml modal

Prerequisites

Before running the pipeline you also need:

ProFam model checkpoint — download it into model_checkpoints/:

python -c "from huggingface_hub import snapshot_download; snapshot_download('alex-hh/profam-1', local_dir='model_checkpoints/profam-1')"

Template structure (if using TemplateMatchEnergy) — place the .cif file at the path specified in your energy config (e.g. template.cif in the repo root), or use a pdb_code to download it automatically.
Modal token (if using run_on_modal: true) — authenticate with:
```
modal token new
```

3. Pipeline configuration

The pipeline is configured with two YAML files:

A pipeline YAML file for high-level pipeline settings.
An energy YAML file specifying the BAGEL folding oracle and energy terms.
CLI flags can override any value from the pipeline YAML.

3.1 YAML configuration (pipeline)

Minimal example (see example_pipeline_config.yaml for the full version):

# Input sequences
initial_fasta: initial_sequences.fasta

# ProFam
profam_checkpoint_dir: model_checkpoints/profam-1
profam_sampler: single
profam_num_samples: 64
profam_temperature: 0.8
profam_top_p: 0.95

# BAGEL energy configuration
energy_config: example_energy_template_match.yaml

# Pipeline control
f_inject: 0.25
max_cycles: 10
output_dir: outputs/pipeline_run1
softmax_temperature: 1.0
random_seed: 42

# Subset selection
sample_with_reinsertion: true   # false = sample without replacement
reinject_initial: true          # false = don't prepend initial seqs from cycle 2+
n_memory: 0                     # pool previous cycles' sequences for selection

# Run entire pipeline on Modal (GPU in the cloud)
run_on_modal: true
output_frequency: 1             # sync results every N cycles
enforce_template: true          # force template residue identity during generation

Required keys (must be provided either in YAML or via CLI):

initial_fasta
profam_checkpoint_dir
energy_config

3.2 Energy YAML configuration

The energy YAML defines the folding oracle(s) and a list of energy terms. Two schemas are supported: single-oracle (legacy) and multi-oracle (for running several predictors side-by-side).

Single-oracle schema:

folding_oracle:
  type: Boltz
  kwargs:
    diffusion_samples: 5   # Boltz averages N samples in one subprocess call
    recycling_steps: 3

energies:
  - type: ipSAEEnergy
    kwargs:
      weight: 1.0
      pae_cutoff: 10.0
      target: <TARGET_SEQUENCE>
      residues:
        GEN: "all"
        B: "all"
  - type: iPTMEnergy
    kwargs:
      weight: 0.05

Supported oracle types (via folding_oracle.type or each entry under folding_oracles:): ESMFold, BatchedESMFold, Boltz (Boltz2), Chai1, AF2BindCraft, AlphaFast, ColabFold. kwargs are passed verbatim to the oracle class's constructor — see the BAGEL source for each oracle's full parameter list. Notable per-oracle ensembling kwargs:

Boltz.diffusion_samples (int, default 1) — Boltz averages N diffusion samples inside a single boltz predict --diffusion_samples N subprocess call. The oracle returns a single BoltzResult whose scalar (pTM, iPTM) and tensor (pLDDT, PAE) metrics are averaged across samples; the structure comes from sample 0.
Chai1.num_diffn_samples (int, default 5) — Chai-1's native ensemble.
AF2BindCraft.prediction_models (list[int], default [0, 1]) — which AF2 models to average over.

Multi-oracle schema (runs every listed oracle on every sequence and extracts per-oracle metrics):

folding_oracles:
  esmfold:
    type: BatchedESMFold
    kwargs: {use_modal: false}
  boltz2:
    type: Boltz
    kwargs: {diffusion_samples: 5}
  chai1:
    type: Chai1
    kwargs: {num_diffn_samples: 3}
  af2:
    type: AF2BindCraft
    kwargs:
      target_pdb: /abs/path/to/target.pdb
      target_chain: A
      conda_env: BindCraft

energies:
  - type: iPTMEnergy
    oracle: boltz2        # per-energy oracle reference
    kwargs: {weight: 0.05, name: b2}
  - type: iPTMEnergy
    oracle: chai1
    kwargs: {weight: 0.05, name: chai}
  - type: PLDDTEnergy
    oracle: esmfold
    kwargs: {weight: 0.1, name: esm, residues: {GEN: "all"}}
  # ... mix and match to get the same metric from multiple predictors

In multi-oracle mode each sequence is folded by every referenced oracle (with a torch.cuda.empty_cache() between calls so in-process models don't OOM each other). Energy terms without an explicit oracle: key fall back to the first oracle in folding_oracles. Use the name: kwarg on each energy term to keep metrics from different oracles distinct in the cycle stats (e.g. iPTM_b2 vs iPTM_chai).

energies: Each entry is a BAGEL energy term class from bagel.energies:
- type: e.g. ipSAEEnergy, iPTMEnergy, PLDDTEnergy, LISEnergy, SolMPNNPerplexityEnergy, TemplateMatchEnergy, SeparationEnergy, etc.
- oracle (optional, multi-oracle mode only): name of the oracle to bind this term to. Defaults to the first oracle in folding_oracles.
- kwargs: Passed to the energy term's __init__, with oracle set automatically to the resolved oracle instance.
- If a residues field is present, it must be a dictionary mapping chain names to residue specifications. The generated chain always uses the key GEN. For multi-chain energy terms, additional keys identify target chains (see sections 3.5 and 3.8). Values can be compact range strings (e.g. "0-43"), integer lists, or "all".
- If a target field is present, the pipeline builds a multi-chain system for that energy term (see section 3.5).
- Structure references (e.g. template_structure_path) can be replaced with pdb_code to download directly from the RCSB PDB (see section 3.9).

3.3 TemplateMatchEnergy

TemplateMatchEnergy computes the RMSD between a subset of the generated structure and the corresponding subset of a reference template structure. The key concept is that the residues.GEN indices specify the same 0-based positions in both structures:

residues: A dictionary with a single key GEN whose value lists 0-based residue positions extracted from both the generated (folded) structure and the template chain. Values can be a compact range string (e.g. "0-43", see section 3.8) or a list of integers. Because the same positions are used on both sides, the atom counts always match.
Template structure loading: Provide either:
- template_structure_path + optional template_chain_id: a local CIF/PDB file and chain to filter by.
- pdb_code + optional template_chain_id: a PDB identifier to download from RCSB (see section 3.9).
backbone_only: When true, only backbone atoms (CA, N, C — 3 per residue) are compared on both sides. When false, all atoms are compared — this requires that the amino-acid identity at each compared position is the same in both structures (otherwise the per-residue atom counts differ).

Example with local file (see example_energy_template_match.yaml):

- type: TemplateMatchEnergy
  kwargs:
    weight: 1.0
    backbone_only: true
    template_structure_path: template.cif
    template_chain_id: A
    residues:
      GEN: "0-43"

Example with PDB download:

- type: TemplateMatchEnergy
  kwargs:
    weight: 1.0
    backbone_only: true
    pdb_code: "1ubq"
    template_chain_id: A
    residues:
      GEN: "0-43"

Here the first 44 residues (0-based) are compared between the generated structure and the template chain.

3.4 Constrained generation (`enforce_template`)

When TemplateMatchEnergy is used, the generated sequence must have the correct amino-acid identity at the template-matching positions, otherwise the atom counts will differ and evaluation fails. The enforce_template YAML flag controls how this is handled:

`enforce_template`	Behaviour
`true` (default)	During ProFam generation, the amino acids at the `residues` positions are forced to match the template sequence using a logits processor. This guarantees the correct identity at constrained positions while letting the model freely generate the remaining positions.
`false`	ProFam generates freely. If a generated sequence has a different amino acid at a template position, the atom count mismatch causes a `ValueError` during energy evaluation. The pipeline catches this error and assigns infinity energy to that sequence. If all sequences in a cycle receive infinity energy, the cycle is retried (up to 5 attempts).

3.5 Multi-chain energy terms (`"target"`)

Energy terms that operate on two groups of residues (e.g. PAEEnergy, SeparationEnergy, LISEnergy) can be used to evaluate interactions between the ProFam-generated chain and a fixed target chain. This is useful for designing sequences that bind to or interact with a known protein.

To enable this, specify the target chain in one of two ways:

Option A — inline sequence:

- type: LISEnergy
  kwargs:
    weight: 1.0
    target: MKTAYIAKQRQISFVKSH...
    residues:
      GEN: "0-19"
      B: "0-9"

The non-GEN key (B here) becomes the target chain's identifier in the output CIF files.

Option B — download from PDB:

- type: LISEnergy
  kwargs:
    weight: 1.0
    target_pdb_code: "1ubq"
    target_chain_id: A
    residues:
      GEN: "0-19"
      A: "0-9"

The non-GEN key must match the target_chain_id value.

When a target is present:

The pipeline builds a target chain from the provided sequence (or downloaded from PDB). Its chain ID is taken from the non-GEN key in the residues dict.
The generated sequence uses chain ID GEN, the target uses the key from residues.
ESMFold folds both chains together as a multi-chain complex (using its native ":" separator).
The "residues" dict maps each chain key to its residue indices. GEN = residues on the generated chain, the other key = residues on the target chain. Compact range strings are supported (see section 3.8).
The energy term (e.g. LISEnergy) computes the metric across the two chains using the predicted complex structure.
Output CIF files use these chain IDs (GEN for the generated chain, the target key for the target chain).

A full example is provided in example_energy_lis_binding.yaml.

Multiple energy entries can each have their own target — if two entries share the same target sequence and chain ID, the pipeline de-duplicates them into a single target chain.

3.6 SolubleMPNN perplexity (designability) energy

SolMPNNPerplexityEnergy scores how well a sequence fits its predicted backbone using SolubleMPNN's autoregressive perplexity. Lower perplexity means the sequence is "well-justified" by the backbone. It works with any folding oracle (ESMFold, Boltz, Chai-1, AF2BindCraft) because it only reads the predicted structure from the oracle result.

Two backends (select via use_modal):

use_modal: true — routes scoring through a deployed Modal app. Kept for backwards compatibility; requires modal deploy modal_proteinmpnn_score.py.
use_modal: false (default, recommended) — runs locally by invoking a bundled BAGEL script (bagel.scripts.proteinmpnn_scorer) inside a separate conda env (to isolate ProteinMPNN's older dependency pins from the main BAGEL/Boltz stack). Requires:
1. A clone of ProteinMPNN.
2. A conda env (e.g. proteinmpnn) with torch + numpy that can import it.

Example config:

- type: SolMPNNPerplexityEnergy
  kwargs:
    weight: 0.1
    use_modal: false
    proteinmpnn_env: proteinmpnn
    proteinmpnn_path: /mnt/disk2/ProteinMPNN
    backbone_noise: 0.1      # augment_eps — Gaussian noise on backbone coords
    ensemble_n: 10           # forward passes per call (each with independent noise + decoding order)
    decoding_order: random   # or "fixed:<seed>" for determinism
    residues:
      GEN: "all"

Ensemble semantics:

Each of the ensemble_n passes re-featurises the structure (applying fresh Gaussian backbone noise when backbone_noise > 0) and samples a new decoding order. The returned perplexity is exp(mean NLL) across all passes. Setting backbone_noise > 0 together with ensemble_n > 1 propagates structural uncertainty into the score.

Complex context (important):

The energy is always computed on whatever structure the folding oracle produced. In standard binder campaigns the oracle folds binder + target together, so SolMPNN sees the full complex — and even though only residues on the GEN chain contribute to the loss, the encoder sees the target as "binding context". This is what you want for binder design: a good interface residue is one that is justified given the target, not one that is stable in isolation.

Validation (see test_mpnn_context_significance.py): On 1YCR (p53 peptide + MDM2), the p53 peptide's perplexity drops from 16.57 ± 0.19 (monomer only) to 4.47 ± 0.07 (in MDM2 complex) — a 73% drop, t-statistic ~-188. This confirms the complex context meaningfully informs the score, well beyond the natural ensemble variance. Always pass the complex to SolMPNN for binder scoring.

3.7 Sequence similarity tracking

At each cycle, the pipeline automatically computes the average sequence similarity between the ProFam-generated sequences and the initial input sequences (from initial_fasta). For each generated sequence, similarity to each initial sequence is computed via global pairwise alignment (Needleman–Wunsch, using Biopython's PairwiseAligner) as the fraction of identical residues over the alignment length. This correctly handles insertions and deletions — a single indel no longer shifts all downstream positions. The best (maximum) similarity across all initial sequences is kept, and the mean of those best-match values is:

Printed to the console during the run.
Stored in cycle_stats.json as "all_avg_similarity" per cycle.
Plotted on a secondary y-axis in energy_summary.png.

This metric helps monitor how much the generated sequences drift from the starting point over successive cycles.

3.8 Compact residue notation

The residues field is a dictionary mapping chain names to residue specifications. Each value can be a compact range string instead of a full list of integers:

Format	Expands to
`"5"`	`[5]`
`"1,2,5"`	`[1, 2, 5]`
`"0-43"`	`[0, 1, 2, ..., 43]`
`"0-5,10,20-25"`	`[0, 1, 2, 3, 4, 5, 10, 20, 21, 22, 23, 24, 25]`

Single-chain energy terms (e.g. TemplateMatchEnergy) use only the GEN key:

residues:
  GEN: "0-43"

Multi-chain energy terms (e.g. PAEEnergy, LISEnergy, SeparationEnergy) use GEN for the generated chain and the target's chain ID as the second key:

residues:
  GEN: "0-19"
  A: "0-9"

This maps residues 0–19 on the generated chain (GEN) and residues 0–9 on the target chain (A). The GEN key is always group 0, the target key is group 1.

The explicit integer list format (e.g. [0,1,2,3]) and the "all" shorthand continue to work as values within the dict.

3.9 PDB structure download

Instead of providing a local CIF/PDB file, you can specify a PDB code and the pipeline will download the structure from the RCSB PDB:

pdb_code: "1ubq"
template_chain_id: A

This replaces template_structure_path. Downloaded files are cached in ~/.cache/profam_bagel/pdb/ so they are not re-downloaded on subsequent runs.

For multi-chain energy targets, use target_pdb_code and target_chain_id instead of an inline target sequence:

target_pdb_code: "1ubq"
target_chain_id: A

The pipeline downloads the CIF, extracts the specified chain, and uses its sequence as the target.

The utility functions download_pdb_cif() and extract_chain_from_cif() are also available for standalone use from run_profam_bagel_pipeline.

3.10 Selection strategy

The pipeline supports two strategies for choosing which sequences to condition ProFam on in the next cycle:

`selection_strategy`	Behaviour
`greedy` (default)	Softmax over energies → sample injection set → elitism/conditional swap. The classic path described in sections 1–5 above.
`thompson`	Thompson sampling with Beta posteriors. Treats each sequence as a bandit arm and learns which conditioning sequences produce the best progeny over time.

Thompson sampling

When selection_strategy: thompson, the pipeline uses a multi-armed bandit approach instead of greedy softmax selection:

Arm = a protein sequence with a Beta(α, β) posterior representing the distribution of rewards from its progeny.
Reward = clamp(-ipSAE, 0, 1). Since ipSAE is negative (more negative = better binding), negating and clamping gives a reward in [0, 1].
Bootstrap: When a sequence is first observed, its own ipSAE provides the first observation → Beta(1 + r, 2 - r).

Each cycle:

Sample θᵢ ~ Beta(αᵢ, βᵢ) for every arm.
Pick the arm with the highest θᵢ → use that sequence to condition ProFam.
Generate progeny, evaluate their ipSAE.
Update the chosen arm's posterior: α += reward, β += (1 - reward).
Register all progeny (with finite ipSAE) as new arms.

Max-seeking variant: Setting thompson_m_samples > 1 samples m times from each arm's Beta posterior and takes the maximum. This biases selection toward high-variance (under-explored) arms — useful for encouraging exploration.

Configuration:

selection_strategy: thompson
thompson_m_samples: 1        # 1 = standard Thompson, >1 = max-seeking
thompson_reward_term: ipSAE  # energy term name used as reward signal

The thompson_reward_term must match a key in the energy_terms dict produced by BAGEL evaluation (e.g. ipSAE from ipSAEEnergy).

Thompson mode outputs an additional file:

thompson_arms.json — full state of all arms (α, β, sequence, parent lineage, selection count), updated each cycle.
cycle_stats.json gains thompson_selected_arm_id, thompson_progeny_reward, and thompson_num_arms fields per cycle.

An example config is at configs/pipelines/pipeline_thompson_example.yaml.

When to use Thompson vs greedy

Scenario	Recommended
Short runs (< 20 cycles), well-understood scaffold	`greedy` with elitism
Long exploratory runs, multiple scaffolds, unknown which sequences are good generators	`thompson`
Need to balance exploration vs exploitation across a growing pool of candidates	`thompson` with `thompson_m_samples: 3`

3.11 Selection memory (`n_memory`)

By default, when selecting which sequences to inject into the next ProFam cycle, only the sequences generated in the current cycle are considered. The n_memory parameter widens this selection pool to include sequences from previous cycles:

n_memory: 3   # include sequences from the last 3 cycles in the pool

`n_memory`	Pool contents
`0` (default)	Current cycle only
`N > 0`	Current cycle + up to the last N cycles

The number of sequences selected for injection stays the same (floor(f_inject * profam_num_samples)); only the candidate pool grows. Probabilities are computed via softmax over the entire pool, so good sequences from earlier cycles can survive even if they were not selected at the time.

Every generated sequence receives a global unique ID that increments monotonically across cycles. With 10 sequences per cycle: cycle 1 produces IDs 0–9, cycle 2 produces IDs 10–19, and so on. These IDs appear in cycle_stats.json as "selected_ids" and in each sequence entry's "id" field, making it unambiguous which sequence is which regardless of cycle of origin.

4. Running the pipeline locally

From the repository root:

conda activate profam_bagel
python run_profam_bagel_pipeline.py --config example_pipeline_config.yaml

Override any YAML key via CLI:

python run_profam_bagel_pipeline.py \
  --config example_pipeline_config.yaml \
  --max_cycles 5 \
  --f_inject 0.1 \
  --run_on_modal false

Run without a YAML file by supplying all required flags:

python run_profam_bagel_pipeline.py \
  --initial_fasta initial_sequences.fasta \
  --profam_checkpoint_dir model_checkpoints/profam-1 \
  --energy_config example_energy_template_match.yaml \
  --profam_num_samples 64 \
  --f_inject 0.25 \
  --max_cycles 10 \
  --output_dir outputs/pipeline_run1

Note: Running locally with run_on_modal: false requires a GPU with enough memory for ESMFold, and the MODEL_DIR environment variable must point to the folder containing the ESMFold model weights (e.g. export MODEL_DIR=~/.cache/bagel/models).

A convenience wrapper for local runs is also available:

./run_pipeline_mac.sh example_pipeline_config.yaml [extra CLI args...]

5. Running the full pipeline on Modal (cloud GPU)

Modal lets you run the entire pipeline — ProFam generation, ESMFold folding, and energy evaluation — on cloud GPUs without managing any infrastructure.

Setup

Install and authenticate with Modal:

pip install modal    # already included by setup_environment.sh
modal token new      # opens a browser to authenticate

Set run_on_modal: true in your YAML config (this is the default in example_pipeline_config.yaml).

How it works

When run_on_modal: true:

run_profam_bagel_pipeline.py serialises the pipeline configuration and dispatches it to a remote Modal job defined in run_profam_bagel_modal_app.py.
The Modal container receives a Docker image with all dependencies pre-installed (PyTorch, transformers, boileroom, BAGEL, etc.).
Your local repository is uploaded to /workspace inside the container.
If cached ESMFold model weights exist locally (at MODEL_DIR or ~/.cache/bagel/models), they are uploaded to /models/bagel to avoid re-downloading.
ESMFold's use_modal is forced to true inside the container, regardless of the energy config setting.
The job runs on an NVIDIA A10G GPU with a 24-hour timeout.

Launch

conda activate profam_bagel
python run_profam_bagel_pipeline.py --config example_pipeline_config.yaml

That's it — the script handles the rest. You'll see Modal's output stream in your terminal.

6. Running on a PBS/HPC cluster

For institutional HPC clusters that use the PBS job scheduler, a batch script is provided: run_pipeline_pbs.sh.

6.1 Setting up the environment on the cluster

The same setup_environment.sh script works on Linux HPC nodes. The only difference is that on Linux with NVIDIA GPUs, the script installs CUDA-enabled PyTorch packages automatically.

# On the cluster login node:
git clone <this-repo-url> profam_bagel
cd profam_bagel

# Load conda (adapt to your cluster's module system)
module load anaconda3            # or: module load miniconda
# or: source ~/miniconda3/etc/profile.d/conda.sh

# Run the setup script
chmod +x setup_environment.sh
./setup_environment.sh

# Download the ProFam checkpoint
conda activate profam_bagel
python -c "from huggingface_hub import snapshot_download; snapshot_download('alex-hh/profam-1', local_dir='model_checkpoints/profam-1')"

If the cluster does not have internet access on compute nodes, run the setup and model download on the login node (which typically does have internet access).

6.2 Choosing between local folding and Modal folding on the cluster

You have two options for ESMFold folding when running on a cluster:

Option	YAML setting	Energy config `use_modal`	Requirements
A. Local GPU folding	`run_on_modal: false`	`false`	GPU node, `MODEL_DIR` set to ESMFold weights
B. Modal folding	`run_on_modal: false`	`true`	Internet access from compute nodes, Modal token configured

Option A (recommended for GPU clusters): The pipeline runs entirely on the cluster node. Set run_on_modal: false in the pipeline YAML and use_modal: false in the energy YAML. You must set the MODEL_DIR environment variable to the directory containing ESMFold model weights:

export MODEL_DIR=/path/to/esmfold/models

Option B: The pipeline runs on the cluster but ESMFold folding is offloaded to Modal. Set run_on_modal: false in the pipeline YAML but use_modal: true in the energy YAML. This requires internet access from compute nodes and a configured Modal token (modal token new).

Note: Setting run_on_modal: true on a cluster would send the entire pipeline (including ProFam) to Modal, which is usually not what you want on an HPC system — use Option A or B instead.

6.3 Submitting the PBS job

Edit run_pipeline_pbs.sh to match your cluster's resource configuration and add the conda activation command:

# Inside run_pipeline_pbs.sh, uncomment and adapt:
module load anaconda3
source activate profam_bagel

# If using local folding (Option A), also add:
export MODEL_DIR=/path/to/esmfold/models

Then submit:

qsub run_pipeline_pbs.sh -v CONFIG=example_pipeline_config.yaml

The default resource request is:

1 node, 8 CPUs, 1 GPU, 64 GB RAM, 24-hour walltime

Adjust the #PBS -l directives in the script as needed for your cluster.

7. Running on AWS / Cloud VMs

For cloud deployment on AWS EC2, GCP Compute Engine, Azure VMs, or similar infrastructure.

7.1 Recommended instance types (AWS)

Use Case	Instance Type	GPU	vCPU	Memory	Hourly Cost*
Production runs	`g5.xlarge`	A10G (24GB)	4	16 GB	~$1.00
Large batches	`g5.2xlarge`	A10G (24GB)	8	32 GB	~$1.50
Modal offload	`t3.medium`	None	2	4 GB	~$0.04
Budget GPU	`g4dn.xlarge`	T4 (16GB)	4	16 GB	~$0.50

*Prices approximate, on-demand, us-east-1. Use spot instances for ~70% savings.

Recommended AMI: "Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04)" — comes with CUDA drivers pre-installed.

7.2 Quick setup (one command)

For a fresh cloud instance with the repository already cloned:

cd ThinkingPLM
chmod +x setup_cloud.sh && ./setup_cloud.sh
source ~/.bashrc && conda activate profam_bagel

The setup_cloud.sh script:

Installs Miniconda if conda is not available
Creates the profam_bagel conda environment with all dependencies
Downloads the ProFam model checkpoint (~3GB)
Verifies GPU availability and all imports

7.3 Step-by-step setup

If the one-command setup fails, follow these steps:

# 1. Install Miniconda (if conda not available)
wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O /tmp/miniconda.sh
bash /tmp/miniconda.sh -b -p $HOME/miniconda3
eval "$($HOME/miniconda3/bin/conda shell.bash hook)"
conda init bash
source ~/.bashrc

# 2. Clone and enter repo
git clone https://github.com/JudeWells/ThinkingPLM.git
cd ThinkingPLM

# 3. Run environment setup
chmod +x setup_environment.sh
./setup_environment.sh

# 4. Activate and download model
conda activate profam_bagel
python -c "from huggingface_hub import snapshot_download; snapshot_download('alex-hh/profam-1', local_dir='.profam_repo/model_checkpoints/profam-1')"

# 5. Verify GPU
nvidia-smi
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"

7.4 GPU vs Modal mode

Option A: Local GPU mode (requires GPU instance like g5.xlarge)

Set run_on_modal: false in your pipeline config:

run_on_modal: false

Run with:

export MODEL_DIR=~/.cache/bagel/models
python run_profam_bagel_pipeline.py --config configs/pipelines/your_config.yaml

Option B: Modal mode (works on any instance, including CPU-only)

Set run_on_modal: true in your pipeline config (this is the default). All GPU work happens on Modal's cloud infrastructure.

First configure Modal:

modal token new                                          # Authenticate
modal secret create huggingface-secret HF_TOKEN=hf_xxxxx # For model access

Then run:

python run_profam_bagel_pipeline.py --config configs/pipelines/your_config.yaml

7.5 Cloud-specific troubleshooting

Problem	Solution
`conda: command not found`	Run the Miniconda installation commands above
`CUDA out of memory`	Reduce `profam_num_samples`, or use a larger GPU instance
`No CUDA GPUs available`	Check `nvidia-smi`; may need driver install or instance restart
`libcudnn.so not found`	`conda install cudnn -c conda-forge`
SSH disconnects during long runs	Use `tmux` or `screen`: `tmux new -s pipeline`
Slow first Modal run	Normal — container image builds on first run (~5-10 min)

7.6 Running in background (recommended for long runs)

# Using tmux (recommended)
tmux new -s pipeline
conda activate profam_bagel
python run_profam_bagel_pipeline.py --config configs/pipelines/your_config.yaml
# Press Ctrl+B then D to detach; `tmux attach -t pipeline` to reconnect

# Using nohup
nohup python run_profam_bagel_pipeline.py --config configs/pipelines/your_config.yaml > run.log 2>&1 &
tail -f run.log

# Monitor progress
tail -f outputs/your_run/cycle_stats.json

8. Outputs

For a run with output_dir: outputs/pipeline_run1, the pipeline creates:

outputs/pipeline_run1/cycle_stats.json
- Dictionary keyed by cycle number (as a string), e.g. "1", "2", ...
- Each entry contains:
  - cycle: integer cycle index
  - all_avg_energy, all_min_energy: statistics over all generated sequences
  - all_avg_similarity: average sequence similarity to initial sequences
  - best_sequence: dict with the lowest-energy sequence's details
  - selected_avg_energy, selected_min_energy: statistics over the selected subset
  - num_selected, selected_indices: subset selection info
  - selected_sequences: list of dicts with index, sequence, energy, and energy_terms
Structures (when folding was performed):
- outputs/pipeline_run1/cycle_XXX/sequences_cycle_all_XXX/sequence_XXXX.cif — all sequences folded in that cycle
- outputs/pipeline_run1/sequences_cycle_XXX/sequence_XXXX.cif — only the selected subset
Plot:
- outputs/pipeline_run1/energy_summary.png — line plot of average and minimum energies vs cycle index (left y-axis), with average sequence similarity on the right y-axis

9. Troubleshooting

General issues

Problem	Solution
`modal.exception.AuthError: Token missing`	Run `modal token new` to authenticate
`AssertionError: MODEL_DIR must be set`	Set `export MODEL_DIR=~/.cache/bagel/models` (or wherever your ESMFold weights are)
`ImportError: lightning` or `torchvision` errors	torch/torchvision version mismatch — re-run `setup_environment.sh` or manually install matching versions (see Section 2)
`boileroom` not found	Ensure BAGEL was installed: `pip install "biobagel[local] @ git+https://github.com/JudeWells/bagel.git"`
ProFam checkpoint not found	Download with `python -c "from huggingface_hub import snapshot_download; snapshot_download('alex-hh/profam-1', local_dir='.profam_repo/model_checkpoints/profam-1')"`
Slow first Modal run	First run builds the container image; subsequent runs reuse the cached image
`enforce_template` has no effect	The `fixed_positions` feature used by constrained generation is only available in newer (unreleased) versions of ProFam. The current GitHub version (`main` branch) does not support it. If your local ProFam has `fixed_positions` (e.g. from a development branch), the pipeline will use it; otherwise it prints a warning and generates freely. Template mismatches then receive infinity energy and the cycle is retried.

Cloud/AWS-specific issues

Problem	Solution
`conda: command not found`	Install Miniconda: `wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh -b` then `source ~/.bashrc`
`CUDA out of memory`	Reduce `profam_num_samples` in config, use smaller batch size, or upgrade to larger GPU instance
`No CUDA GPUs available`	Check `nvidia-smi`; ensure you're on a GPU instance; may need `sudo nvidia-smi -pm 1`
`libcudnn.so not found`	Install cuDNN: `conda install cudnn -c conda-forge`
`libcuda.so.1: cannot open`	NVIDIA drivers not installed — use Deep Learning AMI or install drivers manually
SSH disconnects during long runs	Use `tmux` or `screen` to keep session alive
Disk space errors	Clear HuggingFace cache: `rm -rf ~/.cache/huggingface`
`Permission denied` on HF download	Set `HF_TOKEN` env var or run `huggingface-cli login`
Instance terminated mid-run	Use spot instance interruption handling or switch to on-demand

10. Project structure

profam_bagel/
├── run_profam_bagel_pipeline.py       # Main pipeline entrypoint
├── run_profam_bagel_modal_app.py      # Modal app for cloud execution
├── setup_environment.sh               # Environment setup script (local/HPC)
├── setup_cloud.sh                     # Cloud VM setup (AWS/GCP/Azure)
├── example_pipeline_config.yaml       # Example YAML config
├── example_energy_template_match.yaml # Example energy config (template matching)
├── example_energy_lis_binding.yaml    # Example energy config (LIS binding with target chain)
├── run_pipeline_pbs.sh                # PBS cluster batch script
├── run_pipeline_mac.sh                # Local convenience wrapper
├── initial_sequences.fasta            # Example initial sequences
├── configs/                           # Configuration files
│   ├── pipelines/                     # Pipeline YAML configs
│   ├── energy/                        # Energy YAML configs
│   └── sequences/                     # Initial FASTA sequences
└── model_checkpoints/                 # ProFam model weights (user-downloaded)

External dependencies (installed via pip from GitHub):

BAGEL (biobagel): pip install "biobagel[local] @ git+https://github.com/softnanolab/bagel.git" — structure prediction + energy terms
ProFam: pip install "git+https://github.com/alex-hh/profam.git" — protein sequence generation

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.vscode		.vscode
berlin_hack_bio/data		berlin_hack_bio/data
configs		configs
docs		docs
hp_search		hp_search
pipeline		pipeline
random_scripts		random_scripts
scripts		scripts
slurm_scripts		slurm_scripts
target_pdbs		target_pdbs
.gitignore		.gitignore
.proteinmpnn_repo		.proteinmpnn_repo
CLAUDE.md		CLAUDE.md
CLAUDE_original.md		CLAUDE_original.md
README.md		README.md
analyze_2gdz_campaign.py		analyze_2gdz_campaign.py
analyze_bandit_eb.py		analyze_bandit_eb.py
analyze_bench.py		analyze_bench.py
analyze_bench_ensemble.py		analyze_bench_ensemble.py
analyze_discount_fix.py		analyze_discount_fix.py
analyze_ensemble_mad.py		analyze_ensemble_mad.py
analyze_mutations_vs_improvement.py		analyze_mutations_vs_improvement.py
analyze_strategies.py		analyze_strategies.py
backfill_launched.txt		backfill_launched.txt
backfill_mt2.sh		backfill_mt2.sh
benchmark_boltz_samples.py		benchmark_boltz_samples.py
boltz2_corr_manifest.json		boltz2_corr_manifest.json
boltz2_variance_study.py		boltz2_variance_study.py
build_boltz2_html.py		build_boltz2_html.py
build_eval_report.py		build_eval_report.py
build_results_html.py		build_results_html.py
compare_predictors.py		compare_predictors.py
correlation_study_boltz2.png		correlation_study_boltz2.png
correlation_study_per_target.png		correlation_study_per_target.png
correlation_study_per_target_af2.png		correlation_study_per_target_af2.png
correlation_study_with_af2.csv		correlation_study_with_af2.csv
correlation_study_with_af2.png		correlation_study_with_af2.png
correlations_boltz2.png		correlations_boltz2.png
create_greedy_slurm.py		create_greedy_slurm.py
diagnose_temperature_bo.py		diagnose_temperature_bo.py
evaluate_best_sequences.py		evaluate_best_sequences.py
experiment_boltz_vs_colabfold.py		experiment_boltz_vs_colabfold.py
extract_best_and_prepare_colabfold.py		extract_best_and_prepare_colabfold.py
generate_2gdz_campaign_configs.py		generate_2gdz_campaign_configs.py
generate_benchmark_configs.py		generate_benchmark_configs.py
generate_ensemble_benchmark_configs.py		generate_ensemble_benchmark_configs.py
generate_greedy_diverse_configs.py		generate_greedy_diverse_configs.py
generate_grpo_multi_target_bench.py		generate_grpo_multi_target_bench.py
generate_proposal_bandit_bench.py		generate_proposal_bandit_bench.py
generate_scaffold_comparison_configs.py		generate_scaffold_comparison_configs.py
generate_thompson_eb8_configs.py		generate_thompson_eb8_configs.py
iptm_comparison.png		iptm_comparison.png
iptm_comparison_per_target.png		iptm_comparison_per_target.png
modal_proteinmpnn_score.py		modal_proteinmpnn_score.py
plot_corr_study_with_af2.py		plot_corr_study_with_af2.py
plot_correlation_study.py		plot_correlation_study.py
plot_correlations.py		plot_correlations.py
plot_iptm_comparison.py		plot_iptm_comparison.py
predictor_comparison.csv		predictor_comparison.csv
predictor_comparison.png		predictor_comparison.png
prepare_boltz2_inputs.py		prepare_boltz2_inputs.py
results_viewer.html		results_viewer.html
results_viewer_boltz2.html		results_viewer_boltz2.html
run_all_bench.sh		run_all_bench.sh
run_all_benchmarks_test.sh		run_all_benchmarks_test.sh
run_bindcraft_af2_corr_study.py		run_bindcraft_af2_corr_study.py
run_bindcraft_style_af2.py		run_bindcraft_style_af2.py
run_grpo_hp_sweep.py		run_grpo_hp_sweep.py
run_grpo_hp_sweep_v2.py		run_grpo_hp_sweep_v2.py
run_grpo_hp_sweep_v3.py		run_grpo_hp_sweep_v3.py
run_grpo_hp_sweep_v4.py		run_grpo_hp_sweep_v4.py
run_hp_search.py		run_hp_search.py
run_pipeline_config_1_2GDZ.sh		run_pipeline_config_1_2GDZ.sh
run_pipeline_mac.sh		run_pipeline_mac.sh
run_pipeline_pbs.sh		run_pipeline_pbs.sh
run_profam_bagel_modal_app.py		run_profam_bagel_modal_app.py
run_profam_bagel_pipeline.py		run_profam_bagel_pipeline.py
sample_correlation_study.py		sample_correlation_study.py
setup_cloud.sh		setup_cloud.sh
setup_colabfold.sh		setup_colabfold.sh
setup_environment.sh		setup_environment.sh
simulate_thompson.py		simulate_thompson.py
simulate_thompson_cycles.py		simulate_thompson_cycles.py
simulate_thompson_sweep.py		simulate_thompson_sweep.py
slurm_2gdz_campaign.sh		slurm_2gdz_campaign.sh
slurm_ensemble_job.sh		slurm_ensemble_job.sh
slurm_greedy_diverse_job.sh		slurm_greedy_diverse_job.sh
smoke_test.db		smoke_test.db
submit_all_ensemble.sh		submit_all_ensemble.sh
submit_all_greedy_diverse.sh		submit_all_greedy_diverse.sh
submit_all_greedy_diverse_rel.sh		submit_all_greedy_diverse_rel.sh
sync_outputs.sh		sync_outputs.sh
template.cif		template.cif
test_boltz_determinism.py		test_boltz_determinism.py
test_boltz_ensemble.py		test_boltz_ensemble.py
test_dedup_fixes.py		test_dedup_fixes.py
test_encoder_decoder_grpo.py		test_encoder_decoder_grpo.py
test_grpo_synthetic.py		test_grpo_synthetic.py
test_ipsae_agreement.py		test_ipsae_agreement.py
test_mpnn_context_significance.py		test_mpnn_context_significance.py

Folders and files

Latest commit

History

Repository files navigation

ProFam + BAGEL Generative Pipeline

1. High-level pipeline overview

2. Requirements and environment setup

Dependency notes

Quick setup (recommended)

Manual setup

Prerequisites

3. Pipeline configuration

3.1 YAML configuration (pipeline)

3.2 Energy YAML configuration

3.3 TemplateMatchEnergy

3.4 Constrained generation (enforce_template)

3.5 Multi-chain energy terms ("target")

3.6 SolubleMPNN perplexity (designability) energy

3.7 Sequence similarity tracking

3.8 Compact residue notation

3.9 PDB structure download

3.10 Selection strategy

Thompson sampling

When to use Thompson vs greedy

3.11 Selection memory (n_memory)

4. Running the pipeline locally

5. Running the full pipeline on Modal (cloud GPU)

Setup

How it works

Launch

6. Running on a PBS/HPC cluster

6.1 Setting up the environment on the cluster

6.2 Choosing between local folding and Modal folding on the cluster

6.3 Submitting the PBS job

7. Running on AWS / Cloud VMs

7.1 Recommended instance types (AWS)

7.2 Quick setup (one command)

7.3 Step-by-step setup

7.4 GPU vs Modal mode

7.5 Cloud-specific troubleshooting

7.6 Running in background (recommended for long runs)

8. Outputs

9. Troubleshooting

General issues

Cloud/AWS-specific issues

10. Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3.4 Constrained generation (`enforce_template`)

3.5 Multi-chain energy terms (`"target"`)

3.11 Selection memory (`n_memory`)

Packages