Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

We are releasing a world model for protein biology: a scientific engine for prediction, design, and discovery. Built on the latest generation of Evolutionary Scale Modeling (ESM), this system learns from the protein sequences produced by evolution and uses that knowledge to represent, map, predict, and design proteins across scales — from atomic interactions to evolutionary relationships spanning billions of years. The system includes three artifacts: ESMC, ESMFold2, and ESM Atlas.

**[ESMC](https://biohub.ai/esm/protein)** is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC defines a new scaling frontier relative to ESM2, achieving stronger performance in emergent long-range structural understanding as model scale increases
**[ESMC](https://biohub.ai/esm/protein)** is a state-of-the-art protein language model that has learned the rules of protein biology from training on billions of protein sequences. ESMC defines a new scaling frontier relative to ESM2, achieving stronger performance in emergent long-range structural understanding as model scale increases.


<div align="center">
Expand All @@ -21,7 +21,7 @@ We are releasing a world model for protein biology: a scientific engine for pred
**[ESMFold2](https://huggingface.co/Biohub/ESMFold2)**, built on the ESMC 6B model, is a state-of-the-art structure prediction model that has been validated for the design of protein-protein interactions. ESMFold2 surpasses other models in DockQ pass-rate on Foldbench protein-protein and antibody-antigen complexes, and can be used in single-sequence mode for an order of magnitude speedup in folding.

<div align="center">
<img src="_assets/esmfold2_folding.png" width="40%"/>
<img src="_assets/esmfold2_folding.png" width="60%"/>
</div>


Expand Down Expand Up @@ -50,7 +50,7 @@ For information on using ESM3, see the [ESM3 README](https://github.com/Biohub/e

[ESMC](https://biohub.ai/esm/protein) is a state-of-the-art protein language model that has learned representations of protein biology from training on billions of protein sequences.

Codebase, model weights, and model variants for ESMC are available through [Hugging Face](https://huggingface.co/collections/Biohub/esmc-model-family).
Codebase, model weights, and model variants for ESMC are available through [Hugging Face](https://huggingface.co/collections/biohub/esmc-model-family).

There are two primary ways of running the ESM models: through the [**Biohub Platform**](https://biohub.ai/) or locally with Hugging Face. The Biohub Platform enables users to easily run inference with ESM models with minimal setup. Users interested in customizing or fine-tuning ESM models can use the models from Hugging Face.

Expand All @@ -60,7 +60,7 @@ There are two primary ways of running the ESM models: through the [**Biohub Plat
Install `esm` from GitHub (a PyPI release is coming soon):

```
pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d
pip install esm@git+https://github.com/Biohub/esm.git@main
```

The following code demonstrates how to run ESMC locally
Expand Down Expand Up @@ -103,7 +103,7 @@ Note that our API migrated from forge.evolutionaryscale.ai to [biohub.ai](https:
To get started with ESM, install the python library using `pip`:

```
pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d
pip install esm@git+https://github.com/Biohub/esm.git@main
```

Then import the necessary libraries and instantiate your desired model.
Expand Down Expand Up @@ -180,7 +180,7 @@ For tutorials on how to use ESMC SAEs, see our [tutorials](https://github.com/Bi

The model predicts high-resolution, all-atom 3D protein structures directly from amino acid sequences, with optional multiple sequence alignment (MSA) input for enhanced accuracy on challenging targets. ESMFold2 achieves state-of-the-art performance matching or exceeding AlphaFold3 across diverse evaluation datasets, while offering improved computational efficiency through optimized diffusion sampling and architectural innovations.

Codebase, model weights, and model variants for ESMFold2 are available through [Hugging Face](https://huggingface.co/collections/biohub/esmfold2-model-family)
Codebase, model weights, and model variants for ESMFold2 are available through [Hugging Face](https://huggingface.co/Biohub/ESMFold2)

### Running ESMFold2 Locally

Expand Down Expand Up @@ -237,7 +237,7 @@ with open("1mht_pred.cif", "w") as f:
Install the `esm` Python package

```
pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d
pip install esm@git+https://github.com/Biohub/esm.git@main
```

Import the necessary libraries.
Expand Down
4 changes: 2 additions & 2 deletions _assets/ESM3_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ The code for ESM3 is available from Github and weights for esm3-sm-open-v1 is av
First install the python library using `pip`:

```
pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d
pip install esm@git+https://github.com/Biohub/esm.git@main
```

Then import the necessary libraries and instantiate your model. Use your token from the [Biohub platform](https://biohub.ai")
Expand All @@ -54,7 +54,7 @@ The following code demonstrates how to run ESM3 locally and generate a simple se
First install the python library using `pip`:

```
pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d
pip install esm@git+https://github.com/Biohub/esm.git@main
```

Then import the necessary libraries for your model.
Expand Down
2 changes: 1 addition & 1 deletion cookbook/local/open_generate.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
"outputs": [],
"source": [
"%set_env TOKENIZERS_PARALLELISM=false\n",
"!pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"!pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"import numpy as np\n",
"import torch\n",
"\n",
Expand Down
58 changes: 28 additions & 30 deletions cookbook/tutorials/embed.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion cookbook/tutorials/esm3_generate.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
"source": [
"%set_env TOKENIZERS_PARALLELISM=false\n",
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# !pip install py3Dmol\n",
"\n",
"import numpy as np\n",
Expand Down
2 changes: 1 addition & 1 deletion cookbook/tutorials/esm3_guided_generation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
"outputs": [],
"source": [
"# # If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# !pip install py3dmol"
]
},
Expand Down
2 changes: 1 addition & 1 deletion cookbook/tutorials/esmc_layer_sweep.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@
"outputs": [],
"source": [
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d"
"# !pip install esm@git+https://github.com/Biohub/esm.git@main"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion cookbook/tutorials/esmc_mutation_scoring.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@
"outputs": [],
"source": [
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# !pip install py3dmol"
]
},
Expand Down
20 changes: 9 additions & 11 deletions cookbook/tutorials/esmc_sae_feature_interpretation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -45,26 +45,24 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"id": "cell-3",
"metadata": {
"id": "cell-3"
},
"outputs": [],
"source": [
"# Install dependencies\n",
"# If you are working in colab, uncomment this line to install dependencies\n",
"# If you are working in colab, uncomment these lines to install dependencies\n",
"\n",
"#!pip install -q py3Dmol matplotlib requests numpy"
"!pip install -q \"esm @ git+https://github.com/Biohub/esm.git@main\"\n",
"!pip install -q py3Dmol matplotlib requests numpy "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-4",
"metadata": {
"id": "cell-4"
},
"execution_count": 3,
"id": "4eb60dfb",
"metadata": {},
"outputs": [],
"source": [
"from functools import lru_cache\n",
Expand Down Expand Up @@ -237,7 +235,7 @@
"source": [
"Now we'll extract SAE features using the Biohub API. We use:\n",
"- **Model**: `esmc-6b-2024-12` (6 billion parameter ESMC model)\n",
"- **SAE**: `esmc-6b-2024-12-sae-sweep-layer60-k64-codebook16384` (k=64 means top 64 features per position)\n",
"- **SAE**: `esmc-6b-2024-12-sae-layer60-k64-codebook16384` (k=64 means top 64 features per position)\n",
"- **Normalization**: `normalize_features=True` applies TF-IDF normalization to down-weight common features"
]
},
Expand All @@ -255,7 +253,7 @@
"protein_tensor = model.encode(protein)\n",
"\n",
"# Get SAE features\n",
"sae_model_name = \"esmc-6b-2024-12-sae-sweep-layer60-k64-codebook16384\"\n",
"sae_model_name = \"esmc-6b-2024-12-sae-layer60-k64-codebook16384\"\n",
"output = model.logits(\n",
" protein_tensor,\n",
" config=LogitsConfig(\n",
Expand Down
2 changes: 1 addition & 1 deletion cookbook/tutorials/esmfold2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
"outputs": [],
"source": [
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# !pip install py3dmol"
]
},
Expand Down
2 changes: 1 addition & 1 deletion cookbook/tutorials/esmprotein.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@
"outputs": [],
"source": [
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# ! pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# ! pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# ! pip install py3Dmol\n",
"# ! pip install matplotlib\n",
"# ! pip install dna-features-viewer"
Expand Down
2 changes: 1 addition & 1 deletion cookbook/tutorials/gfp_design.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
"outputs": [],
"source": [
"# If you are working in colab, uncomment these lines to install dependencies\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@c94ed8d\n",
"# !pip install esm@git+https://github.com/Biohub/esm.git@main\n",
"# !pip install py3Dmol\n",
"\n",
"from IPython.display import clear_output\n",
Expand Down
35 changes: 28 additions & 7 deletions esm/models/esmfold2/prepare_input.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,8 @@ class ChainInfo:
sym_id: int
mol_type: int
tokens: list[TokenInfo] = field(default_factory=list)
# (atom_name1, atom_name2) bonds for SMILES ligands, which have no CCD entry.
ligand_bonds: list[tuple[str, str]] = field(default_factory=list)


# =============================================================================
Expand Down Expand Up @@ -566,8 +568,11 @@ def tokenize_ligand_smiles(
atom_offset: int,
space_uid_offset: int,
seed: int | None = None,
) -> tuple[list[TokenInfo], list[AtomInfo]]:
"""Tokenize a ligand from SMILES (1 token per heavy atom)."""
) -> tuple[list[TokenInfo], list[AtomInfo], list[tuple[str, str]]]:
"""Tokenize a ligand from SMILES (1 token per heavy atom).

Returns tokens, atoms, and heavy-atom bonds as (name1, name2) pairs.
"""
from rdkit import Chem
from rdkit.Chem import AllChem

Expand Down Expand Up @@ -651,7 +656,13 @@ def tokenize_ligand_smiles(
token_idx += 1
atom_idx += 1

return tokens, atoms_list
bonds: list[tuple[str, str]] = []
for bond in mol_no_h.GetBonds():
n1 = bond.GetBeginAtom().GetProp("name")
n2 = bond.GetEndAtom().GetProp("name")
bonds.append((n1, n2))

return tokens, atoms_list, bonds


# =============================================================================
Expand Down Expand Up @@ -753,6 +764,7 @@ def build_chains_from_input(

elif isinstance(item, LigandInput):
has_cov = chain_id_str in covalent_chain_ids
ligand_bonds: list[tuple[str, str]] = []
if item.ccd is not None:
if item.smiles is not None:
warnings.warn("Both ccd and smiles provided, using ccd")
Expand All @@ -767,7 +779,7 @@ def build_chains_from_input(
has_covalent_bond=has_cov,
)
elif item.smiles is not None:
new_tokens, new_atoms = tokenize_ligand_smiles(
new_tokens, new_atoms, ligand_bonds = tokenize_ligand_smiles(
smiles=item.smiles,
entity_id=entity_id,
asym_id=asym_id,
Expand All @@ -789,6 +801,7 @@ def build_chains_from_input(
sym_id=sym_id,
mol_type=new_tokens[0].mol_type if new_tokens else MOL_TYPE_PROTEIN,
tokens=new_tokens,
ligand_bonds=ligand_bonds if isinstance(item, LigandInput) else [],
)
chains.append(chain)
all_tokens.extend(new_tokens)
Expand Down Expand Up @@ -990,16 +1003,24 @@ def add_bond(i: int, j: int) -> None:
(atom.name, atom.token_index)
)

# SMILES ligand bonds keyed by (asym_id, residue_index 0).
explicit_bonds: dict[tuple[int, int], list[tuple[str, str]]] = {
(c.asym_id, 0): c.ligand_bonds for c in chains if c.ligand_bonds
}

# Add intra-residue bonds from CCD
for (asym_id_val, res_idx), atom_list in residue_tokens.items():
if not atom_list:
continue
res_name = tokens[atom_list[0][1]].residue_name
ccd_bonds = get_ligand_ccd_bonds(res_name)
atom_to_tok = {name: ti for name, ti in atom_list}

if ccd_bonds:
for a1, a2 in ccd_bonds:
bonds = explicit_bonds.get((asym_id_val, res_idx))
if bonds is None:
bonds = get_ligand_ccd_bonds(res_name)

if bonds:
for a1, a2 in bonds:
if a1 in atom_to_tok and a2 in atom_to_tok:
add_bond(atom_to_tok[a1], atom_to_tok[a2])
else:
Expand Down
31 changes: 31 additions & 0 deletions esm/models/esmfold2/prepare_input_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
"""Tests for ESMFold2 input preparation (prepare_input)."""

import pytest
from rdkit import Chem

from esm.models.esmfold2.prepare_input import (
build_chains_from_input,
compute_token_bonds,
)
from esm.models.esmfold2.types import LigandInput, StructurePredictionInput


@pytest.mark.parametrize(
"smiles",
[
"c1ccccc1", # benzene: 6 atoms, 6 bonds
# The drug-like ligand from the SMILES-vs-CCD issue.
"COC1=CC=C(N2C3=C(C(C(N)=O)=N2)CCN(C4=CC=C(N5CCCCC5=O)C=C4)C3=O)C=C1",
],
)
def test_smiles_ligand_bonds_match_molecular_graph(smiles: str):
"""SMILES ligand bonds must match the molecular graph, not a clique (#313)."""
spi = StructurePredictionInput(sequences=[LigandInput(id="B", smiles=smiles)])
chains, tokens, atoms = build_chains_from_input(spi, seed=0)
token_bonds = compute_token_bonds(tokens, atoms, spi, chains)

mol = Chem.MolFromSmiles(smiles)
assert len(tokens) == mol.GetNumAtoms()
n_edges = int(token_bonds.sum().item()) // 2 # symmetric matrix
assert n_edges == mol.GetNumBonds()
assert n_edges < len(tokens) * (len(tokens) - 1) // 2 # not a clique
12 changes: 6 additions & 6 deletions pixi.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ classifiers = [

dependencies = [
"torch>=2.2.0",
"transformers @ git+https://github.com/Biohub/transformers.git@3a8956fb4d4ea16b0ec8e71deef2c2909b6a5cbf",
"transformers @ git+https://github.com/Biohub/transformers.git@main",
"ipython",
"einops",
"biotite>=1.0.0",
Expand Down
Loading