Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
a217b14
Add Evo 2 directory
mattwoodx Feb 20, 2025
48bf7f7
Add skeleton files
mattwoodx Feb 20, 2025
e992958
Run black on entire repo (#196)
mattwoodx Feb 20, 2025
8431eed
Merge branch 'main' into evo-2
mattwoodx Feb 20, 2025
094fd4c
Add initial Evo 2 model
mattwoodx Feb 24, 2025
a6ec615
Add testing for Evo 2
mattwoodx Feb 24, 2025
83f3725
changed config name default and added dockerfile
Feb 24, 2025
3855567
dockerfile to run EVO added
Feb 24, 2025
49c98e5
Fix type in EVO2 Dockerfile
Feb 24, 2025
af89de6
Fix typo in EVO2 config.
Feb 24, 2025
ddcd645
Add readme for Evo 2
mattwoodx Feb 25, 2025
49f7b44
Add 1B base model and 7B base model
mattwoodx Feb 25, 2025
6cd244a
Clear installation instructions for Evo 2
mattwoodx Feb 25, 2025
431dc66
Update model to incorporate original lengths
mattwoodx Feb 25, 2025
4265f7f
Add symbolic link for evo 2 model card
mattwoodx Feb 25, 2025
5f92cdf
Add docs navigation for Evo 2
mattwoodx Feb 25, 2025
d8a5aa8
Add notebook fro Evo 2
mattwoodx Feb 25, 2025
812f318
Remove data file
mattwoodx Feb 25, 2025
d33df9a
Skip above imports
mattwoodx Feb 25, 2025
f4882a0
Fix Caduceus Test
mattwoodx Feb 25, 2025
f014b81
Add requirement
mattwoodx Feb 25, 2025
5de6427
Fix tests
mattwoodx Feb 25, 2025
ced0565
removed additional cell
Feb 25, 2025
5b8daa7
Merge branch 'evo-2' of https://github.com/helicalAI/helical into evo-2
Feb 25, 2025
695730d
adjusted readme
Feb 25, 2025
fcd6cd2
Downgrade torch/torchvision versions.
Feb 25, 2025
90ca62f
Trigger CI/CD.
Feb 25, 2025
3fe08d0
Add biopython install in CI/CD.
Feb 25, 2025
1e8bde8
Remove Evo-2 notebook during CI/CD execution
Feb 25, 2025
011cc69
Merge pull request #197 from helicalAI/evo-2
maxiallard Feb 25, 2025
f0647ba
Update version and use default branch to install helical
bputzeys Feb 26, 2025
1c043fc
Merge pull request #199 from helicalAI/evo-2
bputzeys Feb 26, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,7 @@ jobs:
sed -i 's/get_anndata_from_hf_dataset(ds\[\\"test\\"\])/get_anndata_from_hf_dataset(ds\[\\"test\\"\])[:10]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
sed -i 's/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))[:100]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
sed -i 's/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))[:10]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
rm ./examples/notebooks/Evo-2.ipynb

- name: Run Notebooks
run: |
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ jobs:
sed -i 's/get_anndata_from_hf_dataset(ds\[\\"test\\"\])/get_anndata_from_hf_dataset(ds\[\\"test\\"\])[:10]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
sed -i 's/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))[:100]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
sed -i 's/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))[:10]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
rm ./examples/notebooks/Evo-2.ipynb

- name: Run Notebooks
run: |
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ results*

# Log
debug.log
*debug.log

# Mac Files
*.DS_Store
Expand Down
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,10 @@ Let’s build the most exciting AI-for-Bio community together!
</div>

## What's new?
### Evo2
We have integrated [Evo2](https://github.com/ArcInstitute/evo2) into our helical package and have made a model card for it in our [Evo2 model folder](helical/helical/models/evo_2/README.md). If you would like to test the model, take a look at our [example notebook](helical/examples/notebooks/Evo-2.ipynb)!
Let us know what you think and we are happy to help you with the larger model (40B parameters!) if needed!

### 🧬 Introducing Helix-mRNA-v0: Unlocking new frontiers & use cases in mRNA therapy 🧬
We’re thrilled to announce the release of our first-ever mRNA Bio Foundation Model, designed to:

Expand Down Expand Up @@ -151,6 +155,7 @@ A lot of our models have been published by talend authors developing these excit
- [scikit-learn](https://github.com/scikit-learn/scikit-learn)
- [GenePT](https://github.com/yiqunchen/GenePT)
- [Caduceus](https://github.com/kuleshov-group/caduceus)
- [Evo2](https://github.com/ArcInstitute/evo2)

### Licenses

Expand All @@ -162,6 +167,7 @@ You can find the Licenses for each model implementation in the model repositorie
- [Geneformer](https://github.com/helicalAI/helical/blob/release/helical/models/geneformer/LICENSE)
- [UCE](https://github.com/helicalAI/helical/blob/release/helical/models/uce/LICENSE)
- [HyenaDNA](https://github.com/helicalAI/helical/blob/release/helical/models/hyena_dna/LICENSE)
- [Evo2](https://github.com/helicalAI/helical/blob/release/helical/models/evo_2/LICENSE)

## Citation

Expand Down
117 changes: 80 additions & 37 deletions ci/download_all.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
from helical.utils.downloader import Downloader
from pathlib import Path
import logging

LOGGER = logging.getLogger(__name__)


def download_geneformer_models():
downloader = Downloader()
versions = ['v1', 'v2']
versions = ["v1", "v2"]

# We can decide to download more models by simply adding the model names from the full list as reported in geneformer_config.py
version_models_dict = {
"v1": ["gf-12L-30M-i2048", "gf-6L-30M-i2048"],
"v2": ["gf-12L-95M-i4096", "gf-12L-95M-i4096-CLcancer", "gf-20L-95M-i4096"],
}

for version in versions:
# Download common files for each version
common_files = [
Expand All @@ -22,31 +24,36 @@ def download_geneformer_models():
]
for file in common_files:
downloader.download_via_name(file)

# Get all model directories
model_dirs = version_models_dict[version]

for model_name in model_dirs:
model_files = [
f"geneformer/{version}/{model_name}/config.json",
f"geneformer/{version}/{model_name}/training_args.bin",
]

# Add version-specific files
if version == 'v2':
model_files.extend([
f"geneformer/{version}/{model_name}/generation_config.json",
f"geneformer/{version}/{model_name}/model.safetensors",
])
if version == "v2":
model_files.extend(
[
f"geneformer/{version}/{model_name}/generation_config.json",
f"geneformer/{version}/{model_name}/model.safetensors",
]
)
else:
model_files.append(f"geneformer/{version}/{model_name}/pytorch_model.bin")

model_files.append(
f"geneformer/{version}/{model_name}/pytorch_model.bin"
)

# Download all files for the current model
for file in model_files:
downloader.download_via_name(file)

LOGGER.info("All Geneformer models and files have been downloaded.")


def main():
downloader = Downloader()
downloader.display = False
Expand All @@ -55,14 +62,30 @@ def main():
downloader.download_via_name("uce/all_tokens.torch")
downloader.download_via_name("uce/species_chrom.csv")
downloader.download_via_name("uce/species_offsets.pkl")
downloader.download_via_name("uce/protein_embeddings/Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt")
downloader.download_via_name("uce/protein_embeddings/Macaca_fascicularis.Macaca_fascicularis_6.0.gene_symbol_to_embedding_ESM2.pt")
downloader.download_via_name("uce/protein_embeddings/Danio_rerio.GRCz11.gene_symbol_to_embedding_ESM2.pt")
downloader.download_via_name("uce/protein_embeddings/Macaca_mulatta.Mmul_10.gene_symbol_to_embedding_ESM2.pt")
downloader.download_via_name("uce/protein_embeddings/Microcebus_murinus.Mmur_3.0.gene_symbol_to_embedding_ESM2.pt")
downloader.download_via_name("uce/protein_embeddings/Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM2.pt")
downloader.download_via_name("uce/protein_embeddings/Sus_scrofa.Sscrofa11.1.gene_symbol_to_embedding_ESM2.pt")
downloader.download_via_name("uce/protein_embeddings/Xenopus_tropicalis.Xenopus_tropicalis_v9.1.gene_symbol_to_embedding_ESM2.pt")
downloader.download_via_name(
"uce/protein_embeddings/Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt"
)
downloader.download_via_name(
"uce/protein_embeddings/Macaca_fascicularis.Macaca_fascicularis_6.0.gene_symbol_to_embedding_ESM2.pt"
)
downloader.download_via_name(
"uce/protein_embeddings/Danio_rerio.GRCz11.gene_symbol_to_embedding_ESM2.pt"
)
downloader.download_via_name(
"uce/protein_embeddings/Macaca_mulatta.Mmul_10.gene_symbol_to_embedding_ESM2.pt"
)
downloader.download_via_name(
"uce/protein_embeddings/Microcebus_murinus.Mmur_3.0.gene_symbol_to_embedding_ESM2.pt"
)
downloader.download_via_name(
"uce/protein_embeddings/Mus_musculus.GRCm39.gene_symbol_to_embedding_ESM2.pt"
)
downloader.download_via_name(
"uce/protein_embeddings/Sus_scrofa.Sscrofa11.1.gene_symbol_to_embedding_ESM2.pt"
)
downloader.download_via_name(
"uce/protein_embeddings/Xenopus_tropicalis.Xenopus_tropicalis_v9.1.gene_symbol_to_embedding_ESM2.pt"
)

downloader.download_via_name("scgpt/scGPT_CP/vocab.json")
downloader.download_via_name("scgpt/scGPT_CP/best_model.pt")
Expand All @@ -75,23 +98,43 @@ def main():
downloader.download_via_name("hyena_dna/hyenadna-medium-450k-seqlen.ckpt")
downloader.download_via_name("hyena_dna/hyenadna-large-1m-seqlen.ckpt")

downloader.download_via_name('caduceus/caduceus-ph-16L-seqlen-131k-d256/model.safetensors')
downloader.download_via_name('caduceus/caduceus-ph-16L-seqlen-131k-d256/config.json')
downloader.download_via_name('caduceus/caduceus-ph-4L-seqlen-1k-d118/model.safetensors')
downloader.download_via_name('caduceus/caduceus-ph-4L-seqlen-1k-d118/config.json')
downloader.download_via_name('caduceus/caduceus-ph-4L-seqlen-1k-d256/model.safetensors')
downloader.download_via_name('caduceus/caduceus-ph-4L-seqlen-1k-d256/config.json')
downloader.download_via_name('caduceus/caduceus-ps-16L-seqlen-131k-d256/model.safetensors')
downloader.download_via_name('caduceus/caduceus-ps-16L-seqlen-131k-d256/config.json')
downloader.download_via_name('caduceus/caduceus-ps-4L-seqlen-1k-d118/model.safetensors')
downloader.download_via_name('caduceus/caduceus-ps-4L-seqlen-1k-d118/config.json')
downloader.download_via_name('caduceus/caduceus-ps-4L-seqlen-1k-d256/model.safetensors')
downloader.download_via_name('caduceus/caduceus-ps-4L-seqlen-1k-d256/config.json')

downloader.download_via_name('genept/genept_embeddings/genept_embeddings.json')

downloader.download_via_link(Path("yolksac_human.h5ad"), "https://huggingface.co/datasets/helical-ai/yolksac_human/resolve/main/data/17_04_24_YolkSacRaw_F158_WE_annots.h5ad?download=true")
downloader.download_via_name(
"caduceus/caduceus-ph-16L-seqlen-131k-d256/model.safetensors"
)
downloader.download_via_name(
"caduceus/caduceus-ph-16L-seqlen-131k-d256/config.json"
)
downloader.download_via_name(
"caduceus/caduceus-ph-4L-seqlen-1k-d118/model.safetensors"
)
downloader.download_via_name("caduceus/caduceus-ph-4L-seqlen-1k-d118/config.json")
downloader.download_via_name(
"caduceus/caduceus-ph-4L-seqlen-1k-d256/model.safetensors"
)
downloader.download_via_name("caduceus/caduceus-ph-4L-seqlen-1k-d256/config.json")
downloader.download_via_name(
"caduceus/caduceus-ps-16L-seqlen-131k-d256/model.safetensors"
)
downloader.download_via_name(
"caduceus/caduceus-ps-16L-seqlen-131k-d256/config.json"
)
downloader.download_via_name(
"caduceus/caduceus-ps-4L-seqlen-1k-d118/model.safetensors"
)
downloader.download_via_name("caduceus/caduceus-ps-4L-seqlen-1k-d118/config.json")
downloader.download_via_name(
"caduceus/caduceus-ps-4L-seqlen-1k-d256/model.safetensors"
)
downloader.download_via_name("caduceus/caduceus-ps-4L-seqlen-1k-d256/config.json")

downloader.download_via_name("genept/genept_embeddings/genept_embeddings.json")

downloader.download_via_link(
Path("yolksac_human.h5ad"),
"https://huggingface.co/datasets/helical-ai/yolksac_human/resolve/main/data/17_04_24_YolkSacRaw_F158_WE_annots.h5ad?download=true",
)
return True


if __name__ == "__main__":
main()
Loading