Skip to content
Merged

Main #258

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
0a3d870
Update typo to include regression as an option for HyenaDNA fine tuni…
mattwoodx Jul 7, 2025
1815c7c
Update the helix model files in the case that all hidden states shoul…
mattwoodx Jul 15, 2025
66b9a02
Merge pull request #245 from helicalAI/emb-layer-helix
maxiallard Jul 15, 2025
1444916
updated downloader
maxiallard Jul 17, 2025
3a58ea4
`anndata` 0.12 allowed in `pyproject.toml`.
TitouanCh Jul 17, 2025
b42e814
Merge pull request #248 from helicalAI/anndata-up
TitouanCh Jul 17, 2025
3895f82
Remove flash attention and its error message as we never used it
bputzeys Jul 28, 2025
e12eb30
Merge pull request #250 from helicalAI/remove-flash-attn-for-scgpt
bputzeys Jul 29, 2025
4962d85
Update pyproject.toml
bputzeys Jul 29, 2025
cb3a559
Add new large Geneformer models (#251)
mattwoodx Jul 29, 2025
55e055a
fix for issue 243
izumiando Jul 30, 2025
8b95bf1
fix for issue 243
izumiando Jul 30, 2025
16c3b3c
Address Different Size Attention Maps and Genes In Context (#252)
mattwoodx Jul 31, 2025
ed48d8e
CZI Fine Tuned Geneformer Integration (#257)
mattwoodx Jul 31, 2025
6c0fd21
removed commented out code
izumiando Jul 31, 2025
4cbecfc
Add contributions section in readme (#256)
giogix2 Aug 1, 2025
26f915a
Update python to 3.11.13 dependency
bputzeys Aug 1, 2025
c2371f8
Update helical/models/uce/uce_dataset.py
bputzeys Aug 1, 2025
ddd3aec
Merge pull request #255 from izumiando/release
bputzeys Aug 1, 2025
c3bcb15
removed hash values
maxiallard Aug 4, 2025
8da312d
Pull using link download instead of boto3 to ensure aws environments …
mattwoodx Aug 4, 2025
ada9266
Merge pull request #246 from helicalAI/new_downloader
maxiallard Aug 6, 2025
a02138f
Remove GF comparison notebook on release action
bputzeys Aug 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,19 +63,19 @@ jobs:

- name: Execute Geneformer v1
run: |
python examples/run_models/run_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
python examples/run_models/run_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"

- name: Fine-tune Geneformer v1
run: |
python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"

- name: Execute Geneformer v2
run: |
python examples/run_models/run_geneformer.py ++model_name="gf-12L-95M-i4096" ++device="cuda"
python examples/run_models/run_geneformer.py ++model_name="gf-12L-38M-i4096" ++device="cuda"

- name: Fine-tune Geneformer v2
run: |
python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"

- name: Execute scGPT
run: |
Expand Down Expand Up @@ -157,6 +157,7 @@ jobs:
sed -i 's/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))[:100]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
sed -i 's/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))[:10]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
rm ./examples/notebooks/Evo-2.ipynb
rm ./examples/notebooks/Geneformer-Series-Comparison.ipynb

- name: Run Notebooks
run: |
Expand Down
9 changes: 5 additions & 4 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,19 +81,19 @@ jobs:

- name: Execute Geneformer v1
run: |
python examples/run_models/run_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
python examples/run_models/run_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"

- name: Fine-tune Geneformer v1
run: |
python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"

- name: Execute Geneformer v2
run: |
python examples/run_models/run_geneformer.py ++model_name="gf-12L-95M-i4096" ++device="cuda"
python examples/run_models/run_geneformer.py ++model_name="gf-12L-38M-i4096" ++device="cuda"

- name: Fine-tune Geneformer v2
run: |
python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"

- name: Execute scGPT
run: |
Expand Down Expand Up @@ -175,6 +175,7 @@ jobs:
sed -i 's/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))[:100]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
sed -i 's/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))[:10]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
rm ./examples/notebooks/Evo-2.ipynb
rm ./examples/notebooks/Geneformer-Series-Comparison.ipynb

- name: Run Notebooks
run: |
Expand Down
31 changes: 31 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Contributing to Helical

We welcome all kinds of contributions, including code, documentation, bug reports, and feature suggestions. Please read the following guidelines to help us keep the project organized and collaborative.

## Support expectations

The Helical team aims to be responsive and engaged with the community. While we do our best to reply promptly to issues, pull requests, and questions, there may be times when responses take some time, as we balance open source contributions with other work. We appreciate your patience and understanding, and we always welcome community support and collaboration.

## New features

If you'd like to see a new feature added, the best way to move it forward is to implement it and open a pull request. For more substantial changes, especially those that affect core functionality or involve design decisions, please start by opening an issue to discuss the idea first. This helps ensure alignment with the project's direction before significant work is done.

While the Helical team actively works to support and integrate new foundation models as quickly as possible, decisions about which models to add and when will be made at the team's discretion, based on technical fit, demand, and project priorities.

## Submitting a Pull Request

1. Ensure your code builds and passes existing tests.
2. Link related issue(s) in your PR description, if applicable.
3. Be ready for constructive feedback and revision requests.
4. Squash commits if needed before final merge.
5. Make sure to open your pull request against the `main` branch. By default, GitHub may select the `release` branch as the base, so you'll need to manually switch it to `main`. The `release` branch is updated periodically by the team and should not be used for contributions.

## Reporting Bugs

When reporting a bug, please provide a clear description of the issue, the steps to reproduce it, and any relevant environment details (e.g., OS, version, browser). If the problem can't be easily reproduced, the team may ask for a minimal code snippet or example that demonstrates the issue end to end.

## Before You Start

1. Look through existing issues to see if your idea or bug is already being addressed.
2. If you want to propose a major change or fix a bug, open an issue first to discuss.
3. Write clear, descriptive commit messages.
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ Let’s build the most exciting AI-for-Bio community together!

## What's new?

### New Larger Geneformer Models
We have integrated the new Geneformer models which are larger and have been trained on more data. Find out which models have been integrated into the Geneformer suite in the [model card](./helical/models/geneformer/README.md). Check out the our notebook on drug perturbation prediction using different Geneformer scalings [here](./examples/notebooks/Geneformer-Series-Comparison.ipynb).


### TranscriptFormer
We have integrated [TranscriptFormer](https://github.com/czi-ai/transcriptformer) into our helical package and have made a model card for it in our [Transcriptformer model folder](helical/models/transcriptformer/README.md). If you would like to test the model, take a look at our [example notebook](examples/notebooks/Geneformer-vs-TranscriptFormer.ipynb)!

Expand All @@ -47,7 +51,7 @@ Check out our <a href="https://www.helical-ai.com/blog/helix-mrna-v0" target="_b

We recommend installing Helical within a conda environment with the commands below (run them in your terminal) - this step is optional:
```
conda create --name helical-package python=3.11.8
conda create --name helical-package python=3.11.13
conda activate helical-package
```

Expand Down Expand Up @@ -132,6 +136,7 @@ Within the `examples/notebooks` folder, open the notebook of your choice. We rec
|[Cell-Type-Classification-Fine-Tuning.ipynb](./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb)|An example how to fine-tune different models on classification tasks.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb) |
|[HyenaDNA-Fine-Tuning.ipynb](./examples/notebooks/HyenaDNA-Fine-Tuning.ipynb)|An example of how to fine-tune the HyenaDNA model on downstream benchmarks.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/HyenaDNA-Fine-Tuning.ipynb) |
|[Cell-Gene-Cls-embedding-generation.ipynb](./examples/notebooks/Cell-Gene-Cls-embedding-generation.ipynb)|A notebook explaining the different embedding modes of single cell RNA models.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Cell-Gene-Cls-embedding-generation.ipynb) |
|[Geneformer-Series-Comparison.ipynb](./examples/notebooks/Geneformer-Series-Comparison.ipynb)|A zero shot comparison between Geneformer model scaling on drug perturbation prediction|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Geneformer-Series-Comparison.ipynb) |

## Stuck somewhere ? Other ideas ?
We are eager to help you and interact with you:
Expand All @@ -148,6 +153,9 @@ If you are (or plan to) working with bio foundation models s.a. Geneformer or UC

We will continuously upload the latest model, publish benchmarks and make our code more efficient.

## Contributing

We welcome all kinds of contributions, including code, documentation, bug reports, and feature suggestions. Please read our [Contributing Guidelines](CONTRIBUTING.md) to help us keep the project organized and collaborative.

## Acknowledgements

Expand Down
4 changes: 2 additions & 2 deletions ci/download_all.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ def download_geneformer_models():

# We can decide to download more models by simply adding the model names from the full list as reported in geneformer_config.py
version_models_dict = {
"v1": ["gf-12L-30M-i2048", "gf-6L-30M-i2048"],
"v2": ["gf-12L-95M-i4096", "gf-12L-95M-i4096-CLcancer", "gf-20L-95M-i4096"],
"v1": ["gf-12L-40M-i2048", "gf-6L-10M-i2048"],
"v2": ["gf-12L-38M-i4096", "gf-12L-38M-i4096-CLcancer", "gf-20L-151M-i4096"],
}

for version in versions:
Expand Down
48 changes: 35 additions & 13 deletions ci/tests/test_geneformer/test_geneformer_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def mock_embeddings_v2(self, mocker):
).repeat(12, 1, 1, 1)
return embs

@pytest.fixture(params=["gf-12L-30M-i2048", "gf-12L-95M-i4096"])
@pytest.fixture(params=["gf-12L-40M-i2048", "gf-12L-38M-i4096"])
def geneformer(self, request):
config = GeneformerConfig(model_name=request.param, batch_size=5)
geneformer = Geneformer(config)
Expand Down Expand Up @@ -132,14 +132,14 @@ def test_cls_mode_with_v1_model_config(self, geneformer):
"This test is only for v1 models and should thus be only executed once."
)
with pytest.raises(ValueError):
GeneformerConfig(model_name="gf-12L-30M-i2048", emb_mode="cls")
GeneformerConfig(model_name="gf-12L-40M-i2048", emb_mode="cls")

@pytest.mark.parametrize("emb_mode", ["cell", "gene"])
def test_get_embeddings_of_different_modes_v1(
self, emb_mode, mock_data, mock_embeddings_v1, mocker
):
config = GeneformerConfig(
model_name="gf-12L-30M-i2048", batch_size=5, emb_mode=emb_mode
model_name="gf-12L-40M-i2048", batch_size=5, emb_mode=emb_mode
)
geneformer = Geneformer(config)
mocker.patch.object(
Expand All @@ -163,12 +163,34 @@ def test_get_embeddings_of_different_modes_v1(
expected = np.array([[4, 4.333333, 4.666667, 4.333333, 4]])
np.testing.assert_allclose(embeddings, expected, rtol=1e-4, atol=1e-4)

def test_get_embeddings_with_output_genes(
self, mock_data, mock_embeddings_v1, mocker
):
config = GeneformerConfig(
model_name="gf-12L-40M-i2048", batch_size=5, emb_mode="cell"
)
geneformer = Geneformer(config)
mocker.patch.object(
geneformer.model, "forward", return_value=mock_embeddings_v1
)

dataset = geneformer.process_data(mock_data, gene_names="gene_symbols")
embeddings, genes = geneformer.get_embeddings(dataset, output_genes=True)

expected = np.array([[4, 4.333333, 4.666667, 4.333333, 4]])
np.testing.assert_allclose(embeddings, expected, rtol=1e-4, atol=1e-4)
for gene_list in genes:
assert len(gene_list) == 3
assert "ENSG00000187583" in gene_list
assert "ENSG00000187634" in gene_list
assert "ENSG00000188290" in gene_list

@pytest.mark.parametrize("emb_mode", ["cell", "gene", "cls"])
def test_get_embeddings_of_different_modes_v2(
self, emb_mode, mock_data, mock_embeddings_v2, mocker
):
config = GeneformerConfig(
model_name="gf-12L-95M-i4096", batch_size=5, emb_mode=emb_mode
model_name="gf-12L-38M-i4096", batch_size=5, emb_mode=emb_mode
)
geneformer = Geneformer(config)
mocker.patch.object(
Expand Down Expand Up @@ -213,7 +235,7 @@ def test_fine_tune_classifier_returns_correct_shape(self, emb_mode, mock_data):

def test_fine_tune_classifier_cls_returns_correct_shape(self, mock_data):
fine_tuned_model = GeneformerFineTuningModel(
GeneformerConfig(model_name="gf-12L-95M-i4096", emb_mode="cls"),
GeneformerConfig(model_name="gf-12L-38M-i4096", emb_mode="cls"),
fine_tuning_head="classification",
output_size=1,
)
Expand All @@ -230,10 +252,10 @@ def test_fine_tune_classifier_cls_returns_correct_shape(self, mock_data):
@pytest.mark.parametrize(
"model_name,emb_layer,expected_error",
[
("gf-6L-30M-i2048", -1, "No Error"),
("gf-6L-30M-i2048", 7, "Error"),
("gf-12L-30M-i2048", 6, "No Error"),
("gf-20L-95M-i4096", 23, "Error"),
("gf-6L-10M-i2048", -1, "No Error"),
("gf-6L-10M-i2048", 7, "Error"),
("gf-12L-40M-i2048", 6, "No Error"),
("gf-20L-151M-i4096", 23, "Error"),
],
)
def test_embedding_layer_error(self, model_name, emb_layer, expected_error):
Expand All @@ -253,10 +275,10 @@ def test_embedding_layer_error(self, model_name, emb_layer, expected_error):
@pytest.mark.parametrize(
"model_name,emb_layer",
[
("gf-6L-30M-i2048", -1),
("gf-6L-30M-i2048", 3),
("gf-12L-30M-i2048", 6),
("gf-20L-95M-i4096", -1),
("gf-6L-10M-i2048", -1),
("gf-6L-10M-i2048", 3),
("gf-12L-40M-i2048", 6),
("gf-20L-151M-i4096", -1),
],
)
def test_layer_to_quant(self, model_name, emb_layer):
Expand Down
46 changes: 46 additions & 0 deletions ci/tests/test_scgpt/test_scgpt_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,52 @@ def test_get_embeddings_of_different_modes(self, mocker, emb_mode):
atol=1e-4, # absolute tolerance
)

@pytest.mark.parametrize("emb_mode", ["cell", "cls"])
def test_get_embeddings_with_gene_outputs(self, mocker, emb_mode):
self.scgpt.config["emb_mode"] = emb_mode
self.scgpt.config["embsize"] = 5

# Mock the method directly on the instance
mocked_embeddings = torch.tensor(
[
[
[1.0, 1.0, 1.0, 1.0, 1.0],
[5.0, 5.0, 5.0, 5.0, 5.0],
[1.0, 2.0, 3.0, 2.0, 1.0],
[6.0, 6.0, 6.0, 6.0, 6.0],
],
]
)
mocker.patch.object(self.scgpt.model, "_encode", return_value=mocked_embeddings)

# mocking the normalization of embeddings makes it easier to test the output
mocker.patch.object(
self.scgpt, "_normalize_embeddings", side_effect=lambda x: x
)

dataset = self.scgpt.process_data(self.data, gene_names="gene_names")
embeddings, genes = self.scgpt.get_embeddings(dataset, output_genes=True)

if emb_mode == "cls":
assert (embeddings == np.array([1.0, 1.0, 1.0, 1.0, 1.0])).all()
assert len(genes) == len(embeddings)
for gene_list in genes:
for gene in gene_list:
assert gene in ["SAMD11", "PLEKHN1", "HES4"]
if emb_mode == "cell":
# average column wise excluding first row
expected = np.array([[4.0, 4.3333335, 4.6666665, 4.3333335, 4.0]])
np.testing.assert_allclose(
embeddings,
expected,
rtol=1e-4, # relative tolerance
atol=1e-4, # absolute tolerance
)
assert len(genes) == len(embeddings)
for gene_list in genes:
for gene in gene_list:
assert gene in ["SAMD11", "PLEKHN1", "HES4"]

@pytest.mark.parametrize("emb_mode", ["cls", "cell"])
def test_normalization_cell_and_cls(self, emb_mode):
mocked_embeddings = np.array(
Expand Down
6 changes: 5 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,11 @@ We will update this repo on a regular basis with new models, benchmarks, modalit
Let’s build the most exciting AI-for-Bio community together!

## What's new?
### New Larger Geneformer Models
We have integrated the new Geneformer models which are larger and have been trained on more data. Find out which models have been integrated into the Geneformer suite in the [model card](./model_cards/geneformer.md). Check out the our notebook on drug perturbation prediction using different Geneformer scalings [here](./notebooks/Geneformer-Series-Comparison.ipynb).

### TranscriptFormer
We have integrated [TranscriptFormer](https://github.com/czi-ai/transcriptformer) into our helical package and have made a model card for it in our [Transcriptformer model folder](helical/models/transcriptformer/README.md). If you would like to test the model, take a look at our [example notebook](examples/notebooks/Geneformer-vs-TranscriptFormer.ipynb)!
We have integrated [TranscriptFormer](https://github.com/czi-ai/transcriptformer) into our helical package and have made a model card for it in our [Transcriptformer model folder](./model_cards/transcriptformer.md). If you would like to test the model, take a look at our [example notebook](./notebooks/Geneformer-vs-TranscriptFormer.ipynb)!

### 🧬 Introducing Helix-mRNA-v0: Unlocking new frontiers & use cases in mRNA therapy 🧬
We’re thrilled to announce the release of our first-ever mRNA Bio Foundation Model, designed to:
Expand Down Expand Up @@ -107,6 +110,7 @@ Within the `example/notebooks` folder, open the notebook of your choice. We reco
|[Cell-Type-Classification-Fine-Tuning.ipynb](./notebooks/Cell-Type-Classification-Fine-Tuning.ipynb)|An example how to fine-tune different models on classification tasks.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb) |
|[HyenaDNA-Fine-Tuning.ipynb](./notebooks/HyenaDNA-Fine-Tuning.ipynb)|An example of how to fine-tune the HyenaDNA model on downstream benchmarks.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/HyenaDNA-Fine-Tuning.ipynb) |
|[Cell-Gene-Cls-embedding-generation.ipynb](./examples/notebooks/Cell-Gene-Cls-embedding-generation.ipynb)|A notebook explaining the different embedding modes of single cell RNA models.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Cell-Gene-Cls-embedding-generation.ipynb) |
|[Geneformer-Series-Comparison.ipynb](./notebooks/Geneformer-Series-Comparison.ipynb)|A zero shot comparison between Geneformer model scaling on drug perturbation prediction|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Geneformer-Series-Comparison.ipynb) |

## Stuck somewhere ? Other ideas ?
We are eager to help you and interact with you:
Expand Down
Loading
Loading