helicalAI · bputzeys · Aug 8, 2025 · Jul 7, 2025 · Jul 15, 2025 · Jul 15, 2025
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -63,19 +63,19 @@ jobs:
 
       - name: Execute Geneformer v1
         run: |
-          python examples/run_models/run_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
+          python examples/run_models/run_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"
 
       - name: Fine-tune Geneformer v1
         run: |
-          python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
+          python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"
 
       - name: Execute Geneformer v2
         run: |
-          python examples/run_models/run_geneformer.py ++model_name="gf-12L-95M-i4096" ++device="cuda"
+          python examples/run_models/run_geneformer.py ++model_name="gf-12L-38M-i4096" ++device="cuda"
 
       - name: Fine-tune Geneformer v2
         run: |
-          python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
+          python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"
 
       - name: Execute scGPT
         run: |
@@ -157,6 +157,7 @@ jobs:
           sed -i 's/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))[:100]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
           sed -i 's/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))[:10]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
           rm ./examples/notebooks/Evo-2.ipynb
+          rm ./examples/notebooks/Geneformer-Series-Comparison.ipynb
 
       - name: Run Notebooks
         run: |

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -81,19 +81,19 @@ jobs:
 
       - name: Execute Geneformer v1
         run: |
-          python examples/run_models/run_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
+          python examples/run_models/run_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"
 
       - name: Fine-tune Geneformer v1
         run: |
-          python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
+          python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"
 
       - name: Execute Geneformer v2
         run: |
-          python examples/run_models/run_geneformer.py ++model_name="gf-12L-95M-i4096" ++device="cuda"
+          python examples/run_models/run_geneformer.py ++model_name="gf-12L-38M-i4096" ++device="cuda"
 
       - name: Fine-tune Geneformer v2
         run: |
-          python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-30M-i2048" ++device="cuda"
+          python examples/fine_tune_models/fine_tune_geneformer.py ++model_name="gf-12L-40M-i2048" ++device="cuda"
 
       - name: Execute scGPT
         run: |
@@ -175,6 +175,7 @@ jobs:
           sed -i 's/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(train_dataset.obs\[\\"LVL1\\"].tolist()))[:100]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
           sed -i 's/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))/list(np.array(test_dataset.obs\[\\"LVL1\\"].tolist()))[:10]/g' ./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb
           rm ./examples/notebooks/Evo-2.ipynb
+          rm ./examples/notebooks/Geneformer-Series-Comparison.ipynb
 
       - name: Run Notebooks
         run: |

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,31 @@
+# Contributing to Helical
+
+We welcome all kinds of contributions, including code, documentation, bug reports, and feature suggestions. Please read the following guidelines to help us keep the project organized and collaborative.
+
+## Support expectations
+
+The Helical team aims to be responsive and engaged with the community. While we do our best to reply promptly to issues, pull requests, and questions, there may be times when responses take some time, as we balance open source contributions with other work. We appreciate your patience and understanding, and we always welcome community support and collaboration.
+
+## New features
+
+If you'd like to see a new feature added, the best way to move it forward is to implement it and open a pull request. For more substantial changes, especially those that affect core functionality or involve design decisions, please start by opening an issue to discuss the idea first. This helps ensure alignment with the project's direction before significant work is done.
+
+While the Helical team actively works to support and integrate new foundation models as quickly as possible, decisions about which models to add and when will be made at the team's discretion, based on technical fit, demand, and project priorities.
+
+## Submitting a Pull Request
+
+1. Ensure your code builds and passes existing tests.
+2. Link related issue(s) in your PR description, if applicable.
+3. Be ready for constructive feedback and revision requests.
+4. Squash commits if needed before final merge.
+5. Make sure to open your pull request against the `main` branch. By default, GitHub may select the `release` branch as the base, so you'll need to manually switch it to `main`. The `release` branch is updated periodically by the team and should not be used for contributions.
+
+## Reporting Bugs
+
+When reporting a bug, please provide a clear description of the issue, the steps to reproduce it, and any relevant environment details (e.g., OS, version, browser). If the problem can't be easily reproduced, the team may ask for a minimal code snippet or example that demonstrates the issue end to end.
+
+## Before You Start
+
+1. Look through existing issues to see if your idea or bug is already being addressed.
+2. If you want to propose a major change or fix a bug, open an issue first to discuss.
+3. Write clear, descriptive commit messages. 
diff --git a/README.md b/README.md
@@ -31,6 +31,10 @@ Let’s build the most exciting AI-for-Bio community together!
 
 ## What's new?
 
+### New Larger Geneformer Models
+We have integrated the new Geneformer models which are larger and have been trained on more data. Find out which models have been integrated into the Geneformer suite in the [model card](./helical/models/geneformer/README.md). Check out the our notebook on drug perturbation prediction using different Geneformer scalings [here](./examples/notebooks/Geneformer-Series-Comparison.ipynb).
+
+
 ### TranscriptFormer
 We have integrated [TranscriptFormer](https://github.com/czi-ai/transcriptformer) into our helical package and have made a model card for it in our [Transcriptformer model folder](helical/models/transcriptformer/README.md). If you would like to test the model, take a look at our [example notebook](examples/notebooks/Geneformer-vs-TranscriptFormer.ipynb)!
 
@@ -47,7 +51,7 @@ Check out our <a href="https://www.helical-ai.com/blog/helix-mrna-v0" target="_b
 
 We recommend installing Helical within a conda environment with the commands below (run them in your terminal) - this step is optional:
 ```
-conda create --name helical-package python=3.11.8
+conda create --name helical-package python=3.11.13
 conda activate helical-package
 ```
 
@@ -132,6 +136,7 @@ Within the `examples/notebooks` folder, open the notebook of your choice. We rec
 |[Cell-Type-Classification-Fine-Tuning.ipynb](./examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb)|An example how to fine-tune different models on classification tasks.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb) |
 |[HyenaDNA-Fine-Tuning.ipynb](./examples/notebooks/HyenaDNA-Fine-Tuning.ipynb)|An example of how to fine-tune the HyenaDNA model on downstream benchmarks.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/HyenaDNA-Fine-Tuning.ipynb) |
 |[Cell-Gene-Cls-embedding-generation.ipynb](./examples/notebooks/Cell-Gene-Cls-embedding-generation.ipynb)|A notebook explaining the different embedding modes of single cell RNA models.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Cell-Gene-Cls-embedding-generation.ipynb) |
+|[Geneformer-Series-Comparison.ipynb](./examples/notebooks/Geneformer-Series-Comparison.ipynb)|A zero shot comparison between Geneformer model scaling on drug perturbation prediction|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Geneformer-Series-Comparison.ipynb) |
 
 ## Stuck somewhere ? Other ideas ?
 We are eager to help you and interact with you:
@@ -148,6 +153,9 @@ If you are (or plan to) working with bio foundation models s.a. Geneformer or UC
 
 We will continuously upload the latest model, publish benchmarks and make our code more efficient.
 
+## Contributing
+
+We welcome all kinds of contributions, including code, documentation, bug reports, and feature suggestions. Please read our [Contributing Guidelines](CONTRIBUTING.md) to help us keep the project organized and collaborative.
 
 ## Acknowledgements
 

diff --git a/ci/download_all.py b/ci/download_all.py
@@ -11,8 +11,8 @@ def download_geneformer_models():
 
     # We can decide to download more models by simply adding the model names from the full list as reported in geneformer_config.py
     version_models_dict = {
-        "v1": ["gf-12L-30M-i2048", "gf-6L-30M-i2048"],
-        "v2": ["gf-12L-95M-i4096", "gf-12L-95M-i4096-CLcancer", "gf-20L-95M-i4096"],
+        "v1": ["gf-12L-40M-i2048", "gf-6L-10M-i2048"],
+        "v2": ["gf-12L-38M-i4096", "gf-12L-38M-i4096-CLcancer", "gf-20L-151M-i4096"],
     }
 
     for version in versions:

diff --git a/ci/tests/test_geneformer/test_geneformer_model.py b/ci/tests/test_geneformer/test_geneformer_model.py
@@ -51,7 +51,7 @@ def mock_embeddings_v2(self, mocker):
         ).repeat(12, 1, 1, 1)
         return embs
 
-    @pytest.fixture(params=["gf-12L-30M-i2048", "gf-12L-95M-i4096"])
+    @pytest.fixture(params=["gf-12L-40M-i2048", "gf-12L-38M-i4096"])
     def geneformer(self, request):
         config = GeneformerConfig(model_name=request.param, batch_size=5)
         geneformer = Geneformer(config)
@@ -132,14 +132,14 @@ def test_cls_mode_with_v1_model_config(self, geneformer):
                 "This test is only for v1 models and should thus be only executed once."
             )
         with pytest.raises(ValueError):
-            GeneformerConfig(model_name="gf-12L-30M-i2048", emb_mode="cls")
+            GeneformerConfig(model_name="gf-12L-40M-i2048", emb_mode="cls")
 
     @pytest.mark.parametrize("emb_mode", ["cell", "gene"])
     def test_get_embeddings_of_different_modes_v1(
         self, emb_mode, mock_data, mock_embeddings_v1, mocker
     ):
         config = GeneformerConfig(
-            model_name="gf-12L-30M-i2048", batch_size=5, emb_mode=emb_mode
+            model_name="gf-12L-40M-i2048", batch_size=5, emb_mode=emb_mode
         )
         geneformer = Geneformer(config)
         mocker.patch.object(
@@ -163,12 +163,34 @@ def test_get_embeddings_of_different_modes_v1(
             expected = np.array([[4, 4.333333, 4.666667, 4.333333, 4]])
             np.testing.assert_allclose(embeddings, expected, rtol=1e-4, atol=1e-4)
 
+    def test_get_embeddings_with_output_genes(
+        self, mock_data, mock_embeddings_v1, mocker
+    ):
+        config = GeneformerConfig(
+            model_name="gf-12L-40M-i2048", batch_size=5, emb_mode="cell"
+        )
+        geneformer = Geneformer(config)
+        mocker.patch.object(
+            geneformer.model, "forward", return_value=mock_embeddings_v1
+        )
+
+        dataset = geneformer.process_data(mock_data, gene_names="gene_symbols")
+        embeddings, genes = geneformer.get_embeddings(dataset, output_genes=True)
+
+        expected = np.array([[4, 4.333333, 4.666667, 4.333333, 4]])
+        np.testing.assert_allclose(embeddings, expected, rtol=1e-4, atol=1e-4)
+        for gene_list in genes:
+            assert len(gene_list) == 3
+            assert "ENSG00000187583" in gene_list
+            assert "ENSG00000187634" in gene_list
+            assert "ENSG00000188290" in gene_list
+
     @pytest.mark.parametrize("emb_mode", ["cell", "gene", "cls"])
     def test_get_embeddings_of_different_modes_v2(
         self, emb_mode, mock_data, mock_embeddings_v2, mocker
     ):
         config = GeneformerConfig(
-            model_name="gf-12L-95M-i4096", batch_size=5, emb_mode=emb_mode
+            model_name="gf-12L-38M-i4096", batch_size=5, emb_mode=emb_mode
         )
         geneformer = Geneformer(config)
         mocker.patch.object(
@@ -213,7 +235,7 @@ def test_fine_tune_classifier_returns_correct_shape(self, emb_mode, mock_data):
 
     def test_fine_tune_classifier_cls_returns_correct_shape(self, mock_data):
         fine_tuned_model = GeneformerFineTuningModel(
-            GeneformerConfig(model_name="gf-12L-95M-i4096", emb_mode="cls"),
+            GeneformerConfig(model_name="gf-12L-38M-i4096", emb_mode="cls"),
             fine_tuning_head="classification",
             output_size=1,
         )
@@ -230,10 +252,10 @@ def test_fine_tune_classifier_cls_returns_correct_shape(self, mock_data):
     @pytest.mark.parametrize(
         "model_name,emb_layer,expected_error",
         [
-            ("gf-6L-30M-i2048", -1, "No Error"),
-            ("gf-6L-30M-i2048", 7, "Error"),
-            ("gf-12L-30M-i2048", 6, "No Error"),
-            ("gf-20L-95M-i4096", 23, "Error"),
+            ("gf-6L-10M-i2048", -1, "No Error"),
+            ("gf-6L-10M-i2048", 7, "Error"),
+            ("gf-12L-40M-i2048", 6, "No Error"),
+            ("gf-20L-151M-i4096", 23, "Error"),
         ],
     )
     def test_embedding_layer_error(self, model_name, emb_layer, expected_error):
@@ -253,10 +275,10 @@ def test_embedding_layer_error(self, model_name, emb_layer, expected_error):
     @pytest.mark.parametrize(
         "model_name,emb_layer",
         [
-            ("gf-6L-30M-i2048", -1),
-            ("gf-6L-30M-i2048", 3),
-            ("gf-12L-30M-i2048", 6),
-            ("gf-20L-95M-i4096", -1),
+            ("gf-6L-10M-i2048", -1),
+            ("gf-6L-10M-i2048", 3),
+            ("gf-12L-40M-i2048", 6),
+            ("gf-20L-151M-i4096", -1),
         ],
     )
     def test_layer_to_quant(self, model_name, emb_layer):

diff --git a/ci/tests/test_scgpt/test_scgpt_model.py b/ci/tests/test_scgpt/test_scgpt_model.py
@@ -170,6 +170,52 @@ def test_get_embeddings_of_different_modes(self, mocker, emb_mode):
                 atol=1e-4,  # absolute tolerance
             )
 
+    @pytest.mark.parametrize("emb_mode", ["cell", "cls"])
+    def test_get_embeddings_with_gene_outputs(self, mocker, emb_mode):
+        self.scgpt.config["emb_mode"] = emb_mode
+        self.scgpt.config["embsize"] = 5
+
+        # Mock the method directly on the instance
+        mocked_embeddings = torch.tensor(
+            [
+                [
+                    [1.0, 1.0, 1.0, 1.0, 1.0],
+                    [5.0, 5.0, 5.0, 5.0, 5.0],
+                    [1.0, 2.0, 3.0, 2.0, 1.0],
+                    [6.0, 6.0, 6.0, 6.0, 6.0],
+                ],
+            ]
+        )
+        mocker.patch.object(self.scgpt.model, "_encode", return_value=mocked_embeddings)
+
+        # mocking the normalization of embeddings makes it easier to test the output
+        mocker.patch.object(
+            self.scgpt, "_normalize_embeddings", side_effect=lambda x: x
+        )
+
+        dataset = self.scgpt.process_data(self.data, gene_names="gene_names")
+        embeddings, genes = self.scgpt.get_embeddings(dataset, output_genes=True)
+
+        if emb_mode == "cls":
+            assert (embeddings == np.array([1.0, 1.0, 1.0, 1.0, 1.0])).all()
+            assert len(genes) == len(embeddings)
+            for gene_list in genes:
+                for gene in gene_list:
+                    assert gene in ["SAMD11", "PLEKHN1", "HES4"]
+        if emb_mode == "cell":
+            # average column wise excluding first row
+            expected = np.array([[4.0, 4.3333335, 4.6666665, 4.3333335, 4.0]])
+            np.testing.assert_allclose(
+                embeddings,
+                expected,
+                rtol=1e-4,  # relative tolerance
+                atol=1e-4,  # absolute tolerance
+            )
+            assert len(genes) == len(embeddings)
+            for gene_list in genes:
+                for gene in gene_list:
+                    assert gene in ["SAMD11", "PLEKHN1", "HES4"]
+
     @pytest.mark.parametrize("emb_mode", ["cls", "cell"])
     def test_normalization_cell_and_cls(self, emb_mode):
         mocked_embeddings = np.array(

diff --git a/docs/index.md b/docs/index.md
@@ -11,8 +11,11 @@ We will update this repo on a regular basis with new models, benchmarks, modalit
 Let’s build the most exciting AI-for-Bio community together!
 
 ## What's new?
+### New Larger Geneformer Models
+We have integrated the new Geneformer models which are larger and have been trained on more data. Find out which models have been integrated into the Geneformer suite in the [model card](./model_cards/geneformer.md). Check out the our notebook on drug perturbation prediction using different Geneformer scalings [here](./notebooks/Geneformer-Series-Comparison.ipynb).
+
 ### TranscriptFormer
-We have integrated [TranscriptFormer](https://github.com/czi-ai/transcriptformer) into our helical package and have made a model card for it in our [Transcriptformer model folder](helical/models/transcriptformer/README.md). If you would like to test the model, take a look at our [example notebook](examples/notebooks/Geneformer-vs-TranscriptFormer.ipynb)!
+We have integrated [TranscriptFormer](https://github.com/czi-ai/transcriptformer) into our helical package and have made a model card for it in our [Transcriptformer model folder](./model_cards/transcriptformer.md). If you would like to test the model, take a look at our [example notebook](./notebooks/Geneformer-vs-TranscriptFormer.ipynb)!
 
 ### 🧬 Introducing Helix-mRNA-v0: Unlocking new frontiers & use cases in mRNA therapy 🧬
 We’re thrilled to announce the release of our first-ever mRNA Bio Foundation Model, designed to:
@@ -107,6 +110,7 @@ Within the `example/notebooks` folder, open the notebook of your choice. We reco
 |[Cell-Type-Classification-Fine-Tuning.ipynb](./notebooks/Cell-Type-Classification-Fine-Tuning.ipynb)|An example how to fine-tune different models on classification tasks.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Cell-Type-Classification-Fine-Tuning.ipynb) |
 |[HyenaDNA-Fine-Tuning.ipynb](./notebooks/HyenaDNA-Fine-Tuning.ipynb)|An example of how to fine-tune the HyenaDNA model on downstream benchmarks.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/HyenaDNA-Fine-Tuning.ipynb) |
 |[Cell-Gene-Cls-embedding-generation.ipynb](./examples/notebooks/Cell-Gene-Cls-embedding-generation.ipynb)|A notebook explaining the different embedding modes of single cell RNA models.|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Cell-Gene-Cls-embedding-generation.ipynb) |
+|[Geneformer-Series-Comparison.ipynb](./notebooks/Geneformer-Series-Comparison.ipynb)|A zero shot comparison between Geneformer model scaling on drug perturbation prediction|[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/helicalAI/helical/blob/main/examples/notebooks/Geneformer-Series-Comparison.ipynb) |
 
 ## Stuck somewhere ? Other ideas ?
 We are eager to help you and interact with you: