scGPT by HesamAsad · Pull Request #61 · Lotfollahi-lab/tripso

HesamAsad · 2024-04-12T16:12:44Z

Added scgptDataModule, scgptWrapper, and mode to run with both geneformer and scgpt.

Resolved merge conflicts
scGPT DataCollator modified

Removed a mistake in merge conflict resolve

…nto scGPT

mariemoullet · 2024-04-12T17:58:17Z

        return output_dict
+
+class scgptDataset(Dataset):
+    def __init__(self, adata, gene_ids, vocab, model_configs, adata_global, batch_ids=None):


the most minor comment but could we call adata_global something like adata_for_recon ? (global was confusing to me but really not a big deal)

mariemoullet · 2024-04-12T17:59:40Z

+            self.train_dataset,
+            collate_fn=self.collator,
+            batch_size=self.batch_size,
+            # shuffle=True,


I usually do shuffle = True for train and shuffle = False for val/test, is there a reason it's commented out here?

mariemoullet · 2024-04-12T18:01:30Z

+            seed=0,
+            dataset_name="ms",
+            do_train=False,
+            load_model=f"/lustre/scratch126/cellgen/team205/ha11/scGPT/{scgpt_mod}/",


for anything that has farm specific paths, is it possible to have it as a function argument? (I need to discuss with Amir what is best practice around this (if you have suggestions let me know, it's not something I know about!) but it might make our lives easier later?

mariemoullet · 2024-04-12T18:01:54Z

+            ecs_thres=0.0, # Elastic cell similarity objective, 0.0 to 1.0, 0.0 to disable
+            dab_weight=0.0,
+            lr=1e-4,
+            batch_size=32,


if the batch size is hardcoded here, does that cause issues in training?

mariemoullet · 2024-04-12T18:02:19Z

+
+        # settings for training
+        MLM = False  # whether to use masked language modeling, currently it is always on.
+        CLS = True  # celltype classification objective


does this add a cell classification head/do we need this?

mariemoullet · 2024-04-12T18:03:43Z

+        cell_embedding_mode = 'cls'
+        max_length = 9585
+        batch_size = 4
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


do you know where this comes into the code? i think this might cause issues for multi gpu training? (i normally let pytorch lightning handle this unless it's evaluation tasks that im sure are on single gpu)

mariemoullet · 2024-04-12T18:04:52Z

        loss_mode: str = 'mse',
        n_genes: int = 25426,
-        d_model: int = 256,
+        d_model: int = 128,


in the geneformer version this corresponds to the size of the embeddings, why does it need to be 128? (can specify elsewhere in the code if needed but just to understand)

mariemoullet · 2024-04-12T18:05:40Z

        self.do_ensembl_conversion = do_ensembl_conversion
        self.n_blocks = n_blocks
        self.attn_dropout = attn_dropout
-        self.use_flash = use_flash


it's quite helpful for me to have this as an option for testing speed, is it possible to leave it here?

mariemoullet · 2024-04-12T18:06:19Z

            gp_inputs=gp_inputs,
            add_remaining_var=add_remaining_var,
-            use_flash=self.use_flash,
+            use_flash=True,


and here for the geneformer version at least it's not obvious the flash provides improvement (sequence may not be long enough for us to see benefits) is it possible to leave it as optional?

mariemoullet · 2024-04-12T18:07:11Z

        drop_path_rate=0.0,  # no effect if only 1 block
        norm_layer=nn.LayerNorm,
-        use_pos_emb=True,
+        use_pos_emb=False,


for the geneformer version, i need the positional embeddings, it is possible to pass this as an argument somewhere else in the script? (if not i can write it but we would essentially need use_pos_emb=True when mode=geneformer and False when mode=scgpt)

mariemoullet · 2024-04-12T18:07:46Z

            precision='bf16-mixed',
            profiler='simple',
-            strategy=strategy,
+            # strategy=strategy,


will be experimenting with this this week so if we can leave it in that would be great!

mariemoullet · 2024-04-12T18:08:28Z

if we end up using scgpt as a model it may be worth organizing the scpt files into the relevant existing folders, what do you think?

mariemoullet · 2024-04-12T18:09:21Z

im thinking it might be worth only keeping the scgpt files we need, what do you think?

Hesam Asadollahzadeh and others added 12 commits April 1, 2024 14:18

Added scGPT Data Module and inference wrapper

31470de

Update comment on scGPT forward

b9ac115

Added scgpt mode to trainer

7c3fcb7

Minor bug fixed in geneformer mode

776d2da

Minor bug fixed in geneformer mode

b8aa825

Both modes done

0ea9384

Merge branch 'main' into scGPT

2b020d2

Update training.py

94a5ec2

Removed a mistake in merge conflict resolve

Update training.py

ba1ff4e

Minor changes

8fae653

Merge branch 'scGPT' of https://github.com/Lotfollahi-lab/gplearner i…

bfa8beb

…nto scGPT

Fixed Data Collators to return counts for Global training

03d5838

HesamAsad requested a review from mariemoullet April 12, 2024 16:16

HesamAsad self-assigned this Apr 12, 2024

mariemoullet reviewed Apr 12, 2024

View reviewed changes

Hesam Asadollahzadeh added 3 commits May 13, 2024 12:38

Minor Changes

72fead6

Merge branch 'main' into scGPT

d5cb4d3

precommit fixed

fd5ac04

Conversation

HesamAsad commented Apr 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants