Skip to content

scGPT#61

Open
HesamAsad wants to merge 15 commits into
mainfrom
scGPT
Open

scGPT#61
HesamAsad wants to merge 15 commits into
mainfrom
scGPT

Conversation

@HesamAsad
Copy link
Copy Markdown
Collaborator

Added scgptDataModule, scgptWrapper, and mode to run with both geneformer and scgpt.

  • Resolved merge conflicts
  • scGPT DataCollator modified

@HesamAsad HesamAsad requested a review from mariemoullet April 12, 2024 16:16
@HesamAsad HesamAsad self-assigned this Apr 12, 2024
Comment thread gplearner/Datamodules/datamodule.py Outdated
return output_dict

class scgptDataset(Dataset):
def __init__(self, adata, gene_ids, vocab, model_configs, adata_global, batch_ids=None):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the most minor comment but could we call adata_global something like adata_for_recon ? (global was confusing to me but really not a big deal)

self.train_dataset,
collate_fn=self.collator,
batch_size=self.batch_size,
# shuffle=True,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually do shuffle = True for train and shuffle = False for val/test, is there a reason it's commented out here?

seed=0,
dataset_name="ms",
do_train=False,
load_model=f"/lustre/scratch126/cellgen/team205/ha11/scGPT/{scgpt_mod}/",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for anything that has farm specific paths, is it possible to have it as a function argument? (I need to discuss with Amir what is best practice around this (if you have suggestions let me know, it's not something I know about!) but it might make our lives easier later?

ecs_thres=0.0, # Elastic cell similarity objective, 0.0 to 1.0, 0.0 to disable
dab_weight=0.0,
lr=1e-4,
batch_size=32,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the batch size is hardcoded here, does that cause issues in training?


# settings for training
MLM = False # whether to use masked language modeling, currently it is always on.
CLS = True # celltype classification objective
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this add a cell classification head/do we need this?

cell_embedding_mode = 'cls'
max_length = 9585
batch_size = 4
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know where this comes into the code? i think this might cause issues for multi gpu training? (i normally let pytorch lightning handle this unless it's evaluation tasks that im sure are on single gpu)

loss_mode: str = 'mse',
n_genes: int = 25426,
d_model: int = 256,
d_model: int = 128,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the geneformer version this corresponds to the size of the embeddings, why does it need to be 128? (can specify elsewhere in the code if needed but just to understand)

Comment thread gplearner/Models/gp_model.py Outdated
self.do_ensembl_conversion = do_ensembl_conversion
self.n_blocks = n_blocks
self.attn_dropout = attn_dropout
self.use_flash = use_flash
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's quite helpful for me to have this as an option for testing speed, is it possible to leave it here?

gp_inputs=gp_inputs,
add_remaining_var=add_remaining_var,
use_flash=self.use_flash,
use_flash=True,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here for the geneformer version at least it's not obvious the flash provides improvement (sequence may not be long enough for us to see benefits) is it possible to leave it as optional?

drop_path_rate=0.0, # no effect if only 1 block
norm_layer=nn.LayerNorm,
use_pos_emb=True,
use_pos_emb=False,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the geneformer version, i need the positional embeddings, it is possible to pass this as an argument somewhere else in the script? (if not i can write it but we would essentially need use_pos_emb=True when mode=geneformer and False when mode=scgpt)

Comment thread gplearner/Train/training.py Outdated
precision='bf16-mixed',
profiler='simple',
strategy=strategy,
# strategy=strategy,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be experimenting with this this week so if we can leave it in that would be great!

Comment thread scgpt_gp/data_collator.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we end up using scgpt as a model it may be worth organizing the scpt files into the relevant existing folders, what do you think?

Comment thread scgpt_gp/tasks/grn.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im thinking it might be worth only keeping the scgpt files we need, what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants