Conversation
Removed a mistake in merge conflict resolve
| return output_dict | ||
|
|
||
| class scgptDataset(Dataset): | ||
| def __init__(self, adata, gene_ids, vocab, model_configs, adata_global, batch_ids=None): |
There was a problem hiding this comment.
the most minor comment but could we call adata_global something like adata_for_recon ? (global was confusing to me but really not a big deal)
| self.train_dataset, | ||
| collate_fn=self.collator, | ||
| batch_size=self.batch_size, | ||
| # shuffle=True, |
There was a problem hiding this comment.
I usually do shuffle = True for train and shuffle = False for val/test, is there a reason it's commented out here?
| seed=0, | ||
| dataset_name="ms", | ||
| do_train=False, | ||
| load_model=f"/lustre/scratch126/cellgen/team205/ha11/scGPT/{scgpt_mod}/", |
There was a problem hiding this comment.
for anything that has farm specific paths, is it possible to have it as a function argument? (I need to discuss with Amir what is best practice around this (if you have suggestions let me know, it's not something I know about!) but it might make our lives easier later?
| ecs_thres=0.0, # Elastic cell similarity objective, 0.0 to 1.0, 0.0 to disable | ||
| dab_weight=0.0, | ||
| lr=1e-4, | ||
| batch_size=32, |
There was a problem hiding this comment.
if the batch size is hardcoded here, does that cause issues in training?
|
|
||
| # settings for training | ||
| MLM = False # whether to use masked language modeling, currently it is always on. | ||
| CLS = True # celltype classification objective |
There was a problem hiding this comment.
does this add a cell classification head/do we need this?
| cell_embedding_mode = 'cls' | ||
| max_length = 9585 | ||
| batch_size = 4 | ||
| self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
There was a problem hiding this comment.
do you know where this comes into the code? i think this might cause issues for multi gpu training? (i normally let pytorch lightning handle this unless it's evaluation tasks that im sure are on single gpu)
| loss_mode: str = 'mse', | ||
| n_genes: int = 25426, | ||
| d_model: int = 256, | ||
| d_model: int = 128, |
There was a problem hiding this comment.
in the geneformer version this corresponds to the size of the embeddings, why does it need to be 128? (can specify elsewhere in the code if needed but just to understand)
| self.do_ensembl_conversion = do_ensembl_conversion | ||
| self.n_blocks = n_blocks | ||
| self.attn_dropout = attn_dropout | ||
| self.use_flash = use_flash |
There was a problem hiding this comment.
it's quite helpful for me to have this as an option for testing speed, is it possible to leave it here?
| gp_inputs=gp_inputs, | ||
| add_remaining_var=add_remaining_var, | ||
| use_flash=self.use_flash, | ||
| use_flash=True, |
There was a problem hiding this comment.
and here for the geneformer version at least it's not obvious the flash provides improvement (sequence may not be long enough for us to see benefits) is it possible to leave it as optional?
| drop_path_rate=0.0, # no effect if only 1 block | ||
| norm_layer=nn.LayerNorm, | ||
| use_pos_emb=True, | ||
| use_pos_emb=False, |
There was a problem hiding this comment.
for the geneformer version, i need the positional embeddings, it is possible to pass this as an argument somewhere else in the script? (if not i can write it but we would essentially need use_pos_emb=True when mode=geneformer and False when mode=scgpt)
| precision='bf16-mixed', | ||
| profiler='simple', | ||
| strategy=strategy, | ||
| # strategy=strategy, |
There was a problem hiding this comment.
will be experimenting with this this week so if we can leave it in that would be great!
There was a problem hiding this comment.
if we end up using scgpt as a model it may be worth organizing the scpt files into the relevant existing folders, what do you think?
There was a problem hiding this comment.
im thinking it might be worth only keeping the scgpt files we need, what do you think?
Added scgptDataModule, scgptWrapper, and mode to run with both geneformer and scgpt.