how to re-train on a different dataset?

Hi @scopello, thanks for the really cool work on this mixed-modality gLM. It has got us very curious to see if it can be useful for our research context.

We are preparing a new release of GlobDB https://globdb.org/home with >300000 species representative microbial genomes. We are thinking it should be feasible to use this dataset to re-train your gLM2. We see some advantages of at least trying this not least because GlobDB collects a lot of microbial diversity from different sources. Hopefully it might also interest you as part of testing the wider applicability or limitations of the model. 

From the clear descriptions you gave in the data pre-processing section of the paper we think we can take care of the multi-modal data setup. However, after looking through your repo I am not clear how one would go about the training process even though a lot of the functions and classes are there. Is it possible for you to also provide some code or scripts that you used when initially training the gLM2 please?

Thank you for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to re-train on a different dataset? #10

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

how to re-train on a different dataset? #10

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions