I wanted to generate embeddings for amplicon sequence variants (ASV) and was wondering if this model would work well for highly similar sequences that are around 150bo. In particular, the V3-V4 region of prokaryotic genomes.
If so, I had a few questions:
- are there any parameters or commands you would recommend when only using DNA and not multimodal?
- what is the resulting embedding dimensionality?
- are there any transformations recommended on the outputs? For example, dnabert-s recommends averaging.
I wanted to generate embeddings for amplicon sequence variants (ASV) and was wondering if this model would work well for highly similar sequences that are around 150bo. In particular, the V3-V4 region of prokaryotic genomes.
If so, I had a few questions: