A new method for analyzing genomic data using contrastive learning, focusing on short nucleotide sequences known as reads, has been presented.
The method involves training an encoder model to produce embeddings that cluster together sequences from the same genomic region, preserving the sequential nature of genomic regions in an embedding space.
The model provides a general representation of $k$-mer sequences, suitable for various downstream tasks such as read data analysis, ancient DNA read mapping, identification of structural variations, and metagenomic species identification.
The approach demonstrates favorable scaling properties and promising results for metagenomic applications and mapping to genomes comparable in size to the human genome.