<ul><li>A new method for analyzing genomic data using contrastive learning, focusing on short nucleotide sequences known as reads, has been presented.</li><li>The method involves training an encoder model to produce embeddings that cluster together sequences from the same genomic region, preserving the sequential nature of genomic regions in an embedding space.</li><li>The model provides a general representation of $k$-mer sequences, suitable for various downstream tasks such as read data analysis, ancient DNA read mapping, identification of structural variations, and metagenomic species identification.</li><li>The approach demonstrates favorable scaling properties and promising results for metagenomic applications and mapping to genomes comparable in size to the human genome.</li></ul>

Learning Genomic Structure from $k$-mers

Discover more