gLM-collection

Overview

Deciphering how DNA determines an organism's development, phenotype, genetic traits, and disease predisposition remains a significant challenge, with critical applications in human genetics depending on improved solutions. Motivated by the recent release of the Logan dataset by Chikhi et al. (50 petabases of preassembled yet unlabeled biological sequences across hundreds of thousands of species) and the success of large language models (LLMs) in human language, we aim to train genomic language models (gLM) to implicitly capture biological functional elements and their organization.

Impact

This work aims to produce the first large-scale, publicly available gLM trained on over 50 petabases of data from all sequenced organisms, capturing the full diversity of the DNA language and enhancing our understanding of genetic mechanisms.

Upcoming

Code implementation and benchmarks are incoming. Stay tuned for updates as we release the codebase and performance benchmarks for community use.

Contributions and Contact

We welcome contributions and collaborations! Open an issue or pull a request to get involved.

Authors

Kalin Nonchev* - Email | LinkedIn
Manuel Burger* - Email | LinkedIn
Andre Kahles
Gunnar Rätsch

*Equal contribution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

gLM-collection

Overview

Impact

Upcoming

Contributions and Contact

Authors

Files

README.md

Latest commit

History

README.md

File metadata and controls

gLM-collection

Overview

Impact

Upcoming

Contributions and Contact

Authors