Skip to content

Latest commit

 

History

History
23 lines (15 loc) · 1.43 KB

README.md

File metadata and controls

23 lines (15 loc) · 1.43 KB

gLM-collection

Overview

Deciphering how DNA determines an organism's development, phenotype, genetic traits, and disease predisposition remains a significant challenge, with critical applications in human genetics depending on improved solutions. Motivated by the recent release of the Logan dataset by Chikhi et al. (50 petabases of preassembled yet unlabeled biological sequences across hundreds of thousands of species) and the success of large language models (LLMs) in human language, we aim to train genomic language models (gLM) to implicitly capture biological functional elements and their organization.

Impact

This work aims to produce the first large-scale, publicly available gLM trained on over 50 petabases of data from all sequenced organisms, capturing the full diversity of the DNA language and enhancing our understanding of genetic mechanisms.

Upcoming

  • Code implementation and benchmarks are incoming. Stay tuned for updates as we release the codebase and performance benchmarks for community use.

Contributions and Contact

We welcome contributions and collaborations! Open an issue or pull a request to get involved.

Authors

*Equal contribution