Unpublished,

The human genome's vocabulary as proposed by the DNA language model GROVER

, , and .
(July 2023)

Abstract

Large Language Models (LLMs) on natural language have achieved a level of performance that allows the generation of coherent and syntactically correct text. DNA sequence of genomes follows rules similar to natural language, but a distinguishing factor is the absence of a concept analogous to words. We established byte-pair tokenization on the human genome and trained a foundation language model called GROVER (``Genome Rules Obtained Via Extracted Representations'') to select the optimal vocabulary with a custom fine-tuning task of next-k-mer prediction. We thus defined a dictionary of words/tokens in the human genome that best carries the information content for DNA language models. Analyzing GROVER's learned representations, we observed that token embeddings primarily encode information related to their frequency, sequence content, and length. Some tokens are almost exclusively localized in repeats, while the vast majority widely distributes over the genome. The model also learns context and lexical ambiguity. Average embeddings of genomic regions relate to functional genomics annotation and thus indicate that GROVER has learned these structures purely from the contextual relationships of tokens. That we can extract functional annotations from the genome, purely based on sequence representation to the trained model, highlights the extent of information content encoded by the sequence. This is supported by fine-tuning tasks on genome biology with questions of promoter identity and protein-DNA binding. GROVER learns sequence context, a sense for grammatical structures and language rules in the genome. This knowledge can be extracted and used to compose a grammar book for the code of life.

Tags

Users

  • @scadsfct

Comments and Reviews