Abstract
Large Language Models (LLMs) on natural language have achieved a
level of performance that allows the generation of coherent and
syntactically correct text. DNA sequence of genomes follows rules
similar to natural language, but a distinguishing factor is the
absence of a concept analogous to words. We established byte-pair
tokenization on the human genome and trained a foundation
language model called GROVER (``Genome Rules Obtained Via
Extracted Representations'') to select the optimal vocabulary
with a custom fine-tuning task of next-k-mer prediction. We thus
defined a dictionary of words/tokens in the human genome that
best carries the information content for DNA language models.
Analyzing GROVER's learned representations, we observed that
token embeddings primarily encode information related to their
frequency, sequence content, and length. Some tokens are almost
exclusively localized in repeats, while the vast majority widely
distributes over the genome. The model also learns context and
lexical ambiguity. Average embeddings of genomic regions relate
to functional genomics annotation and thus indicate that GROVER
has learned these structures purely from the contextual
relationships of tokens. That we can extract functional
annotations from the genome, purely based on sequence
representation to the trained model, highlights the extent of
information content encoded by the sequence. This is supported by
fine-tuning tasks on genome biology with questions of promoter
identity and protein-DNA binding. GROVER learns sequence context,
a sense for grammatical structures and language rules in the
genome. This knowledge can be extracted and used to compose a
grammar book for the code of life.
Users
Please
log in to take part in the discussion (add own reviews or comments).