Abstract

Large Language Models (LLMs) on natural language have achieved a level of performance that allows the generation of coherent and syntactically correct text. DNA sequence of genomes follows rules similar to natural language, but a distinguishing factor is the absence of a concept analogous to words. We established byte-pair tokenization on the human genome and trained a foundation language model called GROVER (``Genome Rules Obtained Via Extracted Representations'') to select the optimal vocabulary with a custom fine-tuning task of n…(more)

Links and resources

Tags

community

  • @cosp536g
  • @scadsfct
@cosp536g's tags highlighted