Genome: Synthesizing DNA Sequences with LLM Techniques

This methodology is not focused on genome data alone. The purpose is to design a generic solution that may also work in other contexts, such as synthesizing molecules. The problem involves dealing with a large amount of “text”. Indeed, the sequences discussed here consist of letter arrangements, from an alphabet that has 5 symbols: A, C, G, T and N. The first four symbols stand for the types of bases found in a DNA molecule: adenine (A), cytosine (C), guanine (G), and thymine (T). The last one (N) represents missing data. No prior knowledge of genome sequencing is required.

Summary

The data consists of DNA sequences from a number of individuals and categorized according to the type of genetic patterns found in each sequence. The goal is to synthesize realistic DNA sequences, evaluate the quality of the synthetizations, and compare the results with random sequences. The idea is to look at a DNA string S1 consisting of n1 consecutive symbols, to identify potential candidates for the next string S2 consisting of n2 symbols. Then, assign a probability to each string S2 conditionally on S1, use these transition probabilities to sample S2 given S1, then move to the right by n2 symbols, do it again, and so on. Eventually you build a synthetic sequence of arbitrary length. There is some analogy to Markov chains.

What you will learn

The implementation has different steps, each one with its own method, and an opportunity to learn new techniques. In particular:

  • Building the keyword architecture with an efficient use of hash tables (key-value pairs) including an hash table whose key is itself an hash table. The keys are the strings, or pairs of strings. The values are occurrence counts.
  • Measuring associations between strings, using the pointwise mutual information (PMI). A low PMI may be an indicator of a rare genetic condition.
  • Evaluating the quality of the synthetic DNA using the Hellinger distance, and PDF scatterplots such as below. In the figure below, each blue dot is the frequency vector for a specific string, computed on the real and synthetic DNA (respectively the X and Y-axis). For the orange dots, the synthetic DNA is replaced by a random sequence.

Accessing the material

The Python code and dataset is on GitHub, here.  The corresponding article with technical documentation (7 pages including the code) is also on GitHub, here. Note that the tech document is an extract from my upcoming book “Practical AI & Machine Learning Projects and Datasets”, offered to participants in my GenAI certification program (see here).  The relevant material starts at page 86. Links are not clickable in this extract, but they are in the full version of the textbook.

To not miss future articles and access members-only content, sign-up to my free newsletter, here.

Author

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com, former VC-funded executive, author and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET.

Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). He published in Journal of Number Theory,  Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is the author of multiple books, including “Synthetic Data and Generative AI” (Elsevier, 2024). Vincent lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory. He recently launched a GenAI certification program, offering state-of-the-art, enterprise grade projects to participants.

Check your inbox or spam folder to confirm your subscription.

Leave a ReplyCancel reply

Discover more from NextGen AI Technology

Subscribe now to keep reading and get access to the full archive.

Continue reading

Exit mobile version