Genome: Synthesizing DNA Sequences with LLM Techniques

This methodology is not focused on genome data alone. The purpose is to design a generic solution that may also work in other contexts, such as synthesizing molecules. The problem involves dealing with a large amount of “text”. Indeed, the sequences discussed here consist of letter arrangements, from an alphabet that has 5 symbols: A, C, G, T and N. The first four symbols stand for the types of bases found in a DNA molecule: adenine (A), cytosine (C), guanine (G), and thymine (T). The last one (N) represents missing data. No prior knowledge of genome sequencing is required.

Summary

The data consists of DNA sequences from a number of individuals and categorized according to the type of genetic patterns found in each sequence. The goal is to synthesize realistic DNA sequences, evaluate the quality of the synthetizations, and compare the results with random sequences. The idea is to look at a DNA string S₁ consisting of n₁ consecutive symbols, to identify potential candidates for the next string S₂ consisting of n₂ symbols. Then, assign a probability to each string S₂ conditionally on S₁, use these transition probabilities to sample S₂ given S₁, then move to the right by n₂ symbols, do it again, and so on. Eventually you build a synthetic sequence of arbitrary length. There is some analogy to Markov chains.

What you will learn

The implementation has different steps, each one with its own method, and an opportunity to learn new techniques. In particular:

Building the keyword architecture with an efficient use of hash tables (key-value pairs) including an hash table whose key is itself an hash table. The keys are the strings, or pairs of strings. The values are occurrence counts.
Measuring associations between strings, using the pointwise mutual information (PMI). A low PMI may be an indicator of a rare genetic condition.
Evaluating the quality of the synthetic DNA using the Hellinger distance, and PDF scatterplots such as below. In the figure below, each blue dot is the frequency vector for a specific string, computed on the real and synthetic DNA (respectively the X and Y-axis). For the orange dots, the synthetic DNA is replaced by a random sequence.

Accessing the material

The Python code and dataset is on GitHub, here. The corresponding article with technical documentation (7 pages including the code) is also on GitHub, here. Note that the tech document is an extract from my upcoming book “Practical AI & Machine Learning Projects and Datasets”, offered to participants in my GenAI certification program (see here). The relevant material starts at page 86. Links are not clickable in this extract, but they are in the full version of the textbook.

To not miss future articles and access members-only content, sign-up to my free newsletter, here.

Author

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com, former VC-funded executive, author and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET.

Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). He published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is the author of multiple books, including “Synthetic Data and Generative AI” (Elsevier, 2024). Vincent lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory. He recently launched a GenAI certification program, offering state-of-the-art, enterprise grade projects to participants.

	messerb5467 on Quantum Derivatives, GenAI, an…
Vincent Granville – Author, publisher, machine learning scientist. Founder of MLtechniques.com. Co-founder of Data Science Central, acquired by Tech Target.	Vincent Granville on Quantum Derivatives, GenAI, an…
	Brad Messer on Quantum Derivatives, GenAI, an…
	Sanjay Gautam on Number Theory: Longest Runs of…
Artem Melnyk – Ukraine – Hello there! My name is Artem. I am an AI enthusiast and affiliate marketer. As an AI enthusiast, I'm always on the lookout for new tools, techniques, and ideas that can help businesses and individuals utilize AI to stimulate innovation and growth. As an affiliate marketer, I'm passionate about helping people discover the best AI products and services available. Whether it's an advanced AI platform or powerful machine learning tool, my insights and recommendations are always eager to be shared with others. Are you passionate about AI content? Look no further! I enjoy liking, following and commenting on blogs related to AI, as well as finding new opportunities to collaborate with fellow AI enthusiasts and marketers. If you're interested in learning more about my affiliate marketing endeavors, feel free to check out https://zeep.ly/SmdwN. I'm sure that you'll find some fantastic AI products and services that can help take your business or personal projects to the next level. Thanks for stopping by; I look forward to connecting with you soon!	Artem Melnyk on Autonomous Driving: Boosting O…

Genome: Synthesizing DNA Sequences with LLM Techniques

Summary

What you will learn

Accessing the material

Author

Share this:

Leave a ReplyCancel reply

Discover more from NextGen AI Technology