Unraveling the secrets of non-coding genes with AI

 

Artificial Intelligence (AI) is permeating every aspect of our life, from intelligent chatbots to programs capable of writing whole articles. Research associate Michael Schon of Wageningen University & Research is developing an artificial intelligence tool that analyzes and compares non-coding RNA in plant genomes. It is anticipated that the tool will streamline and expedite the next generation of plant kinds that are more resistant to diseases or drought, for example.

In organisms, proteins serve as the building elements of cells. RNA from genes codes (issues) instructions for generating these proteins. In addition to these coding RNAs, certain genes can also create non-coding RNAs, or RNA without the instructions needed to make a protein.

According to Michael Schon, this kind of RNA is crucial for the growth of organisms. As an illustration, they have the ability to turn on or off genes. This will have an impact on a plant’s look and characteristics. A plant’s ability to attain maturity is also influenced by a few significant non-coding RNAs.”

Relatives within the same family

Non-coding RNA may also explain why a species of plant differs from others in its family despite sharing some of the same traits. Schon discovered non-coding RNAs of Arabidopsis thaliana (thale cress) in earlier studies. Plant scientists utilize this plant as a model organism.

In the Brassicaceae family, which also includes major crops like broccoli, cauliflower, and kohlrabi, is where Arabidopsis is found. This family is sometimes referred to as the crucifer or mustard families. Comparing Arabidopsis’ non-coding RNAs to those of other mustard family plants is challenging, though, as prior research in these species has mostly concentrated on protein-coding genes.

Limited non-coding RNA annotation

This implies that distinct gene annotation for the non-coding RNA for each crop is needed in order to compare plants. Schon is searching for novel approaches to find non-coding RNAs through his Veni project, which leverages insights from similar species.

For plants in the mustard family, more than 200 genome sequences are available. The bases of a DNA molecule (A, C, T, and G) are represented by millions of letters that make up each genome, which is saved as a massive text file. It is not feasible to compare all of the non-coding genes dispersed throughout this mountain of data since the non-coding portions of these genomes are improperly cataloged (annotated). For that, we require new tactics and resources. He is working on improving those.

A small part of every genome

Finding the right place to search in the genome is the first issue. Schon is creating several tools, one of which he calls GeneSketch. In order to determine the corresponding regions of various genomes, he is employing a technique known as Minimizer Sketch.

According to Schon, the concept behind the Minimizer Sketch is that you only need to examine a little portion of DNA, or a sketch, as opposed to the complete sequence. This means that instead of having to look at millions of characters per genome, you only need to look at a few thousand.

A prior usage of the Minimizer Sketch was to construct an evolutionary tree of primates, which includes humans and our closest relatives. It was out that using sketches of fewer than 1% of the total genomes, a remarkably precise family tree of our ancestors could be constructed. Therefore, a minimizer sketch is an extremely effective method for determining how similar two bits of DNA are to one another, and as such, it should be helpful for comparing genomes within the mustard family.

Similar technology as ChatGPT

Understanding what you are looking at is the next step after knowing where to look. Schon has the same technology in mind for GeneSketch as he has for ChatGPT and other AI applications.

According to Schon, the technology is known as “transformer” technology.

For example, you can ask a transformer to complete a sentence where a word is missing. Since the transformer has never seen a word before, it initially provides you with a random word. However, by gradually learning to recognize patterns in the text, it can estimate the correct words if it is trained on millions of example phrases.

After undergoing training, a sizable language model such as ChatGPT excels at specific tasks like as question answering and language translation. It is possible to train a transformer to understand not only human languages but also DNA’s language, which has unique patterns. He is developing a model to identify patterns in the DNA of several species and convert those patterns into a language that’s comprehensible to humans.

The model has to be trained

Schon will educate GeneSketch’s transformer to take note of how genes, particularly non-coding genes, vary between species. However, he anticipates encountering several difficulties during the journey.

Reliability is a crucial concern. Since it is a relatively new technology, the transformer has errors. For example, ChatGPT was trained on a variety of text sources, but it must invent an answer if you ask it about a subject it was not exposed to during training. Based on the patterns it has observed, you hope that it will conjure up something plausible, but this is never a given. Of course, you want to steer clear of absurd production. A transformer will produce less garbage the more you teach it, but training can be expensive and time-consuming. Is it preferable to start the model’s training from zero and work your way up to an existing model? He is experimenting with both strategies.

Opportunities for GeneSketch

After completing the first year of the project, which began in October 2023, Schon aims to have a prototype of the GeneSketch. Using it, he intends to annotate every gene in the mustard family.

Schon claims that the instrument might be helpful for both the agricultural industry and the research sector. For example, it might give seed developers a rapid means of comprehending the genetic makeup of a crop and its wild cousins. Breeders could make better options for enhancing traits, such increasing crop resilience to climate change, if they had greater knowledge about how crops have evolved distinctive features throughout the ages. Thus, the possible impact can be enormous.

Source link