A group of scientists from Carnegie Mellon University’s Computational Biology Department (CBD) have created new techniques to pinpoint regions of the genome that are essential for comprehending how specific species traits have evolved.
The research, led by School of Computer Science Assistant Professor Andreas Pfenning and published in Science, is a contribution to the Zoonomia Project, an initiative to sequence the entire genomes of 240 mammals in order to reveal fundamental features of genes and traits that have significant implications for safeguarding human health and preserving biodiversity. The most recent advances in machine learning (ML) and artificial intelligence (AI) technologies are needed to make sense of these new, massive data sets.
The instructions for making proteins, the essential controllers of cell activity, are found in specific regions of the genome known as coding DNA. Small variations in the instructions that coding DNA offers for the creation of proteins develop throughout time and become one of the forces that drive evolution.
However, just 1% of the three billion nucleotide pairs that make up the human genome are found in these protein-producing DNA fragments. When and where particular genes are activated are controlled by other non-coding DNA sequences called enhancers.
The Tissue-Aware Conservation Inference Toolkit (TACIT) was developed by the CMU team as a machine learning method to better understand how these areas function. While a conventional model of evolution can show how mutations in a collection of genes might cause changes in a species’ brain size, enhancers may just turn genes on or off to produce the same effect.
The majority of study on the evolution of mammals focuses on the regions of the genome that have altered very little over millions of years. These conserved areas, especially genes, shed insights on the essential components of mammalian DNA that emphasize the distinctive characteristics of individual species.
The problem for Pfenning and his team is that while the DNA enhancer regions’ sequence may vary over time, their functionality does not. For instance, despite more than 700 million years of evolution, a well-studied Islet enhancer regulates gene levels in a manner that is comparable in humans, mice, zebra fish, and sponges. Because of this, they become far more challenging to detect and track using conventional techniques that involve looking at individual nucleotides.
TACIT addresses this issue by precisely predicting whether an enhancer will be active in a specific cell type or tissue. Potential applications in conservation biology are made possible by the fact that it enables researchers to locate these crucial enhancer areas in a freshly sequenced genome without running a fresh lab experiment. When controlled laboratory tests are not feasible for endangered or threatened species, the toolkit can make predictions about how enhancers work in those species.
Irene Kaplow, a lead author on the paper and a postdoctoral associate and Lane Fellow in CBD, said TACIT offers an unheard-of opportunity to forecast the function of genome regions other than genes in species for which we cannot obtain primary tissue samples, such as the critically endangered black rhinoceros and bottlenose dolphin. I believe that when ML techniques and strategies for locating enhancers from certain cell types advance, we will be able to expand the capabilities of TACIT to offer fresh perspectives on the evolution of mammals.
After predicting the function of genomic sequences across 240 mammals, the researchers used TACIT to identify parts of the genome that have evolved in mammals for larger brains, and they discovered that those tended to be near genes whose mutations have been linked to human brain-size disorders. They also discovered an enhancer linked to social behaviour in animals that is specific to a subtype of neuron called the parvalbumin positive inhibitory interneuron.
Pfenning, the study’s principal author, stated, “We think this is just the tip of the iceberg.” “We applied TACIT to a small number of tissues and small number of traits and found interesting relationships, but there is still much to learn.”