A new technique has been developed by biomedical engineers at Duke University for greatly enhancing the efficacy of machine learning models looking for novel molecular treatments with a small percentage of the data. In some circumstances, researchers can more than double their accuracy by utilizing an algorithm that actively finds gaps in datasets.
This innovative method might make it simpler for researchers to recognize and categorize compounds with properties that might be advantageous for the creation of novel drug candidates and other materials.
On June 23, the Royal Society of Chemistry published this article in the journal Digital Discovery.
More and more often, machine learning algorithms are used to recognise and forecast the characteristics of tiny molecules, including drug candidates and other substances. Although there have been considerable improvements in computing power and machine learning techniques, their capabilities are now constrained by the imperfect datasets that are currently utilized to train them.
The bias in the data is one of the key problems. This happens when there are several data points that emphasize one quality over another, such as a molecule’s capacity to inhibit a certain protein or its structural characteristics.
Daniel Reker, an assistant professor of biomedical engineering at Duke University, compared it to training an algorithm to discriminate between images of dogs and cats but giving it one billion images of dogs to learn from and just 100 images of cats. The algorithm will become so adept at recognizing dogs that everything will begin to resemble dogs, and it will lose its understanding of the rest of the world.
When it comes to drug discovery and development, this is a particularly challenging problem because researchers frequently work with datasets in which more than 99% of the examined compounds are deemed “ineffective” and just a small subset of molecules are classified as potentially beneficial.
Researchers use a technique called data subsampling to address this problem, in which their algorithm learns from just a small but (ideally) representative slice of the data. By providing the model with an equal number of samples to learn from, this technique can eliminate bias, but it can also exclude important data points and have a negative effect on an algorithm’s overall accuracy. In order to make up for the lost information, researchers have created hundreds of subsampling techniques.
However, Reker and his colleagues were interested in seeing if the active machine learning method might address this persistent problem.
Instead of passively scanning through the data, the algorithm is essentially able to ask questions or seek more information if it is perplexed or notices a gap, according to Reker. This makes active-learning models for performance prediction extremely effective.
Reker and his team intended to investigate what would happen if the algorithm were allowed to run on existing datasets instead of the fresh datasets that active learning algorithms are often applied to in order to generate new data, such as to identify new drugs. Reker and his team were the first to test the method on molecular biology and drug development, even though this subsampling application of active machine learning had already been investigated in prior studies.
The team assembled datasets of molecules with various properties, such as molecules that could cross the blood-brain barrier, molecules that could inhibit a protein linked to Alzheimer’s disease, and compounds that have been shown to inhibit HIV replication, in order to test the effectiveness of their active subsampling approach. Then, they put their active-learning system to the test against models that had been trained on the entire dataset and against 16 cutting-edge subsampling techniques.
The team demonstrated that, in some circumstances, active subsampling was up to 139 percent more effective than the algorithm that was trained on the entire dataset for identifying and predicting molecular features. This was in comparison to each of the usual subsampling procedures. Furthermore, their model was able to correctly correct for errors in the data, suggesting that it could be especially helpful for low-quality datasets.
The team was most surprised to find that the optimal amount of data to use was far less than anticipated, in some cases just requiring 10% of the total data.
If you add new data after the active-subsampling model has gathered all the data it need, performance will suffer, according to Reker. We found that problem to be particularly intriguing because it suggests that there is a tipping point beyond which additional information, even in a subsample, is no longer useful.
In addition to using this novel strategy to find new molecules as potential therapeutic targets, Reker and his team also intend to utilize it to explore this inflection point in subsequent research. The group is hopeful that their findings will contribute to a better understanding of this technique and its tolerance to data faults because active machine learning is becoming prominent in many different research fields.
Because it uses a more precise dataset, this method not only improves machine learning performance but can also lower data storage requirements and expenses, according to Reker. As a result, machine learning is more powerful, accessible, and replicable for everyone.