Home Machine Learning Machine Learning News Most Machine learning models reproduce dataset bias

Most Machine learning models reproduce dataset bias

February 26, 2024

Bias has been discovered in commonly used machine learning methods used in immunotherapy research by computer scientists at Rice University.

In collaboration with Rodrigo Ferreira and Lydia Kavraki, computer science faculty members, Ph.D. candidates Anja Conev, Romanos Fasoulis, and Sarah Hall-Swan examined publicly accessible peptide-HLA (pHLA) binding prediction data and discovered that it was biased toward higher-income areas. In this study, they investigate the impact of biased data input on algorithmic suggestions utilized in critical immunotherapy research.

Peptide-HLA binding prediction, machine learning and immunotherapy

Every human has the HLA gene, which codes for proteins involved in the immunological response. These proteins attach to peptides, which are small protein fragments found in our cells, and identify the infected cells to the immune system, allowing it to react and, ideally, eradicate the threat.

Individuals differ somewhat in the alleles—genetic variations—that make them up. The current focus of immunotherapy research is on finding peptides that have a higher affinity for the patient’s HLA alleles.

Ultimately, the outcome may be highly customized and efficacious immunotherapies. Therefore, determining with precision which peptides will bind with specific alleles is one of the most important processes. The potential efficacy of the therapy increases with increasing accuracy.

However, it requires a lot of work to calculate a peptide’s affinity for the HLA allele; for this reason, machine learning algorithms are being employed to predict binding. Here’s where Rice and colleagues discovered an issue: it seems that higher-income neighborhoods are spatially favored in the data used to train those algorithms.

Why does this provide a problem? The efficacy of immunotherapies created for lower-income areas may be compromised if genetic data from those groups is not taken into consideration.

Every individual expresses a unique set of HLAs, and these expressions varies between various populations, according to Fasoulis. In light of the fact that machine learning is utilized to find putative peptide candidates for immunotherapies, treatments won’t be effective for all patients in all populations if you have essentially biased machine models.

Redefining ‘pan-allele’ binding predictors

Machine learning models are only as good as the data they are fed, regardless of the application. An unconscious bias in the data may have an impact on the algorithm’s conclusions.

Asserting their ability to extrapolate for allele data absent from the dataset that trained the models, machine learning algorithms currently employed for pHLA binding prediction go by the names “pan-allele” or “all-allele.” The results of the Rice team cast doubt on that.

According to Conev, what they are attempting to demonstrate and perhaps refute is the notion of “pan-allele” machine learning predictions. They sought to determine whether they were effective for the data from lower-income groups, which is the data that is not included in the databases.

The results of testing publicly accessible data on pHLA binding prediction by Fasoulis and Conev’s group confirmed their hypothesis that bias in the data was causing an algorithmic bias as well. By bringing this disparity to the attention of the scientific community, the team intends to establish a truly pan-allele approach of predicting pHLA binding.

Ferreira, a faculty advisor and co-author of the paper, clarified that unless researchers consider their data in a social context, bias in machine learning cannot be addressed. Although some may view datasets as merely “incomplete,” establishing bias requires drawing linkages between the representation of particular groups within the dataset and underlying historical and economic circumstances that impacted those communities.

“Our research points to the significance of when this is not the case,” Ferreira said, adding that “researchers using machine learning models sometimes innocently assume that these models may appropriately represent a global population.” He continued by saying that the databases we looked at are not universal, even if the data they include comes from people all over the world. According to our research, there is a relationship between a population’s socioeconomic status and how well-represented it is in the databases.

In line with this viewpoint, Professor Kavraki emphasized the significance of accurate tools that are forthright about any flaws they may have when employed in clinical work.

Our research on pHLA binding is part of a project we’re working on with MD Anderson that focuses on customized immunotherapies for cancer, according to Kavraki. Eventually, the developed tools find their way into clinical workflows. We must comprehend the potential biases in these instruments. We also hope to raise awareness among researchers about the challenges in acquiring impartial datasets.

Conev pointed out that even though the data was biased, it was a good start because it was made publicly available for her team to examine. The team hopes that by sharing its findings, future research would take a more positive turn, assisting and including people from different demographic backgrounds.

Source link