Could ML contribute to a scientific reproducibility crisis?

Machine learning (ML) is a widely used tool in a variety of fields including biomedicine, and political sciences, and researchers use this tool to predict data based on their patterns.

According to a pair of Princeton University researchers in New Jersey, the claims in many of these studies are allegedly exaggerated.

They want to raise awareness of a looming reproducibility crisis in sciences based on machine learning.

According to Sayash Kapoor, a machine-learning researcher at Princeton University, machine learning is being marketed as a tool that could be learned by researchers within a few hours, and used on their own, and many take after that advice too.

However, it is not fair to expect a chemist to learn the way to manage a lab by taking an online course, says Sayash.

Also, some scientists understand that the problems encountered by them while applying artificial intelligence algorithms are not unique and prevail in other areas too, according to Kapoor, who co-authored a preprint on the crisis.

Since associate reviewers do not have the required time to analyze these models, he claims that academia currently lacks mechanisms to identify irreproducible papers.

Kapoor and his co-author Arvind Narayanan devised guidelines for scientists to follow in order to avoid such difficulties and included a clear-cut checklist that must be submitted with every paper.

What exactly is reproducibility?

The definition of reproducibility proposed by Kapoor and Narayanan is broad. It states that given complete details on data, code, and conditions, other teams should be able to replicate the results of a model — a concept known as computational reproducibility, which is already a concern for machine-learning scientists.

The pair also defines a model as irreproducible when researchers make mistakes in data analysis, resulting in the model being less predictive than claimed.

Judging those types of errors is biased and frequently necessitates extensive knowledge of the field in which machine learning is used.

The team has also assessed the work of a few researchers, and those researchers contended that there were no glitches in their paper, or claimed that Kapoor’s claims were overstated.

For instance, in social studies, machine learning models were developed by researchers that intend to anticipate the likeliness of a country sliding into a civil war. According to Kapoor and Narayanan, once errors are rectified, the performance of these models is similar to traditional statistical techniques.

However, David Muchlinski, a political scientist at the Georgia Institute of Technology in Atlanta, whose paper the pair scrutinized, claims that the field of conflict prediction has been improperly misrepresented and that his work had the backup of his follow-up studies.

Nonetheless, the team’s rallying cry has struck a nerve. More than 1,200 people have registered for what was initially a small online workshop on reproducibility on July 28th, organized by Kapoor and colleagues, with the goal of developing and disseminating solutions. Unless something of this sort is done, he says, every field will continue to face these problems repeatedly.

Being overly optimistic regarding the capabilities of a machine learning model may be detrimental when algorithms find their applications in areas like health and justice, according to Momin Malik, who is a data scientist at Mayo Clinic, Rochester, Minnesota, who is scheduled to give an oration at the workshop.

He believes that unless the crisis is resolved, machine learning’s reputation will suffer. He is amazed there hasn’t already been a crash in the validity of machine learning. But he believes that it will happen very soon.

Problems with machine learning

According to Kapoor and Narayanan, similar pitfalls occur when applying machine learning to multiple sciences. The researchers examined 20 reviews in 17 research fields and identified 329 research papers whose findings could not be entirely replicated due to issues with how machine learning was applied.

Narayanan has also fallen prey to the same, as a  2015 paper on computer security co-authored by him is one of the 329. It truly is a problem that this entire community must address together, says Kapoor.

He added that no individual researcher is to be blamed for such failures. Rather, a combination of AI hype and insufficient checks and balances is to be blamed. The most eminent issue raised by Kapoor and Narayanan is ‘data leakage,’ which occurs when information from the data set on which a model learns includes data on which it is later evaluated. If they are not entirely distinct, the model has effectively seen the answers and its predictions appear much better than they are. Eight major types of data leakage have been determined by the team that researchers should be aware of.

Some data leaks are minute. Temporal leakage, for instance, occurs when training data includes points from later in time than test data — an issue since the future is dependent on the past. Malik cites a 2011 paper as an example that claimed that a model based on the moods of Twitter users could predict the stock market’s closing value with a precision of 87.6 percent. However, because the team tested the predictive power of the model with data from a time period prior to some of its training set, the algorithm was effectively given the ability to see into the future, he claims.

According to Malik, larger issues include training models on datasets that are smaller than the population that they are basically intended to represent. For instance, an AI trained only on older people to detect pneumonia in chest X-rays may not have the same level of accuracy on younger people.

Another issue is that algorithms frequently rely on shortcuts that don’t always work, according to Jessica Hullman, a computer scientist at Northwestern University in Evanston, Illinois, and a workshop speaker. A computer vision algorithm, for example, might learn to recognize a cow based on the grassy background in most cow images, so it would fail when it encountered the picture of the animal on a mountain or beach.

According to her, the high accuracy of predictions in tests frequently misleads people into thinking the models are picking up on the “true structure of the problem” similar to humans. She compares the situation to the replication crisis in psychology, in which people place too much trust in statistical methods.

According to Kapoor, the hype surrounding machine learning’s capabilities has contributed to researchers accepting their results too quickly. According to Malik, the term “prediction” is problematic because most prediction is tested retrospectively and has nothing to do with forecasting the future.

Resolving data leakage

According to Kapoor and Narayanan, researchers can tackle data leakage by including with their manuscripts proof that their models don’t contain each of the eight types of leakage.

A template is recommended by the authors for these types of documentation known as ‘model info’ sheets.

According to Xiao Liu, a clinical ophthalmologist at the University of Birmingham, United Kingdom, who has assisted in creating reporting guidelines for studies involving AI, such as screening or diagnosis, biomedicine has come a long way in the last three years with a similar approach.

In 2019, Liu and her colleagues discovered that only 5% of more than 20,000 papers utilizing AI for medical imaging were described in sufficient detail to determine whether they would work in a clinical setting.

Nobody’s models can be enhanced directly by guidelines, but they clearly indicate who were the people who have done it well, and maybe people who have not done it well, she says, providing regulators with a resource.

Collaboration can also be beneficial, according to Malik. He suggests that studies include both subject matter experts and researchers in machine learning, statistics, and survey sampling.

According to Kapoor, fields where machine learning finds leads for follow-up, like drug discovery, will benefit greatly from the technology. However, other areas will require more work to demonstrate their utility, he adds. Although machine learning is still in its early stages in many fields, he believes that researchers must avert the kind of confidence crisis that followed the replication crisis in psychology ten years ago. The longer the wait time, the worse the issue will become.