Large datasets are needed to test brain-behavior machine learning

 

Researchers train machine learning models to identify patterns in data before evaluating the algorithms’ efficacy. However, according to a recent Yale study, models may appear less capable than they actually are if the datasets used for training and testing aren’t large enough.

Regarding brain-behavior models, these findings may impact future investigations, exacerbate the replication issue plaguing psychological research, and impede our comprehension of the human brain, according to scientists.

The journal Nature Human Behavior published the findings on July 31.

To find connections connecting, for example, brain function or structure to cognitive traits like attentiveness or depressive symptoms, researchers are turning more and more to machine learning models. By establishing these connections, scientists may be able to anticipate who may be at risk for particular cognitive difficulties based only on brain imaging and gain a greater understanding of how the brain influences these characteristics and vice versa.

However, models are only helpful if they are accurate not only among the individuals in the training data but also throughout the general population.

Because gathering two distinct sets of data demands more resources, researchers frequently divide a single dataset into bigger portions for model training and smaller portions for model testing. However, a growing body of research has put machine learning models through more rigorous testing to assess their generalizability, using a completely new dataset that was made public by other researchers.

Matthew Rosenblatt, the study’s primary author and a graduate student in Dustin Scheinost’s lab—an associate professor of radiology and biomedical imaging at Yale School of Medicine—said that this is good. A strong brain-behavior association is most likely present if you can demonstrate how something functions in an entirely distinct dataset.

Complicating matters further is the inclusion of an additional dataset, which raises questions about the “power” of the study. A study’s likelihood of finding an impact, if one exists, is known as its statistical power. Consider the strong relationship between an infant’s height and age. That association will be seen in a study with sufficient power. However, there is a greater chance of missing the association between height and age if the study is “low-powered.”

Both the effect size and the dataset size, commonly referred to as the sample size, are crucial components of statistical power. Moreover, the other feature must be larger the smaller one of those dimensions is. One can see this association in even a tiny dataset: there is a significant correlation between height and age, indicating a large impact size. Nevertheless, more data collection from participants would be required to identify a more nuanced association between two variables, such as the relationship between age and one’s ability to perceive touch.

There exist formulas that determine the optimal size of a dataset for obtaining sufficient power, but none that make it simple to determine the size of two datasets, one for training and the other for testing.

In the current work, researchers resampled data from six neuroimaging studies, varying the dataset sizes to observe the effects on statistical power, in order to better understand how training and testing dataset sizes affect study power.

They demonstrated that, for both training and external testing datasets, statistical power necessitates comparatively large sample numbers, according to Rosenblatt. They discovered that the majority of the published research in the field that used this strategy—testing models on a second dataset—had too small datasets, which undermined their findings.

The researchers discovered that the median sizes for training and testing datasets were 129 and 108 participants, respectively, across previously published studies. These dataset sizes were large enough to achieve sufficient power for metrics like age that had substantial impact sizes. However, datasets of those sizes led to a 51% likelihood that the study would not find a relationship between brain structure and the measure for measures with medium effect sizes, like working memory; for measures with low effect sizes, like attention issues, those odds grew to 91%.

Researchers may require datasets containing hundreds to thousands of individuals for these metrics with smaller effect sizes, according to Rosenblatt.

Rosenblatt and colleagues predict that more researchers will choose to test their models on different datasets as new neuroimaging datasets become accessible.

That’s a positive step, according to Scheinost. One way to address the issue of reproducibility is to validate a model using an additional external dataset. However, we want users to consider the volumes of their datasets. While researchers must make the best use of the data at their disposal, as more data become available, we should all strive to do external testing and ensure that the test datasets are sizable.

Source link