Figure 1 | Multi-fidelity data can improve predictive models developed using machine learning. Accurate (high-fidelity) data about the properties of materials can be difficult or expensive to acquire, and so less-accurate (low-fidelity) data are often obtained instead. Low-fidelity data sets are therefore usually larger than high-fidelity ones, and represent a greater diversity of materials. Machine-learning systems typically use individual data sets to generate predictive models of materials’ properties. Diverse low-fidelity data produce general but approximate models, whereas high-fidelity data produce accurate but less-general ones. Here, machine learning based on individual data sets A to D produces four predictive models, the generality and accuracy of which are shown. Chen et al. report a machine-learning architecture that can process materials data from multiple sets that have different fidelities, and thereby generates predictive models that are more general and accurate than are those produced using the individual data sets; the red spot indicates the generality and accuracy of a model trained using the combined data sets A to D. Examples are illustrative, and do not depict actual data.
Chen et al. now report an adaptation of an artificial neural network (a brain-inspired computer system) that they call a multi-fidelity graph network. This can learn about materials’ properties using data acquired from different modelling and experimental techniques. As a proof of principle, the authors trained their graph network to learn about bandgaps — a property that controls several electrical and optical properties of solid materials, such as their conductivity and colour. They used bandgap data from five sources: four data sets were the results of different types of quantum-mechanical calculation, and the fifth source was experimental data. The data set that had the lowest fidelity level contained about 50,000 data points, roughly 100 times more than the number in each of the other data sets; this is typical of the heterogeneity of available data in materials science.
The authors’ graph network takes a materials graph — a mathematical representation of the structure of a material, consisting of nodes that represent atoms and edges that represent bonds — as an input. It then performs a series of mathematical (convolution) operations to exchange information between its nodes and edges. This produces an output vector known as a latent representation, which is passed on to, and further manipulated by, another artificial neural network to predict the property of interest (in this case, the bandgap). One or more historical data sets are used first to simultaneously train the materials graph and the second artificial neural network, priming them to make predictions.
Machine-learning techniques based on graph networks are among the top-performing methods for single-fidelity learning of materials properties, and do not require a feature-engineering step (in which a material’s composition and/or atomic structure is converted into a string of numbers in a machine-readable format), as is necessary for other machine-learning algorithms. To adapt their graph network for multi-fidelity learning, the authors introduced a new variable, in addition to those used to represent graph nodes and edges, that accounts for the fidelity level of a data point. The authors’ graph network therefore exchanges information between the atom nodes, the bond edges and the data-fidelity level represented by the new variable. This means that their approach is applicable to any number of fidelity levels.
A comparison of prediction errors clearly demonstrates the benefit of the multi-fidelity approach. For example, models that had four levels of fidelity reduced errors in predictions of bandgaps by 22–45%, compared with single-fidelity models. Similarly, multi-fidelity models involving two, three or five levels of fidelity performed better than did single-fidelity models.
This improvement can be attributed to two key factors. First, the large volume of low-fidelity data represents a more chemically diverse collection of materials than does a single high-fidelity data set; exposure of the graph network to this diversity results in a better and more-general latent representation. Second, there is a high correlation between the low- and high-fidelity bandgap data — many of the bandgaps in the low-fidelity data set are close in value to the equivalent data points in the high-fidelity data sets. This second factor is evident from the higher prediction accuracy that is achieved when using high-fidelity data sets that correlate more closely with the low-fidelity data set.
Chen and colleagues’ approach overcomes the limitations of other multi-fidelity approaches9,10, which are either not easily scalable to large data sets, or cannot handle heterogeneous data or more than two levels of fidelity. The authors’ multi-fidelity graph network is therefore a powerful new system for capturing complex relationships between data sets of multiple fidelities. It should be noted, however, that Chen et al. did not explore what happens if low- and high-fidelity data points are weighted differently. Such weighting might become necessary when the number of low-fidelity points is so large that it over-represents the full set of multi-fidelity data.
The authors’ system is not restricted to materials science, but is generalizable to any problem that can be described using graph structures, such as social networks and knowledge graphs (digital frameworks that represent knowledge as concepts connected by relationships). Furthermore, this approach could, in principle, be used to learn about multiple properties simultaneously (multi-task learning), rather than learning about just one property for which data are available at multiple levels of fidelity.
However, some fundamental questions remain. Are multi-fidelity approaches guaranteed to perform better than single-fidelity models, even when the quality of the low-fidelity data is extremely poor? And what happens when low- and high-fidelity data points are poorly correlated? More research is needed to understand the scenarios for which multi-fidelity learning is most beneficial, balancing prediction accuracy with the cost of acquiring data. In the meantime, the popularity of multi-fidelity methods will surely increase, because they directly exploit the underlying widespread heterogeneity of data in the materials and chemical sciences.
This article has been published from the source link without modifications to the text. Only the headline has been changed.
Source link