Researchers from Facebook AI Research (FAIR) at Meta AI describe a dataset of 617 million predicted protein structures that was produced by machine learning. Despite having lower claimed accuracy, the ESMFold language model characterized the structures 60 times faster than DeepMinds AlphaFold2.
The fold predictions were finished on a cluster of around 2,000 GPUs in about two weeks. 20 to 1,024 nucleotides made up the original sequence lengths. 365 million forecasts were made with a high degree of confidence, and 225 million of those predictions were accurate to a high degree of confidence.
A random sampling of 1 million high-confidence results from the study “Evolutionary-scale prediction of atomic-level protein structure with a language model” revealed that 767,580 proteins have a sequence identity below 90% to any sequence in UniRef90, a database of known protein sequences. This, according to researchers, shows that the proteins are different from available UniRef90 sequences.
Following that, the Meta AI team compared the sample of predicted structures to known structures in the Protein Data Bank, a database for three-dimensional protein structures. 12.6% (125,765 proteins) of the proteins at thresholds 0.5 TM-score did not match any structural components. As a result, scientists hypothesize that roughly 28 million proteins (12.6% of 225 million) with high-confidence predictions could describe portions of protein structure that are remote from our current understanding.
Sequence-based predictions
A protein starts as a messenger RNA, which is a raw ingredient wish list for the protein it will become, created by the transcription of a linear sequence of nucleotides from DNA. The translation of the mRNA nucleotides into amino acids follows (the raw ingredients). This sequence of amino acids subsequently experiences an astonishing transformation into a complicated three-dimensional folded shape that, depending on its folded structure, carries out various sophisticated cellular functions.
Because it restricts and optimizes what it can interact with, the way a protein or enzyme folds influences its activity in part. With the correct molecular “key,” the structure provides an aperture or “lock” that can be opened. Without a thorough grasp of how the proteins are folded, people have been employing these lock and key enzymes for anything from the food sector and beer brewing to textiles and biofuel.
Cellulases, which break down plant material, are frequently found in laundry detergents along with other forms of enzymes. A grass stain’s cellulose transforms into the lock’s key when the cellulase enzyme comes into contact with it. The enzyme causes a chemical reaction that breaks down the bonds in the grass stain. When a lipstick or grease stain is present, the same enzyme will have no effect; another enzyme may be needed for that task.
Enzymes are an instrumental technology because a single protein enzyme may carry out a task dozens or even millions of times per second without failing, providing industry with a low-energy powerhouse of a catalyst.
To carry out biological functions, every system in our body depends on proteins. While determining the causes of disease, it is essential to understand how proteins function because their folded shape is essential to the activity they can engage in.
Medical researchers would be better able to comprehend how protein metabolite interactions and biological processes take place throughout the body if they could predict how a protein will fold based on its primary amino acid sequence (raw ingredients). A little revolution in modern medicine could result from this higher-resolution understanding, which might be able to spot unnoticed illness features and speed up the search for better or more effective treatments. Researchers would also be able to create custom proteins to carry out certain functions in industry and healthcare by properly understanding how structure follows the shape of the initial raw materials (translated mRNA).
In the decades prior AI prediction models, scientists modelled the structures of around 190,000 proteins of interest. Hundreds of millions of predictions have now been produced by machine learning, but they still need to be verified and further investigated before they can be of any use. AI is still in its infancy and is still not trustworthy enough to replace the slower, more methodical X-ray crystallography or a controlled assay experiment for structure or function. The information obtained in the decades to come will likely overshadow everything that came before.