Shedding light on the darkness (of the proteome) with artificial intelligence

The dark proteome comprises proteins whose structure and function are unknown. Now, a study involving the IBE (CSIC-UPF) reveals that it is possible to determine their functions thanks to deep learning.

Protein

Sequences from mice, yeast, and fruit flies have been the starting material for AI, through deep learning, to decipher the function of previously unknown proteins. Image adapted from Trim via Wellcome Collection.

All living organisms produce proteins from their DNA, but not all of them are known. Some have an unknown structure or function, but the DNA sequence that encodes them is known. This sequence has been the basis of a collaboration between the Andalusian Center for Developmental Biology (CABD) and the Institute of Evolutionary Biology (IBE: CSIC-UPF) to analyze proteins using artificial intelligence.

The study analyzed sequences from model organisms (yeast, mouse, and fruit fly) through deep learning and was able to determine and classify in great detail the functions of proteins that previously we had no information on.

Through deep learning, research groups have been able to determine the function of proteins for which only the DNA sequence was available.

The authors also found that, among the two deep learning methods used, language models or transformers are more efficient than convolutional networks. The latter are based on image processing, while transformers process sequences and language. This makes transfoer models more informative and precise, as well as able to retrieve information from RNA sequences.

This research is vital to addressing the issue of the dark proteome, which includes all proteins for which no information is available. In this way, proteins can be analyzed, and functions of genes with biomedical and biotechnological potential, especially in little-studied organisms, can be identified, says Rosa Fernández, co-leader of the study at IBE (CSIC-UPF). This is especially relevant now that unknown organisms are being sequenced in large quantities, leading to millions of sequences whose function cannot be predicted using traditional methods.

Leave a Reply

Your email address will not be published. Required fields are marked *