Decoding protein clumps with explainable AI

In a major scientific breakthrough, researchers have developed a new artificial intelligence tool, called CANYA, that can decode the “language” proteins use to form harmful clumps—a process involved in diseases like Alzheimer’s and dozens of other disorders. CANYA allows scientists to see exactly which combinations of amino acids (the building blocks of proteins) encourage or prevent the sticky clumping known as amyloid aggregation, which can disrupt normal cell function.

The researchers were led by Benedetta Bolognesi (Institute for Bioengineering of Catalonia (IBEC)) and Ben Lehner (Centre for Genomic Regulation (CRG) and Wellcome Sanger Institute), in collaboration with scientists at the Cold Spring Harbor Laboratory (CSHL).

To build CANYA, the team created the largest-ever dataset on protein aggregation by generating over 100,000 entirely random protein fragments – including many versions not found in nature – and testing them in yeast cells. This innovative approach gave them a much broader view of potential protein behaviors than studies that only looked at natural or small sets of sequences.

The data was then used to train CANYA using a combination of AI methods borrowed from image and language recognition, enabling the system to both zoom in on tiny details in protein chains and understand their importance in a larger context. As a result, CANYA not only predicted whether a protein would form aggregates but also explained why—revealing new rules about how proteins behave.

Application in drug development

The significance of this work extends beyond disease research. Protein clumping is a major challenge in biotechnology, particularly in the manufacturing of protein-based drugs, which can become unusable if they aggregate. CANYA’s ability to identify aggregation-prone sequences could help engineers design more stable proteins, saving time and money.

While the tool currently classifies proteins into clumping or non-clumping types, researchers aim to expand it to predict how quickly proteins aggregate, a key factor in disease progression.

The importance of transparencyin AI

Unlike typical AI systems that produce results without explaining them (the famous ‘black box’), CANYA is ‘explainable’, which means it was specifically built to reveal the chemical rules behind its decisions, making these transparent and understandable to humans. “We want to trust that the model is making its predictions for reasons that make sense, and not based on something that just happens to correlate with the result but actually is not at all related to it, as it might happen in a ‘black box’ AI”, explains Mike Thompson, first author of the paper.

“As more and more researchers turn to AI to analyze and model their data, it’s critical to understand the conclusions and predictions made by the model”
Mike Thompson (CRG), first author of the paper

When asked about the challenges of making AI explainable, Thompson says: “It isn’t that difficult for biological contexts, but it generally comes at the risk of a loss of predictive power. This is because the more complex the model architecture is – the more parameters it uses – the better it performs, but also the less we understand about how it works”.

Although making CANYA explainable meant sacrificing a bit of its predictive power, it was worth it to improve its trustworthiness. Also, the tool still proved to be 15% more accurate than existing models.

The authors hope this work serves as an example for ways in which to select and interpret model architectures, and are currently exploring additional scenarios and guidelines to develop these practices.

In essence, this study demonstrates how combining large-scale lab experiments with explainable AI can make biology more predictable—an essential step for both healthcare and synthetic biology innovations.

Reference article

Mike Thompson, Mariano Martín, Trinidad Sanmartín Olmo, Chandana Rajesh, Peter K. Koo, Benedetta Bolognesi, Ben Lehner. “Massive experimental quantification allows interpretable deep learning of protein aggregation“. Science Advances, 30 Apr 2025, Vol 11, Issue 18. DOI: 10.1126/sciadv.adt5111