Our lab focuses on:


We set-up collaborations with scientists across the world with the goal of advancing the state-of-the-art for biological discovery.

Protein Transformers

In ProtTrans, we trained large AI models to read protein sequences like a language, similar to how modern AI reads human text. These models learned fundamental principles about how proteins work without being explicitly taught biology, chemistry or physics. Most remarkably, our best model (ProtT5) achieved state-of-the-art performance in predicting protein properties while being dramatically faster than traditional methods.

This work demonstrated how combining high-performance computing, natural language processing, and biology can lead to scientific breakthroughs. We're particularly interested in how self-supervised learning - where AI systems learn directly from raw data without human labels - can reveal new patterns and principles in biology.


Find out more https://doi.org/10.1109/TPAMI.2021.3095381

Nucleotide Transformers

We contributed Nucleotide Transformer (NT), a deep learning system that learned multiple genomes from humans and other species. NT learned the language of DNA by unmasking millions of genetic sequences. Through this process, it discoverd patterns in DNA that control how genes work, where genes begin and end, and how DNA variations might affect human health.

NT demonstrates how a single AI model can learn to perform many different genomics tasks - from predicting gene regulation to identifying disease-causing mutations - without requiring specialized training for each task. The project also provided insights into what the model learns about DNA structure and function, helping advance our understanding of how genetic information is encoded and interpreted in living systems.


Find out more https://doi.org/10.1038/s41592-024-02523-z


Also check out GenSLMs, a generative model for genome sequences: https://doi.org/10.1177/10943420231201154

Efficient homology search

In this project, we tackled fundamental computational bottlenecks in protein sequence comparison by redesigning core algorithms to efficiently utilize modern GPU architectures. The work combines elements of parallel algorithm design, high-performance computing, and computational biology. With our collaborators, we proposed novel GPU-optimized algorithms for both gapless sequence filtering and gapped alignment using protein profiles, achieving speeds up to 100 trillion cell updates per second. Beyond raw performance gains, we focused on making these advanced search capabilities accessible even on low power (W) GPUs, enabling broader adoption in the research community.


Find out more https://doi.org/10.1101/2024.11.13.623350

Probing Protein Engineering

FLIP (Fitness Landscape Inference for Proteins) is a benchmark suite that helps evaluate how well machine learning models can predict protein function from sequence data. The benchmark addressed a critical need in protein engineering - the ability to accurately predict whether protein sequences will perform desired functions, which is essential for developing new therapeutics and industrial enzymes.

What makde FLIP particularly valuable was its focus on testing how well models can make predictions in challenging scenarios that matter for actual protein engineering applications - like predicting the effects of multiple simultaneous mutations or working with limited training data. The benchmark revealed that while current machine learning methods showed promise, there are still significant opportunities for developing better approaches, especially for complex mutation patterns. By providing standardized ways to evaluate new methods, FLIP helps accelerate research toward more reliable protein engineering tools that could impact fields from medicine to industrial biotechnology.


Find out more https://openreview.net/forum?id=p2dMLEwL8tF

Predictions at your fingertips

PredictProtein represented a pioneering effort in making protein analysis accessible to scientists worldwide. Since 1992, it helped researchers understand proteins by predicting their structure and function from amino acid sequences. We contributed the latest update to PredictProtein, enhancing it with predictions from protein language models, the first significant departure from explit use of evolutionary information.

We continuously integrate our deep learning approaches into usable software, freely available to the scientific community. This work is particularly meaningful as it helps bridge the growing gap between the vast number of protein sequences we discover and our ability to experimentally characterize them, contributing to fundamental biological research and potential applications in medicine and biotechnology.


Find out more https://doi.org/10.1093/nar/gkab354