How does AlphaFold 2 Work?

And Why does it Matter?

Oct 24, 2024

Welcome to this week’s issue of Robots in Lab Coats, the newsletter dedicated to giving you all the latest breakthroughs in AI in Medicine. Today, we’ll look at AlphaFold 2, DeepMind’s protein structure prediction model. We’ll cover why it’s important, what it is, and how it works.

Disclaimer: This newsletter assumes a basic understanding of proteins and how they fold. For a more in-depth / beginner’s guide, see here.

Why is it important?

If you want to find out the structure of a new protein (“de novo”), it would cost anywhere from 7 500 to 300 000 CAD, as well as years of research. This means it’s tough for small startups in the medical field to get going due to the large amounts of time and money required, as well as the fact that 90% of pharmaceuticals fail in clinical trials. Think about that for a second; that means that new medicines have a higher chance of failing than New Year’s Resolutions. This is why, in medical trials, it’s very important to fail faster, to avoid wasting years and millions of dollars researching a drug that either doesn’t work, is toxic, or has some other problem.

What is AlphaFold 2?

AlphaFold 2 was developed by the DeepMind team at Google. The DeepMind team is one of the most impactful groups in AI today. From beating the world champion of StarCraft to beating Stockfish, the greatest Chess player known to man, they are truly committed to the honourable pursuit of beating up nerds. Unlike traditional chess engines, DeepMind’s AlphaZero is special because all of its rules and principles are things it’s come up with through just playing games against itself 44 million times. AlphaFold 2, however, was created to tackle another big challenge: protein-folding. Since 1994, every other year, the Critical Assessment of protein Structure Prediction (CASP) has been dedicated to reducing the time, effort, and money needed to predict the structure of proteins. They release ~100 amino acid sequences whose structure has been found in the lab, but not released to the public. These teams then try to find the structure of the proteins and are graded using a global distance test (GDT), which measures on a scale from 0 to 100 how close a predicted structure is from the shape of a protein found in a lab. In 2020, AlphaFold flabbergasted the world by guessing the structure of over two-thirds of the proteins with a score of above 90, or within the width of an atom. For any score above 90, the differences between the predicted and actual structure could be down to problems with the experiment or natural variation of the protein. It also recently got the Nobel Prize in Chemistry for its ability to predict protein structures.

How does it work?

The first AlphaFold2 was developed in 2020. It was trained on data from almost 170 000 publicly available protein structures and used 16 TPUv3s, which is equivalent to ~100–200 GPUs, or in layman’s terms: a whole freaking lot of computing power. When you enter a protein sequence, it builds something called multiple sequence alignment (MSA), which finds similar protein structures. It then feeds this data into the neural network (A program that mimics the way the operation of the human brain by identifying phenomena, weighing options, and arriving at conclusions), which highlights differences and similarities between these proteins, and creates a set of “pair representations”, which is every pair of amino acids in the protein. This allows the neural network to encode the co-evolutionary relationships (What effect changing one amino acid has on another) between them based on MSA. AlphaFold2 uses a neural network called Evoformer that looks at and updates both the MSA and the pair representations at the same time, which allows reasoning about evolutionary relationships (What amino acids stay the same over time and which ones change). After all of this, it finds the structure by taking the updated pair representation and the original sequence from the Evoformer and turns this into the backbone. It then places the amino acid side chains and refines their positions. It also performs an iterative process called “recycling”, where it feeds the MSA, the pair representations, and the 3D structure back into the neural network, and generates a new 3D structure. It repeats this 3 times, which lets it improve the final structure’s accuracy immensely.

TL;DR

Protein Folding: The structure of a protein based on its sequence of amino acids. Conventional methods use X-rays to measure the structure of a protein but can take years and hundreds of thousands of dollars to find.
Machine Learning: A type of artificial intelligence that uses statistical models to learn from past data and can make predictions on new data
AlphaFold2: Developed by Google’s DeepMind team. Able to predict the structure of a protein within the width of an atom within a few days.
Other Technologies: Similar technologies are RoseTTA fold which is able to predict the structure of a protein within 10 minutes and with 1/3 of the memory, but is less accurate. Meta AI is different as it uses a Language Model to predict the structure of a protein.

What I’m Working On

Last week, I finished an article on using DeepFace, Python, and OpenCV to recognise faces. This week I’ve been challenged by my brother to write an article and share it with 10 people and get them to each share it with another 3 people by next Tuesday, and so I’m procrastinating on releasing that by working on this.

Conclusion

And, that’s a wrap for this week’s newsletter. Always remember, curiosity is your best friend — and probably your only friend if you keep signing up for mediocre mailing lists instead of going outside, you goober. If you have any ideas for how I can improve this newsletter (shouldn’t be too hard), you can email me here or schedule a meeting with me here. Until next time always remember, life is eternal suffering, and the average wait time for a ride at Disney World is 38 minutes.

Smell you later!

— Amitav Krishna

Robots in Lab Coats