Why proteins fold and how GPUs help us fold
Posted by diginova 1 day ago
Comments
Comment by fabian2k 1 day ago
I'm quite impressed by the amino acid overview graphic. I'm sure all images are AI-generated, and this one is something I didn't expect AI to be able to do yet. There are mistakes in there (e.g. Threenine instead of Threonine, charged amino groups for some amino acids), but it doesn't look immediately wrong. Though I haven't needed to know the chemical formular for all the amino acids in a long time, so there are probably more errors in there I didn't immediately notice. The angles and lengths of the bonds are not entirely consistent, but that also happens without AI sometimes if someone doesn't know the drawing tools well. The labels are probably the clearest indicator, because they are partly wrong and they are not consistent as they also include the non-side-chain parts sometimes, which doesn't make sense.
The biology part of the text looks somehwat reasonable overall, I didn't notice any completely outrageous statements at a quick glance. Though I don't like the "folding is reproducible" statement as that is a huge oversimplification. Proteins do misfold, and there is an entire apparatus in the cells to handle those cases and clean them up.
Comment by D-Machine 1 day ago
AlphaFold models also used TPUs: https://github.com/google-deepmind/alphafold/issues/31#issue...
EDIT: Also annoying is the usual bullshit about "attention" being some kind of magic. It isn't even clear AlphaFold uses the same kind of attention as typical LLM transformers, because it uses custom "Evoformer" layers instead: https://www.nature.com/articles/s41586-021-03819-2_reference...
Comment by lukah 1 day ago
Comment by Agingcoder 1 day ago
Comment by augment_me 1 day ago
Comment by coolness 1 day ago
Comment by lm28469 1 day ago
There are many similar things where people just take shortcuts because they don't understand the interesting part is the process/skill not the final result. It probably has to do with external validation, reddit is full of "art" subs being polluted by these people, generative ai is even leaking into leather work, wood carving, lino cut, it's a cancer
Comment by IAmBroom 23 hours ago
Comment by robbie-c 1 day ago
Comment by fabian2k 1 day ago
Comment by D-Machine 1 day ago
Comment by fabian2k 1 day ago
Comment by D-Machine 1 day ago
EDIT: Ugh, it is late. I mean, if you used e.g. ChatGPT-5.X with extended thinking and search, it would not make these grievous errors. However, ChatGPT without search and the default style, produces junk basically indistinguishable from this kind of post. So, for me, the smoking gun is that not even the most basic due diligence (reading Wikipedia or looking at the actual papers) has been done, and, given the length and style of the post, this is effectively a smoking gun for (cheap, free-version) AI use.
But, more importantly, it is indistinguishable in quality from AI slop, and so garbage regardless.
Comment by atomlib 1 day ago
Comment by topaz0 1 day ago
Comment by D-Machine 1 day ago
This is all covered cursorily even by Wikipedia - https://en.wikipedia.org/wiki/AlphaFold#AlphaFold_2_(2020).
Comment by terhechte 1 day ago
Comment by eesmith 1 day ago
> as you're reading this, there are approximately 20,000 different types of proteins working inside your body.
From https://biologyinsights.com/how-many-human-proteins-are-ther...
"The human genome contains approximately 19,000 to 20,000 protein-coding genes. While each gene can initiate the production of at least one protein, the total count of distinct proteins is significantly higher. Estimates suggest the human body contains 80,000 to 400,000 different protein types, with some projections reaching up to a million, depending on how a “distinct protein” is defined."
Plus, that's just in the human DNA. In your body are a whole bunch of bacteria, adding even more types of protein.
> The actual number of protein molecules? Billions. Trillions if we're counting across all your cells.
There are on average 10 trillion proteins in a single cell. https://nigms.nih.gov/biobeat/2025/01/proteins-by-the-number... There are over 30 trillion human cells in an adult. https://pmc.ncbi.nlm.nih.gov/articles/PMC4991899/ . That's about 300 septillion proteins in the body. While yes, that's "trillions" in some mathematical sense, in that case it's also "tens" of proteins.
(The linked-to piece later says "every single one of your 37 trillion cells", showing that "trillions" is far from the correct characterization. "trillions of trillions" would get the point across better.)
> Each one has a specific job.
Proteins can do multiple jobs, unless you define "job" as "whatever the protein does."
Eg, from https://pmc.ncbi.nlm.nih.gov/articles/PMC3022353/
"many of the proteins or protein domains encoded by viruses are multifunctional. The transmembrane (TM) domains of Hepatitis C Virus envelope glycoprotein are extreme examples of such multifunctionality. Indeed, these TM domains bear ER retention signals, demonstrate signal function and are involved in E1:E2 heterodimerization (Cocquerel et al. 1999; Cocquerel et al. 1998; Cocquerel et al. 2000). All these functions are partially overlapped and present in the sequence of <30 amino acids"
> And if even ONE type folds wrong, one could get ... sickle cell anemia
Sickle cell anemia is due to a mutation in the hemoglobin gene causing a hydrophobic patch to appear on the surface, which causes the hemoglobins to stick to each other.
It isn't caused by misfolding. https://en.wikipedia.org/wiki/Sickle_cell_disease
(I haven't researched the others to see if they are due to misfolding.)
> Your body makes these proteins perfectly
No, it doesn't. The error rate is quite low, but not perfect. Quoting https://pmc.ncbi.nlm.nih.gov/articles/PMC3866648/
"Errors are more frequent during protein synthesis, resulting either from misacylation of tRNAs or from tRNA selection errors that cause insertion of an incorrect amino acid (misreading) shifting out of the normal reading frame (frameshifting), or spontaneous release of the peptidyl-tRNA (drop-off) (Kurland et al. 1996). Misreading errors are arguably the most common translational errors (Kramer and Farabaugh 2007; Kramer et al. 2010; Yadavalli and Ibba 2012)."
> Then AI companies showed up in 2020 and said "we got this" and solved it in an afternoon.
They didn't simply "show up" in 2020. Google DeepMind was working on it since 2016 or so. https://www.quantamagazine.org/how-ai-revolutionized-protein...
> we're DESIGNING entirely new proteins that have never existed in nature
We've been designing new proteins that have never existed in nature for decades. From https://en.wikipedia.org/wiki/Protein_design
"The first protein successfully designed completely de novo was done by Stephen Mayo and coworkers in 1997 ... Later, in 2008, Baker's group computationally designed enzymes for two different reactions.[7] In 2010, one of the most powerful broadly neutralizing antibodies was isolated from patient serum using a computationally designed protein probe.[8] In 2024, Baker received one half of the Nobel Prize in Chemistry for his advancement of computational protein design, with the other half being shared by Demis Hassabis and John Jumper of Deepmind for protein structure prediction."
> These are called secondary structures, local patterns in the protein backbone
The corresponding figure is really messed up. The sequence of atoms in the amino acids are wrong, and the pairs of atoms which are hydrogen bonded are wrong. For example, it shows a hydrogen bond between two double-bonded oxygens, which don't have a hydrogen, and a hydrogen bond between two hydrogens, which would both have partial positive charge. The hydrogen bonds are suppose to go from the N-H to the O=C. See https://en.wikipedia.org/wiki/Beta_sheet#Hydrogen_bonding_pa...
> Given the same sequence, you get the same structure.
The structure may depend on environmental factors. For example, https://en.wikipedia.org/wiki/%CE%91-Lactalbumin "α-lactalbumin is a protein that regulates the production of lactose in the milk of almost all mammalian species ... A folding variant of human α-lactalbumin that may form in acidic environments such as the stomach, called HAMLET, probably induces apoptosis in tumor and immature cells."
There can also be post-translational modifications.
> The sequence contains all the instructions needed to fold into the correct shape.
Assuming you know the folding environment.
> Change the shape even slightly, and the protein stops working.
I don't know how to interpret this. Some proteins require changing their shape to work. Myosin - a muscle protein - changes it shape during its power stroke.
> Prions are misfolded proteins that can convert normal proteins into the misfolded form, spreading like an infection
Earlier the author wrote "It's deterministic (mostly, there are exceptions called intrinsically disordered proteins, but let's not go there)."
https://en.wikipedia.org/wiki/Prion says "Prions are a type of intrinsically disordered protein that continuously changes conformation unless bound to a specific partner, such as another protein."
So the author went there. :)
Either accept that proteins aren't always deterministically folded based on their sequence, or don't use prions as an example of misfolding.
Comment by D-Machine 1 day ago
====
Physical insights are built into the network structure, not just a process around it
- End-to-end system directly producing a structure instead of inter-residue distances
- Inductive biases reflect our knowledge of protein physics and geometry
- The positions of residues in the sequence are de-emphasized
- Instead residues that are close in the folded protein need to communicate
- The network iteratively learns a graph of which residues are close, while reasoning over this implicit graph as it is being built
What went badly:
- Manual work required to get a very high-quality Orf8 prediction
- Genetics search works much better on full sequences than individual domains
- Final relaxation required to remove stereochemical violations
What went well
- Building the full pipeline as a single end-to-end deep learning system
- Building physical and geometric notions into the architecture instead of a search process
- Models that predict their own accuracy can be used for model-ranking
- Using model uncertainty as a signal to improve our methods (e.g. training new models to eliminate problems with long chains)
====
Also you can read the papers, e.g. https://www.nature.com/articles/s41586-019-1923-7 (available if you search the title on Google Scholar; also https://www.nature.com/articles/s41586-021-03819-2_reference...). There is actual, real good science, physics, and engineering going on here, as compared to e.g. LLMs or computer vision models that are just trained on the internet, and where all the engineering is focused on managing finicky training and compute costs. AlphaFold requires all this and more.
EDIT: Basically, the article makes it sound like deep models just allowed scientists to sidestep all the complicated physics and etc and just magically solve the problem, and while this is arguably somewhat correct for computer vision and much of NLP, this is the exact opposite of the truth for AlphaFold.
Comment by penetrarthur 1 day ago
On a sidenote, what is this new style of writing using small sentences where each sentence is supposed to be a punchline?
"And most of those sequences? They don't fold into anything useful. They're junk. They aggregate into clumps. They get degraded by cellular quality control. Only a TINY fraction of possible sequences fold into stable, functional proteins."
Comment by prof-dr-ir 1 day ago
Congratulation, you are now able to recognize an AI-generated text.
(As of December 2025 at least, who knows what they will look like next month.)
Comment by tim333 23 hours ago
Comment by cassianoleal 1 day ago
Comment by lm28469 1 day ago
Comment by emptybits 1 day ago
And I do love the optimism.
But then you must admit this reads like a B-movie intro:
Then AI companies showed up in 2020 and said "we got this" and
solved it in an afternoon. ... We're playing God with molecules
and it's working.Comment by ursAxZA 1 day ago
How many H100s do you need to simulate one human cell? Probably more than the universe can power.
Comment by tim333 23 hours ago
Comment by zkmon 1 day ago
Anytime some talks about large numbers - some galaxy is billions of kilometers away, there are trillions of atoms in universe, trillions of possible combinations for a problem etc - it appears to me that you talking about some problem that doesn't fall into your job description.
Comment by IAmBroom 23 hours ago
I'm sorry that you're also so numerophobic, but real people use numbers of those magnitudes every day. Your own computer, in fact, has billions of storage slots in its disk space - although perhaps that's something that doesn't fall into your job description.
Comment by zkmon 23 hours ago
Most of these numbers are hierarchical. I do count the memory modules, but not bits. I count apples, but not molecules in them. I try to count a few bright starts in the night sky, but not all stars in the galaxy. I try to stick to traditional non-GM food, which my ancestors ate, instead of counting protein molecules. I try to have childand grand kids, instead trying living eternally through great advances in science.
Comment by VirusNewbie 1 day ago
Comment by D-Machine 1 day ago
This article is garbage and makes many incorrect claims, and it is clearly AI-generated. E.g. the claim that "AlphaFold doesn't simulate physics. It recognizes patterns learned from 170,000+ known protein structures" couldn't be farther from the truth. Physical models are baked right into AlphaFold models and development at multiple steps, it is a highly unique architecture and approach.