Why proteins fold and how GPUs help us fold

Posted by diginova 1 day ago

Comments

Comment by fabian2k 1 day ago

The secondary structure graphic is entirely wrong. It's full of bad chemical formulas, and I would assume is AI-generated.

I'm quite impressed by the amino acid overview graphic. I'm sure all images are AI-generated, and this one is something I didn't expect AI to be able to do yet. There are mistakes in there (e.g. Threenine instead of Threonine, charged amino groups for some amino acids), but it doesn't look immediately wrong. Though I haven't needed to know the chemical formular for all the amino acids in a long time, so there are probably more errors in there I didn't immediately notice. The angles and lengths of the bonds are not entirely consistent, but that also happens without AI sometimes if someone doesn't know the drawing tools well. The labels are probably the clearest indicator, because they are partly wrong and they are not consistent as they also include the non-side-chain parts sometimes, which doesn't make sense.

The biology part of the text looks somehwat reasonable overall, I didn't notice any completely outrageous statements at a quick glance. Though I don't like the "folding is reproducible" statement as that is a huge oversimplification. Proteins do misfold, and there is an entire apparatus in the cells to handle those cases and clean them up.

Comment by D-Machine 1 day ago

This article is garbage and makes many incorrect claims, and it is clearly AI-generated. E.g. the claim that "AlphaFold doesn't simulate physics. It recognizes patterns learned from 170,000+ known protein structures" couldn't be farther from the truth. Physical models are baked right into AlphaFold models and development at multiple steps, it is a highly unique architecture and approach.

AlphaFold models also used TPUs: https://github.com/google-deepmind/alphafold/issues/31#issue...

EDIT: Also annoying is the usual bullshit about "attention" being some kind of magic. It isn't even clear AlphaFold uses the same kind of attention as typical LLM transformers, because it uses custom "Evoformer" layers instead: https://www.nature.com/articles/s41586-021-03819-2_reference...

Comment by lukah 1 day ago

I interpreted that section as alphafold not learning physics, but rather correlations within a constrained setting that a-priori correspond to physically sound inferences. It has a specific architecture that allows the model to make inferences that are more physically plausible than not, but not that it’s discovering actual, causally verifiable laws of nature (like what I’d assume are encoded into another non-ML approach to the folding problem for example).

Comment by Agingcoder 1 day ago

It’s also not a solved problem unlike what the article claims, unless ‘solved’ doesn’t mean ‘works all the time ‘.

Comment by augment_me 1 day ago

The text structure screams GPT5 sadly, so I would not be surprised if not only the text but the images were wrong.

Comment by coolness 1 day ago

Yeah, I don't really understand why someone would make a blog and use AI to write the articles. Isn't having a blog more about the joy of writing and the learning you do while writing it?

Comment by lm28469 1 day ago

Because it's what cool people do, so if you want to be cool you do it. They didn't realise the cool part was actually having the knowledge and actually writing the text.

There are many similar things where people just take shortcuts because they don't understand the interesting part is the process/skill not the final result. It probably has to do with external validation, reddit is full of "art" subs being polluted by these people, generative ai is even leaking into leather work, wood carving, lino cut, it's a cancer

Comment by IAmBroom 23 hours ago

Also, resume padding.

Comment by robbie-c 1 day ago

I think it's just an AI-generated simplification, sucks that it made it to the front page. The subject matter is interesting, I would have loved to have read something written by an expert!

Comment by fabian2k 1 day ago

I would assume so, but I didn't see any smoking guns in the text itself. But I'm also not familiar with the newest models here and their quirks.

Comment by D-Machine 1 day ago

See my point above (https://news.ycombinator.com/item?id=46271980) for smoking guns. There are some pretty basic and grievous factual errors re: GPUs being used when in fact TPUs are used, and completely false claims about physical models not being huge parts of AlphaFold development and even architecture.

Comment by fabian2k 1 day ago

Those errors don't seem AI-specific to me, they could easily be made by humans.

Comment by D-Machine 1 day ago

True, it is the style of the post that reveals obvious overuse of AI. The errors could well be made by a human, especially since a trivial visit to Wikipedia or one of the original papers will show most of what is being said here re: the actual deep models to be wrong. This is more likely the error of a human than an AI.

EDIT: Ugh, it is late. I mean, if you used e.g. ChatGPT-5.X with extended thinking and search, it would not make these grievous errors. However, ChatGPT without search and the default style, produces junk basically indistinguishable from this kind of post. So, for me, the smoking gun is that not even the most basic due diligence (reading Wikipedia or looking at the actual papers) has been done, and, given the length and style of the post, this is effectively a smoking gun for (cheap, free-version) AI use.

But, more importantly, it is indistinguishable in quality from AI slop, and so garbage regardless.

Comment by atomlib 1 day ago

Was this text AI-generated?

Comment by topaz0 1 day ago

I got about a page in before finding out this is drivel. The final straw was "AI companies showed up and solved it in an afternoon". No faster way to show you don't know what you're talking about.

Comment by D-Machine 1 day ago

Yeah this article is garbage. The real problem with protein-folding is not compute, or training on known configurations only, but figuring out a differentiable loss that is related to the energy configuration of generated new sequences / molecules, and iterative folding and all sorts of other things. It is very much NOT just a "throw lots of data at GPUs" problem.

This is all covered cursorily even by Wikipedia - https://en.wikipedia.org/wiki/AlphaFold#AlphaFold_2_(2020).

Comment by terhechte 1 day ago

I don't know the space, so I found the article interesting. Please explain, what's wrong with it?

Comment by eesmith 1 day ago

From the text:

> as you're reading this, there are approximately 20,000 different types of proteins working inside your body.

From https://biologyinsights.com/how-many-human-proteins-are-ther...

"The human genome contains approximately 19,000 to 20,000 protein-coding genes. While each gene can initiate the production of at least one protein, the total count of distinct proteins is significantly higher. Estimates suggest the human body contains 80,000 to 400,000 different protein types, with some projections reaching up to a million, depending on how a “distinct protein” is defined."

Plus, that's just in the human DNA. In your body are a whole bunch of bacteria, adding even more types of protein.

> The actual number of protein molecules? Billions. Trillions if we're counting across all your cells.

There are on average 10 trillion proteins in a single cell. https://nigms.nih.gov/biobeat/2025/01/proteins-by-the-number... There are over 30 trillion human cells in an adult. https://pmc.ncbi.nlm.nih.gov/articles/PMC4991899/ . That's about 300 septillion proteins in the body. While yes, that's "trillions" in some mathematical sense, in that case it's also "tens" of proteins.

(The linked-to piece later says "every single one of your 37 trillion cells", showing that "trillions" is far from the correct characterization. "trillions of trillions" would get the point across better.)

> Each one has a specific job.

Proteins can do multiple jobs, unless you define "job" as "whatever the protein does."

Eg, from https://pmc.ncbi.nlm.nih.gov/articles/PMC3022353/

"many of the proteins or protein domains encoded by viruses are multifunctional. The transmembrane (TM) domains of Hepatitis C Virus envelope glycoprotein are extreme examples of such multifunctionality. Indeed, these TM domains bear ER retention signals, demonstrate signal function and are involved in E1:E2 heterodimerization (Cocquerel et al. 1999; Cocquerel et al. 1998; Cocquerel et al. 2000). All these functions are partially overlapped and present in the sequence of <30 amino acids"

> And if even ONE type folds wrong, one could get ... sickle cell anemia

Sickle cell anemia is due to a mutation in the hemoglobin gene causing a hydrophobic patch to appear on the surface, which causes the hemoglobins to stick to each other.

It isn't caused by misfolding. https://en.wikipedia.org/wiki/Sickle_cell_disease

(I haven't researched the others to see if they are due to misfolding.)

> Your body makes these proteins perfectly

No, it doesn't. The error rate is quite low, but not perfect. Quoting https://pmc.ncbi.nlm.nih.gov/articles/PMC3866648/

"Errors are more frequent during protein synthesis, resulting either from misacylation of tRNAs or from tRNA selection errors that cause insertion of an incorrect amino acid (misreading) shifting out of the normal reading frame (frameshifting), or spontaneous release of the peptidyl-tRNA (drop-off) (Kurland et al. 1996). Misreading errors are arguably the most common translational errors (Kramer and Farabaugh 2007; Kramer et al. 2010; Yadavalli and Ibba 2012)."

> Then AI companies showed up in 2020 and said "we got this" and solved it in an afternoon.

They didn't simply "show up" in 2020. Google DeepMind was working on it since 2016 or so. https://www.quantamagazine.org/how-ai-revolutionized-protein...

> we're DESIGNING entirely new proteins that have never existed in nature

We've been designing new proteins that have never existed in nature for decades. From https://en.wikipedia.org/wiki/Protein_design

"The first protein successfully designed completely de novo was done by Stephen Mayo and coworkers in 1997 ... Later, in 2008, Baker's group computationally designed enzymes for two different reactions.[7] In 2010, one of the most powerful broadly neutralizing antibodies was isolated from patient serum using a computationally designed protein probe.[8] In 2024, Baker received one half of the Nobel Prize in Chemistry for his advancement of computational protein design, with the other half being shared by Demis Hassabis and John Jumper of Deepmind for protein structure prediction."

> These are called secondary structures, local patterns in the protein backbone

The corresponding figure is really messed up. The sequence of atoms in the amino acids are wrong, and the pairs of atoms which are hydrogen bonded are wrong. For example, it shows a hydrogen bond between two double-bonded oxygens, which don't have a hydrogen, and a hydrogen bond between two hydrogens, which would both have partial positive charge. The hydrogen bonds are suppose to go from the N-H to the O=C. See https://en.wikipedia.org/wiki/Beta_sheet#Hydrogen_bonding_pa...

> Given the same sequence, you get the same structure.

The structure may depend on environmental factors. For example, https://en.wikipedia.org/wiki/%CE%91-Lactalbumin "α-lactalbumin is a protein that regulates the production of lactose in the milk of almost all mammalian species ... A folding variant of human α-lactalbumin that may form in acidic environments such as the stomach, called HAMLET, probably induces apoptosis in tumor and immature cells."

There can also be post-translational modifications.

> The sequence contains all the instructions needed to fold into the correct shape.

Assuming you know the folding environment.

> Change the shape even slightly, and the protein stops working.

I don't know how to interpret this. Some proteins require changing their shape to work. Myosin - a muscle protein - changes it shape during its power stroke.

> Prions are misfolded proteins that can convert normal proteins into the misfolded form, spreading like an infection

Earlier the author wrote "It's deterministic (mostly, there are exceptions called intrinsically disordered proteins, but let's not go there)."

https://en.wikipedia.org/wiki/Prion says "Prions are a type of intrinsically disordered protein that continuously changes conformation unless bound to a specific partner, such as another protein."

So the author went there. :)

Either accept that proteins aren't always deterministically folded based on their sequence, or don't use prions as an example of misfolding.

Comment by D-Machine 1 day ago

See for example the AlphaFold2 presentation linked here: https://predictioncenter.org/casp14/doc/presentations/2020_1.... Some samples that point out where most of the innovations are NOT just "huck a transformer at it":

====

Physical insights are built into the network structure, not just a process around it

- End-to-end system directly producing a structure instead of inter-residue distances

- Inductive biases reflect our knowledge of protein physics and geometry

- The positions of residues in the sequence are de-emphasized

- Instead residues that are close in the folded protein need to communicate

- The network iteratively learns a graph of which residues are close, while reasoning over this implicit graph as it is being built

What went badly:

- Manual work required to get a very high-quality Orf8 prediction

- Genetics search works much better on full sequences than individual domains

- Final relaxation required to remove stereochemical violations

What went well

- Building the full pipeline as a single end-to-end deep learning system

- Building physical and geometric notions into the architecture instead of a search process

- Models that predict their own accuracy can be used for model-ranking

- Using model uncertainty as a signal to improve our methods (e.g. training new models to eliminate problems with long chains)

====

Also you can read the papers, e.g. https://www.nature.com/articles/s41586-019-1923-7 (available if you search the title on Google Scholar; also https://www.nature.com/articles/s41586-021-03819-2_reference...). There is actual, real good science, physics, and engineering going on here, as compared to e.g. LLMs or computer vision models that are just trained on the internet, and where all the engineering is focused on managing finicky training and compute costs. AlphaFold requires all this and more.

EDIT: Basically, the article makes it sound like deep models just allowed scientists to sidestep all the complicated physics and etc and just magically solve the problem, and while this is arguably somewhat correct for computer vision and much of NLP, this is the exact opposite of the truth for AlphaFold.

Comment by penetrarthur 1 day ago

Great article!

On a sidenote, what is this new style of writing using small sentences where each sentence is supposed to be a punchline?

"And most of those sequences? They don't fold into anything useful. They're junk. They aggregate into clumps. They get degraded by cellular quality control. Only a TINY fraction of possible sequences fold into stable, functional proteins."

Comment by prof-dr-ir 1 day ago

> what is this new style of writing

Congratulation, you are now able to recognize an AI-generated text.

(As of December 2025 at least, who knows what they will look like next month.)

Comment by tim333 23 hours ago

I don't mind the style but the factual errors are not good. Like "How NVIDIA..." when it was done by DeepMind with TPUs.

Comment by cassianoleal 1 day ago

Sounds like TEDspeak, only in writing.

Comment by lm28469 1 day ago

Short sentence are good. Especially when you interact with low attention individuals. Make sure they stay engaged. It's not just a style. It's a game changer for your blog.

Comment by emptybits 1 day ago

I really appreciated the explanation of what proteins are, in simple terms. I assume (?) it's accurate enough for a layperson.

And I do love the optimism.

But then you must admit this reads like a B-movie intro:

    Then AI companies showed up in 2020 and said "we got this" and
    solved it in an afternoon. ... We're playing God with molecules
    and it's working.

Comment by ursAxZA 1 day ago

One protein fold is cute.

How many H100s do you need to simulate one human cell? Probably more than the universe can power.

Comment by tim333 23 hours ago

Depends on how accurate you want the simulation. DeepMind who did the protein folding are working towards a cell simulation.

Comment by zkmon 1 day ago

If nature did so well for billions of years, why are we taking over it's job now? Did it ask for your help?

Anytime some talks about large numbers - some galaxy is billions of kilometers away, there are trillions of atoms in universe, trillions of possible combinations for a problem etc - it appears to me that you talking about some problem that doesn't fall into your job description.

Comment by IAmBroom 23 hours ago

So, you're anti-vaccine? And anti-antibiotic? And pro-dying of disease and cancer, in general?

I'm sorry that you're also so numerophobic, but real people use numbers of those magnitudes every day. Your own computer, in fact, has billions of storage slots in its disk space - although perhaps that's something that doesn't fall into your job description.

Comment by zkmon 23 hours ago

Yes, counting memory bits in my PC doesn't fall into my job. But my question is a bit more fundamental.

Most of these numbers are hierarchical. I do count the memory modules, but not bits. I count apples, but not molecules in them. I try to count a few bright starts in the night sky, but not all stars in the galaxy. I try to stick to traditional non-GM food, which my ancestors ate, instead of counting protein molecules. I try to have childand grand kids, instead trying living eternally through great advances in science.