Auto-grading decade-old Hacker News discussions with hindsight

Posted by __rito__ 4 days ago

Related from yesterday: Show HN: Gemini Pro 3 imagines the HN front page 10 years from now - https://news.ycombinator.com/item?id=46205632

Comments

Comment by jasonthorsness 4 days ago

It's fun to read some of these historic comments! A while back I wrote a replay system to better capture how discussions evolved at the time of these historic threads. Here's Karpathy's list from his graded articles, in the replay visualizer:

Swift is Open Source https://hn.unlurker.com/replay?item=10669891

Launch of Figma, a collaborative interface design tool https://hn.unlurker.com/replay?item=10685407

Introducing OpenAI https://hn.unlurker.com/replay?item=10720176

The first person to hack the iPhone is building a self-driving car https://hn.unlurker.com/replay?item=10744206

SpaceX launch webcast: Orbcomm-2 Mission [video] https://hn.unlurker.com/replay?item=10774865

At Theranos, Many Strategies and Snags https://hn.unlurker.com/replay?item=10799261

Comment by SauntSolaire 4 days ago

I'd love to see sentiment analysis done based on time of day. I'm sure it's largely time zone differences, but I see a large variance in the types of opinions posted to hn in the morning versus the evening and I'd be curious to see it quantified.

Comment by embedding-shape 4 days ago

Yeah, I see this constantly any time Europe is mentioned in a submission. Early European morning/day, regular discussions, but as the European afternoon/evening comes around, you start noticing a lot anti-union sentiment, discussions start to shift into over-regulation, and the typical boring anti-Europe/EU talking points.

Comment by nostrebored 4 days ago

“Regular” to who? Pro EU sentiment almost only comes from the EU, which is what you’re observing. Pro-US sentiment is relatively mixed (as is anti-US sentiment) in distribution.

Comment by gilrain 4 days ago

> Pro EU sentiment almost only comes from the EU

Says who? But also, it doesn’t suggest what you imply. I could as easily conclude: “Oh wow, the people who actually experience the system like it that much? Awesome!”

Comment by TimedToasts 2 days ago

Or one could conclude that the bots were posting at a time of day intending you as the reading target. As long as they post things that you are inclined to agree with, you'll feel positive reinforcement about an issue regardless of the actual popularity or even viability.

Comment by red-iron-pine 3 days ago

e.g. how many are cali tech bros vs nyc fintec vs 10am moscow shillbot time

Comment by arowthway 4 days ago

Comment dates on hn frontend are sometimes altered when submissions are merged, do you handle this case properly?

Comment by jasonthorsness 3 days ago

It is handled on the Unlurker front page (you will see a little note that says “time adjusted for second chance”). The replay doesn’t do any adjustment for it, but I think that makes it reflect the reality of when the comments came in since the adjustments are like a temporary bump

Comment by matsemann 4 days ago

I like the "past" functionality here, maybe wished there was one for week/month I could scroll back as well.

Miss it for reddit as well. Top day/week/month/alltime makes it hard to find top a month in 2018.

Comment by HanClinto 4 days ago

Okay, your site is a ton of fun. Thank you! :)

Comment by modeless 4 days ago

This is a cool idea. I would install a Chrome extension that shows a score by every username on this site grading how well their expressed opinions match what subsequently happened in reality, or the accuracy of any specific predictions they've made. Some people's opinions are closer to reality than others and it's not always correlated with upvotes.

An extension of this would be to grade people on the accuracy of the comments they upvote, and use that to weight their upvotes more in ranking. I would love to read a version of HN where the only upvotes that matter are from people who agree with opinions that turn out to be correct. Of course, only HN could implement this since upvotes are private.

Comment by cootsnuck 4 days ago

The RES (Reddit Enhancement Suite) browser extension indirectly does this for me since it tracks the lifetime number of upvotes I give other users. So when I stumble upon a thread with a user with like +40 I know "This is someone whom I've repeatedly found to have good takes" (depending on the context).

It's subjective of course but at least it's transparently so.

I just think it's neat that it's kinda sorta a loose proxy for what you're talking about but done in arguably the simplest way possible.

Comment by nickff 4 days ago

I am not a Redditor, but RES sounds like it would increase the ‘echo-chamber’ effect, rather than improving one’s understanding of contributors’ calibration.

Comment by baq 4 days ago

Echo chamber of rational, thoughtful and truthful speakers is what I’m looking for in Internet forums.

Comment by jrmg 3 days ago

That’s what everyone living in an echo chamber (and especially one of their own creation) thinks they’re in.

Comment by XorNot 3 days ago

"you're in an echo chamber" is one of the most frightfully overused opinions.

Comment by ssl-3 3 days ago

The expression is an echo chamber in and of itself; it is self-fulfilling prophecy.

Comment by baq 3 days ago

I don't think I'm in any is my problem (HN is better than most, doesn't mean it's good in absolute terms...)

Comment by red-iron-pine 3 days ago

flat earth creationists would describe their colleagues the same way.

a group of them certainly is an echo chamber; why isn't your view?

Comment by ahf8Aithaex7Nai 3 days ago

He doesn't deny that his point of view forms an echo chamber.

Comment by lukan 3 days ago

"flat earth creationists would describe their colleagues the same way."

Actually they mostly don't. Lots of infighting over the real true answer .. (infinite flat earth, finite but with impassable ice walls, ..)

Comment by xmprt 3 days ago

An echo chamber is a product of your own creation. If you're willing to upvote people who disagree with your and actively seek out opposite takes to be genuinely curious about, then you're probably not in an echo chamber.

The tools for controlling your feed are reducing on social media like Instagram, TikTok, Youtube, etc., but simply saying that you follow and respect the opinions of a select group doesn't necessarily mean you're forming an echo chamber.

This is different from something like flat earth/other conspiracy theories where when confronted with opposite evidence, they aren't likely to engage with it in good faith.

Comment by mistercheph 4 days ago

it depends on if you vote based on the quality of contribution to the discussion or based on how much you agree/disagree.

Comment by miki123211 4 days ago

I don't think you can change user behavior like this.

You can give them a "venting sink" though. Instead of having a downvote button that just downvotes, have it pop up a little menu asking for a downvote reason, with "spam" and "disagree" as options. You could then weigh downvotes by which option was selected, along with an algorithm to discover "user honesty" based on whether their downvotes correlate with others or just with the people on their end of the political spectrum, a la Birdwatch.

Comment by morshu9001 3 days ago

You can't change it for other users, only for yourself, which is what the original comment about the extension said.

Comment by intended 4 days ago

Echo chambers will always result on social media. I don't think you can come up with a format that will not result in consolidated blocs.

Comment by modeless 4 days ago

Reddit's current structure very much produces an echo chamber with only one main prevailing view. If everyone used an extension like this I would expect it to increase overall diversity of opinion on the site, as things that conflict with the main echo chamber view could still thrive in their own communities rather than getting downvoted with the actual spam.

Comment by XorNot 3 days ago

Hacker News structure is identical though. Topics invite different demographics and downvotes suppress unpopular opinions. The front page shows most up voted stories. It's the same system.

Comment by modeless 3 days ago

HN's moderation and ranking is better. But there's definitely an echo chamber effect here too.

Comment by morshu9001 3 days ago

HN has some built-in ways to reduce this, like not allowing everyone to downvote everything.

Comment by PunchyHamster 4 days ago

More than having exact same system but with any random reader voting ? I'd say as long as you don't do "I disagree therefore I downvote" it would probably be more accurate than having essentially same voting system driven by randoms like reddit/HN already does

Comment by janalsncm 4 days ago

That assumes your upvotes in the past were a good proxy for being correct today. You could have both been wrong.

Comment by potato3732842 4 days ago

>This is a cool idea. I would install a Chrome extension that shows a score by every username on this site grading how well their expressed opinions match what subsequently happened in reality, or the accuracy of any specific predictions they've made.

Why stop there?

If you can do that you can score them on all sorts of things. You could make a "this person has no moral convictions and says whatever makes the number go up" score. Or some other kind of score.

Stuff like this makes the community "smaller" in a way. Like back in the old days on forums and IRC you knew who the jerks were.

Comment by leobg 4 days ago

That’s what Elon’s vision was before he ended up buying Twitter. Keep a digital track record for journalists. He wanted to call it Pravda.

(And we do have that in real life. Just as, among friends, we do keep track of who is in whose debt, we also keep a mental map of whose voice we listen to. Old school journalism still had that, where people would be reading someone’s column over the course of decades. On the internet, we don’t have that, or we have it rarely.)

Comment by TrainedMonkey 4 days ago

I long had a similar idea for stocks. Analyze posts of people giving stock tips on WSB, Twitter, etc and rank by accuracy. I would be very surprised if this had not been done a thousand times by various trading firms and enterprising individuals.

Of course in the above example of stocks there are clear predictions (HNWS will go up) and an oracle who resolves it (stock market). This seems to be a way harder problem for generic free form comments. Who resolves what prediction a particular comment has made and whether it actually happened?

Comment by miki123211 4 days ago

> Analyze posts of people giving stock tips on WSB, Twitter, etc and rank by accuracy.

Didn't somebody make an ETF once that went against the prediction of some famous CNBC stock picker, showing that it would have given you alpha in the past.

> seems to be a way harder problem for generic free form comments.

That's what prediction markets are for. People for whom truth and accuracy matters (often concentrated around the rationalist community) will often very explicitly make annual lists of concrete and quantifiable predictions, and then self-grade on them later.

Comment by Natsu 3 days ago

You probably mean Inverse Cramer:

https://finbold.com/inverse-cramer-leaves-sp-nasdaq-and-dow-...

Comment by red-iron-pine 3 days ago

Cramer is the stock picker guy. There is a well known "Cramer Effect" or "Cramer Bounce" where the stock peaks then drops hard.

Makes for great pump n dump if you're day trading and willing to ride

https://www.investopedia.com/terms/c/cramerbounce.asp

long-term his choices don't do well, so the Inverse Cramer basically says "do the opposite of this goober" and has solid returns (sorta; depends a lot on methodology, and the sole hedgefund playing that strategy shutdown)

Comment by Karrot_Kream 4 days ago

I ran across Sybil [1] the other day which tries to offer a reputation score based on correct predictions in prediction markets.

[1]: https://sybilpredicttrust.info/

Comment by mvkel 4 days ago

Out of curiosity, I built this. I extended karpathy's code and widened the date range to see what stocks these users would pick given their sentiments.

What came back were the usual suspects: GLP-1 companies and AI.

Back to the "boring but right" thesis. Not much alpha to be found

Comment by emaro 4 days ago

I like the idea and certainly would try it. Although I feel in a way this would be an anti-thesis to HN. HN tries to foster curiosity, but if you're (only) ranked by the accuracy of your predictions, this could give the incentive to always fall back to a save and boring position.

Comment by handoflixue 1 day ago

I think the most interesting predictions are the ones that sound bold and even a little bit insane at the time. I think a lot more of the people who were willing to say saying "ASI will kill us all" 20+ years ago, because they were taking a risk (and routinely ridiculed for it).

Even today, "ASI will kill us all" can be a pretty divisive declaration - hardly safe and boring.

From the couple of threads I clicked, it seemed like this LLM-driven analysis was picking up on that, too: the top comments were usually bold, and some of the worst-rated comments was the "safe and boring" declaration that nothing interesting ever really happens.

Comment by 8organicbits 4 days ago

The problem seems underspecified; what does it mean for a comment to be accurate? It would seem that comments like "the sun will rise tomorrow" would rank highest, but they aren't surprising.

Comment by smeeger 4 days ago

just because an idea is qualitative doesn't mean its invalid

Comment by prawn 4 days ago

Didn't Slashdot have something like the second point with their meta-moderation, many many years ago?

Comment by ssl-3 3 days ago

Sorta.

IIRC, when comment moderation and scoring came to Slashdot, only a random (and changing) selection of users were able to moderate.

Meta-moderation came a bit later. It allowed people to review prior moderation actions and evaluate the worth of those actions.

Those users who made good moderations were more likely to become a mod again in the future than those who made bad moderations.

The meta-mods had no idea whose actions they were evaluating, and previous/potential mods had no idea what their score was. That anonymity helped keep it honest and harder to game.

Comment by handoflixue 1 day ago

It's still that way today: if you're active, you'll be randomly given 5 moderation points occasionally, and they expire after a few days. So you have to decide which threads and comments are worth spending a "moderation point" on

Comment by ssl-3 1 day ago

How does meta-moderation work these days? I remember it being called Chips and Dip instead of /., but it's been many years since I've hung out there.

Comment by tptacek 4 days ago

'pcwalton, I'm coming for you. You're going down.

Kidding aside, the comments it picks out for us are a little random. For instance, this was an A+ predictive thread (it appears to be rating threads and not individual comments):

https://news.ycombinator.com/item?id=10703512

But there's just 11 comments, only 1 for me, and it's like a 1-sentence comment.

I do love that my unaccredited-access-to-startup-shares take is on that leaderboard, though.

Comment by kbenson 4 days ago

I noticed from reviewing my own entry (which honestly I'm surprised exists) that the idea of what it thinks constitutes a "prediction" is fairly open to interpretation, or at least that adding some nuance to a small aspect in a thread to someone else prediction counts quite heavily. I don't really view how I've participated here over the years in any way as making predictions. I actually thought I had done a fairly good job at not making predictions, by design.

Comment by n4r9 4 days ago

Yeah, I'm having to pinch myself a little here. Another slightly odd example it picked out from your history: https://news.ycombinator.com/item?id=10735398

It's a good comment, but "prescient" isn't a word I'd apply to it. This is more like a list of solid takes. To be fair there probably aren't even that many explicit, correct predictions in one month of comments in 2015.

Comment by mvkel 4 days ago

Hilariously, it seems you anticipated this happening and copyrighted your comments. Is karpathy's tool in violation of your copyright?!

Comment by tptacek 3 days ago

Karpathy, I'm coming for you next.

Comment by btbuildem 4 days ago

I've spent a weekend making something similar for my gmail account (which google keeps nagging me about being 90% full). It's fascinating to be able to classify 65k+ of emails (surprise: more than half are garbage), as well as summarize and trace the nature of communication between specific senders/recipients. It took about 50 hours on a dual RTX 3090 running Qwen 3.

My original goal was to prune the account deleting all the useless things and keeping just the unique, personal, valuable communications -- but the other day, an insight has me convinced that the safer / smarter thing to do in the current landscape is the opposite: remove any personal, valuable, memorable items, and leave google (and whomever else is scraping these repositories) with useless flotsam of newsletters, updates, subscription receipts, etc.

Comment by subscriptzero 3 days ago

I would love to do something like this, and weirdly I even have a dual 3090 home setup.

Any chance you can outline the steps/prompts/tools you used to run this?

I've been building a 2nd brain type project, that plugs into all my work places and a custom classifier has been on that list that would enhance that.

Comment by red-iron-pine 3 days ago

so then what do you do with the useful stuff?

Comment by btbuildem 2 days ago

Local archive + client for search

Comment by Rperry2174 4 days ago

One thing this really highlights to me is how often the "boring" takes end up being the most accurate. The provocative, high-energy threads are usually the ones that age the worst.

If an LLM were acting as a kind of historian revisiting today’s debates with future context, I’d bet it would see the same pattern again and again: the sober, incremental claims quietly hold up, while the hyperconfident ones collapse.

Something like "Lithium-ion battery pack prices fall to $108/kWh" is classic cost-curve progress. Boring, steady, and historically extremely reliable over long horizons. Probably one of the most likely headlines today to age correctly, even if it gets little attention.

On the flip side, stuff like "New benchmark shows top LLMs struggle in real mental health care" feels like high-risk framing. Benchmarks rotate constantly, and “struggle” headlines almost always age badly as models jump whole generations.

I bet theres many "boring but right" takes we overlook today and I wondr if there's a practical way to surface them before hindsight does

Comment by yunwal 4 days ago

"Boring but right" generally means that this prediction is already priced in to our current understanding of the world though. Anyone can reliably predict "the sun will rise tomorrow", but I'm not giving them high marks for that.

Comment by onraglanroad 4 days ago

I'm giving them higher marks than the people who say it won't.

LLMs have seen huge improvements over the last 3 years. Are you going to make the bet that they will continue to make similarly huge improvements, taking them well past human ability, or do you think they'll plateau?

The former is the boring, linear prediction.

Comment by bryanrasmussen 4 days ago

>The former is the boring, linear prediction.

right, because if there is one thing that history shows us again and again is that things that have a period of huge improvements never plateau but instead continue improving to infinity.

Improvement to infinity, that is the sober and wise bet!

Comment by p-e-w 4 days ago

The prediction that a new technology that is being heavily researched plateaus after just 5 years of development is certainly a daring one. I can’t think of an example from history where that happened.

Comment by gitremote 4 days ago

Neural network research and development existed since the 1980s at least, so at least 40 years. One of the bottlenecks before was not enough compute.

Comment by OccamsMirror 4 days ago

Perhaps the fact that you think this field is only 5 years old means you're probably not enough of an authority to comment confidently on it?

Comment by p-e-w 4 days ago

Claiming that AI in anything resembling its current form is older than 5 years is like claiming the history of the combustion engine started when an ape picked up a burning stick.

Comment by OccamsMirror 4 days ago

Your analogy fails because picking up a burning stick isn’t a combustion engine, whereas decades of neural-net and sequence-model work directly enabled modern LLMs. LLMs aren’t “five years old”; the scaling-transformer regime is. The components are old, the emergent-capability configuration is new.

Treating the age of the lineage as evidence of future growth is equivocation across paradigms. Technologies plateau when their governing paradigm saturates, not when the calendar says they should continue. Supersonic flight stalled immediately, fusion has stalled for seventy years, and neither cared about “time invested.”

Early exponential curves routinely flatten: solar cells, battery density, CPU clocks, hard-disk areal density. The only question that matters is whether this paradigm shows signs of saturation, not how long it has existed.

Comment by bryanrasmussen 4 days ago

I think this is the first time I have ever posted one of these but thank you for making the argument so well.

Comment by pixl97 4 days ago

Tiger: humans will never beat tigers because tigers are purpose built killing machines and they are just generalist --40,000BC

Comment by OccamsMirror 4 days ago

You don't think humans hunted tigers in 40,000BC?

Comment by pxc 2 days ago

I don't think it would make much sense to hunt large predators prior to the invention of agriculture, even though early humans were probably plenty smart enough to build traps capable of holding animals like tigers. But after that (less than 40k years ago, more than 10k years ago), I'd bet it was a common-ish thing for humans to try to hunt predators that preyed upon their livestock.

Tigers are terrifying, though. I think it takes extreme or perverse circumstances to make hunting a tiger make any sense at all. And even then, traps and poisons make more sense than stalking a tiger to kill it!

Comment by bigiain 4 days ago

LaunchHN: Announcing Twoday, our new YC backed startup coming out of stealth mode.

We’re launching a breakthrough platform that leverages frontier scale artificial intelligence to model, predict, and dynamically orchestrate solar luminance cycles, unlocking the world’s first synthetic second sunrise by Q2 2026. By combining physics informed multimodal models with real time atmospheric optimisation, we’re redefining what’s possible in climate scale AI and opening a new era of programmable daylight.

Comment by rznicolet 4 days ago

You joke, but, alas, there is a _real_ company kinda trying to do this. Reflect Orbital[1] wants to set up space mirrors, so you can have daytime at night for your solar panels! (Various issues, like around light pollution and the fact that looking up at the proposed satellites with binoculars could cause eye damage... don't seem to be on their roadmap.) This is one idea that's going to age badly whether or not they actually launch anything, I suspect.

Battery tech is too boring, but seems more likely to manage long-term effectiveness.

[1] https://www.reflectorbital.com

Comment by mananaysiempre 4 days ago

Reflecting sunlight from orbit is an idea that had been talked about for a couple of decades even before Znamya-2[1] launched in 1992. The materials science needed to unfurl large surfaces in space seems to be very difficult, whether mirrors or sails.

[1] https://en.wikipedia.org/wiki/Znamya_(satellite)

Comment by yunwal 4 days ago

> Are you going to make the bet that they will continue to make similarly huge improvements

Sure yeah why not

> taking them well past human ability,

At what? They're already better than me at reciting historical facts. You'd need some actual prediction here for me to give you "prescience".

Comment by janalsncm 4 days ago

“At what?” is really the key question here.

A lot of the press likes to paint “AI” as a uniform field that continues to improve together. But really it’s a bunch of related subfields. Once in a blue moon a technique from one subfield crosses over into another.

“AI” can play chess at superhuman skill. “AI” can also drive a car. That doesn’t mean Waymo gets safer when we increase Stockfish’s elo by 10 points.

Comment by Terr_ 4 days ago

I imagine "better" in this case depends on how one scores "I don't know" or confident-sounding falsehoods.

Failures aren't just a ratio, they're a multi-dimensional shape.

Comment by onraglanroad 4 days ago

At every intellectual task.

They're already better than you at reciting historical facts. I'd guess they're probably better at composing poems (they're not great but far better than the average person).

Or you agree with me? I'm not looking for prescience marks, I'm just less convinced that people really make the more boring and obvious predictions.

Comment by yunwal 4 days ago

What is an intellectual task? Once again, there's tons of stuff LLMs won't be trained on in the next 3 years. So it would be trivial to just find one of those things and say voila! LLMs aren't better than me at that.

I'll make one prediction that I think will hold up. No LLM-based system will be able to take a generic ask like "hack the nytimes website and retrieve emails and password hashes of all user accounts" and do better than the best hackers and penetration testers in the world, despite having plenty of training data to go off of. It requires out-of-band thinking that they just don't possess.

Comment by hathawsh 4 days ago

I'll take a stab at this: LLMs currently seem to be rather good at details, but they seem to struggle greatly with the overall picture, in every subject.

- If I want Claude Code to write some specific code, it often handles the task admirably, but if I'm not sure what should be written, consulting Claude takes a lot of time and doesn't yield much insight, where as 2 minutes with a human is 100x more valuable.

- I asked ChatGPT about some political event. It mirrored the mainstream press. After I reminded it of some obvious facts that revealed a mainstream bias, it agreed with me that its initial answer was wrong.

These experiences and others serve to remind me that current LLMs are mostly just advanced search engines. They work especially well on code because there is a lot of reasonably good code (and tutorials) out there to train on. LLMs are a lot less effective on intellectual tasks that humans haven't already written and published about.

Comment by medler 4 days ago

> it agreed with me that its initial answer was wrong.

Most likely that was just its sycophancy programming taking over and telling you what you wanted to hear

Comment by blibble 4 days ago

> They're already better than you at reciting historical facts.

so is a textbook, but no-one argues that's intelligent

Comment by janalsncm 4 days ago

To be clear, you are suggesting “huge improvements” in “every intellectual task”?

This is unlikely for the trivial reason that some tasks are roughly saturated. Modest improvements in chess playing ability are likely. Huge improvements probably not. Even more so for arithmetic. We pretty much have that handled.

But the more substantive issue is that intellectual tasks are not all interconnected. Getting significantly better at drawing hands doesn’t usually translate to executive planning or information retrieval.

Comment by yunwal 4 days ago

There’s plenty of room to grow for LLMs in terms of chess playing ability considering chess engines have them beat by around 1500 ELO

Comment by janalsncm 4 days ago

Sorry, I now realize this thread is about whether LLMs can improve on tasks and not whether AI can. Agreed there’s a lot of headroom for LLMs, less so for AI as a whole.

Comment by autoexec 4 days ago

> They're already better than you at reciting historical facts.

They're better at regurgitating historical facts than me because they were trained on historical facts written by many humans other than me who knew a lot more historical facts. None of those facts came from an LLM. Every historical fact that isn't entirely LLM generated nonsense came from a human. It's the humans that were intelligent, not the fancy autocomplete.

Now that LLMs have consumed the bulk of humanity's written knowledge on history what's left for it to suck up will be mainly its own slop. Exactly because LLMs are not even a little bit intelligent they will regurgitate that slop with exactly as much ignorance as to what any of it means as when it was human generated facts, and they'll still spew it back out with all the confidence they've been programed to emulate. I predict that the resulting output will increasingly shatter the illusion of intelligence you've so thoroughly fallen for so far.

Comment by irishcoffee 4 days ago

> At what? They're already better than me at reciting historical facts.

I wonder what happens if you ask deepseek about Tiananmen Square…

Edit: my “subtle” point was, we already know LLMs censor history. Trusting them to honestly recite historical facts is how history dies. “The victor writes history” has never been more true. Terrifying.

Comment by Dylan16807 4 days ago

> Edit: my “subtle” point was, we already know LLMs censor history. Trusting them to honestly recite historical facts is how history dies.

I mean, that's true but not very relevant. You can't trust a human to honestly recite historical facts either. Or a book.

> “The victor writes history” has never been more true.

I don't see how.

Comment by OrderlyTiamat 3 days ago

> The former is the boring, linear prediction.

Surely you meant the latter? The boring option follows previous experience. No technology has ever not reached a plateau, except for evolution itself I suppose, till we nuke the planet.

Comment by Dylan16807 4 days ago

LLMs aren't getting better that fast. I think a linear prediction says they'd need quite a while to maybe get "well past human ability", and if you incorporate the increases in training difficulty the timescale stretches wide.

Comment by SubiculumCode 4 days ago

Perhaps a new category, 'highest risk guess but right the most often'. Those is the high impact predictions.

Comment by arjie 4 days ago

Prediction markets have pretty much obviated the need for these things. Rather than rely on "was that really a hot take?" you have a market system that rewards those with accurate hot takes. The massive fees and lock-up period discourage low-return bets.

Comment by Karrot_Kream 4 days ago

FWIW Polymarket (which is one of the big markets) has no lock-up period and, for now while they're burning VC coins, no fees. Otherwise agree with your point though.

Comment by gammarator 4 days ago

Can’t wait for the brave new world of individuals “match fixing” outcomes on Polymarket.

Comment by Karrot_Kream 4 days ago

As opposed to the current world of brigading social media threads to make consensus look like it goes your way and then getting journalists scraping by on covering clickbait to cover your brigading as fact?

Comment by Gravityloss 4 days ago

something like correctness^2 x novel information content rank?

Comment by Gravityloss 3 days ago

Actually now thinking about it, incorrect information has negative value so the metric should probably reflect that.

Comment by jimbokun 4 days ago

The one about LLMs and mental health is not a prediction but a current news report, the way you phrased it.

Also, the boring consistent progress case for AI plays out in the end of humans as viable economic agents requiring a complete reordering of our economic and political systems in the near future. So the “boring but right” prediction today is completely terrifying.

Comment by p-e-w 4 days ago

“Boring” predictions usually state that things will continue to work the way they do right now. Which is trivially correct, except in cases where it catastrophically isn’t.

So the correctness of boring predictions is unsurprising, but also quite useless, because predicting the future is precisely about predicting those events which don’t follow that pattern.

Comment by adam1996TL 4 days ago

[dead]

Comment by simianparrot 4 days ago

Instead of "LLM's will put developers out of jobs" the boring reality is going to be "LLM's are a useful tool with limited use".

Comment by jimbokun 4 days ago

That is at odds with predicting based on recent rates of progress.

Comment by schoen 4 days ago

I predict that, in 2035, 1+1=2. I also predict that, in 2045, 2+2=4. I also predict that, in 2055, 3+3=6.

By 2065, we should be in possession of a proof that 0+0=0. Hopefully by the following year we will also be able to confirm that 0*0=0.

(All arithmetic here is over the natural numbers.)

Comment by johnfn 4 days ago

This suggests that the best way to grade predictions is some sort of weighting of how unlikely they were at the time. Like, if you were to open a prediction market for statement X, some sort of grade of the delta between your confidence of the event and the “expected” value, summed over all your predictions.

Comment by jacquesm 4 days ago

Exactly, that's the element that is missing. If there are 50 comments against and one pro and that pro has it in the longer term then that is worth noticing, not when there are 50 comments pro and you were one of the 'pros'.

Going against the grain and turning out right is far more valuable than being right consistently when the crowd is with you already.

Comment by mcmoor 4 days ago

Yeah a simple of total points of pro comments vs total points of con comments may be simple and exact enough to simulate a prediction market. I don't know if it can be included in the prompt or better to be vibecoded in directly.

Comment by 0manrho 4 days ago

It's because algorithmic feeds based on "user engagement" rewards antagonism. If your goal is to get eyes on content, being boring, predictable and nuanced is a sure way to get lost in the ever increasing noise.

Comment by xpe 4 days ago

> One thing this really highlights to me is how often the "boring" takes end up being the most accurate.

Would the commenter above mind sharing the method behind of their generalization? Many people would spot check maybe five items -- which is enough for our brains to start to guess at potential patterns -- and stop there.

On HN, when I see a generalization, one of my mental checklist items is to ask "what is this generalization based on?" and "If I were to look at the problem with fresh eyes, what would I conclude?".

Comment by copperx 4 days ago

Is this why depressed people often end up making the best predictions?

In personal situations there's clearly a self fulfilling prophecy going on, but when it comes to the external world, the predictions come out pretty accurate.

Comment by mistercheph 4 days ago

A majority don't seem to be predictions about the future, and it seems to mostly like comments that give extended air to what was then and now the consensus viewpoint, e.g. the top comment from pcwalton the highest scored user: https://news.ycombinator.com/item?id=10657401

> (Copying my comment here from Reddit /r/rust:) Just to repeat, because this was somewhat buried in the article: Servo is now a multiprocess browser, using the gaol crate for sandboxing. This adds (a) an extra layer of defense against remote code execution vulnerabilities beyond that which the Rust safety features provide; (b) a safety net in case Servo code is tricked into performing insecure actions. There are still plenty of bugs to shake out, but this is a major milestone in the project.

Comment by hackthemack 4 days ago

I noticed the Hall of Fame grading of predictive comments has a quirk? It grades some comments about if they came true or not, but in the grading of comment to the article

https://news.ycombinator.com/item?id=10654216

The Cannons on the B-29 Bomber "accurate account of LeMay stripping turrets and shifting to incendiary area bombing; matches mainstream history"

It gave a good grade to user cstross but to my reading of the comment, cstross just recounted a bit of old history. The evaluation gave cstross for just giving a history lesson or no?

Comment by karpathy 4 days ago

Yes I noticed a few of these around. The LLM is a little too willing to give out grades for comments that were good/bad in a bit more general sense, even if they weren't making strong predictions specifically. Another thing I noticed is that the LLM has a very impressive recognition of the various usernames and who they belong to, and I think shows a little bit of a bias in its evaluations based on the identity of the person. I tuned the prompt a little bit based on some low-hanging fruit mistakes but I think one can most likely iterate it quite a bit further.

Comment by patcon 4 days ago

I think you were getting at this, but in case others didn't know: cstross is a famous sci-fi author and futurist :)

Comment by pierrec 4 days ago

"the distributed “trillions of Tamagotchi” vision never materialized"

I begrudgingly accept my poor grade.

Comment by LeroyRaz 4 days ago

I am surprised the author thought the project passed quality control. The LLM reviews seem mostly false.

Looking at the comment reviews on the actual website, the LLM seems to have mostly judged whether it agreed with the takes, not whether they came true, and it seems to have an incredibly poor grasp of it's actual task of accessing whether the comments were predictive or not.

The LLM's comment reviews are of often statements like "correctly characterized [program language] as [opinion]."

This dynamic means the website mostly grades people on having the most confirmist take (the take most likely to dominate the training data, and be selected for in the LLM RL tuning process of pleasing the average user).

Comment by LeroyRaz 4 days ago

Examples: tptacek gets an 'A' for his comment on DF which the LLM claiming that the user "captured DF's unforgiving nature, where 'can't do x or it crashes is just another feature to learn' which remained true until it was fixed on ..."

Link to LLM review: https://karpathy.ai/hncapsule/2015-12-02/index.html#article-....

So the LLM is praising a comment as describing DF as unforgiving (a characterization of the present then, not a statement about the future). And worse, it seems like tptacek may in fact be implying the opposite of the future (e.g., x will continue to crash when it was eventually fixed.)

Here is the original comment: " tptacek on Dec 2, 2015 | root | parent | next [–]

If you're not the kind of person who can take flaws like crashes or game-stopping frame-rate issues and work them into your gameplay, DF is not the game for you. It isn't a friendly game. It can take hours just to figure out how to do core game tasks. "Don't do this thing that crashes the game" is just another task to learn."

Note: I am paraphrasing the LLM review, as the website is also poorly designed, with one unable to select the text of the LLM review!

N.b., this choice of comment review is not overly cherry picked. I just scanned the "best commentators" and tptacek was number two, with this particular egregiously unrelated-to-prediction LLM summary given as justifying his #2 rating.

Comment by hathawsh 4 days ago

Are you sure? The third section of each review lists the “Most prescient” and “Most wrong” comments. That sounds exactly like what you're looking for. For example, on the "Kickstarter is Debt" article, here is the LLM's analysis of the most prescient comment. The analysis seems accurate and helpful to me.

https://karpathy.ai/hncapsule/2015-12-03/index.html#article-...

  phire

  > “Oculus might end up being the most successful product/company to be kickstarted… > Product wise, Pebble is the most successful so far… Right now they are up to major version 4 of their product. Long term, I don't think they will be more successful than Oculus.”

  With hindsight:

  Oculus became the backbone of Meta’s VR push, spawning the Rift/Quest series and a multi‑billion‑dollar strategic bet.
  Pebble, despite early success, was shut down and absorbed by Fitbit barely a year after this thread.

  That’s an excellent call on the relative trajectories of the two flagship Kickstarter hardware companies.

Comment by xpe 4 days ago

Until someone publishes a systematic quality assessment, we're grasping at anecdotes.

It is unfortunate that the questions of "how well did the LLM do?" and "how does 'grading' work in this app?" seem to have gone out the window when HN readers see something shiny.

Comment by voidhorse 4 days ago

Yes. And the article is a perfect example of the dangerous sort of automation bias that people will increasingly slide into when it comes to LLMs. I realize Karpathy is sort of incentivized toward this bias given his career, but he doesn't even spend a single sentence even so much as suggesting that the results would need further inspection, or that they might be inaccurate.

The LLM is consulted like a perfect oracle, flawless in its ability to perform a task, and it's left at that. Its results are presented totally uncritically.

For this project, of course, the stakes are nil. But how long until this unfounded trust in LLMs works its way into high stakes problems? The reign of deterministic machines for the past few centuries has ingrained a trust in the reliability of machines in us that should be suspended when dealing with an inherently stochastic device like an LLM.

Comment by karmickoala 4 days ago

I get what you're saying, but looking at some examples, they look kinda of right, but there are a lot of misleading facts sprinkled, making his grading wrong. It is useful, but I'd suggest to be careful to use this to make decisions.

Some of the issues could be resolved with better prompting (it was biased to always interpret every comment through the lens of predictions) and LLM-as-a-judge, but still. For example, Anthropic's Deep Research prompts sub-agents to pass original quotes instead of paraphrasing, because it can deteriorate the original message.

Some examples:

  Swift is Open Source (2015)
  ===========================

sebastiank123 got a C-, and was quoted by the LLM as saying:

  > “It could become a serious Javascript competitor due to its elegant syntax, the type safety and speed.”

Now, let's read his full comment:

  > Great news! Coding in Swift is fantastic and I would love to see it coming to more platforms, maybe even on servers. It could become a serious Javascript competitor due to its elegant syntax, the type safety and speed.

I don't interpret it as a prediction, but a desire. The user is praising Swift. If it went the server way, perhaps it could replace JS, to the user's wishes. To make it even clearer, if someone asked the commenter right after: "Is that a prediction? Are you saying Swift is going to become a serious Javascript competitor?" I don't think its answer would be 'yes' in this context.

  How to be like Steve Ballmer (2015)
  ===================================
  
  Most wrong
  ----------
  
  >     corford (grade: D) (defending Ballmer’s iPhone prediction):
  >         Cited an IDC snapshot (Android 79%, iOS 14%) and suggested Ballmer was “kind of right” that the iPhone wouldn’t gain significant share.
  >         In 2025, iOS is one half of a global duopoly, dominates profits and premium segments, and is often majority share in key markets. Any reasonable definition of “significant” is satisfied, so Ballmer’s original claim—and this defense of it—did not age well.

Full quote:

  > And in a funny sort of way he was kind of right :) http://www.forbes.com/sites/dougolenick/2015/05/27/apple-ios...
  > Android: 79% versus iOS: 14%

"Any reasonable definition of 'significant' is satisfied"? That's not how I would interpret this. We see it clearly as a duopoly in North America. It's not wrong per se, but I'd say misleading. I know we could take this argument and see other slices of the data (premium phones worldwide, for instance), I'm just saying it's not as clear cut as it made it out to be.

  > volandovengo (grade: C+) (ill-equipped to deal with Apple/Google):
  >  
  >     Wrote that Ballmer’s fast-follower strategy “worked great” when competitors were weak but left Microsoft ill-equipped for “good ones like Apple and Google.”
  >     This is half-true: in smartphones, yes. But in cloud, office suites, collaboration, and enterprise SaaS, Microsoft became a primary, often leading competitor to both Apple and Google. The blanket claim underestimates Microsoft’s ability to adapt outside of mobile OS.

That's not what the user was saying:

  > Despite his public perception, he's incredibly intelligent. He has an IQ of 150.
  > 
  > His strategy of being a fast follower worked great for Microsoft when it had crappy competitors - it was ill equipped to deal with good ones like Apple and Google.

He was praising him and he did miss opportunities at first. OC did not make predictions of his later days.

  [Let's Encrypt] Entering Public Beta (2015)
  ===========================================

  - niutech: F "(endorsed StartSSL and WoSign as free options; both were later distrusted and effectively removed from the trusted ecosystem)"

Full quote:

  > There are also StartSSL and WoSign, which provide the A+ certificates for free (see example WoSign domain audit: https://www.ssllabs.com/ssltest/analyze.html?d=checkmyping.c...)
  > 
  > pjbrunet: F (dismissed HTTPS-by-default arguments as paranoid, incorrectly asserted ISPs had stopped injection, and underestimated exactly the use cases that later moved to HTTPS)

Full quote:

  > "We want to see HTTPS become the default."
  > 
  > Sounds fine for shopping, online banking, user authorizations. But for every website? If I'm a blogger/publisher or have a brochure type of website, I don't see point of the extra overhead.
  > 
  > Update: Thanks to those who answered my question. You pointed out some things I hadn't considered. Blocking the injection of invisible trackers and javascripts and ads, if that's what this is about for websites without user logins, then it would help to explicitly spell that out in marketing communications to promote adoption of this technology. The free speech angle argument is not as compelling to me though, but that's just my opinion.

I thought the debate was useful and so did pjbrunet, per his update.

I mean, we could go on, there are many others like these.

Comment by andy99 4 days ago

I haven’t looked at the output yet, but came here to say,LLM grading is crap. They miss things, they ignore instructions, bring in their own views, have no calibration and in general are extremely poorly suited to this task. “Good” LLM as a judge type products (and none are great) use LLMs to make binary decisions - “do these atomic facts match yes / no” type stuff - and aggregate them to get a score.

I understand this is just a fun exercise so it’s basically what LLMs are good at - generating plausible sounding stuff without regard for correctness. I would not extrapolate this to their utility on real evaluation tasks.

Comment by jacquesm 4 days ago

Predictions are only valuable when they're actually made ahead of the knowledge becoming available. A man will walk on mars by 2030 is falsifiable, a man will walk on mars is not. A lot of these entries have very low to no predictive value or were already known at the time, but just related. Would be nice if future 'judges' put in more work to ensure quality judgments.

I would grade this article B-, but then again, nobody wrote it... ;)

Comment by MBCook 4 days ago

#272, I got a B+! Neat.

It would be very interesting to see this applied year after year to see if people get better or worse over time in the accuracy of their judgments.

It would also be interesting to correlate accuracy to scores, but I kind of doubt that can be done. Between just expressing popular sentiment and the first to the post people getting more votes for the same comment than people who come later it probably wouldn’t be very useful data.

Comment by pjc50 4 days ago

#250, but then I wasn't trying to make predictions for a future AI. Or anyone else, really. Got a high score mostly for status quo bias, e.g. visual languages going nowhere and FPGAs remain niche.

Comment by embedding-shape 4 days ago

Yeah, it be much more interesting to see the people who made (at the time) outrageous claims, but they came to be true, rather than a list of people who could state that the status quo most likely would stay as it is.

Comment by nixpulvis 4 days ago

Quick give everyone colors to indicate their rank here and ban anyone with a grade less than C-.

Seriously, while I find this cool and interesting, I also fear how these sorts of things will work out for us all.

Comment by DonHopkins 4 days ago

I'd love to see an "Annie Hall" analysis of hn posts, for incidents where somebody says something about some piece of software or whatever, and the person who created it replies, like Marshall McLuhan stepping out from behind a sign in Annie Hall.

https://www.youtube.com/watch?v=vTSmbMm7MDg

Comment by Sophira 4 days ago

It somehow feels right to see what GPT-5 thinks of the article titled "Machine learning works spectacularly well, but mathematicians aren’t sure why" and its discussion: https://karpathy.ai/hncapsule/2015-12-04/index.html#article-...

Comment by xpe 3 days ago

Many people are impressed by this, and I can see why. Still, this much isn't surprising: the Karpathy + LLM combo can deliver quickly. But there are downsides of blazing speed.

If you dig in, there are substantial flaws in the project's analysis and framing, such as the definition of a prediction, assessing comments, data quality overall, and more. Go spelunking through the comments here and notice people asking about methodology and checking the results.

Social science research isn't easy; it requires training, effort, and patience. I would be very happy if Karpathy added a Big Flashing Red Sign to this effect. It would raise awareness and focus community attention on what I think are the hardest and most important aspects of this kind of project: methodology, rigor, criticism, feedback, and correction.

Comment by Tossrock 4 days ago

So where do I collect my prize for this 2015 comment? https://news.ycombinator.com/item?id=9882217

Comment by johncolanduoni 4 days ago

Never call a man happy until he is dead. Also I don’t think your argument generalizes well - there are plenty of private research investment bubbles that have popped and not reached their original peaks (e.g. VR).

Comment by Tossrock 4 days ago

It wasn't a generalized argument, though, it was a specific one, about AI.

Comment by johncolanduoni 4 days ago

Okay, but the only part that’s specific to AI (that the companies investing the money are capturing more value than they’re putting into it) is now false. Even the hyperscalers are not capturing nearly the value they’re investing, though they’re not using debt to finance it. OpenAI and Anthropic are of course blowing through cash like it’s going out of style, and if investor interest drops drastically they’ll likely need to look to get acquired.

Comment by xpe 3 days ago

Here is one sentence from the referenced prediction:

> I don't think there will be any more AI winters.

This isn't enough to qualify as a testable prediction, in the eyes of people that care about such things, because there is no good way to formulate a resolution criteria for a claim that extends indefinitely into the future. See [1] for a great introduction.

[1]: https://www.astralcodexten.com/p/prediction-market-faq

Comment by moultano 4 days ago

Notable how this is only possible because the website is a good "web citizen." It has urls that maintain their state over a decade. They contain a whole conversation. You don't have to log in to see anything. The value of old proper websites increases with our ability to process them.

Comment by chrisweekly 4 days ago

Yes! See "Cool URIs Don't Change"^1 by Sir TBL himself.

1. https://www.w3.org/Provider/Style/URI

Comment by dietr1ch 4 days ago

> because the website is a good "web citizen." It has urls that maintain their state over a decade.

It's a shame that maintaining the web is so hard that only a few websites are "good citizens". I wish the web was a -bit- way more like git. It should be easier to crawl the web and serve it.

Say, you browse and get things cached and shared, but only your "local bookmarks" persist. I guess it's like pinning in IPFS.

Comment by moultano 4 days ago

Yes, I wish we could serve static content more like bittorent, where your uri has an associate hash, and any intermediate router or cache could be an equivalent source of truth, with the final server only needing to play a role if nothing else has it.

It is not possible right now to make hosting democratized/distributed/robust because there's no way for people to donate their own resources in a seamless way to keeping things published. In an ideal world, the internet archive seamlessly drops in to serve any content that goes down in a fashion transparent to the user.

Comment by oncallthrow 4 days ago

This is IPFS

Comment by shpx 4 days ago

In my experience from the couple of times I clicked an IPFS link years ago, it loaded for a long time and never actually loaded anything, failing the first "I wish we could serve static content" part.

If you make it possible for people to donate bandwidth you might just discover no one wants to.

Comment by dietr1ch 3 days ago

I think that many are able to toss a almost permanently online raspberry pi in their homes and that's probably enough for sustaining a decently good distributed CAS network that shares small text files.

The wanting to is in my mind harder. How do you convince people that having the network is valuable enough? It's easy to compare it with the web backed by few feuds that offer for the most part really good performance, availability and somewhat good discovery.

Comment by 4 days ago

Comment by drdec 4 days ago

> It's a shame that maintaining the web is so hard that only a few websites are "good citizens"

It's not hard actually. There is a lack of will and forethought on the part of most maintainers. I suspect that monetization also plays a role.

Comment by DANmode 4 days ago

Let Reddit and friends continue to out themselves for who they are.

Keeps the spotlight on carefully protected communities like this one.

Comment by jeffbee 4 days ago

There are things that you have to log in to see, and the mods sometimes move conversations from one place to another, and also, for some reason, whole conversations get reset to a single timestamp.

Comment by embedding-shape 4 days ago

> and the mods sometimes move conversations from one place to another

This only manipulates the children references though, never the item ID itself. So if you have the item ID of an item (submission, comment, poll, pollItem), it'll be available there as long as moderators don't remove it, which happens very seldom.

Comment by latexr 4 days ago

> for some reason, whole conversations get reset to a single timestamp.

What do you mean?

Comment by embedding-shape 4 days ago

Submissions put in the second-chance pool briefly appear (sometimes "again") on the frontpage, and the conversation timestamps are reset so it appears like they were written after the second-chance submission, not before.

Comment by Y_Y 4 days ago

I never noticed that. What a weird lie!

I suppose they want to make the comments seem "fresh" but it's a deliberate misrepresentation. You could probably even contrive a situation where it could be damaging, e.g. somebody says something before some relevant incident, but the website claims they said it afterwards.

Comment by embedding-shape 4 days ago

I think the reason is much simpler than that. Resetting the timestamp lets them easily resurface things on the frontpage, because the current time - posting time delta becomes a lot smaller, so it's again ranked higher. And avoiding adding a special case, lets the rest of the codebase work exactly like it was before, basically just need to add a "set submission time to now" function and you get the rest for free.

But, I'm just guessing here based on my own refactoring experience through the years, may be a completely different reason, or even by mistake? Who knows? :)

Comment by jeffbee 4 days ago

There is some action that moderators can take that throws one of yesterday's articles back on the front page and when that happens all the comments have the same timestamp.

Comment by consumer451 4 days ago

I believe that this is called "the second chance pool." It is a bit strange when it unexpectedly happens to one's own post.

Comment by scosman 4 days ago

Anyone have a branch that I can run to target my own comments? I'd love to see where I was right and where I was off base. Seems like a genuinely great way to learn about my own biases.

Comment by xpe 4 days ago

I appreciate your intent, but this tool needs a lot of work -- maybe an entire redesign -- before it would be suitable for the purpose you seek. See discussion at [1].

Besides, in my experience, only a tiny fraction of HN comments can be interpreted as falsifiable predictions.

Instead I would recommend learning about calibration [2] and ways to improve one's calibration, which will likely lead you into literature reviews of cognitive biases and what we can do about them. Also, jumping into some prediction markets (as long as they don't become too much of a distraction) is good practice.

[1]: https://news.ycombinator.com/item?id=46223959

[2]: https://www.lesswrong.com/w/calibration

Comment by GaggiX 4 days ago

I think the most fun thing is to go to: https://karpathy.ai/hncapsule/hall-of-fame.html

And scroll down to the bottom.

Comment by 4 days ago

Comment by MBCook 4 days ago

It’s interesting, if you go down near the bottom you see some people with both A’s and D’s.

According to the ratings for example, one person both had extremely racist ideas but also made a couple of accurate points about how some tech concepts would evolve.

Comment by brian_spiering 4 days ago

That is interesting because of the Halo effect. There is a cognitive bias that if a person is right in one area, they will be right in another unrelated area.

I try to temper my tendency to believe the Halo effect with Warren Buffett's notion of the Circle of Competence; there is often a very narrow domain where any person can be significantly knowledgeable.

Comment by xpe 1 day ago

> A circle of competence is the subject area which matches a person's skills or expertise. The concept was developed by Warren Buffett and Charlie Munger as what they call a mental model, a codified form of business acumen, concerning the investment strategy of limiting one's financial investments in areas where an individual may have limited understanding or experience, while concentrating in areas where one has the greatest familiarity. -Wikipedia

> I try to temper my tendency to believe the Halo effect with Warren Buffett's notion of the Circle of Competence; there is often a very narrow domain where any person can be significantly knowledgeable. (commenter above)

Putting aside Buffett in particular, I'm wary of claims like "there is often a very narrow domain where any person can be significantly knowledgeable". How often? How narrow of a domain? Doesn't it depend on arbitrary definitions of what qualifies as a category? Is this a testable theory? Is it a predictive theory? What does empirical research and careful analysis show?

Putting that aside, there are useful mathematical ways to get an idea of some of the backing concepts without making assumptions about people, culture, education, etc. I'll cook one up now...

Start with 70K balls split evenly across seven colors: red, orange, yellow, green, blue, indigo, and violet. 1,000 show up demanding balls. So we mix them up and randomly distribute 10 balls to every person. What does the distribution tend to look like? What particulars would you tune and/or definitions would you choose to make this problem "sort of" map to something sort of like assessing the diversity of human competence across different areas?

Note the colored balls example assumes independence between colors (subjects or skills or something). But in real life, there are often causally significant links between skills. For example, general reasoning ability improves performance in a lot of other subjects.

Then a goat exploded, because I don't know how to end this comment gracefully.

Comment by 1 day ago

Comment by bgwalter 4 days ago

"If LLMs are watching, humans will be on their best behavior". Karpathy, paraphrasing Larry Ellison.

The EU may give LLM surveillance an F at some point.

Comment by 4 days ago

Comment by gen6acd60af 4 days ago

Commenters of HN:

Your past thoughts have been dredged up and judged.

For each $TOPIC, you have been awarded a grade by GPT-5.1 Thinking.

Your grade is based on OpenAI's aligned worldview and what OpenAI's blob of weights considers Truth in 2025.

Did you think well, netizen?

Are you an Alpha or a Delta-Minus?

Where will the dragnet grading of your online history happen next?

Comment by HighGoldstein 3 days ago

Of all the people on the entire internet, I would hope HN posters understand best that anything and everything posted online already has and also will at some point be used in such ways.

Comment by popinman322 4 days ago

It doesn't look like the code anonymizes usernames when sending the thread for grading. This likely induces bias in the grades based on past/current prevailing opinions of certain users. It would be interesting to see the whole thing done again but this time randomly re-assigning usernames, to assess bias, and also with procedurally generated pseudonyms, to see whether the bias can be removed that way.

I'd expect de-biasing would deflate grades for well known users.

It might also be interesting to use a search-grounded model that provides citations for its grading claims. Gemini models have access to this via their API, for example.

Comment by ProllyInfamous 4 days ago

What a human-like critizicism of human-like behavior.

I [as a human] also do the same thing when observing others in IRL and forum interactions. Reputation matters™

----

A further question is whether a bespoke username could influence the bias of a particular comment (e.g. A username of something like HatesPython might influence the interpretation of that commenter's particular perception of the Python coding language, which might actually be expressing positivity — the username's irony lost to the AI?).

Comment by khafra 4 days ago

You can't anonymize comments from well-known users, to an LLM: https://gwern.net/doc/statistics/stylometry/truesight/index

Comment by WithinReason 4 days ago

That's an overly strong claim, an LLM could also be used to normalise style

Comment by wetpaws 4 days ago

How would you possibly grade comments if you change them?

Comment by strken 4 days ago

Extract the concrete predictions, evaluate them as true/false/indeterminate, and grade the user on the number of true vs false?

Comment by Natsu 3 days ago

This doesn't even seem to look at "predictions" if you dig into what it actually did. Looking at my own example (#210 on https://karpathy.ai/hncapsule/hall-of-fame.html with 4 comments), very little of what I said could be construed as "predictions" at all.

I got an A for commenting on DF saying that I had not personally seen save corruption and listing weird bugs. It's true that weird bugs have long been a defining feature of DF, but I didn't predict it would remain that way or say that save corruption would never be a big thing, just that I hadn't personally seen it.

Another A for a comment on Google wallet just pointing out that users are already bad at knowing what links to trust. Sure, that's still true (and probably will remain true until something fundamental changes), but it was at best half a prediction as it wasn't forward looking.

Then something on hospital airships from the 1930s. I pointed out that one could escape pollution, I never said I thought it would be a big thing. Airships haven't really ever been much of a thing, except in fiction. Maybe that could change someday, but I kinda doubt it.

Then lastly there was the design patent famously referred to as the "rounded corner" patent. It dings me for simplifying it to that label, despite my actual statements being that yes, there's more, but just minor details like that can be sufficient for infringement. But the LLM says I'm right about ties to the Samsung case and still oversimplifying it. Either way, none of this was really a prediction to begin with.

Comment by koakuma-chan 4 days ago

You don’t need comments, just facts in them to see if they’re accurate.

Comment by karmickoala 4 days ago

I understand the exercise, but I think it should have a disclaimer, some of the LLM reviews are showing a bias and when I read the comments they turned out not to be as bad as the LLM made them. As this hits the front page, some people will only read the title and not the accompanying blog post, losing all of the nuance.

That said, I understand the concept and love what you did here. By this being exposed to the best disinfectant, I hope it will raise awareness and show how people and corporations should be careful about its usage. Now this tech is accessible to anyone, not only big techs, in a couple of hours.

It also shows how we should take with a grain of salt the result of any analysis of such scale by a LLM. Our private channels now and messages on software like Teams and Slack can be analyzed to hell by our AI overlords. I'm probably going to remove a lot of things from cloud drives just in case. Perhaps online discourse will deteriorate to more inane / LinkedIn style content.

Also, I like that your prompt itself has some purposefully leaked bias, which shows other risks—¹for instance, "fsflover: F", which may align the LLM to grade worse the handles that are related to free software and open source).

As a meta concept of this, I wonder how I'll be graded by our AI overlords in the future now that I have posted something dismissive of it.

¹Alt+0151

Comment by ComputerGuru 4 days ago

Looking at the results and the prompt, I would tweak the prompt to

* ignore comments that do not speculate on something that was unknown or had not achieved consensus as of the date of yyyy-mm-dd

* at the same time, exclude speculations for which there still isn’t a definitive answer or consensus today

* ignore comments that speculate on minor details or are stating a preference/opinion on a subjective matter

* it is ok to generate an empty list of users for a thread if there are no comments meeting the speculation requirements laid out above

* etc

Comment by losvedir 4 days ago

Agreed. I feel like it's more just a collection of good comments. It doesn't surprise me to see tptacek, patio11, etc there. I think the "prediction" aspect is under weighted.

But it reminds me that I miss Manishearth's comments! What ever happened to him? I recall him being a big rust contributor. I'd think he'd be all over the place, with rust's adoption since then. I also liked tokenadult. interesting blast from the past.

Comment by xpe 4 days ago

Good points. To summarize: for a given comment, one presumably must downselect to the ones that can reasonably be interpreted as forecasts. I see some indicators that the creator of the project (despite his amazing reputation) skated over this part.

Comment by janalsncm 4 days ago

You would also need to exclude “predictions” for things which already happened at the time they were predicted.

Comment by abhinav_sk 3 days ago

What's interesting is that the hindsight it has now is not going to be what it has in 10 years either. Some of the most wrong and most prescient comments could switch as stuff unfolds. In a way some could both still be wrong and right just at different points in time.

Comment by alister 4 days ago

> https://karpathy.ai/hncapsule/2015-12-24/index.html#article-...

I wonder why ChatGPT refused to analyze it?

The HN article was "Brazil declares emergency after 2,400 babies are born with brain damage" but the page says "No analysis available".

Comment by bspammer 4 days ago

My guess is that it’s because there’s a lot of very negative comments about Brazil in that article. Trying to grade people for their opinions on a topic like that gets into dangerous territory.

Comment by 4 days ago

Comment by bretpiatt 3 days ago

10 Years Ago, December 11, 2015 - Introducing Open AI -- very meta: https://karpathy.ai/hncapsule/2015-12-11/index.html#article-...

The company has changed and it seems the mission has as well.

Comment by bspammer 3 days ago

Yes very funny to see their own model betray them like this:

> The original “non‑profit, open, patents shared” promise now reads almost like an alternate timeline. Today OpenAI is a capped‑profit entity with a massive corporate partner, closed frontier models, and an aggressive product roadmap.

Comment by intheitmines 4 days ago

Interesting that for the "December 16 2015 geohot is building Comma" it graded geohot's comments on the thread as only B

Comment by snowwrestler 4 days ago

Presumably because of how things went with Comma since then.

Comment by jeffnappi 4 days ago

The analysis of the 2015 article about Triplebyte is fascinating [1]. Particularly the Awards section.

1. https://karpathy.ai/hncapsule/2015-12-08/index.html#article-...

Comment by 4 days ago

Comment by lapcat 4 days ago

Does anyone else think that HN engages in far too much navel-gazing? Nothing gets upvotes faster than a HN submission about HN.

Comment by dang 4 days ago

It's true that meta is the crack of internet forums, so we, er, crack down on it quite a bit. That's a longstanding view: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

Alternate metaphor: evil catnip - https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

But yesterday's thread and this one are clearly exceptions—far above the median. https://news.ycombinator.com/item?id=46212180 was particularly incredible I think!

Comment by latexr 4 days ago

I love it when you share some insight about HN or internet communication for which you have relevant searches at the ready to explanations of the concept.

A personal favourite is “the contrarian dynamic”.

Do you have a list of those at the ready or do you just remember them? If you feel like sharing, what’s your process and is there a list of those you’d make public?

I imagine having one would be useful, e.g. for onboarding someone like tomhow, though that doesn’t really happen often.

Comment by dang 4 days ago

I just remember them. Or forget them!

The process is simply that moderation is super repetitive, so eventually certain pathways get engraved in one's memory. A lot of the time, though, I can't quite remember one of these patterns and I'm unable to dig up my past comments about it. That's annoying, in that particular way when your brain can feel something's there but is unable to retrieve it.

Comment by Terretta 4 days ago

Well, you're #24 in this article's hall of fame, and the LLM thinks your moderation views stood the test of time. Perhaps it can already retrieve them for you.

Comment by dang 4 days ago

There are so many interesting points and patterns that I've just lost track of over the years.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Comment by DonHopkins 4 days ago

Dang, posting links to searches for your own comments is so meta, no matter the topic, but even more meta when about meta crack. I love how the first hit of meta crack is this, your own message about meta crack.

Comment by dang 3 days ago

I'm higher than my supplier!

Comment by DonHopkins 2 days ago

You never meta crack you wouldn't hit.

Comment by yellow_lead 4 days ago

It's weird that HN viewers are interested in HN

Comment by CamperBob2 4 days ago

As moultano suggests, this is likely because most other websites make it completely impossible to navel-gaze. We can't possibly give the HN admins too much praise and credit for their commitment to open and stable availability of legacy data.

Comment by 4 days ago

Comment by swalsh 4 days ago

I have never felt less confident in the future than I do in 2025... and it's such a stark contrast. I guess if you split things down the middle, AI probably continues to change the world in dramatic ways but not in the all or nothing way people expect.

A non trivial amount of people get laid off, likely due to a finanical crisis which is used as an excuse for companies scale up use of AI. Good chance the financial crisis was partly caused by AI companies, which ironically makes AI cheaper as infra is bought up on the cheap (so there is a consolidation, but the bountiful infra keeps things cheap). That results in increased usage (over a longer period of time). and even when the economy starts coming back the jobs numbers stay abismal.

Politics are divided into 2 main groups, those who are employed, and those who are retired. The retired group is VERY large, and has alot of power. They mostly care about entitlements. The employed age people focus on AI which is making the job market quite tough. There are 3 large political forces (but 2 parties). The Left, the Right, and the Tech Elite. The left and the right both hate AI, but the tech elite though a minority has outsized power in their tie breaker role. The age distributions would surprise most. Most older people are now on the left, and most younger people are split by gender. The right focuses on limiting entitlements, and the left focuses on growing them by taxing the tech elite. The right maintains power by not threatening the tech elite.

Unlike the 20th century America is a more focused global agenda. We're not policing everyone, just those core trading powers. We have not gone to war with China, China has not taken over Taiwan.

Physical robotics is becoming a pretty big thing, space travel is becoming cheaper. We have at least one robot on an astroid mining it. The yield is trivial, but we all thought it was neat.

Energy is much much greener, and you wouln't have guessed it... but it was the data centers that got us there. The Tech elite needed it quickly, and used the political connections to cut red tape and build really quickly.

Comment by 1121redblackgo 4 days ago

We do not currently have the political apparatus in place to stop the dystopian nightmares depicted in movies and media. They were supposed to be cautionary tales. Maybe they still can be, but there are basically zero guardrails in non-progressive forms of government to prevent massive accumulations of power being wielded in ways most of the population disapproves of.

Comment by samdoesnothing 4 days ago

Thats the whole point of democracy, to prevent the ruling parties from doing wildly unpopular things. Unlike a dictatorship, where they can do anything (including good things, that otherwise wouldn't happen in a democracy).

I know that "X is destroying democracy, vote for Y" has been a prevalent narrative lately, but is there any evidence that it's true? I get that it's death by a thousand cuts, or "one step at a time" as they say.

Comment by xpe 3 days ago

> I know that "X is destroying democracy, vote for Y" has been a prevalent narrative lately, but is there any evidence that it's true? I get that it's death by a thousand cuts, or "one step at a time" as they say.

I suggest reading [1], [2], and [3]. From there, you'll probably have lots of background to pose your own research questions. According to [4], until you write about something, your thinking will be incomplete, and I tend to agree nearly all of the time.

[1]: https://en.wikipedia.org/wiki/Democratic_backsliding

[2]: https://hub.jhu.edu/2024/08/12/anne-applebaum-autocracy-inc/

[3]: https://carnegieendowment.org/research/2025/08/us-democratic...

[4]: "Neuroscientists, psychologists and other experts on thinking have very different ideas about how our brains work, but, as Levy writes: “no matter how internal processes are implemented, (you) need to understand the extent to which the mind is reliant upon external scaffolding.” (2011, 270) If there is one thing the experts agree on, then it is this: You have to externalise your ideas, you have to write. Richard Feynman stresses it as much as Benjamin Franklin. If we write, it is more likely that we understand what we read, remember what we learn and that our thoughts make sense." - Sönke Ahrens. How to Take Smart Notes_ - Sonke Ahrens (p. 30)

Comment by Karrot_Kream 4 days ago

Are you in the wrong thread?

Comment by smugma 4 days ago

I believe that the GPA calculation is off, maybe just for F's.

I scrolled to the bottom of the hall of fame/shame and saw that entry #1505 and 3 F's and a D, with an average grade of D+ (1.46).

No grade better than a D shouldn't average to a D+, I'd expect it to be closer to a 0.25.

Comment by dschnurr 4 days ago

Nice! Something must be in the air – last week I built a very similar project using the historical archive of all-in podcast episodes: https://allin-predictions.pages.dev/

Comment by sanex 4 days ago

I'll use this as evidence supporting my continued demand for a Friedberg only spinoff.

Comment by 4 days ago

Comment by DeathArrow 4 days ago

>I believe it is quite possible and desirable to train your forward future predictor given training and effort.

That's interesting. I wouldn't have thought that a decent generic forward future predictor would be possible.

Comment by GaggiX 4 days ago

I was reading the Anki article on 2015-12-13, and the best prediction was by markm248 saying: "Remember that you read it here first, there will be a unicorn built on the concept of SRS"

They were right, Duolingo.

Comment by mtlynch 4 days ago

Duolingo existed for a while at that point and was already valued at $500M by end of 2015.

Comment by GaggiX 4 days ago

It became a unicorn in December 2019 tho, 4 years later.

Comment by 3 days ago

Comment by sigmar 4 days ago

Gotta auto grade every HN comment for how good it is at predicting stock market movement then check what the "most frequently correct" user is saying about the next 6 months.

Comment by Rychard 4 days ago

As the saying goes, "past performance is not indicative of future results"

Comment by xpe 4 days ago

I hope this is a joke.

Forecasting and the meta-analysis of forecasters is fairly well studied. [1] is a good place to start.

[1]: https://en.wikipedia.org/wiki/Superforecaster

Comment by sigmar 4 days ago

> The conclusion was that superforecasters' ability to filter out "noise" played a more significant role in improving accuracy than bias reduction or the efficient extraction of information.

>In February 2023, Superforecasters made better forecasts than readers of the Financial Times on eight out of nine questions that were resolved at the end of the year.[19] In July 2024, the Financial Times reported that Superforecasters "have consistently outperformed financial markets in predicting the Fed's next move"

>In particular, a 2015 study found that key predictors of forecasting accuracy were "cognitive ability [IQ], political knowledge, and open-mindedness".[23] Superforecasters "were better at inductive reasoning, pattern detection, cognitive flexibility, and open-mindedness".

I'm really not sure what you want me to take from this article? Do you contend that everyone has the same competency at forecasting stock movements?

Comment by xpe 3 days ago

> I'm really not sure what you want me to take from this article?

I linked to the Wikipedia page as a way of pointing to the book Superforecasters by Tetlock and Gardner. If forecasting interests you, I recommend using it as a jumping off point.

> Do you contend that everyone has the same competency at forecasting stock movements?

No, and I'm not sure why you are asking me this. Superforecasters does not make that claim.

> I'm really not sure what you want me to take from this article?

If you read the book and process and internalize its lessons properly, I predict you will view what you wrote above in a different different light:

> Gotta auto grade every HN comment for how good it is at predicting stock market movement then check what the "most frequently correct" user is saying about the next 6 months.

Namely, you would have many reasons to doubt such a project from the outset and would pursue other more fruitful directions.

Comment by nomel 3 days ago

> I realized that this task is actually a really good fit for LLMs

I've found the opposite, since these models still fail pretty wildly at nuance. I think it's a conceptual "needle in the haystack sort of problem.

A good test is to find some thread where there's a disagreement and have it try to analyze the discussion. It will usually strongly misrepresent what was being said, by each side, and strongly align with one user, missing the actual divide that's causing the disagreement (a needle).

Comment by gowld 3 days ago

As always, which model versions did you use in your test?

Comment by nomel 3 days ago

Claude Opus 4.5, Gemini 3 Pro, ChatGPT 5.1. Haven't tried ChatGPT 5.2.

It requires that the discussion has nuance, to see the failure. Gemini is, by far the, worst at this (which fits my suspicion that they heavily weighted reddit posts).

I don't think this is all that strange though. The human, on one side of the argument, is also missing the nuance, which is the source of the conflict. Is there a belief that AI has surpassed the average human, with conversational nuance!?

Comment by neilv 4 days ago

> I spent a few hours browsing around and found it to be very interesting.

This seems to be the result of the exercise? No evaluation?

My concern is that, even if the exercise is only an amusing curiosity, many people will take the results more seriously than they should, and be inspired to apply the same methods to products and initiatives that adversely affect people's lives in real ways.

Comment by cootsnuck 4 days ago

> My concern is that, even if the exercise is only an amusing curiosity, many people will take the results more seriously than they should, and be inspired to apply the same methods to products and initiatives that adversely affect people's lives in real ways.

That will most definitely happen. We already have known for awhile that algorithmic methods have been applied "to products and initiatives that adversely affect people's lives in real ways", for awhile: https://www.scientificamerican.com/blog/roots-of-unity/revie...

I guess the question is if LLMs for some reason will reinvigorate public sentiment / pressure for governing bodies to sincerely take up the ongoing responsibility of trying to lessen the unique harms that can be amplified by reckless implementation of algorithms.

Comment by SequoiaHope 4 days ago

This is great! Now I want to run this to analyze my own comments and see how I score and whether my rhetoric has improved in quality/accuracy over time!

Comment by godelski 4 days ago

  > I was reminded again of my tweets that said "Be good, future LLMs are watching". You can take that in many directions, but here I want to focus on the idea that future LLMs are watching. Everything we do today might be scrutinized in great detail in the future because doing so will be "free". A lot of the ways people behave currently I think make an implicit "security by obscurity" assumption. But if intelligence really does become too cheap to meter, it will become possible to do a perfect reconstruction and synthesis of everything. LLMs are watching (or humans using them might be). Best to be good.

Can we take a second and talk about how dystopian this is? Such an outcome is not inevitable, it relies on us making it. The future is not deterministic, the future is determined by us. Moreso, Karpathy has significantly more influence on that future than your average HN user.

We are doing something very *very* wrong if we are operating under the belief that this future is unavoidable. That future is simply unacceptable.

Comment by jacquesm 4 days ago

Given the quality of the judgment I'm not worried, there is no value here.

To properly execute this idea rather than to just toss it off without putting in the work to make it valuable is exactly what irritates me about a lot of AI work. You can be 900 times as productive at producing mental popcorn, but if there was value to be had here we're not getting it, just a whiff of it. Sure, fun project. But I don't feel particularly judged here. The funniest bit is the judgment on things that clearly could not yet have come to pass (for instance because there is an exact date mentioned that we have not yet reached). QA could be better.

Comment by godelski 4 days ago

I think you're missing the actual problem.

I'm not worried about this project but instead harvesting, analyzing all that data and deanonymizing people.

That's exactly what Karparthy is saying. He's not being shy about it. He said "behave because the future panopticon can look into the past". Which makes the panopticon effectively exist now.

  Be good, future LLMs are watching
  ...
  or humans using them might be

That's the problem. Not the accuracy of this toy project, but the idea of monitoring everyone and their entire history.

The idea that we have to behave as if we're being actively watched by the government is literally the setting of 1984 lol. The idea that we have to behave that way now because a future government will use the Panopticon to look into the past is absolutely unhinged. You don't even know what the rules of that world will be!

Did we forget how unhinged the NSA's "harvest now, decrypt later" strategy is? Did we forget those giant data centers that were all the news talked about for a few weeks?

That's not the future I want to create, is it the one you want?

To act as if that future is unavoidable is a failure of *us*

Comment by jacquesm 4 days ago

Yes, you are right, this is a real problem. But it really is just a variation on 'the internet never forgets', for instance in relation to teen behavior online. But AI allows for weaponization of such information. I wish the wannabe politicians of 2050 much good luck with their careers, they are going to be the most boring people available.

Comment by godelski 4 days ago

The internet never forgets but you could be anonymous. Or at least somewhat. But that's getting harder and harder

If such a thing isn't already possible (it is to a certain extent), we are headed towards a point where your words alone will be enough to fingerprint you.

Comment by jacquesm 4 days ago

Stylometry killed that a long time ago. There was a website, stylometry.net that coupled HN accounts based on text comparison and ranked the 10 best candidates. It was incredibly accurate and allowed id'ing a bunch of people that had gotten banned but that came back again. Based on that I would expect that anybody that has written more than a few KB of text to be id'able in the future.

Comment by godelski 4 days ago

You need a person's text with their actual identity to pull that off. Normally that's pretty hard, especially since you'll get different formats. Like I don't write the same way on Twitter as HN. But yeah, this stuff has been advancing and I don't think it is okay.

Comment by jacquesm 4 days ago

The AOL scandal pretty much proved that anonymity is a mirage. You may think you are anonymous but it just takes combining a few unrelated databases to de-anonymize you. HN users think they are anonymous but they're not, they drop factoids all over the place about who they are. 33 bits... it is one of my recurring favorite themes and anybody in the business of managing other people's data should be well aware of the risks.

Comment by godelski 3 days ago

I think you're being too conspiracy theorist here by making everything black and white.

Besides, the main problem of how difficult it is to deanonymize, not if possible.

Privacy and security both have to perfect defense. For example, there's no passwords that are unhackable. There are only passwords that cannot be hacked with our current technology, budgets, and lifetime. But you could brute force my HN password, it would just take billions of years.

The same distinction it's important here. My threat model on HN doesn't care if you need to spend millions of dollars nor thousands of hours to deanonymize me. My handle is here to discourage that and to allow me to speak more freely about certain topics. I'm not trying to hide from nation states, I'm trying to hide from my peers in AI and tech. So I can freely discuss my opinions, which includes criticizing my own community (something I think everyone should do! Be critical of the communities we associate with). And moreso I want people to consider my points on their merit alone, not on my identity nor status.

If I was trying to hide from nation states I'd do things very very differently, such as not posting on HN.

I'm not afraid of my handle being deanonymized, but I still think we should recognize the dangers of the future we are creating.

By oversimplifying you've created the position that this is a lost cause, as if we already lost and that because we lost we can't change. There are multiple fallacies here. The future has yet to be written.

If you really believe it is deterministic then what is the point to anything? To have desires it opinions? Are were just waiting to see which algorithm wins out? Or are we the algorithms playing themselves out? If it's deterministic wouldn't you be happy if the freedom algorithm won and this moment is an inflection in your programming? I guess that's impossible to say in an objective manner but I'd hope that's how it plays out

Comment by jacquesm 2 days ago

I have enough industry insights to prove that your data is floating out there, unprotected, in plain text and that those that are not bound by the law are making very good use of it. Every breach leaks more bits about you.

This is the main driver behind the targeted scams that ordinary people now have to deal with. It is why people get voice calls from loved ones in distress, why they get 'tech support' calls that aim to take over their devices and why lots of people have lost lots of money.

If you think I am too conspiracy theorist by making everything black and white that is maybe simply because we live different lives and have different experience.

Comment by acyou 4 days ago

I call this the "judgement day" scenario. I would be interested if there is some science fiction based on this premise.

If you believe in God of a certain kind, you don't think that being judged for your sins is unacceptable or even good or bad in itself, you consider it inevitable. We have already talked it over for 2000 years, people like the idea.

Comment by godelski 4 days ago

You'll be interested in Clarke's "The Light of Other Days". Basically a wormhole where people can look back at any point in time, ending all notion of privacy.

God is different though. People like God because they believe God is fair and infallible. That is not true for machines nor men. Similarly I do not think people will like this idea. I'm sure there will be some but look at people today and their religious fever. Or look in the past. They'll want it, but it is fleeting. Cults don't last forever, even when they're governments. Sounds like a great way to start wars. Every one will be easily justified

https://en.wikipedia.org/wiki/The_Light_of_Other_Days

Comment by JetSetWilly 3 days ago

It would be great to run this on a collection of interesting threads over different periods and not just one snapshot. For example, the thread from the day Trump got elected in 2016, the thread from the day of brexit and so on. Those are the times when people make many passionate predictions about how the future will play out, be good to see them retroactively scored.

Comment by rkuykendall-com 3 days ago

Assuming this keeps running, I suppose we just have to wait about a year.

Comment by anshulbhide 4 days ago

I often summarise HN comments (which are sometimes more insightful than the original article) using an LLM. Total game-changer.

Comment by NooneAtAll3 4 days ago

UX feedback: I wish clicking on a new thread scrolled right side to the top again

reading from the end isn't really useful, y'know :)

Comment by dw_arthur 4 days ago

Reading this I feel the same sense of dread I get watching those highly choreographed Chinese holiday drone shows.

Comment by jeffbee 4 days ago

I'm delighted to see that one of the users who makes the same negative comments on every Google-related post gets a "D" for saying Waymo was smoke and mirrors. Never change, I guess.

Comment by bediger4000 4 days ago

LLMs are watching (or humans using them might be). Best to be good.

Shades of Roko's Basilisk!

Comment by ambicapter 4 days ago

More like a Panopticon. As the parenthesis notes, this is just as bad when humans are the final link in the eyeball chain.

Comment by Bjartr 4 days ago

Neat, I got a shout-out. Always happy to share the random stuff I remember exists!

Comment by apparent 4 days ago

> And then when you navigate over to the Hall of Fame, you can find the top commenters of Hacker News in December 2015, sorted by imdb-style score of their grade point average.

Now let's make a Chrome extension that subtly highlights these users' comments when browsing HN.

Comment by bbcisking 4 days ago

Why not rank ESP for each HN user, with evidence?

Comment by exasperaited 4 days ago

> Everything we do today might be scrutinized in great detail in the future because it will be "free".

s/"free"/stolen/

The bit about college courses for future prediction was just silly, I'm afraid: reminds me of how Conan Doyle has Sherlock not knowing Earth revolves around the Sun. Almost all serious study concerns itself with predicting, modelling and influence over the future behaviour of some system; the problem is only that people don't fucking listen to the predictions of experts. They aren't going to value refined, academic general-purpose futurology any more than they have in the past; it's not even a new area of study.

Comment by pnt12 4 days ago

On the site itself:

it's great that this was produced in 1h with 60$. This is amazing to create small utilities, explore your curiosity, etc.

But the site is also quite confusing and messy. OK for a vibe coded experiment, sure, but wouldn't be for a final product. But I fear we're gonna see more and more of this. Big companies downsizing their tech departments and embracing vibe coded. Comparing to inflation, shrinkflation and skimpflation/ enshittification , will we soon adopt some word for this? AIflation? LLMflation?

And how will this comment score in a couple of years? :)

Comment by slg 4 days ago

This is a perfect example of the power and problems with LLMs.

I took the narcissistic approach of searching for myself. Here's a grade of one of my comments[1]:

>slg: B- (accurate characterization of PH’s “networking & facade” feel, but implicitly underestimates how long that model can persist)

And here's the actual comment I made[2]:

>And maybe it is the cynical contrarian in me, but I think the "real world" aspect of Product Hunt it what turned me off of the site before these issues even came to the forefront. It always seemed like an echo chamber were everyone was putting up a facade. Users seemed more concerned with the people behind products and networking with them than actually offering opinions of what was posted.

>I find the more internet-like communities more natural. Sure, the top comment on a Show HN is often a critique. However I find that more interesting than the usual "Wow, another great product from John Developer. Signing up now." or the "Wow, great product. Here is why you should use the competing product that I work on." that you usually see on Product Hunt.

I did not say nor imply anything about "how long that model can persist", I just said I personally don't like using the site. It's a total hallucination to claim I was implying doom for "that model" and you would only know that if you actually took the time to dig into the details of what was actually said, but the summary seems plausible enough that most people never would.

The LLM processed and analyzed a huge amount of data in a way that no human could, but the single in-depth look I took at that analysis was somewhere between misleading and flat out wrong. As I said, a perfect example of what LLMs do.

And yes, I do recognize the funny coincidence that I'm now doing the exact thing I described as the typical HN comment a decade ago. I guess there is a reason old me said "I find that more interesting".

[1] - https://karpathy.ai/hncapsule/2015-12-18/index.html#article-...

[2] - https://news.ycombinator.com/item?id=10761980

Comment by npunt 3 days ago

I'm not so sure; that may not have been what you meant, but that doesn't mean it's not what others read into it. The broader context is HN is a startup forum and one of the most common discussion patterns is 'I don't like it' that is often a stand-in for 'I don't think it's viable as-is'. Startups are default dead, after all.

With that context, if someone were to read your comment and be asked 'does this person think the product's model is viable in the long run' I think a lot of people would respond 'no'.

Comment by slg 2 days ago

And this is a perfect example of how some people respond to LLMs, bending over backwards to justify the output like we are some kids around a Ouija board.

The LLM isn't misinterpreting the text, it's just representing people who misinterpreted the text isn't the defense you seem to think it is.

Comment by npunt 2 days ago

And your response here is a perfect example of confidently jumping to conclusions on what someone's intent is... which is exactly what you're saying the LLM did to you.

I scoped my comment specifically around what a reasonable human answer would be if one were asked the particular question it was asked with the available information it had. That's all.

Btw I agree with your comment that it hallucinated/assumed your intent! Sorry I did not specify that. This was a bit of a 'play stupid games win stupid prizes' prompt by the OP. If one asks an imprecise question one should not expect a precise answer. The negative externality here is reader's takeaways are based on false precision. So is it the fault of the question asker, the readers, the tool, or some mix? The tool is the easiest to change, so probably deserves the most blame.

I think we'd both agree LLMs are notoriously overly-helpful and provide low confidence responses to things they should just not comment on. That to me is the underlying issue - at the very least they should respond like humans do not only in content but in confidence. It should have said it wasn't confident about its response to your post, and OP should have thus thrown its response out.

Rarely do we have perfect info, in regular communications we're always making assumptions which affect our confidence in our answers. The question is what's the confidence threshold we should use? This is the question to ask before the question of 'is it actually right?', which is also an important question to ask, but one I think they're a lot better at than the former.

Fwiw you can tell most LLMs to update its memory to always give you a confidence score 0.0-1.0. This helps tremendously, it's pretty darn accurate, it's something you can program thresholds around, and I think it should be built in to every LLM response.

The way I see it, LLMs have lots and lots of negative externalities that we shouldn't bring into this world (I'm particularly sensitive to the effects on creative industries), and I detest how they're being used so haphazardly, but they do have some uses we also shouldn't discount and figure out how to improve on. The question is where are we today in that process?

The framework I use to think about how LLMs are evolving is that of transitioning mediums. Like movies started as a copy/paste of stage plays before they settled into the medium and understand how to work along the grain of its strengths & weaknesses to create new conventions. Speech & text are now transitioning into LLMs. What is the grain we need to go along?

My best answer is the convention LLMs need to settle into is explicit confidence, and each question asked of them should first be a question of what the acceptable confidence threshold is for such a question. I think every question and domain will have different answers for that, and we should debate and discuss that alongside any particular answer.

Comment by 0xWTF 4 days ago

Now: compared to what? Is there a better source than HN? How's it compare to Reddit or lobsters?

Compared to what happens next? Does tptacek's commentary become market signal equivalent to the Fed Chair or the BLS labor and inflation reports?

Comment by tptacek 4 days ago

What makes you think it already isn't?

Comment by jacquesm 4 days ago

You've made me billions by now! Thank you...

Comment by tgtweak 4 days ago

Cool - now make it analyze all of those and come up with the 10 commandments of commenting factually and insightfully on HN posts...

Comment by npunt 4 days ago

One of the few use cases for LLMs that I have high hopes for and feel is still under appreciated is grading qualitative things. LLMs are the first tech (afaik) that can do top-down analysis of phenomena in a manner similar to humans, which means a lot of important human use cases that are judgement-oriented can become more standardized, faster, and more readily available.

For instance, one of the unfortunate aspects of social media that has become so unsustainable and destructive to modern society is how it exposes us to so many more people and hot takes than we have ability to adequately judge. We're overwhelmed. This has led to conversation being dominated by really shitty takes and really shitty people, who rarely if ever suffer reputational consequence.

If we build our mediums of discourse with more reputational awareness using approaches like this, we can better explore the frontier of sustainable positive-sum conversation at scale.

Implementation-wise, the key question is how do we grade the grader and ensure it is predictable and accurate?

Comment by Arodex 3 days ago

This is wrong, just look at this comment here:

https://news.ycombinator.com/item?id=46222523

LLM can't grade reliably human text. It doesn't understand it.

Comment by mvdtnz 4 days ago

Do we need more AI slop on the front page?

Comment by gaigalas 4 days ago

I am not sure if we need a karma precog analogue.

It does seem better than just upvotes and downvotes though.

Comment by collinmcnulty 4 days ago

> But if intelligence really does become too cheap to meter, it will become possible to do a perfect reconstruction and synthesis of everything. LLMs are watching (or humans using them might be). Best to be good.

I cannot believe this is just put out there unexamined of any level of "maybe we shouldn't help this happen". This is complete moral abdication. And to be clear, being "good" is no defense. Being good often means being unaligned with the powerful, so being good is often the very thing that puts you in danger.

Comment by doctoboggan 4 days ago

I've had the same though as Karpathy over the past couple of months/years. I don't think it's good, exciting, or something to celebrate, but I also have no idea how to prevent it.

I would read his "Best to be good." as a warning or reminder that everything you do or say online will be collected and analyzed by an "intelligence". You can't count on hiding amongst the mass of online noise. Imagine if someone were to collect everything you've written or uploaded to the internet and compiled it into a long document. What sort of story would that tell about who you are? What would a clever person (or LLM) be able to do with that document?

If you have any ideas on how to stop everyone from building the torment nexus, I am willing to listen.

Comment by karpathy 4 days ago

Thank you

Comment by collinmcnulty 4 days ago

This is my plan at least

1. Don't build the Torment Nexus yourself. Don't work for them and don't give them your money.

2. When people you know say they're taking a new job to work at Torment Nexus, act like that's super weird, like they said they're going to work for the Sinaloa cartel. Treat rich people working on the Torment Nexus like it's cringe to quote them.

3. Get hostile to bots. Poison the data. Use AdNauseum and Anubis.

4. Give your non-tech friends the vague sense that this stuff is bad. Some might want to listen more, but most just take their sense of what's cool and good from people they trust in the area.

Comment by Teever 4 days ago

Do you have any suggestions on how to interact online with people who work at Torment Nexus?

Comment by 4 days ago

Comment by magic_hamster 4 days ago

This seems to me like a form of social engineering, or to some extent, being a bit insufferable. And, rest assured it will not result in anything useful. The only result of this is that you will alienate your friends and colleagues if they work for an employer you don't like.

Comment by tensor 4 days ago

I think we need to stop focusing only on the AI aspect of this. Yes, it's an important component to the sort of mass surveillance system you're describing, but it's not the only component. The internet, advertising, privacy, all of these are integral to this outcome.

While I don't have a general solution, I do believe that the solution will need to be multi-faceted and address multiple aspects of the technologies enabling this. My first step would be for society to re-evaluate and shift its views towards information, both locally and internationally.

For example, if you proposed to get rid of all physical borders between countries, everyone would likely be aghast. Obviously there are too many disagreements and conflicting value sets between countries for this to happen. Yet in the west we think nothing have having no digital information borders, despite the fact that the lack of them in part enables this data collection and other issues such as election interference. Yes, erecting firewalls is extremely unpalatable to people in the west, but is almost certainly part of the solution on the national level. Countries like China long ago realized this, though they also use firewalls as a means of control, not just protection (it doesn't have to be this way).

But within countries we also need to shift away from a default position of "I have the right to say whatever I want so therefore I should" and into one of "I'm not putting anything online unless I'm willing to have my employer, parents, literally everyone, read it." Also, we need to systematically attack and dismantle the advertising industry. That industry is one of the single biggest driving factors behind the extreme systematic collection and correlation of data on people. Advertising needs to switch to a "you come to me" approach not a "I'm coming to you" approach.

Comment by flir 4 days ago

That's not my department, says Wernher von Braun.

Don't know why that just popped into my head.

Comment by consumer451 4 days ago

It's nice that the LLM-enabled panopticon still cannot find this very recent related media, [0] but my silly mind can. It is actually an interesting commentary from a non-tech point of view. This is how the rest of the world feels:

Anyway, back to work trying to make my millions using Opus and such.

[0] https://old.reddit.com/r/funny/comments/1pj5bg9/al_companies...

Comment by thatguy0900 4 days ago

Well the companies that facilitate this have found themselves in a position where if they go down they take the US economy with them, so the maybe this shouldn't happen thing is a moot point. At least we know this stuff is in stable, secure hands though, like how the palantir ceo does recorded interviews while obviously blasted out of his mind on drugs.

Comment by cootsnuck 4 days ago

To be clear...prior to this recent explosive interest in LLMs, this was already true. Snowden was over 10 years ago.

We can't start clutching our pearls now as if programmatic mass surveillance hasn't been running on all cylinders for over 20 years.

Don't get me wrong, we should absolutely care about this, everyone should. I'm just saying any vague gestures at imminent privacy-doom thanks to LLMs is liable to be doing some big favors of inadvertently sanitizing the history of prior (and still) egregious privacy offenders.

I'm just suggesting more "Yes and" and less "pearl clutching" is all.

Comment by panarky 4 days ago

Who, exactly, is the "we" who you see "pearl clutching" instead of "yes and-ing"?

Comment by Teever 4 days ago

The time for discussion and action on this was over a 15 years ago when Snowden and the NSA with their Utah data centre was a big story.

Governments around the world have profiles on people and spiders that quietly amass the data that continuously updates those profiles.

It's just a matter of time before hardware improves and we see another holocaust scale purge facilitated by robots.

Surveillance capitalism won.

Comment by Uptrenda 4 days ago

dude, please do this for every year until today. This idea is actually amazing. If you need more money for API credits im sure people here could help donate.

Comment by siliconc0w 4 days ago

Random Bets for 2035:

* Nvidia GPUs will see heavy competition and most chat-like use-cases switching to cheaper models and inference-specific-silicon but will be still used on the high end for critical applications and frontier science

* Most Software and UIs will be primarily AI-generated. There will be no 'App Stores' as we know them.

* ICE Cars will become niche and will be largely been replaced with EVs, Solar will be widely deployed and will be the dominate source of power

* Climate Change will be widely recognized due to escalating consequences and there will be lots of efforts in mitigations (e.g, Climate Engineering, Climate-resistant crops, etc).

Comment by pu_pe 4 days ago

The infamous Dropbox comment might turn out to be right in 10 more years, when LLMs might just build an entire application from scratch for you.

Comment by 4 days ago

Comment by rafaelmn 4 days ago

I'd take the other side for most of these - Nvidia one is too vague (some could argue it's already seeing "heavy competition" from Google and other players in the space) but something more concrete - I doubt they will fall below 50% market share.

Comment by xattt 4 days ago

You’re about 20 days short or 345 days late for this HN tradition. ;)

Comment by huflungdung 4 days ago

[dead]

Comment by throwaway984393 4 days ago

[dead]

Comment by artur44 4 days ago

Interesting experiment. Using modern LLMs to retroactively grade decade-old HN discussions is a clever way to measure how well our collective predictions age. It’s impressive how little time and compute it now takes to analyze something that would’ve required days of manual reading. My only caution is that hindsight grading can overvalue outcomes instead of reasoning — good reasoning can still lead to wrong predictions. But as a tool for calibrating forecasting and identifying real signal in discussions, this is a very cool direction.