Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7
Posted by simonw 21 hours ago
Comments
Comment by ericpauley 20 hours ago
I'd say the example actually does (vaguely) suggest that Qwen might be overfitting to the Pelican.
Comment by wongarsu 18 hours ago
But in terms of making something physically plausible, Opus certainly got a lot closer
Comment by kmacdough 18 hours ago
Comment by BobbyJo 14 hours ago
Comment by userbinator 14 hours ago
Comment by doobiedowner 16 hours ago
Comment by itake 11 hours ago
I think getting the models to generate realistic and proportional objects is a much harder and important challenge (remember when the models would generate 6 fingers?).
Comment by tpm 8 hours ago
Comment by kube-system 13 hours ago
Comment by tecoholic 17 hours ago
Comment by mejutoco 17 hours ago
Comment by irthomasthomas 16 hours ago
Comment by monocasa 14 hours ago
Comment by jbellis 19 hours ago
Comment by kristianp 17 hours ago
Comment by yorwba 16 hours ago
This doesn't hold if some models trained on the benchmark and some didn't, but you can fix this by deliberately fine-tuning all models for the benchmark before comparing them. For more in-depth discussion of this, see https://mlbenchmarks.org/11-evaluating-language-models.html#...
Comment by __natty__ 18 hours ago
Comment by javawizard 18 hours ago
Comment by ericd 18 hours ago
Comment by spwa4 5 hours ago
Qwen 3.6 35b a3b: 34 tok/sec
Qwen 3.5 27b: 10 tok/sec
Qwen 3.5 35b a3b: doesn't support image input
Comment by mentalgear 19 hours ago
Comment by simonw 19 hours ago
For a delightful moment this morning I thought I might have finally caught a model provider cheating by training for the pelican, but the flamingo convinced me that wasn't the case.
Comment by furyofantares 18 hours ago
Comment by simonw 18 hours ago
Comment by furyofantares 18 hours ago
The Qwen one looks like a 3-tailed, broken-winged, beakless (I guess? Is that offset white thing a beak? Or is it chewing on a pelican feather like it's a piece of straw?) monstrosity not sitting on the seat, with its one foot off the pedal (the other chopped off at the knee) of a malmanufactured wheel that has bonus spokes that are longer than the wheel.
But yeah, it does have a bowtie and sunglasses that you didn't ask for! Plus it says "<3 Flamingo on a Unicycle <3", which perhaps resolves all ambiguity.
Comment by bigyabai 16 hours ago
Comment by monksy 17 hours ago
Comment by akavel 18 hours ago
Comment by solarkraft 12 hours ago
Comment by simonw 9 hours ago
Comment by prodigycorp 18 hours ago
Comment by dude250711 18 hours ago
Comment by gistscience 7 hours ago
Comment by luyu_wu 15 hours ago
It's directly stated in the post that the entire test is meant to be humorous, not taken seriously, only that is has vaguely followed model performance to date. The author also writes that this new result shows that trend has broken..
Comment by stephbook 17 hours ago
https://x.com/JeffDean/status/2024525132266688757
If anything, the disastrous Opus4.7 pelican shows us they don't pelicanmaxx
Comment by bitwize 17 hours ago
Comment by BoorishBears 17 hours ago
Comment by wood_spirit 17 hours ago
Comment by big-chungus4 7 hours ago
Comment by 999900000999 1 hour ago
God bless these open models. Claude can’t subsidize its users forever and no one can afford 1200$ a month for llm credits.
Comment by bdangubic 1 hour ago
you'd be surprised....
Comment by 999900000999 1 hour ago
Will Claude constantly be able to deliver more value than rolling your own ?
I think the future is a bunch of just good enough models, which is what most people need. Not top of the line models that require millions in hardware to run
Comment by bdangubic 23 minutes ago
Comment by ralph84 12 hours ago
Comment by henry2023 12 hours ago
I’m not sure you’re a bot but this is the stereotypical comment being overly critical of anything where OpenAI is not superior or being overly supportive (see comments on the Codex post today) while clearly not understanding the discussed topic at all.
Comment by SJMG 11 hours ago
This is not refutation of astroturfing on HN, but in this case, I doubt it.
Comment by sailingcode 17 hours ago
Comment by VHRanger 18 hours ago
Comment by ineedasername 14 hours ago
The amount of money you have in the bank may often "increase" or "decrease" but it also goes up and down, spatial. Concepts can be adjacent to each, orthogonal. Plenty more.
So, as models utilize weight more densely with more complex strategies learned during training the patterns & structure of these metaphors might also be deepened. Hmmm... another thing to add to the heap of future project-- trace down the geometry of activations in older/newer models of similar size with the same prompts containing such metaphors, or these pelican prompts, test the idea so it isn't just arm chair speculation.
Comment by f33d5173 16 hours ago
I guess initially it would have been a silly way to demonstrate the effect of model size. But the size of the largest models stopped increasing a while ago, recent improvements are driven principally by optimizing for specific tasks. If you had some secret task that you knew they weren't training for then you could use that as a benchmark for how much the models are improving versus overfitting for their training set, but this is not that.
Comment by simonw 16 hours ago
Comment by Quarrelsome 14 hours ago
Comment by atonse 12 hours ago
Oh maybe it might continue to iterate on the existing drawing?
Comment by quux 13 hours ago
Comment by henry2023 12 hours ago
Comment by aliljet 18 hours ago
Comment by comandillos 19 hours ago
Comment by hopinhopout 7 hours ago
Comment by wongarsu 7 hours ago
And so far, the ability to make SVGs of $animal on $ vehicle seems to correlate surprisingly well with model 'intelligence'
Comment by bottlepalm 17 hours ago
Comment by Havoc 14 hours ago
Comment by JaggerFoo 17 hours ago
Comment by lofaszvanitt 18 hours ago
Comment by justinbaker84 17 hours ago
Comment by kburman 15 hours ago
Comment by yieldcrv 16 hours ago
That’s so wild
Comment by refulgentis 17 hours ago
Pelican: saturated!
Comment by jedisct1 18 hours ago
It's pretty good at finding bugs, but not so good at writing patches to fix them.
Comment by nba456_ 16 hours ago
Comment by aimadetools 4 hours ago
Comment by tmatsuzaki 13 hours ago
Comment by whywhywhywhy 16 hours ago
Comment by simonw 16 hours ago
Comment by simon_is_genius 17 hours ago
Comment by 19qUq 19 hours ago
Comment by mvanbaak 18 hours ago
Comment by throwuxiytayq 18 hours ago
Comment by sharkjacobs 17 hours ago
Comment by stephbook 17 hours ago
But that Opus pelican?
Comment by cedws 16 hours ago
Comment by throwuxiytayq 7 hours ago
Comment by recursive 16 hours ago
Comment by throwuxiytayq 7 hours ago
Comment by bschwindHN 12 hours ago
Comment by segmondy 17 hours ago
Comment by Marciplan 14 hours ago
Comment by smcl 15 hours ago