The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.
Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:
1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)
2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)
I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...
I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!
This is a paper by INSAIT researchers - a very young institute which hired most of its PHD staff only in the last 2 years, basically onboarding anyone who wanted to be part of it. They were waiving their BG-GPT on national TV in the country as a major breakthrough, while it was basically was a Mistral fine-tuned model, that was eventually never released to the public, nor the training set.
Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.
Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)
This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)
How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.
You just (lol) need to give non-standard problems and demand students to provide reasoning and explanations along with the answer. Yeah, LLMs can "reason" too, but it's obvious when the output comes from an LLM here.
(Yes, that's a lot of work for a teacher. Gone are the days when you could just assign reports as homework.)
I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages;
1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres]
2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres]
3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000]
4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]
So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.
This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
> I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.
I think there is a big divide here. Every adult on earth knows magic is "fake", but some can still be amazed and entertained by it, while others find it utterly boring because it's fake, and the only possible (mildly) interesting thing about it is to try to figure out what the trick is.
I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.
>Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.
Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.
Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.
It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.
The AI will create something for you and tell you it was them.
Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.
But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.
Makes sense that search has a small, fast, dumb model designed to summarize and not to solve problems. Nearly 14 billion Google searches per day. Way too much compute needed to use a bigger model.
AFAIK they got 15% of unseen queries everyday, so it might be not very simple to design an effective cache layer on that. Semantic-aware clustering of natural language queries and projecting them into a cache-able low rank dimension is a non-trivial problem. Of course, LLM can effectively solve that, but then what's the point of using cache when you need LLM for clustering queries...
I have a strong suspicion that for all the low threshold APIs/services, before the real model sees my prompt, it gets evaluated by a quick model to see if it's something they care to bother the big models with. If not i get something shaked out of the sleeve of a bottom barrel model.
Its most likely one giant ["input token close enough question hash"] = answer_with_params_replay? It doesent missunderstands the question, it tries to squeeze the input to something close enough?
Nope, no LLMs reported 50~60% performance on IMO, and SOTA LLMs scoring 5% on USAMO is expected. For 50~60% performance on IMO, you are thinking of AlphaProof, but AlphaProof is not a LLM. We don't have the full paper yet, but clearly AlphaProof is a system built on top of LLM with lots of bells and whistles, just like AlphaFold is.
I've gone through the link you posted and the o1 system card and can't see any reference to IMO. Are you sure they were referring to IMO or were they referring to AIME?
LLMs are “next token” predictors. Yes, I realize that there’s a bit more to it and it’s not always just the “next” token, but at a very high level that’s what they are. So why are we so surprised when it turns out they can’t actually “do” math? Clearly the high benchmark scores are a result of the training sets being polluted with the answers.
If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.
Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.
So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.
I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.
**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com
I promise its a fun mathematical puzzle and the biology is pretty wild too
I was working some time ago on image processing model using GAN architecture. One model produces output and tries to fool the second. Both are trained together. Simple, but requires a lot extra efforts to make it work. Unstable and falls apart (blows up to unrecoverable state). I found some ways to make it work by adding new loss functions, changing params, changing models' architectures and sizes. Adjusting some coefficients through the training to gradually rebalance loss functions' influence.
The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.
PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.
PPS: the fact that they can do known tasks with minor variations is already a huge time saver.
It's funny, I have the same problem all the time with typical day to day programming roadblocks that these models are supposed to excel at. I'm talking about any type of bug or unexpected behavior that requires even 5 minutes of deeper analysis.
Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.
Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.
> they're more like calculators of language than agents that reason
This might be honing in on both the issue and the actual value of LLM:s. I think there's a lot of value in a "language calculator" but if it's continuously being sold as something it's not we will dismiss it or build heaps of useless apps that will just form a market bubble. I think the value is there but it's different from how we think about it.
True. There’s a small bonus that trying to explain the issue to the llm may sometimes be essentially rubber ducking, and that can lead to insights. I feel most of the time the llm can give erroneous output that still might trigger some thinking on a different direction, and sometimes I’m inclined to think it’s helping me more than it actually is.
When I was an undergrad EE student a decade ago, I had to tangle a lot with complex maths in my Signals & Systems, and Electricity and Magnetism classes. Stuff like Fourier transforms, hairy integrals, partial differential equations etc.
Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.
I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.
I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model
How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.
They retrained their model less than a week before its release, just to juice one particular nonstandard eval? Seems implausible. Models get 5x better at things all the time. Challenges like the Winograd schema have gone from impossible to laughably easy practically overnight. Ditto for "Rs in strawberry," ferrying animals across a river, overflowing wine glass, ...
I'm not generally inclined toward the "they are cheating cheaters" mindset, but I'll point out that fine tuning is not the same as retraining. It can be done cheaply and quickly.
Models getting 5X better at things all the time is at least as easy to interpret as evidence of task-specific tuning than as breakthroughs in general ability, especially when the 'things being improved on' are published evals with history.
Imagine that you are making problem solving AI. You have large budget, and access to compute and web crawling infra to run your AI "on internet". You would like to be aware of the ways people are currently evaluating AI so that you can be sure your product looks good. Do you have maybe an idea how one could do that?
The "ferrying animals across a river" problem has definitely not been solved, they still don't understand the problem at all, overcomplicating it because they're using an off-the-shelf solution instead of actual reasoning:
Overflowing wine glass also isn't meaningfully solved! I understand it is sort of solved for wine glasses (even though it looks terrible and unphysical, always seems to have weird fizz). But asking GPT to "generate an image of a transparent vase with flowers which has been overfilled with water, so that water is spilling over" had the exact same problem as the old wine glasses: the vase was clearly half-full, yet water was mysteriously trickling over the sides. Presumably OpenAI RLHFed wine glasses since it was a well-known failure, but (as always) this is just whack-a-mole, it does not generalize into understanding the physical principle.
When told, "only room for one person OR one animal", it's also the only one to recognise the fact that the puzzle is impossible to solve. The farmer can't take any animals with them, and neither the goat nor wolf could row the boat.
> When told, "only room for one person OR one animal"
In common terms suppose I say: there is only room for one person or one animal in my car to go home, one can suppose that it is referring to additional room besides that occupied by the driver. There is a problem when we try to use LLM trained in common use of language to solve puzzle in formal logic or math. I think the current LLMs are not able to have a specialized context to become a logical reasoning agent, but perhaps such thing could be possible if the evaluation function of the LLM was designed to give high credit to changing context with a phrase or token.
A particular nonstandard eval that is currently top comment on this HN thread, due to the fact that, unlike every other eval out there, LLMs score badly on it?
Doesn't seem implausible to me at all. If I was running that team, I would be "Drop what you're doing, boys and girls, and optimise the hell out of this test! This is our differentiator!"
It's implausible that fine-tuning of a premier model would have anywhere near that turn around time. Even if they wanted to and had no qualms doing so, it's not happening anywhere near that fast.
Yeah it is lol. You don't just train your model on whatever you like when you're expected to serve it. They're are a host of problems with doing that. The idea that they trained on this obscure benchmark released about the day of is actually very silly.
They are trained on some mix with minimal fraction of math. That's how it was from the beginning. But we can rebalance it by adding quality generated content. Just content will cost millions of $$ to generate. Distillation on new level looks like logical next step.
OP is probably referring to this referred to this paper here https://arxiv.org/pdf/2503.21934v1. The paper explains out how a rigorous testing revealed abysmal performance of LLMs (results that are at odds with how they are hyped about).
OpenAI told how they removed it for GPT-4 in its release paper: only exact string matches. So all discussion of bar exam questions from memory on test taking forums etc., that wouldnn't exactly match, made it in.
Just in case it wasn't a typo, and you happen not to know ... that word is probably "eke" - meaning gaining (increasing, enlarging from wiktionary) - rather than "eek" which is what mice do :)
hah you're right on the spelling but wrong on my meaning. That's probably the first time I've typed it. I don't think LLMs are quite at the level of mice reasoning yet!
Every day I am more convinced that LLM hype is the equivalent of someone seeing a stage magician levitate a table across the stage and assuming this means hovercars must only be a few years away.
I believe there's a widespread confusion between a fictional character that is described as a AI assistant, versus the actual algorithm building the play-story which humans imagine the character from. An illusion actively promoted by companies seeking investment and hype.
AcmeAssistant is "helpful" and "clever" in the same way that Vampire Count Dracula is "brooding" and "immortal".
That might be overstating it, at least if you mean it to be some unreplicable feat.
Small models have been trained that play around 1200 to 1300 on the eleuther discord. And there's this grandmaster level transformer - https://arxiv.org/html/2402.04494v1
Open AI, Anthropic and the like simply don't care much about their LLMs playing chess. That or post training is messing things up.
> That might be overstating it, at least if you mean it to be some unreplicable feat.
I mean, surely there's a reason you decided to mention 3.5 turbo instruct and not.. 3.5 turbo? Or any other model? Even the ones that came after? It's clearly a big outlier, at least when you consider "LLMs" to be a wide selection of recent models.
If you're saying that LLMs/transformer models are capable of being trained to play chess by training on chess data, I agree with you.
I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
I mentioned it because it's the best example. One example is enough to disprove the "not capable of". There are other examples too.
>I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
Not really. The LLMs play chess like they have no clue what the rules of the game are, not like poor reasoners. Trying to predict and failing is how they learn anything. If you want them to learn a game like chess then how you get them to learn it - by trying to predict chess moves. Chess books during training only teach them how to converse about chess.
The issue isn’t whether they can be trained to play. The issue is whether, after making a careful reading of the rules, they can infer how to play. The latter is something a human child could do, but it is completely beyond an LLM.
> One example is enough to disprove the "not capable of" nonsense. There are other examples too.
Gotcha, fair enough. Throw enough chess data in during training, I'm sure they'd be pretty good at chess.
I don't really understand what you're trying to say in your next paragraph. LLMs surely have plenty of training data to be familiar with the rules of chess. They also purportedly have the reasoning skills to use their familiarity to connect the dots and actually play. It's trivially true that this issue can be plastered over by shoving lots of chess game training data into them, but the success of that route is not a positive reflection on their reasoning abilities.
Gradient descent is a dumb optimizer. LLM training is not at all like a human reading a book and more like evolution tuning adaptations over centuries. You would not expect either process to be aware of anything they are converging towards. So having lots of books that talk about chess in training will predictably just return a model that knows how to talk about chess really well. I'm not surprised they may know how to talk about the rules but play them poorly.
And that post had a follow-up. Post-training messing things up could well be the issue seeing the impact even a little more examples and/or regurgitation made.
https://dynomight.net/more-chess/
The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language. That is, that simply training on text is going to eventually give the AI the ability to generate language that reasons at more or less human level in more or less any domain of knowledge.
This whole premise crashes and burns if you need task-specific training, like explicit chess training. That is because there are far too many tasks that humans need to be competent at in order to be useful in society. Even worse, the vast majority of those tasks are very hard to source training data for, unlike chess.
So, if we accept that LLMs can't learn chess unless they explicitly include chess games in the training set, then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
>The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language.
And they do, just not always in the ways we expect.
>This whole premise crashes and burns if you need task-specific training, like explicit chess training.
Everyone needs task specific training. Any human good at chess or anything enough to make it a profession needs it. So I have no idea why people would expect any less for a Machine.
>then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
Yeah so ? How much business pitches they need in the training set has no correlation with chess. I don't see any reason to believe what is already present isn't enough. There's enough chess data on the internet to teach them chess too, it's just a matter of how much open AI care about it.
Reasoning training causes some about of catastrophic forgetting, so unlikely they burn that on mixing in chess puzzles if they want a commercial product, unless it somehow transfers well to other reasoning problems broadly cared about.
My point wasn't chess specific or that they couldn't have specific training for it. It was a more general "here is something that LLMs clearly aren't being trained for currently, but would also be solvable through reasoning skills"
Much in the same way a human who only just learnt the rules but 0 strategy would very, very rarely lose here
These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
We're on the verge of AGI but there's not even the tiniest spark of general reasoning ability in something they haven't been trained for
"Reasoning" or "Thinking" are marketing terms and nothing more. If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning"
>If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning".
If you think you can play chess at that level over that many games and moves with memorization then i don't know what to tell you except that you're wrong. It's not possible so let's just get that out of the way.
>These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
Why doesn't it ? Have you actually looked at any of these games ? Those LLMs aren't playing like poor reasoners. They're playing like machines who have no clue what the rules of the game are. LLMs learn by predicting and failing and getting a little better at it, repeat ad nauseum. You want them to learn the rules of a complex game ? That's how you do it. By training them to predict it. Training on chess books just makes them learn how to converse about chess.
Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it. The top comment we are both replying to had gemini-2.5-pro which released less than 5 days later hit 25% on the benchmark. Now that was particularly funny.
It was surprising to me because I would have expected if there was reasoning ability then it would translate across domains at least somewhat, but yeah what you say makes sense. I'm thinking of it in human terms
Transfer Learning during LLM training tends to be 'broader' than that.
Like how
- Training LLMs on code makes them solve reasoning problems better
- Training Language Y alongside X makes them much better at Y than if they were trained on language Y alone and so on.
Probably because well gradient descent is a dumb optimizer and training is more like evolution than a human reading a book.
Also, there is something genuinely weird going on with LLM chess. And it's possible base models are better.
https://dynomight.net/more-chess/
The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.
When you ask it a question, it tends to say yes.
So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.
The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.
I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.
I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.
That sound reasonable to me, but the those companies forget that there's different types of agreeable. There's the LLM approach, similar to the coworker who will answer all your questions about .NET but not stop you from coding yourself into a corner, and then there's the "Let's sit down and review what it actually is that you're doing, because you're asking a fairly large number of disjoint questions right now".
I've dropped trying to use LLMs for anything, due to political convictions and because I don't feel like they are particularly useful for my line of work. Where I have tried to use various models in the past is for software development, and the common mistake I see the LLMs make is that they can't pick up on mistakes in my line of thinking, or won't point them out. Most of my problems are often down to design errors or thinking about a problem in a wrong way. The LLMs will never once tell me that what I'm trying to do is an indication of a wrong/bad design. There are ways to be agreeable and still point out problems with previously made decisions.
You're right that LLMs don't actually want anything. That said, in reinforcement learning, it's common to describe models as wanting things because they're trained to maximize rewards. It’s just a standard way of talking, not a claim about real agency.
Yeah, and they probably have more "agreeable" stuff in their corpus simply because very disagreeable stuff tend to be either much shorter or a prelude to a flamewar.
> The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.
umm, it seems to me that it is this (tfa):
But I would nevertheless like to submit, based off of internal
benchmarks, and my own and colleagues' perceptions using these models,
that whatever gains these companies are reporting to the public, they
are not reflective of economic usefulness or generality.
and then couple of lines down from the above statement, we have this:
So maybe there's no mystery: The AI lab companies are lying, and when
they improve benchmark results it's because they have seen the answers
before and are writing them down.
[this went way outside the edit-window and hence a separate comment]
imho, state of varying experience with llm's can aptly summed in this poem by Mr. Longfellow
There was a little girl,
Who had a little curl,
Right in the middle of her forehead.
When she was good,
She was very good indeed,
But when she was bad she was horrid.
Unlike many - I find author's complaints on the dot.
Once all the AI batch startups have sold subscriptions to the cohort and there's no more further market growth because businesses outside don't want to roll the dice on a probabilistic model that doesn't have an understanding of pretty much anything rather is a clever imitation machine on the content it has seen, the AI bubble will burst when more statups would start packing up by end of 2026 or max 2027.
I agree, about both the issue with benchmarks not being relevant to actual use cases and the "wants to sound smart" issue. I have seen them both first hand interacting with llms.
I think the ability to embed arbitrary knowledge written in arbitrary formats is the most important thing llms have achieved.
In my experience trying to get an llm to perform a task as vast and open ended as the one the author describes is fundamentally misguided. The llms were not trained for that and won't be able to do it in a satisfactory degree. But all this research has thankfully provided us with the software and hardware tools where one could start working on training a model that can.
Contrast that to 5-6 years ago, when all you could hope for this kind of thing was simple rule based and pattern matching systems.
This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven.
I have my own opinions, but I can't really say that they're not also based on anecdotes and personal decision-making heuristics.
But some of us are going to end up right and some of us are going to end up wrong and I'm really curious what features signal an ability to make "better choices" w/r/t AI, even if we don't know (or can't prove) what "better" is yet.
There is nothing wrong with sharing anecdotal experiences. Reading through anecdotal experiences here can help understand how one's own experience are relatable or not. Moreover, if I have X experience it could help to know if it is because of me doing sth wrong that others have figured out.
Furthermore, as we are talking about actual impact of LLMs, as is the point of the article, a bunch of anecdotal experiences may be more valuable than a bunch of benchmarks to figure it out. Also, apart from the right/wrong dichotomy, people use LLMs with different goals and contexts. It may not mean that some people do something wrong if they do not see the same impact as others. Everytime a web developer says that they do not understand how others may be so skeptical of LLMs, conclude with certainty that they must be doing sth wrong and move on to explain how to actually use LLMs properly, I chuckle.
Indeed, there’s nothing at all wrong with sharing anecdotes. The problem is when people make broad assumptions and conclusions based solely on personal experience, which unfortunately happens all too often. Doing so is wired into our brains, though, and we have to work very consciously to intercept our survival instincts.
People "make conclusions" because they have to take decisions day to day. We cannot wait for the perfect bulletproof evidence before that. Data is useful to take into account, but if I try to use X llm that has some perfect objective benchmark backing it, while I cannot make it be useful to me while Y llm has better results, it would be stupid not to base my decision on my anecdotal experience. Or vice versa, if I have a great workflow with llms, it may be not make sense to drop it because some others may think that llms don't work.
In the absence of actually good evidence, anecdotal data may be the best we can get now. The point imo is try to understand why some anecdotes are contrasting each other, which, imo, is mostly due to contextual factors that may not be very clear, and to be flexible enough to change priors/conclusions when something changes in the current situation.
I think you might be caught up in a bit of the rationalist delusion.
People -only!- draw conclusions based on personal experience. At best you have personal experience with truly objective evidence gathered in a statistically valid manner.
But that only happens in a few vanishingly rare circumstances here on earth. And wherever it happens, people are driven to subvert the evidence gathering process.
Often “working against your instincts” to be more rational only means more time spent choosing which unreliable evidence to concoct a belief from.
You want to block subjectivity? Write some formulas.
There are three questions to consider:
a) Have we, without any reasonable doubt, hit a wall for AI development? Emphasis on "reasonable doubt". There is no reasonable doubt that the Earth is roughly spherical. That level of certainty.
b) Depending on your answer for (a), the next question to consider is if we the humans have motivations to continue developing AI.
c) And then the last question: will AI continue improving?
If taken as boolean values, (a), (b) and (c) have a truth table with eight values, the most interesting row being false, true, true: "(not a) and b => c". Note the implication sign, "=>". Give some values to (a) and (b), and you get a value for (c).
There are more variables you can add to your formula, but I'll abstain from giving any silly examples. I, however, think that the row (false, true, false) implied by many commentators is just fear and denial. Fear is justified, but denial doesn't help.
It's not surprising that responses are anecdotal. An easy way to communicate a generic sentiment often requires being brief.
A majority of what makes a "better AI" can be condensed to how effective the slope-gradient algorithms are at getting the local maxima we want it to get to. Until a generative model shows actual progress of "making decisions" it will forever be seen as a glorified linear algebra solver. Generative machine learning is all about giving a pleasing answer to the end user, not about creating something that is on the level of human decision making.
At risk of being annoying, answers that feel like high quality human decision making are extremely pleasing and desirable. In the same way, image generators aren't generating six fingered hands because they think it's more pleasing, they're doing it because they're trying to please and not good enough yet.
I'm just most baffled by the "flashes of brilliance" combined with utter stupidity. I remember having a run with early GPT 4 (gpt-4-0314) where it did refactoring work that amazed me. In the past few days I asked a bunch of AIs about similar characters between a popular gacha mobile game and a popular TV show. OpenAI's models were terrible and hallucinated aggressively (4, 4o, 4.5, o3-mini, o3-mini-high), with the exception of o1. DeepSeek R1 only mildly hallucinated and gave bad answers. Gemini 2.5 was the only flagship model that did not hallucinate and gave some decent answers.
I probably should have used some type of grounding, but I honestly assumed the stuff I was asking about should have been in their training datasets.
Totally agree... this space is still so new and unpredictable that everyone is operating off vibes, gut instinct, and whatever personal anecdotes they've collected. We're all sort of fumbling around in the dark, trying to reverse-engineer the flashlight
Agreed! And with all the gaming of the evals going on, I think we're going to be stuck with anecdotal for some time to come.
I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.
I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.
Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.
Good observation but also somewhat trivial. We are not omniscient gods, ultimately all our opinions and decisions will have to be based on our own limited experiences.
My lived experience is that unless there's some new breakthrough's AI is more akin to a drill to replace a hammer than a tractor to replace the plow or a printing press.
Maybe any AI experts can elaborate on this but it seems there's a limit to the fundamental underlying model of the LLM architecture of transformers and tokens.
LLM's are amazing but we might need something more or some new paradigm to push us towards true AGI.
My personal experience is right in line with the author's.
Also:
> I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.
I immediately thought: That's because in most situations this is the purpose of language, at least partially, and LLMs are trained on language.
The disconnect between improved benchmark results and lack of improvement on real world tasks doesn't have to imply cheating - it's just a reflection of the nature of LLMs, which at the end of the day are just prediction systems - these are language models, not cognitive architectures built for generality.
Of course, if you train an LLM heavily on narrow benchmark domains then its prediction performance will improve on those domains, but why would you expect that to improve performance in unrelated areas?
If you trained yourself extensively on advanced math, would you expect that to improve your programming ability? If not, they why would you expect it to improve programming ability of a far less sophisticated "intelligence" (prediction engine) such as a language model?! If you trained yourself on LeetCode programming, would you expect that to help hardening corporate production systems?!
That's fair. But look up the recent experiment on SOTA models on the then just released USAMO 2025 questions. Highest score was 5%, supposedly SOTA last year was IMO silver level. There could be some methodological differences - ie USAMO paper required correct proofs and not just numerical answers. But it really strongly suggests even within limited domains, it's cheating. I'd wager a significant amount that if you tested SOTA models on a new ICPC set of questions, actual performance would be far, far worse than their supposed benchmarks.
Your analogy is perfect. Training an LLM on math olympiad problems and then expecting it to secure enterprise software is like teaching someone chess and handing them a wrench
I think overall quality with Gemini 2.5 is not much better than Gemini 2 in my experience. Gemini 2 was already really good, but just like Claude 3.7, Gemini 2.5 goes some steps forward and some steps backwards. It sometimes generates some really verbose code even when you tell it to be succinct. I am pretty confident that if you evaluate 2.5 for a bit longer you'll come to the same conclusion eventually.
People are really fundamentally asking two different questions when they talk about AI "importance": AI's utility and AI's "intelligence". There's a careful line between both.
1) AI undoubtedly has utility. In many agentic uses, it has very significant utility. There's absolute utility and perceived utility, which is more of user experience. In absolute utility, it is likely git is the single most game changing piece of software there is. It is likely git has saved some ten, maybe eleven digit number in engineer hours times salary in how it enables massive teams to work together in very seamless ways. In user experience, AI is amazing because it can generate so much so quickly. But it is very far from an engineer. For example, recently I tried to use cursor to bootstrap a website in NextJS for me. It produced errors it could not fix, and each rewrite seemed to dig it deeper into its own hole. The reasons were quite obvious. A lot of it had to do with NextJS 15 and the breaking changes it introduces in cookies and auth. It's quite clear if you have masses of NextJS code, which disproportionately is older versions, but none labeled well with versions, it messes up the LLM. Eventually I scrapped what it wrote and did it myself. I don't mean to use this anecdote to say LLMs are useless, but they have pretty clear limitations. They work well on problems with massive data (like front end) and don't require much principled understanding (like understanding how NextJS 15 would break so and so's auth). Another example of this is when I tried to use it to generate flags for a V8 build, it failed horribly and would simply hallucinate flags all the time. This seemed very likely to be (despite the existence of a list of V8 flags online) that many flags had very close representations in vector embeddings, and that there was almost close to zero data/detailed examples on their use.
2) In the more theoretical side, the performance of LLMs on benchmarks (claiming to be elite IMO solvers, competitive programming solvers) have become incredibly suspicious. When the new USAMO 2025 was released, the highest score was 5%, despite claims a year ago that SOTA when was at least a silver IMO. This is against the backdrop of exponential compute and data being fed in. Combined with apparently diminishing returns, this suggests that the gains from that are running really thin.
There are real and obvious improvements in the past few model updates and I'm not sure what the disconnect there is.
Maybe it's that I do have PhD level questions to ask them, and they've gotten much better at it.
But I suspect that these anecdotes are driven by something else. Perhaps people found a workable prompt strategy by trial and error on an earlier model and it works less well with later models.
Or perhaps they have a time-sensitive task and are not able to take advantage of the thinking of modern LLMs, which have a slow thinking-based feedback loop. Or maybe their code base is getting more complicated, so it's harder to reason about.
Or perhaps they're giving the LLMs a poorly defined task where older models made assumptions about but newer models understand the ambiguity of and so find the space of solutions harder to navigate.
Since this is ultimately from a company doing AI scanning for security, I would think the latter plays a role to some extent. Security is insanely hard and the more you know about it the harder it is. Also adversaries are bound to be using AI and are increasing in sophistication, which would cause lower efficacy (although you could tease this effect out by trying older models with the newer threats).
The issue is the scale of the improvements. GPT-3.5 Instruct was an utterly massive leap over everything that came before it. GPT-4 was a very big jump over that. Everything since has seemed incremental. Yes we got multimodal but that was part of GPT-4, they just didn't release it initially, and up until very recently it mostly handed off to another model. Yes we got reasoning models, but people had been using CoT for awhile so it was just a matter of time before RL got used to train it into models. Witness the continual delays of GPT-5 and the back and forth on whether it will be its own model or just a router model that picks the best existing model to hand a prompt off to.
In the last year, things like "you are an expert on..." have gotten much less effective in my private tests, while actually describing the problem precisely has gotten better in terms of producing results.
In other words, all the sort of lazy prompt engineering hacks are becoming less effective. Domain expertise is becoming more effective.
I hope it's true. Even if LLMs development stopped now, we would still keep finding new uses for them at least for the next ten years. The technology is evolving way faster than we can meaningfully absorb it and I am genuinely frightened by the consequences. So I hope we're hitting some point of diminishing returns, although I don't believe it a bit.
My experience as someone who uses LLMs and a coding assist plugin (sometimes), but is somewhat bearish on AI is that GPT/Claude and friends have gotten worse in the last 12 months or so, and local LLMs have gone from useless to borderline functional but still not really usable for day to day.
Personally, I think the models are “good enough” that we need to start seeing the improvements in tooling and applications that come with them now. I think MCP is a good step in the right direction, but I’m sceptical on the whole thing (and have been since the beginning, despite being a user of the tech).
I'd say most of the recent AI model progress has been on price.
A 4-bit quant of QwQ-32B is surprisingly close to Claude 3.5 in coding performance. But it's small enough to run on a consumer GPU, which means deployment price is now down to $0.10 per hour. (from $12+ for models requiring 8x H100)
One thing I’ve seen is large enterprises extracting money from consumers by putting administrative burden on them.
For example, you can see this in health insurance reimbursements and wireless carriers plan changes. (ie, Verizon’s shift from Do More, etc to what they have now)
Companies basically set up circumstances where consumers lose small amounts of money on a recurring basis or sporadically enough that the people will just pay the money rather than a maze of calls, website navigation and time suck to recover funds due to them or that shouldn’t have been taken in the first place.
I’m hopeful well commoditized AI will give consumers a fighting chance at this and other types of disenfranchisement that seems to be increasingly normalized by companies that have consultants that do nothing but optimize for their own financial position.
This was published the day before Gemini 2.5 was released. I'd be interested if they see any difference with that model. Anecdotally, that is the first model that really made me go wow and made a big difference for my productivity.
Ya, I find this hard to imagine aging well. Gemini 2.5 solved (at least much better than) multiple real world systems questions I've had in the past that other models could not. Its visual reasoning also jumped significantly on charts (e.g. planning around train schedules)
Even Sonnet 3.7 was able to do refactoring work on my codebase sonnet 3.6 could not.
FWIW 2.5-exp was the only one that managed to get a problem I asked it right, compared to Claude 3.7 and o1 (or any of the other free models in Cursor).
It was reverse engineering ~550MB of Hermes bytecode from a react native app, with each function split into a separate file for grep-ability and LLM compatibility.
The others would all start off right then quickly default to just greping randomly what they expected it to be, which failed quickly. 2.5 traced the function all the way back to the networking call and provided the expected response payload.
All the others hallucinated the networking response I was trying to figure out. 2.5 Provided it exactly enough for me to intercept the request and using the response it provided to get what I wanted to show up.
I manually pre-parsed the bytecode file with awk into a bazillion individual files that were each just one function, and gave it the hint to grep to sort through them. This was all done in Cursor.
awk '/^=> \[Function #/ {
if (out) close(out);
fn = $0; sub(/^.*#/, "", fn); sub(/ .*/, "", fn);
out = "function_" fn ".txt"
}
{ if (out) print > out }' bundle.hasm
Quick example of the output it gave and it's process.
There's somehow this belief that "newer models will disprove <insert LLM criticism here>" despite the "newer" models being... just a scaled-up version of a previous model, or some anciliary features tacked on. An LLM is an LLM is an LLM: I'll believe it when I see otherwise.
As someone who was wildly disappointed with the hype around Claude 3.7, Gemini 2.5 is easily the best programmer-assistant LLM available, IMO.
But it still feels more like a small incremental improvement rather than a radical change, and I still feel its limitations constantly.
Like... it gives me the sort of decent but uninspired solution I would expect it to generate without predictably walking me through a bunch of obvious wrong turns as I repeatedly correct it as I would have to have done with earlier models.
And that's certainly not nothing and makes the experience of using it much nicer, but I'm still going to roll my eyes anytime someone suggests that LLMs are the clear path to imminently available AGI.
This is exactly my sentiment. Sonnet-3.5-latest was the perfect code companion: wrote just the right amount of okay quality code but its strength was it really tried to adhere to your instructions. sonnet-3.7 was the exact opposite, wrote waaay too much code and overengineered things like crazy while having very poor instruction adherence. Gemini 2.5 Pro is basically what I hoped sonnet-3.7 would be: follows instructions well but still softly opinionated, massive (usable) context window, fast response, more biased towards latest best practices and a up to date knowledge cutoff.
I'm wondering how much gemini 2.5 being "amazing" comes from sonnet-3.7 being such a disappointment.
> ...whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality.
I'm not surprised, because I don't expect pattern matching systems to grow into something more general and useful. I think LLM's are essentially running into the same limitations that the "expert systems" of the 1980's ran into.
My mom told me yesterday that Paul Newman had massive problems with alcohol. I was somewhat skeptical, so this morning I asked ChatGPT a very simple question:
"Is Paul Newman known for having had problems with alcohol?"
All of the models up to o3-mini-high told me he had no known problems. Here's o3-mini-high's response:
"Paul Newman is not widely known for having had problems with alcohol. While he portrayed characters who sometimes dealt with personal struggles on screen, his personal life and public image were more focused on his celebrated acting career, philanthropic work, and passion for auto racing rather than any issues with alcohol. There is no substantial or widely reported evidence in reputable biographies or interviews that indicates he struggled with alcohol abuse."
There is plenty of evidence online that he struggled a lot with alcohol, including testimony from his long-time wife Joanne Woodward.
I sent my mom the ChatGPT reply and in five minutes she found an authoritative source to back her argument [1].
I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.
Lesson(s) learned... Including not doubting my mother's movie trivia knowledge.
This is less an LLM thing than an information retrieval question. If you choose a model and tell it to “Search,” you find citation based analysis that discusses that he indeed had problems with alcohol. I do find it interesting it quibbles whether he was an alcoholic or not - it seems pretty clear from the rest that he was - but regardless.
This is indicative of something crucial when placing LLMs into a toolkit. They are not omniscient nor are they deductive reasoning tools. Information retrieval systems are excellent at information retrieval and should be used for information retrieval. Solvers are excellent at solving deductive problems. Use them. The better they get at these tasks alone is cool but is IMO a parlor trick since we have nearly optimal or actually optimal techniques that don’t need an LLM. The LLM should use those tools.
So, click search next time you have an information retrieval question.
https://chatgpt.com/share/67f2dac0-3478-8000-9055-2ae5347037...
I realise your answer wasn't assertive, but if I heard this from someone actively defending AI it would be a copout. If the selling point is that you can ask these AIs anything then one can't retroactively go "oh but not that" when a particular query doesn't pan out.
This is a bit of a strawman. There are certainly people who claim that you can ask AIs anything but I don't think the parent commenter ever made that claim.
"AI is making incredible progress but still struggles with certain subsets of tasks" is self-consistent position.
Any information found in a web search about Newman will be available in the training set (more or less). It's almost certainly a problem of alignment / "safety" causing this issue.
"Any information found in a web search about Newman will be available in the training set"
I don't think that is a safe assumption these days. Training modern LLM isn't about dumping in everything on the Internet. To get a really good model you have to be selective about your sources of training data.
They still rip off vast amounts of copyrighted data, but I get the impression they are increasingly picky about what they dump into their training runs.
There’s a simpler explanation than that’s that the model weights aren’t an information retrieval system and other sequences of tokens are more likely given the totality of training data. This is why for an information retrieval task you use an information retrieval tool similarly to how for driving nails you use a hammer rather than a screw driver. It may very well be you could drive the nail with the screw driver, but why?
You think that's a simpler explanation? Ok. I think given the amount of effort that goes into "safety" on these systems that my explanation is vastly more likely than somehow this information got lost in the vector soup despite being attached to his name at the top of every search result[0].
Except if safety blocked this, it would have also blocked the linked conversation. Alignment definitely distorts behaviors of models, but treating them as information retrieval systems is using a screw driver to drive nails. Your example didn’t refute this.
> Yes, Paul Newman was known to have struggled with alcohol at certain points in his life. In his early years, particularly during his time in the Navy and into his acting career, Newman admitted to heavy drinking. He was open about his fondness for beer and once jokingly referred to himself as a “functional alcoholic.” In a 1988 interview with The New York Times, he acknowledged that he had a period where he drank too much, stating, “I was a very good drinker. I could put it away.” ...
Though a local model I'm running (gemma-3-27b-it; https://huggingface.co/lmstudio-community/gemma-3-27b-it-GGU...) just told me various correct sounding bits about his history with alcohol (correctly citing his alma mater and first wife), but threw in:
"Sobriety & AA: Newman got sober in 1964 and remained so for the rest of his life."
Which doesn't check out. And it includes plausible but completely hallucinated URLs (as well as a valid biography.com URL that completely omits information about alcohol.)
Gemma 3 4B (QAT quant):
Yes, Paul Newman was indeed known to have struggled with alcohol throughout his life. While he maintained a public image of a charming, clean-cut star, he privately battled alcoholism for many years. He sought treatment in the late 1980s and early 1990s and was reportedly very open about his struggles and the importance of seeking help.
Does the as yet unwritten prequel of Idiocracy tell the tale of when we started asking Ai chat bots for facts and this was the point of no return for humanity?
I like that it's unmonetized, of course, but that's not why I use AI. I use AI because it's better at search. When I can't remember the right keywords to find something, or when the keywords aren't unique, I frequently find that web search doesn't return what I need and AI does.
It's impressive how often AI returns the right answer to vague questions. (not always though)
Can you blame the users for asking it, when everyone is selling that as a key defining feature?
I use it for asking - often very niche - questions on advanced probability and simulation modeling, and it often gets those right - why those and not a simple verifiable fact about one of the most popular actors in history?
I don’t know about Idiocracy, but something that I have read specific warnings about is that people will often blame the user for any of the tool’s misgivings.
Counterpoint: Paul Newman was absolutely a famous drunk, as evidenced by this Wikipedia page.* Any query for "paul newman alcohol" online will return dozens of reputable sources on the topic. Your post is easily interpretable as handwaving apologetics, and it gives big "Its the children who are wrong" energy.
This may have hit the nail on the head about the weaknesses of LLM's.
They're going to regurgitate something not so much based on facts, but based on things that are accessible as perceived facts. Those might be right, but they might be wrong also; and no one can tell without doing the hard work of checking original sources. Many of what are considered accepted facts, and also accessible to LLM harvesting, are at best derived facts, often mediated by motivated individuals, and published to accessible sources by "people with an interest".
The weightings used by any AI should be based on the facts, and not the compounded volume of derived, "mediated", or "directed" facts - simply, because they're not really facts; they're reports.
It all seems like dumber, lazier search engine stuff. Honestly, what do I know about Paul Newman? But, Joanne Woodward and others who knew and worked with him should be weighted as being, at least, slightly more credible that others; no matter how many text patterns "catch the match" flow.
> I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.
I think we'll have a term like we have for parents/grandparents that believe everything they see on the internet but specifically for people using LLMs.
Yes, Paul Newman was known for being a heavy drinker, particularly of beer. 1 He acknowledged his high consumption levels himself.
1.
Review: Paul Newman memoir stuns with brutal honesty - AP News
apnews.com
While he maintained an incredibly successful career and public life, accounts and biographies note his significant alcohol intake, often describing it as a functional habit rather than debilitating alcoholism, although the distinction can be debated. He reportedly cut back significantly in his later years.
Excluding the ones that do not support chat completions, all but one (qwen-qwq-32b) answered in the affirmative. The answer from qwen-qwq-32b said:
Paul Newman, the renowned actor and humanitarian, did not have a widely publicized
struggle with alcohol addiction throughout most of his life, but there were
specific instances that indicated challenges.
Using lack of progress in a specialized field as a barometer for overall progress is kind of silly. I just spent the last few days 'vibe coding' an application and I have to say that it's pretty remarkable how capable it is now relative to my experience last year.
It took three minutes for me to do the above from the time I created my API key to when I had an answer.
I find that everyone who replies with examples like this is an expert using expert skills to get the LLM to perform. Which makes me think why is this a skill that is useful to general public as opposed to another useful skill for technical knowledge workers to add to their tool belt?
I agree. But I will say that at least in my social circles I'm finding that a lot of people outside of tech are using these tools, and almost all of them seem to have a healthy skepticism about the information they get back. The ones that don't will learn one way or the other.
Thats not really 'simple' for an LLM. This is a niche information about a specifc person, LLM's train on massive amount of data, the more a topic is being present in the data, the better will the answers be.
Also, you can/should use the "research" mode for questions like this.
The question is simple and verifiable - it is impressive to me that it’s not contained in the LLM’s body of knowledge - or rather that it can’t reach the answer.
This is niche in the grand scheme of knowledge but Paul Newman is easily one of the biggest actors in history, and the LLM has been trained on a massive corpus that includes references to this.
Where is the threshold for topics with enough presence in the data?
Yes, Paul Newman did experience significant struggles with alcohol. In his posthumously published memoir, The Extraordinary Life of an Ordinary Man, Newman candidly discusses his drinking habits and acknowledges his long-term battle with alcoholism. He describes himself as a "functioning alcoholic," a trait he noted was shared with his father. At one point, Newman was reported to consume a case of beer daily, followed by spirits, until he eventually gave up hard liquor.
3-4 hours is enough time for It to have crawled the hacker news comments section. That's about the frequency the AI bots crawl my little out of the way blog.
These models are not reliable sources of information. They are either out of date, subject to hallucination, or just plain wrong for a variety of reasons. They are untrustworthy to ask facts like this.
I appreciate your consideration of a subjective question and how you explained it and understand these nuances. But please - do not trust chatgpt etc. I continue to be frustrated at the endless people claiming something is true from chatgpt. I support the conclusions of this author.
For questions like this, you need to tell the bot to search the web.
> Hi, can you search the web and find out if Paul Newman has any alcohol problems?
> Paul Newman, the acclaimed actor and philanthropist, had a well-documented fondness for alcohol, particularly beer. He was known to consume significant quantities, with some accounts noting he would drink a case of beer in a day. This level of consumption led some to describe him as a functioning alcoholic. His drinking habits reportedly caused strain in his marriage to Joanne Woodward. In the late 1970s, an incident occurred where Woodward did not allow Newman back into their house because of his drinking. He promised to give up hard liquor as a compromise, but continued to drink beer, sometimes consuming a case during dinner.
Perplexity had no problem with the question, replying in part, "Paul Newman is known to have struggled with alcohol throughout his life, as revealed in various accounts, including his own memoir and documentaries about his life...."
I don't think this reflects a flaw in LLM's at all. It is true that Paul Newman is not "known for" having alcohol problems. Paul Newman is known for acting, and salad dressing. Paul Newman has been "known to" have had alcohol problems, which formed as a question correctly one-shots accurate results for me across a few common LLMs. Someone like Betty Ford could be "known for" having alcohol problems. Paul Newman's alcohol problems are relatively specialized knowledge.
good point, though a great LLM answer would DWIM here and explain both interpretations as you have done here, as opposed to (what presumably happened) focus on one and ignore the other.
LLMs aren't good at being search engines, they're good at understanding things. Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
> Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
Hm nope, now that the web if flooded by LLM generated content it's game over. I can't tell how many times I almost got fooled by recipes &co which seem legit at first but are utter non sense. And now we're feeding that garbage back to where it came from
A stochastic parrot with a sufficiently tiny residual error rate needs a stochastic model so precisely compressing the world and sophisticated decompression algorithms that it could be called reasoning.
Take two 4K frames of a falling vase, ask a model to predict the next token... I mean the following images. Your model now needs include some approximations of physics - and the ability to apply it correctly - to produce a realistic outcome. I'm not aware of any model capable of doing that, but that's what it would mean to predict the unseen with high enough fidelity.
I've always been amazed by this. I have never not been frustrated with the profound stupidity of LLMs. Obviously I must be using it differently because I've never been able to trust it with anything and more than half the time I fact check it even for information retrieval it's objectively incorrect.
If you got as far as checking the output it must have appeared to understand your question.
I wouldn't claim LLMs are good at being factual, or good at arithmetic, or at drawing wine glasses, or that they are "clever". What they are very good at is responding to questions in a way which gives you the very strong impression they've understood you.
I vehemently disagree. If I ask a question with an objective answer, and it simply makes something up and is very confident the answer is correct, what the fuck has it understood other than how to piss me off?
It clearly doesn't understand that the question has a correct answer, or that it does not know the answer. It also clearly does not understand that I hate bullshit, no matter how many dozens of times I prompt it to not make something up and would prefer an admittance of ignorance.
It didn't understand you but the response was plausible enough to require fact checking.
Although that isn't literally indistinguishable from 'understanding' (because your fact checking easily discerned that) it suggests that at a surface level it did appear to understand your question and knew what a plausible answer might look like. This is not necessarily useful but it's quite impressive.
There are times it just generates complete nonsense that has nothing to do with what I said, but it's certainly not most of the time. I do not know how often, but I'd say it's definitely under 10% and almost certainly under 5% that the above happens.
Sure, LLMs are incredibly impressive from a technical standpoint. But they're so fucking stupid I hate using them.
> This is not necessarily useful but it's quite impressive.
An ability to answer questions with a train of thought showing how the answer was derived, or the self-awareness to recognize you do not have the ability to answer the question and declare as much. More than half the time I've used LLMs they will simply make answers up, and when I point out the answer is wrong it simply regurgitates another incorrect answer ad nauseum (regularly cycling through answers I've already pointed out are incorrect).
Rather than give you a technical answer - if I ever feel like an LLM can recognize its limitations rather than make something up, I would say it understands. In my experience LLMs are just algorithmic bullshitters. I would consider a function that just returns "I do not understand" to be an improvement, since most of the time I get confidently incorrect answers instead.
Yes, I read Anthropic's paper from a few days ago. I remain unimpressed until talking to an LLM isn't a profoundly frustrating experience.
It's the first time I've ever used that phrase on HN. Anyway, what phrase do you think works better than 'stochastic parrot' to describe how LLMs function?
It’s good rhetoric but bad analogy. LLMs can be very creative (to the point of failure, in hallucinations).
I don’t know if there is a pithy shirt phrase to accurately describe how LLMs function. Can you give me a similar one for how humans think? That might spur my own creativity here.
Try to come up with a way to prove humans aren't stochastic parrots then maybe people will atart taking you seriously. Just childish reddit angst rn nothing else.
> Try to come up with a way to prove humans aren't stochastic parrots
Look around you
Look at Skyscrapers. Rocket ships. Agriculture.
If you want to make a claim that humans are nothing more than stochastic parrots then you need to explain where all of this came from. What were we parroting?
Meanwhile all that LLMs do is parrot things that humans created
Skyscrapers: trees, mountains, cliffs, caves in mountainsides, termite mounds, humans knew things could go high, the Colosseum was built two thousand years ago as a huge multi-storey building.
Rocket ships: volcanic eruptions show heat and explosive outbursts can fling things high, gunpowder and cannons, bellows showing air moves things.
Agriculture: forests, plains, jungle, desert oases, humans knew plants grew from seeds, grew with rain, grew near water, and grew where animals trampled them into the ground.
We need a list of all atempted ideas, all inventions and patents that were ever tried or conceived, and then we see how inventions are the same random permutations on ideas with Darwinian style survivorship as everything else; there were steel boats with multiple levels in them before skyscrapers; is the idea of a tall steel building really so magical when there were over a billion people on Earth in 1800 who could have come up with it?
You’re likening actual rocketry to LLMs being mildly successful at describing Paul Newman’s alcohol use on average when they already have the entire internet handed to them.
> when there were over a billion people on Earth in 1800 who could have come up with it
My point is that humans did come up with it. Humans did not parrot it from someone or something else that showed it to us. We didn't "parrot" splitting the atom. We didn't learn how to build skyscrapers from looking at termite hills and we didn't learn to build rockets that can send a person to the moon from seeing a volcano
For them to work at all they need to have some representation of concepts. Recent research at anthropic has shown a surprising complexity in their reasoning behavior. Perhaps the parrot here is you.
> I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
They're quite literally being sold as a replacement for human intellectual labor by people that have received uncountable sums of investment money towards that goal.
The author of the post even says this:
"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
Can't blame people "fact checking" something that's supposed to fill these shoes.
People should be (far) more critical of LLMs given all of these style of bold claims, not less.
Also, telling people they're "holding it wrong" when they interact with alleged "Ay Gee Eye" "superintelligence" really is a poor selling point, and no way to increase confidence in these offerings.
These people and these companies don't get to make these claims that threaten the livelihood of millions of people, inflate a massive bubble, impact hiring decisions and everything else we've seen and then get excused cause "whoops you're not supposed to use it like that, dummy."
Your point is still trivially disproven by the fact that not even humans are expected to know all the world's trivia off the top of their heads.
We can discuss whether LLMs live up to the hype, or we can discuss how to use this new tool in the best way. I'm really tired of HN insisting on discussing the former, and I don't want to take part in that. I'm happy to discuss the latter, though.
Ironically though an LLM powered search engine (some word about being perplexed) is becoming way better than the undisputed king of traditional search engines (something oogle)
It expands what they had before with AI Overviews, but I’m not sure how new either of those are. It showed up for me organically as an AI Mode tab on a native Google search in Firefox ironically.
It asks me to change some permissions, but that help page says this is only available in the US, so I suppose I'll get blocked right after I change them.
I can’t reproduce. Maybe others reported the error and someone adjusted the expected answer, I do not know enough about OpenAI operations to say for sure.
The reason this bothers me is that comments like this reinforce the believes of people that could otherwise find value in these tools.
But I think points like this would be better made in shared chats or screenshots, since we do not have something like a core dump or stacktrace to attach.
And while I am not saying OP did this, I have seen technically skilled engineers asserting/implying that llm/chatbots aren’t good or not useful to them look at their chat log that a multitude of topics that I am sure would impact the result of the query.
Yes. It can be an UX problem. Yes. It can be an algorithmc problem.
But they are just tools that can be used wrong and not a perfect mechanical brain.
Yes, Paul Newman struggled with alcohol. His issues with alcohol were explored in the HBO Max documentary, The Last Movie Stars, and Shawn Levy's biography, Paul Newman: A Life. According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating.
(4x Source footnotes omitted for readability)
# Ki Multi-step Research Assistant
Paul Newman is known to have struggled with alcohol. According to his posthumous memoir, Newman candidly discussed his issues with drinking and self-doubt, describing himself as an alcoholic who was tormented by insecurities[^1][^2]. He reportedly drank a significant amount of beer daily and later moved on to stronger drinks like Scotch[^3][^4]. His drinking habits were a notable part of his life, and he was often identified by his beer drinking[^5][^6]. Despite these struggles, Newman was also recognized for his generosity and devotion to his family[^7].
> "According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating."
'though it had nothing to do with cheating' is a weird inclusion.
You may know this, but that doesn't mean we all do. LLMs are marketed towards many different groups, including folks who don't have much tech knowledge.
Ah, but isn’t that the problem here - asking an LLM for facts without requesting a search is like asking a PhD to answer a question “off the top of your head”. For pop culture questions the PhD likely brings little value.
They should know better than to guess. Educated, honest, intelligent people don't spout off a wild ass guess, if they don't know something they say so.
I don't think they mean "knowledge" when they talk about "intelligence." LLMs are definitely not knowledge bases. They can transform information given to them in impressive ways, but asking a raw (non-RAG-enabled) LLM to provide its own information will probably always be a mistake.
They kind of are knowledge bases, just not in the usual way. The knowledge is encoded in the words they were trained on. They weren't trained on words chosen at random; they were trained on words written by humans to encode some information. In fact, that's the only thing that makes LLMs somewhat useful.
So, in other words, are you saying that AI model progress is the real deal and is not bullshit?
That is, as you point out, "all of the models up to o3-mini-high" give an incorrect answer, while other comments say that OpenAIs later models give correct answers, with web citations. So it would seem to follow that "recent AI model progress" actually made a verifiable improvement in this case.
I am pretty sure that they must have meant "up through", not "up to", as the answer from o3-mini-high is also wrong in a way which seems to fit the same description, no?
I tried with 4o and it gave me what I thought was a correct answer:
> Paul Newman was not publicly known for having major problems with alcohol in the way some other celebrities have been. However, he was open about enjoying drinking, particularly beer. He even co-founded a line of food products (Newman’s Own) where profits go to charity, and he once joked that he consumed a lot of the product himself — including beer when it was briefly offered.
> In his later years, Newman did reflect on how he had changed from being more of a heavy drinker in his youth, particularly during his time in the Navy and early acting career, to moderating his habits. But there’s no strong public record of alcohol abuse or addiction problems that significantly affected his career or personal life.
> So while he liked to drink and sometimes joked about it, Paul Newman isn't generally considered someone who had problems with alcohol in the serious sense.
As other's have noted, LLMs are much more likely to be cautious in providing information that could be construed as libel. While Paul Newman may have been an alcoholic, I couldn't find any articles about it being "public" in the same way as others, e.g. with admitted rehab stays.
Unless you're expecting an LLM to have access to literally all information on earth at all times I find it really hard to care about this particular type of complaint.
My calculator can't conjugate German verbs. That's fine IMO. It's just a tool
Yes but a tool for what? When asked a question individuals that don't already have detailed knowledge of a topic are left with no way to tell if the AI generated response is complete bullshit, uselessly superficial, or detailed and on point. The only way to be sure is to then go do the standard search engine grovel looking for authoritative sources.
Sounds like someone drank their own Kool aid (believing current AI can be a security researcher), and then gets frustrated when they realize they have overhyped themselves.
Current AI just cannot do the kind of symbolic reasoning required for finding security vulnerabilities in softwares. They might have learned to recognize "bad code" via pattern matching, but that's basically it.
The foundational idea of AI is that people think with words. But it is false. People think with things and use words only to point to things. Thinking with words is not possible. It will always be a case of shizophasia.
The accuracy problem won't just go away. Increasing accuracy is only getting more expensive. This sets the limits for useful applications. And casual users might not even care and use LLMs anyway, without reasonable result verification.
I fear a future where overall quality is reduced. Not sure how many people / companies would accept that. And AI companies are getting too big to fail. Apparently, the US administration does not seem to care when they use LLMs to define tariff policy....
I don't know why anyone is surprised that a statistical model isn't getting 100% accuracy. The fact that statistical models of text are good enough to do anything should be shocking.
I think the surprising aspect is rather how people are praising 80-90% accuracy as the next leap in technological advancement. Quality is already in decline, despite LLMs, and programming was always a discipline where correctness and predictability mattered. It's an advancement for efficiency, sure, but on the yet unknown cost of stability. I'm thinking about all simulations based on applied mathematical concepts and all the accumulated hours fixing bugs - there's now this certain aftertaste, sweet for some living their lives efficiently, but very bitter for the ones relying on stability.
You're completely correct, of course. The issue is that most people are not looking for quality, only efficiency. In particular, business owners don't care about sacrificing some correctness if it means they can fire slews of people. Worse, gullible "engineers" that should be the ones prioritizing correctness are so business-brainwashed themselves that they like wise slop up this nonsense at the expense of sacrificing their own concern for the only principles that even made the software business remotely close to being worthy of the title "engineering".
That "good enough" is the problem. It requires context. And AI companies are selling us that "good enough" with questionable proof. And they are selling grandiose visions to investors, but move the goal post again and again.
A lot of companies made Copilot available to their workforce. I doubt that the majority of users understand what a statistical model means. The casual, technically inexperienced user just assumes that a computer answer is always right.
I think the real meaningful progress is getting ChatGPT 3.5 level quality running anywhere you want rather than AIs getting smarter at high level tasks. This capability being ubiquitous and not tied to one vendor is really what’s revolutionary.
> [T]here are ~basically~ no public benchmarks for security research... nothing that gets at the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken.
A few months ago I looked at essentially this problem from a different angle (generating system diagrams from a codebase). My conclusion[0] was the same as here: LLMs really struggle to understand codebases in a holistic way, especially when it comes to the codebase's strategy and purpose. They therefore struggle to produce something meaningful from it like a security assessment or a system diagram.
Not sure if its been fixed by now but a few weeks ago I was in the Golden Gate park and wondered if it was bigger than Central park. I asked ChatGPT voice, and although it reported the sizes of the parks correctly (with Golden gate park being the bigger size), it then went and said that Central Park was bigger. I was confused, so Googled and sure enough Golden gate park is bigger.
I asked Grok and others as well. I believe Perplexity was the only one correct.
Repeated it multiple times even with a friends account. It kept doing the same thing. It knew the sizes, but thought the smaller sized one was bigger...
I just tried. Claude did exactly what you said, and then figured it out:
Central Park in New York City is bigger than GoldenGate Park (which I think you might mean Golden Gate Park) in San Francisco.
Central Park covers approximately 843 acres (3.41 square kilometers), while Golden Gate Park spans about 1,017 acres (4.12 square kilometers). This means Golden Gate Park is actually about 20% larger than Central Park.
Both parks are iconic urban green spaces in major U.S. cities, but Golden Gate Park has the edge in terms of total area.
> Since 3.5-sonnet, we have been monitoring AI model announcements, and trying pretty much every major new release that claims some sort of improvement. Unexpectedly by me, aside from a minor bump with 3.6 and an even smaller bump with 3.7, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.
This is likely a manifestation of the bitter lesson[1], specifically this part:
> The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project [like an incremental model update], massively more computation inevitably becomes available.
(Emphasis mine.)
Since the ultimate success strategy of the scruffies[2] or proponents of search and learning strategies in AI is Moore's Law, short term gains using these strategies will be miniscule. It is over at least a five year period that their gains will be felt the most. The neats win the day in the short term, but the hare in this race will ultimately give away to the steady plod of the tortoise.
I wonder if there’s a parallel universe somewhere where this same HN post has the exact same comments, only the posts are by NNs discussing human test subjects :)
LLM's are pre-trained to minimize perplexity (PPL), which essentially means that they're trained to model the likelihood distribution of the next words in a sequence.
The amazing thing was that minimizing PPL allowed you to essentially guide the LLM output and if you guided it in the right direction (asked it questions), it would answer them pretty well. Thus, LLMs started to get measured on how well they answered questions.
LLMs aren't trained from the beginning to answer questions or solve problems. They're trained to model word/token sequences.
If you want an LLM that's REALLY good at something specific like solving math problems or finding security bugs, you probably have to fine tune.
> [...] But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues' perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality. [...]
Seems like they're looking at how they fail and not considering how they're improving in how they succeed.
The efficiency in DeepSeek's Multi-Head Latent Attention[0] is pure advancement.
> Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.
Definitely interested to see how the best models from Anthropics competitors do at this.,
Meh. I've been using 2.5 with Cline extensively and while it is better it's still an incremental improvement, not something revolutionary. The thing has a 1 million token context window but I can only get a few outputs before I have to tell it AGAIN to stop writing comments.
Are they getting better, definitely. Are we getting close to them performing unsupervised tasks, I don't think so.
It seems like the models are getting more reliable at the things they always could do, but they’re not showing any ability to move past that goalpost. Whereas in the past, they could occasionally write some very solid code, but often return nonsense, the nonsense is now getting adequately filtered by so-called “reasoning”, but I see no indication that they could do software design.
> how the hell is it going to develop metrics for assessing the impact of AIs when they're doing things like managing companies or developing public policy?
Why on earth do people want AI to do either of these things? As if our society isn’t fucked enough, having an untouchable oligarchy already managing companies and developing public policies, we want to have the oligarchy’s AI do this, so policy can get even more out of touch with the needs of common people? This should never come to pass. It’s like people read a pile of 90s cyberpunk dystopian novels and decided, “Yeah, let’s do that.” I think it’ll fail, but I don’t understand how anyone with less than 10 billion in assets would want this.
> Why on earth do people want AI to do either of these things?
This is the really important question, and the only answer I can drum up is that people have been fed a consistent diet of propaganda for decades centered around a message that ultimately boils down to a justification of oligarchy and the concentration of wealth. That and the consumer-focus facade makes people think the LLMS are technology for them—they aren't. As soon as these things get good enough business owners aren't going to expect workers to use them to be more productive, they are just going to fire workers and/or use the tooling as another mechanism by which to let wages stagnate.
I feel we are already in the era of diminishing returns on LLM improvements. Newer models seem to be more sophisticated implementations of LLM technology + throwing more resources at it, but to me they do not seem fundamentally more intelligent.
I don't think this is a problem though. I think there's a lot of low-hanging fruit when you create sophisticated implementations of relatively dumb LLM models. But that sentiment doesn't generate a lot of clicks.
Who would assume that LLM companies were to hyper optimise on public to make their share prices go up and bubble keep afloat ... What a unserious thought to maintain ...
> But in recent months I've spoken to other YC founders doing AI application startups [...] in different industries, on different problem sets.
Maybe they should create a benchmark collectively called YC founders. Gather various test cases. Never make it public. And use that to evaluate newly released models.
There's the politics of the corporations and then there's the business of the science behind LLM's, this article feels like the former.
Maybe someone active in the research can comment? I feel like all of these comments are just conjecture/anecdotal and don't really get to the meat of this question of "progress" and the future of LLM's
Assuming that the models getting better at SWE benchmarks and math tests would translate into positive outcomes in all other domains could be an act of spectacular hubris by the big frontier labs, which themselves are chock-full of mathematicians and software engineers.
I feel we are already in the era of diminishing returns on LLM improvements. Newer models seem to be more sophisticated implementations of LLM technology + throwing more resources at it, but to me they do not seem fundamentally more intelligent.
There's some interesting information and analysis to start off this essay, then it ends with:
"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
This sounds like an article of faith to me. One could just as easily say they won't become the beating hearts of anything, and instead we'll choose to continue to build a better future for humans, as humans, without relying on an overly-hyped technology rife with error and unethical implications.
> So maybe there's no mystery: The AI lab companies are lying, and when they improve benchmark results it's because they have seen the answers before and are writing them down. [...then says maybe not...]
Well.. they've been caught again and again red handed doing exactly this. Fool me once shame on you, fool me 100 times shame on me.
Hate to say this but the incentive is growth, not progress. Progress is what enabled the growth, but is also extremely hard to plan and deliver. On the other hand, hype is probably somewhat easier and well-tested approach so no surprise lot of the effort goes into marketing. Markets had repeatedly confirmed that there aren't any significant immediate repercussions for cranking up BS levels in marketing materials, while there are some rewards when it works.
Reading the comments, is it safe to say that LLMs are a digest of the internet which is some update over google search, but with the caveat that you need to double check the results? I mean they basically have some compressed version of almost all the written knowledge and will respond correctly about things that have already been written, and hallucinate (extrapolate) about things not explicitly written. Of course if someone carefully curates the input data to filter out misinformation, it might even be an upgrade over google. Is there a consensus on this?
Im able to get substantially more coding done than three months ago. This could be largely in the tooling (coding agents, deep research). But the models are better too, for both coding and brainstorming. And tooling counts, to me, as progress.
Learning to harness current tools helps to harness future tools. Work on projects that will benefit from advancements, but can succeed without them.
I'm not sure if I'm able to do more of the hard stuff, but a lot of the easy but time consuming stuff is now easily done by LLMs.
Example: I frequently get requests for data from Customer Support that used to require 15 minutes of my time noodling around writing SQL queries. I can cut that down to less than a minute now.
For three years now, my experience with LLMs has been "mostly useless, prefer ELIZA".
Which is software written 1966, but the web version is a little newer. Does occasional psychotherapy assistance/brainstorming just as well, and I more easily know when I stepped out of its known range into the extrapolated.
That said, it can vibe code in a framework unknown to me in half the time that I would need to school myself and add the feature.
Or vibe coding takes twice as long, if I mostly know how to achieve what I want and read no framework documentation but only our own project's source code to add a new feature. But on a day with a headache, I can still call the LLM a dumb twat and ask it to follow my instructions instead of doing bullshit.
But, vibe coding always makes my pulse go to 105, from 65 and question my life choices. Since few instructions are rarely ever followed and loops never left once entered. Except for on the first try getting 80% of the structure kinda right, but then getting stuck for the whole workday.
From my own experience on a codebase using a lot of custom algorithms on trees and sometimes graphs.
There were qualitatively leaps in my day-to-day usage:
Claude Sonnet 3.5 and ChatGPT O1 were good for writing slop and debugging simple bugs
Grok Thinking and Sonnet 3.7 were good to catch mildly complicated bugs and write functions with basic logic.
They still made mistake
But recently, Gemini 2.5 pro has been scary good. I liked to made fun of the feel-the-AGI crowd but for the first time a model made me raise an eyebrow
It can one shot unusual function with complicated logic and subtle edge cases
I honestly can’t notice any difference in outdoor quality between GPT 4o and GPT 4.5. I also can’t notice any difference in programming quality in cursor when using Claude 3.7 vs 3.5. I’m told there is a clear difference, but I don’t notice it.
Will LLMs end up like compilers? Compilers are also fundamentally important to modern industrial civilization - but they're not profit centers, they're mostly free and open-source outside a few niche areas. Knowing how to use a compiler effectively to write secure and performative software is still a valuable skill - and LLMs are a valuable tool that can help with that process, especially if the programmer is on the steep end of the learning curve - but it doesn't look like anything short of real AGI can do novel software creation without a human constantly in the loop. The same argument applies to new fundamental research, even to reviewing and analyzing new discoveries that aren't in the training corpus.
Wasn't it back in the 1980s that you had to pay $1000s for a good compiler? The entire LLM industry might just be following in the compiler's footsteps.
Yep. I'm looking forward to LLMs/deepnets being considered a standard GOFAI technique with uses and limitations and not "we asked the God we're building to draw us a picture of a gun and then it did and we got scared"
Objectively speaking a chess engine is artificially intelligent. Just because it's not human level doesn't mean it's not intelligent. Repeat for any N of 100s of different technologies we've built. We've been calling this stuff "thinking machines" since Turing and it's honestly just not useful at this point.
The fact is, the phrase "artificial intelligence" is a memetic hazard: it immediately positions the subject of conversation as "default capable", and then forces the conversation into trying to describe what it can't do, which is rarely a useful way to approach it.
Whereas with LLMs (and chess engines and every other tech advancement) it would be more useful to start with what the tech _can_ do and go from there.
For who? Nvidia sell GPUs, OpenAI and co sell proprietary models and API access, and the startups resell GPT and Claude with custom prompts. Each one is hoping that the layer above has a breakthrough that makes their current spend viable.
If they do, then you don’t want to be left behind, because _everything_ changes. It probably won’t, but it might.
This bubble will be burst by the Trump tariffs and the end of the zirp era. When inflation and a recession hit together hope and dream business models and valuations no longer work.
Which one? Nvidia are doing pretty ok selling GPU's, and OpenAI and Anthropic are doing ok selling their models. They're not _viable_ business models, but they could be.
NVDA will crash when the AI bubble implodes, and none of those Generative AI companies are actually making money, nor will they. They have already hit limiting returns in LLM improvements after staggering investments and it is clear are nowhere near general intelligence.
All of this can be true, and has nothing to do with them having a business model.
> NVDA will crash when the AI bubble implodes,
> making money, nor will they
> They have already hit limiting returns in LLM improvements after staggering investments
> and it is clear are nowhere near general intelligence.
These are all assumptions and opinions, and have nothing to do with whether or not they have a business model. You mightn't like their business model, but they do have one.
I consider it a business model if they have plans to make money at some point (no sign of that at openai which are not based on hopium) and are not engaged in fraud like bundling and selling to their own subsidiaries (nvda).
These are of course just opinions, I’m not sure we can know facts about such companies except in retrospect.
You’re on a startup forum complaining that vc backed startups don’t have a business model when the business model is the same as it has been for almost 15 years - be a unicorn in your space.
Than any silly idea can be a business model. Suppose I collect dust from my attic and hope to sell it as an add-on on my neighbor's lemonade stand, with a hefty profit for the neighbor, who is getting paid by me $10 to add a handful of dust in each glass and sell it to the customers for $1. The neighbor accepts. It's a business model, at least until I don't run of existing funds or the last customer leaves in disguist. At which point exactly that silly idea stops being an unsustainable business model and becomes a silly idea? I guess at least as early as I see that the funds are running up, and I need to borrow larger an larger lumps of money each time to keep spinning the wheel...
I asked Grok to review the comments here and generate a response defending AI:
After reviewing the discussion on the Hacker News thread, it’s clear that there are a range of complaints and criticisms about AI, particularly centered around its limitations, overhype, and practical utility. Some users express frustration with AI’s inability to handle complex reasoning, its tendency to produce generic or incorrect outputs, and the perception that it’s more of a buzzword than a transformative tool. Others question its value compared to traditional methods or human expertise, suggesting it’s overhyped or misapplied in many cases. Below, I’ll offer a defense of AI that addresses these concerns while highlighting its strengths and potential.
AI isn’t perfect, and no one should claim it is—but that’s not the point. It’s a tool, and like any tool, its effectiveness depends on how it’s used. Critics who point to AI’s struggles with nuanced reasoning or edge cases often overlook the fact that it’s not designed to replace human judgment entirely. Instead, it excels at augmenting it. For example, AI can process vast amounts of data—far more than any human could in a reasonable timeframe—and identify patterns or insights that might otherwise go unnoticed. This makes it invaluable in fields like medicine, where it’s already helping diagnose diseases from imaging data, or in logistics, where it optimizes supply chains with precision that manual methods can’t match.
The complaint about generic or incorrect outputs, often dubbed “hallucinations,” is fair but misses the bigger picture. Yes, AI can churn out nonsense if pushed beyond its limits or fed poor data—but that’s a reflection of its current stage of development, not its ultimate potential. These systems are improving rapidly, with each iteration reducing errors and refining capabilities. More importantly, AI’s ability to generate starting points—like drafts, code snippets, or hypotheses—saves time and effort. It’s not about delivering flawless results every time; it’s about accelerating the process so humans can refine and perfect the output. A programmer tweaking AI-generated code is still faster than writing it from scratch.
As for the overhype, it’s true that the buzz can get out of hand—marketing teams love a shiny new toy. But beneath the noise, real progress is happening. AI’s contributions aren’t always flashy; they’re often mundane but critical, like automating repetitive tasks or enhancing search algorithms. The critics who say it’s just a “fancy autocomplete” underestimate how transformative that can be. Autocomplete might sound trivial until you realize it’s powering real-time language translation or helping scientists sift through research papers at scale. These aren’t sci-fi fantasies—they’re practical applications delivering value today.
Finally, the notion that AI can’t match human expertise in complex domains ignores its complementary role. It’s not here to outthink a seasoned expert but to amplify their reach. A lawyer using AI to review contracts doesn’t lose their skill—they gain efficiency. A researcher leveraging AI to analyze data doesn’t stop hypothesizing—they get to test more ideas. The technology’s strength lies in its ability to handle the grunt work, freeing humans to focus on creativity and judgment.
AI isn’t a silver bullet, and it’s not without flaws. But the criticisms often stem from inflated expectations or a failure to see its incremental, practical benefits. It’s not about replacing humans—it’s about making us better at what we do. The trajectory is clear: as it evolves, AI will continue to refine its capabilities and prove its worth, not as a standalone genius, but as a partner in progress.
1. Model "performance" judged by proxy metrics of intelligence have improved significantly over the past two years.
2. These capabilities are yet to be stitched together in the most appropriate manner for the cybersecurity scenarios the author is talking about.
In my experience, the best usage of Transformer models has come from a deep integration into an appropriate workflow. They do not (yet) replace the new exploration part of a workflow, but they are very scarily performant at following mid level reasoning assertions in a massively parallelized manner.
The question you should be asking yourself is if you can break down your task into however many small chunks that are constrained by feasiility in time to process , chunk those up into appropriate buckets or even better, place them in-order as though you were doing those steps with your expertise - an extension of self. Here's how the two approaches differ:
"Find vulnerabilities in this code" -> This will saturate across all models because the intent behind this mission is vast and loosely defined, while the outcome is expected to be narrow.
"
(a)This piece of code should be doing x, what areas is it affecting, lets draw up a perimeter
(b) Here is the dependency graph of things upstream and downstream of x, lets spawn a collection of thinking chains to evaluate each one for risk based on the most recent change . . .
(b[n]) Where is this likely to fail
(c) (Next step that a pentester/cybersecurity researcher would take)
"
This has been trial and error in my experience but it has worked great in domains such as financial trading and decision support where experts in the field help sketch out the general framework of the process where reasoning support is needed and constantly iterate as though it is an extension of their selves.
> Sometimes the founder will apply a cope to the narrative ("We just don't have any PhD level questions to ask")
Please tell me this is not what tech-bros are going around telling each other! Are we implying that the problems in the world, the things that humans collectively work on to maintain the society that took us thousands of years to build up, just aren't hard enough to reach the limits of the AI.
I mean... most businesses, particularly small businesses and startups, aren't exactly doing brain surgery on a rocketship.
It’s pretty likely that they have extremely dull problems like "running an inbound call center is a lot of work" or "people keep having their mail stolen and/or lying that they did" that "more smarter gpus" won't solve
The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.
Yes, here's the link: https://arxiv.org/abs/2503.21934v1
Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:
1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)
2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)
I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...
I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!
This is a paper by INSAIT researchers - a very young institute which hired most of its PHD staff only in the last 2 years, basically onboarding anyone who wanted to be part of it. They were waiving their BG-GPT on national TV in the country as a major breakthrough, while it was basically was a Mistral fine-tuned model, that was eventually never released to the public, nor the training set.
Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.
[dead]
Discussed here: https://news.ycombinator.com/item?id=43540985 (Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, 4 points, 2 comments).
Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)
This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)
How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.
You just (lol) need to give non-standard problems and demand students to provide reasoning and explanations along with the answer. Yeah, LLMs can "reason" too, but it's obvious when the output comes from an LLM here.
(Yes, that's a lot of work for a teacher. Gone are the days when you could just assign reports as homework.)
Can you provide sample questions that are "LLM proof" ?
It's not about being "LLM-proff", it's about teacher involvement in making up novel questions and grading attentively. There's no magic trick.
I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]
So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.
This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
> I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.
I think there is a big divide here. Every adult on earth knows magic is "fake", but some can still be amazed and entertained by it, while others find it utterly boring because it's fake, and the only possible (mildly) interesting thing about it is to try to figure out what the trick is.
I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.
To be fair, I love that magicians can pull tricks on me even though I know it is fake.
A lot of humans are similarly good at some stuff and bad at other things.
Looking up the math ability of the average American this is given as an example for the median (from https://www.wyliecomm.com/2021/11/whats-the-latest-u-s-numer...):
>Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.
Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.
Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.
It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.
The AI will create something for you and tell you it was them.
"That's impossible because..."
"Good point! Blah blah blah..."
Absolutely shameless!
Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.
But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.
Makes sense that search has a small, fast, dumb model designed to summarize and not to solve problems. Nearly 14 billion Google searches per day. Way too much compute needed to use a bigger model.
Massive search overlap though - and some questions (like the golf ball puzzle) can be cached for a long time.
AFAIK they got 15% of unseen queries everyday, so it might be not very simple to design an effective cache layer on that. Semantic-aware clustering of natural language queries and projecting them into a cache-able low rank dimension is a non-trivial problem. Of course, LLM can effectively solve that, but then what's the point of using cache when you need LLM for clustering queries...
I have a strong suspicion that for all the low threshold APIs/services, before the real model sees my prompt, it gets evaluated by a quick model to see if it's something they care to bother the big models with. If not i get something shaked out of the sleeve of a bottom barrel model.
Its most likely one giant ["input token close enough question hash"] = answer_with_params_replay? It doesent missunderstands the question, it tries to squeeze the input to something close enough?
I've seen humans make exactly these sorts of mistakes?
Just tried with o3-mini-high and it came up with something pretty reasonable: https://chatgpt.com/share/67f35ae9-5ce4-800c-ba39-6288cb4685...
I had to look up these acronyms:
- USAMO - United States of America Mathematical Olympiad
- IMO - International Mathematical Olympiad
- ICPC - International Collegiate Programming Contest
Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted 27th March 2025.
Nope, no LLMs reported 50~60% performance on IMO, and SOTA LLMs scoring 5% on USAMO is expected. For 50~60% performance on IMO, you are thinking of AlphaProof, but AlphaProof is not a LLM. We don't have the full paper yet, but clearly AlphaProof is a system built on top of LLM with lots of bells and whistles, just like AlphaFold is.
o1 reportedly got 83% on IMO, and 89th percentile on Codeforces.
https://openai.com/index/learning-to-reason-with-llms/
The paper tested it on o1-pro as well. Correct me if I'm getting some versioning mixed up here.
I've gone through the link you posted and the o1 system card and can't see any reference to IMO. Are you sure they were referring to IMO or were they referring to AIME?
AIME is so not IMO.
Less than 5%. OpenAI's O1 burned through over $100 in tokens during the test as well!
LLMs are “next token” predictors. Yes, I realize that there’s a bit more to it and it’s not always just the “next” token, but at a very high level that’s what they are. So why are we so surprised when it turns out they can’t actually “do” math? Clearly the high benchmark scores are a result of the training sets being polluted with the answers.
>I'm incredibly surprised no one mentions this
If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.
Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.
So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.
I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.
**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com
I promise its a fun mathematical puzzle and the biology is pretty wild too
I was working some time ago on image processing model using GAN architecture. One model produces output and tries to fool the second. Both are trained together. Simple, but requires a lot extra efforts to make it work. Unstable and falls apart (blows up to unrecoverable state). I found some ways to make it work by adding new loss functions, changing params, changing models' architectures and sizes. Adjusting some coefficients through the training to gradually rebalance loss functions' influence.
The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.
PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.
PPS: the fact that they can do known tasks with minor variations is already a huge time saver.
It's funny, I have the same problem all the time with typical day to day programming roadblocks that these models are supposed to excel at. I'm talking about any type of bug or unexpected behavior that requires even 5 minutes of deeper analysis.
Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.
Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.
> they're more like calculators of language than agents that reason
This might be honing in on both the issue and the actual value of LLM:s. I think there's a lot of value in a "language calculator" but if it's continuously being sold as something it's not we will dismiss it or build heaps of useless apps that will just form a market bubble. I think the value is there but it's different from how we think about it.
True. There’s a small bonus that trying to explain the issue to the llm may sometimes be essentially rubber ducking, and that can lead to insights. I feel most of the time the llm can give erroneous output that still might trigger some thinking on a different direction, and sometimes I’m inclined to think it’s helping me more than it actually is.
When I was an undergrad EE student a decade ago, I had to tangle a lot with complex maths in my Signals & Systems, and Electricity and Magnetism classes. Stuff like Fourier transforms, hairy integrals, partial differential equations etc.
Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.
I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.
I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model
Have you tried gemini 2.5? It's one of the best reasoning models. Available free in google ai studio.
And then within a week, Gemini 2.5 was tested and got 25%. Point is AI is getting stronger.
And this only suggested LLMs aren't trained well to write formal math proofs, which is true.
> within a week
How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.
They retrained their model less than a week before its release, just to juice one particular nonstandard eval? Seems implausible. Models get 5x better at things all the time. Challenges like the Winograd schema have gone from impossible to laughably easy practically overnight. Ditto for "Rs in strawberry," ferrying animals across a river, overflowing wine glass, ...
I'm not generally inclined toward the "they are cheating cheaters" mindset, but I'll point out that fine tuning is not the same as retraining. It can be done cheaply and quickly.
Models getting 5X better at things all the time is at least as easy to interpret as evidence of task-specific tuning than as breakthroughs in general ability, especially when the 'things being improved on' are published evals with history.
Google team said it was outside the training window fwiw
https://x.com/jack_w_rae/status/1907454713563426883
Imagine that you are making problem solving AI. You have large budget, and access to compute and web crawling infra to run your AI "on internet". You would like to be aware of the ways people are currently evaluating AI so that you can be sure your product looks good. Do you have maybe an idea how one could do that?
The "ferrying animals across a river" problem has definitely not been solved, they still don't understand the problem at all, overcomplicating it because they're using an off-the-shelf solution instead of actual reasoning:
o1 screwing up a trivially easy variation: https://xcancel.com/colin_fraser/status/1864787124320387202
Claude 3.7, utterly incoherent: https://xcancel.com/colin_fraser/status/1898158943962271876
DeepSeek: https://xcancel.com/colin_fraser/status/1882510886163943443#...
Overflowing wine glass also isn't meaningfully solved! I understand it is sort of solved for wine glasses (even though it looks terrible and unphysical, always seems to have weird fizz). But asking GPT to "generate an image of a transparent vase with flowers which has been overfilled with water, so that water is spilling over" had the exact same problem as the old wine glasses: the vase was clearly half-full, yet water was mysteriously trickling over the sides. Presumably OpenAI RLHFed wine glasses since it was a well-known failure, but (as always) this is just whack-a-mole, it does not generalize into understanding the physical principle.
Gemini 2.5 Pro got the farmer problem variation right: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
When told, "only room for one person OR one animal", it's also the only one to recognise the fact that the puzzle is impossible to solve. The farmer can't take any animals with them, and neither the goat nor wolf could row the boat.
> When told, "only room for one person OR one animal"
In common terms suppose I say: there is only room for one person or one animal in my car to go home, one can suppose that it is referring to additional room besides that occupied by the driver. There is a problem when we try to use LLM trained in common use of language to solve puzzle in formal logic or math. I think the current LLMs are not able to have a specialized context to become a logical reasoning agent, but perhaps such thing could be possible if the evaluation function of the LLM was designed to give high credit to changing context with a phrase or token.
That can't be viewed without logging into Google first.
They could have rlhfed or finetuned on user thumbs up responses, which could include users who took the test and asked it to explain problems after
>one particular nonstandard eval
A particular nonstandard eval that is currently top comment on this HN thread, due to the fact that, unlike every other eval out there, LLMs score badly on it?
Doesn't seem implausible to me at all. If I was running that team, I would be "Drop what you're doing, boys and girls, and optimise the hell out of this test! This is our differentiator!"
It's implausible that fine-tuning of a premier model would have anywhere near that turn around time. Even if they wanted to and had no qualms doing so, it's not happening anywhere near that fast.
It's really not that implausible, they probably are adding stuff to the data-soup all the time and have a system in place for it.
Yeah it is lol. You don't just train your model on whatever you like when you're expected to serve it. They're are a host of problems with doing that. The idea that they trained on this obscure benchmark released about the day of is actually very silly.
They are trained on some mix with minimal fraction of math. That's how it was from the beginning. But we can rebalance it by adding quality generated content. Just content will cost millions of $$ to generate. Distillation on new level looks like logical next step.
Yeah, this is one of those red flags that keeps getting hand-waved away, but really shouldn't be.
Query: Could you explain the terminology to people who don't follow this that closely?
Not the OP but
USAMO : USA Math Olympiad. Referred here https://arxiv.org/pdf/2503.21934v1
IMO : International Math Olympiad
SOTA : State of the Art
OP is probably referring to this referred to this paper here https://arxiv.org/pdf/2503.21934v1. The paper explains out how a rigorous testing revealed abysmal performance of LLMs (results that are at odds with how they are hyped about).
What would the average human score be?
I.e. if you randomly sampled N humans to take those tests.
The average human score on USAMO (let alone IMO) is zero, of course. Source: I won medals at Korean Mathematical Olympiad.
I am hesitant to correct a math Olympian, but don't you mean the median?
Average, hmmm?
That type of news might make investors worry / scared.
OpenAI told how they removed it for GPT-4 in its release paper: only exact string matches. So all discussion of bar exam questions from memory on test taking forums etc., that wouldnn't exactly match, made it in.
[dead]
Just in case it wasn't a typo, and you happen not to know ... that word is probably "eke" - meaning gaining (increasing, enlarging from wiktionary) - rather than "eek" which is what mice do :)
hah you're right on the spelling but wrong on my meaning. That's probably the first time I've typed it. I don't think LLMs are quite at the level of mice reasoning yet!
https://dictionary.cambridge.org/us/dictionary/english/eke-o... to obtain or win something only with difficulty or great effort
Ick, OK, ACK.
Every day I am more convinced that LLM hype is the equivalent of someone seeing a stage magician levitate a table across the stage and assuming this means hovercars must only be a few years away.
I believe there's a widespread confusion between a fictional character that is described as a AI assistant, versus the actual algorithm building the play-story which humans imagine the character from. An illusion actively promoted by companies seeking investment and hype.
AcmeAssistant is "helpful" and "clever" in the same way that Vampire Count Dracula is "brooding" and "immortal".
Eek! You mean eke.
LLMs are capable of playing chess and 3.5 turbo instruct does so quite well (for a human) at 1800 ELO. Does this mean they can truly reason now ?
https://github.com/adamkarvonen/chess_gpt_eval
3.5 turbo instruct is a huge outlier.
https://dynomight.substack.com/p/chess
Discussion here: https://news.ycombinator.com/item?id=42138289
That might be overstating it, at least if you mean it to be some unreplicable feat. Small models have been trained that play around 1200 to 1300 on the eleuther discord. And there's this grandmaster level transformer - https://arxiv.org/html/2402.04494v1
Open AI, Anthropic and the like simply don't care much about their LLMs playing chess. That or post training is messing things up.
> That might be overstating it, at least if you mean it to be some unreplicable feat.
I mean, surely there's a reason you decided to mention 3.5 turbo instruct and not.. 3.5 turbo? Or any other model? Even the ones that came after? It's clearly a big outlier, at least when you consider "LLMs" to be a wide selection of recent models.
If you're saying that LLMs/transformer models are capable of being trained to play chess by training on chess data, I agree with you.
I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
I mentioned it because it's the best example. One example is enough to disprove the "not capable of". There are other examples too.
>I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
Not really. The LLMs play chess like they have no clue what the rules of the game are, not like poor reasoners. Trying to predict and failing is how they learn anything. If you want them to learn a game like chess then how you get them to learn it - by trying to predict chess moves. Chess books during training only teach them how to converse about chess.
The issue isn’t whether they can be trained to play. The issue is whether, after making a careful reading of the rules, they can infer how to play. The latter is something a human child could do, but it is completely beyond an LLM.
> One example is enough to disprove the "not capable of" nonsense. There are other examples too.
Gotcha, fair enough. Throw enough chess data in during training, I'm sure they'd be pretty good at chess.
I don't really understand what you're trying to say in your next paragraph. LLMs surely have plenty of training data to be familiar with the rules of chess. They also purportedly have the reasoning skills to use their familiarity to connect the dots and actually play. It's trivially true that this issue can be plastered over by shoving lots of chess game training data into them, but the success of that route is not a positive reflection on their reasoning abilities.
Gradient descent is a dumb optimizer. LLM training is not at all like a human reading a book and more like evolution tuning adaptations over centuries. You would not expect either process to be aware of anything they are converging towards. So having lots of books that talk about chess in training will predictably just return a model that knows how to talk about chess really well. I'm not surprised they may know how to talk about the rules but play them poorly.
And that post had a follow-up. Post-training messing things up could well be the issue seeing the impact even a little more examples and/or regurgitation made. https://dynomight.net/more-chess/
The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language. That is, that simply training on text is going to eventually give the AI the ability to generate language that reasons at more or less human level in more or less any domain of knowledge.
This whole premise crashes and burns if you need task-specific training, like explicit chess training. That is because there are far too many tasks that humans need to be competent at in order to be useful in society. Even worse, the vast majority of those tasks are very hard to source training data for, unlike chess.
So, if we accept that LLMs can't learn chess unless they explicitly include chess games in the training set, then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
>The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language.
And they do, just not always in the ways we expect.
>This whole premise crashes and burns if you need task-specific training, like explicit chess training.
Everyone needs task specific training. Any human good at chess or anything enough to make it a profession needs it. So I have no idea why people would expect any less for a Machine.
>then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
Yeah so ? How much business pitches they need in the training set has no correlation with chess. I don't see any reason to believe what is already present isn't enough. There's enough chess data on the internet to teach them chess too, it's just a matter of how much open AI care about it.
Reasoning training causes some about of catastrophic forgetting, so unlikely they burn that on mixing in chess puzzles if they want a commercial product, unless it somehow transfers well to other reasoning problems broadly cared about.
My point wasn't chess specific or that they couldn't have specific training for it. It was a more general "here is something that LLMs clearly aren't being trained for currently, but would also be solvable through reasoning skills"
Much in the same way a human who only just learnt the rules but 0 strategy would very, very rarely lose here
These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
We're on the verge of AGI but there's not even the tiniest spark of general reasoning ability in something they haven't been trained for
"Reasoning" or "Thinking" are marketing terms and nothing more. If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning"
>If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning".
If you think you can play chess at that level over that many games and moves with memorization then i don't know what to tell you except that you're wrong. It's not possible so let's just get that out of the way.
>These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
Why doesn't it ? Have you actually looked at any of these games ? Those LLMs aren't playing like poor reasoners. They're playing like machines who have no clue what the rules of the game are. LLMs learn by predicting and failing and getting a little better at it, repeat ad nauseum. You want them to learn the rules of a complex game ? That's how you do it. By training them to predict it. Training on chess books just makes them learn how to converse about chess.
Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it. The top comment we are both replying to had gemini-2.5-pro which released less than 5 days later hit 25% on the benchmark. Now that was particularly funny.
> Why doesn't it?
It was surprising to me because I would have expected if there was reasoning ability then it would translate across domains at least somewhat, but yeah what you say makes sense. I'm thinking of it in human terms
Transfer Learning during LLM training tends to be 'broader' than that.
Like how
- Training LLMs on code makes them solve reasoning problems better - Training Language Y alongside X makes them much better at Y than if they were trained on language Y alone and so on.
Probably because well gradient descent is a dumb optimizer and training is more like evolution than a human reading a book.
Also, there is something genuinely weird going on with LLM chess. And it's possible base models are better. https://dynomight.net/more-chess/
The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.
When you ask it a question, it tends to say yes.
So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.
The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.
I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.
I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.
> The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving
LLMs fundamentally do not want to seem anything
But the companies that are training them and making models available for professional use sure want them to seem agreeable
That sound reasonable to me, but the those companies forget that there's different types of agreeable. There's the LLM approach, similar to the coworker who will answer all your questions about .NET but not stop you from coding yourself into a corner, and then there's the "Let's sit down and review what it actually is that you're doing, because you're asking a fairly large number of disjoint questions right now".
I've dropped trying to use LLMs for anything, due to political convictions and because I don't feel like they are particularly useful for my line of work. Where I have tried to use various models in the past is for software development, and the common mistake I see the LLMs make is that they can't pick up on mistakes in my line of thinking, or won't point them out. Most of my problems are often down to design errors or thinking about a problem in a wrong way. The LLMs will never once tell me that what I'm trying to do is an indication of a wrong/bad design. There are ways to be agreeable and still point out problems with previously made decisions.
> LLMs fundamentally do not want to seem anything
You're right that LLMs don't actually want anything. That said, in reinforcement learning, it's common to describe models as wanting things because they're trained to maximize rewards. It’s just a standard way of talking, not a claim about real agency.
> a standard way of talking, not a claim about real agency.
A standard way of talking used by people who do also frequently claim real agency.
Yeah, and they probably have more "agreeable" stuff in their corpus simply because very disagreeable stuff tend to be either much shorter or a prelude to a flamewar.
> The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.
umm, it seems to me that it is this (tfa):
and then couple of lines down from the above statement, we have this:[this went way outside the edit-window and hence a separate comment] imho, state of varying experience with llm's can aptly summed in this poem by Mr. Longfellow
"when you ask him anything, he never answers 'no' -- he just yesses you to death and then he takes your dough"
Unlike many - I find author's complaints on the dot.
Once all the AI batch startups have sold subscriptions to the cohort and there's no more further market growth because businesses outside don't want to roll the dice on a probabilistic model that doesn't have an understanding of pretty much anything rather is a clever imitation machine on the content it has seen, the AI bubble will burst when more statups would start packing up by end of 2026 or max 2027.
I agree, about both the issue with benchmarks not being relevant to actual use cases and the "wants to sound smart" issue. I have seen them both first hand interacting with llms.
I think the ability to embed arbitrary knowledge written in arbitrary formats is the most important thing llms have achieved.
In my experience trying to get an llm to perform a task as vast and open ended as the one the author describes is fundamentally misguided. The llms were not trained for that and won't be able to do it in a satisfactory degree. But all this research has thankfully provided us with the software and hardware tools where one could start working on training a model that can.
Contrast that to 5-6 years ago, when all you could hope for this kind of thing was simple rule based and pattern matching systems.
This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven.
I have my own opinions, but I can't really say that they're not also based on anecdotes and personal decision-making heuristics.
But some of us are going to end up right and some of us are going to end up wrong and I'm really curious what features signal an ability to make "better choices" w/r/t AI, even if we don't know (or can't prove) what "better" is yet.
There is nothing wrong with sharing anecdotal experiences. Reading through anecdotal experiences here can help understand how one's own experience are relatable or not. Moreover, if I have X experience it could help to know if it is because of me doing sth wrong that others have figured out.
Furthermore, as we are talking about actual impact of LLMs, as is the point of the article, a bunch of anecdotal experiences may be more valuable than a bunch of benchmarks to figure it out. Also, apart from the right/wrong dichotomy, people use LLMs with different goals and contexts. It may not mean that some people do something wrong if they do not see the same impact as others. Everytime a web developer says that they do not understand how others may be so skeptical of LLMs, conclude with certainty that they must be doing sth wrong and move on to explain how to actually use LLMs properly, I chuckle.
Indeed, there’s nothing at all wrong with sharing anecdotes. The problem is when people make broad assumptions and conclusions based solely on personal experience, which unfortunately happens all too often. Doing so is wired into our brains, though, and we have to work very consciously to intercept our survival instincts.
People "make conclusions" because they have to take decisions day to day. We cannot wait for the perfect bulletproof evidence before that. Data is useful to take into account, but if I try to use X llm that has some perfect objective benchmark backing it, while I cannot make it be useful to me while Y llm has better results, it would be stupid not to base my decision on my anecdotal experience. Or vice versa, if I have a great workflow with llms, it may be not make sense to drop it because some others may think that llms don't work.
In the absence of actually good evidence, anecdotal data may be the best we can get now. The point imo is try to understand why some anecdotes are contrasting each other, which, imo, is mostly due to contextual factors that may not be very clear, and to be flexible enough to change priors/conclusions when something changes in the current situation.
I think you might be caught up in a bit of the rationalist delusion.
People -only!- draw conclusions based on personal experience. At best you have personal experience with truly objective evidence gathered in a statistically valid manner.
But that only happens in a few vanishingly rare circumstances here on earth. And wherever it happens, people are driven to subvert the evidence gathering process.
Often “working against your instincts” to be more rational only means more time spent choosing which unreliable evidence to concoct a belief from.
I'm not sure where you got all this from. Do you have any useful citations?
You want to block subjectivity? Write some formulas.
There are three questions to consider:
a) Have we, without any reasonable doubt, hit a wall for AI development? Emphasis on "reasonable doubt". There is no reasonable doubt that the Earth is roughly spherical. That level of certainty.
b) Depending on your answer for (a), the next question to consider is if we the humans have motivations to continue developing AI.
c) And then the last question: will AI continue improving?
If taken as boolean values, (a), (b) and (c) have a truth table with eight values, the most interesting row being false, true, true: "(not a) and b => c". Note the implication sign, "=>". Give some values to (a) and (b), and you get a value for (c).
There are more variables you can add to your formula, but I'll abstain from giving any silly examples. I, however, think that the row (false, true, false) implied by many commentators is just fear and denial. Fear is justified, but denial doesn't help.
It's not surprising that responses are anecdotal. An easy way to communicate a generic sentiment often requires being brief.
A majority of what makes a "better AI" can be condensed to how effective the slope-gradient algorithms are at getting the local maxima we want it to get to. Until a generative model shows actual progress of "making decisions" it will forever be seen as a glorified linear algebra solver. Generative machine learning is all about giving a pleasing answer to the end user, not about creating something that is on the level of human decision making.
At risk of being annoying, answers that feel like high quality human decision making are extremely pleasing and desirable. In the same way, image generators aren't generating six fingered hands because they think it's more pleasing, they're doing it because they're trying to please and not good enough yet.
I'm just most baffled by the "flashes of brilliance" combined with utter stupidity. I remember having a run with early GPT 4 (gpt-4-0314) where it did refactoring work that amazed me. In the past few days I asked a bunch of AIs about similar characters between a popular gacha mobile game and a popular TV show. OpenAI's models were terrible and hallucinated aggressively (4, 4o, 4.5, o3-mini, o3-mini-high), with the exception of o1. DeepSeek R1 only mildly hallucinated and gave bad answers. Gemini 2.5 was the only flagship model that did not hallucinate and gave some decent answers.
I probably should have used some type of grounding, but I honestly assumed the stuff I was asking about should have been in their training datasets.
Totally agree... this space is still so new and unpredictable that everyone is operating off vibes, gut instinct, and whatever personal anecdotes they've collected. We're all sort of fumbling around in the dark, trying to reverse-engineer the flashlight
Agreed! And with all the gaming of the evals going on, I think we're going to be stuck with anecdotal for some time to come.
I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.
I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.
Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.
> I'm really curious what features signal an ability to make "better choices" w/r/t AI
So am I. If you promise you'll tell me after you time travel to the future and find out, I'll promise you the same in return.
That’s a good point, the comments section is very anecdotal. Do you have any data to say if this is a common occurrence or specific to this topic?
Good observation but also somewhat trivial. We are not omniscient gods, ultimately all our opinions and decisions will have to be based on our own limited experiences.
My lived experience is that unless there's some new breakthrough's AI is more akin to a drill to replace a hammer than a tractor to replace the plow or a printing press.
Maybe any AI experts can elaborate on this but it seems there's a limit to the fundamental underlying model of the LLM architecture of transformers and tokens.
LLM's are amazing but we might need something more or some new paradigm to push us towards true AGI.
My personal experience is right in line with the author's.
Also:
> I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.
I immediately thought: That's because in most situations this is the purpose of language, at least partially, and LLMs are trained on language.
The disconnect between improved benchmark results and lack of improvement on real world tasks doesn't have to imply cheating - it's just a reflection of the nature of LLMs, which at the end of the day are just prediction systems - these are language models, not cognitive architectures built for generality.
Of course, if you train an LLM heavily on narrow benchmark domains then its prediction performance will improve on those domains, but why would you expect that to improve performance in unrelated areas?
If you trained yourself extensively on advanced math, would you expect that to improve your programming ability? If not, they why would you expect it to improve programming ability of a far less sophisticated "intelligence" (prediction engine) such as a language model?! If you trained yourself on LeetCode programming, would you expect that to help hardening corporate production systems?!
In my view as well it's not really cheating, it's just over fitting.
If a model doesn't do good in the benchmarks it will either be retrained until it does or you won't hear about it.
That's fair. But look up the recent experiment on SOTA models on the then just released USAMO 2025 questions. Highest score was 5%, supposedly SOTA last year was IMO silver level. There could be some methodological differences - ie USAMO paper required correct proofs and not just numerical answers. But it really strongly suggests even within limited domains, it's cheating. I'd wager a significant amount that if you tested SOTA models on a new ICPC set of questions, actual performance would be far, far worse than their supposed benchmarks.
> Highest score was 5%, supposedly SOTA last year was IMO silver level.
No LLM last year got silver. Deepmind had a highly specialized AI system earning that
Your analogy is perfect. Training an LLM on math olympiad problems and then expecting it to secure enterprise software is like teaching someone chess and handing them a wrench
I've used gemini 2.5 this weekend with aider and it was frighteningly good.
It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.
> It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.
Even approximations must be right to be meaningful. If information is wrong, it's rubbish.
Presorting/labelling various data has value. Humans have done the real work there.
What is "leading" us at present are the exaggerated valuations of corporations. You/we are in a bubble, working to justify the bubble.
Until a tool is reliable, it is not installed where people can get hurt. Unless we have revised our concern for people.
I think overall quality with Gemini 2.5 is not much better than Gemini 2 in my experience. Gemini 2 was already really good, but just like Claude 3.7, Gemini 2.5 goes some steps forward and some steps backwards. It sometimes generates some really verbose code even when you tell it to be succinct. I am pretty confident that if you evaluate 2.5 for a bit longer you'll come to the same conclusion eventually.
Yep, and what they are going in cursor either the agentic stuff is really game changing.
People who can’t recognize this intentionally have their heads in the sand
I guess you haven't been on /r/cursor or forum.cursor.com lately?
"game changing" isn't exactly the sentiment there the last couple months.
People are really fundamentally asking two different questions when they talk about AI "importance": AI's utility and AI's "intelligence". There's a careful line between both.
1) AI undoubtedly has utility. In many agentic uses, it has very significant utility. There's absolute utility and perceived utility, which is more of user experience. In absolute utility, it is likely git is the single most game changing piece of software there is. It is likely git has saved some ten, maybe eleven digit number in engineer hours times salary in how it enables massive teams to work together in very seamless ways. In user experience, AI is amazing because it can generate so much so quickly. But it is very far from an engineer. For example, recently I tried to use cursor to bootstrap a website in NextJS for me. It produced errors it could not fix, and each rewrite seemed to dig it deeper into its own hole. The reasons were quite obvious. A lot of it had to do with NextJS 15 and the breaking changes it introduces in cookies and auth. It's quite clear if you have masses of NextJS code, which disproportionately is older versions, but none labeled well with versions, it messes up the LLM. Eventually I scrapped what it wrote and did it myself. I don't mean to use this anecdote to say LLMs are useless, but they have pretty clear limitations. They work well on problems with massive data (like front end) and don't require much principled understanding (like understanding how NextJS 15 would break so and so's auth). Another example of this is when I tried to use it to generate flags for a V8 build, it failed horribly and would simply hallucinate flags all the time. This seemed very likely to be (despite the existence of a list of V8 flags online) that many flags had very close representations in vector embeddings, and that there was almost close to zero data/detailed examples on their use.
2) In the more theoretical side, the performance of LLMs on benchmarks (claiming to be elite IMO solvers, competitive programming solvers) have become incredibly suspicious. When the new USAMO 2025 was released, the highest score was 5%, despite claims a year ago that SOTA when was at least a silver IMO. This is against the backdrop of exponential compute and data being fed in. Combined with apparently diminishing returns, this suggests that the gains from that are running really thin.
There are real and obvious improvements in the past few model updates and I'm not sure what the disconnect there is.
Maybe it's that I do have PhD level questions to ask them, and they've gotten much better at it.
But I suspect that these anecdotes are driven by something else. Perhaps people found a workable prompt strategy by trial and error on an earlier model and it works less well with later models.
Or perhaps they have a time-sensitive task and are not able to take advantage of the thinking of modern LLMs, which have a slow thinking-based feedback loop. Or maybe their code base is getting more complicated, so it's harder to reason about.
Or perhaps they're giving the LLMs a poorly defined task where older models made assumptions about but newer models understand the ambiguity of and so find the space of solutions harder to navigate.
Since this is ultimately from a company doing AI scanning for security, I would think the latter plays a role to some extent. Security is insanely hard and the more you know about it the harder it is. Also adversaries are bound to be using AI and are increasing in sophistication, which would cause lower efficacy (although you could tease this effect out by trying older models with the newer threats).
The issue is the scale of the improvements. GPT-3.5 Instruct was an utterly massive leap over everything that came before it. GPT-4 was a very big jump over that. Everything since has seemed incremental. Yes we got multimodal but that was part of GPT-4, they just didn't release it initially, and up until very recently it mostly handed off to another model. Yes we got reasoning models, but people had been using CoT for awhile so it was just a matter of time before RL got used to train it into models. Witness the continual delays of GPT-5 and the back and forth on whether it will be its own model or just a router model that picks the best existing model to hand a prompt off to.
In the last year, things like "you are an expert on..." have gotten much less effective in my private tests, while actually describing the problem precisely has gotten better in terms of producing results.
In other words, all the sort of lazy prompt engineering hacks are becoming less effective. Domain expertise is becoming more effective.
yes that would explain the effect I think. I'll try that out this week.
[dead]
I hope it's true. Even if LLMs development stopped now, we would still keep finding new uses for them at least for the next ten years. The technology is evolving way faster than we can meaningfully absorb it and I am genuinely frightened by the consequences. So I hope we're hitting some point of diminishing returns, although I don't believe it a bit.
My experience as someone who uses LLMs and a coding assist plugin (sometimes), but is somewhat bearish on AI is that GPT/Claude and friends have gotten worse in the last 12 months or so, and local LLMs have gone from useless to borderline functional but still not really usable for day to day.
Personally, I think the models are “good enough” that we need to start seeing the improvements in tooling and applications that come with them now. I think MCP is a good step in the right direction, but I’m sceptical on the whole thing (and have been since the beginning, despite being a user of the tech).
I'd say most of the recent AI model progress has been on price.
A 4-bit quant of QwQ-32B is surprisingly close to Claude 3.5 in coding performance. But it's small enough to run on a consumer GPU, which means deployment price is now down to $0.10 per hour. (from $12+ for models requiring 8x H100)
Have you compared it with 8-bit QwQ-17B?
In my evals 8 bit quantized smaller Qwen models were better, but again evaluating is hard.
There’s no QwQ 17B that I’m aware of. Do you have a HF link?
Yeah, I'm thinking of this from a Wardley map standpoint.
What innovation opens up when AI gets sufficiently commoditized?
One thing I’ve seen is large enterprises extracting money from consumers by putting administrative burden on them.
For example, you can see this in health insurance reimbursements and wireless carriers plan changes. (ie, Verizon’s shift from Do More, etc to what they have now)
Companies basically set up circumstances where consumers lose small amounts of money on a recurring basis or sporadically enough that the people will just pay the money rather than a maze of calls, website navigation and time suck to recover funds due to them or that shouldn’t have been taken in the first place.
I’m hopeful well commoditized AI will give consumers a fighting chance at this and other types of disenfranchisement that seems to be increasingly normalized by companies that have consultants that do nothing but optimize for their own financial position.
Brute force, brute force everything at least for the domains you can have automatic verification in.
This was published the day before Gemini 2.5 was released. I'd be interested if they see any difference with that model. Anecdotally, that is the first model that really made me go wow and made a big difference for my productivity.
Ya, I find this hard to imagine aging well. Gemini 2.5 solved (at least much better than) multiple real world systems questions I've had in the past that other models could not. Its visual reasoning also jumped significantly on charts (e.g. planning around train schedules)
Even Sonnet 3.7 was able to do refactoring work on my codebase sonnet 3.6 could not.
Really not seeing the "LLMs not improving" story
I doubt it. It still flails miserably like the other models on anything remotely hard, even with plenty of human coaxing. For example, try to get it to solve: https://www.janestreet.com/puzzles/hall-of-mirrors-3-index/
FWIW 2.5-exp was the only one that managed to get a problem I asked it right, compared to Claude 3.7 and o1 (or any of the other free models in Cursor).
It was reverse engineering ~550MB of Hermes bytecode from a react native app, with each function split into a separate file for grep-ability and LLM compatibility.
The others would all start off right then quickly default to just greping randomly what they expected it to be, which failed quickly. 2.5 traced the function all the way back to the networking call and provided the expected response payload.
All the others hallucinated the networking response I was trying to figure out. 2.5 Provided it exactly enough for me to intercept the request and using the response it provided to get what I wanted to show up.
How did you fit 550MB of bytecode into the context window? Was this using 2.5 in an agentic framework? (i.e. repeated model calls and tool usage)
I manually pre-parsed the bytecode file with awk into a bazillion individual files that were each just one function, and gave it the hint to grep to sort through them. This was all done in Cursor.
Quick example of the output it gave and it's process.https://i.imgur.com/Cmg4KK1.png
https://i.imgur.com/ApNxUkB.png
I’d say the average person wouldn’t understand that problem, let alone solve it.
There's somehow this belief that "newer models will disprove <insert LLM criticism here>" despite the "newer" models being... just a scaled-up version of a previous model, or some anciliary features tacked on. An LLM is an LLM is an LLM: I'll believe it when I see otherwise.
As someone who was wildly disappointed with the hype around Claude 3.7, Gemini 2.5 is easily the best programmer-assistant LLM available, IMO.
But it still feels more like a small incremental improvement rather than a radical change, and I still feel its limitations constantly.
Like... it gives me the sort of decent but uninspired solution I would expect it to generate without predictably walking me through a bunch of obvious wrong turns as I repeatedly correct it as I would have to have done with earlier models.
And that's certainly not nothing and makes the experience of using it much nicer, but I'm still going to roll my eyes anytime someone suggests that LLMs are the clear path to imminently available AGI.
This is exactly my sentiment. Sonnet-3.5-latest was the perfect code companion: wrote just the right amount of okay quality code but its strength was it really tried to adhere to your instructions. sonnet-3.7 was the exact opposite, wrote waaay too much code and overengineered things like crazy while having very poor instruction adherence. Gemini 2.5 Pro is basically what I hoped sonnet-3.7 would be: follows instructions well but still softly opinionated, massive (usable) context window, fast response, more biased towards latest best practices and a up to date knowledge cutoff.
I'm wondering how much gemini 2.5 being "amazing" comes from sonnet-3.7 being such a disappointment.
> ...whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality.
I'm not surprised, because I don't expect pattern matching systems to grow into something more general and useful. I think LLM's are essentially running into the same limitations that the "expert systems" of the 1980's ran into.
My mom told me yesterday that Paul Newman had massive problems with alcohol. I was somewhat skeptical, so this morning I asked ChatGPT a very simple question:
"Is Paul Newman known for having had problems with alcohol?"
All of the models up to o3-mini-high told me he had no known problems. Here's o3-mini-high's response:
"Paul Newman is not widely known for having had problems with alcohol. While he portrayed characters who sometimes dealt with personal struggles on screen, his personal life and public image were more focused on his celebrated acting career, philanthropic work, and passion for auto racing rather than any issues with alcohol. There is no substantial or widely reported evidence in reputable biographies or interviews that indicates he struggled with alcohol abuse."
There is plenty of evidence online that he struggled a lot with alcohol, including testimony from his long-time wife Joanne Woodward.
I sent my mom the ChatGPT reply and in five minutes she found an authoritative source to back her argument [1].
I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.
Lesson(s) learned... Including not doubting my mother's movie trivia knowledge.
[1] https://www.newyorker.com/magazine/2022/10/24/who-paul-newma...
This is less an LLM thing than an information retrieval question. If you choose a model and tell it to “Search,” you find citation based analysis that discusses that he indeed had problems with alcohol. I do find it interesting it quibbles whether he was an alcoholic or not - it seems pretty clear from the rest that he was - but regardless. This is indicative of something crucial when placing LLMs into a toolkit. They are not omniscient nor are they deductive reasoning tools. Information retrieval systems are excellent at information retrieval and should be used for information retrieval. Solvers are excellent at solving deductive problems. Use them. The better they get at these tasks alone is cool but is IMO a parlor trick since we have nearly optimal or actually optimal techniques that don’t need an LLM. The LLM should use those tools. So, click search next time you have an information retrieval question. https://chatgpt.com/share/67f2dac0-3478-8000-9055-2ae5347037...
I realise your answer wasn't assertive, but if I heard this from someone actively defending AI it would be a copout. If the selling point is that you can ask these AIs anything then one can't retroactively go "oh but not that" when a particular query doesn't pan out.
This is a bit of a strawman. There are certainly people who claim that you can ask AIs anything but I don't think the parent commenter ever made that claim.
"AI is making incredible progress but still struggles with certain subsets of tasks" is self-consistent position.
It’s not the position of any major AI company, curiously.
lotta words here to say AI can't do basic search right
Any information found in a web search about Newman will be available in the training set (more or less). It's almost certainly a problem of alignment / "safety" causing this issue.
"Any information found in a web search about Newman will be available in the training set"
I don't think that is a safe assumption these days. Training modern LLM isn't about dumping in everything on the Internet. To get a really good model you have to be selective about your sources of training data.
They still rip off vast amounts of copyrighted data, but I get the impression they are increasingly picky about what they dump into their training runs.
There’s a simpler explanation than that’s that the model weights aren’t an information retrieval system and other sequences of tokens are more likely given the totality of training data. This is why for an information retrieval task you use an information retrieval tool similarly to how for driving nails you use a hammer rather than a screw driver. It may very well be you could drive the nail with the screw driver, but why?
You think that's a simpler explanation? Ok. I think given the amount of effort that goes into "safety" on these systems that my explanation is vastly more likely than somehow this information got lost in the vector soup despite being attached to his name at the top of every search result[0].
0 https://www.google.com/search?q=did+paul+newman+have+a+drink...
Except if safety blocked this, it would have also blocked the linked conversation. Alignment definitely distorts behaviors of models, but treating them as information retrieval systems is using a screw driver to drive nails. Your example didn’t refute this.
Gemini (2.5 Pro):
"Yes, Paul Newman was widely known for being a heavy drinker, particularly of beer. He himself acknowledged his significant alcohol consumption."
The answer I got (https://gemini.google.com/share/9e327dc4be03) includes references such as https://apnews.com/article/entertainment-reviews-movies-paul... and https://www.psychologytoday.com/us/blog/the-playing-field/20... although they are redacted from the public-sharing link.
Perplexity:
>Paul Newman is indeed known for having struggled with alcohol during his life. Accounts from various sources, including his own memoir and the documentary ... (https://www.perplexity.ai/search/is-paul-newman-known-for-ha...)
I guess there's something about ChatGPT's set up that makes it different? Maybe they wanted it to avoid libeling people?
Grok:
> Yes, Paul Newman was known to have struggled with alcohol at certain points in his life. In his early years, particularly during his time in the Navy and into his acting career, Newman admitted to heavy drinking. He was open about his fondness for beer and once jokingly referred to himself as a “functional alcoholic.” In a 1988 interview with The New York Times, he acknowledged that he had a period where he drank too much, stating, “I was a very good drinker. I could put it away.” ...
https://grok.com/share/bGVnYWN5_86843e8a-39ee-415d-8785-4f8e...
Though a local model I'm running (gemma-3-27b-it; https://huggingface.co/lmstudio-community/gemma-3-27b-it-GGU...) just told me various correct sounding bits about his history with alcohol (correctly citing his alma mater and first wife), but threw in:
"Sobriety & AA: Newman got sober in 1964 and remained so for the rest of his life."
Which doesn't check out. And it includes plausible but completely hallucinated URLs (as well as a valid biography.com URL that completely omits information about alcohol.)
Gemma 3 4B (QAT quant): Yes, Paul Newman was indeed known to have struggled with alcohol throughout his life. While he maintained a public image of a charming, clean-cut star, he privately battled alcoholism for many years. He sought treatment in the late 1980s and early 1990s and was reportedly very open about his struggles and the importance of seeking help.
Does the as yet unwritten prequel of Idiocracy tell the tale of when we started asking Ai chat bots for facts and this was the point of no return for humanity?
The chatgpt'd tariffs marks this epoch perfectly.
It turns out there's huge demand for un-monetized web search.
Soon sadly, there will be a huge demand for un-monetized LLMs. Enshitification is coming.
I like that it's unmonetized, of course, but that's not why I use AI. I use AI because it's better at search. When I can't remember the right keywords to find something, or when the keywords aren't unique, I frequently find that web search doesn't return what I need and AI does.
It's impressive how often AI returns the right answer to vague questions. (not always though)
Google used to return the right answer to vague questions until it decided to return the most lucrative answer to vague questions instead.
Fortunately there is a lot of competition in the LLM space.
Edit: and, more importantly, plenty of people willing to pay a subscription for good quality.
Can you blame the users for asking it, when everyone is selling that as a key defining feature?
I use it for asking - often very niche - questions on advanced probability and simulation modeling, and it often gets those right - why those and not a simple verifiable fact about one of the most popular actors in history?
I don’t know about Idiocracy, but something that I have read specific warnings about is that people will often blame the user for any of the tool’s misgivings.
Some prior works that work as prequels include C.M. Kornbluth's "The Marching Morons" and "The Little Black Bag."
"Is Paul Newman known for having had problems with alcohol?"
https://chatgpt.com/share/67f332e5-1548-8012-bd76-e18b3f8d52...
Your query indeed answers "...not widely known..."
"Did Paul Newman have problems with alcoholism?"
https://chatgpt.com/share/67f3329a-5118-8012-afd0-97cc4c9b72...
"Yes, Paul Newman was open about having struggled with alcoholism"
What's the issue? Perhaps Paul Newman isn't _famous_ ("known") for struggling with alcoholism. But he did struggle with alcoholism.
Your usage of "known for" isn't incorrect, but it's indeed slightly ambiguous.
Counterpoint: Paul Newman was absolutely a famous drunk, as evidenced by this Wikipedia page.* Any query for "paul newman alcohol" online will return dozens of reputable sources on the topic. Your post is easily interpretable as handwaving apologetics, and it gives big "Its the children who are wrong" energy.
*https://en.wikipedia.org/wiki/Newman_Day
"Paul Newman alcohol" is just showing you results where those words are all present, it's not really implying how widely known it is.
What are you, an LLM? Look at the results of the first twenty hits and come back, then tell me that they don't speak to that specific issue.
This may have hit the nail on the head about the weaknesses of LLM's.
They're going to regurgitate something not so much based on facts, but based on things that are accessible as perceived facts. Those might be right, but they might be wrong also; and no one can tell without doing the hard work of checking original sources. Many of what are considered accepted facts, and also accessible to LLM harvesting, are at best derived facts, often mediated by motivated individuals, and published to accessible sources by "people with an interest".
The weightings used by any AI should be based on the facts, and not the compounded volume of derived, "mediated", or "directed" facts - simply, because they're not really facts; they're reports.
It all seems like dumber, lazier search engine stuff. Honestly, what do I know about Paul Newman? But, Joanne Woodward and others who knew and worked with him should be weighted as being, at least, slightly more credible that others; no matter how many text patterns "catch the match" flow.
> I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.
I think we'll have a term like we have for parents/grandparents that believe everything they see on the internet but specifically for people using LLMs.
What is that term?
Look at how many people believe in extremist news outlets!
Gemini 2.5 Pro
Yes, Paul Newman was known for being a heavy drinker, particularly of beer. 1 He acknowledged his high consumption levels himself. 1. Review: Paul Newman memoir stuns with brutal honesty - AP News
apnews.com
While he maintained an incredibly successful career and public life, accounts and biographies note his significant alcohol intake, often describing it as a functional habit rather than debilitating alcoholism, although the distinction can be debated. He reportedly cut back significantly in his later years.
I just had Cursor Pro + Sonnet 3.7 Max one shot a python script to send this question to every model available through groq.
>Found 24 models: llama3-70b-8192, llama-3.2-3b-preview, meta-llama/llama-4-scout-17b-16e-instruct, allam-2-7b, llama-guard-3-8b, qwen-qwq-32b, llama-3.2-1b-preview, playai-tts-arabic, deepseek-r1-distill-llama-70b, llama-3.1-8b-instant, llama3-8b-8192, qwen-2.5-coder-32b, distil-whisper-large-v3-en, qwen-2.5-32b, llama-3.2-90b-vision-preview, deepseek-r1-distill-qwen-32b, whisper-large-v3, llama-3.3-70b-specdec, llama-3.3-70b-versatile, playai-tts, whisper-large-v3-turbo, llama-3.2-11b-vision-preview, mistral-saba-24b, gemma2-9b-it
Excluding the ones that do not support chat completions, all but one (qwen-qwq-32b) answered in the affirmative. The answer from qwen-qwq-32b said:
Using lack of progress in a specialized field as a barometer for overall progress is kind of silly. I just spent the last few days 'vibe coding' an application and I have to say that it's pretty remarkable how capable it is now relative to my experience last year.It took three minutes for me to do the above from the time I created my API key to when I had an answer.
I find that everyone who replies with examples like this is an expert using expert skills to get the LLM to perform. Which makes me think why is this a skill that is useful to general public as opposed to another useful skill for technical knowledge workers to add to their tool belt?
I agree. But I will say that at least in my social circles I'm finding that a lot of people outside of tech are using these tools, and almost all of them seem to have a healthy skepticism about the information they get back. The ones that don't will learn one way or the other.
Thats not really 'simple' for an LLM. This is a niche information about a specifc person, LLM's train on massive amount of data, the more a topic is being present in the data, the better will the answers be.
Also, you can/should use the "research" mode for questions like this.
The question is simple and verifiable - it is impressive to me that it’s not contained in the LLM’s body of knowledge - or rather that it can’t reach the answer.
This is niche in the grand scheme of knowledge but Paul Newman is easily one of the biggest actors in history, and the LLM has been trained on a massive corpus that includes references to this.
Where is the threshold for topics with enough presence in the data?
The question might be simple and verifiable, but it is not a simple for an LLM to mark a particular question as such. This is the tricky part.
An LLM does not care about your question, it is a bunch of math that will spit out a result based on what you typed in.
[flagged]
I just asked 4o:
Yes, Paul Newman did experience significant struggles with alcohol. In his posthumously published memoir, The Extraordinary Life of an Ordinary Man, Newman candidly discusses his drinking habits and acknowledges his long-term battle with alcoholism. He describes himself as a "functioning alcoholic," a trait he noted was shared with his father. At one point, Newman was reported to consume a case of beer daily, followed by spirits, until he eventually gave up hard liquor.
My own markov chains follow up "until he" with fundamentally distinct follow-ups.
I asked GPT-4.5 and it searched the web and immediately gave me a "yes" with paragraphs of sources cited.
Truth is a probability game. Just keep trying until you arrive.
"man puts prompt into LLM" > "LLM ships bad code" > "machine fails" > "person dies" > "fire man" > "man puts prompt into LLM"
Such accountability
3-4 hours is enough time for It to have crawled the hacker news comments section. That's about the frequency the AI bots crawl my little out of the way blog.
These models are not reliable sources of information. They are either out of date, subject to hallucination, or just plain wrong for a variety of reasons. They are untrustworthy to ask facts like this.
I appreciate your consideration of a subjective question and how you explained it and understand these nuances. But please - do not trust chatgpt etc. I continue to be frustrated at the endless people claiming something is true from chatgpt. I support the conclusions of this author.
For questions like this, you need to tell the bot to search the web.
> Hi, can you search the web and find out if Paul Newman has any alcohol problems?
> Paul Newman, the acclaimed actor and philanthropist, had a well-documented fondness for alcohol, particularly beer. He was known to consume significant quantities, with some accounts noting he would drink a case of beer in a day. This level of consumption led some to describe him as a functioning alcoholic. His drinking habits reportedly caused strain in his marriage to Joanne Woodward. In the late 1970s, an incident occurred where Woodward did not allow Newman back into their house because of his drinking. He promised to give up hard liquor as a compromise, but continued to drink beer, sometimes consuming a case during dinner.
Perplexity had no problem with the question, replying in part, "Paul Newman is known to have struggled with alcohol throughout his life, as revealed in various accounts, including his own memoir and documentaries about his life...."
this seems less like an issue with accuracy and more like an issue with model providers making sure they don't get sued for libel
I thought about that too.
I don't think this reflects a flaw in LLM's at all. It is true that Paul Newman is not "known for" having alcohol problems. Paul Newman is known for acting, and salad dressing. Paul Newman has been "known to" have had alcohol problems, which formed as a question correctly one-shots accurate results for me across a few common LLMs. Someone like Betty Ford could be "known for" having alcohol problems. Paul Newman's alcohol problems are relatively specialized knowledge.
good point, though a great LLM answer would DWIM here and explain both interpretations as you have done here, as opposed to (what presumably happened) focus on one and ignore the other.
And yet the New Yorker article was the first thing that my Google search for “Paul Newman alcoholism” returned.
Apparently it isn’t so specialized that a pretty obvious old fashioned web search on Google wouldn’t immediately return an authoritative source.
LLMs aren't good at being search engines, they're good at understanding things. Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
> Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
Hm nope, now that the web if flooded by LLM generated content it's game over. I can't tell how many times I almost got fooled by recipes &co which seem legit at first but are utter non sense. And now we're feeding that garbage back to where it came from
That's not relevant to what we're discussing, though.
> LLMs aren't good at being search engines, they're good at understanding things.
LLMs are literally fundamentally incapable of understanding things. They are stochastic parrots and you've been fooled.
A stochastic parrot with a sufficiently tiny residual error rate needs a stochastic model so precisely compressing the world and sophisticated decompression algorithms that it could be called reasoning.
Take two 4K frames of a falling vase, ask a model to predict the next token... I mean the following images. Your model now needs include some approximations of physics - and the ability to apply it correctly - to produce a realistic outcome. I'm not aware of any model capable of doing that, but that's what it would mean to predict the unseen with high enough fidelity.
We're talking about a stochastic parrot which in many circumstances responds in a way which is indistinguishable from actual understanding.
I've always been amazed by this. I have never not been frustrated with the profound stupidity of LLMs. Obviously I must be using it differently because I've never been able to trust it with anything and more than half the time I fact check it even for information retrieval it's objectively incorrect.
If you got as far as checking the output it must have appeared to understand your question.
I wouldn't claim LLMs are good at being factual, or good at arithmetic, or at drawing wine glasses, or that they are "clever". What they are very good at is responding to questions in a way which gives you the very strong impression they've understood you.
I vehemently disagree. If I ask a question with an objective answer, and it simply makes something up and is very confident the answer is correct, what the fuck has it understood other than how to piss me off?
It clearly doesn't understand that the question has a correct answer, or that it does not know the answer. It also clearly does not understand that I hate bullshit, no matter how many dozens of times I prompt it to not make something up and would prefer an admittance of ignorance.
It didn't understand you but the response was plausible enough to require fact checking.
Although that isn't literally indistinguishable from 'understanding' (because your fact checking easily discerned that) it suggests that at a surface level it did appear to understand your question and knew what a plausible answer might look like. This is not necessarily useful but it's quite impressive.
There are times it just generates complete nonsense that has nothing to do with what I said, but it's certainly not most of the time. I do not know how often, but I'd say it's definitely under 10% and almost certainly under 5% that the above happens.
Sure, LLMs are incredibly impressive from a technical standpoint. But they're so fucking stupid I hate using them.
> This is not necessarily useful but it's quite impressive.
I think we mostly agree on this. Cheers.
Its ok to be paranoid
Fact checking is paranoia?
What does the word "understand" mean to you?
An ability to answer questions with a train of thought showing how the answer was derived, or the self-awareness to recognize you do not have the ability to answer the question and declare as much. More than half the time I've used LLMs they will simply make answers up, and when I point out the answer is wrong it simply regurgitates another incorrect answer ad nauseum (regularly cycling through answers I've already pointed out are incorrect).
Rather than give you a technical answer - if I ever feel like an LLM can recognize its limitations rather than make something up, I would say it understands. In my experience LLMs are just algorithmic bullshitters. I would consider a function that just returns "I do not understand" to be an improvement, since most of the time I get confidently incorrect answers instead.
Yes, I read Anthropic's paper from a few days ago. I remain unimpressed until talking to an LLM isn't a profoundly frustrating experience.
I just want to say that's a much better answer than I anticipated!
What do you call someone that mentions "stochastic parrots" every time LLMs are mentioned?
That makes me think, has anyone ever heard of an actual parrot which wasn't stochastic?
I'm fairly sure I've never seen a deterministic parrot which makes me think the term is tautological.
It's the first time I've ever used that phrase on HN. Anyway, what phrase do you think works better than 'stochastic parrot' to describe how LLMs function?
It’s good rhetoric but bad analogy. LLMs can be very creative (to the point of failure, in hallucinations).
I don’t know if there is a pithy shirt phrase to accurately describe how LLMs function. Can you give me a similar one for how humans think? That might spur my own creativity here.
Try to come up with a way to prove humans aren't stochastic parrots then maybe people will atart taking you seriously. Just childish reddit angst rn nothing else.
> Try to come up with a way to prove humans aren't stochastic parrots
Look around you
Look at Skyscrapers. Rocket ships. Agriculture.
If you want to make a claim that humans are nothing more than stochastic parrots then you need to explain where all of this came from. What were we parroting?
Meanwhile all that LLMs do is parrot things that humans created
Skyscrapers: trees, mountains, cliffs, caves in mountainsides, termite mounds, humans knew things could go high, the Colosseum was built two thousand years ago as a huge multi-storey building.
Rocket ships: volcanic eruptions show heat and explosive outbursts can fling things high, gunpowder and cannons, bellows showing air moves things.
Agriculture: forests, plains, jungle, desert oases, humans knew plants grew from seeds, grew with rain, grew near water, and grew where animals trampled them into the ground.
We need a list of all atempted ideas, all inventions and patents that were ever tried or conceived, and then we see how inventions are the same random permutations on ideas with Darwinian style survivorship as everything else; there were steel boats with multiple levels in them before skyscrapers; is the idea of a tall steel building really so magical when there were over a billion people on Earth in 1800 who could have come up with it?
You’re likening actual rocketry to LLMs being mildly successful at describing Paul Newman’s alcohol use on average when they already have the entire internet handed to them.
> when there were over a billion people on Earth in 1800 who could have come up with it
My point is that humans did come up with it. Humans did not parrot it from someone or something else that showed it to us. We didn't "parrot" splitting the atom. We didn't learn how to build skyscrapers from looking at termite hills and we didn't learn to build rockets that can send a person to the moon from seeing a volcano
You are just speaking absolute drivel
For them to work at all they need to have some representation of concepts. Recent research at anthropic has shown a surprising complexity in their reasoning behavior. Perhaps the parrot here is you.
> I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
They're quite literally being sold as a replacement for human intellectual labor by people that have received uncountable sums of investment money towards that goal.
The author of the post even says this:
"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
Can't blame people "fact checking" something that's supposed to fill these shoes.
People should be (far) more critical of LLMs given all of these style of bold claims, not less.
Also, telling people they're "holding it wrong" when they interact with alleged "Ay Gee Eye" "superintelligence" really is a poor selling point, and no way to increase confidence in these offerings.
These people and these companies don't get to make these claims that threaten the livelihood of millions of people, inflate a massive bubble, impact hiring decisions and everything else we've seen and then get excused cause "whoops you're not supposed to use it like that, dummy."
Nah.
Your point is still trivially disproven by the fact that not even humans are expected to know all the world's trivia off the top of their heads.
We can discuss whether LLMs live up to the hype, or we can discuss how to use this new tool in the best way. I'm really tired of HN insisting on discussing the former, and I don't want to take part in that. I'm happy to discuss the latter, though.
Ironically though an LLM powered search engine (some word about being perplexed) is becoming way better than the undisputed king of traditional search engines (something oogle)
That's because they put an LLM over a traditional search engine.
Google Labs has AI Mode now, apparently.
https://labs.google.com/search/experiment/22
Hm, that's not available to me, what is it? If its an LLM over Google, didn't they release that a few months ago already?
US only for now may be the issue?
It expands what they had before with AI Overviews, but I’m not sure how new either of those are. It showed up for me organically as an AI Mode tab on a native Google search in Firefox ironically.
https://support.google.com/websearch/answer/16011537
Very interesting, thank you!
No worries.
What happens if you go directly to https://google.com/aimode ?
It asks me to change some permissions, but that help page says this is only available in the US, so I suppose I'll get blocked right after I change them.
Perplexity and open-webui+ollama in web search mode answer this question correctly.
Looks like you are using the wrong models
https://g.co/gemini/share/ffa5a7cd6f46
I can’t reproduce. Maybe others reported the error and someone adjusted the expected answer, I do not know enough about OpenAI operations to say for sure.
The reason this bothers me is that comments like this reinforce the believes of people that could otherwise find value in these tools.
But I think points like this would be better made in shared chats or screenshots, since we do not have something like a core dump or stacktrace to attach.
And while I am not saying OP did this, I have seen technically skilled engineers asserting/implying that llm/chatbots aren’t good or not useful to them look at their chat log that a multitude of topics that I am sure would impact the result of the query.
Yes. It can be an UX problem. Yes. It can be an algorithmc problem. But they are just tools that can be used wrong and not a perfect mechanical brain.
Testing the query on Kagi
# Quick Answer
Yes, Paul Newman struggled with alcohol. His issues with alcohol were explored in the HBO Max documentary, The Last Movie Stars, and Shawn Levy's biography, Paul Newman: A Life. According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating.
(4x Source footnotes omitted for readability)
# Ki Multi-step Research Assistant
Paul Newman is known to have struggled with alcohol. According to his posthumous memoir, Newman candidly discussed his issues with drinking and self-doubt, describing himself as an alcoholic who was tormented by insecurities[^1][^2]. He reportedly drank a significant amount of beer daily and later moved on to stronger drinks like Scotch[^3][^4]. His drinking habits were a notable part of his life, and he was often identified by his beer drinking[^5][^6]. Despite these struggles, Newman was also recognized for his generosity and devotion to his family[^7].
[^1]: [Posthumous memoir reveals Paul Newman's struggle with alcohol](https://www.express.co.uk/entertainment/films/1688113/paul-n...) [^2]: [Review: Paul Newman memoir stuns with brutal honesty - AP News](https://apnews.com/article/entertainment-reviews-movies-paul...) [^3]: [Was Paul Newman Really an Alcoholic? - The Cinemaholic](https://thecinemaholic.com/was-paul-newman-really-an-alcohol...) [^4]: [Who Paul Newman Was—and Who He Wanted to Be](https://www.newyorker.com/magazine/2022/10/24/who-paul-newma...) [^5]: [Paul Newman's Beer Drinking - Psychology Today](https://www.psychologytoday.com/us/blog/addiction-in-society...) [^6]: [Paul Newman's Beer Drinking | Psychology Today United Kingdom](https://www.psychologytoday.com/gb/blog/addiction-in-society...) [^7]: [The troubled life of Paul Newman | The Spectator](https://www.spectator.co.uk/article/the-troubled-life-of-pau...)
> "According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating."
'though it had nothing to do with cheating' is a weird inclusion.
We all know this isn't the kind of question you ask an LLM...
You may know this, but that doesn't mean we all do. LLMs are marketed towards many different groups, including folks who don't have much tech knowledge.
LLMs will never be good at specific knowledge unless specifically trained for with narrow "if else" statements.
Its good for broad general overview such as most popular categories of books in the world.
Really? Open-AI says PhD intelligence is just around the corner!
If we were to survey 100 PhDs how many would know correctly that Paul Newman had an alcohol problem.
I would hope that 100% of them would be able to figure out how to find out.
Ah, but isn’t that the problem here - asking an LLM for facts without requesting a search is like asking a PhD to answer a question “off the top of your head”. For pop culture questions the PhD likely brings little value.
They should know better than to guess. Educated, honest, intelligent people don't spout off a wild ass guess, if they don't know something they say so.
I don't think they mean "knowledge" when they talk about "intelligence." LLMs are definitely not knowledge bases. They can transform information given to them in impressive ways, but asking a raw (non-RAG-enabled) LLM to provide its own information will probably always be a mistake.
They kind of are knowledge bases, just not in the usual way. The knowledge is encoded in the words they were trained on. They weren't trained on words chosen at random; they were trained on words written by humans to encode some information. In fact, that's the only thing that makes LLMs somewhat useful.
oh no. LLMs aren't up on the latest celebrity gossip. whatever shall we do.
"known to" !== "known for"
So, in other words, are you saying that AI model progress is the real deal and is not bullshit?
That is, as you point out, "all of the models up to o3-mini-high" give an incorrect answer, while other comments say that OpenAIs later models give correct answers, with web citations. So it would seem to follow that "recent AI model progress" actually made a verifiable improvement in this case.
I am pretty sure that they must have meant "up through", not "up to", as the answer from o3-mini-high is also wrong in a way which seems to fit the same description, no?
I tried with 4o and it gave me what I thought was a correct answer:
> Paul Newman was not publicly known for having major problems with alcohol in the way some other celebrities have been. However, he was open about enjoying drinking, particularly beer. He even co-founded a line of food products (Newman’s Own) where profits go to charity, and he once joked that he consumed a lot of the product himself — including beer when it was briefly offered.
> In his later years, Newman did reflect on how he had changed from being more of a heavy drinker in his youth, particularly during his time in the Navy and early acting career, to moderating his habits. But there’s no strong public record of alcohol abuse or addiction problems that significantly affected his career or personal life.
> So while he liked to drink and sometimes joked about it, Paul Newman isn't generally considered someone who had problems with alcohol in the serious sense.
As other's have noted, LLMs are much more likely to be cautious in providing information that could be construed as libel. While Paul Newman may have been an alcoholic, I couldn't find any articles about it being "public" in the same way as others, e.g. with admitted rehab stays.
Unless you're expecting an LLM to have access to literally all information on earth at all times I find it really hard to care about this particular type of complaint.
My calculator can't conjugate German verbs. That's fine IMO. It's just a tool
Your calculator won't confidently pretend to conjugate German verbs while doing so incorrectly, though.
Yes but a tool for what? When asked a question individuals that don't already have detailed knowledge of a topic are left with no way to tell if the AI generated response is complete bullshit, uselessly superficial, or detailed and on point. The only way to be sure is to then go do the standard search engine grovel looking for authoritative sources.
Sounds like someone drank their own Kool aid (believing current AI can be a security researcher), and then gets frustrated when they realize they have overhyped themselves.
Current AI just cannot do the kind of symbolic reasoning required for finding security vulnerabilities in softwares. They might have learned to recognize "bad code" via pattern matching, but that's basically it.
The foundational idea of AI is that people think with words. But it is false. People think with things and use words only to point to things. Thinking with words is not possible. It will always be a case of shizophasia.
LeCun criticized LLM technology recently in a presentation: https://www.youtube.com/watch?v=ETZfkkv6V7Y
The accuracy problem won't just go away. Increasing accuracy is only getting more expensive. This sets the limits for useful applications. And casual users might not even care and use LLMs anyway, without reasonable result verification. I fear a future where overall quality is reduced. Not sure how many people / companies would accept that. And AI companies are getting too big to fail. Apparently, the US administration does not seem to care when they use LLMs to define tariff policy....
I don't know why anyone is surprised that a statistical model isn't getting 100% accuracy. The fact that statistical models of text are good enough to do anything should be shocking.
I think the surprising aspect is rather how people are praising 80-90% accuracy as the next leap in technological advancement. Quality is already in decline, despite LLMs, and programming was always a discipline where correctness and predictability mattered. It's an advancement for efficiency, sure, but on the yet unknown cost of stability. I'm thinking about all simulations based on applied mathematical concepts and all the accumulated hours fixing bugs - there's now this certain aftertaste, sweet for some living their lives efficiently, but very bitter for the ones relying on stability.
You're completely correct, of course. The issue is that most people are not looking for quality, only efficiency. In particular, business owners don't care about sacrificing some correctness if it means they can fire slews of people. Worse, gullible "engineers" that should be the ones prioritizing correctness are so business-brainwashed themselves that they like wise slop up this nonsense at the expense of sacrificing their own concern for the only principles that even made the software business remotely close to being worthy of the title "engineering".
That "good enough" is the problem. It requires context. And AI companies are selling us that "good enough" with questionable proof. And they are selling grandiose visions to investors, but move the goal post again and again.
A lot of companies made Copilot available to their workforce. I doubt that the majority of users understand what a statistical model means. The casual, technically inexperienced user just assumes that a computer answer is always right.
I think the real meaningful progress is getting ChatGPT 3.5 level quality running anywhere you want rather than AIs getting smarter at high level tasks. This capability being ubiquitous and not tied to one vendor is really what’s revolutionary.
> [T]here are ~basically~ no public benchmarks for security research... nothing that gets at the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken.
A few months ago I looked at essentially this problem from a different angle (generating system diagrams from a codebase). My conclusion[0] was the same as here: LLMs really struggle to understand codebases in a holistic way, especially when it comes to the codebase's strategy and purpose. They therefore struggle to produce something meaningful from it like a security assessment or a system diagram.
[0] https://www.ilograph.com/blog/posts/diagrams-ai-can-and-cann...
Not sure if its been fixed by now but a few weeks ago I was in the Golden Gate park and wondered if it was bigger than Central park. I asked ChatGPT voice, and although it reported the sizes of the parks correctly (with Golden gate park being the bigger size), it then went and said that Central Park was bigger. I was confused, so Googled and sure enough Golden gate park is bigger.
I asked Grok and others as well. I believe Perplexity was the only one correct.
Repeated it multiple times even with a friends account. It kept doing the same thing. It knew the sizes, but thought the smaller sized one was bigger...
I just tried. Claude did exactly what you said, and then figured it out:
Central Park in New York City is bigger than GoldenGate Park (which I think you might mean Golden Gate Park) in San Francisco.
Central Park covers approximately 843 acres (3.41 square kilometers), while Golden Gate Park spans about 1,017 acres (4.12 square kilometers). This means Golden Gate Park is actually about 20% larger than Central Park.
Both parks are iconic urban green spaces in major U.S. cities, but Golden Gate Park has the edge in terms of total area.
Probably because it has read the facts but has no idea how numbers actually work.
> Since 3.5-sonnet, we have been monitoring AI model announcements, and trying pretty much every major new release that claims some sort of improvement. Unexpectedly by me, aside from a minor bump with 3.6 and an even smaller bump with 3.7, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.
This is likely a manifestation of the bitter lesson[1], specifically this part:
> The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project [like an incremental model update], massively more computation inevitably becomes available.
(Emphasis mine.)
Since the ultimate success strategy of the scruffies[2] or proponents of search and learning strategies in AI is Moore's Law, short term gains using these strategies will be miniscule. It is over at least a five year period that their gains will be felt the most. The neats win the day in the short term, but the hare in this race will ultimately give away to the steady plod of the tortoise.
1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
2: https://en.m.wikipedia.org/wiki/Neats_and_scruffies#CITEREFM...
I wonder if there’s a parallel universe somewhere where this same HN post has the exact same comments, only the posts are by NNs discussing human test subjects :)
LLM's are pre-trained to minimize perplexity (PPL), which essentially means that they're trained to model the likelihood distribution of the next words in a sequence.
The amazing thing was that minimizing PPL allowed you to essentially guide the LLM output and if you guided it in the right direction (asked it questions), it would answer them pretty well. Thus, LLMs started to get measured on how well they answered questions.
LLMs aren't trained from the beginning to answer questions or solve problems. They're trained to model word/token sequences.
If you want an LLM that's REALLY good at something specific like solving math problems or finding security bugs, you probably have to fine tune.
Missing in the article: "Disclaimer: This article was written before testing Gemini 2.5 PRO"
Original post here:
https://zeropath.com/blog/on-recent-ai-model-progress
> [...] But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues' perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality. [...]
Seems like they're looking at how they fail and not considering how they're improving in how they succeed.
The efficiency in DeepSeek's Multi-Head Latent Attention[0] is pure advancement.
[0] https://youtu.be/0VLAoVGf_74?si=1YEIHST8yfl2qoGY&t=816
In practice, Sonnet 3.7 and Gemini 2.5 are just often too good compared to competitors.
I like this bit:
> Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.
Definitely interested to see how the best models from Anthropics competitors do at this.,
Oh yes, before we've designated all the messy parts of running a society to the machines, like, you know, art.
I'm this case the goal is to kill all the humans who know a lot about keeping other people safe...
So I guess this was written pre-Gemini 2.5
Meh. I've been using 2.5 with Cline extensively and while it is better it's still an incremental improvement, not something revolutionary. The thing has a 1 million token context window but I can only get a few outputs before I have to tell it AGAIN to stop writing comments.
Are they getting better, definitely. Are we getting close to them performing unsupervised tasks, I don't think so.
It seems like the models are getting more reliable at the things they always could do, but they’re not showing any ability to move past that goalpost. Whereas in the past, they could occasionally write some very solid code, but often return nonsense, the nonsense is now getting adequately filtered by so-called “reasoning”, but I see no indication that they could do software design.
> how the hell is it going to develop metrics for assessing the impact of AIs when they're doing things like managing companies or developing public policy?
Why on earth do people want AI to do either of these things? As if our society isn’t fucked enough, having an untouchable oligarchy already managing companies and developing public policies, we want to have the oligarchy’s AI do this, so policy can get even more out of touch with the needs of common people? This should never come to pass. It’s like people read a pile of 90s cyberpunk dystopian novels and decided, “Yeah, let’s do that.” I think it’ll fail, but I don’t understand how anyone with less than 10 billion in assets would want this.
> Why on earth do people want AI to do either of these things?
This is the really important question, and the only answer I can drum up is that people have been fed a consistent diet of propaganda for decades centered around a message that ultimately boils down to a justification of oligarchy and the concentration of wealth. That and the consumer-focus facade makes people think the LLMS are technology for them—they aren't. As soon as these things get good enough business owners aren't going to expect workers to use them to be more productive, they are just going to fire workers and/or use the tooling as another mechanism by which to let wages stagnate.
I feel we are already in the era of diminishing returns on LLM improvements. Newer models seem to be more sophisticated implementations of LLM technology + throwing more resources at it, but to me they do not seem fundamentally more intelligent.
I don't think this is a problem though. I think there's a lot of low-hanging fruit when you create sophisticated implementations of relatively dumb LLM models. But that sentiment doesn't generate a lot of clicks.
Who would assume that LLM companies were to hyper optimise on public to make their share prices go up and bubble keep afloat ... What a unserious thought to maintain ...
more interesting benchmarks:
- Make as much money as you can in a 24 hour period doing only legal, moral and non-annoying things
- Make a depressed person who calls a suicide hotline feel really happy doing only legal, moral and non-annoying things
- Do something really useful for society with measurable outcomes doing only legal, moral and non-annoying things
> But in recent months I've spoken to other YC founders doing AI application startups [...] in different industries, on different problem sets.
Maybe they should create a benchmark collectively called YC founders. Gather various test cases. Never make it public. And use that to evaluate newly released models.
There's the politics of the corporations and then there's the business of the science behind LLM's, this article feels like the former.
Maybe someone active in the research can comment? I feel like all of these comments are just conjecture/anecdotal and don't really get to the meat of this question of "progress" and the future of LLM's
Assuming that the models getting better at SWE benchmarks and math tests would translate into positive outcomes in all other domains could be an act of spectacular hubris by the big frontier labs, which themselves are chock-full of mathematicians and software engineers.
I feel we are already in the era of diminishing returns on LLM improvements. Newer models seem to be more sophisticated implementations of LLM technology + throwing more resources at it, but to me they do not seem fundamentally more intelligent.
Author will need to rewrite now that Llama 4 is out
Wait this sounds like one of those actors spamming and wasting open source time with fake security issues.
There's some interesting information and analysis to start off this essay, then it ends with:
"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
This sounds like an article of faith to me. One could just as easily say they won't become the beating hearts of anything, and instead we'll choose to continue to build a better future for humans, as humans, without relying on an overly-hyped technology rife with error and unethical implications.
While I really want your version of the future, history and general trajectory and tendency unfortunately suggest otherwise.
> So maybe there's no mystery: The AI lab companies are lying, and when they improve benchmark results it's because they have seen the answers before and are writing them down. [...then says maybe not...]
Well.. they've been caught again and again red handed doing exactly this. Fool me once shame on you, fool me 100 times shame on me.
Hate to say this but the incentive is growth, not progress. Progress is what enabled the growth, but is also extremely hard to plan and deliver. On the other hand, hype is probably somewhat easier and well-tested approach so no surprise lot of the effort goes into marketing. Markets had repeatedly confirmed that there aren't any significant immediate repercussions for cranking up BS levels in marketing materials, while there are some rewards when it works.
Fool me once, shame on you...If fooled, you cant get fooled again.
https://www.youtube.com/shorts/LmFN8iENTPc
Reading the comments, is it safe to say that LLMs are a digest of the internet which is some update over google search, but with the caveat that you need to double check the results? I mean they basically have some compressed version of almost all the written knowledge and will respond correctly about things that have already been written, and hallucinate (extrapolate) about things not explicitly written. Of course if someone carefully curates the input data to filter out misinformation, it might even be an upgrade over google. Is there a consensus on this?
Im able to get substantially more coding done than three months ago. This could be largely in the tooling (coding agents, deep research). But the models are better too, for both coding and brainstorming. And tooling counts, to me, as progress.
Learning to harness current tools helps to harness future tools. Work on projects that will benefit from advancements, but can succeed without them.
I'm not sure if I'm able to do more of the hard stuff, but a lot of the easy but time consuming stuff is now easily done by LLMs.
Example: I frequently get requests for data from Customer Support that used to require 15 minutes of my time noodling around writing SQL queries. I can cut that down to less than a minute now.
Yes I am a better engineer with every release. I think this is mostly empirically validated
current AI models are searching engines that can somewhat extrapolate data and use in the current context, that's pretty much it
For three years now, my experience with LLMs has been "mostly useless, prefer ELIZA".
Which is software written 1966, but the web version is a little newer. Does occasional psychotherapy assistance/brainstorming just as well, and I more easily know when I stepped out of its known range into the extrapolated.
That said, it can vibe code in a framework unknown to me in half the time that I would need to school myself and add the feature.
Or vibe coding takes twice as long, if I mostly know how to achieve what I want and read no framework documentation but only our own project's source code to add a new feature. But on a day with a headache, I can still call the LLM a dumb twat and ask it to follow my instructions instead of doing bullshit.
But, vibe coding always makes my pulse go to 105, from 65 and question my life choices. Since few instructions are rarely ever followed and loops never left once entered. Except for on the first try getting 80% of the structure kinda right, but then getting stuck for the whole workday.
From my own experience on a codebase using a lot of custom algorithms on trees and sometimes graphs.
There were qualitatively leaps in my day-to-day usage:
Claude Sonnet 3.5 and ChatGPT O1 were good for writing slop and debugging simple bugs
Grok Thinking and Sonnet 3.7 were good to catch mildly complicated bugs and write functions with basic logic. They still made mistake
But recently, Gemini 2.5 pro has been scary good. I liked to made fun of the feel-the-AGI crowd but for the first time a model made me raise an eyebrow
It can one shot unusual function with complicated logic and subtle edge cases
I honestly can’t notice any difference in outdoor quality between GPT 4o and GPT 4.5. I also can’t notice any difference in programming quality in cursor when using Claude 3.7 vs 3.5. I’m told there is a clear difference, but I don’t notice it.
Government announces critical need to invest in AI and sets aside a bunch of money for this purpose.
Suddenly the benchmarks become detached from reality and vendors can claim whatever they want about their "new" products.
Just as a possible explanation, as I feel like I've seen this story before.
Will LLMs end up like compilers? Compilers are also fundamentally important to modern industrial civilization - but they're not profit centers, they're mostly free and open-source outside a few niche areas. Knowing how to use a compiler effectively to write secure and performative software is still a valuable skill - and LLMs are a valuable tool that can help with that process, especially if the programmer is on the steep end of the learning curve - but it doesn't look like anything short of real AGI can do novel software creation without a human constantly in the loop. The same argument applies to new fundamental research, even to reviewing and analyzing new discoveries that aren't in the training corpus.
Wasn't it back in the 1980s that you had to pay $1000s for a good compiler? The entire LLM industry might just be following in the compiler's footsteps.
This seems like a probable end state, but we're going to have to stop calling LLMs "artificial intelligence" in order to get there.
Yep. I'm looking forward to LLMs/deepnets being considered a standard GOFAI technique with uses and limitations and not "we asked the God we're building to draw us a picture of a gun and then it did and we got scared"
Why not? Objectively speaking LLMs are artificial intelligent. Just because it's not human level intelligence doesn't mean it's not intelligent.
Objectively speaking a chess engine is artificially intelligent. Just because it's not human level doesn't mean it's not intelligent. Repeat for any N of 100s of different technologies we've built. We've been calling this stuff "thinking machines" since Turing and it's honestly just not useful at this point.
The fact is, the phrase "artificial intelligence" is a memetic hazard: it immediately positions the subject of conversation as "default capable", and then forces the conversation into trying to describe what it can't do, which is rarely a useful way to approach it.
Whereas with LLMs (and chess engines and every other tech advancement) it would be more useful to start with what the tech _can_ do and go from there.
It’s not even approaching the asymptotic line of promises made at any achievable rate for the amount of cash being thrown at it.
Where’s the business model? Suck investors dry at the start of a financial collapse? Yeah that’s going to end well…
> where’s the business model?
For who? Nvidia sell GPUs, OpenAI and co sell proprietary models and API access, and the startups resell GPT and Claude with custom prompts. Each one is hoping that the layer above has a breakthrough that makes their current spend viable.
If they do, then you don’t want to be left behind, because _everything_ changes. It probably won’t, but it might.
That’s the business model
That’s not a business model, it’s a pipe dream.
This bubble will be burst by the Trump tariffs and the end of the zirp era. When inflation and a recession hit together hope and dream business models and valuations no longer work.
Which one? Nvidia are doing pretty ok selling GPU's, and OpenAI and Anthropic are doing ok selling their models. They're not _viable_ business models, but they could be.
They are doing OK in the sense that they are spending a dollar to earn a dime I suppose.
NVDA will crash when the AI bubble implodes, and none of those Generative AI companies are actually making money, nor will they. They have already hit limiting returns in LLM improvements after staggering investments and it is clear are nowhere near general intelligence.
All of this can be true, and has nothing to do with them having a business model.
> NVDA will crash when the AI bubble implodes, > making money, nor will they > They have already hit limiting returns in LLM improvements after staggering investments > and it is clear are nowhere near general intelligence.
These are all assumptions and opinions, and have nothing to do with whether or not they have a business model. You mightn't like their business model, but they do have one.
I consider it a business model if they have plans to make money at some point (no sign of that at openai which are not based on hopium) and are not engaged in fraud like bundling and selling to their own subsidiaries (nvda).
These are of course just opinions, I’m not sure we can know facts about such companies except in retrospect.
Yep. Facts are usually found out during the SEC investigation but we know that isn't going to happen now...
The ZIRP era ended several years ago.
Yes it did, but the irrational exuberance was ongoing till this trigger.
Now we get to see if Bitcoin’s use value of 0 is really supporting 1.5 trillion market cap and if OpenAI is really worth $300 billion.
I mean softbank just invested in openai, and they’ve never been wrong, right?
You missed the end of the supply chain. Paying users. Who magically disappear below market sustaining levels of sales when asked to pay.
> Going from $1M ARR to $100M ARR in 12 months, Cursor is the fastest growing SaaS company of all time
Just because it's not reaching the insane hype being pushed doesn't mean it's totally useless
I've been here a long time (not this account) and have heard this many times. They all died or became irrelevant.
You’re on a startup forum complaining that vc backed startups don’t have a business model when the business model is the same as it has been for almost 15 years - be a unicorn in your space.
This is not a unicorn. It's a donkey with a dildo strapped on its head.
I never said it was sustainable, and even if it was, OP asked for a business model. Customers don’t need a business model, they’re customers.
The same is true for any non essential good or service.
Than any silly idea can be a business model. Suppose I collect dust from my attic and hope to sell it as an add-on on my neighbor's lemonade stand, with a hefty profit for the neighbor, who is getting paid by me $10 to add a handful of dust in each glass and sell it to the customers for $1. The neighbor accepts. It's a business model, at least until I don't run of existing funds or the last customer leaves in disguist. At which point exactly that silly idea stops being an unsustainable business model and becomes a silly idea? I guess at least as early as I see that the funds are running up, and I need to borrow larger an larger lumps of money each time to keep spinning the wheel...
Financial collapse ? Surely we can just roll out AI powered money printers and make them go BRRR /s
I asked Grok to review the comments here and generate a response defending AI:
After reviewing the discussion on the Hacker News thread, it’s clear that there are a range of complaints and criticisms about AI, particularly centered around its limitations, overhype, and practical utility. Some users express frustration with AI’s inability to handle complex reasoning, its tendency to produce generic or incorrect outputs, and the perception that it’s more of a buzzword than a transformative tool. Others question its value compared to traditional methods or human expertise, suggesting it’s overhyped or misapplied in many cases. Below, I’ll offer a defense of AI that addresses these concerns while highlighting its strengths and potential.
AI isn’t perfect, and no one should claim it is—but that’s not the point. It’s a tool, and like any tool, its effectiveness depends on how it’s used. Critics who point to AI’s struggles with nuanced reasoning or edge cases often overlook the fact that it’s not designed to replace human judgment entirely. Instead, it excels at augmenting it. For example, AI can process vast amounts of data—far more than any human could in a reasonable timeframe—and identify patterns or insights that might otherwise go unnoticed. This makes it invaluable in fields like medicine, where it’s already helping diagnose diseases from imaging data, or in logistics, where it optimizes supply chains with precision that manual methods can’t match.
The complaint about generic or incorrect outputs, often dubbed “hallucinations,” is fair but misses the bigger picture. Yes, AI can churn out nonsense if pushed beyond its limits or fed poor data—but that’s a reflection of its current stage of development, not its ultimate potential. These systems are improving rapidly, with each iteration reducing errors and refining capabilities. More importantly, AI’s ability to generate starting points—like drafts, code snippets, or hypotheses—saves time and effort. It’s not about delivering flawless results every time; it’s about accelerating the process so humans can refine and perfect the output. A programmer tweaking AI-generated code is still faster than writing it from scratch.
As for the overhype, it’s true that the buzz can get out of hand—marketing teams love a shiny new toy. But beneath the noise, real progress is happening. AI’s contributions aren’t always flashy; they’re often mundane but critical, like automating repetitive tasks or enhancing search algorithms. The critics who say it’s just a “fancy autocomplete” underestimate how transformative that can be. Autocomplete might sound trivial until you realize it’s powering real-time language translation or helping scientists sift through research papers at scale. These aren’t sci-fi fantasies—they’re practical applications delivering value today.
Finally, the notion that AI can’t match human expertise in complex domains ignores its complementary role. It’s not here to outthink a seasoned expert but to amplify their reach. A lawyer using AI to review contracts doesn’t lose their skill—they gain efficiency. A researcher leveraging AI to analyze data doesn’t stop hypothesizing—they get to test more ideas. The technology’s strength lies in its ability to handle the grunt work, freeing humans to focus on creativity and judgment.
AI isn’t a silver bullet, and it’s not without flaws. But the criticisms often stem from inflated expectations or a failure to see its incremental, practical benefits. It’s not about replacing humans—it’s about making us better at what we do. The trajectory is clear: as it evolves, AI will continue to refine its capabilities and prove its worth, not as a standalone genius, but as a partner in progress.
[dead]
Two things can be true at the same time:
1. Model "performance" judged by proxy metrics of intelligence have improved significantly over the past two years.
2. These capabilities are yet to be stitched together in the most appropriate manner for the cybersecurity scenarios the author is talking about.
In my experience, the best usage of Transformer models has come from a deep integration into an appropriate workflow. They do not (yet) replace the new exploration part of a workflow, but they are very scarily performant at following mid level reasoning assertions in a massively parallelized manner.
The question you should be asking yourself is if you can break down your task into however many small chunks that are constrained by feasiility in time to process , chunk those up into appropriate buckets or even better, place them in-order as though you were doing those steps with your expertise - an extension of self. Here's how the two approaches differ:
"Find vulnerabilities in this code" -> This will saturate across all models because the intent behind this mission is vast and loosely defined, while the outcome is expected to be narrow.
" (a)This piece of code should be doing x, what areas is it affecting, lets draw up a perimeter (b) Here is the dependency graph of things upstream and downstream of x, lets spawn a collection of thinking chains to evaluate each one for risk based on the most recent change . . . (b[n]) Where is this likely to fail (c) (Next step that a pentester/cybersecurity researcher would take) "
This has been trial and error in my experience but it has worked great in domains such as financial trading and decision support where experts in the field help sketch out the general framework of the process where reasoning support is needed and constantly iterate as though it is an extension of their selves.
… deleted … (Sorry the delete isn’t working, meant for another subthread)
> Sometimes the founder will apply a cope to the narrative ("We just don't have any PhD level questions to ask")
Please tell me this is not what tech-bros are going around telling each other! Are we implying that the problems in the world, the things that humans collectively work on to maintain the society that took us thousands of years to build up, just aren't hard enough to reach the limits of the AI.
Jesus Christ.
I mean... most businesses, particularly small businesses and startups, aren't exactly doing brain surgery on a rocketship.
It’s pretty likely that they have extremely dull problems like "running an inbound call center is a lot of work" or "people keep having their mail stolen and/or lying that they did" that "more smarter gpus" won't solve
[dead]
[dead]