Recent AI model progress feels mostly like bullshit

431 points by paulpauper 19 hours ago

InkCanon 17 hours ago

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

AIPedant 16 hours ago

Yes, here's the link: https://arxiv.org/abs/2503.21934v1
Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:
1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)
2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)
I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...
I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!
- larodi an hour ago
  
  This is a paper by INSAIT researchers - a very young institute which hired most of its PHD staff only in the last 2 years, basically onboarding anyone who wanted to be part of it. They were waiving their BG-GPT on national TV in the country as a major breakthrough, while it was basically was a Mistral fine-tuned model, that was eventually never released to the public, nor the training set.
  Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.
  - AIPedant 13 minutes ago
    
    [dead]
- JohnKemeny 8 hours ago
  
  Discussed here: https://news.ycombinator.com/item?id=43540985 (Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, 4 points, 2 comments).
- otabdeveloper4 3 hours ago
  
  Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)
  This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)
  - bambax 3 hours ago
    
    How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.
    
    otabdeveloper4 2 hours ago
    
    You just (lol) need to give non-standard problems and demand students to provide reasoning and explanations along with the answer. Yeah, LLMs can "reason" too, but it's obvious when the output comes from an LLM here.
    (Yes, that's a lot of work for a teacher. Gone are the days when you could just assign reports as homework.)
    
    itchyjunk an hour ago
    
    Can you provide sample questions that are "LLM proof" ?
    
    otabdeveloper4 38 minutes ago
    
    It's not about being "LLM-proff", it's about teacher involvement in making up novel questions and grading attentively. There's no magic trick.
billforsternz 10 hours ago

I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]
So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.
This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
- aezart 6 hours ago
  
  > I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
  I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.
  - bambax 2 hours ago
    
    I think there is a big divide here. Every adult on earth knows magic is "fake", but some can still be amazed and entertained by it, while others find it utterly boring because it's fake, and the only possible (mildly) interesting thing about it is to try to figure out what the trick is.
    I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.
  - katsura 3 hours ago
    
    To be fair, I love that magicians can pull tricks on me even though I know it is fake.
- tim333 an hour ago
  
  A lot of humans are similarly good at some stuff and bad at other things.
  Looking up the math ability of the average American this is given as an example for the median (from https://www.wyliecomm.com/2021/11/whats-the-latest-u-s-numer...):
  >Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.
  Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.
  Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.
- Sunspark 8 hours ago
  
  It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.
  The AI will create something for you and tell you it was them.
  - prawn 2 hours ago
    
    "That's impossible because..."
    "Good point! Blah blah blah..."
    Absolutely shameless!
- greenmartian 8 hours ago
  
  Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.
  But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.
  - aurareturn 8 hours ago
    
    Makes sense that search has a small, fast, dumb model designed to summarize and not to solve problems. Nearly 14 billion Google searches per day. Way too much compute needed to use a bigger model.
    
    fire_lake 7 hours ago
    
    Massive search overlap though - and some questions (like the golf ball puzzle) can be cached for a long time.
    
    summerlight 5 hours ago
    
    AFAIK they got 15% of unseen queries everyday, so it might be not very simple to design an effective cache layer on that. Semantic-aware clustering of natural language queries and projecting them into a cache-able low rank dimension is a non-trivial problem. Of course, LLM can effectively solve that, but then what's the point of using cache when you need LLM for clustering queries...
  - vintermann 5 hours ago
    
    I have a strong suspicion that for all the low threshold APIs/services, before the real model sees my prompt, it gets evaluated by a quick model to see if it's something they care to bother the big models with. If not i get something shaked out of the sleeve of a bottom barrel model.
  - InDubioProRubio 3 hours ago
    
    Its most likely one giant ["input token close enough question hash"] = answer_with_params_replay? It doesent missunderstands the question, it tries to squeeze the input to something close enough?
- throwawaymaths 2 hours ago
  
  I've seen humans make exactly these sorts of mistakes?
- senordevnyc 8 hours ago
  
  Just tried with o3-mini-high and it came up with something pretty reasonable: https://chatgpt.com/share/67f35ae9-5ce4-800c-ba39-6288cb4685...
simonw 16 hours ago

I had to look up these acronyms:
- USAMO - United States of America Mathematical Olympiad
- IMO - International Mathematical Olympiad
- ICPC - International Collegiate Programming Contest
Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted 27th March 2025.
sanxiyn 12 hours ago

Nope, no LLMs reported 50~60% performance on IMO, and SOTA LLMs scoring 5% on USAMO is expected. For 50~60% performance on IMO, you are thinking of AlphaProof, but AlphaProof is not a LLM. We don't have the full paper yet, but clearly AlphaProof is a system built on top of LLM with lots of bells and whistles, just like AlphaFold is.
- InkCanon 10 hours ago
  
  o1 reportedly got 83% on IMO, and 89th percentile on Codeforces.
  https://openai.com/index/learning-to-reason-with-llms/
  The paper tested it on o1-pro as well. Correct me if I'm getting some versioning mixed up here.
  - alexlikeits1999 9 hours ago
    
    I've gone through the link you posted and the o1 system card and can't see any reference to IMO. Are you sure they were referring to IMO or were they referring to AIME?
  - sanxiyn 6 hours ago
    
    AIME is so not IMO.
colonial 12 minutes ago

Less than 5%. OpenAI's O1 burned through over $100 in tokens during the test as well!
yahoozoo an hour ago

LLMs are “next token” predictors. Yes, I realize that there’s a bit more to it and it’s not always just the “next” token, but at a very high level that’s what they are. So why are we so surprised when it turns out they can’t actually “do” math? Clearly the high benchmark scores are a result of the training sets being polluted with the answers.
sigmoid10 an hour ago

>I'm incredibly surprised no one mentions this
If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.
bglazer 15 hours ago

Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.
So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.
I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.
**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com
I promise its a fun mathematical puzzle and the biology is pretty wild too
- MoonGhost 5 hours ago
  
  I was working some time ago on image processing model using GAN architecture. One model produces output and tries to fool the second. Both are trained together. Simple, but requires a lot extra efforts to make it work. Unstable and falls apart (blows up to unrecoverable state). I found some ways to make it work by adding new loss functions, changing params, changing models' architectures and sizes. Adjusting some coefficients through the training to gradually rebalance loss functions' influence.
  The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.
  PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.
  PPS: the fact that they can do known tasks with minor variations is already a huge time saver.
- root_axis 14 hours ago
  
  It's funny, I have the same problem all the time with typical day to day programming roadblocks that these models are supposed to excel at. I'm talking about any type of bug or unexpected behavior that requires even 5 minutes of deeper analysis.
  Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.
  Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.
  - worldsayshi 6 hours ago
    
    > they're more like calculators of language than agents that reason
    This might be honing in on both the issue and the actual value of LLM:s. I think there's a lot of value in a "language calculator" but if it's continuously being sold as something it's not we will dismiss it or build heaps of useless apps that will just form a market bubble. I think the value is there but it's different from how we think about it.
  - jwrallie 14 hours ago
    
    True. There’s a small bonus that trying to explain the issue to the llm may sometimes be essentially rubber ducking, and that can lead to insights. I feel most of the time the llm can give erroneous output that still might trigger some thinking on a different direction, and sometimes I’m inclined to think it’s helping me more than it actually is.
- torginus 6 hours ago
  
  When I was an undergrad EE student a decade ago, I had to tangle a lot with complex maths in my Signals & Systems, and Electricity and Magnetism classes. Stuff like Fourier transforms, hairy integrals, partial differential equations etc.
  Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.
  I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.
- airstrike 9 hours ago
  
  I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model
- kristianp 10 hours ago
  
  Have you tried gemini 2.5? It's one of the best reasoning models. Available free in google ai studio.
usaar333 16 hours ago

And then within a week, Gemini 2.5 was tested and got 25%. Point is AI is getting stronger.
And this only suggested LLMs aren't trained well to write formal math proofs, which is true.
- selcuka 11 hours ago
  
  > within a week
  How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.
  - levocardia 11 hours ago
    
    They retrained their model less than a week before its release, just to juice one particular nonstandard eval? Seems implausible. Models get 5x better at things all the time. Challenges like the Winograd schema have gone from impossible to laughably easy practically overnight. Ditto for "Rs in strawberry," ferrying animals across a river, overflowing wine glass, ...
    
    NiloCK an hour ago
    
    I'm not generally inclined toward the "they are cheating cheaters" mindset, but I'll point out that fine tuning is not the same as retraining. It can be done cheaply and quickly.
    Models getting 5X better at things all the time is at least as easy to interpret as evidence of task-specific tuning than as breakthroughs in general ability, especially when the 'things being improved on' are published evals with history.
    
    alphabetting an hour ago
    
    Google team said it was outside the training window fwiw
    https://x.com/jack_w_rae/status/1907454713563426883
    
    112233 8 hours ago
    
    Imagine that you are making problem solving AI. You have large budget, and access to compute and web crawling infra to run your AI "on internet". You would like to be aware of the ways people are currently evaluating AI so that you can be sure your product looks good. Do you have maybe an idea how one could do that?
    
    AIPedant 10 hours ago
    
    The "ferrying animals across a river" problem has definitely not been solved, they still don't understand the problem at all, overcomplicating it because they're using an off-the-shelf solution instead of actual reasoning:
    o1 screwing up a trivially easy variation: https://xcancel.com/colin_fraser/status/1864787124320387202
    Claude 3.7, utterly incoherent: https://xcancel.com/colin_fraser/status/1898158943962271876
    DeepSeek: https://xcancel.com/colin_fraser/status/1882510886163943443#...
    Overflowing wine glass also isn't meaningfully solved! I understand it is sort of solved for wine glasses (even though it looks terrible and unphysical, always seems to have weird fizz). But asking GPT to "generate an image of a transparent vase with flowers which has been overfilled with water, so that water is spilling over" had the exact same problem as the old wine glasses: the vase was clearly half-full, yet water was mysteriously trickling over the sides. Presumably OpenAI RLHFed wine glasses since it was a well-known failure, but (as always) this is just whack-a-mole, it does not generalize into understanding the physical principle.
    
    leonidasv 9 hours ago
    
    Gemini 2.5 Pro got the farmer problem variation right: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
    
    greenmartian 8 hours ago
    
    When told, "only room for one person OR one animal", it's also the only one to recognise the fact that the puzzle is impossible to solve. The farmer can't take any animals with them, and neither the goat nor wolf could row the boat.
    
    yyy3ww2 5 hours ago
    
    > When told, "only room for one person OR one animal"
    In common terms suppose I say: there is only room for one person or one animal in my car to go home, one can suppose that it is referring to additional room besides that occupied by the driver. There is a problem when we try to use LLM trained in common use of language to solve puzzle in formal logic or math. I think the current LLMs are not able to have a specialized context to become a logical reasoning agent, but perhaps such thing could be possible if the evaluation function of the LLM was designed to give high credit to changing context with a phrase or token.
    
    Tepix 4 hours ago
    
    That can't be viewed without logging into Google first.
    
    cma 9 hours ago
    
    They could have rlhfed or finetuned on user thumbs up responses, which could include users who took the test and asked it to explain problems after
    
    akoboldfrying 10 hours ago
    
    >one particular nonstandard eval
    A particular nonstandard eval that is currently top comment on this HN thread, due to the fact that, unlike every other eval out there, LLMs score badly on it?
    Doesn't seem implausible to me at all. If I was running that team, I would be "Drop what you're doing, boys and girls, and optimise the hell out of this test! This is our differentiator!"
    
    og_kalu 9 hours ago
    
    It's implausible that fine-tuning of a premier model would have anywhere near that turn around time. Even if they wanted to and had no qualms doing so, it's not happening anywhere near that fast.
    
    suddenlybananas 4 hours ago
    
    It's really not that implausible, they probably are adding stuff to the data-soup all the time and have a system in place for it.
    
    og_kalu 2 hours ago
    
    Yeah it is lol. You don't just train your model on whatever you like when you're expected to serve it. They're are a host of problems with doing that. The idea that they trained on this obscure benchmark released about the day of is actually very silly.
- MoonGhost 5 hours ago
  
  They are trained on some mix with minimal fraction of math. That's how it was from the beginning. But we can rebalance it by adding quality generated content. Just content will cost millions of $$ to generate. Distillation on new level looks like logical next step.
KolibriFly 6 hours ago

Yeah, this is one of those red flags that keeps getting hand-waved away, but really shouldn't be.
geuis 8 hours ago

Query: Could you explain the terminology to people who don't follow this that closely?
- BlanketLogic 8 hours ago
  
  Not the OP but
  USAMO : USA Math Olympiad. Referred here https://arxiv.org/pdf/2503.21934v1
  IMO : International Math Olympiad
  SOTA : State of the Art
  OP is probably referring to this referred to this paper here https://arxiv.org/pdf/2503.21934v1. The paper explains out how a rigorous testing revealed abysmal performance of LLMs (results that are at odds with how they are hyped about).
TrackerFF 5 hours ago

What would the average human score be?
I.e. if you randomly sampled N humans to take those tests.
- sanxiyn 5 hours ago
  
  The average human score on USAMO (let alone IMO) is zero, of course. Source: I won medals at Korean Mathematical Olympiad.
  - lordgrenville 10 minutes ago
    
    I am hesitant to correct a math Olympian, but don't you mean the median?
  - vintermann 5 hours ago
    
    Average, hmmm?
anonzzzies 5 hours ago

That type of news might make investors worry / scared.
cma 9 hours ago

OpenAI told how they removed it for GPT-4 in its release paper: only exact string matches. So all discussion of bar exam questions from memory on test taking forums etc., that wouldnn't exactly match, made it in.
AstroBen 16 hours ago

[dead]
- gilleain 13 hours ago
  
  Just in case it wasn't a typo, and you happen not to know ... that word is probably "eke" - meaning gaining (increasing, enlarging from wiktionary) - rather than "eek" which is what mice do :)
  - AstroBen 10 hours ago
    
    hah you're right on the spelling but wrong on my meaning. That's probably the first time I've typed it. I don't think LLMs are quite at the level of mice reasoning yet!
    https://dictionary.cambridge.org/us/dictionary/english/eke-o... to obtain or win something only with difficulty or great effort
  - Terr_ 12 hours ago
    
    Ick, OK, ACK.
- kylebyte 14 hours ago
  
  Every day I am more convinced that LLM hype is the equivalent of someone seeing a stage magician levitate a table across the stage and assuming this means hovercars must only be a few years away.
  - Terr_ 12 hours ago
    
    I believe there's a widespread confusion between a fictional character that is described as a AI assistant, versus the actual algorithm building the play-story which humans imagine the character from. An illusion actively promoted by companies seeking investment and hype.
    AcmeAssistant is "helpful" and "clever" in the same way that Vampire Count Dracula is "brooding" and "immortal".
- raylad 5 hours ago
  
  Eek! You mean eke.
- og_kalu 14 hours ago
  
  LLMs are capable of playing chess and 3.5 turbo instruct does so quite well (for a human) at 1800 ELO. Does this mean they can truly reason now ?
  https://github.com/adamkarvonen/chess_gpt_eval
  - hatefulmoron 13 hours ago
    
    3.5 turbo instruct is a huge outlier.
    https://dynomight.substack.com/p/chess
    Discussion here: https://news.ycombinator.com/item?id=42138289
    
    og_kalu 11 hours ago
    
    That might be overstating it, at least if you mean it to be some unreplicable feat. Small models have been trained that play around 1200 to 1300 on the eleuther discord. And there's this grandmaster level transformer - https://arxiv.org/html/2402.04494v1
    Open AI, Anthropic and the like simply don't care much about their LLMs playing chess. That or post training is messing things up.
    
    hatefulmoron 11 hours ago
    
    > That might be overstating it, at least if you mean it to be some unreplicable feat.
    I mean, surely there's a reason you decided to mention 3.5 turbo instruct and not.. 3.5 turbo? Or any other model? Even the ones that came after? It's clearly a big outlier, at least when you consider "LLMs" to be a wide selection of recent models.
    If you're saying that LLMs/transformer models are capable of being trained to play chess by training on chess data, I agree with you.
    I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
    
    og_kalu 10 hours ago
    
    I mentioned it because it's the best example. One example is enough to disprove the "not capable of". There are other examples too.
    >I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
    Not really. The LLMs play chess like they have no clue what the rules of the game are, not like poor reasoners. Trying to predict and failing is how they learn anything. If you want them to learn a game like chess then how you get them to learn it - by trying to predict chess moves. Chess books during training only teach them how to converse about chess.
    
    throwaway173738 9 minutes ago
    
    The issue isn’t whether they can be trained to play. The issue is whether, after making a careful reading of the rules, they can infer how to play. The latter is something a human child could do, but it is completely beyond an LLM.
    
    hatefulmoron 10 hours ago
    
    > One example is enough to disprove the "not capable of" nonsense. There are other examples too.
    Gotcha, fair enough. Throw enough chess data in during training, I'm sure they'd be pretty good at chess.
    I don't really understand what you're trying to say in your next paragraph. LLMs surely have plenty of training data to be familiar with the rules of chess. They also purportedly have the reasoning skills to use their familiarity to connect the dots and actually play. It's trivially true that this issue can be plastered over by shoving lots of chess game training data into them, but the success of that route is not a positive reflection on their reasoning abilities.
    
    og_kalu 10 hours ago
    
    Gradient descent is a dumb optimizer. LLM training is not at all like a human reading a book and more like evolution tuning adaptations over centuries. You would not expect either process to be aware of anything they are converging towards. So having lots of books that talk about chess in training will predictably just return a model that knows how to talk about chess really well. I'm not surprised they may know how to talk about the rules but play them poorly.
    And that post had a follow-up. Post-training messing things up could well be the issue seeing the impact even a little more examples and/or regurgitation made. https://dynomight.net/more-chess/
    
    tsimionescu 7 hours ago
    
    The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language. That is, that simply training on text is going to eventually give the AI the ability to generate language that reasons at more or less human level in more or less any domain of knowledge.
    This whole premise crashes and burns if you need task-specific training, like explicit chess training. That is because there are far too many tasks that humans need to be competent at in order to be useful in society. Even worse, the vast majority of those tasks are very hard to source training data for, unlike chess.
    So, if we accept that LLMs can't learn chess unless they explicitly include chess games in the training set, then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
    
    og_kalu 2 hours ago
    
    >The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language.
    And they do, just not always in the ways we expect.
    >This whole premise crashes and burns if you need task-specific training, like explicit chess training.
    Everyone needs task specific training. Any human good at chess or anything enough to make it a profession needs it. So I have no idea why people would expect any less for a Machine.
    >then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
    Yeah so ? How much business pitches they need in the training set has no correlation with chess. I don't see any reason to believe what is already present isn't enough. There's enough chess data on the internet to teach them chess too, it's just a matter of how much open AI care about it.
    
    cma 9 hours ago
    
    Reasoning training causes some about of catastrophic forgetting, so unlikely they burn that on mixing in chess puzzles if they want a commercial product, unless it somehow transfers well to other reasoning problems broadly cared about.
  - AstroBen 11 hours ago
    
    My point wasn't chess specific or that they couldn't have specific training for it. It was a more general "here is something that LLMs clearly aren't being trained for currently, but would also be solvable through reasoning skills"
    Much in the same way a human who only just learnt the rules but 0 strategy would very, very rarely lose here
    These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
    We're on the verge of AGI but there's not even the tiniest spark of general reasoning ability in something they haven't been trained for
    "Reasoning" or "Thinking" are marketing terms and nothing more. If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning"
    
    og_kalu 10 hours ago
    
    >If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning".
    If you think you can play chess at that level over that many games and moves with memorization then i don't know what to tell you except that you're wrong. It's not possible so let's just get that out of the way.
    >These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
    Why doesn't it ? Have you actually looked at any of these games ? Those LLMs aren't playing like poor reasoners. They're playing like machines who have no clue what the rules of the game are. LLMs learn by predicting and failing and getting a little better at it, repeat ad nauseum. You want them to learn the rules of a complex game ? That's how you do it. By training them to predict it. Training on chess books just makes them learn how to converse about chess.
    Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it. The top comment we are both replying to had gemini-2.5-pro which released less than 5 days later hit 25% on the benchmark. Now that was particularly funny.
    
    AstroBen 9 hours ago
    
    > Why doesn't it?
    It was surprising to me because I would have expected if there was reasoning ability then it would translate across domains at least somewhat, but yeah what you say makes sense. I'm thinking of it in human terms
    
    og_kalu 9 hours ago
    
    Transfer Learning during LLM training tends to be 'broader' than that.
    Like how
    - Training LLMs on code makes them solve reasoning problems better - Training Language Y alongside X makes them much better at Y than if they were trained on language Y alone and so on.
    Probably because well gradient descent is a dumb optimizer and training is more like evolution than a human reading a book.
    Also, there is something genuinely weird going on with LLM chess. And it's possible base models are better. https://dynomight.net/more-chess/

iambateman 18 hours ago

The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.

When you ask it a question, it tends to say yes.

So while the LLM arms race is incrementally increasing benchmark scores, those improvements are illusory.

The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving. So even if the model gets an extra 5/100 math problems right, it feels about the same in a series of prompts which are more complicated than just a ChatGPT scenario.

I would say the industry knows it’s missing a tool but doesn’t know what that tool is yet. Truly agentic performance is getting better (Cursor is amazing!) but it’s still evolving.

I totally agree that the core benchmarks that matter should be ones which evaluate a model in agentic scenario, not just on the basis of individual responses.

bluefirebrand 16 hours ago

> The real challenge is that the LLM’s fundamentally want to seem agreeable, and that’s not improving
LLMs fundamentally do not want to seem anything
But the companies that are training them and making models available for professional use sure want them to seem agreeable
- mrweasel an hour ago
  
  That sound reasonable to me, but the those companies forget that there's different types of agreeable. There's the LLM approach, similar to the coworker who will answer all your questions about .NET but not stop you from coding yourself into a corner, and then there's the "Let's sit down and review what it actually is that you're doing, because you're asking a fairly large number of disjoint questions right now".
  I've dropped trying to use LLMs for anything, due to political convictions and because I don't feel like they are particularly useful for my line of work. Where I have tried to use various models in the past is for software development, and the common mistake I see the LLMs make is that they can't pick up on mistakes in my line of thinking, or won't point them out. Most of my problems are often down to design errors or thinking about a problem in a wrong way. The LLMs will never once tell me that what I'm trying to do is an indication of a wrong/bad design. There are ways to be agreeable and still point out problems with previously made decisions.
- JohnKemeny 6 hours ago
  
  > LLMs fundamentally do not want to seem anything
  You're right that LLMs don't actually want anything. That said, in reinforcement learning, it's common to describe models as wanting things because they're trained to maximize rewards. It’s just a standard way of talking, not a claim about real agency.
  - Peritract 5 hours ago
    
    > a standard way of talking, not a claim about real agency.
    A standard way of talking used by people who do also frequently claim real agency.
- Terr_ 12 hours ago
  
  Yeah, and they probably have more "agreeable" stuff in their corpus simply because very disagreeable stuff tend to be either much shorter or a prelude to a flamewar.

signa11 6 hours ago

> The core point in this article is that the LLM wants to report _something_, and so it tends to exaggerate. It’s not very good at saying “no” or not as good as a programmer would hope.

umm, it seems to me that it is this (tfa):

     But I would nevertheless like to submit, based off of internal
     benchmarks, and my own and colleagues' perceptions using these models,
     that whatever gains these companies are reporting to the public, they
     are not reflective of economic usefulness or generality.

and then couple of lines down from the above statement, we have this:

     So maybe there's no mystery: The AI lab companies are lying, and when
     they improve benchmark results it's because they have seen the answers
     before and are writing them down.

signa11 3 hours ago

[this went way outside the edit-window and hence a separate comment] imho, state of varying experience with llm's can aptly summed in this poem by Mr. Longfellow

     There was a little girl,
        Who had a little curl,
     Right in the middle of her forehead.
        When she was good,
        She was very good indeed,
     But when she was bad she was horrid.

malingo 10 hours ago

"when you ask him anything, he never answers 'no' -- he just yesses you to death and then he takes your dough"

wg0 4 hours ago

Unlike many - I find author's complaints on the dot.

Once all the AI batch startups have sold subscriptions to the cohort and there's no more further market growth because businesses outside don't want to roll the dice on a probabilistic model that doesn't have an understanding of pretty much anything rather is a clever imitation machine on the content it has seen, the AI bubble will burst when more statups would start packing up by end of 2026 or max 2027.

throwawayffffas 2 hours ago

I agree, about both the issue with benchmarks not being relevant to actual use cases and the "wants to sound smart" issue. I have seen them both first hand interacting with llms.

I think the ability to embed arbitrary knowledge written in arbitrary formats is the most important thing llms have achieved.

In my experience trying to get an llm to perform a task as vast and open ended as the one the author describes is fundamentally misguided. The llms were not trained for that and won't be able to do it in a satisfactory degree. But all this research has thankfully provided us with the software and hardware tools where one could start working on training a model that can.

Contrast that to 5-6 years ago, when all you could hope for this kind of thing was simple rule based and pattern matching systems.

lukev 18 hours ago

This is a bit of a meta-comment, but reading through the responses to a post like this is really interesting because it demonstrates how our collective response to this stuff is (a) wildly divergent and (b) entirely anecdote-driven.

I have my own opinions, but I can't really say that they're not also based on anecdotes and personal decision-making heuristics.

But some of us are going to end up right and some of us are going to end up wrong and I'm really curious what features signal an ability to make "better choices" w/r/t AI, even if we don't know (or can't prove) what "better" is yet.

freehorse 16 hours ago

There is nothing wrong with sharing anecdotal experiences. Reading through anecdotal experiences here can help understand how one's own experience are relatable or not. Moreover, if I have X experience it could help to know if it is because of me doing sth wrong that others have figured out.
Furthermore, as we are talking about actual impact of LLMs, as is the point of the article, a bunch of anecdotal experiences may be more valuable than a bunch of benchmarks to figure it out. Also, apart from the right/wrong dichotomy, people use LLMs with different goals and contexts. It may not mean that some people do something wrong if they do not see the same impact as others. Everytime a web developer says that they do not understand how others may be so skeptical of LLMs, conclude with certainty that they must be doing sth wrong and move on to explain how to actually use LLMs properly, I chuckle.
- otterley 14 hours ago
  
  Indeed, there’s nothing at all wrong with sharing anecdotes. The problem is when people make broad assumptions and conclusions based solely on personal experience, which unfortunately happens all too often. Doing so is wired into our brains, though, and we have to work very consciously to intercept our survival instincts.
  - freehorse 29 minutes ago
    
    People "make conclusions" because they have to take decisions day to day. We cannot wait for the perfect bulletproof evidence before that. Data is useful to take into account, but if I try to use X llm that has some perfect objective benchmark backing it, while I cannot make it be useful to me while Y llm has better results, it would be stupid not to base my decision on my anecdotal experience. Or vice versa, if I have a great workflow with llms, it may be not make sense to drop it because some others may think that llms don't work.
    In the absence of actually good evidence, anecdotal data may be the best we can get now. The point imo is try to understand why some anecdotes are contrasting each other, which, imo, is mostly due to contextual factors that may not be very clear, and to be flexible enough to change priors/conclusions when something changes in the current situation.
  - droopyEyelids 9 hours ago
    
    I think you might be caught up in a bit of the rationalist delusion.
    People -only!- draw conclusions based on personal experience. At best you have personal experience with truly objective evidence gathered in a statistically valid manner.
    But that only happens in a few vanishingly rare circumstances here on earth. And wherever it happens, people are driven to subvert the evidence gathering process.
    Often “working against your instincts” to be more rational only means more time spent choosing which unreliable evidence to concoct a belief from.
    
    otterley 6 hours ago
    
    I'm not sure where you got all this from. Do you have any useful citations?
dsign 5 hours ago

You want to block subjectivity? Write some formulas.
There are three questions to consider:
a) Have we, without any reasonable doubt, hit a wall for AI development? Emphasis on "reasonable doubt". There is no reasonable doubt that the Earth is roughly spherical. That level of certainty.
b) Depending on your answer for (a), the next question to consider is if we the humans have motivations to continue developing AI.
c) And then the last question: will AI continue improving?
If taken as boolean values, (a), (b) and (c) have a truth table with eight values, the most interesting row being false, true, true: "(not a) and b => c". Note the implication sign, "=>". Give some values to (a) and (b), and you get a value for (c).
There are more variables you can add to your formula, but I'll abstain from giving any silly examples. I, however, think that the row (false, true, false) implied by many commentators is just fear and denial. Fear is justified, but denial doesn't help.
FiniteIntegral 17 hours ago

It's not surprising that responses are anecdotal. An easy way to communicate a generic sentiment often requires being brief.
A majority of what makes a "better AI" can be condensed to how effective the slope-gradient algorithms are at getting the local maxima we want it to get to. Until a generative model shows actual progress of "making decisions" it will forever be seen as a glorified linear algebra solver. Generative machine learning is all about giving a pleasing answer to the end user, not about creating something that is on the level of human decision making.
- code_biologist 8 hours ago
  
  At risk of being annoying, answers that feel like high quality human decision making are extremely pleasing and desirable. In the same way, image generators aren't generating six fingered hands because they think it's more pleasing, they're doing it because they're trying to please and not good enough yet.
  I'm just most baffled by the "flashes of brilliance" combined with utter stupidity. I remember having a run with early GPT 4 (gpt-4-0314) where it did refactoring work that amazed me. In the past few days I asked a bunch of AIs about similar characters between a popular gacha mobile game and a popular TV show. OpenAI's models were terrible and hallucinated aggressively (4, 4o, 4.5, o3-mini, o3-mini-high), with the exception of o1. DeepSeek R1 only mildly hallucinated and gave bad answers. Gemini 2.5 was the only flagship model that did not hallucinate and gave some decent answers.
  I probably should have used some type of grounding, but I honestly assumed the stuff I was asking about should have been in their training datasets.
KolibriFly 6 hours ago

Totally agree... this space is still so new and unpredictable that everyone is operating off vibes, gut instinct, and whatever personal anecdotes they've collected. We're all sort of fumbling around in the dark, trying to reverse-engineer the flashlight
lherron 17 hours ago

Agreed! And with all the gaming of the evals going on, I think we're going to be stuck with anecdotal for some time to come.
I do feel (anecdotally) that models are getting better on every major release, but the gains certainly don't seem evenly distributed.
I am hopeful the coming waves of vertical integration/guardrails/grounding applications will move us away from having to hop between models every few weeks.
- InkCanon 17 hours ago
  
  Frankly the overarching story about evals (which receives very little coverage) is how much gaming is going on. On the recent USAMO 2025, SOTA models scored 5%, despite claiming silver/gold in IMOs. And ARC-AGI: one very easy way to "solve" it is to generate masses of synthetic examples by extrapolating the basic rules of ARC AGI questions and train it on that.
throwanem 12 hours ago

> I'm really curious what features signal an ability to make "better choices" w/r/t AI
So am I. If you promise you'll tell me after you time travel to the future and find out, I'll promise you the same in return.
aunty_helen 14 hours ago

That’s a good point, the comments section is very anecdotal. Do you have any data to say if this is a common occurrence or specific to this topic?
nialv7 16 hours ago

Good observation but also somewhat trivial. We are not omniscient gods, ultimately all our opinions and decisions will have to be based on our own limited experiences.

numa7numa7 an hour ago

My lived experience is that unless there's some new breakthrough's AI is more akin to a drill to replace a hammer than a tractor to replace the plow or a printing press.

Maybe any AI experts can elaborate on this but it seems there's a limit to the fundamental underlying model of the LLM architecture of transformers and tokens.

LLM's are amazing but we might need something more or some new paradigm to push us towards true AGI.

jonahx 18 hours ago

My personal experience is right in line with the author's.

Also:

> I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.

I immediately thought: That's because in most situations this is the purpose of language, at least partially, and LLMs are trained on language.

HarHarVeryFunny 18 hours ago

The disconnect between improved benchmark results and lack of improvement on real world tasks doesn't have to imply cheating - it's just a reflection of the nature of LLMs, which at the end of the day are just prediction systems - these are language models, not cognitive architectures built for generality.

Of course, if you train an LLM heavily on narrow benchmark domains then its prediction performance will improve on those domains, but why would you expect that to improve performance in unrelated areas?

If you trained yourself extensively on advanced math, would you expect that to improve your programming ability? If not, they why would you expect it to improve programming ability of a far less sophisticated "intelligence" (prediction engine) such as a language model?! If you trained yourself on LeetCode programming, would you expect that to help hardening corporate production systems?!

throwawayffffas 2 hours ago

In my view as well it's not really cheating, it's just over fitting.
If a model doesn't do good in the benchmarks it will either be retrained until it does or you won't hear about it.
InkCanon 17 hours ago

That's fair. But look up the recent experiment on SOTA models on the then just released USAMO 2025 questions. Highest score was 5%, supposedly SOTA last year was IMO silver level. There could be some methodological differences - ie USAMO paper required correct proofs and not just numerical answers. But it really strongly suggests even within limited domains, it's cheating. I'd wager a significant amount that if you tested SOTA models on a new ICPC set of questions, actual performance would be far, far worse than their supposed benchmarks.
- usaar333 16 hours ago
  
  > Highest score was 5%, supposedly SOTA last year was IMO silver level.
  No LLM last year got silver. Deepmind had a highly specialized AI system earning that
KolibriFly 6 hours ago

Your analogy is perfect. Training an LLM on math olympiad problems and then expecting it to secure enterprise software is like teaching someone chess and handing them a wrench

joelthelion 18 hours ago

I've used gemini 2.5 this weekend with aider and it was frighteningly good.

It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.

heresie-dabord 3 hours ago

> It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.
Even approximations must be right to be meaningful. If information is wrong, it's rubbish.
Presorting/labelling various data has value. Humans have done the real work there.
What is "leading" us at present are the exaggerated valuations of corporations. You/we are in a bubble, working to justify the bubble.
Until a tool is reliable, it is not installed where people can get hurt. Unless we have revised our concern for people.
jchw 18 hours ago

I think overall quality with Gemini 2.5 is not much better than Gemini 2 in my experience. Gemini 2 was already really good, but just like Claude 3.7, Gemini 2.5 goes some steps forward and some steps backwards. It sometimes generates some really verbose code even when you tell it to be succinct. I am pretty confident that if you evaluate 2.5 for a bit longer you'll come to the same conclusion eventually.
mountainriver 18 hours ago

Yep, and what they are going in cursor either the agentic stuff is really game changing.
People who can’t recognize this intentionally have their heads in the sand
- dimitri-vs an hour ago
  
  I guess you haven't been on /r/cursor or forum.cursor.com lately?
  "game changing" isn't exactly the sentiment there the last couple months.
- InkCanon 17 hours ago
  
  People are really fundamentally asking two different questions when they talk about AI "importance": AI's utility and AI's "intelligence". There's a careful line between both.
  1) AI undoubtedly has utility. In many agentic uses, it has very significant utility. There's absolute utility and perceived utility, which is more of user experience. In absolute utility, it is likely git is the single most game changing piece of software there is. It is likely git has saved some ten, maybe eleven digit number in engineer hours times salary in how it enables massive teams to work together in very seamless ways. In user experience, AI is amazing because it can generate so much so quickly. But it is very far from an engineer. For example, recently I tried to use cursor to bootstrap a website in NextJS for me. It produced errors it could not fix, and each rewrite seemed to dig it deeper into its own hole. The reasons were quite obvious. A lot of it had to do with NextJS 15 and the breaking changes it introduces in cookies and auth. It's quite clear if you have masses of NextJS code, which disproportionately is older versions, but none labeled well with versions, it messes up the LLM. Eventually I scrapped what it wrote and did it myself. I don't mean to use this anecdote to say LLMs are useless, but they have pretty clear limitations. They work well on problems with massive data (like front end) and don't require much principled understanding (like understanding how NextJS 15 would break so and so's auth). Another example of this is when I tried to use it to generate flags for a V8 build, it failed horribly and would simply hallucinate flags all the time. This seemed very likely to be (despite the existence of a list of V8 flags online) that many flags had very close representations in vector embeddings, and that there was almost close to zero data/detailed examples on their use.
  2) In the more theoretical side, the performance of LLMs on benchmarks (claiming to be elite IMO solvers, competitive programming solvers) have become incredibly suspicious. When the new USAMO 2025 was released, the highest score was 5%, despite claims a year ago that SOTA when was at least a silver IMO. This is against the backdrop of exponential compute and data being fed in. Combined with apparently diminishing returns, this suggests that the gains from that are running really thin.

ants_everywhere 17 hours ago

There are real and obvious improvements in the past few model updates and I'm not sure what the disconnect there is.

Maybe it's that I do have PhD level questions to ask them, and they've gotten much better at it.

But I suspect that these anecdotes are driven by something else. Perhaps people found a workable prompt strategy by trial and error on an earlier model and it works less well with later models.

Or perhaps they have a time-sensitive task and are not able to take advantage of the thinking of modern LLMs, which have a slow thinking-based feedback loop. Or maybe their code base is getting more complicated, so it's harder to reason about.

Or perhaps they're giving the LLMs a poorly defined task where older models made assumptions about but newer models understand the ambiguity of and so find the space of solutions harder to navigate.

Since this is ultimately from a company doing AI scanning for security, I would think the latter plays a role to some extent. Security is insanely hard and the more you know about it the harder it is. Also adversaries are bound to be using AI and are increasing in sophistication, which would cause lower efficacy (although you could tease this effect out by trying older models with the newer threats).

DebtDeflation 2 hours ago

The issue is the scale of the improvements. GPT-3.5 Instruct was an utterly massive leap over everything that came before it. GPT-4 was a very big jump over that. Everything since has seemed incremental. Yes we got multimodal but that was part of GPT-4, they just didn't release it initially, and up until very recently it mostly handed off to another model. Yes we got reasoning models, but people had been using CoT for awhile so it was just a matter of time before RL got used to train it into models. Witness the continual delays of GPT-5 and the back and forth on whether it will be its own model or just a router model that picks the best existing model to hand a prompt off to.
pclmulqdq 17 hours ago

In the last year, things like "you are an expert on..." have gotten much less effective in my private tests, while actually describing the problem precisely has gotten better in terms of producing results.
In other words, all the sort of lazy prompt engineering hacks are becoming less effective. Domain expertise is becoming more effective.
- ants_everywhere 17 hours ago
  
  yes that would explain the effect I think. I'll try that out this week.
AIPedant 16 hours ago

[dead]

throw310822 18 hours ago

I hope it's true. Even if LLMs development stopped now, we would still keep finding new uses for them at least for the next ten years. The technology is evolving way faster than we can meaningfully absorb it and I am genuinely frightened by the consequences. So I hope we're hitting some point of diminishing returns, although I don't believe it a bit.

maccard 18 hours ago

My experience as someone who uses LLMs and a coding assist plugin (sometimes), but is somewhat bearish on AI is that GPT/Claude and friends have gotten worse in the last 12 months or so, and local LLMs have gone from useless to borderline functional but still not really usable for day to day.

Personally, I think the models are “good enough” that we need to start seeing the improvements in tooling and applications that come with them now. I think MCP is a good step in the right direction, but I’m sceptical on the whole thing (and have been since the beginning, despite being a user of the tech).

fxtentacle 19 hours ago

I'd say most of the recent AI model progress has been on price.

A 4-bit quant of QwQ-32B is surprisingly close to Claude 3.5 in coding performance. But it's small enough to run on a consumer GPU, which means deployment price is now down to $0.10 per hour. (from $12+ for models requiring 8x H100)

xiphias2 18 hours ago

Have you compared it with 8-bit QwQ-17B?
In my evals 8 bit quantized smaller Qwen models were better, but again evaluating is hard.
- redrove 7 hours ago
  
  There’s no QwQ 17B that I’m aware of. Do you have a HF link?
shostack 18 hours ago

Yeah, I'm thinking of this from a Wardley map standpoint.
What innovation opens up when AI gets sufficiently commoditized?
- bredren 18 hours ago
  
  One thing I’ve seen is large enterprises extracting money from consumers by putting administrative burden on them.
  For example, you can see this in health insurance reimbursements and wireless carriers plan changes. (ie, Verizon’s shift from Do More, etc to what they have now)
  Companies basically set up circumstances where consumers lose small amounts of money on a recurring basis or sporadically enough that the people will just pay the money rather than a maze of calls, website navigation and time suck to recover funds due to them or that shouldn’t have been taken in the first place.
  I’m hopeful well commoditized AI will give consumers a fighting chance at this and other types of disenfranchisement that seems to be increasingly normalized by companies that have consultants that do nothing but optimize for their own financial position.
- mentalgear 18 hours ago
  
  Brute force, brute force everything at least for the domains you can have automatic verification in.

gundmc 19 hours ago

This was published the day before Gemini 2.5 was released. I'd be interested if they see any difference with that model. Anecdotally, that is the first model that really made me go wow and made a big difference for my productivity.

usaar333 18 hours ago

Ya, I find this hard to imagine aging well. Gemini 2.5 solved (at least much better than) multiple real world systems questions I've had in the past that other models could not. Its visual reasoning also jumped significantly on charts (e.g. planning around train schedules)
Even Sonnet 3.7 was able to do refactoring work on my codebase sonnet 3.6 could not.
Really not seeing the "LLMs not improving" story
jonahx 18 hours ago

I doubt it. It still flails miserably like the other models on anything remotely hard, even with plenty of human coaxing. For example, try to get it to solve: https://www.janestreet.com/puzzles/hall-of-mirrors-3-index/
- flutas 18 hours ago
  
  FWIW 2.5-exp was the only one that managed to get a problem I asked it right, compared to Claude 3.7 and o1 (or any of the other free models in Cursor).
  It was reverse engineering ~550MB of Hermes bytecode from a react native app, with each function split into a separate file for grep-ability and LLM compatibility.
  The others would all start off right then quickly default to just greping randomly what they expected it to be, which failed quickly. 2.5 traced the function all the way back to the networking call and provided the expected response payload.
  All the others hallucinated the networking response I was trying to figure out. 2.5 Provided it exactly enough for me to intercept the request and using the response it provided to get what I wanted to show up.
  - arkmm 18 hours ago
    
    How did you fit 550MB of bytecode into the context window? Was this using 2.5 in an agentic framework? (i.e. repeated model calls and tool usage)
    
    flutas 13 hours ago
    
    I manually pre-parsed the bytecode file with awk into a bazillion individual files that were each just one function, and gave it the hint to grep to sort through them. This was all done in Cursor.
    awk '/^=> \[Function #/ { if (out) close(out); fn = $0; sub(/^.*#/, "", fn); sub(/ .*/, "", fn); out = "function_" fn ".txt" } { if (out) print > out }' bundle.hasm
    Quick example of the output it gave and it's process.
    https://i.imgur.com/Cmg4KK1.png
    https://i.imgur.com/ApNxUkB.png
- Xenoamorphous 18 hours ago
  
  I’d say the average person wouldn’t understand that problem, let alone solve it.
ponorin 2 hours ago

There's somehow this belief that "newer models will disprove <insert LLM criticism here>" despite the "newer" models being... just a scaled-up version of a previous model, or some anciliary features tacked on. An LLM is an LLM is an LLM: I'll believe it when I see otherwise.
georgemcbay 18 hours ago

As someone who was wildly disappointed with the hype around Claude 3.7, Gemini 2.5 is easily the best programmer-assistant LLM available, IMO.
But it still feels more like a small incremental improvement rather than a radical change, and I still feel its limitations constantly.
Like... it gives me the sort of decent but uninspired solution I would expect it to generate without predictably walking me through a bunch of obvious wrong turns as I repeatedly correct it as I would have to have done with earlier models.
And that's certainly not nothing and makes the experience of using it much nicer, but I'm still going to roll my eyes anytime someone suggests that LLMs are the clear path to imminently available AGI.
- dimitri-vs 43 minutes ago
  
  This is exactly my sentiment. Sonnet-3.5-latest was the perfect code companion: wrote just the right amount of okay quality code but its strength was it really tried to adhere to your instructions. sonnet-3.7 was the exact opposite, wrote waaay too much code and overengineered things like crazy while having very poor instruction adherence. Gemini 2.5 Pro is basically what I hoped sonnet-3.7 would be: follows instructions well but still softly opinionated, massive (usable) context window, fast response, more biased towards latest best practices and a up to date knowledge cutoff.
  I'm wondering how much gemini 2.5 being "amazing" comes from sonnet-3.7 being such a disappointment.

sema4hacker 18 hours ago

> ...whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality.

I'm not surprised, because I don't expect pattern matching systems to grow into something more general and useful. I think LLM's are essentially running into the same limitations that the "expert systems" of the 1980's ran into.

aerhardt 18 hours ago

My mom told me yesterday that Paul Newman had massive problems with alcohol. I was somewhat skeptical, so this morning I asked ChatGPT a very simple question:

"Is Paul Newman known for having had problems with alcohol?"

All of the models up to o3-mini-high told me he had no known problems. Here's o3-mini-high's response:

"Paul Newman is not widely known for having had problems with alcohol. While he portrayed characters who sometimes dealt with personal struggles on screen, his personal life and public image were more focused on his celebrated acting career, philanthropic work, and passion for auto racing rather than any issues with alcohol. There is no substantial or widely reported evidence in reputable biographies or interviews that indicates he struggled with alcohol abuse."

There is plenty of evidence online that he struggled a lot with alcohol, including testimony from his long-time wife Joanne Woodward.

I sent my mom the ChatGPT reply and in five minutes she found an authoritative source to back her argument [1].

I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.

Lesson(s) learned... Including not doubting my mother's movie trivia knowledge.

[1] https://www.newyorker.com/magazine/2022/10/24/who-paul-newma...

fnordpiglet 17 hours ago

This is less an LLM thing than an information retrieval question. If you choose a model and tell it to “Search,” you find citation based analysis that discusses that he indeed had problems with alcohol. I do find it interesting it quibbles whether he was an alcoholic or not - it seems pretty clear from the rest that he was - but regardless. This is indicative of something crucial when placing LLMs into a toolkit. They are not omniscient nor are they deductive reasoning tools. Information retrieval systems are excellent at information retrieval and should be used for information retrieval. Solvers are excellent at solving deductive problems. Use them. The better they get at these tasks alone is cool but is IMO a parlor trick since we have nearly optimal or actually optimal techniques that don’t need an LLM. The LLM should use those tools. So, click search next time you have an information retrieval question. https://chatgpt.com/share/67f2dac0-3478-8000-9055-2ae5347037...
- Vanit 15 hours ago
  
  I realise your answer wasn't assertive, but if I heard this from someone actively defending AI it would be a copout. If the selling point is that you can ask these AIs anything then one can't retroactively go "oh but not that" when a particular query doesn't pan out.
  - philomath_mn 12 hours ago
    
    This is a bit of a strawman. There are certainly people who claim that you can ask AIs anything but I don't think the parent commenter ever made that claim.
    "AI is making incredible progress but still struggles with certain subsets of tasks" is self-consistent position.
    
    skywhopper 11 hours ago
    
    It’s not the position of any major AI company, curiously.
- terminatornet 14 hours ago
  
  lotta words here to say AI can't do basic search right
- mvdtnz 17 hours ago
  
  Any information found in a web search about Newman will be available in the training set (more or less). It's almost certainly a problem of alignment / "safety" causing this issue.
  - simonw 16 hours ago
    
    "Any information found in a web search about Newman will be available in the training set"
    I don't think that is a safe assumption these days. Training modern LLM isn't about dumping in everything on the Internet. To get a really good model you have to be selective about your sources of training data.
    They still rip off vast amounts of copyrighted data, but I get the impression they are increasingly picky about what they dump into their training runs.
  - fnordpiglet 17 hours ago
    
    There’s a simpler explanation than that’s that the model weights aren’t an information retrieval system and other sequences of tokens are more likely given the totality of training data. This is why for an information retrieval task you use an information retrieval tool similarly to how for driving nails you use a hammer rather than a screw driver. It may very well be you could drive the nail with the screw driver, but why?
    
    mvdtnz 17 hours ago
    
    You think that's a simpler explanation? Ok. I think given the amount of effort that goes into "safety" on these systems that my explanation is vastly more likely than somehow this information got lost in the vector soup despite being attached to his name at the top of every search result[0].
    0 https://www.google.com/search?q=did+paul+newman+have+a+drink...
    
    fnordpiglet 15 hours ago
    
    Except if safety blocked this, it would have also blocked the linked conversation. Alignment definitely distorts behaviors of models, but treating them as information retrieval systems is using a screw driver to drive nails. Your example didn’t refute this.
r_klancer 17 hours ago

Gemini (2.5 Pro):
"Yes, Paul Newman was widely known for being a heavy drinker, particularly of beer. He himself acknowledged his significant alcohol consumption."
The answer I got (https://gemini.google.com/share/9e327dc4be03) includes references such as https://apnews.com/article/entertainment-reviews-movies-paul... and https://www.psychologytoday.com/us/blog/the-playing-field/20... although they are redacted from the public-sharing link.
- tim333 16 hours ago
  
  Perplexity:
  >Paul Newman is indeed known for having struggled with alcohol during his life. Accounts from various sources, including his own memoir and the documentary ... (https://www.perplexity.ai/search/is-paul-newman-known-for-ha...)
  I guess there's something about ChatGPT's set up that makes it different? Maybe they wanted it to avoid libeling people?
  - aldanor 15 hours ago
    
    Grok:
    > Yes, Paul Newman was known to have struggled with alcohol at certain points in his life. In his early years, particularly during his time in the Navy and into his acting career, Newman admitted to heavy drinking. He was open about his fondness for beer and once jokingly referred to himself as a “functional alcoholic.” In a 1988 interview with The New York Times, he acknowledged that he had a period where he drank too much, stating, “I was a very good drinker. I could put it away.” ...
    https://grok.com/share/bGVnYWN5_86843e8a-39ee-415d-8785-4f8e...
- r_klancer 17 hours ago
  
  Though a local model I'm running (gemma-3-27b-it; https://huggingface.co/lmstudio-community/gemma-3-27b-it-GGU...) just told me various correct sounding bits about his history with alcohol (correctly citing his alma mater and first wife), but threw in:
  "Sobriety & AA: Newman got sober in 1964 and remained so for the rest of his life."
  Which doesn't check out. And it includes plausible but completely hallucinated URLs (as well as a valid biography.com URL that completely omits information about alcohol.)
  - smahs 16 hours ago
    
    Gemma 3 4B (QAT quant): Yes, Paul Newman was indeed known to have struggled with alcohol throughout his life. While he maintained a public image of a charming, clean-cut star, he privately battled alcoholism for many years. He sought treatment in the late 1980s and early 1990s and was reportedly very open about his struggles and the importance of seeking help.
blitzar 17 hours ago

Does the as yet unwritten prequel of Idiocracy tell the tale of when we started asking Ai chat bots for facts and this was the point of no return for humanity?
- MarcelOlsz 13 hours ago
  
  The chatgpt'd tariffs marks this epoch perfectly.
- pclmulqdq 17 hours ago
  
  It turns out there's huge demand for un-monetized web search.
  - spudlyo 16 hours ago
    
    Soon sadly, there will be a huge demand for un-monetized LLMs. Enshitification is coming.
  - leereeves 16 hours ago
    
    I like that it's unmonetized, of course, but that's not why I use AI. I use AI because it's better at search. When I can't remember the right keywords to find something, or when the keywords aren't unique, I frequently find that web search doesn't return what I need and AI does.
    It's impressive how often AI returns the right answer to vague questions. (not always though)
    
    pclmulqdq 16 hours ago
    
    Google used to return the right answer to vague questions until it decided to return the most lucrative answer to vague questions instead.
    
    HPsquared 14 hours ago
    
    Fortunately there is a lot of competition in the LLM space.
    Edit: and, more importantly, plenty of people willing to pay a subscription for good quality.
- aerhardt 17 hours ago
  
  Can you blame the users for asking it, when everyone is selling that as a key defining feature?
  I use it for asking - often very niche - questions on advanced probability and simulation modeling, and it often gets those right - why those and not a simple verifiable fact about one of the most popular actors in history?
  I don’t know about Idiocracy, but something that I have read specific warnings about is that people will often blame the user for any of the tool’s misgivings.
- jmount 12 hours ago
  
  Some prior works that work as prequels include C.M. Kornbluth's "The Marching Morons" and "The Little Black Bag."
qiqitori 11 hours ago

"Is Paul Newman known for having had problems with alcohol?"
https://chatgpt.com/share/67f332e5-1548-8012-bd76-e18b3f8d52...
Your query indeed answers "...not widely known..."
"Did Paul Newman have problems with alcoholism?"
https://chatgpt.com/share/67f3329a-5118-8012-afd0-97cc4c9b72...
"Yes, Paul Newman was open about having struggled with alcoholism"
What's the issue? Perhaps Paul Newman isn't _famous_ ("known") for struggling with alcoholism. But he did struggle with alcoholism.
Your usage of "known for" isn't incorrect, but it's indeed slightly ambiguous.
- posterguy 11 hours ago
  
  Counterpoint: Paul Newman was absolutely a famous drunk, as evidenced by this Wikipedia page.* Any query for "paul newman alcohol" online will return dozens of reputable sources on the topic. Your post is easily interpretable as handwaving apologetics, and it gives big "Its the children who are wrong" energy.
  *https://en.wikipedia.org/wiki/Newman_Day
  - mustyoshi 10 hours ago
    
    "Paul Newman alcohol" is just showing you results where those words are all present, it's not really implying how widely known it is.
    
    posterguy an hour ago
    
    What are you, an LLM? Look at the results of the first twenty hits and come back, then tell me that they don't speak to that specific issue.
kayo_20211030 16 hours ago

This may have hit the nail on the head about the weaknesses of LLM's.
They're going to regurgitate something not so much based on facts, but based on things that are accessible as perceived facts. Those might be right, but they might be wrong also; and no one can tell without doing the hard work of checking original sources. Many of what are considered accepted facts, and also accessible to LLM harvesting, are at best derived facts, often mediated by motivated individuals, and published to accessible sources by "people with an interest".
The weightings used by any AI should be based on the facts, and not the compounded volume of derived, "mediated", or "directed" facts - simply, because they're not really facts; they're reports.
It all seems like dumber, lazier search engine stuff. Honestly, what do I know about Paul Newman? But, Joanne Woodward and others who knew and worked with him should be weighted as being, at least, slightly more credible that others; no matter how many text patterns "catch the match" flow.
mlinhares 11 hours ago

> I use ChatGPT for many tasks every day, but I couldn't fathom that it would get so wrong something so simple.
I think we'll have a term like we have for parents/grandparents that believe everything they see on the internet but specifically for people using LLMs.
- delian66 2 hours ago
  
  What is that term?
- BobbyTables2 11 hours ago
  
  Look at how many people believe in extremist news outlets!
meta_ai_x 12 hours ago

Gemini 2.5 Pro
Yes, Paul Newman was known for being a heavy drinker, particularly of beer. 1 He acknowledged his high consumption levels himself. 1. Review: Paul Newman memoir stuns with brutal honesty - AP News
apnews.com
While he maintained an incredibly successful career and public life, accounts and biographies note his significant alcohol intake, often describing it as a functional habit rather than debilitating alcoholism, although the distinction can be debated. He reportedly cut back significantly in his later years.
jcims 13 hours ago
I just had Cursor Pro + Sonnet 3.7 Max one shot a python script to send this question to every model available through groq.
>Found 24 models: llama3-70b-8192, llama-3.2-3b-preview, meta-llama/llama-4-scout-17b-16e-instruct, allam-2-7b, llama-guard-3-8b, qwen-qwq-32b, llama-3.2-1b-preview, playai-tts-arabic, deepseek-r1-distill-llama-70b, llama-3.1-8b-instant, llama3-8b-8192, qwen-2.5-coder-32b, distil-whisper-large-v3-en, qwen-2.5-32b, llama-3.2-90b-vision-preview, deepseek-r1-distill-qwen-32b, whisper-large-v3, llama-3.3-70b-specdec, llama-3.3-70b-versatile, playai-tts, whisper-large-v3-turbo, llama-3.2-11b-vision-preview, mistral-saba-24b, gemma2-9b-it
Excluding the ones that do not support chat completions, all but one (qwen-qwq-32b) answered in the affirmative. The answer from qwen-qwq-32b said:
```
    Paul Newman, the renowned actor and humanitarian, did not have a widely publicized 
    struggle with alcohol addiction throughout most of his life, but there were 
    specific instances that indicated challenges.
```
Using lack of progress in a specialized field as a barometer for overall progress is kind of silly. I just spent the last few days 'vibe coding' an application and I have to say that it's pretty remarkable how capable it is now relative to my experience last year.
It took three minutes for me to do the above from the time I created my API key to when I had an answer.
- mnky9800n 13 hours ago
  
  I find that everyone who replies with examples like this is an expert using expert skills to get the LLM to perform. Which makes me think why is this a skill that is useful to general public as opposed to another useful skill for technical knowledge workers to add to their tool belt?
  - jcims 12 hours ago
    
    I agree. But I will say that at least in my social circles I'm finding that a lot of people outside of tech are using these tools, and almost all of them seem to have a healthy skepticism about the information they get back. The ones that don't will learn one way or the other.
lfkdev 18 hours ago

Thats not really 'simple' for an LLM. This is a niche information about a specifc person, LLM's train on massive amount of data, the more a topic is being present in the data, the better will the answers be.
Also, you can/should use the "research" mode for questions like this.
- aerhardt 17 hours ago
  
  The question is simple and verifiable - it is impressive to me that it’s not contained in the LLM’s body of knowledge - or rather that it can’t reach the answer.
  This is niche in the grand scheme of knowledge but Paul Newman is easily one of the biggest actors in history, and the LLM has been trained on a massive corpus that includes references to this.
  Where is the threshold for topics with enough presence in the data?
  - Max_aaa 17 hours ago
    
    The question might be simple and verifiable, but it is not a simple for an LLM to mark a particular question as such. This is the tricky part.
    An LLM does not care about your question, it is a bunch of math that will spit out a result based on what you typed in.
  - xrortrad 17 hours ago
    
    [flagged]
tptacek 13 hours ago

I just asked 4o:
Yes, Paul Newman did experience significant struggles with alcohol. In his posthumously published memoir, The Extraordinary Life of an Ordinary Man, Newman candidly discusses his drinking habits and acknowledges his long-term battle with alcoholism. He describes himself as a "functioning alcoholic," a trait he noted was shared with his father. At one point, Newman was reported to consume a case of beer daily, followed by spirits, until he eventually gave up hard liquor.
- goldfeld 13 hours ago
  
  My own markov chains follow up "until he" with fundamentally distinct follow-ups.
drooby 18 hours ago

I asked GPT-4.5 and it searched the web and immediately gave me a "yes" with paragraphs of sources cited.
- john2x 18 hours ago
  
  Truth is a probability game. Just keep trying until you arrive.
  - Avicebron 17 hours ago
    
    "man puts prompt into LLM" > "LLM ships bad code" > "machine fails" > "person dies" > "fire man" > "man puts prompt into LLM"
    
    john2x 7 hours ago
    
    Such accountability
- boznz 14 hours ago
  
  3-4 hours is enough time for It to have crawled the hacker news comments section. That's about the frequency the AI bots crawl my little out of the way blog.
Alive-in-2025 16 hours ago

These models are not reliable sources of information. They are either out of date, subject to hallucination, or just plain wrong for a variety of reasons. They are untrustworthy to ask facts like this.
I appreciate your consideration of a subjective question and how you explained it and understand these nuances. But please - do not trust chatgpt etc. I continue to be frustrated at the endless people claiming something is true from chatgpt. I support the conclusions of this author.
d_burfoot 14 hours ago

For questions like this, you need to tell the bot to search the web.
> Hi, can you search the web and find out if Paul Newman has any alcohol problems?
> Paul Newman, the acclaimed actor and philanthropist, had a well-documented fondness for alcohol, particularly beer. He was known to consume significant quantities, with some accounts noting he would drink a case of beer in a day. This level of consumption led some to describe him as a functioning alcoholic. His drinking habits reportedly caused strain in his marriage to Joanne Woodward. In the late 1970s, an incident occurred where Woodward did not allow Newman back into their house because of his drinking. He promised to give up hard liquor as a compromise, but continued to drink beer, sometimes consuming a case during dinner.
fernly 13 hours ago

Perplexity had no problem with the question, replying in part, "Paul Newman is known to have struggled with alcohol throughout his life, as revealed in various accounts, including his own memoir and documentaries about his life...."
permo-w 18 hours ago

this seems less like an issue with accuracy and more like an issue with model providers making sure they don't get sued for libel
- aerhardt 18 hours ago
  
  I thought about that too.
adamiscool8 12 hours ago

I don't think this reflects a flaw in LLM's at all. It is true that Paul Newman is not "known for" having alcohol problems. Paul Newman is known for acting, and salad dressing. Paul Newman has been "known to" have had alcohol problems, which formed as a question correctly one-shots accurate results for me across a few common LLMs. Someone like Betty Ford could be "known for" having alcohol problems. Paul Newman's alcohol problems are relatively specialized knowledge.
- xdavidliu 12 hours ago
  
  good point, though a great LLM answer would DWIM here and explain both interpretations as you have done here, as opposed to (what presumably happened) focus on one and ignore the other.
- moregrist 12 hours ago
  
  And yet the New Yorker article was the first thing that my Google search for “Paul Newman alcoholism” returned.
  Apparently it isn’t so specialized that a pretty obvious old fashioned web search on Google wouldn’t immediately return an authoritative source.
stavros 17 hours ago

LLMs aren't good at being search engines, they're good at understanding things. Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
- lm28469 2 hours ago
  
  > Put an LLM on top of a search engine, and that's the appropriate tool for this use case.
  Hm nope, now that the web if flooded by LLM generated content it's game over. I can't tell how many times I almost got fooled by recipes &co which seem legit at first but are utter non sense. And now we're feeding that garbage back to where it came from
  - stavros 2 hours ago
    
    That's not relevant to what we're discussing, though.
- MegaButts 17 hours ago
  
  > LLMs aren't good at being search engines, they're good at understanding things.
  LLMs are literally fundamentally incapable of understanding things. They are stochastic parrots and you've been fooled.
  - the8472 17 hours ago
    
    A stochastic parrot with a sufficiently tiny residual error rate needs a stochastic model so precisely compressing the world and sophisticated decompression algorithms that it could be called reasoning.
    Take two 4K frames of a falling vase, ask a model to predict the next token... I mean the following images. Your model now needs include some approximations of physics - and the ability to apply it correctly - to produce a realistic outcome. I'm not aware of any model capable of doing that, but that's what it would mean to predict the unseen with high enough fidelity.
  - fancyfredbot 17 hours ago
    
    We're talking about a stochastic parrot which in many circumstances responds in a way which is indistinguishable from actual understanding.
    
    MegaButts 17 hours ago
    
    I've always been amazed by this. I have never not been frustrated with the profound stupidity of LLMs. Obviously I must be using it differently because I've never been able to trust it with anything and more than half the time I fact check it even for information retrieval it's objectively incorrect.
    
    fancyfredbot 16 hours ago
    
    If you got as far as checking the output it must have appeared to understand your question.
    I wouldn't claim LLMs are good at being factual, or good at arithmetic, or at drawing wine glasses, or that they are "clever". What they are very good at is responding to questions in a way which gives you the very strong impression they've understood you.
    
    MegaButts 16 hours ago
    
    I vehemently disagree. If I ask a question with an objective answer, and it simply makes something up and is very confident the answer is correct, what the fuck has it understood other than how to piss me off?
    It clearly doesn't understand that the question has a correct answer, or that it does not know the answer. It also clearly does not understand that I hate bullshit, no matter how many dozens of times I prompt it to not make something up and would prefer an admittance of ignorance.
    
    fancyfredbot 16 hours ago
    
    It didn't understand you but the response was plausible enough to require fact checking.
    Although that isn't literally indistinguishable from 'understanding' (because your fact checking easily discerned that) it suggests that at a surface level it did appear to understand your question and knew what a plausible answer might look like. This is not necessarily useful but it's quite impressive.
    
    MegaButts 16 hours ago
    
    There are times it just generates complete nonsense that has nothing to do with what I said, but it's certainly not most of the time. I do not know how often, but I'd say it's definitely under 10% and almost certainly under 5% that the above happens.
    Sure, LLMs are incredibly impressive from a technical standpoint. But they're so fucking stupid I hate using them.
    > This is not necessarily useful but it's quite impressive.
    I think we mostly agree on this. Cheers.
    
    karn97 17 hours ago
    
    Its ok to be paranoid
    
    MegaButts 17 hours ago
    
    Fact checking is paranoia?
  - mitthrowaway2 17 hours ago
    
    What does the word "understand" mean to you?
    
    MegaButts 17 hours ago
    
    An ability to answer questions with a train of thought showing how the answer was derived, or the self-awareness to recognize you do not have the ability to answer the question and declare as much. More than half the time I've used LLMs they will simply make answers up, and when I point out the answer is wrong it simply regurgitates another incorrect answer ad nauseum (regularly cycling through answers I've already pointed out are incorrect).
    Rather than give you a technical answer - if I ever feel like an LLM can recognize its limitations rather than make something up, I would say it understands. In my experience LLMs are just algorithmic bullshitters. I would consider a function that just returns "I do not understand" to be an improvement, since most of the time I get confidently incorrect answers instead.
    Yes, I read Anthropic's paper from a few days ago. I remain unimpressed until talking to an LLM isn't a profoundly frustrating experience.
    
    mitthrowaway2 16 hours ago
    
    I just want to say that's a much better answer than I anticipated!
  - bobsmooth 17 hours ago
    
    What do you call someone that mentions "stochastic parrots" every time LLMs are mentioned?
    
    fancyfredbot 16 hours ago
    
    That makes me think, has anyone ever heard of an actual parrot which wasn't stochastic?
    I'm fairly sure I've never seen a deterministic parrot which makes me think the term is tautological.
    
    MegaButts 17 hours ago
    
    It's the first time I've ever used that phrase on HN. Anyway, what phrase do you think works better than 'stochastic parrot' to describe how LLMs function?
    
    brookst 16 hours ago
    
    It’s good rhetoric but bad analogy. LLMs can be very creative (to the point of failure, in hallucinations).
    I don’t know if there is a pithy shirt phrase to accurately describe how LLMs function. Can you give me a similar one for how humans think? That might spur my own creativity here.
    
    karn97 17 hours ago
    
    Try to come up with a way to prove humans aren't stochastic parrots then maybe people will atart taking you seriously. Just childish reddit angst rn nothing else.
    
    bluefirebrand 16 hours ago
    
    > Try to come up with a way to prove humans aren't stochastic parrots
    Look around you
    Look at Skyscrapers. Rocket ships. Agriculture.
    If you want to make a claim that humans are nothing more than stochastic parrots then you need to explain where all of this came from. What were we parroting?
    Meanwhile all that LLMs do is parrot things that humans created
    
    jodrellblank 12 hours ago
    
    Skyscrapers: trees, mountains, cliffs, caves in mountainsides, termite mounds, humans knew things could go high, the Colosseum was built two thousand years ago as a huge multi-storey building.
    Rocket ships: volcanic eruptions show heat and explosive outbursts can fling things high, gunpowder and cannons, bellows showing air moves things.
    Agriculture: forests, plains, jungle, desert oases, humans knew plants grew from seeds, grew with rain, grew near water, and grew where animals trampled them into the ground.
    We need a list of all atempted ideas, all inventions and patents that were ever tried or conceived, and then we see how inventions are the same random permutations on ideas with Darwinian style survivorship as everything else; there were steel boats with multiple levels in them before skyscrapers; is the idea of a tall steel building really so magical when there were over a billion people on Earth in 1800 who could have come up with it?
    
    meroes 11 hours ago
    
    You’re likening actual rocketry to LLMs being mildly successful at describing Paul Newman’s alcohol use on average when they already have the entire internet handed to them.
    
    bluefirebrand 11 hours ago
    
    > when there were over a billion people on Earth in 1800 who could have come up with it
    My point is that humans did come up with it. Humans did not parrot it from someone or something else that showed it to us. We didn't "parrot" splitting the atom. We didn't learn how to build skyscrapers from looking at termite hills and we didn't learn to build rockets that can send a person to the moon from seeing a volcano
    You are just speaking absolute drivel
  - more_corn 17 hours ago
    
    For them to work at all they need to have some representation of concepts. Recent research at anthropic has shown a surprising complexity in their reasoning behavior. Perhaps the parrot here is you.
- nyarlathotep_ 13 hours ago
  
  > I guess the problem with LLMs is that they're too usable for their own good, so people don't realizing that they can't perfectly know all the trivia in the world, exactly the same as any human.
  They're quite literally being sold as a replacement for human intellectual labor by people that have received uncountable sums of investment money towards that goal.
  The author of the post even says this:
  "These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."
  Can't blame people "fact checking" something that's supposed to fill these shoes.
  People should be (far) more critical of LLMs given all of these style of bold claims, not less.
  Also, telling people they're "holding it wrong" when they interact with alleged "Ay Gee Eye" "superintelligence" really is a poor selling point, and no way to increase confidence in these offerings.
  These people and these companies don't get to make these claims that threaten the livelihood of millions of people, inflate a massive bubble, impact hiring decisions and everything else we've seen and then get excused cause "whoops you're not supposed to use it like that, dummy."
  Nah.
  - stavros 6 hours ago
    
    Your point is still trivially disproven by the fact that not even humans are expected to know all the world's trivia off the top of their heads.
    We can discuss whether LLMs live up to the hype, or we can discuss how to use this new tool in the best way. I'm really tired of HN insisting on discussing the former, and I don't want to take part in that. I'm happy to discuss the latter, though.
- more_corn 17 hours ago
  
  Ironically though an LLM powered search engine (some word about being perplexed) is becoming way better than the undisputed king of traditional search engines (something oogle)
  - stavros 17 hours ago
    
    That's because they put an LLM over a traditional search engine.
    
    aspenmayer 16 hours ago
    
    Google Labs has AI Mode now, apparently.
    https://labs.google.com/search/experiment/22
    
    stavros 16 hours ago
    
    Hm, that's not available to me, what is it? If its an LLM over Google, didn't they release that a few months ago already?
    
    aspenmayer 16 hours ago
    
    US only for now may be the issue?
    It expands what they had before with AI Overviews, but I’m not sure how new either of those are. It showed up for me organically as an AI Mode tab on a native Google search in Firefox ironically.
    https://support.google.com/websearch/answer/16011537
    
    stavros 16 hours ago
    
    Very interesting, thank you!
    
    aspenmayer 16 hours ago
    
    No worries.
    What happens if you go directly to https://google.com/aimode ?
    
    stavros 16 hours ago
    
    It asks me to change some permissions, but that help page says this is only available in the US, so I suppose I'll get blocked right after I change them.
pshirshov 14 hours ago

Perplexity and open-webui+ollama in web search mode answer this question correctly.
jonomacd 15 hours ago

Looks like you are using the wrong models
https://g.co/gemini/share/ffa5a7cd6f46
adelpozo 14 hours ago

I can’t reproduce. Maybe others reported the error and someone adjusted the expected answer, I do not know enough about OpenAI operations to say for sure.
The reason this bothers me is that comments like this reinforce the believes of people that could otherwise find value in these tools.
But I think points like this would be better made in shared chats or screenshots, since we do not have something like a core dump or stacktrace to attach.
And while I am not saying OP did this, I have seen technically skilled engineers asserting/implying that llm/chatbots aren’t good or not useful to them look at their chat log that a multitude of topics that I am sure would impact the result of the query.
Yes. It can be an UX problem. Yes. It can be an algorithmc problem. But they are just tools that can be used wrong and not a perfect mechanical brain.
pbronez 14 hours ago

Testing the query on Kagi
# Quick Answer
Yes, Paul Newman struggled with alcohol. His issues with alcohol were explored in the HBO Max documentary, The Last Movie Stars, and Shawn Levy's biography, Paul Newman: A Life. According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating.
(4x Source footnotes omitted for readability)
# Ki Multi-step Research Assistant
Paul Newman is known to have struggled with alcohol. According to his posthumous memoir, Newman candidly discussed his issues with drinking and self-doubt, describing himself as an alcoholic who was tormented by insecurities[^1][^2]. He reportedly drank a significant amount of beer daily and later moved on to stronger drinks like Scotch[^3][^4]. His drinking habits were a notable part of his life, and he was often identified by his beer drinking[^5][^6]. Despite these struggles, Newman was also recognized for his generosity and devotion to his family[^7].
[^1]: [Posthumous memoir reveals Paul Newman's struggle with alcohol](https://www.express.co.uk/entertainment/films/1688113/paul-n...) [^2]: [Review: Paul Newman memoir stuns with brutal honesty - AP News](https://apnews.com/article/entertainment-reviews-movies-paul...) [^3]: [Was Paul Newman Really an Alcoholic? - The Cinemaholic](https://thecinemaholic.com/was-paul-newman-really-an-alcohol...) [^4]: [Who Paul Newman Was—and Who He Wanted to Be](https://www.newyorker.com/magazine/2022/10/24/who-paul-newma...) [^5]: [Paul Newman's Beer Drinking - Psychology Today](https://www.psychologytoday.com/us/blog/addiction-in-society...) [^6]: [Paul Newman's Beer Drinking | Psychology Today United Kingdom](https://www.psychologytoday.com/gb/blog/addiction-in-society...) [^7]: [The troubled life of Paul Newman | The Spectator](https://www.spectator.co.uk/article/the-troubled-life-of-pau...)
- billforsternz 10 hours ago
  
  > "According to a posthumous memoir, Newman was tormented by self-doubt and insecurities and questioned his acting ability. His struggles with alcohol led to a brief separation from Joanne Woodward, though it had nothing to do with cheating."
  'though it had nothing to do with cheating' is a weird inclusion.
4b11b4 12 hours ago

We all know this isn't the kind of question you ask an LLM...
- whutsurnaym 12 hours ago
  
  You may know this, but that doesn't mean we all do. LLMs are marketed towards many different groups, including folks who don't have much tech knowledge.
ilrwbwrkhv 18 hours ago

LLMs will never be good at specific knowledge unless specifically trained for with narrow "if else" statements.
Its good for broad general overview such as most popular categories of books in the world.
- Rebuff5007 17 hours ago
  
  Really? Open-AI says PhD intelligence is just around the corner!
  - dadadad100 17 hours ago
    
    If we were to survey 100 PhDs how many would know correctly that Paul Newman had an alcohol problem.
    
    AnimalMuppet 17 hours ago
    
    I would hope that 100% of them would be able to figure out how to find out.
    
    dadadad100 17 hours ago
    
    Ah, but isn’t that the problem here - asking an LLM for facts without requesting a search is like asking a PhD to answer a question “off the top of your head”. For pop culture questions the PhD likely brings little value.
    
    jcgrillo 14 hours ago
    
    They should know better than to guess. Educated, honest, intelligent people don't spout off a wild ass guess, if they don't know something they say so.
  - ludwik 17 hours ago
    
    I don't think they mean "knowledge" when they talk about "intelligence." LLMs are definitely not knowledge bases. They can transform information given to them in impressive ways, but asking a raw (non-RAG-enabled) LLM to provide its own information will probably always be a mistake.
    
    AnimalMuppet 17 hours ago
    
    They kind of are knowledge bases, just not in the usual way. The knowledge is encoded in the words they were trained on. They weren't trained on words chosen at random; they were trained on words written by humans to encode some information. In fact, that's the only thing that makes LLMs somewhat useful.
fragmede 12 hours ago

oh no. LLMs aren't up on the latest celebrity gossip. whatever shall we do.
whiddershins 11 hours ago

"known to" !== "known for"
hn_throwaway_99 17 hours ago

So, in other words, are you saying that AI model progress is the real deal and is not bullshit?
That is, as you point out, "all of the models up to o3-mini-high" give an incorrect answer, while other comments say that OpenAIs later models give correct answers, with web citations. So it would seem to follow that "recent AI model progress" actually made a verifiable improvement in this case.
- saurik 17 hours ago
  
  I am pretty sure that they must have meant "up through", not "up to", as the answer from o3-mini-high is also wrong in a way which seems to fit the same description, no?
  - hn_throwaway_99 17 hours ago
    
    I tried with 4o and it gave me what I thought was a correct answer:
    > Paul Newman was not publicly known for having major problems with alcohol in the way some other celebrities have been. However, he was open about enjoying drinking, particularly beer. He even co-founded a line of food products (Newman’s Own) where profits go to charity, and he once joked that he consumed a lot of the product himself — including beer when it was briefly offered.
    > In his later years, Newman did reflect on how he had changed from being more of a heavy drinker in his youth, particularly during his time in the Navy and early acting career, to moderating his habits. But there’s no strong public record of alcohol abuse or addiction problems that significantly affected his career or personal life.
    > So while he liked to drink and sometimes joked about it, Paul Newman isn't generally considered someone who had problems with alcohol in the serious sense.
    As other's have noted, LLMs are much more likely to be cautious in providing information that could be construed as libel. While Paul Newman may have been an alcoholic, I couldn't find any articles about it being "public" in the same way as others, e.g. with admitted rehab stays.
mhh__ 11 hours ago

Unless you're expecting an LLM to have access to literally all information on earth at all times I find it really hard to care about this particular type of complaint.
My calculator can't conjugate German verbs. That's fine IMO. It's just a tool
- twright0 8 hours ago
  
  Your calculator won't confidently pretend to conjugate German verbs while doing so incorrectly, though.
- forgetfreeman 11 hours ago
  
  Yes but a tool for what? When asked a question individuals that don't already have detailed knowledge of a topic are left with no way to tell if the AI generated response is complete bullshit, uselessly superficial, or detailed and on point. The only way to be sure is to then go do the standard search engine grovel looking for authoritative sources.

nialv7 16 hours ago

Sounds like someone drank their own Kool aid (believing current AI can be a security researcher), and then gets frustrated when they realize they have overhyped themselves.

Current AI just cannot do the kind of symbolic reasoning required for finding security vulnerabilities in softwares. They might have learned to recognize "bad code" via pattern matching, but that's basically it.

Mikhail_Edoshin 8 hours ago

The foundational idea of AI is that people think with words. But it is false. People think with things and use words only to point to things. Thinking with words is not possible. It will always be a case of shizophasia.

einrealist 17 hours ago

LeCun criticized LLM technology recently in a presentation: https://www.youtube.com/watch?v=ETZfkkv6V7Y

The accuracy problem won't just go away. Increasing accuracy is only getting more expensive. This sets the limits for useful applications. And casual users might not even care and use LLMs anyway, without reasonable result verification. I fear a future where overall quality is reduced. Not sure how many people / companies would accept that. And AI companies are getting too big to fail. Apparently, the US administration does not seem to care when they use LLMs to define tariff policy....

pclmulqdq 17 hours ago

I don't know why anyone is surprised that a statistical model isn't getting 100% accuracy. The fact that statistical models of text are good enough to do anything should be shocking.
- whilenot-dev 17 hours ago
  
  I think the surprising aspect is rather how people are praising 80-90% accuracy as the next leap in technological advancement. Quality is already in decline, despite LLMs, and programming was always a discipline where correctness and predictability mattered. It's an advancement for efficiency, sure, but on the yet unknown cost of stability. I'm thinking about all simulations based on applied mathematical concepts and all the accumulated hours fixing bugs - there's now this certain aftertaste, sweet for some living their lives efficiently, but very bitter for the ones relying on stability.
  - voidhorse 10 hours ago
    
    You're completely correct, of course. The issue is that most people are not looking for quality, only efficiency. In particular, business owners don't care about sacrificing some correctness if it means they can fire slews of people. Worse, gullible "engineers" that should be the ones prioritizing correctness are so business-brainwashed themselves that they like wise slop up this nonsense at the expense of sacrificing their own concern for the only principles that even made the software business remotely close to being worthy of the title "engineering".
- einrealist 17 hours ago
  
  That "good enough" is the problem. It requires context. And AI companies are selling us that "good enough" with questionable proof. And they are selling grandiose visions to investors, but move the goal post again and again.
  A lot of companies made Copilot available to their workforce. I doubt that the majority of users understand what a statistical model means. The casual, technically inexperienced user just assumes that a computer answer is always right.

softwaredoug 19 hours ago

I think the real meaningful progress is getting ChatGPT 3.5 level quality running anywhere you want rather than AIs getting smarter at high level tasks. This capability being ubiquitous and not tied to one vendor is really what’s revolutionary.

billyp-rva 18 hours ago

> [T]here are ~basically~ no public benchmarks for security research... nothing that gets at the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken.

A few months ago I looked at essentially this problem from a different angle (generating system diagrams from a codebase). My conclusion[0] was the same as here: LLMs really struggle to understand codebases in a holistic way, especially when it comes to the codebase's strategy and purpose. They therefore struggle to produce something meaningful from it like a security assessment or a system diagram.

[0] https://www.ilograph.com/blog/posts/diagrams-ai-can-and-cann...

Jonovono 13 hours ago

Not sure if its been fixed by now but a few weeks ago I was in the Golden Gate park and wondered if it was bigger than Central park. I asked ChatGPT voice, and although it reported the sizes of the parks correctly (with Golden gate park being the bigger size), it then went and said that Central Park was bigger. I was confused, so Googled and sure enough Golden gate park is bigger.

I asked Grok and others as well. I believe Perplexity was the only one correct.

Repeated it multiple times even with a friends account. It kept doing the same thing. It knew the sizes, but thought the smaller sized one was bigger...

gpsx 10 hours ago

I just tried. Claude did exactly what you said, and then figured it out:
Central Park in New York City is bigger than GoldenGate Park (which I think you might mean Golden Gate Park) in San Francisco.
Central Park covers approximately 843 acres (3.41 square kilometers), while Golden Gate Park spans about 1,017 acres (4.12 square kilometers). This means Golden Gate Park is actually about 20% larger than Central Park.
Both parks are iconic urban green spaces in major U.S. cities, but Golden Gate Park has the edge in terms of total area.
stitched2gethr 12 hours ago

Probably because it has read the facts but has no idea how numbers actually work.

djha-skin 19 hours ago

> Since 3.5-sonnet, we have been monitoring AI model announcements, and trying pretty much every major new release that claims some sort of improvement. Unexpectedly by me, aside from a minor bump with 3.6 and an even smaller bump with 3.7, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.

This is likely a manifestation of the bitter lesson[1], specifically this part:

> The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project [like an incremental model update], massively more computation inevitably becomes available.

(Emphasis mine.)

Since the ultimate success strategy of the scruffies[2] or proponents of search and learning strategies in AI is Moore's Law, short term gains using these strategies will be miniscule. It is over at least a five year period that their gains will be felt the most. The neats win the day in the short term, but the hare in this race will ultimately give away to the steady plod of the tortoise.

1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

2: https://en.m.wikipedia.org/wiki/Neats_and_scruffies#CITEREFM...

richardatlarge 3 hours ago

I wonder if there’s a parallel universe somewhere where this same HN post has the exact same comments, only the posts are by NNs discussing human test subjects :)

djoldman 12 hours ago

LLM's are pre-trained to minimize perplexity (PPL), which essentially means that they're trained to model the likelihood distribution of the next words in a sequence.

The amazing thing was that minimizing PPL allowed you to essentially guide the LLM output and if you guided it in the right direction (asked it questions), it would answer them pretty well. Thus, LLMs started to get measured on how well they answered questions.

LLMs aren't trained from the beginning to answer questions or solve problems. They're trained to model word/token sequences.

If you want an LLM that's REALLY good at something specific like solving math problems or finding security bugs, you probably have to fine tune.

antirez 4 hours ago

Missing in the article: "Disclaimer: This article was written before testing Gemini 2.5 PRO"

mellosouls 10 hours ago

Original post here:

https://zeropath.com/blog/on-recent-ai-model-progress

karmakaze 14 hours ago

> [...] But I would nevertheless like to submit, based off of internal benchmarks, and my own and colleagues' perceptions using these models, that whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality. [...]

Seems like they're looking at how they fail and not considering how they're improving in how they succeed.

The efficiency in DeepSeek's Multi-Head Latent Attention[0] is pure advancement.

[0] https://youtu.be/0VLAoVGf_74?si=1YEIHST8yfl2qoGY&t=816

burny_tech 16 hours ago

In practice, Sonnet 3.7 and Gemini 2.5 are just often too good compared to competitors.

dcl 8 hours ago

I like this bit:

> Personally, when I want to get a sense of capability improvements in the future, I'm going to be looking almost exclusively at benchmarks like Claude Plays Pokemon.

Definitely interested to see how the best models from Anthropics competitors do at this.,

conartist6 an hour ago

Oh yes, before we've designated all the messy parts of running a society to the machines, like, you know, art.

I'm this case the goal is to kill all the humans who know a lot about keeping other people safe...

bbstats 12 hours ago

So I guess this was written pre-Gemini 2.5

stitched2gethr 12 hours ago

Meh. I've been using 2.5 with Cline extensively and while it is better it's still an incremental improvement, not something revolutionary. The thing has a 1 million token context window but I can only get a few outputs before I have to tell it AGAIN to stop writing comments.
Are they getting better, definitely. Are we getting close to them performing unsupervised tasks, I don't think so.

dimal 18 hours ago

It seems like the models are getting more reliable at the things they always could do, but they’re not showing any ability to move past that goalpost. Whereas in the past, they could occasionally write some very solid code, but often return nonsense, the nonsense is now getting adequately filtered by so-called “reasoning”, but I see no indication that they could do software design.

> how the hell is it going to develop metrics for assessing the impact of AIs when they're doing things like managing companies or developing public policy?

Why on earth do people want AI to do either of these things? As if our society isn’t fucked enough, having an untouchable oligarchy already managing companies and developing public policies, we want to have the oligarchy’s AI do this, so policy can get even more out of touch with the needs of common people? This should never come to pass. It’s like people read a pile of 90s cyberpunk dystopian novels and decided, “Yeah, let’s do that.” I think it’ll fail, but I don’t understand how anyone with less than 10 billion in assets would want this.

voidhorse 10 hours ago

> Why on earth do people want AI to do either of these things?
This is the really important question, and the only answer I can drum up is that people have been fed a consistent diet of propaganda for decades centered around a message that ultimately boils down to a justification of oligarchy and the concentration of wealth. That and the consumer-focus facade makes people think the LLMS are technology for them—they aren't. As soon as these things get good enough business owners aren't going to expect workers to use them to be more productive, they are just going to fire workers and/or use the tooling as another mechanism by which to let wages stagnate.

mmcnl 17 hours ago

I feel we are already in the era of diminishing returns on LLM improvements. Newer models seem to be more sophisticated implementations of LLM technology + throwing more resources at it, but to me they do not seem fundamentally more intelligent.

I don't think this is a problem though. I think there's a lot of low-hanging fruit when you create sophisticated implementations of relatively dumb LLM models. But that sentiment doesn't generate a lot of clicks.

mentalgear 18 hours ago

Who would assume that LLM companies were to hyper optimise on public to make their share prices go up and bubble keep afloat ... What a unserious thought to maintain ...

slicerdicer2 13 hours ago

more interesting benchmarks:

- Make as much money as you can in a 24 hour period doing only legal, moral and non-annoying things

- Make a depressed person who calls a suicide hotline feel really happy doing only legal, moral and non-annoying things

- Do something really useful for society with measurable outcomes doing only legal, moral and non-annoying things

crvdgc 14 hours ago

> But in recent months I've spoken to other YC founders doing AI application startups [...] in different industries, on different problem sets.

Maybe they should create a benchmark collectively called YC founders. Gather various test cases. Never make it public. And use that to evaluate newly released models.

StickyRibbs 17 hours ago

There's the politics of the corporations and then there's the business of the science behind LLM's, this article feels like the former.

Maybe someone active in the research can comment? I feel like all of these comments are just conjecture/anecdotal and don't really get to the meat of this question of "progress" and the future of LLM's

OtherShrezzing 17 hours ago

Assuming that the models getting better at SWE benchmarks and math tests would translate into positive outcomes in all other domains could be an act of spectacular hubris by the big frontier labs, which themselves are chock-full of mathematicians and software engineers.

mmcnl 17 hours ago

andrethegiant 12 hours ago

Author will need to rewrite now that Llama 4 is out

lithos 11 hours ago

Wait this sounds like one of those actors spamming and wasting open source time with fake security issues.

jaredcwhite 15 hours ago

There's some interesting information and analysis to start off this essay, then it ends with:

"These machines will soon become the beating hearts of the society in which we live. The social and political structures they create as they compose and interact with each other will define everything we see around us."

This sounds like an article of faith to me. One could just as easily say they won't become the beating hearts of anything, and instead we'll choose to continue to build a better future for humans, as humans, without relying on an overly-hyped technology rife with error and unethical implications.

voidhorse 10 hours ago

While I really want your version of the future, history and general trajectory and tendency unfortunately suggest otherwise.

boxed 19 hours ago

> So maybe there's no mystery: The AI lab companies are lying, and when they improve benchmark results it's because they have seen the answers before and are writing them down. [...then says maybe not...]

Well.. they've been caught again and again red handed doing exactly this. Fool me once shame on you, fool me 100 times shame on me.

drdaeman 18 hours ago

Hate to say this but the incentive is growth, not progress. Progress is what enabled the growth, but is also extremely hard to plan and deliver. On the other hand, hype is probably somewhat easier and well-tested approach so no surprise lot of the effort goes into marketing. Markets had repeatedly confirmed that there aren't any significant immediate repercussions for cranking up BS levels in marketing materials, while there are some rewards when it works.
smnplk 18 hours ago

Fool me once, shame on you...If fooled, you cant get fooled again.
https://www.youtube.com/shorts/LmFN8iENTPc

bluepoint 7 hours ago

Reading the comments, is it safe to say that LLMs are a digest of the internet which is some update over google search, but with the caveat that you need to double check the results? I mean they basically have some compressed version of almost all the written knowledge and will respond correctly about things that have already been written, and hallucinate (extrapolate) about things not explicitly written. Of course if someone carefully curates the input data to filter out misinformation, it might even be an upgrade over google. Is there a consensus on this?

paulsutter 18 hours ago

Im able to get substantially more coding done than three months ago. This could be largely in the tooling (coding agents, deep research). But the models are better too, for both coding and brainstorming. And tooling counts, to me, as progress.

Learning to harness current tools helps to harness future tools. Work on projects that will benefit from advancements, but can succeed without them.

dghlsakjg 18 hours ago

I'm not sure if I'm able to do more of the hard stuff, but a lot of the easy but time consuming stuff is now easily done by LLMs.
Example: I frequently get requests for data from Customer Support that used to require 15 minutes of my time noodling around writing SQL queries. I can cut that down to less than a minute now.
mountainriver 18 hours ago

Yes I am a better engineer with every release. I think this is mostly empirically validated

guluarte 10 hours ago

current AI models are searching engines that can somewhat extrapolate data and use in the current context, that's pretty much it

a3w 18 hours ago

For three years now, my experience with LLMs has been "mostly useless, prefer ELIZA".

Which is software written 1966, but the web version is a little newer. Does occasional psychotherapy assistance/brainstorming just as well, and I more easily know when I stepped out of its known range into the extrapolated.

That said, it can vibe code in a framework unknown to me in half the time that I would need to school myself and add the feature.

Or vibe coding takes twice as long, if I mostly know how to achieve what I want and read no framework documentation but only our own project's source code to add a new feature. But on a day with a headache, I can still call the LLM a dumb twat and ask it to follow my instructions instead of doing bullshit.

But, vibe coding always makes my pulse go to 105, from 65 and question my life choices. Since few instructions are rarely ever followed and loops never left once entered. Except for on the first try getting 80% of the structure kinda right, but then getting stuck for the whole workday.

Nesco 13 hours ago

From my own experience on a codebase using a lot of custom algorithms on trees and sometimes graphs.

There were qualitatively leaps in my day-to-day usage:

Claude Sonnet 3.5 and ChatGPT O1 were good for writing slop and debugging simple bugs

Grok Thinking and Sonnet 3.7 were good to catch mildly complicated bugs and write functions with basic logic. They still made mistake

But recently, Gemini 2.5 pro has been scary good. I liked to made fun of the feel-the-AGI crowd but for the first time a model made me raise an eyebrow

It can one shot unusual function with complicated logic and subtle edge cases

dkersten 18 hours ago

I honestly can’t notice any difference in outdoor quality between GPT 4o and GPT 4.5. I also can’t notice any difference in programming quality in cursor when using Claude 3.7 vs 3.5. I’m told there is a clear difference, but I don’t notice it.

timewizard 17 hours ago

Government announces critical need to invest in AI and sets aside a bunch of money for this purpose.

Suddenly the benchmarks become detached from reality and vendors can claim whatever they want about their "new" products.

Just as a possible explanation, as I feel like I've seen this story before.

photochemsyn 18 hours ago

Will LLMs end up like compilers? Compilers are also fundamentally important to modern industrial civilization - but they're not profit centers, they're mostly free and open-source outside a few niche areas. Knowing how to use a compiler effectively to write secure and performative software is still a valuable skill - and LLMs are a valuable tool that can help with that process, especially if the programmer is on the steep end of the learning curve - but it doesn't look like anything short of real AGI can do novel software creation without a human constantly in the loop. The same argument applies to new fundamental research, even to reviewing and analyzing new discoveries that aren't in the training corpus.

Wasn't it back in the 1980s that you had to pay $1000s for a good compiler? The entire LLM industry might just be following in the compiler's footsteps.

lukev 18 hours ago

This seems like a probable end state, but we're going to have to stop calling LLMs "artificial intelligence" in order to get there.
- bcoates 17 hours ago
  
  Yep. I'm looking forward to LLMs/deepnets being considered a standard GOFAI technique with uses and limitations and not "we asked the God we're building to draw us a picture of a gun and then it did and we got scared"
- mmcnl 17 hours ago
  
  Why not? Objectively speaking LLMs are artificial intelligent. Just because it's not human level intelligence doesn't mean it's not intelligent.
  - lukev 17 hours ago
    
    Objectively speaking a chess engine is artificially intelligent. Just because it's not human level doesn't mean it's not intelligent. Repeat for any N of 100s of different technologies we've built. We've been calling this stuff "thinking machines" since Turing and it's honestly just not useful at this point.
    The fact is, the phrase "artificial intelligence" is a memetic hazard: it immediately positions the subject of conversation as "default capable", and then forces the conversation into trying to describe what it can't do, which is rarely a useful way to approach it.
    Whereas with LLMs (and chess engines and every other tech advancement) it would be more useful to start with what the tech _can_ do and go from there.

ohgr 19 hours ago

It’s not even approaching the asymptotic line of promises made at any achievable rate for the amount of cash being thrown at it.

Where’s the business model? Suck investors dry at the start of a financial collapse? Yeah that’s going to end well…

maccard 18 hours ago

> where’s the business model?
For who? Nvidia sell GPUs, OpenAI and co sell proprietary models and API access, and the startups resell GPT and Claude with custom prompts. Each one is hoping that the layer above has a breakthrough that makes their current spend viable.
If they do, then you don’t want to be left behind, because _everything_ changes. It probably won’t, but it might.
That’s the business model
- grey-area 18 hours ago
  
  That’s not a business model, it’s a pipe dream.
  This bubble will be burst by the Trump tariffs and the end of the zirp era. When inflation and a recession hit together hope and dream business models and valuations no longer work.
  - maccard 18 hours ago
    
    Which one? Nvidia are doing pretty ok selling GPU's, and OpenAI and Anthropic are doing ok selling their models. They're not _viable_ business models, but they could be.
    
    dragandj 3 hours ago
    
    They are doing OK in the sense that they are spending a dollar to earn a dime I suppose.
    
    grey-area 16 hours ago
    
    NVDA will crash when the AI bubble implodes, and none of those Generative AI companies are actually making money, nor will they. They have already hit limiting returns in LLM improvements after staggering investments and it is clear are nowhere near general intelligence.
    
    maccard 16 hours ago
    
    All of this can be true, and has nothing to do with them having a business model.
    > NVDA will crash when the AI bubble implodes, > making money, nor will they > They have already hit limiting returns in LLM improvements after staggering investments > and it is clear are nowhere near general intelligence.
    These are all assumptions and opinions, and have nothing to do with whether or not they have a business model. You mightn't like their business model, but they do have one.
    
    grey-area 15 hours ago
    
    I consider it a business model if they have plans to make money at some point (no sign of that at openai which are not based on hopium) and are not engaged in fraud like bundling and selling to their own subsidiaries (nvda).
    These are of course just opinions, I’m not sure we can know facts about such companies except in retrospect.
    
    ohgr 14 hours ago
    
    Yep. Facts are usually found out during the SEC investigation but we know that isn't going to happen now...
  - otterley 14 hours ago
    
    The ZIRP era ended several years ago.
    
    grey-area 8 hours ago
    
    Yes it did, but the irrational exuberance was ongoing till this trigger.
    Now we get to see if Bitcoin’s use value of 0 is really supporting 1.5 trillion market cap and if OpenAI is really worth $300 billion.
    I mean softbank just invested in openai, and they’ve never been wrong, right?
- ohgr 17 hours ago
  
  You missed the end of the supply chain. Paying users. Who magically disappear below market sustaining levels of sales when asked to pay.
  - AstroBen 15 hours ago
    
    > Going from $1M ARR to $100M ARR in 12 months, Cursor is the fastest growing SaaS company of all time
    Just because it's not reaching the insane hype being pushed doesn't mean it's totally useless
    
    ohgr 14 hours ago
    
    I've been here a long time (not this account) and have heard this many times. They all died or became irrelevant.
    
    maccard 6 hours ago
    
    You’re on a startup forum complaining that vc backed startups don’t have a business model when the business model is the same as it has been for almost 15 years - be a unicorn in your space.
    
    ohgr 5 hours ago
    
    This is not a unicorn. It's a donkey with a dildo strapped on its head.
  - maccard 16 hours ago
    
    I never said it was sustainable, and even if it was, OP asked for a business model. Customers don’t need a business model, they’re customers.
    The same is true for any non essential good or service.
    
    dragandj 3 hours ago
    
    Than any silly idea can be a business model. Suppose I collect dust from my attic and hope to sell it as an add-on on my neighbor's lemonade stand, with a hefty profit for the neighbor, who is getting paid by me $10 to add a handful of dust in each glass and sell it to the customers for $1. The neighbor accepts. It's a business model, at least until I don't run of existing funds or the last customer leaves in disguist. At which point exactly that silly idea stops being an unsustainable business model and becomes a silly idea? I guess at least as early as I see that the funds are running up, and I need to borrow larger an larger lumps of money each time to keep spinning the wheel...
reportgunner 5 hours ago

Financial collapse ? Surely we can just roll out AI powered money printers and make them go BRRR /s

Claudus 5 hours ago

I asked Grok to review the comments here and generate a response defending AI:

After reviewing the discussion on the Hacker News thread, it’s clear that there are a range of complaints and criticisms about AI, particularly centered around its limitations, overhype, and practical utility. Some users express frustration with AI’s inability to handle complex reasoning, its tendency to produce generic or incorrect outputs, and the perception that it’s more of a buzzword than a transformative tool. Others question its value compared to traditional methods or human expertise, suggesting it’s overhyped or misapplied in many cases. Below, I’ll offer a defense of AI that addresses these concerns while highlighting its strengths and potential.

AI isn’t perfect, and no one should claim it is—but that’s not the point. It’s a tool, and like any tool, its effectiveness depends on how it’s used. Critics who point to AI’s struggles with nuanced reasoning or edge cases often overlook the fact that it’s not designed to replace human judgment entirely. Instead, it excels at augmenting it. For example, AI can process vast amounts of data—far more than any human could in a reasonable timeframe—and identify patterns or insights that might otherwise go unnoticed. This makes it invaluable in fields like medicine, where it’s already helping diagnose diseases from imaging data, or in logistics, where it optimizes supply chains with precision that manual methods can’t match.

The complaint about generic or incorrect outputs, often dubbed “hallucinations,” is fair but misses the bigger picture. Yes, AI can churn out nonsense if pushed beyond its limits or fed poor data—but that’s a reflection of its current stage of development, not its ultimate potential. These systems are improving rapidly, with each iteration reducing errors and refining capabilities. More importantly, AI’s ability to generate starting points—like drafts, code snippets, or hypotheses—saves time and effort. It’s not about delivering flawless results every time; it’s about accelerating the process so humans can refine and perfect the output. A programmer tweaking AI-generated code is still faster than writing it from scratch.

As for the overhype, it’s true that the buzz can get out of hand—marketing teams love a shiny new toy. But beneath the noise, real progress is happening. AI’s contributions aren’t always flashy; they’re often mundane but critical, like automating repetitive tasks or enhancing search algorithms. The critics who say it’s just a “fancy autocomplete” underestimate how transformative that can be. Autocomplete might sound trivial until you realize it’s powering real-time language translation or helping scientists sift through research papers at scale. These aren’t sci-fi fantasies—they’re practical applications delivering value today.

Finally, the notion that AI can’t match human expertise in complex domains ignores its complementary role. It’s not here to outthink a seasoned expert but to amplify their reach. A lawyer using AI to review contracts doesn’t lose their skill—they gain efficiency. A researcher leveraging AI to analyze data doesn’t stop hypothesizing—they get to test more ideas. The technology’s strength lies in its ability to handle the grunt work, freeing humans to focus on creativity and judgment.

AI isn’t a silver bullet, and it’s not without flaws. But the criticisms often stem from inflated expectations or a failure to see its incremental, practical benefits. It’s not about replacing humans—it’s about making us better at what we do. The trajectory is clear: as it evolves, AI will continue to refine its capabilities and prove its worth, not as a standalone genius, but as a partner in progress.

cbmask 35 minutes ago

[dead]

DisjointedHunt 17 hours ago

Two things can be true at the same time:

1. Model "performance" judged by proxy metrics of intelligence have improved significantly over the past two years.

2. These capabilities are yet to be stitched together in the most appropriate manner for the cybersecurity scenarios the author is talking about.

In my experience, the best usage of Transformer models has come from a deep integration into an appropriate workflow. They do not (yet) replace the new exploration part of a workflow, but they are very scarily performant at following mid level reasoning assertions in a massively parallelized manner.

The question you should be asking yourself is if you can break down your task into however many small chunks that are constrained by feasiility in time to process , chunk those up into appropriate buckets or even better, place them in-order as though you were doing those steps with your expertise - an extension of self. Here's how the two approaches differ:

"Find vulnerabilities in this code" -> This will saturate across all models because the intent behind this mission is vast and loosely defined, while the outcome is expected to be narrow.

" (a)This piece of code should be doing x, what areas is it affecting, lets draw up a perimeter (b) Here is the dependency graph of things upstream and downstream of x, lets spawn a collection of thinking chains to evaluate each one for risk based on the most recent change . . . (b[n]) Where is this likely to fail (c) (Next step that a pentester/cybersecurity researcher would take) "

This has been trial and error in my experience but it has worked great in domains such as financial trading and decision support where experts in the field help sketch out the general framework of the process where reasoning support is needed and constantly iterate as though it is an extension of their selves.

fnordpiglet 17 hours ago

… deleted … (Sorry the delete isn’t working, meant for another subthread)

delusional 17 hours ago

> Sometimes the founder will apply a cope to the narrative ("We just don't have any PhD level questions to ask")

Please tell me this is not what tech-bros are going around telling each other! Are we implying that the problems in the world, the things that humans collectively work on to maintain the society that took us thousands of years to build up, just aren't hard enough to reach the limits of the AI.

Jesus Christ.

bcoates 17 hours ago

I mean... most businesses, particularly small businesses and startups, aren't exactly doing brain surgery on a rocketship.
It’s pretty likely that they have extremely dull problems like "running an inbound call center is a lot of work" or "people keep having their mail stolen and/or lying that they did" that "more smarter gpus" won't solve

claytron5000 18 hours ago

[dead]

yimby2001 18 hours ago

[dead]