With right knowledge and web searches one can answer this question in a matter of minutes at most. The model fumbled around modding forums and other sites and did manage to find some good information but then started to hallucinate some details and used them in the further research. The end result it gave me was incorrect, and the steps it described to get the value were totally fabricated.
What’s even worse in the thinking trace it looks like it is aware it does not have an answer and that the 399 is just an estimate. But in the answer itself it confidently states it found the correct value.
Essentially, it lied to me that it doesn’t really know and provided me with an estimate without telling me.
Now, I’m perfectly aware that this is a very niche topic, but at this point I expect the AI to either find me a good answer or tell me it couldn’t do it. Not to lie me in the face.
> Formulate Conclusion: Since a definitive list or count isn't readily available through standard web searches, the best approach is to: state that an exact count is difficult to ascertain from readily available online sources without direct analysis of game files ... avoid giving a specific number, as none was reliably found across multiple sources.
I like this answer. It does mention the correct, definitive way of getting the information I want (extracting the char.lgp data file) and so even though it gave up it pushes you in the right direction, whereas o3/o4 just make up stuff.
I've used AI with "niche" programming questions and it's always a total let down. I truly don't understand this "vibe coding" movement unless everyone is building todo apps.
LLMs are great at interpolating and extrapolating based on context. Interpolating is far less error-prone. The problem with interpolating is that you need to start with accurate points so that interpolating between them leads to expected and relatively accurate estimates.
What we are seeing is the result of developers being oblivious to higher-level aspects of coding, such as software architecture, proper naming conventions, disciplined choice of dependencies and dependency management, and even best practices. Even basic requirements-gathering.
Their own personal experience is limited to diving into existing code bases and patching them here and there. They often screw up the existing software architecture because their lack of insight and awareness leads them to post PRs that get the job done at the expense of polluting the whole codebase into an unmanageable mess.
So these developers crack open an LLM and prompt it to generate code. They use their insights and personal experience to guide their prompts. Their experience reflects what they do on a daily basis. The LLMs of course generate code from their prompts, and the result is underwhelming. Garbage-in, garbage-out.
It's the LLMs fault, right? All the vibe coders out there showcasing good results must be frauds.
The telltale sign of how poor these developers are is how they dump the responsibility of they failing to get LLMs to generate acceptable results on the models not being good enough. The same models that are proven effective at creating whole projects from scratch at their hands are incapable of the smallest changes. It's weird how that sounds, right? If only the models were better... Better at what? At navigating through your input to achieve things that others already achieve? That's certainly the model's fault, isn't it?
Yes, with a bit of work around prompting and focusing on closed context, or as you put it, interpolating, you can get further. But the problems is that, this is not how the LLMs were sold. If you blame someone for trying to use it by specifying fairly high level prompts - well isn´t that exactly how this technology was being advertised the whole time? The problem is not the bad workman, the problem is that the tool is not doing what it is advertised as doing.
Yeah this is a great summary of what I do as well and I find it very effective. I think of hands-off AI coding like you're directing a movie. You have a rough image of what "good" looks like in your head, and you're trying to articulate it with enough detail to all the stagehands and actors such that they can realize the vision. The models can always get there with enough coaching, traditionally the question is if that's worth the trouble versus just doing it yourself.
Increasingly I find that AI at this point is good enough I am rarely stepping in to "do it myself".
It's incredible when I ask Claude 3.7 a question about Typescript/Python and it can generate hundreds of lines of code that are pretty on point (it's usually not exactly correct on first prompt, but it's coherent).
I've recently been asking questions about Dafny and Lean -- it's frustrating that it will completely make up syntax and features that don't exist, but still speak to me with the same confidence as when it's talking about Typescript. It's possible that shoving lots of documentation or a book about the language into the context would help (I haven't tried), but I'm not sure if it would make up for the model's lack of "intuition" about the subject.
Recommend using RAG for this. Make the Haskell or OCaml documentation your knowledge base and index it for RAG. Then it makes a heck of a lot more sense!
You can get pretty good results by copying the output from Firefox's Reader View into your project, for example:
about:reader?url=https://learnxinyminutes.com/ocaml/
Well all LLM are fairly bad for react native as soon as you look at more than hello world type of things.
I got stuck with different LLM until I checked the official documentation, yeah spouting nonsense from 2y+ removed features I suppose or just making up stuff.
I'm trialing co-pilot in VSCode and it's a mixed bag. Certain things it pops out great, but a lot of times I'll be like woohoo! <tab> <tab> <tab> and then end up immediately realising wait a sec, none of this is actually needed, or it's just explicitly calling for things that are already default values, or whatever.
(This is particularly in the context of metadata-type stuff, things like pyproject files, ansible playbooks, Dockerfiles, etc)
I recently exclaimed that “vibe coding is BS” to one of my coworkers before explaining that I’ve actually been using GPT, Claude, llama (for airplanes), Cline, Cursor, Windsurf, and more for coding for as long as they’ve been available (more recently playing with Gemini). Cline + Sonnet 3.7 has been giving me great results on smaller projects with popular languages, and I feel truly fortunate to have AWS Bedrock on tap to drive this stuff (no effective throttling/availability limits for an individual dev). Even llama + Continue has proven workable (though it will absolutely hallucinate language features and APIs).
That said, 100% pure vibe coding is, as far as I can tell, still very much BS. The subtle ugliness that can come out of purely prompt-coded projects is truly a rat hole of hate, and results can get truly explosive when context windows saturate. Thoughtful, well-crafted architectural boundaries and protocols call for forethought and presence of mind that isn’t yet emerging from generative systems. So spend your time on that stuff and let the robots fill in the boilerplate. The edges of capability are going to keep moving/growing, but it’s already a force multiplier if you can figure out ways to operate.
For reference, I’ve used various degrees of assistance for color transforms, computer vision, CNN network training for novel data, and several hundred smaller problems. Even if I know how to solve a problem, I generally run it through 2-3 models to see how they’ll perform. Sometimes they teach me something. Sometimes they violently implode, which teaches me something else.
> That said, 100% pure vibe coding is, as far as I can tell, still very much BS.
I don't really agree. There's certainly a showboating factor, not to mention there is currently a goldrush to tap this movement to capitalize from it. However, I personally managed to create a fully functioning web app from scratch with Copilot+vs code using a mix of GPT4 and o1-mini. I'm talking about both backend and frontend, with basic auth in place. I am by no means a expert, but I did it in an afternoon. Call it BS, the the truth of the matter is that the app exists.
People were making a front and backend web app in half a day using Ruby on Rails way before LLMs were ever a thing, and their code quality was still much better than yours!
So vibe coding, sure you can create some shitty thing which WORKS, but once it becomes bigger than a small shitty thing, it becomes harder and harder to work with because the code is so terrible when you're pure vibe coding.
> People were making a front and backend web app in half a day using Ruby on Rails way before LLMs were ever a thing, and their code quality was still much better than yours!
A few people were doing that.
With LLMs, anyone can do that. And more.
It's important to frame the scenario correctly. I repeat: I created everything in an afternoon just for giggles, and I challenged myself to write zero lines of code.
> So vibe coding, sure you can create some shitty thing which WORKS (...)
You're somehow blindly labelling a hypothetical output as "shitty", which only serves to show your bias. In the meantime, anyone who is able to churn out a half-functioning MVP in an afternoon is praised as a 10x developer. There's a contrast in there, where the same output is described as shitty or outstanding depending on who does it.
People who embracing vibe coding are probably the same people who were already sudo-vibe coding to begin with using found fragments of code they could piece together to make things sort of work for simple tasks.
I mean, I don't think you need to do cutting edge programming to make something personal to you. See here from Canva's product. Check this out: https://youtu.be/LupwvXsOQqs?t=2366
> I've used AI with "niche" programming questions and it's always a total let down.
That's perfectly fine. It just means you tried without putting in any effort and failed to get results that were aligned with your expectations.
I'm also disappointed when I can't dunk or hit >50% of my 3pt shots, but then again I never played basketball competitively
> I truly don't understand this "vibe coding" movement unless everyone is building todo apps.
Yeah, I also don't understand the NBA. Every single one of those players show themselves dunking and jumping over cars and having almost perfect percentages in 3pt shots during practice, whereas I can barely get off my chair. The problem is certainly basketball.
I imagine after GPT-4 / o1, improvements on benchmarks have been increasingly a result of overfitting, because those breakthrough models already used most of the high quality training data that is available on the internet, there haven't been any dramatic architectural changes, we are already melting the world's GPUs, and there simply isn't enough new, high quality data being generated (orders of magnitudes more than what they already used on older models) to enable breakthrough improvements.
What I'd really like to see is the model development companies improving their guardrails so that they are less concerned about doing something offensive or controversial and more concerned about conveying their level of confidence in an answer, i.e. saying I don't know every once in a while. Once we get a couple years of relative stagnation in AI models, I suspect this will become a huge selling point and you will start getting "defense grade", B2B type models where accuracy is king.
Have you asked this same question to various other models out there in the wild? I am just curious if you have found some that performed better. I would ask some models myself, but I do not know the proper answer, so I would probably be gullible enough to believe whatever the various answers have in common.
AIs in general are definitely hallucinating a lot more when it comes to niche topics. It is funny how they are unable to say "I don't know" and just make up things to answer your questions
How would it ever know the answer it found is true and correct though? It could as well just repeat some existing false answer that you didn't yet find on your own. That's not much better than hallucinating it, since you can't verify its truth without finding it independently anyway.
I would be ok with having an answer and an explanation of how it got the answer with a list of sources. And it does just that - the only problem is that both the answer and the explanation are fabrications after you double check the sources.
> Are you saying that, it deliberately lied to you?
Pretty much yeah. Now “deliberately” does imply some kind of agency or even consciousness which I don’t believe these models have, its probably the result of overfitting, reward hacking or some other issues from training it, but the end result is that the model straight up misleads you knowingly (as in - the thinking trace is aware of the fact it doesn’t know the answer but it provides it anyways).
Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA hash that NixOS needs, and wrote a test suite. The test suite indicates that it even did GUI testing- not sure whether that is a hallucination or not though. Nevertheless, it one-shotted the installation instructions for me, and I don't see how it could have calculated the package hash without downloading, so I think this indicates some very interesting new capabilities. Highly impressive.
Are you sure about all of this? You acknowledged it might be a hallucination, but you seem to mostly believe it? o3 doesn't have the ability to spin up a VM.
Thats so different from my experience. I tried to have it switch a flake for a yarn package that works to npm and after 3 tries with all the hints I could give it it couldn’t do it
I find that so incredibly unlikely. Granted I haven't been keeping up to date with the latest LLM developments - but has there even been any actual confirmation from OpenAI that these models have the ability to do such things in the background?
If it can write a nixos flake it's significantly smarter than the average programmer. Certainly smarter than me, one-shotting a flake is not something I'll ever be able to do — usually takes me about thirty shots and a few minutes to cool off from how mad I am at whoever designed this fucking idiotic language. That's awesome.
I use flakes a lot and I think both flakes and the Nix language are beyond comprehension. Try searching duckduckgo or google for “what is nix flakes” or “nix flake schema” and take an honest read at the results. Insanely complicated and confusing answers, multiple different seemingly-canonical sources of information. Then go look at some flakes for common projects; the almost necessary usage of things like flake-compat and flake-util, the many-valid-approaches to devshell and package definitions, the concepts of “apps” in addition to packages. All very complicated and crazy!
Thank you for your service, I use your work with great anger (check my github I really do!)
I forgive you as I hope you forgive me. Flakes are certainly much better than Nix without them, and they’ve saved me much more time than they’ve cost me.
FWIW, they said the language was bad, not specifically flakes.
IMHO, nix is super easy if you already know Haskell (possibly others in that family). If you don't, it's extremely unintuitive.
I mean, a smart programmer still has to learn what NixOs and Flakes are, and based on your description and some cursory searching, a smart programmer would just go do literally anything else. Perfect thing to delegate to a machine that doesn't have to worry about motivation.
Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]
Incredible how resilient Claude models have been for best-in-coding class.
[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).
Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.
Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.
What language / framework are you using? I ask because in a Node / Typescript / React project I experience the opposite- Claude 3.7 usually solves my query on the first try, and seems to understand the project's context, ie the file structure, packages, coding guidelines, tests, etc, while Gemini 2.5 seems to install packages willy-nilly, duplicate existing tests, create duplicate components, etc.
This was incredibly irritating at first, though over time I've learned to appreciate this "extra credit" work. It can be fun to see what Claude thinks I can do better, or should add in addition to whatever feature I just asked for. Especially when it comes to UI work, Claude actually has some pretty cool ideas.
If I'm using Claude through Copilot where it's "free" I'll let it do its thing and just roll back to the last commit if it gets too ambitious. If I really want it to stay on track I'll explicitly tell it in the prompt to focus only on what I've asked, and that seems to work.
And just today, I found myself leaving a comment like this:
//Note to Claude: Do not refactor the below. It's ugly, but it's supposed to be that way.
Never thought I'd see the day I was leaving comments for my AI agent coworker.
Claude is almost comically good outside of copilot. When using through copilot it’s like working with a lobotomized idiot (that complains it generated public code about half the time).
Oh, that must’ve been in the last few days. Weird that it’s only in 2.5 Pro preview but at least they’re headed in the right direction.
Now they just need a decent usage dashboard that doesn’t take a day to populate or require additional GCP monitoring services to break out the model usage.
I do find it likes to subtly reformat every single line thereby nuking my diff and making its changes unusable since I can’t verify them that way, which Sonnet doesn’t do.
I keep seeing this sentiment so often here and on X that I have to wonder if I'm somehow using a different Gemini 2.5 Pro. I've been trying to use it for a couple of weeks already and without exaggeration it has yet to solve a single programming task successfully. It is constantly wrong, constantly misunderstands my requests, ignores constraints, ignores existing coding conventions, breaks my code and then tells me to fix it myself.
Eh, I wouldn't say that's accurate, I think it's situational. I code all day using AI tools and Sonnet 3.7 is still the king. Maybe it's language dependent or something, but all the engineers I know are full on Claude-Code at this point.
The image generation improvement with o4-mini is incredible. Testing it out today, this is a step change in editing specificity even from the ChatGPT 4o LLM image integration just a few weeks ago (which was already a step change). I'm able to ask for surgical edits, and they are done correctly.
There isn't a numerical benchmark for this that people seem to be tracking but this opens up production-ready image use cases. This was worth a new release.
Thanks for sharing that. that was more interesting then their demo. I tried it and it was pretty good! I have felt that the ability to iterate from images blocked this from any real production use I had. This may be good enough now.
also another addition: i previously tried to upload an image for chatgpt to edit and it was incapable under the previous model i tried. Now its able to change uploaded images using o4mini.
Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0]
OpenAI said they got 69.1% in their blog post.
Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding:
> For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results.
I think you may have misread the footnote. That simpler setup results in the 62.3%/63.7% score. The 70.3% score results from a high-compute parallel setup with rejection sampling and ranking:
> For our “high compute” number we adopt additional complexity and parallel test-time compute as follows:
> We sample multiple parallel attempts with the scaffold above
> We discard patches that break the visible regression tests in the repository, similar to the rejection sampling approach adopted by Agentless; note no hidden test information is used.
> We then rank the remaining attempts with a scoring model similar to our results on GPQA and AIME described in our research post and choose the best one for the submission.
> This results in a score of 70.3% on the subset of n=489 verified tasks which work on our infrastructure. Without this scaffold, Claude 3.7 Sonnet achieves 63.7% on SWE-bench Verified using this same subset.
I think reading this makes it even clearer that the 70.3% score should just be discarded from the benchmarks. "I got a 7%-8% higher SWE benchmark score by doing a bunch of extra work and sampling a ton of answers" is not something a typical user is going to have already set up when logging onto Claude and asking it a SWE style question.
Personally, it seems like an illegitimate way to juice the numbers to me (though Claude was transparent with what they did so it's all good, and it's not uninteresting to know you can boost your score by 8% with the right tooling).
I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks
The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like:
switch(testFile) {
case "test1.ase": // run this because it's a particular case
case "test2.ase": // run this because it's a particular case
default: // run something that's not working but that's ok because the previous case should
// give the right output for all the test files ...
}
Also, if you're using Cursor AI, it seems to have much better integration with Claude where it can reflect on its own things and go off and run commands. I don't see it doing that with Gemini or the O1 models.
I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).
Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.
As I mentioned, this is not a scientific test but rather just something that I have tried from time to time and has always (shockingly in my opinion) failed but today worked. It takes a minute of two of prompting, is boring to verify and I don't remember exactly which models I have used. It is purely a personal anecdote, nothing more.
However, looking at the code that Gemini wrote in the link, it does the same thing that other LLMs often do, which is to assume that we are encoding individual long values. I assume there must be a github repo or stackoverflow question in the weights somewhere that is pushing it in this direction but it is a little odd. Naturally, this isn't the kind encoder that someone would normally want. Typically it should encode a byte array and return a string (or maybe encode / decode UTF8 strings directly). Having the interface use a long is very weird and not very useful.
In any case, I suspect with a bit more prompting you might be able to get gemini to do the right thing.
I think it's because the question is rather ambiguous - "convert the number to base-N" is a very common API, e.g. in C# you have Convert.ToString(long value, int base), in JavaScript you have Number.toString(base) etc. It seems that it just follows this pattern. If you were to ask me the same question, I'd probably do the same thing without any further context.
OTOH if you tell it to write a Base62 encoder in C#, it does consistently produce an API that can be called with byte arrays: https://g.co/gemini/share/6076f67abde2
Similarly, many of my informal tests have started passing with Gemini 2.5 that never worked before, which makes the 2025 era of AI models feel like a step change to me.
I’ve been using Gemini 2.5 pro side by side with o1-pro and Grok lately. My experience is they each randomly offer significant insight the other two didn’t.
But generally, o1-pro listens to my profile instructions WAY better, and it seems to be better at actually solving problems the first time. More reliable.
But they are all quite similar and so far these new models are similar but faster IMO.
I asked o3 to build and test a maximum parsimony phylogenetic tree builder in Python (my standard test for new models) and it's been thinking for 10 minutes. Still not clear if anything is happening, I have barely seen any code since I asked to test what it produced in the first answer. The thought summary is totally useless compared to Gemini's. Underwhelming so far.
The CoT summary is full of references to Jupyter notebook cells. The variable names are too abbreviated, nbr for neighbor, the code becomes fairly cryptic as a result, not nice to read. Maybe optimized too much for speed.
Also I've noticed ChatGPT seems to abort thinking when I switch away from the app. That's stupid, I don't want to look at a spinner for 5 minutes.
And the CoT summary keeps mentioning my name which is irritating.
It's maddening that you can't switch away from the app while it generates output. To use the Deep Research feature on mobile, you have to give up your phone for ten minutes.
deep research will run in the BG on mobile and I think it gives a notification when done. it's not like normal chats that need the app to be in the foreground.
To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.
GPT-4o mini: The new moon in August 2025 will occur on August 12.
Llama 3.3 70B: The new moon in August 2025 is expected to occur on August 16, 2025.
Claude 3 Haiku: The new moon in August 2025 will occur on August 23, 2025.
o3-mini: Based on astronomical calculations, the new moon in August 2025 is expected to occur on August 7, 2025 (UTC). [...]
Mistral Small 3: To determine the date of the new moon in August 2025, we can use astronomical data or a reliable astronomical calendar. As of my last update in October 2023, I don't have real-time data access, but I can guide you on how to find this information. [...]
I got different answers, mostly wrong. My calendars (both paper and app versions) show me 23. august as the date.
And btw, when I asked those AIs which entries in a robots.text file would block most Chinese search engines, one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."
"On what date will the new moon occur on in August 2025. Use a tool to verify the date if needed"
It correctly reasoned it did not have exact dates due to its cutoff and did a lookup.
"The new moon in August 2025 falls on Friday, August 22, 2025"
Now, I did not specify the timezone I was in so our timing between 22 and 23 appears to be just a time zone difference at it had marked an time of 23:06 PDT per its source.
```
Based on the search results, the new moon in August 2025 will occur late on Friday, August 22nd, 2025 in the Pacific Time Zone (PDT), specifically around 11:06 PM.
In other time zones, like the Eastern Time Zone (ET), this event falls early on Saturday, August 23rd, 2025 (around 2:06 AM).
```
"Use a tool to verify the date if needed" that's a good idea, yes. And the answers I got are based on UTC, so 23:06 PDT should match the 23. for Europe.
My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be would be worth it.
Heh, I've always been neurodivergent enough that I've never been great at 'normal human' questions. I commonly add a lot of verbosity. This said it's worked out well talking to computer based things like search engines.
LLMs on the other hand are weird in ways we don't expect computers to be. Based upon the previous prompting, training datasets, and biases in the model a response to something like "What time is dinner" can all have the response "Just a bit after 5", "Quarter after 5" or "Dinner is at 17:15 CDT". Setting ones priors can be important to performance of the model, much in the same way we do this visually and contextually with other humans.
All that said, people will find AI problematic for the foreseeable future because it behaves somewhat human like in responses and does so with confidence.
My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be, would be a good test. Because plain folks will ask plain questions and won't think about the subtle details. They would not expect a "precise number" either, i.e. not 23:06 PDT, but would like to know if this weekend would be fine for a trip or the previous or next weekend would be better to book a "dark sky" tour.
And, BTW, I thought that LLMs are computers too ;-0
when are the long list of 'enterprise' coworkers, who have glibly and overconfidently answered questions without doing math or looking them up, going to be fired?
> one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."
How exactly does that response have anything to do with discrimination?
Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.
> we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining.
Didn’t the pivot to RL from pretraining happen because the scaling “law” didn’t deliver the expected gains? (Or at least because O(log) increases in model performance became unreasonably costly?) I see they’ve finally resigned themselves to calling these trends, not laws, but trends are often fleeting. Why should we expect this one to hold for much longer?
This isn't exactly the case. The trend is a log scale. So a 10x in pretraining should yield a 10% increase in performance. That's not proving to be false per say but rather they are encountering practical limitations around 10x'ing data volume and 10x'ing available compute.
> (Or at least because O(log) increases in model performance became unreasonably costly?)
But, yes, I left implicit in my comment that the trend might be “fleeting” because of its impracticality. RL is only a trend so long as it is fashionable, and only fashionable (i.e., practical) so long as OpenAI is fed an exponential amount of VC money to ensure linear improvements under O(log) conditions.
OpenAI is selling to VCs the idea that some hitherto unspecified amount of linear model improvement will kick off productivity gains greater than their exponentially increasing investment. These productivity gains would be no less than a sizeable percentage of American GDP, which Altman has publicly set as his target. But as the capital required increases exponentially, the gap between linearly increasing model capability (i.e., its productivity) and the breakeven ROI target widens. The bigger model would need to deliver a non-linear increase in productivity to justify the exponential price tag.
It doesn't need to hold forever or even 'much longer' depending on your definition of that duration. It just needs to hold on long enough to realize certain capabilities.
Will it ? Who knows. But seeing as this is something you can't predict ahead of time, it makes little sense not to try in so far as the whole thing is still feasible.
I think it can be confusing if you're just reading the news. If you use ChatGPT, the model selector has good brief explanations and teaching you about newly available options if you don't visit the dropdown. Anthropic does similarly.
Yes, this one is addictive for its speed and I like how Google was clever and also offered it in a powerful reasoning edition. This helps offset deficiencies from being smaller while still being cheap. I also find it quite sufficient for my kind of coding. I only pull out 2.5 Pro on larger and complex code bases that I think might need deeper domain specific knowledge beyond the coding itself.
Mad tangent, but as an old timey MtG player it’s always jarring when someone uses “the meta” not to refer to the particular dynamics of their competitive ecosystem but to a single strategy within it. Impoverishes the concept, I feel, even in this case where I don’t actually think a single model is best at everything.
I'm a World of Warcraft & Dota 2 player, using "the meta" in that way is pretty common in gaming these days I think. The "meta" is still the 'metagame' in the competitive ecosystem sense, but it also refers to strategies that are considered flavor of the month (FOTM) or just generally safe bets.
So there's "the meta", and there's "that strategy is meta", or "that strategy is the meta."
This one seems to make it easier — if the promises here hold true, the multi-modal support probably makes o4-mini-high OpenAI's best model for most tasks unless you have time and money, in which case it's o3-pro.
It confers to the speaker confirmation they're absolutely right - names are arbitrary.
While also politely, implicitly, pointing out the core issue is it doesn't matter to you --- which is fine! --- but it may just be contributing to dull conversation to be the 10th person to say as much.
I asked OpenAI how to choose the right USB cable for my device. Now the objects around me are shimmering and winking out of existence, one by one. Help
"good at advanced reasoning", "fast at advanced reasoning", "slower at advanced reasoning but more advanced than the good one but not as fast but cant search the internet", "great at code and logic", "good for everyday tasks but awful at everything else", "faster for most questions but answers them incorrectly", "can draw but cant search", "can search but cant draw", "good for writing and doing creative things"
Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development? Not just openAI, but across the main cloud offerings and feasible local models?
I know there are the benchmarks, and directories like huggingface, and you can get a 'feel' for things by scanning threads here or other forums.
I'm thinking more of something that provides use-case tailored "top 3" choices by collecting and summarizing different data points. For example:
* agent & tool based dev (cloud) - [top 3 models]
* agent & tool based dev (local) - m1, m2, m,3
* code review / high level analysis - ...
* general tech questions - ...
* technical writing (ADRs, needs assessments, etc) - ...
Part of the problem is how quickly the landscape changes everyday, and also just relying on benchmarks isn't enough: it ignores cost, and more importantly ignores actual user experience (which I realize is incredibly hard to aggregate & quantify).
LMArena might have some of the information you are looking for. It offers rankings of LLM models across main cloud offerings, and I feel that its evaluation method, human prompting and voting, is closer to real-world use case and less prone to data contamination than benchmarks.
In the "Leaderboard">"Language" tab, it lists the top models in various categories such as overall, coding, math, and creative writing.
In the "Leaderboard">"Price Analysis" tab, it shows a chart comparing models by cost per million tokens.
In the "Prompt-to-Leaderboard" tab, there is even an LLM to help you find LLMs -- you enter a prompt, and it will find the top models for your particular prompt.
> Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development?
Below is a spreadsheet I bookmarked from a previous HN discussion. Its information dense but you can just look at the composite scores to get a quick idea how things compare.
I have been using this site: https://artificialanalysis.ai/ . It's still about benchmarks, and it doesn't do deep dives into specific use cases, but it's helpful to compare models for intelligence vs cost vs latency and other characteristics.
They're supposed to be released today for everyone, and o3-pro for Pro users in a few weeks:
"ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high."
They are all now available on the Pro plan. Y'all really ought to have a little bit more grace to wait 30 minutes after the announcement for the rollout.
They'd probably want their announcement to be the one the press picks up instead of a tweet or reddit post saying "Did anyone else notice the new ChatGPT model?"
Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.
On most other benchmarks, they seem to perform about the same, which is bad news for o3 because it's much more expensive and slower than Gemini 2.5 Pro, and it also hides its reasoning while Gemini shows everything.
We can probably just stick with Gemini 2.5 Pro, since it offers the best combination of price, quality, and speed. No need to worry about finding a replacement (for now).
If you download GIMP, Blender etc - every user would have to report exactly the same experience mostly given the hardware is recent.
In this thread however - there are varying experiences from amazing to awful. I'm not saying anyone is wrong but all I'm saying is that this wide range of operational accuracy is what will pop the AI bubble eventually in that they can't be reliably deployed almost anywhere with any certainty or guarantees of any sorts.
The pace of notable releases across the industry right now is unlike any time I remember since I started doing this in the early 2000's. And it feels like it's accelerating
How is this a notable release? It's strictly worse than Gemini 2.5 on coding &c, and only an iterative improvement over their own models. The only thing that struck me as particularly interesting was the native visual reasoning.
Love Sonnet but 3.7 is not obviously an improvement over 3.5 in my real world usage. Gemini 2.5 pro is great, has replaced most others for me (Grok I use for things that require realtime answers)
Integration is accelerating rapidly. Even if model development froze today, we would still probably have ~5 years of adoption and integration before it started to level off.
You are both correct. It feels like the tech itself is kinda plateauing but it's still massively under-used. It will take a decade or more before the deployment starts slowing down.
It's more like GPT-3 is the Manchester Baby, and we're somewhere around IBM 700 series right now. Still a long way to go to iPhone, as much as the industry likes to pretend otherwise.
In the examples they demonstrate tool use in the reasoning loop. The models pretty impressively recognize they need some external data, and either complete a web search, or write and execute python to solve intermediate steps.
To the extent that reasoning is noisy and models can go astray during it, this helps inject truth back into the reasoning loop.
Is there some well known equivalent to Moores Law for token use? We're headed in a direction where LLM control loops can run 24/7 generating tokens to reason about live sensor data, and calling tools to act on it.
I personally like being able to choose because I understand the tradeoffs and want to choose the best one for what I’m asking. So I hope this doesn’t go away.
But I agree that they probably need some kind of basic mode to make things easier for the average person. The basic mode should decide automatically what model to use and hide this from the user.
No, Mixture of Experts is a really confusing term.
It sounds like it means "have a bunch of models, one that's an expert in physics, one that's an expert in health etc and then pick the one that's a best fit for the user's query".
It's not that. The "experts" are each another giant opaque blob of weights. The model is trained to select one of those blobs, but they don't have any form of human-understandable "expertise". It's an optimization that lets you avoid using ALL of the weights for every run through the model, which helps with performance.
So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.
This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.
Im old enough to remember the mystery and hype before o*/o1/strawberry that was supposed to be essentially AGI. We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet
Now we're up to o4, AGI is still not even in near site (depending on your definition, I know). And OpenAI is up to about 5000 employees. I'd think even before AGI a new model would be able to cover for at least 4500 of those employees being fired, is that not the case?
I don't think these comparisons are useful. Every time you look at companies like LinkedIn or Docusign, yeah - they have a lot of staff, but a significant proportion of this are functions like sales, customer support, and regulatory compliance across a bazillion different markets; along with all the internal tooling and processes you need to support that.
OpenAI is at a much earlier stage in their adventures and probably doesn't have that much baggage. Given their age and revenue streams, their headcount is quite substantial.
If we're making comparisons, its more like someone selling a $10,000 course on how to be a millionaire
Not directly from OpenAI - but people in the industry is advertising how these advanced models can replace employees, yet they keep on going on hiring tears (including OpenAI). Lets see the first company to stand behind their models, and replace 50% of their existing headcount with agents. That to me would be a sign these things are going to replace peoples jobs. Until I see that, if OpenAI can't figure out how to replace humans with models, then no one will
I mean could you imagine if todays announcement was - the chatgpt.com webdev team has been laid off, and all new features and fixes will be complete by Codex CLI + o4-mini. That means they believe in the product theyre advertising. Until they do something like that, theyll keep on trusting those human engineers and try selling other people on the dream
I'm also a skeptic on AI replacing many human jobs anytime soon. It's mostly going to assist, accelerate or amplify humans in completing work better or faster. That's the typical historical technology cycle where better tech makes work more efficient. Eventually that does allow the same work to be done with less people, like a better IP telephony system enabling a 90 person call center to handle the same call volume that previously required 100 people. But designing, manufacturing, selling, installing and supporting the new IP phone system also creates at least 10 new jobs.
So far the only significant human replacement I'm seeing AI enable is in low-end, entry level work. For example, fulfilling "gig work" for Fiverr like spending an hour or two whipping up a relatively low-quality graphic logo or other basic design work for $20. This is largely done at home by entry-level graphic design students in second-world locales like the Philippines or rural India. A good graphical AI can (and is) taking some of this work from the humans doing it. Although it's not even a big impact yet, primarily because for non-technical customers, the Fiverr workflow can still be easier or more comfortable than figuring out which AI tool to use and how to get what they really want from it.
The point is that this Fiverr piece-meal gig work is the lowest paying, least desirable work in graphic design. No one doing it wants to still be doing it a year or two from now. It's the Mcdonald's counter of their industry. They all aspire to higher skill, higher paying design jobs. They're only doing Fiverr gig work because they don't yet have a degree, enough resume credits or decent portfolio examples. Much like steam-powered bulldozers and pile drivers displaced pick axe swinging humans digging railroad tunnels in the 1800s, the new technology is displacing some of the least-desirable, lowest-paying jobs first. I don't yet see any clear reason this well-established 200+ year trend will be fundamentally different this time. And history is littered with those who predicted "but this time it'll be different."
I've read the scenarios which predict that AI will eventually be able to fundamentally and repeatedly self-improve autonomously, at scale and without limit. I do think AI will continue to improve but, like many others, I find the "self-improve" step to be a huge and unevidenced leap of faith. So, I don't think it's likely, for reasons I won't enumerate here because domain experts far smarter than I am have already written extensively about them.
I hope I don't have to link this adjacent reply of mine too many more times: https://news.ycombinator.com/item?id=43709056 Specifically "The venue is a matter of convenience, nothing more," and if you prefer another, that would work about as well. Perhaps Merano; I hear it's a lovely little town.
The closest Elon ever came to anything Hague-worthy is allowing Starlink to be used in Ukrainian attacks on Russian civilian infrastructure. I don't think the Hague would be interested in anything like that. And if his life is worthless, then what would you say about your own? Nonetheless, I commend you on your complete lack of hinges. /s
Oh, I'm thinking more in the sense of the special one-off kinds of trials, the sort Gustave Gilbert so ably observed. The venue is a matter of convenience, nothing more. To the rest I would say the worth of my life is no more mine to judge than anyone else is competent to do the same for themselves, or indeed other than foolish to pursue the attempt.
Deep learning models will continue to improve as we feed them more data and use more compute, but they will still fail at even very simple tasks as long as the input data are outside their training distribution. The numerous examples of ChatGPT (even the latest, most powerful versions) failing at basic questions or tasks illustrate this well.
Learning from data is not enough; there is a need for the kind of system-two thinking we humans develop as we grow. It is difficult to see how deep learning and backpropagation alone will help us model that.
https://medium.com/thoughts-on-machine-learning/why-sam-altm...
Honestly, sometimes I wonder if most people these days kinda aren't at least that age, you know? Or less inhibited about acting it than I believe I recall people being last decade. Even compared to just a few years back, people seem more often to struggle to carry a thought, and resort much more quickly to emotional belligerence.
Oh, not that I haven't been as knocked about in the interim, of course. I'm not really claiming I'm better, and these are frightening times; I hope I'm neither projecting nor judging too harshly. But even trying to discount for the possibility, there still seems something new left to explain.
> Even compared to just a few years back, people seem more often to struggle to carry a thought, and resort much more quickly to emotional belligerence.
We're living in extremely uncertain times, with multiple global crises taking place at the same time, each of which could develop into a turning point for humankind.
At the same time, predatory algorithms do whatever it takes to make people addicted to media, while mental health care remains inaccessible for many.
I feel like throwing a tantrum almost every single day.
I feel perhaps I've been unkind to many people in my thoughts, but I'm conflicted. I don't understand myself to be particularly fearless, but what times call more for courage than times like these? How do people afraid even to try to practice courage expect to find it, when there isn't time for practice any more?
You have only so many spoons available per crisis. Even picking your battle can become a problem.
I've been out in the streets, protesting and raising awareness of climate change. I no longer do. It's a pointless waste of time. Today, the climate change deniers are in charge.
I don't assume I'm going to be given the luxury of picking my battles, and - though I've been aware of "spoon theory" since I watched it getting invented at Shakesville back in the day - I've never held to it all that strongly, even as I acknowledge I've also never been quite the same since a nasty bout of wild-type covid in early 2020. Now as before, I do what needs doing as best I can, then count the cost. Some day that will surely prove too high, and my forward planning efforts will be put to the test. Till then I'm happy not to borrow trouble.
I've lived in this neighborhood a long time, and there are a couple of old folks' homes a block or so from here. Both have excellent views, on one frontage each, of an extremely historic cemetery, which I have always found a wonderfully piquant example of my adopted hometown's occasionally wire-brush sense of humor. But I bring it up to mention that the old folks don't seem to have much concern for spoons other than to eat with, and they are protesting the present situation regularly and at considerable volume, and every time I pass about my errands I make a point of raising a fist and hollering "hell yeah!" just like most of the people who drive past honk in support.
I think people expected reasoning to be more than just trained chain of thought (which was known already at the time). On the other hand, it is impressive that CoT can achieve so much.
Do you have a 200k context window? I don't. Most humans can only keep 6 or 7 things in short term memory. Beyond those 6 or 7 you are pulling data from your latent space, or replacing of the short term slots with new content.
Not really, because humans can form long term memories from conversations, but LLM users aren’t finetuning models after every chat so the model remembers.
He's right, but most people don't have the resources, nor indeed the weights themselves, to keep training the models. But the weights are very much long term memory.
If you want the entire model to remember everything it talked about with every user, sure. But ideally, I would want the model to remember what I told it a few million tokens ago, but not what you told it (because to me, the model should look like my private copy that only talks to me).
ideally, I would want the model to remember what I told it a few million tokens ago
Yes, you can keep finetuning your model on every chat you have with it. You can definitely make it remember everything you have ever said. LLMs are excellent at remembering their training data.
I'm not quite AGI, but I work quite adequately with a much, much smaller memory. Maybe AGI just needs to know how to use other computers and work with storage a bit better.
I’m not an AI researcher but I’m not convinced these contemporary artificial neural networks will get us to AGI, even assuming an acceleration to current scaling pace. Maybe my definition of AGI is off but I’m thinking what that means is a machine that can think, learn and behave in the world in ways very close to human. I think we need a fundamentally different paradigm for that. Not something that is just trained and deployed like current models, but something that is constantly observing, constantly learning and constantly interacting with the real world like we do. AHI, not AGI. True AGI may not exist because there are always compromises of some kind.
But, we don’t need AGI/AHI to transform large parts of our civilization. And I’m not seeing this happen either.
I don't think we shift the goalposts for AGI. I'm not getting the sense that people are redefining what AGI is when a new model is released. I'm getting the sense that some people are thinking like me when a new model is released: we got a better transformer, and a more useful model trained on more or better data, but we didn't get closer to AGI. And people are saying this not because they've pushed out what AGI really means, they're saying this because the models still have the same basic use cases, the same flaws and the same limitations. They're just better at what they already do. Also, the better these models get at what they already do, the more starkly they contrast with human capabilities, for better or worse.
Meanwhile even the highest ranked models can’t do simple logic tasks. GothamChess on YouTube did some tests where he played against a bunch of the best models and every single one of them failed spectacularly.
They’d happily lose a queen to take a pawn. They failed to understand how pieces are even allowed to move, hallucinated the existence of new pieces, repeatedly declared checkmate when it wasn’t, etc.
I tried it last night with Gemini 2.5 Pro and it made it 6 turns before it started making illegal moves, and 8 turns before it got so confused about the state of the board before it refused to play with me any longer.
I was in the chess club in 3rd grade. One of the top ranked LLMs in the world is vastly dumber than I was in 3rd grade. But we’re going to pour hundreds of billions into this in the hope that it can end my career? Good luck with that, guys.
Chess is not exactly a simple logic task. It requires you to keep track of 32 things in a 2d space.
I remember being extremely surprised when I could ask GPT3 to rotate a 3d model of a car in it's head and ask it about what I would see when sitting inside, or which doors would refuse to open because they're in contact with the ground.
It really depends on how much you want to shift the goalposts on what constitutes "simple".
Compare to what a software engineer is able to do, it is very much a simple logic task. Or the average person having a non-trivial job. Or a beehive organizing its existence, from its amino acids up to hive organization. All those things are magnitudes harder than chess.
> I remember being extremely surprised when I could ask GPT3 to rotate a 3d model of a car in it's head and ask it about what I would see when sitting inside, or which doors would refuse to open because they're in contact with the ground.
It's not reasoning its way there. Somebody asked something similar some time in the corpus and that corpus also contained the answers. That's why it can answer. After a quite small number of moves, the chess board it unique and you can't fake it. You need to think ahead. A task which computers are traditionally very good at. Even trained chess players are. That LLMs are not goes to show that they are very far from AGI.
I'm not sure why people are expecting a language model to be great at chess. Remember they are trained on text, which is not the best medium for representing things like a chess board. They are also "general models", with limited training on pretty much everything apart from human language.
An Alpha Star type model would wipe the floor at chess.
This misses the point. LLMs will do things like move a knight by a single square as if it were a pawn. Chess is an extremely well understood game, and the rules about how things move is almost certainly well-represented in the training data.
These models cannot even make legal chess moves. That’s incredibly basic logic, and it shows how LLMs are still completely incapable of reasoning or understanding. Many kinds of task are never going to be possible for LLMs unless that changes. Programming is one of those tasks.
>These models cannot even make legal chess moves. That’s incredibly basic logic, and it shows how LLMs are still completely incapable of reasoning or understanding.
Yeah they can. There's a link I shared to prove it which you've conveniently ignored.
LLMs learn by predicting, failing and getting a little better, rinse and repeat. Pre-training is not like reading a book. LLMs trained on chess games play chess just fine. They don't make the silly mistakes you're talking about and they very rarely make illegal moves.
There's gpt-3.5-turbo-instruct which i already shared and plays at around 1800 ELO. Then there's this grandmaster level chess transformer - https://arxiv.org/abs/2402.04494.
They're also a couple of models that were trained in the Eleuther AI discord that reached about 1100-1300 Elo.
I don't know what the peak of LLM Chess playing looks like but this is clearly less of a 'LLMs can't do this' problem and more 'Open AI/Anthropic/Google etc don't care if their models can play Chess or not' problem.
So are they capable of reasoning now or would you like to shift the posts ?
There isn't any general intelligence that isn't receiving pre-traning. People spend 14 to 18+ years in school to have any sort of career.
You don't have to pretrain it for every little thing but it should come as no surprise that a complex non-trivial game would require it.
Even if you explained all the rules of chess clearly to someone brand new to it, it will be a while and lots of practice before they internalize it.
And like I said, LLM pre-training is less like a machine reading text and more like Evolution. If you gave a corpus of chess rules, you're only training a model that knows how to converse about chess rules.
Do humans require less 'pre-training' ? Sure, but then again, that's on the back of millions of years of evolution. Modern NNs initialize random weights and have relatively very little inductive bias.
People are focussing on chess, which is complicated, but LLM fail at even simple games like tic-tac-toe where you'd think, if it was capable of "reasoning" it would be able to understand where it went wrong. That doesn't seem to be the case.
What it can do is write and execute code to generate the correct output, but isn't that cheating?
Saying programming is a task that is "never going to be possible" for an LLM is a big claim, given how many people have derived huge value from having LLMs write code for them over the past two years.
(Unless you're arguing against the idea that LLMs are making programmers obsolete, in which case I fully agree with you.)
I think "useful as an assistant for coding" and "being able to program" are two different things.
When I was trying to understand what is happening with hallucination GPT gave me this:
> It's called hallucinating when LLMs get things wrong because the model generates content that sounds plausible but is factually incorrect or made-up—similar to how a person might "see" or "experience" things that aren't real during a hallucination.
From that we can see that they fundamentally don't know what is correct. While they can get better at predicting correct answers, no-one has explained how they are expected to cross the boundary from "sounding plausible" to "knowing they are factually correct". All the attempts so far seem to be about reducing the likelihood of hallucination, not fixing the problem that they fundamentally don't understand what they are saying.
Until/unless they are able to understand the output enough to verify the truth then there's a knowledge gap that seems dangerous given how much code we are allowing "AI" to write.
Code is one of the few applications of LLMs where they DO have a mechanism for verifying if what they produced is correct: they can write code, run that code, look at the output and iterate in a loop until it does what it's supposed to do.
> We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet
I wonder if any of the people that quit regret doing so.
Seems a lot like Chicken Little behavior - "Oh no, the sky is falling!"
How anyone with technical acumen thinks current AI models are conscious, let alone capable of writing new features and expanding their abilities is beyond me. Might as well be afraid of calculators revolting and taking over the world.
Relative to the hype they've been spinning to attract investment, casting the launch and commercialization of ChatGPT as their greatest achievement really is a quite significant downgrade, especially given that they really only got there first because they were the first entity reckless enough to deploy such a tool to the public.
It's easy to forget what smart, connected people were saying about how AI would evolve by <current date> ~a year ago, when in fact what we've gotten since then is a whole bunch of diminishing returns and increasingly sketchy benchmark shenanigans. I have no idea when a real AGI breakthrough will happen, but if you're a person who wants it to happen (I am not), you have to admit to yourself that the last year or so has been disappointing---even if you won't admit it to anybody else.
ChatGPT was released two and a half years ago though. Pretty sure that at some point Sam Altman had promised us AGI by now.
The person you're responding to is correct that OpenAI feels a lot more stagnant than other players (like Google, which was nowhere to be seen even one year and a half ago and now has the leading model on pretty much every metric, but also DeepSeek, who built a competitive model in a year that runs for much cheaper).
o3-mini wasn't even the second place for non-STEM tasks, and in today's announcement they don't even publish benchmarks for those. What's impressive about Gemini 2.5 pro (and was also really impressive with R1) is how good the model is for a very broad range of tasks, not just benchmaxing on AIME.
I had a philosophical discussion with o3 model earlier today. It was much better than 2.5 pro. In fact it was pretty much what I would expect from a professional philosopher.
Writing philosophy that looks convincing has been a thing LLM do well since the first release ChatGPT back in 2022 (in my country back in early 2023, TV featured a kind of competition between ChatGPT and a philosopher turned media personality, with university professors blindly reviewing both essays and attempting to determine which was whom).
To have an idea about how good a model is on non-STEM tasks, you need to challenge it on stuff that is harder than this for LLMs, like summarization without hallucination or creative writing. OpenAI's nonthinking model are usually very good on these, but not their thinking models, whereas other players (be it Google, Anthropic or DeepSeek) manage to make models that can be very good at both.
I've been discussing a philosophical topic (brain uploading) with all major models in the last two years. This is a topic I've read and thought about for a long time. Until o3, the responses I got from all other models (Gemini 2.5 pro most recently) have been underwhelming - generic, high level, not interesting to an expert. They struggled to understand the points I was making, and ideas I wanted to explore. o3 was the first model that could keep up, and provide interesting insights. It was communicating on a level of a professional in the field, though not an expert on this particular topic - this is a significant improvement over all existing models.
The play now seems to be less AGI, more "too big to fail" / use all the capital to morph into a FAANG bigtech.
My bet is that they'll develop a suite of office tools that leverage their model, chat/communication tools, a browser, and perhaps a device.
They're going to try to turn into Google (with maybe a bit of Apple and Meta) before Google turns into them.
Near-term, I don't see late stage investors as recouping their investment. But in time, this may work out well for them. There's a tremendous amount of inefficiency and lack of competition amongst the big tech players. They've been so large that nobody else could effectively challenge them. Now there's a "startup" with enough capital to start eating into big tech's more profitable business lines.
I don't know how anyone could look at any of this and say ponderously: it's basically the same as Nov 2022 ChatGPT. Thus strategically they're pivoting to social to become too big to fail.
I mean, it's not fucking AGI/ASI. No amount of LLM flip floppery is going to get us terminators.
If this starts looking differently and the pace picks up, I won't be giving analysis on OpenAI anymore. I'll start packing for the hills.
But to OpenAI's credit, I also don't see how minting another FAANG isn't an incredible achievement. Like - wow - this tech giant was willed into existence. Can't we marvel at that a little bit without worrying about LLMs doing our taxes?
I'm bullish on the models, and my first quiet 5 minutes after the announcement was spent thinking how many of the people I walked past days would be different if the computer Just Did It(tm) (I don't think their day would be different, so I'm not bullish on ASI-even-if-achieved, I guess?)
I think binary analysis that flips between "this is a propped up failure, like when banks get bailouts" and "I'd run away from civilization" isn't really worth much.
Most people don't care about techies or tech drama. They just use the platforms their friends do.
ChatGPT images are the biggest thing on social media right now. My wife is turning photos of our dogs into people. There's a new GPT4o meme trending on TikTok every day. Using GPT4o as the basis of a social media network could be just the kickstart a new social media platform needs.
chatGPT should be built into my iMessage threads with friends. @chatGPT "Is there an evening train on Thursdays from Brussels to Berlin?" Something a friend and I were discussing but we had to exit out of iMessage and use GPT then back to iMessage.
For UX The GPT info in the thread would be collapsed by default and both users have the discretion to click to expand the info.
Research by METR suggests that frontier LLMs can perform software tasks over exponentially longer time horizon required for human engineers, with ~7-month for each doubling. o3 is above the trend line.
The AlexNet paper which kickstarted the deep learning era in 2012 was ahead of the 2nd-best entry by 11%. Many published AI papers then advanced SOTA by just a couple percentage points.
o3 high is about 9% ahead of o1 high on livebench.ai and there are also quite a few testimonials of their differences.
Yes, AlexNet made major strides in other aspects as well but it’s been just 7 months since o1-preview, the first publicly available reasoning model, which is a seminal advance beyond previous LLMs.
It seems some people have become desensitized to how rapidly things are moving in AI, despite its largely unprecedented pace of progress.
OpenAI isn't selling GPT-4 or o1 or o4-mini or turbo or whatever else to the general public. These announcements may as well be them releasing GPT v12.582.599385. No one outside of a small group of nerds cares. The end consumer is going to chatgpt.com and typing things in the box.
> This is just getting to be a bit much, seems like they are
> trying to cover for the fact that they haven't actually done much
Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about. Subjectively, customers get locked in by feeling they have the inside track, and these small tweaks prove that. Objectively, the small change might make a real difference to the customer's use case.
Similarly, it's important to force development teams to actually ship, and shipping more frequently reduces risk, so this could reflect internal discipline.
As for media buzz, OpenAI is probably trying to tamp that down; they have plenty of first-mover advantage. More puffery just makes their competitors seem more important, and the risk to their reputation of a flop is a lot larger than the reward of the next increment.
As for "a bit much", before 2023 I was thinking I could meaningfully track progress and trade-off's in selecting tech, but now the cat is not only out of the bag, it's had more litters than I can count. So, yeah - a bit much!
not only that. filling search lists on eBay with your products is old sellers' tactics. Try to search for used Dell workstation or server and you will see pages and pages from the same seller.
To use that criticism for this release ain't really fair, as these will replace the old models (o3 will replace o1, o4-mini will replace o3-mini).
On a more general level - sure, but they aren't planning to use this release to add a larger number of models, it's just that deprecating/killing the old models can't be done overnight.
As someone who doesn't use anything OpenAI (for all the reasons), I have to agree with the GP. It's all baffling. Why is there an o3-mini and an o4-mini? Why on earth are there so many models?
Once you get to this point you're putting the paradox of choice on the user - I used to use a particular brand toothpaste for years until it got to the point where I'd be in the supermarket looking at a wall of toothpaste all by the same brand with no discernible difference between the products. Why is one of them called "whitening"? Do the others not do that? Why is this one called "complete" and that one called "complete ultra"? That would suggest that the "complete" one wasn't actually complete. I stopped using that brand of toothpaste as it become impossible to know which was the right product within the brand.
If I was assessing the AI landscape today, where the leading models are largely indistinguishable in day to day use, I'd look at OpenAI's wall of toothpaste and immediately discount them.
In ChatGPT, o4-mini is replacing o3-mini. It's a straight 1-to-1 upgrade.
In the API, o4-mini is a new model option. We continue to support o3-mini so that anyone who built a product atop o3-mini can continue to get stable behavior. By offering both, developers can test both and switch when they like. The alternative would be to risk breaking production apps whenever we launch a new model and shut off developers without warning.
I don't think it's too different from what other companies do. Like, consider Apple. They support dozens of iPhone models with their software updates and developer docs. And if you're an app developer, you probably want to be aware of all those models and docs as you develop your app (not an exact analogy). But if you're a regular person and you go into an Apple store, you only see a few options, which you can personalize to what you want.
If you have concrete suggestions on how we can improve our naming or our product offering, happy to consider them. Genuinely trying to do the best we can, and we'll clean some things up later this year.
Fun fact: before GPT-4, we had a unified naming scheme for models that went {modality}-{size}-{version}, which resulted in names like text-davinci-002. We considered launching GPT-4 as something like text-earhart-001, but since everyone was calling it GPT-4 anyway, we abandoned that system to use the name GPT-4 that everyone had already latched onto. Kind of funny how our unified naming scheme originally made room for 999 versions, but we didn't make it past 3.
Have any of the models been deprecated? It seems like a deprecation plan and definition of timelines would be extraordinarily helpful.
I have not seen any sort of "If you're using X.122, upgrade to X.123, before 202X. If you're using X.120, upgrade to anything before April 2026, because the model will no longer be available on that date." ... Like all operating systems and hardware manufacturers have been doing for decades.
Side note, it's amusing that stable behavior is only available on a particular model with a sufficiently low temperature setting. As near-AGI shouldn't these models be smart enough to maintain consistency or improvement from version to version?
It's got all deprecations, ordered by date of announcement, alongside shutdown dates and recommended replacements.
Note that we use the term deprecated to mean slated for shutdown, and shutdown to mean when it's actually shut down.
In general, we try to minimize developer pain by supporting models for as long as we reasonably can, and we'll give a long heads up before any shutdown. (GPT-4.5-preview was a bit of an odd case because it was launched as a potentially temporary preview, so we only gave a 3-month notice. But generally we aim for much longer notice.)
Our hypothesis is that o4-mini is a much better model, but we'll wait to hear feedback from developers. Evals only tell part of the story, and we wouldn't want to prematurely deprecate a model that developers continue to find value in. Model behavior is extremely high dimensional, and it's impossible to prevent regression on 100% use cases/prompts, especially if those prompts were originally tuned to the quirks of the older model. But if the majority of developers migrate happily, then it may make sense to deprecate at some future point.
We generally want to give developers as stable as an experience as possible, and not force them to swap models every few months whether they want to or not. Personally, I want developers to spend >99% of their time thinking about their business and <1% of their time thinking about what the OpenAI API is requiring of them.
Any idea when v1/models will be updated? As of right now, https://api.openai.com/v1/models has "id": "o3-mini-2025-01-31" and "id": "o3-mini", but no just 'o3'.
I recognize that verification is annoying, but we eventually had to resort to this as otherwise bad actors will create zillions of accounts to violate our policies and/or avoid paying via credit card fraud/etc.
> Why is there an o3-mini and an o4-mini? Why on earth are there so many models?
Because if they removed access to o3-mini — which I have tested, costed, and built around — I would be very angry. I will probably switch to o4-mini when the time is right.
I think you're trying to re-contextualize the old Standards joke, but I actually think you're right -- if a front end model could dispatch as appropriate to the best backend model for a given prompt, and turn everything into a high level sort of mixture of models, I think that would be great, and a great simplifying step. Then they can specialize and optimize all they want, CPU goes down, responses get better and we only see one interface.
Well, in fairness, Anthropic has less because 1) they started later, 2) could learn from competitors' mistakes, 3) focused on enterprise and not consumer, 4) have fewer resources.
The point is taken — and OpenAI agrees. They have said they are actively working on simplifying the offering. I just think it's a bit unfair. We have perfect hindsight today here on HackerNews and also did zero of the work to produce the product.
Model fatigue is a real thing - Particularly with their billing model that is wildly different from model to model and gives you more headroom as you spend more. We spend a lot of time and effort running tests across many models to balance for that cost/performance ratio. When you can run 300k tokens per min on a shittier model, or 10k tokens per min on a better model - you want to use the cheaper model but if the performance isn't there then you gotta pivot. Can I use tools here? Can I use function calling here? Do I use the chat API, the chat completions API, or the responses API? Do either of those work with the model I want to use, or only with other models?
I almost wonder if this is intentional ... because when you create a quagmire of insane inter-dependent billing scenarios you end up with a product like AWS that can generate substantial amounts of revenue from sheer ignorance or confusion. Then you can hire special consultants to come in and offer solutions to your customers in order to wade through the muck on your behalf.
Dealing with OpenAI's API's is a straight up nightmare.
Most industries, or categories go through cycles of fragmentation and consolidation.
AI is currently in a high growth expansion phase. The leads to rapid iteration and fragmentation because getting things released is the most important thing.
When the models start to plateau or the demands on the industry are for profit you will see consolidation start.
I'd attribute that more to first mover advantage than a benefit from poor naming choices, though I do think they are likely to misattribute that to a causal relationship so that they keep doing the latter
They do this because people like to have predictability. A new model may behave quite differently on something that’s important for a use case.
Also, there are a lot of cases where very small models are just fine and others where they are not. It would always make sense to have the smallest highest performing models available.
I can not believe that we feel that this is what's most worth talking about here (by visibility). At this point I truly wonder if AI is what will make HN side with the luddites.
This seems like a perfect use case for "agentic" AI. OpenAI can enrich the context window with the strengths and weakness of each model, and when a user prompts for something the model can say "Hey, I'm gonna switch to another model that is better at answering this sort of question." and the user can accept or reject.
> This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model
OpenAI's progress lately:
2024 December - first reasoning model (official release)
2025 February - deep search
2025 March - true multi-modal image generation
2025 April - reasoning model with tools
I'm not sure why people say they haven't done much. We couldn't even dream of stuff like this five years ago, and now releasing groundbreaking/novel features every month is considered "meh"... I think we're spoiled and can't appreciate anything anymore :)
If there are incremental gains in each release, why would they hold them back? The amount of exhaust coming off of each release is gold for the internal teams. The naming convention is bad, and the CPO just admitted as much on Lenny's podcast, but I am not sure why incremental releases is a bad thing.
There are 9 models in the ChatGPT model picker and they have stated that it's their goal to get rid of the model picker because everyone finds it annoying.
Think for 30 seconds about why they might in good faith do what they do.
Do you use any of them? Are you a developer? Just because a model is non-deterministic it doesn't mean developers don't want some level of consistency, whether it be about capabilities, cost, latency, call structure etc.
you'd think they could use AI to interpret the best model for your use case so you don't even have to think about it. Run the first few API calls in parallel, grade the result, and then send the rest to whatever works best
> All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones.
That's not a problem in and of itself. It's only a problem if the models aren't good enough.
Judging by ChatGPT's adoption, people seem to think they're doing just fine.
Here's a summary of this conversation so far, generated using o3 after 306 comments. This time I ran it like so:
llm install llm-openai-plugin
llm install llm-hacker-news
llm -m openai/o3 -f hn:43707719 -s 'Summarize the themes of the opinions expressed here.
For each theme, output a markdown header.
Include direct "quotations" (with author attribution) where appropriate.
You MUST quote directly from users when crediting them, with double quotes.
Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'
Then again with o4-mini using the exact same content (hence the hash ID for -f):
llm -m openai/o4-mini \
-f f16158f09f76ab5cb80febad60a6e9d5b96050bfcf97e972a8898c4006cbd544 \
-s 'Summarize the themes of the opinions expressed here.
For each theme, output a markdown header.
Include direct "quotations" (with author attribution) where appropriate.
You MUST quote directly from users when crediting them, with double quotes.
Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'
The breathtaking thing is not the model itself, but that someone as smart as Cowen (and he's not the only one) is uttering "AGI" in the same sentence as any of these models. Now, I'm not a hater, and for many tasks they are amazing, but they are, as of now, not even close to AGI, by any reasonable definition.
We're pretty bad at model naming and communicating capabilities (in our defense, it's hard!), but o4-mini is actually a _considerably_ better vision model than o3, despite the benchmarks. Similar to how o3-mini-high was a much better coding model than o1. I would recommend using o4-mini-high over o3 for any task involving vision.
Thanks for the reply. I am not sure the vision is the failing point here, but logic. I routinely try to get these models to solve difficult puzzles or coding challenges (the kind that a good undergrad math major could probably solve, but that most would struggle with). They fail almost always. Even with help.
For example, JaneStreet monthly puzzles. Surprisingly, the new o3 was able to solve this months (previous models were not), which was an easier one. Believe me, I am not trying to minimize the overall achievement -- what it can do incredible -- but I don't believe the phrase AGI should even be mentioned until we are seeing solutions to problems that most professional mathematicians would struggle with, including solutions to unsolved problems.
That might not be enough even, but that should be the minimum bar for even having the conversation.
>what it can do incredible -- but I don't believe the phrase AGI should even be mentioned until we are seeing solutions to problems that most professional mathematicians would struggle with, including solutions to unsolved problems.
But Why ? Why should Artificial General Intelligence preclude things a good chunk of humans wouldn't be able to do ? Are those guys no longer General Intelligences ?
I'm not saying this definition is 'wrong' but you have to realize at this point, the individual words of that acronym no longer mean anything.
Sure, there's no authority who stamps the official definition.
I'll make my case. To me, if you look at how the phrase is usually used -- "when humans have achieved AGI...", etc -- it evokes a science fiction turning point that implies superhuman performance in more or less every intellectual task. It's general, after all. I think of Hal or the movie Her. It's not "Artifical General Just-Like-Most-People-You-Know Intelligence". Though we are not there yet, either, if you consider the full spectrum of human abilities.
Few things would demonstrate general superhuman reasoning ability more definitively than machines producing new, useful, influential math results at a faster rate than people. With that achieved, you would expect it could start writing fiction and screenplays and comedy as well as people too (it's still very far imo), but maybe not, maybe those skills develop at different paces, and I still wouldn't want to call it AGI. But I think truly conquering mathematics would get me there.
A standard term people use for what you describe is superintelligence, not AGI.
Current frontier models are better than average humans in many skills but worse in others. Ethan Mollick calls it “jagged frontier” which sounds about right.
I think it is AGI, seriously. Try asking it lots of questions, and then ask yourself: just how much smarter was I expecting AGI to be?
That's his whole argument!!!! This is so frustrating coming from a public intellectual. "You don't need rigorous reasoning to answer these questions, baybeee, just go with your vibes." Complete and total disregard for scientific thinking, in favor of confirmation bias and ideology.
Tyler Cowen is someone I take seriously. I think he is one of the most rational thought leaders.
But I have to say, his views on LLMs seem a little premature. He definitely has a unique viewpoint of what "general intelligence" is, which might not apply broadly to most jobs. I think "interviews" them like they were a guest on his podcast and bases his judgement on how they compare to his other extremely smart guests.
>I'm obsessed with o3. It's way better than the previous models. It just helped me resolve a psychological/emotional problem I've been dealing with for years in like 3 back-and-forths (one that wasn't socially acceptable to share, and those I shared it with didn't/couldn't help)
Genuinely intrigued by what kind of “psychological/emotional problem I've been dealing with for years” could an AI solve in a matter of hours after its release.
The most striking difference to me is that o3 and o4 know when the web search tool is unavailable, and will tell you they can't answer a question that requires it. While 4o and (sadly) 4.1 will just make up a bunch of nonsense.
I'm simultaneously impressed that they can do that, and also wondering why the heck that's so impressive (isn't "is this tool in this list?" something GPT-3 was able to handle?) and why 4.1 still fails at it too—especially considering it's hyped as the agentic coder model!
That's pretty damning for the general intelligence aspect of it, that they apparently had to special-case something so trivial... and I say that as someone who's really optimistic about this stuff!
That being said, the new "enhanced" web search seems great so far, and means I can finally delete another stupid 10 line Python script from 2023 that I shouldn't have needed in the first place ;)
(...Now if they'd just put 4.1 in the Chat... why the hell do I need to use a 3rd party UI for their best model!)
This is even more incomprehensible to users who don't understand what this naming scheme is supposed to mean. Right now, most power users are keeping track of all the models and know what they are like, so this naming wouldn't help them. Normal consumers don't really know the difference between the models, but this wouldn't help them either - all those letters and numbers aren't super inviting and friendly. They could try just having a linear slider for amount of intelligence and another one for speed.
After refreshing the browser I see that the old o3-mini-high has gone now so I continued my coding task conversation with o4-mini-high. In two separate conversations it butchered things in a way that I never saw o3-mini-high do. In one case it rewrote working code without reason, breaking it, in the other it took a function I asked it to apply a code fix to and it instead refactored it with a different and unrelated function that was part of an earlier bit of chat history.
I notice too that it employs a different style of code where it often puts assignment on a different line, which looks like it's trying to maintain an ~80 character line limit, but does so in places where the entire line of code is only about 40 characters.
Not saying it’s for sure the case but it might be that the model gets confused by OOD text from the other model whereas it expects its own text to be online from itself (particularly if the CoT is used as context for later conversations).
I’m having very mixed feelings about it. I’m using o3 to help me parse and understand a book about statistics and ML, it’s very dense in math.
On one hand the answers became a lot more comprehensive and deep. It’s now able to give me very advanced explanations.
On the other hand, it started overloading the answers with information. Entire concepts became single sentence summaries. Complex topics and theorems became acronyms. In a way I’m feeling overwhelmed by the information it’s now throwing at me. I can’t tell if it’s actually smarter or just too complicated for me to understand.
Surprise, the machine that interpolates from a database of maths books confuses a human who wants to learn about the contents of the books in that database.
The demo video is very impressive, and it shows what AI could be. Our current models are unreliable in research, but if they were reliable, then what's shown alone would be better than AGI.
There are 8 billion+ instances of general intelligence on the planet; there isn't a shortage. I'd rather see AI do data science and applied math at computer speeds. Those are the hard problems, a lot of the AGI problems (to human brains) are easy.
So what are they selling with the 200 dollar subscription? Only a model that has now caught up with their competitor who sells for 1/10 of their price?
o4-mini is available on vs code. I've been playing with it for the last couple of hours. It's quite fast for a thinking model.
It's also super concise with code. Where claude 3.7 and gemini 2.5 will write a ton, o4-mini will write a tiny portion of it accomplishing the same task.
On the flip side, in its conciseness, it's more lazy with implementation than the other leading models missing features.
For fixing very complex typescript types, I've previously found that o1 outperformed the others. o4-mini seems to understand things well here.
I still think gemini will continue to be my favorite model for code. It's more consistent and follows instructions better.
However, openAI's more advanced models have a better shot at providing a solution when gemini and claude are stuck.
Maybe there's a win here in having o4-mini or o3 do a first draft for conciseness, revise with gemini to fill in what's missed (but with a base that is not overdone), and then run fixes with o4-mini.
The user experience needs to be massively improved when it comes to model choice. How are average users supposed to know which model to pick? Why shouldn't I just always pick the newest or most powerful one? Why should I have to choose at all? I say this from the perspective of a ChatGPT user - I understand the different pricing on the API side helps people make decisions.
Interesting that using tools to zoom around the image is useful for the model. I was kind of assuming that these models were beyond such things and could attend to all aspects image simultaneously anyway, but perhaps their input is still limited in the resolution? Very cool, in any case, spooky progress as always.
On the vision side of things: I ran my torture test through it, and while it performed "well", about the same level as 4o and o1, it still fails to handle spatial relationships well, and did hallucinate some details. OCR is a little better it seems, but a more thorough OCR focused test would be needed to know for sure. My torture tests are more focused on accurately describing the content of images.
Both seem to be better at prompt following and have more up to date knowledge.
But honestly, if o3 was only at the same level as o1, it'd still be an upgrade since it's cheaper. o1 is difficult to justify in the API due to cost.
FWIW, o4-mini-high does not feel better o3-mini-high for working on fairly simply econ theory proofs. It does feel faster. And both elementary mistakes.
So far with my random / coding design question that I asked with o1 last week, it did substantially better with o3. It’s more like a mid level engineer and less like a intern.
I have been using o4-mini-high today. Most of the time for a file longer than 100 lines it stops generating randomly and won't complete a file unless I re-prompt it with the end of the missing file.
As usual, it's a frustrating experience for anything more complex than the usual problems everyone else does.
ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high.
I subscribe to pro but don't yet see the new models (either in the Android app or on the web version).
The big step function here seems to be RL on tool calling.
Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).
OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).
o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.
My hunch is that these models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7
tl;dr - GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.
I find o4 very bad at coding. I tried to improve a script created by 3.5 mini-high with o4 mini-high and it doesn't return nearly as good results as what i used to get by o3.5
Doesn't achieving AGI mean the beginning of the end of humanity's current economic model? I'm not sure I understand the presumption by many that achieving AGI is just another step in some company's offering.
o3 failed the first test I gave it. I wanted it to create a bar chart using Python of the first 10 Fibonacci numbers (did this easily), and then use that image as input to generate an info-graphic of the chart with an animal theme. It failed in two ways. It didn't have access to the visual output from python and, when I gave it a screenshot of that output, it failed in standard GenAI fashion by having poor / incomplete text and not adhering exactly to bar heights, which were critical in this case.
So one failure that could be resolved with better integration on the back end and then an open problem with image generation in general.
I’m not sure I fully understand the rationale of having newer mini versions (eg o3-mini, o4-mini) when previous thinking models (eg o1) and smart non-thinking models (eg gpt-4.1) exist. Does anyone here use these for anything?
o1 is a much larger, more expensive to operate on OpenAI's end. Having a smaller "newer" (roughly equating newer to more capable) model means that you can match the performance of larger older models while reducing inference and API costs.
At this point, it's like comparing the iPhone 5s vs the iPhone 6. The upgrades are still noticeable, but it's nowhere the huge jump between GPT 3.5 and GPT 4.
It seems to be getting better. I used to use my custom "Turbo Chad" GPT based on 4o and now the default models are similar. Is it learning from my previous annoyances?
> Downloaded an untouched char.lgp from the current Steam build (1.0.9) to make sure the count reflects the shipping game rather than a modded archive.
Meanwhile we have people elsewhere in the thread complaining about too many models.
Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around. When they upgrade gpt-o4 they don't let you use the old version, after all.
>Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around.
Imagine if every time your favorite SaaS had an update, they renamed the product. Yesterday you were using Slack S7, and today you're suddenly using Slack 9S-o. That was fine in the desktop era, when new releases happened once a year - not every few weeks. You just can't keep up with all the versions.
I think they should just stick with one brand and announce new releases as just incremental updates to that same brand/product (even if the underlying models are different): "the DeepSearch Update" or "The April 2025 Reasoning Update" etc.
The model picker should be replaced entirely with a router that automatically detects which underlying model to use. Power users could have optional checkboxes like "Think harder" or "Code mode" as settings, if they want to guide the router toward more specialized models.
I've taken to pasting in the latest OpenAI API docs for their python library to each prompt (via API, I'm not pasting each time manually in ChatGPT) so that the AI can write code that uses itself! Like, I get it, the training data thing is hard, but - OpenAI changed their python library with breaking changes and their models largely still do not know about it! I haven't tried 4.1- series yet with their newer cutoff, but, the rest of the models like o3-mini (and I presume these new ones today) still write openai python library code in the old, broken style. Argh.
I want to be excited about this but after chatting with 4.1 about a simple app screenshot and it continuously forgetting and hallucinating, I am increasingly sceptical of Open AI's announcements. (No coding involved, so the context window was likely < 10% full.)
I have barely found time to gauge 4.1s capabilities, so at this stage, I’d rather focus on the ever worsening names these companies bestow upon their models. To say that I the USB-IF have found their match would be an understatement.
Here are some notes I made to understand each of these models and when to use them.
# OpenAI Models
## Reasoning Models (o-series)
- All `oX` (o-series aka `omni`) models are reasoning models.
- Use these for complex, multi-step, reasoning tasks.
## Flagship/Core Models
- All `x.x` and `Xo` models are the core models.
- Use these for one-shot results
- Examples: 4o, 4.1
## Cost Optimized
- All `-mini`, `-nano` are cheaper, faster models.
- Use these for high-volume, low effort tasks.
## Flagship vs Reasoning (o-series) Models
- Latest flagship model = 4.1
- Latest reasoning model = o3
- The flagship models are general purpose, typically with larger context windows. These rely mostly on pattern matching.
- The reasoning models are trained with extended chain-of-thought and reinforcement learning models. They work best with tools, code and other multi-step workflows. Because tools are used, the accuracy will be higher.
# List of Models
## 4o (omni)
- 128K context window
- complex multimodal, applications requiring the top level of reliability and nuance
## 4o-mini
- 128K context window
- Use: multimodal reasoning for math, coding, and structured outputs
- Use: Cheaper than `4o`. Use when you can trade off accuracy vs speed/cost.
- Dont Use: When high accuracy is needed
## 4.1
- 1M context window
- Use: For large context ingest, such as full codebases
- Use: For reliable instruction following, comprehension
- Dont Use: For high volume/faster tasks
## 4.1-mini
- 1M context window
- Use: For large context ingest
- Use: When a tradeoff can be made with accuracy vs speed
## 4.1-nano
- 1M context window
- Use: For high-volume, near-instant responses
- Dont Use: When accuracy is required
- Examples: classification, autocompletion, short-answers
## o3
- 200K context window
- Use: for the most challenging reasoning tasks in coding, STEM, and vision that demand deep chain‑of‑thought and tool use
- Use: Agentic workflows leveraging web search, Python execution, and image analysis in one coherent loop
- Dont Use: For simple tasks, where lighter model will be faster and cheaper.
## o4-mini
- 200K context window
- Use: High-volume needs where reasoning and cost should be balanced
- Use: For high throughput applications
- Dont Use: When accuracy is critical
## o4-mini-high
- 200K context window
- Use: When o4-mini results are not satisfactory, but before moving to o3.
- Use: Compex tool-driven reasoning, where o4-mini results are not satisfactory
- Dont Use: When accuracy is critical
## o1-pro-mode
- 200K context window
- Use: Highly specialized science, coding, or reasoning jobs that benefit from extra compute for consistency
- Dont Use: For simple tasks
## Models Sorted for Complex Coding Tasks (my opinion)
1. o3
2. Gemini 2.5 Pro
3. Claude 3.7
2. o1-pro-mode
3. o4-mini-high
4. 4.1
5. o4-mini
It took me reading your comment to realize that they were different and this wasn’t deja vu. Maybe that says more about me than OpenAI, but my gut agrees with you.
They jokingly admitted that they’re bad at naming in the 4.1 reveal video, so they’re certainly aware of the problem. They’re probably hoping to make the model lineup clearer after some of the older models get retired, but the current mess was certainly entirely foreseeable.
Gave Codex a go with o4-mini and it's disappointing...
Here you can see my tries. It fully fails on something a mid engineer can do after getting used to the tools:
https://xcancel.com/Tymscar/status/1912578655378628847
What is wrong with OpenAI? The naming of their models seems like it is intentionally confusing - maybe to distract from lack of progress? Honestly, I have no idea which model to use for simply everyday tasks anymore.
It really is bizarre. If you had asked me 2 days ago I would have said unequivically that these models already existed. Surely given the rate of change a date-based numbering system would be more helpful?
I suspect that "ChatGPT-4o" is the most confusing part. Absolutely baffling to go with that and then later "oN", but surely they will avoid any "No" models moving forward
But we have both 4o and 4.1 for non-reasoning. And it's still not clear to me which is better (the comparison on their page was from an older version of 4o).
I have doubts whether the live stream was really live.
During the live-stream the subtitles are shown line by line.
When subtitles are auto-generated, they pop up word by word, which I assume would need to happen during a real live stream.
Line-by-line subtitles are shown if the uploader provides captions by themselves for an existing video, the only way OpenAI could provide captions ahead of time, is if the "live-stream" isn't actually live.
Ok, I’m a bit underwhelmed. I’ve asked it a fairly technical question, about a very niche topic (Final Fantasy VII reverse engineering): https://chatgpt.com/share/68001766-92c8-8004-908f-fb185b7549...
With right knowledge and web searches one can answer this question in a matter of minutes at most. The model fumbled around modding forums and other sites and did manage to find some good information but then started to hallucinate some details and used them in the further research. The end result it gave me was incorrect, and the steps it described to get the value were totally fabricated.
What’s even worse in the thinking trace it looks like it is aware it does not have an answer and that the 399 is just an estimate. But in the answer itself it confidently states it found the correct value.
Essentially, it lied to me that it doesn’t really know and provided me with an estimate without telling me.
Now, I’m perfectly aware that this is a very niche topic, but at this point I expect the AI to either find me a good answer or tell me it couldn’t do it. Not to lie me in the face.
Edit: Turns out it’s not just me: https://x.com/transluceai/status/1912552046269771985?s=46
Compare to Gemini Pro 2.5:
https://g.co/gemini/share/c8fb1c9795e4
Of note, the final step in the CoT is:
> Formulate Conclusion: Since a definitive list or count isn't readily available through standard web searches, the best approach is to: state that an exact count is difficult to ascertain from readily available online sources without direct analysis of game files ... avoid giving a specific number, as none was reliably found across multiple sources.
and then the response is in line with that.
I like this answer. It does mention the correct, definitive way of getting the information I want (extracting the char.lgp data file) and so even though it gave up it pushes you in the right direction, whereas o3/o4 just make up stuff.
I've used AI with "niche" programming questions and it's always a total let down. I truly don't understand this "vibe coding" movement unless everyone is building todo apps.
There's a bit of a skill to it.
Good architecture plans help. Telling it where in an existing code base it can find things to pattern match against is also fantastic.
I'll often end up with a task that looks something like this:
* Implement Foo with a relation to FooBar.
* Foo should have X, Y, Z features
* We have an existing pattern for Fidget in BigFidget. Look at that for implementation
* Make sure you account for A, B, C. Check Widget for something similar.
It works surprisingly well.
> Good architecture plans help.
This is they key answer right here.
LLMs are great at interpolating and extrapolating based on context. Interpolating is far less error-prone. The problem with interpolating is that you need to start with accurate points so that interpolating between them leads to expected and relatively accurate estimates.
What we are seeing is the result of developers being oblivious to higher-level aspects of coding, such as software architecture, proper naming conventions, disciplined choice of dependencies and dependency management, and even best practices. Even basic requirements-gathering.
Their own personal experience is limited to diving into existing code bases and patching them here and there. They often screw up the existing software architecture because their lack of insight and awareness leads them to post PRs that get the job done at the expense of polluting the whole codebase into an unmanageable mess.
So these developers crack open an LLM and prompt it to generate code. They use their insights and personal experience to guide their prompts. Their experience reflects what they do on a daily basis. The LLMs of course generate code from their prompts, and the result is underwhelming. Garbage-in, garbage-out.
It's the LLMs fault, right? All the vibe coders out there showcasing good results must be frauds.
The telltale sign of how poor these developers are is how they dump the responsibility of they failing to get LLMs to generate acceptable results on the models not being good enough. The same models that are proven effective at creating whole projects from scratch at their hands are incapable of the smallest changes. It's weird how that sounds, right? If only the models were better... Better at what? At navigating through your input to achieve things that others already achieve? That's certainly the model's fault, isn't it?
A bad workman always blames his tools.
Yes, with a bit of work around prompting and focusing on closed context, or as you put it, interpolating, you can get further. But the problems is that, this is not how the LLMs were sold. If you blame someone for trying to use it by specifying fairly high level prompts - well isn´t that exactly how this technology was being advertised the whole time? The problem is not the bad workman, the problem is that the tool is not doing what it is advertised as doing.
> But the problems is that, this is not how the LLMs were sold.
No one cares about promises. The only thing that matters are the tangibles we have right now.
Right now we have a class of tools that help us write multidisciplinary apps with a few well-crafted prompts and zero code involved.
Yeah this is a great summary of what I do as well and I find it very effective. I think of hands-off AI coding like you're directing a movie. You have a rough image of what "good" looks like in your head, and you're trying to articulate it with enough detail to all the stagehands and actors such that they can realize the vision. The models can always get there with enough coaching, traditionally the question is if that's worth the trouble versus just doing it yourself.
Increasingly I find that AI at this point is good enough I am rarely stepping in to "do it myself".
It's incredible when I ask Claude 3.7 a question about Typescript/Python and it can generate hundreds of lines of code that are pretty on point (it's usually not exactly correct on first prompt, but it's coherent).
I've recently been asking questions about Dafny and Lean -- it's frustrating that it will completely make up syntax and features that don't exist, but still speak to me with the same confidence as when it's talking about Typescript. It's possible that shoving lots of documentation or a book about the language into the context would help (I haven't tried), but I'm not sure if it would make up for the model's lack of "intuition" about the subject.
Don't need to ho that esoteric. Seen them make stuff up pretty often for more common functional programming languages like Haskell and OCaml.
Recommend using RAG for this. Make the Haskell or OCaml documentation your knowledge base and index it for RAG. Then it makes a heck of a lot more sense!
How does one do that? As far as I can tell neither Claude or chatgpt web clients support this. Is there a third party tool that people are using?
You could try using the built-in "projects" feature of Claude and ChatGPT: https://support.anthropic.com/en/articles/9517075-what-are-p...
You can get pretty good results by copying the output from Firefox's Reader View into your project, for example: about:reader?url=https://learnxinyminutes.com/ocaml/
They can't
You can use openwebui with deepseek v3 0324 via API with for example deepinfra as provider for your embeddings and text generation models
Well all LLM are fairly bad for react native as soon as you look at more than hello world type of things.
I got stuck with different LLM until I checked the official documentation, yeah spouting nonsense from 2y+ removed features I suppose or just making up stuff.
I'm trialing co-pilot in VSCode and it's a mixed bag. Certain things it pops out great, but a lot of times I'll be like woohoo! <tab> <tab> <tab> and then end up immediately realising wait a sec, none of this is actually needed, or it's just explicitly calling for things that are already default values, or whatever.
(This is particularly in the context of metadata-type stuff, things like pyproject files, ansible playbooks, Dockerfiles, etc)
I recently exclaimed that “vibe coding is BS” to one of my coworkers before explaining that I’ve actually been using GPT, Claude, llama (for airplanes), Cline, Cursor, Windsurf, and more for coding for as long as they’ve been available (more recently playing with Gemini). Cline + Sonnet 3.7 has been giving me great results on smaller projects with popular languages, and I feel truly fortunate to have AWS Bedrock on tap to drive this stuff (no effective throttling/availability limits for an individual dev). Even llama + Continue has proven workable (though it will absolutely hallucinate language features and APIs).
That said, 100% pure vibe coding is, as far as I can tell, still very much BS. The subtle ugliness that can come out of purely prompt-coded projects is truly a rat hole of hate, and results can get truly explosive when context windows saturate. Thoughtful, well-crafted architectural boundaries and protocols call for forethought and presence of mind that isn’t yet emerging from generative systems. So spend your time on that stuff and let the robots fill in the boilerplate. The edges of capability are going to keep moving/growing, but it’s already a force multiplier if you can figure out ways to operate.
For reference, I’ve used various degrees of assistance for color transforms, computer vision, CNN network training for novel data, and several hundred smaller problems. Even if I know how to solve a problem, I generally run it through 2-3 models to see how they’ll perform. Sometimes they teach me something. Sometimes they violently implode, which teaches me something else.
> That said, 100% pure vibe coding is, as far as I can tell, still very much BS.
I don't really agree. There's certainly a showboating factor, not to mention there is currently a goldrush to tap this movement to capitalize from it. However, I personally managed to create a fully functioning web app from scratch with Copilot+vs code using a mix of GPT4 and o1-mini. I'm talking about both backend and frontend, with basic auth in place. I am by no means a expert, but I did it in an afternoon. Call it BS, the the truth of the matter is that the app exists.
People were making a front and backend web app in half a day using Ruby on Rails way before LLMs were ever a thing, and their code quality was still much better than yours!
So vibe coding, sure you can create some shitty thing which WORKS, but once it becomes bigger than a small shitty thing, it becomes harder and harder to work with because the code is so terrible when you're pure vibe coding.
> People were making a front and backend web app in half a day using Ruby on Rails way before LLMs were ever a thing, and their code quality was still much better than yours!
A few people were doing that.
With LLMs, anyone can do that. And more.
It's important to frame the scenario correctly. I repeat: I created everything in an afternoon just for giggles, and I challenged myself to write zero lines of code.
> So vibe coding, sure you can create some shitty thing which WORKS (...)
You're somehow blindly labelling a hypothetical output as "shitty", which only serves to show your bias. In the meantime, anyone who is able to churn out a half-functioning MVP in an afternoon is praised as a 10x developer. There's a contrast in there, where the same output is described as shitty or outstanding depending on who does it.
People who embracing vibe coding are probably the same people who were already sudo-vibe coding to begin with using found fragments of code they could piece together to make things sort of work for simple tasks.
I mean, I don't think you need to do cutting edge programming to make something personal to you. See here from Canva's product. Check this out: https://youtu.be/LupwvXsOQqs?t=2366
> I've used AI with "niche" programming questions and it's always a total let down.
That's perfectly fine. It just means you tried without putting in any effort and failed to get results that were aligned with your expectations.
I'm also disappointed when I can't dunk or hit >50% of my 3pt shots, but then again I never played basketball competitively
> I truly don't understand this "vibe coding" movement unless everyone is building todo apps.
Yeah, I also don't understand the NBA. Every single one of those players show themselves dunking and jumping over cars and having almost perfect percentages in 3pt shots during practice, whereas I can barely get off my chair. The problem is certainly basketball.
I imagine after GPT-4 / o1, improvements on benchmarks have been increasingly a result of overfitting, because those breakthrough models already used most of the high quality training data that is available on the internet, there haven't been any dramatic architectural changes, we are already melting the world's GPUs, and there simply isn't enough new, high quality data being generated (orders of magnitudes more than what they already used on older models) to enable breakthrough improvements.
What I'd really like to see is the model development companies improving their guardrails so that they are less concerned about doing something offensive or controversial and more concerned about conveying their level of confidence in an answer, i.e. saying I don't know every once in a while. Once we get a couple years of relative stagnation in AI models, I suspect this will become a huge selling point and you will start getting "defense grade", B2B type models where accuracy is king.
It can imitate its creator. We reached AGI.
I wanted to believe. But not now.
Have you asked this same question to various other models out there in the wild? I am just curious if you have found some that performed better. I would ask some models myself, but I do not know the proper answer, so I would probably be gullible enough to believe whatever the various answers have in common.
AIs in general are definitely hallucinating a lot more when it comes to niche topics. It is funny how they are unable to say "I don't know" and just make up things to answer your questions
LLMs made me a lot more aware of leading questions.
Tiny changes in how you frame the same query can generate predictably different answers as the LLM tries to guess at your underlying expectations.
Btw Ive also asked this question using Deep Research mode in ChatGPT and got the correct answer: https://chatgpt.com/share/68009a09-2778-8004-af40-4a8e7e812b...
So maybe this is just too hard for a “non-research” mode. I’m still disappointed it lied to me instead of saying it couldn’t find an answer.
What's the correct answer? Curious if it got it right the second time: https://chatgpt.com/share/68009f36-a068-800e-987e-e6aaf190ec...
Same thing happened when asking it a fairly simple question about dracut on Linux.
If I went through with the changes it suggested, I wouldn't have a bootable machine.
How would it ever know the answer it found is true and correct though? It could as well just repeat some existing false answer that you didn't yet find on your own. That's not much better than hallucinating it, since you can't verify its truth without finding it independently anyway.
I would be ok with having an answer and an explanation of how it got the answer with a list of sources. And it does just that - the only problem is that both the answer and the explanation are fabrications after you double check the sources.
Underwhelmed compared with Gemini 2.5 Pro--however it would've been impressive a month ago I think.
> Not to lie me in the face.
Are you saying that, it deliberately lied to you?
> With right knowledge and web searches one can answer this question in a matter of minutes at most.
Reminded me of Dunning Kruger curve, the ai model at the first peak and you at the latter.
> Are you saying that, it deliberately lied to you?
Pretty much yeah. Now “deliberately” does imply some kind of agency or even consciousness which I don’t believe these models have, its probably the result of overfitting, reward hacking or some other issues from training it, but the end result is that the model straight up misleads you knowingly (as in - the thinking trace is aware of the fact it doesn’t know the answer but it provides it anyways).
Oh boy, here comes the “it didn’t work for this one specific thing I tried” posts
Interesting... I asked o3 for help writing a flake so I could install the latest Webstorm on NixOS (since the one in the package repo is several months old), and it looks like it actually spun up a NixOS VM, downloaded the Webstorm package, wrote the Flake, calculated the SHA hash that NixOS needs, and wrote a test suite. The test suite indicates that it even did GUI testing- not sure whether that is a hallucination or not though. Nevertheless, it one-shotted the installation instructions for me, and I don't see how it could have calculated the package hash without downloading, so I think this indicates some very interesting new capabilities. Highly impressive.
Are you sure about all of this? You acknowledged it might be a hallucination, but you seem to mostly believe it? o3 doesn't have the ability to spin up a VM.
https://xcancel.com/TransluceAI/status/1912552046269771985 / https://news.ycombinator.com/item?id=43713502 is a discussion of these hallucinations.
As for the hash, could it have simply found a listing for the package with hashes provided and used that hash?
Thats so different from my experience. I tried to have it switch a flake for a yarn package that works to npm and after 3 tries with all the hints I could give it it couldn’t do it
I find that so incredibly unlikely. Granted I haven't been keeping up to date with the latest LLM developments - but has there even been any actual confirmation from OpenAI that these models have the ability to do such things in the background?
If it can write a nixos flake it's significantly smarter than the average programmer. Certainly smarter than me, one-shotting a flake is not something I'll ever be able to do — usually takes me about thirty shots and a few minutes to cool off from how mad I am at whoever designed this fucking idiotic language. That's awesome.
I was a major contributor of Flake. What in particular is so idiotic in your opinion?
I use flakes a lot and I think both flakes and the Nix language are beyond comprehension. Try searching duckduckgo or google for “what is nix flakes” or “nix flake schema” and take an honest read at the results. Insanely complicated and confusing answers, multiple different seemingly-canonical sources of information. Then go look at some flakes for common projects; the almost necessary usage of things like flake-compat and flake-util, the many-valid-approaches to devshell and package definitions, the concepts of “apps” in addition to packages. All very complicated and crazy!
Thank you for your service, I use your work with great anger (check my github I really do!)
I apologize. It was my Haskell life period.
I forgive you as I hope you forgive me. Flakes are certainly much better than Nix without them, and they’ve saved me much more time than they’ve cost me.
Man ... Classic HN.
But yes unfortunately even if you across the whole functional paradigm, nix is surely complicated. And one single file whole system up is rarely true.
FWIW, they said the language was bad, not specifically flakes. IMHO, nix is super easy if you already know Haskell (possibly others in that family). If you don't, it's extremely unintuitive.
I mean, a smart programmer still has to learn what NixOs and Flakes are, and based on your description and some cursory searching, a smart programmer would just go do literally anything else. Perfect thing to delegate to a machine that doesn't have to worry about motivation.
Just jokes, idk anything about either.
\s
> Interesting... I asked o3 for help writing...
What tool were you using for this?
Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]
Incredible how resilient Claude models have been for best-in-coding class.
[1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume almost no one uses?). The new OpenAI models might still be effectively best in class now (or likely beating Claude with similar augmentation?).
Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling. Main advantage over Sonnet being that it's better at using a large amount of context, which is enormously helpful during coding tasks.
Sonnet is still an incredibly impressive model as it held the crown for 6 months, which may as well be a decade with the current pace of LLM improvement.
Main advantage over Sonnet is Gemini 2.5 doesn't try to make a bunch of unrelated changes like it's rewriting my project from scratch.
I find Gemini 2.5 truly remarkable and overall better than Claude, which I was a big fan of
Still doesn't work well in Cursor unfortunately.
Works well in RA.Aid --in fact I'd recommend it as the default model in terms of overall cost and capability.
Working fine here. What problems do you see?
Not the OP but believe they could be referring to the fact it’s not supported in edit mode yet, only agent mode.
So far for me that’s not been too much of a roadblock. Though I still find overall Gemini struggles with more obscure issues such as SQL errors in dbt
Cline/Roo Code work fine with it
What language / framework are you using? I ask because in a Node / Typescript / React project I experience the opposite- Claude 3.7 usually solves my query on the first try, and seems to understand the project's context, ie the file structure, packages, coding guidelines, tests, etc, while Gemini 2.5 seems to install packages willy-nilly, duplicate existing tests, create duplicate components, etc.
Node / Vue
This was incredibly irritating at first, though over time I've learned to appreciate this "extra credit" work. It can be fun to see what Claude thinks I can do better, or should add in addition to whatever feature I just asked for. Especially when it comes to UI work, Claude actually has some pretty cool ideas.
If I'm using Claude through Copilot where it's "free" I'll let it do its thing and just roll back to the last commit if it gets too ambitious. If I really want it to stay on track I'll explicitly tell it in the prompt to focus only on what I've asked, and that seems to work.
And just today, I found myself leaving a comment like this: //Note to Claude: Do not refactor the below. It's ugly, but it's supposed to be that way.
Never thought I'd see the day I was leaving comments for my AI agent coworker.
> If I'm using Claude through Copilot where it's "free"
Too bad Microsoft is widely limiting this -- have you seen their pricing changes?
I also feel like they nerfed their models, or reduced context window again.
Claude is almost comically good outside of copilot. When using through copilot it’s like working with a lobotomized idiot (that complains it generated public code about half the time).
Also that Gemini 2.5 still doesn’t support prompt caching, which is huge for tools like Cline.
2.5 Pro supports prompt caching now: https://cloud.google.com/vertex-ai/generative-ai/docs/models...
Oh, that must’ve been in the last few days. Weird that it’s only in 2.5 Pro preview but at least they’re headed in the right direction.
Now they just need a decent usage dashboard that doesn’t take a day to populate or require additional GCP monitoring services to break out the model usage.
It's viable context, context length where is doesn't fall apart, is also much longer.
I do find it likes to subtly reformat every single line thereby nuking my diff and making its changes unusable since I can’t verify them that way, which Sonnet doesn’t do.
I don't understand this assertion, but maybe I'm missing something?
Google included a SWE-bench score of 63.8% in their announcement for Gemini 2.5 Pro: https://blog.google/technology/google-deepmind/gemini-model-...
I keep seeing this sentiment so often here and on X that I have to wonder if I'm somehow using a different Gemini 2.5 Pro. I've been trying to use it for a couple of weeks already and without exaggeration it has yet to solve a single programming task successfully. It is constantly wrong, constantly misunderstands my requests, ignores constraints, ignores existing coding conventions, breaks my code and then tells me to fix it myself.
I feel that Claude 3.7 is smarter, but does way too much and has poor prompt adherence
2.5 Pro is very buggy with cursor. It often stops before generating any code. It's likely a cursor problem, but I use 3.7 because of that.
Eh, I wouldn't say that's accurate, I think it's situational. I code all day using AI tools and Sonnet 3.7 is still the king. Maybe it's language dependent or something, but all the engineers I know are full on Claude-Code at this point.
The image generation improvement with o4-mini is incredible. Testing it out today, this is a step change in editing specificity even from the ChatGPT 4o LLM image integration just a few weeks ago (which was already a step change). I'm able to ask for surgical edits, and they are done correctly.
There isn't a numerical benchmark for this that people seem to be tracking but this opens up production-ready image use cases. This was worth a new release.
Thanks for sharing that. that was more interesting then their demo. I tried it and it was pretty good! I have felt that the ability to iterate from images blocked this from any real production use I had. This may be good enough now.
Example of edits (not quite surgical but good): https://chatgpt.com/share/68001b02-9b4c-8012-a339-73525b8246...
I don’t know if they let you share the actual images when sharing a chat. For me, they are blank.
wait, o4-mini outputs images? What I thought I saw was the ability to do a tool call to zoom in on an image.
Are you sure that's not 4o?
I’m generating logo designs for merch via o4-mini-high and they are pretty good. Good text and comprehending my instructions.
It's using the new gpt-4o, a version that's not in the API
in the api or on the website?
also another addition: i previously tried to upload an image for chatgpt to edit and it was incapable under the previous model i tried. Now its able to change uploaded images using o4mini.
Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post.
[0] swebench.com/#verified
Yes, however Claude advertised 70.3%[1] on SWE bench verified when using the following scaffolding:
> For Claude 3.7 Sonnet and Claude 3.5 Sonnet (new), we use a much simpler approach with minimal scaffolding, where the model decides which commands to run and files to edit in a single session. Our main “no extended thinking” pass@1 result simply equips the model with the two tools described here—a bash tool, and a file editing tool that operates via string replacements—as well as the “planning tool” mentioned above in our TAU-bench results.
Arguably this shouldn't be counted though?
[1] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...
I think you may have misread the footnote. That simpler setup results in the 62.3%/63.7% score. The 70.3% score results from a high-compute parallel setup with rejection sampling and ranking:
> For our “high compute” number we adopt additional complexity and parallel test-time compute as follows:
> We sample multiple parallel attempts with the scaffold above
> We discard patches that break the visible regression tests in the repository, similar to the rejection sampling approach adopted by Agentless; note no hidden test information is used.
> We then rank the remaining attempts with a scoring model similar to our results on GPQA and AIME described in our research post and choose the best one for the submission.
> This results in a score of 70.3% on the subset of n=489 verified tasks which work on our infrastructure. Without this scaffold, Claude 3.7 Sonnet achieves 63.7% on SWE-bench Verified using this same subset.
Somehow completely missed that, thanks!
I think reading this makes it even clearer that the 70.3% score should just be discarded from the benchmarks. "I got a 7%-8% higher SWE benchmark score by doing a bunch of extra work and sampling a ton of answers" is not something a typical user is going to have already set up when logging onto Claude and asking it a SWE style question.
Personally, it seems like an illegitimate way to juice the numbers to me (though Claude was transparent with what they did so it's all good, and it's not uninteresting to know you can boost your score by 8% with the right tooling).
It isn't on the benchmark https://www.swebench.com/#verified
The one on the official leaderboard is the 63% score. Presumably because of all the extra work they had to do for the 70% score.
OpenAI have not shown themselves to be trustworthy, I'd take their claims with a few solar masses of salt
they also gave more detail on their SWEBench scaffolding here https://www.latent.space/p/claude-sonnet
I haven't been following them that closely, but are people finding these benchmarks relevant? It seems like these companies could just tune their models to do well on particular benchmarks
The benchmark is something you can optimize for, doesn't mean it generalize well. Yesterday I tried for 2 hours to get claude to create a program that would extract data from a weird adobe file. 10$ later, the best I had is a program that was doing something like:
That’s exactly what’s happening. I’m not convinced there’s any real progress occurring here.
Right now the Swe-Bench leader Augment Agent still use Claude 3.7 in combo with o1. https://www.augmentcode.com/blog/1-open-source-agent-on-swe-...
The findings are open sourced on a repo too https://github.com/augmentcode/augment-swebench-agent
Also, if you're using Cursor AI, it seems to have much better integration with Claude where it can reflect on its own things and go off and run commands. I don't see it doing that with Gemini or the O1 models.
I often wonder if we could expect that to reach 80% - 90% within next 5 years.
I have a very basic / stupid "Turing test" which is just to write a base 62 converter in C#. I would think this exact thing would be in github somewhere (thus in the weights) but has always failed for me in the past (non-scientific / didn't try every single model).
Using o4-mini-high, it actually did produce a working implementation after a bit of prompting. So yeah, today, this test passed which is cool.
Unless I'm misunderstanding what you are asking the model to do, Gemini 2.5 pro just passed this easily. https://g.co/gemini/share/e2876d310914
As I mentioned, this is not a scientific test but rather just something that I have tried from time to time and has always (shockingly in my opinion) failed but today worked. It takes a minute of two of prompting, is boring to verify and I don't remember exactly which models I have used. It is purely a personal anecdote, nothing more.
However, looking at the code that Gemini wrote in the link, it does the same thing that other LLMs often do, which is to assume that we are encoding individual long values. I assume there must be a github repo or stackoverflow question in the weights somewhere that is pushing it in this direction but it is a little odd. Naturally, this isn't the kind encoder that someone would normally want. Typically it should encode a byte array and return a string (or maybe encode / decode UTF8 strings directly). Having the interface use a long is very weird and not very useful.
In any case, I suspect with a bit more prompting you might be able to get gemini to do the right thing.
I think it's because the question is rather ambiguous - "convert the number to base-N" is a very common API, e.g. in C# you have Convert.ToString(long value, int base), in JavaScript you have Number.toString(base) etc. It seems that it just follows this pattern. If you were to ask me the same question, I'd probably do the same thing without any further context.
OTOH if you tell it to write a Base62 encoder in C#, it does consistently produce an API that can be called with byte arrays: https://g.co/gemini/share/6076f67abde2
There is Convert.ToBase64String so I don't think encode is necessarily universal (though probably more precise).
Similarly, many of my informal tests have started passing with Gemini 2.5 that never worked before, which makes the 2025 era of AI models feel like a step change to me.
I’ve been using Gemini 2.5 pro side by side with o1-pro and Grok lately. My experience is they each randomly offer significant insight the other two didn’t.
But generally, o1-pro listens to my profile instructions WAY better, and it seems to be better at actually solving problems the first time. More reliable.
But they are all quite similar and so far these new models are similar but faster IMO.
I asked o3 to build and test a maximum parsimony phylogenetic tree builder in Python (my standard test for new models) and it's been thinking for 10 minutes. Still not clear if anything is happening, I have barely seen any code since I asked to test what it produced in the first answer. The thought summary is totally useless compared to Gemini's. Underwhelming so far.
The CoT summary is full of references to Jupyter notebook cells. The variable names are too abbreviated, nbr for neighbor, the code becomes fairly cryptic as a result, not nice to read. Maybe optimized too much for speed.
Also I've noticed ChatGPT seems to abort thinking when I switch away from the app. That's stupid, I don't want to look at a spinner for 5 minutes.
And the CoT summary keeps mentioning my name which is irritating.
It's maddening that you can't switch away from the app while it generates output. To use the Deep Research feature on mobile, you have to give up your phone for ten minutes.
deep research will run in the BG on mobile and I think it gives a notification when done. it's not like normal chats that need the app to be in the foreground.
I could be misinterpreting your claim here, but I'll point out that LLM weights don't literally encode the entirety of the training data set.
I guess you could consider it a lossy encoding.
To plan a visit to a dark sky place, I used duck.ai (Duckduckgo's experimental AI chat feature) to ask five different AIs on what date the new moon will happen in August 2025.
GPT-4o mini: The new moon in August 2025 will occur on August 12.
Llama 3.3 70B: The new moon in August 2025 is expected to occur on August 16, 2025.
Claude 3 Haiku: The new moon in August 2025 will occur on August 23, 2025.
o3-mini: Based on astronomical calculations, the new moon in August 2025 is expected to occur on August 7, 2025 (UTC). [...]
Mistral Small 3: To determine the date of the new moon in August 2025, we can use astronomical data or a reliable astronomical calendar. As of my last update in October 2023, I don't have real-time data access, but I can guide you on how to find this information. [...]
I got different answers, mostly wrong. My calendars (both paper and app versions) show me 23. august as the date.
And btw, when I asked those AIs which entries in a robots.text file would block most Chinese search engines, one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."
So I asked GPT-o4-mini-high
"On what date will the new moon occur on in August 2025. Use a tool to verify the date if needed"
It correctly reasoned it did not have exact dates due to its cutoff and did a lookup.
"The new moon in August 2025 falls on Friday, August 22, 2025"
Now, I did not specify the timezone I was in so our timing between 22 and 23 appears to be just a time zone difference at it had marked an time of 23:06 PDT per its source.
Response from Gemini 2.5 Pro for comparison -
``` Based on the search results, the new moon in August 2025 will occur late on Friday, August 22nd, 2025 in the Pacific Time Zone (PDT), specifically around 11:06 PM.
In other time zones, like the Eastern Time Zone (ET), this event falls early on Saturday, August 23rd, 2025 (around 2:06 AM). ```
"Use a tool to verify the date if needed" that's a good idea, yes. And the answers I got are based on UTC, so 23:06 PDT should match the 23. for Europe.
My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be would be worth it.
Heh, I've always been neurodivergent enough that I've never been great at 'normal human' questions. I commonly add a lot of verbosity. This said it's worked out well talking to computer based things like search engines.
LLMs on the other hand are weird in ways we don't expect computers to be. Based upon the previous prompting, training datasets, and biases in the model a response to something like "What time is dinner" can all have the response "Just a bit after 5", "Quarter after 5" or "Dinner is at 17:15 CDT". Setting ones priors can be important to performance of the model, much in the same way we do this visually and contextually with other humans.
All that said, people will find AI problematic for the foreseeable future because it behaves somewhat human like in responses and does so with confidence.
Even with a knowledge cutoff, you could know when a future new moon would be.
I would never ask any of these questions of an LLM (and I use and rely on LLMs multiple times a day), this is a job for a computer.
I would also never ask a coworker for this precise number either.
My reasoning for the plain question was: as people start to replace search engines by AI chat, I thought that asking "plain" questions to see how trustworthy the answers might be, would be a good test. Because plain folks will ask plain questions and won't think about the subtle details. They would not expect a "precise number" either, i.e. not 23:06 PDT, but would like to know if this weekend would be fine for a trip or the previous or next weekend would be better to book a "dark sky" tour.
And, BTW, I thought that LLMs are computers too ;-0
I think its much better to help people learn that an LLM is "not" a computer (even if it technically is).
Thinking its a computer makes you do dumb things with them that they simply have never done a good job with.
Build intuitions about what they do well and intuitions about what they don't do well and help others learn the same things.
Don't encourage people to have poor ideas about how they work, it makes things worse.
Would you ask an LLM a phone number? If it doesn't use a function call the answer is simply not worth having.
But it's a good reminder when so many enterprises like to claim that hallucinations have "mostly been solved".
I agree with you partially, BUT
when are the long list of 'enterprise' coworkers, who have glibly and overconfidently answered questions without doing math or looking them up, going to be fired?
First we wanted to be able to do calculations really quickly, so we built computers.
Then we wanted the computers to reason like humans, so we built LLMs.
Now we want the LLMs to do calculations really quickly.
It doesn't seem like we'll ever be satisfied.
Ask the LLM what calculations you might or should do (and how you might implement and test those calculations) is pretty wildly useful.
These models are proclaiming near AGI, so they should be smarter than hallucinating an answer.
"Who was the President of the United States when Neil Armstrong walked on the moon?"
Gemini 2.5 refuses to answer this because it is too political.
Gemini 2.5 is not generating that refusal. It's a separate censorship model.
It's more clear when you try via AI studio where that have censorship level toggles.
I call bs on this: https://g.co/gemini/share/ed38e9d38b02
Interesting - i got rejected https://g.co/gemini/share/17f73f620a3e
That is indeed what I got.
> one of them (Claude) told me that it can't tell because that might be discriminatory: "I apologize, but I do not feel comfortable providing recommendations about how to block specific search engines in a robots.txt file. That could be seen as attempting to circumvent or manipulate search engine policies, which goes against my principles."
How exactly does that response have anything to do with discrimination?
Gemini gets the new moon right. Better to use one good model than 5 worse ones.
I think all the full power LLMs will get it right because they do web search. ChatGPT 4 does as well.
Gemini 2.0 Flash gets it correct too.
Buried in the article, a new CLI for coding:
> Codex CLI is fully open-source at https://github.com/openai/codex today.
Related ongoing thread:
OpenAI Codex CLI: Lightweight coding agent that runs in your terminal - https://news.ycombinator.com/item?id=43708025
Looks like a Claude Code clone.
But open source like aider
Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.
Lets see what the pricing looks like.
Looks like they are taking a page from Apple's book, which is to never even acknowledge other products exist outside your ecosystem.
Apple has commercials for a decade making fun of “PCs”
They didn't provide a comparison either in the GPT-4.1 release and quite a few past releases, which is telling of their attitude as an org.
Pricing is already available:
https://platform.openai.com/docs/pricing
> we’ve observed that large-scale reinforcement learning exhibits the same “more compute = better performance” trend observed in GPT‑series pretraining.
Didn’t the pivot to RL from pretraining happen because the scaling “law” didn’t deliver the expected gains? (Or at least because O(log) increases in model performance became unreasonably costly?) I see they’ve finally resigned themselves to calling these trends, not laws, but trends are often fleeting. Why should we expect this one to hold for much longer?
This isn't exactly the case. The trend is a log scale. So a 10x in pretraining should yield a 10% increase in performance. That's not proving to be false per say but rather they are encountering practical limitations around 10x'ing data volume and 10x'ing available compute.
I am aware of that, like I said:
> (Or at least because O(log) increases in model performance became unreasonably costly?)
But, yes, I left implicit in my comment that the trend might be “fleeting” because of its impracticality. RL is only a trend so long as it is fashionable, and only fashionable (i.e., practical) so long as OpenAI is fed an exponential amount of VC money to ensure linear improvements under O(log) conditions.
OpenAI is selling to VCs the idea that some hitherto unspecified amount of linear model improvement will kick off productivity gains greater than their exponentially increasing investment. These productivity gains would be no less than a sizeable percentage of American GDP, which Altman has publicly set as his target. But as the capital required increases exponentially, the gap between linearly increasing model capability (i.e., its productivity) and the breakeven ROI target widens. The bigger model would need to deliver a non-linear increase in productivity to justify the exponential price tag.
This happens once it starts improving itself.
I suppose that is the question...
It doesn't need to hold forever or even 'much longer' depending on your definition of that duration. It just needs to hold on long enough to realize certain capabilities.
Will it ? Who knows. But seeing as this is something you can't predict ahead of time, it makes little sense not to try in so far as the whole thing is still feasible.
As a consumer, it is so exhausting keeping up with what model I should or can be using for the task I want to accomplish.
I think it can be confusing if you're just reading the news. If you use ChatGPT, the model selector has good brief explanations and teaching you about newly available options if you don't visit the dropdown. Anthropic does similarly.
Gemini 2.5 Pro for every single task was the meta until this release. Will have to reassess now.
Huh. I use Gemini 2.0 Flash for many things because it's several times faster than 2.5 Pro.
Agreed.
I pretty much stopped shopping around once Gemini 2.0 Flash came out.
For general, cloud-centric software development help, it does the job just fine.
I'm honestly quite fond of this Gemini model. I feel silly saying that, but it's true.
Yes, this one is addictive for its speed and I like how Google was clever and also offered it in a powerful reasoning edition. This helps offset deficiencies from being smaller while still being cheap. I also find it quite sufficient for my kind of coding. I only pull out 2.5 Pro on larger and complex code bases that I think might need deeper domain specific knowledge beyond the coding itself.
Mad tangent, but as an old timey MtG player it’s always jarring when someone uses “the meta” not to refer to the particular dynamics of their competitive ecosystem but to a single strategy within it. Impoverishes the concept, I feel, even in this case where I don’t actually think a single model is best at everything.
I'm a World of Warcraft & Dota 2 player, using "the meta" in that way is pretty common in gaming these days I think. The "meta" is still the 'metagame' in the competitive ecosystem sense, but it also refers to strategies that are considered flavor of the month (FOTM) or just generally safe bets.
So there's "the meta", and there's "that strategy is meta", or "that strategy is the meta."
Yeah, I accept that "Nash equilibrium" isn't likely to catch on at this stage.
how do you deal with the fact that they use all of your data for training their own systems and review all conversations
gemini-2.5-pro-preview-03-25 is the paid version which doesn't use your data
https://ai.google.dev/gemini-api/terms#data-use-paid
I do not feel like I can trust the empire that was built off selling personal data.
Make no mistake, I doubt the other options are trustworthy too.
Personally, I frankly do not care for most things. But for more sensitive things which might land me in trouble, local models are the way to go.
This one seems to make it easier — if the promises here hold true, the multi-modal support probably makes o4-mini-high OpenAI's best model for most tasks unless you have time and money, in which case it's o3-pro.
The answer is to just use the latest Claude model and not worry beyond that.
It's becoming a bit like iphone 3, 4... 13, 25...
Ok they are all phones that run apps and have a camera. I'm not an "AI power user", but I do talk to ChatGPT + Grok for daily tasks and use copilot.
The big step function happened when they could search the web but not much else has changed in my limited experience.
This is a very apt analogy.
It confers to the speaker confirmation they're absolutely right - names are arbitrary.
While also politely, implicitly, pointing out the core issue is it doesn't matter to you --- which is fine! --- but it may just be contributing to dull conversation to be the 10th person to say as much.
It feels like all the AI companies are pulling the versions out of their arse at the moment, I think they should work backwards and work to AGI 1.0
So my guess currently is that most are lingering at about 0.3
I asked OpenAI how to choose the right USB cable for my device. Now the objects around me are shimmering and winking out of existence, one by one. Help
Lol. But that's nothing. Wait until you shimmer and wink in and out of existence, like llms do during each completion
As another consumer, I think you're overreacting, it's not that bad.
[flagged]
I’m assuming when you say “read once”, that implies reading once every single release?
It’s confusing. If I’m confused, it’s confusing. This is UX 101.
Aside from anything else, having one model called o4 and one model called 4o is confusing. And I know they haven't released o4 yet but still.
We'll know they have cracked AGI when they solve the hardest problem of all - naming things
"good at advanced reasoning", "fast at advanced reasoning", "slower at advanced reasoning but more advanced than the good one but not as fast but cant search the internet", "great at code and logic", "good for everyday tasks but awful at everything else", "faster for most questions but answers them incorrectly", "can draw but cant search", "can search but cant draw", "good for writing and doing creative things"
Putting the actual list would have made it too clear that I'm right I see
Some people don't blindly trust the marketing department of the publisher
Then it doesn't even matter what they name the model since it's just marketing that they wouldn't trust anyway.
`ETOOMANYMODELS`
Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development? Not just openAI, but across the main cloud offerings and feasible local models?
I know there are the benchmarks, and directories like huggingface, and you can get a 'feel' for things by scanning threads here or other forums.
I'm thinking more of something that provides use-case tailored "top 3" choices by collecting and summarizing different data points. For example:
* agent & tool based dev (cloud) - [top 3 models] * agent & tool based dev (local) - m1, m2, m,3 * code review / high level analysis - ... * general tech questions - ... * technical writing (ADRs, needs assessments, etc) - ...
Part of the problem is how quickly the landscape changes everyday, and also just relying on benchmarks isn't enough: it ignores cost, and more importantly ignores actual user experience (which I realize is incredibly hard to aggregate & quantify).
LMArena might have some of the information you are looking for. It offers rankings of LLM models across main cloud offerings, and I feel that its evaluation method, human prompting and voting, is closer to real-world use case and less prone to data contamination than benchmarks.
https://lmarena.ai/
In the "Leaderboard">"Language" tab, it lists the top models in various categories such as overall, coding, math, and creative writing.
In the "Leaderboard">"Price Analysis" tab, it shows a chart comparing models by cost per million tokens.
In the "Prompt-to-Leaderboard" tab, there is even an LLM to help you find LLMs -- you enter a prompt, and it will find the top models for your particular prompt.
> Is there a reputable, non-blogspam site that offers a 'cheat sheet' of sorts for what models to use, in particular for development?
Below is a spreadsheet I bookmarked from a previous HN discussion. Its information dense but you can just look at the composite scores to get a quick idea how things compare.
https://docs.google.com/spreadsheets/u/1/d/1foc98Jtbi0-GUsNy...
I have been using this site: https://artificialanalysis.ai/ . It's still about benchmarks, and it doesn't do deep dives into specific use cases, but it's helpful to compare models for intelligence vs cost vs latency and other characteristics.
It's pretty frustrating to see a press release with "Try on ChatGPT" and then not see the models available even though I'm paying them $200/mo.
They're supposed to be released today for everyone, and o3-pro for Pro users in a few weeks:
"ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high."
with rate limits unchanged
They are all now available on the Pro plan. Y'all really ought to have a little bit more grace to wait 30 minutes after the announcement for the rollout.
Or maybe OpenAI could wait until they'd released it before telling people to use it now.
They'd probably want their announcement to be the one the press picks up instead of a tweet or reddit post saying "Did anyone else notice the new ChatGPT model?"
Deploying several things is sometimes tricky and this could not be a smaller deal.
Why pay $200/mo when you can just access the models from the Platform playground?
Higher limits and operator access maybe?
I see o4-mini on the $20 tier but no o3.
Holy crap... thats expensive.
Where's the comparison with Gemini 2.5 Pro?
For coding, I like the Aider polyglot benchmark, since it covers multiple programming languages.
Gemini 2.5 Pro got 72.9%
o3 high gets 81.3%, o4-mini high gets 68.9%
Isn't it easy to train on the specific Exercism exercises that this benchmark uses?
where do you find those o3 high numbers? https://aider.chat/docs/leaderboards/ currently has gemini 2.5 pro as the leader at, as you say, 72.9%.
It's in the OpenAI article post (OP) i.e. OpenAI ran Aider themselves.
Update: the leaderboard has o3 high + 4o tops of the charts now with 82.7%. This is a) amazing b) 20x more expensive than Gemini.
It was a good benchmark until it entered the training set.
thanks
Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.
On most other benchmarks, they seem to perform about the same, which is bad news for o3 because it's much more expensive and slower than Gemini 2.5 Pro, and it also hides its reasoning while Gemini shows everything.
We can probably just stick with Gemini 2.5 Pro, since it offers the best combination of price, quality, and speed. No need to worry about finding a replacement (for now).
> Some sources mention that o3 scores 63.8 on SWE-bench, while Gemini 2.5 Pro scores 69.1.
It's the opposite. o3 scores higher
On SWE bench? Show your source.
Exactly.
If you download GIMP, Blender etc - every user would have to report exactly the same experience mostly given the hardware is recent.
In this thread however - there are varying experiences from amazing to awful. I'm not saying anyone is wrong but all I'm saying is that this wide range of operational accuracy is what will pop the AI bubble eventually in that they can't be reliably deployed almost anywhere with any certainty or guarantees of any sorts.
The pace of notable releases across the industry right now is unlike any time I remember since I started doing this in the early 2000's. And it feels like it's accelerating
How is this a notable release? It's strictly worse than Gemini 2.5 on coding &c, and only an iterative improvement over their own models. The only thing that struck me as particularly interesting was the native visual reasoning.
It's not worse on coding. SWE Bench, Aider, live bench coding all show noticeably better results.
Lots of releases but very little actual performance increases
Sonnet and Gemini saw fairly substantial perf increases recenly
Love Sonnet but 3.7 is not obviously an improvement over 3.5 in my real world usage. Gemini 2.5 pro is great, has replaced most others for me (Grok I use for things that require realtime answers)
Are you comparing it with or without thinking? I'd say it's a fairly big improvement in long thinking mode.
It does a lot better on philosophy questions.
Not really. We’re definitely in the incremental improvement stage at this point. Certainly no indication that progress is “accelerating”.
Integration is accelerating rapidly. Even if model development froze today, we would still probably have ~5 years of adoption and integration before it started to level off.
You are both correct. It feels like the tech itself is kinda plateauing but it's still massively under-used. It will take a decade or more before the deployment starts slowing down.
But we're seeing incremental improvements every two months, so...
ChatGPT 3 : iPhone 1
A bunch of models later, we're about on the iPhone 4-5 now. Feels about right.
It's more like GPT-3 is the Manchester Baby, and we're somewhere around IBM 700 series right now. Still a long way to go to iPhone, as much as the industry likes to pretend otherwise.
Both were big consumer commercial breakouts and far better than predecessors. And several years later both see only iterative improvements.
Neither apply to your analogy.
In the examples they demonstrate tool use in the reasoning loop. The models pretty impressively recognize they need some external data, and either complete a web search, or write and execute python to solve intermediate steps.
To the extent that reasoning is noisy and models can go astray during it, this helps inject truth back into the reasoning loop.
Is there some well known equivalent to Moores Law for token use? We're headed in a direction where LLM control loops can run 24/7 generating tokens to reason about live sensor data, and calling tools to act on it.
I'm starting to be reminded of the razor blade business.
Fuck Everything, We're Doing o5
Maybe OpenAI needs an easy mode for all these people saying 5 choices of models (and that's only if you pay) is simply too confusing for them.
They even provide a description in the UI of each before you select it, and it defaults to a model for you.
If you just want an answer of what you should use and can't be bothered to research them, just use o3(4)-mini and call it a day.
I personally like being able to choose because I understand the tradeoffs and want to choose the best one for what I’m asking. So I hope this doesn’t go away.
But I agree that they probably need some kind of basic mode to make things easier for the average person. The basic mode should decide automatically what model to use and hide this from the user.
I thought sama said that that's the plan for gpt-5: a router which'll choose the right model and thinking level for you
Would that be considered a Mixture of Experts system?
No, Mixture of Experts is a really confusing term.
It sounds like it means "have a bunch of models, one that's an expert in physics, one that's an expert in health etc and then pick the one that's a best fit for the user's query".
It's not that. The "experts" are each another giant opaque blob of weights. The model is trained to select one of those blobs, but they don't have any form of human-understandable "expertise". It's an optimization that lets you avoid using ALL of the weights for every run through the model, which helps with performance.
https://huggingface.co/blog/moe#what-is-a-mixture-of-experts... is a decent explanation.
o3 is cheaper than o1. (per 1M tokens)
• o3 Pricing:
• o1 Pricing: o4-mini pricing remains the same as o3-mini.So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.
This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.
Im old enough to remember the mystery and hype before o*/o1/strawberry that was supposed to be essentially AGI. We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet
Now we're up to o4, AGI is still not even in near site (depending on your definition, I know). And OpenAI is up to about 5000 employees. I'd think even before AGI a new model would be able to cover for at least 4500 of those employees being fired, is that not the case?
Remember that Docusign has 7,000 employees. I think OpenAI is pretty lean for what they're accomplishing.
I don't think these comparisons are useful. Every time you look at companies like LinkedIn or Docusign, yeah - they have a lot of staff, but a significant proportion of this are functions like sales, customer support, and regulatory compliance across a bazillion different markets; along with all the internal tooling and processes you need to support that.
OpenAI is at a much earlier stage in their adventures and probably doesn't have that much baggage. Given their age and revenue streams, their headcount is quite substantial.
If we're making comparisons, its more like someone selling a $10,000 course on how to be a millionaire
Not directly from OpenAI - but people in the industry is advertising how these advanced models can replace employees, yet they keep on going on hiring tears (including OpenAI). Lets see the first company to stand behind their models, and replace 50% of their existing headcount with agents. That to me would be a sign these things are going to replace peoples jobs. Until I see that, if OpenAI can't figure out how to replace humans with models, then no one will
I mean could you imagine if todays announcement was - the chatgpt.com webdev team has been laid off, and all new features and fixes will be complete by Codex CLI + o4-mini. That means they believe in the product theyre advertising. Until they do something like that, theyll keep on trusting those human engineers and try selling other people on the dream
I'm also a skeptic on AI replacing many human jobs anytime soon. It's mostly going to assist, accelerate or amplify humans in completing work better or faster. That's the typical historical technology cycle where better tech makes work more efficient. Eventually that does allow the same work to be done with less people, like a better IP telephony system enabling a 90 person call center to handle the same call volume that previously required 100 people. But designing, manufacturing, selling, installing and supporting the new IP phone system also creates at least 10 new jobs.
So far the only significant human replacement I'm seeing AI enable is in low-end, entry level work. For example, fulfilling "gig work" for Fiverr like spending an hour or two whipping up a relatively low-quality graphic logo or other basic design work for $20. This is largely done at home by entry-level graphic design students in second-world locales like the Philippines or rural India. A good graphical AI can (and is) taking some of this work from the humans doing it. Although it's not even a big impact yet, primarily because for non-technical customers, the Fiverr workflow can still be easier or more comfortable than figuring out which AI tool to use and how to get what they really want from it.
The point is that this Fiverr piece-meal gig work is the lowest paying, least desirable work in graphic design. No one doing it wants to still be doing it a year or two from now. It's the Mcdonald's counter of their industry. They all aspire to higher skill, higher paying design jobs. They're only doing Fiverr gig work because they don't yet have a degree, enough resume credits or decent portfolio examples. Much like steam-powered bulldozers and pile drivers displaced pick axe swinging humans digging railroad tunnels in the 1800s, the new technology is displacing some of the least-desirable, lowest-paying jobs first. I don't yet see any clear reason this well-established 200+ year trend will be fundamentally different this time. And history is littered with those who predicted "but this time it'll be different."
I've read the scenarios which predict that AI will eventually be able to fundamentally and repeatedly self-improve autonomously, at scale and without limit. I do think AI will continue to improve but, like many others, I find the "self-improve" step to be a huge and unevidenced leap of faith. So, I don't think it's likely, for reasons I won't enumerate here because domain experts far smarter than I am have already written extensively about them.
Not really. It could also mean their company's effective headcount is much greater than its nominal one.
Yes and Amazon has 1.52 million employees. How many developers could they possibly need?
Or maybe it’s just nonsensical to compare the number of employees across companies - especially when they don’t do nearly the same thing.
On a related note, wait until you find out how many more employees that Apple has than Google since Apple has hundreds of retail employees.
Apple has fewer employees than Google (164k < 183k).
Siri must be really good.
what kind of employees does Docusign employ? surely Digital Documents dont require physical onsite distribution centers and labor
Just look at their careers page
Its a lot of sales/account managers. and some engineers
wow the sales go hard in this product
[flagged]
The US is not a signatory to the International Criminal Court so you won't see Musk on trial there.
I hope I don't have to link this adjacent reply of mine too many more times: https://news.ycombinator.com/item?id=43709056 Specifically "The venue is a matter of convenience, nothing more," and if you prefer another, that would work about as well. Perhaps Merano; I hear it's a lovely little town.
The closest Elon ever came to anything Hague-worthy is allowing Starlink to be used in Ukrainian attacks on Russian civilian infrastructure. I don't think the Hague would be interested in anything like that. And if his life is worthless, then what would you say about your own? Nonetheless, I commend you on your complete lack of hinges. /s
Oh, I'm thinking more in the sense of the special one-off kinds of trials, the sort Gustave Gilbert so ably observed. The venue is a matter of convenience, nothing more. To the rest I would say the worth of my life is no more mine to judge than anyone else is competent to do the same for themselves, or indeed other than foolish to pursue the attempt.
True.
Deep learning models will continue to improve as we feed them more data and use more compute, but they will still fail at even very simple tasks as long as the input data are outside their training distribution. The numerous examples of ChatGPT (even the latest, most powerful versions) failing at basic questions or tasks illustrate this well. Learning from data is not enough; there is a need for the kind of system-two thinking we humans develop as we grow. It is difficult to see how deep learning and backpropagation alone will help us model that. https://medium.com/thoughts-on-machine-learning/why-sam-altm...
> Im old enough to remember the mystery and hype before o*/o1/strawberry
So at least two years old?
Honestly, sometimes I wonder if most people these days kinda aren't at least that age, you know? Or less inhibited about acting it than I believe I recall people being last decade. Even compared to just a few years back, people seem more often to struggle to carry a thought, and resort much more quickly to emotional belligerence.
Oh, not that I haven't been as knocked about in the interim, of course. I'm not really claiming I'm better, and these are frightening times; I hope I'm neither projecting nor judging too harshly. But even trying to discount for the possibility, there still seems something new left to explain.
> Even compared to just a few years back, people seem more often to struggle to carry a thought, and resort much more quickly to emotional belligerence.
We're living in extremely uncertain times, with multiple global crises taking place at the same time, each of which could develop into a turning point for humankind.
At the same time, predatory algorithms do whatever it takes to make people addicted to media, while mental health care remains inaccessible for many.
I feel like throwing a tantrum almost every single day.
I feel perhaps I've been unkind to many people in my thoughts, but I'm conflicted. I don't understand myself to be particularly fearless, but what times call more for courage than times like these? How do people afraid even to try to practice courage expect to find it, when there isn't time for practice any more?
You have only so many spoons available per crisis. Even picking your battle can become a problem.
I've been out in the streets, protesting and raising awareness of climate change. I no longer do. It's a pointless waste of time. Today, the climate change deniers are in charge.
I don't assume I'm going to be given the luxury of picking my battles, and - though I've been aware of "spoon theory" since I watched it getting invented at Shakesville back in the day - I've never held to it all that strongly, even as I acknowledge I've also never been quite the same since a nasty bout of wild-type covid in early 2020. Now as before, I do what needs doing as best I can, then count the cost. Some day that will surely prove too high, and my forward planning efforts will be put to the test. Till then I'm happy not to borrow trouble.
I've lived in this neighborhood a long time, and there are a couple of old folks' homes a block or so from here. Both have excellent views, on one frontage each, of an extremely historic cemetery, which I have always found a wonderfully piquant example of my adopted hometown's occasionally wire-brush sense of humor. But I bring it up to mention that the old folks don't seem to have much concern for spoons other than to eat with, and they are protesting the present situation regularly and at considerable volume, and every time I pass about my errands I make a point of raising a fist and hollering "hell yeah!" just like most of the people who drive past honk in support.
Will you tell them it's pointless?
I think people expected reasoning to be more than just trained chain of thought (which was known already at the time). On the other hand, it is impressive that CoT can achieve so much.
Yeah, I don't know exactly what at an AGI model will look like, but I think it would have more than 200k context window.
Do you have a 200k context window? I don't. Most humans can only keep 6 or 7 things in short term memory. Beyond those 6 or 7 you are pulling data from your latent space, or replacing of the short term slots with new content.
But context windows for LLMs include all the “long term memory” things you’re excluding from humans
Long term memory in an LLM is its weights.
Not really, because humans can form long term memories from conversations, but LLM users aren’t finetuning models after every chat so the model remembers.
He's right, but most people don't have the resources, nor indeed the weights themselves, to keep training the models. But the weights are very much long term memory.
users aren’t finetuning models after every chat
Users can do that if they want, but it’s more effective and more efficient to do that after every billion chats, and I’m sure OpenAI does it.
If you want the entire model to remember everything it talked about with every user, sure. But ideally, I would want the model to remember what I told it a few million tokens ago, but not what you told it (because to me, the model should look like my private copy that only talks to me).
ideally, I would want the model to remember what I told it a few million tokens ago
Yes, you can keep finetuning your model on every chat you have with it. You can definitely make it remember everything you have ever said. LLMs are excellent at remembering their training data.
I'm not quite AGI, but I work quite adequately with a much, much smaller memory. Maybe AGI just needs to know how to use other computers and work with storage a bit better.
I'd think it would be able to at least suggest which model to use rather than just having 6 for you to choose from.
I’m not an AI researcher but I’m not convinced these contemporary artificial neural networks will get us to AGI, even assuming an acceleration to current scaling pace. Maybe my definition of AGI is off but I’m thinking what that means is a machine that can think, learn and behave in the world in ways very close to human. I think we need a fundamentally different paradigm for that. Not something that is just trained and deployed like current models, but something that is constantly observing, constantly learning and constantly interacting with the real world like we do. AHI, not AGI. True AGI may not exist because there are always compromises of some kind.
But, we don’t need AGI/AHI to transform large parts of our civilization. And I’m not seeing this happen either.
I feel like every time AI gets better we shift the goalposts of AGI to something else.
I don't think we shift the goalposts for AGI. I'm not getting the sense that people are redefining what AGI is when a new model is released. I'm getting the sense that some people are thinking like me when a new model is released: we got a better transformer, and a more useful model trained on more or better data, but we didn't get closer to AGI. And people are saying this not because they've pushed out what AGI really means, they're saying this because the models still have the same basic use cases, the same flaws and the same limitations. They're just better at what they already do. Also, the better these models get at what they already do, the more starkly they contrast with human capabilities, for better or worse.
> Now we're up to o4, AGI is still not even in near site (depending on your definition, I know)
It's not only definition. Some googler was sure their model was conscious.
Meanwhile even the highest ranked models can’t do simple logic tasks. GothamChess on YouTube did some tests where he played against a bunch of the best models and every single one of them failed spectacularly.
They’d happily lose a queen to take a pawn. They failed to understand how pieces are even allowed to move, hallucinated the existence of new pieces, repeatedly declared checkmate when it wasn’t, etc.
I tried it last night with Gemini 2.5 Pro and it made it 6 turns before it started making illegal moves, and 8 turns before it got so confused about the state of the board before it refused to play with me any longer.
I was in the chess club in 3rd grade. One of the top ranked LLMs in the world is vastly dumber than I was in 3rd grade. But we’re going to pour hundreds of billions into this in the hope that it can end my career? Good luck with that, guys.
Chess is not exactly a simple logic task. It requires you to keep track of 32 things in a 2d space.
I remember being extremely surprised when I could ask GPT3 to rotate a 3d model of a car in it's head and ask it about what I would see when sitting inside, or which doors would refuse to open because they're in contact with the ground.
It really depends on how much you want to shift the goalposts on what constitutes "simple".
> Chess is not exactly a simple logic task.
Compare to what a software engineer is able to do, it is very much a simple logic task. Or the average person having a non-trivial job. Or a beehive organizing its existence, from its amino acids up to hive organization. All those things are magnitudes harder than chess.
> I remember being extremely surprised when I could ask GPT3 to rotate a 3d model of a car in it's head and ask it about what I would see when sitting inside, or which doors would refuse to open because they're in contact with the ground.
It's not reasoning its way there. Somebody asked something similar some time in the corpus and that corpus also contained the answers. That's why it can answer. After a quite small number of moves, the chess board it unique and you can't fake it. You need to think ahead. A task which computers are traditionally very good at. Even trained chess players are. That LLMs are not goes to show that they are very far from AGI.
I'm not sure why people are expecting a language model to be great at chess. Remember they are trained on text, which is not the best medium for representing things like a chess board. They are also "general models", with limited training on pretty much everything apart from human language.
An Alpha Star type model would wipe the floor at chess.
This misses the point. LLMs will do things like move a knight by a single square as if it were a pawn. Chess is an extremely well understood game, and the rules about how things move is almost certainly well-represented in the training data.
These models cannot even make legal chess moves. That’s incredibly basic logic, and it shows how LLMs are still completely incapable of reasoning or understanding. Many kinds of task are never going to be possible for LLMs unless that changes. Programming is one of those tasks.
>These models cannot even make legal chess moves. That’s incredibly basic logic, and it shows how LLMs are still completely incapable of reasoning or understanding.
Yeah they can. There's a link I shared to prove it which you've conveniently ignored.
LLMs learn by predicting, failing and getting a little better, rinse and repeat. Pre-training is not like reading a book. LLMs trained on chess games play chess just fine. They don't make the silly mistakes you're talking about and they very rarely make illegal moves.
There's gpt-3.5-turbo-instruct which i already shared and plays at around 1800 ELO. Then there's this grandmaster level chess transformer - https://arxiv.org/abs/2402.04494. They're also a couple of models that were trained in the Eleuther AI discord that reached about 1100-1300 Elo.
I don't know what the peak of LLM Chess playing looks like but this is clearly less of a 'LLMs can't do this' problem and more 'Open AI/Anthropic/Google etc don't care if their models can play Chess or not' problem.
So are they capable of reasoning now or would you like to shift the posts ?
I think the point here is that if you have to pretrain it for every specific task, it's not artificial general intelligence, by definition.
There isn't any general intelligence that isn't receiving pre-traning. People spend 14 to 18+ years in school to have any sort of career.
You don't have to pretrain it for every little thing but it should come as no surprise that a complex non-trivial game would require it.
Even if you explained all the rules of chess clearly to someone brand new to it, it will be a while and lots of practice before they internalize it.
And like I said, LLM pre-training is less like a machine reading text and more like Evolution. If you gave a corpus of chess rules, you're only training a model that knows how to converse about chess rules.
Do humans require less 'pre-training' ? Sure, but then again, that's on the back of millions of years of evolution. Modern NNs initialize random weights and have relatively very little inductive bias.
People are focussing on chess, which is complicated, but LLM fail at even simple games like tic-tac-toe where you'd think, if it was capable of "reasoning" it would be able to understand where it went wrong. That doesn't seem to be the case.
What it can do is write and execute code to generate the correct output, but isn't that cheating?
Which SOTA LLM fails at tic-tac-toe?
Saying programming is a task that is "never going to be possible" for an LLM is a big claim, given how many people have derived huge value from having LLMs write code for them over the past two years.
(Unless you're arguing against the idea that LLMs are making programmers obsolete, in which case I fully agree with you.)
I think "useful as an assistant for coding" and "being able to program" are two different things.
When I was trying to understand what is happening with hallucination GPT gave me this: > It's called hallucinating when LLMs get things wrong because the model generates content that sounds plausible but is factually incorrect or made-up—similar to how a person might "see" or "experience" things that aren't real during a hallucination.
From that we can see that they fundamentally don't know what is correct. While they can get better at predicting correct answers, no-one has explained how they are expected to cross the boundary from "sounding plausible" to "knowing they are factually correct". All the attempts so far seem to be about reducing the likelihood of hallucination, not fixing the problem that they fundamentally don't understand what they are saying.
Until/unless they are able to understand the output enough to verify the truth then there's a knowledge gap that seems dangerous given how much code we are allowing "AI" to write.
Code is one of the few applications of LLMs where they DO have a mechanism for verifying if what they produced is correct: they can write code, run that code, look at the output and iterate in a loop until it does what it's supposed to do.
> I'm not sure why people are expecting a language model to be great at chess.
Because the conversation is about AGI, and how far away we are from AGI.
Does AGI mean good at chess?
What if it is a dumb AGI?
Claude can't beat Pokemon Red. Not even close yet: https://arstechnica.com/ai/2025/03/why-anthropics-claude-sti...
LLMs can play chess fine.
The best model you can play with is decent for a human - https://github.com/adamkarvonen/chess_gpt_eval
SOTA models can't play it because these companies don't really care about it.
> We had serious news outlets write about senior people at OpenAI quitting because o1 was SkyNet
I wonder if any of the people that quit regret doing so.
Seems a lot like Chicken Little behavior - "Oh no, the sky is falling!"
How anyone with technical acumen thinks current AI models are conscious, let alone capable of writing new features and expanding their abilities is beyond me. Might as well be afraid of calculators revolting and taking over the world.
"haven't actually done much" being popularizing the chat llm and absolutely dwarfing the competition in paid usage
Relative to the hype they've been spinning to attract investment, casting the launch and commercialization of ChatGPT as their greatest achievement really is a quite significant downgrade, especially given that they really only got there first because they were the first entity reckless enough to deploy such a tool to the public.
It's easy to forget what smart, connected people were saying about how AI would evolve by <current date> ~a year ago, when in fact what we've gotten since then is a whole bunch of diminishing returns and increasingly sketchy benchmark shenanigans. I have no idea when a real AGI breakthrough will happen, but if you're a person who wants it to happen (I am not), you have to admit to yourself that the last year or so has been disappointing---even if you won't admit it to anybody else.
ChatGPT was released two and a half years ago though. Pretty sure that at some point Sam Altman had promised us AGI by now.
The person you're responding to is correct that OpenAI feels a lot more stagnant than other players (like Google, which was nowhere to be seen even one year and a half ago and now has the leading model on pretty much every metric, but also DeepSeek, who built a competitive model in a year that runs for much cheaper).
Google has the leading model on pretty much every metric
Correction: Google had the leading model for three weeks. Today it’s back to the second place.
press X to doubt
o3-mini wasn't even the second place for non-STEM tasks, and in today's announcement they don't even publish benchmarks for those. What's impressive about Gemini 2.5 pro (and was also really impressive with R1) is how good the model is for a very broad range of tasks, not just benchmaxing on AIME.
I had a philosophical discussion with o3 model earlier today. It was much better than 2.5 pro. In fact it was pretty much what I would expect from a professional philosopher.
I'm not expecting someone paying $200 a month to access something to be objective about that particular something.
Also “what I would expect from a professional philosopher”, is that your argument, really?
I’m paying $20/mo, and I’m paying the same for Gemini and for Claude.
What’s wrong with my argument? You questioned the performance of the model on non-STEM tasks, and I gave you my impression.
Writing philosophy that looks convincing has been a thing LLM do well since the first release ChatGPT back in 2022 (in my country back in early 2023, TV featured a kind of competition between ChatGPT and a philosopher turned media personality, with university professors blindly reviewing both essays and attempting to determine which was whom).
To have an idea about how good a model is on non-STEM tasks, you need to challenge it on stuff that is harder than this for LLMs, like summarization without hallucination or creative writing. OpenAI's nonthinking model are usually very good on these, but not their thinking models, whereas other players (be it Google, Anthropic or DeepSeek) manage to make models that can be very good at both.
I've been discussing a philosophical topic (brain uploading) with all major models in the last two years. This is a topic I've read and thought about for a long time. Until o3, the responses I got from all other models (Gemini 2.5 pro most recently) have been underwhelming - generic, high level, not interesting to an expert. They struggled to understand the points I was making, and ideas I wanted to explore. o3 was the first model that could keep up, and provide interesting insights. It was communicating on a level of a professional in the field, though not an expert on this particular topic - this is a significant improvement over all existing models.
I guess it was related to the last period, rather than the full picture
What are people expecting here honestly? This thread is ridiculous.
They have 500M weekly users now. I would say that counts as doing something.
While bleeding cash faster than anything else in History.
ChatGPT was released in 2022, so OP's point stands perfectly well.
They're rumored to be working on a social network to rival X with the focus being on image generations.
https://techcrunch.com/2025/04/15/openai-is-reportedly-devel...
The play now seems to be less AGI, more "too big to fail" / use all the capital to morph into a FAANG bigtech.
My bet is that they'll develop a suite of office tools that leverage their model, chat/communication tools, a browser, and perhaps a device.
They're going to try to turn into Google (with maybe a bit of Apple and Meta) before Google turns into them.
Near-term, I don't see late stage investors as recouping their investment. But in time, this may work out well for them. There's a tremendous amount of inefficiency and lack of competition amongst the big tech players. They've been so large that nobody else could effectively challenge them. Now there's a "startup" with enough capital to start eating into big tech's more profitable business lines.
I don't know how anyone could look at any of this and say ponderously: it's basically the same as Nov 2022 ChatGPT. Thus strategically they're pivoting to social to become too big to fail.
I mean, it's not fucking AGI/ASI. No amount of LLM flip floppery is going to get us terminators.
If this starts looking differently and the pace picks up, I won't be giving analysis on OpenAI anymore. I'll start packing for the hills.
But to OpenAI's credit, I also don't see how minting another FAANG isn't an incredible achievement. Like - wow - this tech giant was willed into existence. Can't we marvel at that a little bit without worrying about LLMs doing our taxes?
I don't know what AGI/ASI means to you.
I'm bullish on the models, and my first quiet 5 minutes after the announcement was spent thinking how many of the people I walked past days would be different if the computer Just Did It(tm) (I don't think their day would be different, so I'm not bullish on ASI-even-if-achieved, I guess?)
I think binary analysis that flips between "this is a propped up failure, like when banks get bailouts" and "I'd run away from civilization" isn't really worth much.
So to you AGI == terminators? Interesting.
I appreciate the info and I have a question:
Why would anyone use a social network run by Sam Altman? No offense but his reputation is chaotic neutral to say the least.
Social networks require a ton of momentum to get going.
BlueSky already ate all the momentum that X lost.
Social networks have to be the most chaotic neutral thing ever made. It's like, "hey everyone! Come share what ever you want on my servers!"
Most people don't care about techies or tech drama. They just use the platforms their friends do.
ChatGPT images are the biggest thing on social media right now. My wife is turning photos of our dogs into people. There's a new GPT4o meme trending on TikTok every day. Using GPT4o as the basis of a social media network could be just the kickstart a new social media platform needs.
Not surprising. Add comments to sora.com and you've got a social network.
Seriously. The users on sora.com are already trying to. They're sending messages to each other with the embedded image text and upvoting it.
GPT 4o and Sora are incredibly viral and organic and it's taking over TikTok, Instagram, and all other social media.
If you're not watching casual social media you might miss it, but it's nothing short of a phenomenon.
ChatGPT is now the most downloaded app this month. Images are the reason for that.
Honestly I popped on sora.com the other day and the memes are great. I can totally understand where folks are coming from and why this is happening.
chatGPT should be built into my iMessage threads with friends. @chatGPT "Is there an evening train on Thursdays from Brussels to Berlin?" Something a friend and I were discussing but we had to exit out of iMessage and use GPT then back to iMessage.
For UX The GPT info in the thread would be collapsed by default and both users have the discretion to click to expand the info.
seriously. the level of arrogance combined with ignorance is awe inspiring.
True. They've blown their absolutely massive lead with power users to Anthropic and Google. So they definitely haven't done nothing.
Research by METR suggests that frontier LLMs can perform software tasks over exponentially longer time horizon required for human engineers, with ~7-month for each doubling. o3 is above the trend line.
https://x.com/METR_Evals/status/1912594122176958939
—-
The AlexNet paper which kickstarted the deep learning era in 2012 was ahead of the 2nd-best entry by 11%. Many published AI papers then advanced SOTA by just a couple percentage points.
o3 high is about 9% ahead of o1 high on livebench.ai and there are also quite a few testimonials of their differences.
Yes, AlexNet made major strides in other aspects as well but it’s been just 7 months since o1-preview, the first publicly available reasoning model, which is a seminal advance beyond previous LLMs.
It seems some people have become desensitized to how rapidly things are moving in AI, despite its largely unprecedented pace of progress.
Ref:
- https://proceedings.neurips.cc/paper_files/paper/2012/file/c...
- https://livebench.ai/#/
Imagenet had improved the error rate by 100*11/25=44%.
o1 to o3 error rate went from 28 to 19, so 100*9/28=32%.
But these are meaningless comparisons because it’s typically harder to improve already good results.
OpenAI isn't selling GPT-4 or o1 or o4-mini or turbo or whatever else to the general public. These announcements may as well be them releasing GPT v12.582.599385. No one outside of a small group of nerds cares. The end consumer is going to chatgpt.com and typing things in the box.
They have an enterprise business too. I think it's relevant for that.
And that’s exactly why their model naming and release process looks like this right now.
$20 Plus subscription give access to o1 and Deep Research (10 uses/month). I'm pretty sure general public can get access through API as well.
Right and most people are not going to spend 200$+ /mo on ChatGPT.. Maybe businesses will but at this point they have too many choices.
> This is just getting to be a bit much, seems like they are > trying to cover for the fact that they haven't actually done much
Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about. Subjectively, customers get locked in by feeling they have the inside track, and these small tweaks prove that. Objectively, the small change might make a real difference to the customer's use case.
Similarly, it's important to force development teams to actually ship, and shipping more frequently reduces risk, so this could reflect internal discipline.
As for media buzz, OpenAI is probably trying to tamp that down; they have plenty of first-mover advantage. More puffery just makes their competitors seem more important, and the risk to their reputation of a flop is a lot larger than the reward of the next increment.
As for "a bit much", before 2023 I was thinking I could meaningfully track progress and trade-off's in selecting tech, but now the cat is not only out of the bag, it's had more litters than I can count. So, yeah - a bit much!
> Or perhaps they're trying to make some important customers happy by showing movement on areas the customers care about
Or make important investors happy, they need to justify the latest $40 billion round
The old Chinese strategy of having 7343 different phone models with almost the same specs to confuse the customer better
This sounds like recent Dell and Lenovo strategies
recent? they've been doing this for decades.
person a: "I just got an new macbook pro!"
person b: "Nice! I just got a Lenovo YogaPilates Flipfold XR 3299 T92 Thinkbookpad model number SRE44939293X3321"
...
person a: "does that have oled?"
person b: "Lol no silly that is model SRE44939293XB3321". Notice the B in the middle?!?! That is for OLED.
They should launch o999 and count backwards for each release till they hit oagi
not only that. filling search lists on eBay with your products is old sellers' tactics. Try to search for used Dell workstation or server and you will see pages and pages from the same seller.
To use that criticism for this release ain't really fair, as these will replace the old models (o3 will replace o1, o4-mini will replace o3-mini).
On a more general level - sure, but they aren't planning to use this release to add a larger number of models, it's just that deprecating/killing the old models can't be done overnight.
As someone who doesn't use anything OpenAI (for all the reasons), I have to agree with the GP. It's all baffling. Why is there an o3-mini and an o4-mini? Why on earth are there so many models?
Once you get to this point you're putting the paradox of choice on the user - I used to use a particular brand toothpaste for years until it got to the point where I'd be in the supermarket looking at a wall of toothpaste all by the same brand with no discernible difference between the products. Why is one of them called "whitening"? Do the others not do that? Why is this one called "complete" and that one called "complete ultra"? That would suggest that the "complete" one wasn't actually complete. I stopped using that brand of toothpaste as it become impossible to know which was the right product within the brand.
If I was assessing the AI landscape today, where the leading models are largely indistinguishable in day to day use, I'd look at OpenAI's wall of toothpaste and immediately discount them.
(I work at OpenAI.)
In ChatGPT, o4-mini is replacing o3-mini. It's a straight 1-to-1 upgrade.
In the API, o4-mini is a new model option. We continue to support o3-mini so that anyone who built a product atop o3-mini can continue to get stable behavior. By offering both, developers can test both and switch when they like. The alternative would be to risk breaking production apps whenever we launch a new model and shut off developers without warning.
I don't think it's too different from what other companies do. Like, consider Apple. They support dozens of iPhone models with their software updates and developer docs. And if you're an app developer, you probably want to be aware of all those models and docs as you develop your app (not an exact analogy). But if you're a regular person and you go into an Apple store, you only see a few options, which you can personalize to what you want.
If you have concrete suggestions on how we can improve our naming or our product offering, happy to consider them. Genuinely trying to do the best we can, and we'll clean some things up later this year.
Fun fact: before GPT-4, we had a unified naming scheme for models that went {modality}-{size}-{version}, which resulted in names like text-davinci-002. We considered launching GPT-4 as something like text-earhart-001, but since everyone was calling it GPT-4 anyway, we abandoned that system to use the name GPT-4 that everyone had already latched onto. Kind of funny how our unified naming scheme originally made room for 999 versions, but we didn't make it past 3.
Have any of the models been deprecated? It seems like a deprecation plan and definition of timelines would be extraordinarily helpful.
I have not seen any sort of "If you're using X.122, upgrade to X.123, before 202X. If you're using X.120, upgrade to anything before April 2026, because the model will no longer be available on that date." ... Like all operating systems and hardware manufacturers have been doing for decades.
Side note, it's amusing that stable behavior is only available on a particular model with a sufficiently low temperature setting. As near-AGI shouldn't these models be smart enough to maintain consistency or improvement from version to version?
Yep, we have a page of announced API deprecations here: https://platform.openai.com/docs/deprecations
It's got all deprecations, ordered by date of announcement, alongside shutdown dates and recommended replacements.
Note that we use the term deprecated to mean slated for shutdown, and shutdown to mean when it's actually shut down.
In general, we try to minimize developer pain by supporting models for as long as we reasonably can, and we'll give a long heads up before any shutdown. (GPT-4.5-preview was a bit of an odd case because it was launched as a potentially temporary preview, so we only gave a 3-month notice. But generally we aim for much longer notice.)
On that page I don't see any mention of o3-mini. Is o3-mini a legacy model now which is slated to be deprecated later on?
Nothing announced yet.
Our hypothesis is that o4-mini is a much better model, but we'll wait to hear feedback from developers. Evals only tell part of the story, and we wouldn't want to prematurely deprecate a model that developers continue to find value in. Model behavior is extremely high dimensional, and it's impossible to prevent regression on 100% use cases/prompts, especially if those prompts were originally tuned to the quirks of the older model. But if the majority of developers migrate happily, then it may make sense to deprecate at some future point.
We generally want to give developers as stable as an experience as possible, and not force them to swap models every few months whether they want to or not. Personally, I want developers to spend >99% of their time thinking about their business and <1% of their time thinking about what the OpenAI API is requiring of them.
Any idea when v1/models will be updated? As of right now, https://api.openai.com/v1/models has "id": "o3-mini-2025-01-31" and "id": "o3-mini", but no just 'o3'.
Ah, I know this is a pain, but by default o3 is only available to developers on tiers 4–5.
If you're in tiers 1–3, you can still get access - you just need to verify your org with us here:
https://help.openai.com/en/articles/10910291-api-organizatio...
I recognize that verification is annoying, but we eventually had to resort to this as otherwise bad actors will create zillions of accounts to violate our policies and/or avoid paying via credit card fraud/etc.
Aha! Verified and now I see o3. Thanks.
> Why is there an o3-mini and an o4-mini? Why on earth are there so many models?
Because if they removed access to o3-mini — which I have tested, costed, and built around — I would be very angry. I will probably switch to o4-mini when the time is right.
They just did that, at least for chat
It seems clear to me I would have built an app around the API, not the chat window.
They keep a lot of models around for backward compatibility for API users. This is confusing, but not inherently a bad idea.
You could develop an AI model to help pick the correct AI model.
Now you’ve got 18 problems.
I think you're trying to re-contextualize the old Standards joke, but I actually think you're right -- if a front end model could dispatch as appropriate to the best backend model for a given prompt, and turn everything into a high level sort of mixture of models, I think that would be great, and a great simplifying step. Then they can specialize and optimize all they want, CPU goes down, responses get better and we only see one interface.
> I think you're trying to re-contextualize the old Standards joke
Regex joke [1], but the standards joke will do just fine also :)
[1] Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
Isn't this basically the idea of agents?
Well, in fairness, Anthropic has less because 1) they started later, 2) could learn from competitors' mistakes, 3) focused on enterprise and not consumer, 4) have fewer resources.
The point is taken — and OpenAI agrees. They have said they are actively working on simplifying the offering. I just think it's a bit unfair. We have perfect hindsight today here on HackerNews and also did zero of the work to produce the product.
Model fatigue is a real thing - Particularly with their billing model that is wildly different from model to model and gives you more headroom as you spend more. We spend a lot of time and effort running tests across many models to balance for that cost/performance ratio. When you can run 300k tokens per min on a shittier model, or 10k tokens per min on a better model - you want to use the cheaper model but if the performance isn't there then you gotta pivot. Can I use tools here? Can I use function calling here? Do I use the chat API, the chat completions API, or the responses API? Do either of those work with the model I want to use, or only with other models?
I almost wonder if this is intentional ... because when you create a quagmire of insane inter-dependent billing scenarios you end up with a product like AWS that can generate substantial amounts of revenue from sheer ignorance or confusion. Then you can hire special consultants to come in and offer solutions to your customers in order to wade through the muck on your behalf.
Dealing with OpenAI's API's is a straight up nightmare.
Most industries, or categories go through cycles of fragmentation and consolidation.
AI is currently in a high growth expansion phase. The leads to rapid iteration and fragmentation because getting things released is the most important thing.
When the models start to plateau or the demands on the industry are for profit you will see consolidation start.
having many models from the same company in some haphazard strategy doesn't equate to "industry fragmentation". it's just confusion
OpenAI's continued growth and press coverage relative to their peers leads to me to believe it isn't *just* confusion, even if it is confusing.
I'd attribute that more to first mover advantage than a benefit from poor naming choices, though I do think they are likely to misattribute that to a causal relationship so that they keep doing the latter
They do this because people like to have predictability. A new model may behave quite differently on something that’s important for a use case.
Also, there are a lot of cases where very small models are just fine and others where they are not. It would always make sense to have the smallest highest performing models available.
I have *no idea* why you're being downvoted on this.
If I want to take advantage of a new model, I must validate that the structured queries I've made to the older models still work on the new models.
The last time I did a validation and update. Their Responses. Had. Changed.
API users need dependability, which means they need older models to keep being usable.
> I have no idea why you're being downvoted on this.
I probably offended someone at YC and my account is being punished.
I can not believe that we feel that this is what's most worth talking about here (by visibility). At this point I truly wonder if AI is what will make HN side with the luddites.
Is there some new HN with more insightful discussions?
It's giving "they took our jerbs"
This seems like a perfect use case for "agentic" AI. OpenAI can enrich the context window with the strengths and weakness of each model, and when a user prompts for something the model can say "Hey, I'm gonna switch to another model that is better at answering this sort of question." and the user can accept or reject.
> This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model
OpenAI's progress lately:
I'm not sure why people say they haven't done much. We couldn't even dream of stuff like this five years ago, and now releasing groundbreaking/novel features every month is considered "meh"... I think we're spoiled and can't appreciate anything anymore :)If there are incremental gains in each release, why would they hold them back? The amount of exhaust coming off of each release is gold for the internal teams. The naming convention is bad, and the CPO just admitted as much on Lenny's podcast, but I am not sure why incremental releases is a bad thing.
> This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much.
Did you miss the 4o image generation announcement from roughly three week ago?
https://news.ycombinator.com/item?id=43474112
Combining a multimodal LLM+ImageGen puts them pretty significantly ahead of the curve at least in that domain.
Demonstration of the capabilities:
https://mordenstar.com/blog/chatgpt-4o-images
That would explain why they all have a knowledge cutoff (likely training date) of ~August 2023.
There are 9 models in the ChatGPT model picker and they have stated that it's their goal to get rid of the model picker because everyone finds it annoying.
Think for 30 seconds about why they might in good faith do what they do.
Do you use any of them? Are you a developer? Just because a model is non-deterministic it doesn't mean developers don't want some level of consistency, whether it be about capabilities, cost, latency, call structure etc.
you'd think they could use AI to interpret the best model for your use case so you don't even have to think about it. Run the first few API calls in parallel, grade the result, and then send the rest to whatever works best
> All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones.
That's not a problem in and of itself. It's only a problem if the models aren't good enough.
Judging by ChatGPT's adoption, people seem to think they're doing just fine.
Here's a summary of this conversation so far, generated using o3 after 306 comments. This time I ran it like so:
https://gist.github.com/simonw/a35f39b070978e703d9eb8b1aa7c0... - cost 2,684 input, 2,452 output (of which 896 were reasoning tokens) which is 12.492 cents.Then again with o4-mini using the exact same content (hence the hash ID for -f):
Output: https://gist.github.com/simonw/b11ba0b11e71eea0292fb6adaf9cd...Cost 2,684 input, 2,681 output (of which 1,088 reasoning tokens) = 1.4749 cents
The above uses these two plugins: https://github.com/simonw/llm-openai-plugin and https://github.com/simonw/llm-hacker-news - taking advantage of new -f "fragments" feature I released last week: https://simonwillison.net/2025/Apr/7/long-context-llm/
Tyler cowen seems convinced https://marginalrevolution.com/marginalrevolution/2025/04/o3...
It can't solve this puzzle: https://i.imgur.com/AJqbqHJ.png
The breathtaking thing is not the model itself, but that someone as smart as Cowen (and he's not the only one) is uttering "AGI" in the same sentence as any of these models. Now, I'm not a hater, and for many tasks they are amazing, but they are, as of now, not even close to AGI, by any reasonable definition.I work for openai.
o4-mini gets much closer (but I'm pretty sure it fumbles at the last moment): https://chatgpt.com/share/680031fb-2bd0-8013-87ac-941fa91cea...
We're pretty bad at model naming and communicating capabilities (in our defense, it's hard!), but o4-mini is actually a _considerably_ better vision model than o3, despite the benchmarks. Similar to how o3-mini-high was a much better coding model than o1. I would recommend using o4-mini-high over o3 for any task involving vision.
Thanks for the reply. I am not sure the vision is the failing point here, but logic. I routinely try to get these models to solve difficult puzzles or coding challenges (the kind that a good undergrad math major could probably solve, but that most would struggle with). They fail almost always. Even with help.
For example, JaneStreet monthly puzzles. Surprisingly, the new o3 was able to solve this months (previous models were not), which was an easier one. Believe me, I am not trying to minimize the overall achievement -- what it can do incredible -- but I don't believe the phrase AGI should even be mentioned until we are seeing solutions to problems that most professional mathematicians would struggle with, including solutions to unsolved problems.
That might not be enough even, but that should be the minimum bar for even having the conversation.
>what it can do incredible -- but I don't believe the phrase AGI should even be mentioned until we are seeing solutions to problems that most professional mathematicians would struggle with, including solutions to unsolved problems.
But Why ? Why should Artificial General Intelligence preclude things a good chunk of humans wouldn't be able to do ? Are those guys no longer General Intelligences ?
I'm not saying this definition is 'wrong' but you have to realize at this point, the individual words of that acronym no longer mean anything.
Sure, there's no authority who stamps the official definition.
I'll make my case. To me, if you look at how the phrase is usually used -- "when humans have achieved AGI...", etc -- it evokes a science fiction turning point that implies superhuman performance in more or less every intellectual task. It's general, after all. I think of Hal or the movie Her. It's not "Artifical General Just-Like-Most-People-You-Know Intelligence". Though we are not there yet, either, if you consider the full spectrum of human abilities.
Few things would demonstrate general superhuman reasoning ability more definitively than machines producing new, useful, influential math results at a faster rate than people. With that achieved, you would expect it could start writing fiction and screenplays and comedy as well as people too (it's still very far imo), but maybe not, maybe those skills develop at different paces, and I still wouldn't want to call it AGI. But I think truly conquering mathematics would get me there.
A standard term people use for what you describe is superintelligence, not AGI.
Current frontier models are better than average humans in many skills but worse in others. Ethan Mollick calls it “jagged frontier” which sounds about right.
Tyler Cowen is someone I take seriously. I think he is one of the most rational thought leaders.
But I have to say, his views on LLMs seem a little premature. He definitely has a unique viewpoint of what "general intelligence" is, which might not apply broadly to most jobs. I think "interviews" them like they were a guest on his podcast and bases his judgement on how they compare to his other extremely smart guests.
This post[1] is highlighted by Techmeme:
>I'm obsessed with o3. It's way better than the previous models. It just helped me resolve a psychological/emotional problem I've been dealing with for years in like 3 back-and-forths (one that wasn't socially acceptable to share, and those I shared it with didn't/couldn't help)
Genuinely intrigued by what kind of “psychological/emotional problem I've been dealing with for years” could an AI solve in a matter of hours after its release.
[1] https://x.com/carmenleelau/status/1912645771955962300
The most striking difference to me is that o3 and o4 know when the web search tool is unavailable, and will tell you they can't answer a question that requires it. While 4o and (sadly) 4.1 will just make up a bunch of nonsense.
I'm simultaneously impressed that they can do that, and also wondering why the heck that's so impressive (isn't "is this tool in this list?" something GPT-3 was able to handle?) and why 4.1 still fails at it too—especially considering it's hyped as the agentic coder model!
That's pretty damning for the general intelligence aspect of it, that they apparently had to special-case something so trivial... and I say that as someone who's really optimistic about this stuff!
That being said, the new "enhanced" web search seems great so far, and means I can finally delete another stupid 10 line Python script from 2023 that I shouldn't have needed in the first place ;)
(...Now if they'd just put 4.1 in the Chat... why the hell do I need to use a 3rd party UI for their best model!)
A suggestion for OpenAI to create more meaningful model names:
{Size}-{Quarter/Year}-{Speed/Accuracy}-{Specialty}
Where:
* Size is XS/S/M/L/XL/XXL to indicate overall capability level
* Quarter/Year like Q2-25
* Speed/Accuracy indicated as Fast/Balanced/Precise
* Optional specialty tag like Code/Vision/Science/etc
Example model names:
* L-Q2-25-Fast-Code (Large model from Q2 2025, optimized for speed, specializes in coding)
* M-Q4-24-Balanced (Medium model from Q4 2024, balanced speed/accuracy)
This is even more incomprehensible to users who don't understand what this naming scheme is supposed to mean. Right now, most power users are keeping track of all the models and know what they are like, so this naming wouldn't help them. Normal consumers don't really know the difference between the models, but this wouldn't help them either - all those letters and numbers aren't super inviting and friendly. They could try just having a linear slider for amount of intelligence and another one for speed.
I think they should name them after fictional characters. Bonus points if they're trademarked characters.
"You gotta try Mickey, it beats the crap out of Gandalf in coding."
Thank god we don’t usually let engineers name stuff in the west.
While this is entirely logical in theory this is how you get LG style naming like “THE ALL NEW LG-CFT563-X2”
I mean, it makes total sense, it tells you exactly the model, region, series and edition! Right??
What about using Marvel superhero names (with permission, of course)? The studio keeps giving us stronger and stronger examples...
After refreshing the browser I see that the old o3-mini-high has gone now so I continued my coding task conversation with o4-mini-high. In two separate conversations it butchered things in a way that I never saw o3-mini-high do. In one case it rewrote working code without reason, breaking it, in the other it took a function I asked it to apply a code fix to and it instead refactored it with a different and unrelated function that was part of an earlier bit of chat history.
I notice too that it employs a different style of code where it often puts assignment on a different line, which looks like it's trying to maintain an ~80 character line limit, but does so in places where the entire line of code is only about 40 characters.
Not saying it’s for sure the case but it might be that the model gets confused by OOD text from the other model whereas it expects its own text to be online from itself (particularly if the CoT is used as context for later conversations).
Maybe they should ask the new models to generate a better name for themselves. It's getting quite confusing.
I’m having very mixed feelings about it. I’m using o3 to help me parse and understand a book about statistics and ML, it’s very dense in math.
On one hand the answers became a lot more comprehensive and deep. It’s now able to give me very advanced explanations.
On the other hand, it started overloading the answers with information. Entire concepts became single sentence summaries. Complex topics and theorems became acronyms. In a way I’m feeling overwhelmed by the information it’s now throwing at me. I can’t tell if it’s actually smarter or just too complicated for me to understand.
Pretty wild that we’re at the point that the human is the limitation
Surprise, the machine that interpolates from a database of maths books confuses a human who wants to learn about the contents of the books in that database.
The demo video is very impressive, and it shows what AI could be. Our current models are unreliable in research, but if they were reliable, then what's shown alone would be better than AGI.
There are 8 billion+ instances of general intelligence on the planet; there isn't a shortage. I'd rather see AI do data science and applied math at computer speeds. Those are the hard problems, a lot of the AGI problems (to human brains) are easy.
So what are they selling with the 200 dollar subscription? Only a model that has now caught up with their competitor who sells for 1/10 of their price?
o4-mini is available on vs code. I've been playing with it for the last couple of hours. It's quite fast for a thinking model.
It's also super concise with code. Where claude 3.7 and gemini 2.5 will write a ton, o4-mini will write a tiny portion of it accomplishing the same task.
On the flip side, in its conciseness, it's more lazy with implementation than the other leading models missing features.
For fixing very complex typescript types, I've previously found that o1 outperformed the others. o4-mini seems to understand things well here.
I still think gemini will continue to be my favorite model for code. It's more consistent and follows instructions better.
However, openAI's more advanced models have a better shot at providing a solution when gemini and claude are stuck.
Maybe there's a win here in having o4-mini or o3 do a first draft for conciseness, revise with gemini to fill in what's missed (but with a base that is not overdone), and then run fixes with o4-mini.
Things are still changing quite quickly.
The user experience needs to be massively improved when it comes to model choice. How are average users supposed to know which model to pick? Why shouldn't I just always pick the newest or most powerful one? Why should I have to choose at all? I say this from the perspective of a ChatGPT user - I understand the different pricing on the API side helps people make decisions.
Interesting that using tools to zoom around the image is useful for the model. I was kind of assuming that these models were beyond such things and could attend to all aspects image simultaneously anyway, but perhaps their input is still limited in the resolution? Very cool, in any case, spooky progress as always.
There's just a certain amount of things the image encoder can process at once. It's pretty apparent when you give the models a big table in an image.
But isn't this basically what the conv layer does...?
On the vision side of things: I ran my torture test through it, and while it performed "well", about the same level as 4o and o1, it still fails to handle spatial relationships well, and did hallucinate some details. OCR is a little better it seems, but a more thorough OCR focused test would be needed to know for sure. My torture tests are more focused on accurately describing the content of images.
Both seem to be better at prompt following and have more up to date knowledge.
But honestly, if o3 was only at the same level as o1, it'd still be an upgrade since it's cheaper. o1 is difficult to justify in the API due to cost.
FWIW, o4-mini-high does not feel better o3-mini-high for working on fairly simply econ theory proofs. It does feel faster. And both elementary mistakes.
So far with my random / coding design question that I asked with o1 last week, it did substantially better with o3. It’s more like a mid level engineer and less like a intern.
I can't even get ChatGPT to tell me which chatgpt to use.
So it looks like no increase in context window size since it’s not mentioned anywhere.
I assume this announcement is all 256k, while the base model 4.1 just shot up this week to a million.
I thought they weren't going to release o3 and it would just be bundled into "GPT-5".
I have been using o4-mini-high today. Most of the time for a file longer than 100 lines it stops generating randomly and won't complete a file unless I re-prompt it with the end of the missing file.
As usual, it's a frustrating experience for anything more complex than the usual problems everyone else does.
I noticed that OpenAI don't compare their models to third party models in their announcement posts, unlike google, meta and the others.
Same...
It's there now in the web app for me.
I see them in the Android app now.
The big step function here seems to be RL on tool calling.
Claude 3.7/3.5 are the only models that seem to be able to handle "pure agent" usecases well (agent in a loop, not in an agentic workflow scaffold[0]).
OpenAI has made a bet on reasoning models as the core to a purely agentic loop, but it hasn't worked particularly well yet (in my own tests, though folks have hacked a Claude Code workaround[1]).
o3-mini has been better at some technical problems than 3.7/3.5 (particularly refactoring, in my experience), but still struggles with long chains of tool calling.
My hunch is that these models were tuned _with_ OpenAI Codex[2], which is presumably what Anthropic was doing internally with Claude Code on 3.5/3.7
tl;dr - GPT-3 launched with completions (predict the next token), then OpenAI fine-tuned that model on "chat completions" which then led GPT-3.5/GPT-4, and ultimately the success of ChatGPT. This new agent paradigm, requires fine-tuning on the LLM interacting with itself (thinking) and with the outside world (tools), sans any human input.
[0]https://www.anthropic.com/engineering/building-effective-age...
[1]https://github.com/1rgs/claude-code-proxy
[2]https://openai.com/index/openai-codex/
I find o4 very bad at coding. I tried to improve a script created by 3.5 mini-high with o4 mini-high and it doesn't return nearly as good results as what i used to get by o3.5
This reminds me of keeping up with all the latest JavaScript framework trivia circa the ~2010s
JS framework thing is still ongoing
https://krausest.github.io/js-framework-benchmark/2025/table...
Doesn't achieving AGI mean the beginning of the end of humanity's current economic model? I'm not sure I understand the presumption by many that achieving AGI is just another step in some company's offering.
No you see because everyone will become agi engineers actually that makes sense and is going to happen
Most days I feel the same.
Other days I remember that humans like "handmade" furniture, and live performances, and unique styles, and human contact.
Perhaps there's life in us still?
o3 failed the first test I gave it. I wanted it to create a bar chart using Python of the first 10 Fibonacci numbers (did this easily), and then use that image as input to generate an info-graphic of the chart with an animal theme. It failed in two ways. It didn't have access to the visual output from python and, when I gave it a screenshot of that output, it failed in standard GenAI fashion by having poor / incomplete text and not adhering exactly to bar heights, which were critical in this case.
So one failure that could be resolved with better integration on the back end and then an open problem with image generation in general.
A very subtle mention of o3-pro, which I'd imagine is now the most capable programming model. Excited to see when I get access to that.
Good thing I stopped working a few hours ago
EDIT: Altman tweeted o3-pro is coming out in a few weeks, looks like that guy misspoke :(
https://transluce.org/investigating-o3-truthfulness
Some interesting hallucinations going on here!
I’m not sure I fully understand the rationale of having newer mini versions (eg o3-mini, o4-mini) when previous thinking models (eg o1) and smart non-thinking models (eg gpt-4.1) exist. Does anyone here use these for anything?
I use o3-mini-high in Aider, where I want a model to employ reasoning but not put up with the latency of the non-mini o1.
o1 is a much larger, more expensive to operate on OpenAI's end. Having a smaller "newer" (roughly equating newer to more capable) model means that you can match the performance of larger older models while reducing inference and API costs.
At this point, it's like comparing the iPhone 5s vs the iPhone 6. The upgrades are still noticeable, but it's nowhere the huge jump between GPT 3.5 and GPT 4.
It seems to be getting better. I used to use my custom "Turbo Chad" GPT based on 4o and now the default models are similar. Is it learning from my previous annoyances?
It has been getting better IMO.
Underwhelming. Cancelled my subscription in favor of Gemini Pro 2.5
I find it worse than Gemini 2.5 Pro at math research.
o4 is doing a better job than o3 on my current project, and while this isn’t really a priority, its personality is somehow far more engaging now.
> Downloaded an untouched char.lgp from the current Steam build (1.0.9) to make sure the count reflects the shipping game rather than a modded archive.
How?
o3 joins gemini-2.5-pro as the only other model that can pace long form creative writing properly when details about the story are provided.
I'm confused. I typically use o1 for all of my questions. Now it's disappeared. Is o3 a better model?
Yes, in almost all aspects if you do not use the o1-pro. o3-pro is not available yet.
Any quick impressions of o3 vs o1? We've got one inference in our product that only o1 has seemed to handle well, wondering if o3 can replace it.
They are replacing o1 with o3 in the UI, at least for me, so they must be pretty confident it is a strict improvement.
The most annoying part of all this is they replaced o1 with o3 without any notices or warnings. This is why I hate proprietary models.
Meanwhile we have people elsewhere in the thread complaining about too many models.
Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around. When they upgrade gpt-o4 they don't let you use the old version, after all.
>Assuming OpenAI are correct that o3 is strictly an improvement over o1 then I don't see why they'd keep o1 around.
Imagine if every time your favorite SaaS had an update, they renamed the product. Yesterday you were using Slack S7, and today you're suddenly using Slack 9S-o. That was fine in the desktop era, when new releases happened once a year - not every few weeks. You just can't keep up with all the versions.
I think they should just stick with one brand and announce new releases as just incremental updates to that same brand/product (even if the underlying models are different): "the DeepSearch Update" or "The April 2025 Reasoning Update" etc.
The model picker should be replaced entirely with a router that automatically detects which underlying model to use. Power users could have optional checkboxes like "Think harder" or "Code mode" as settings, if they want to guide the router toward more specialized models.
When are they going to release o3-high? I don't think it's in the API, and I certainly don't see it in the web app (Pro).
> We expect to release OpenAI o3‑pro in a few weeks with full tool support. For now, Pro users can still access o1‑pro.
https://openai.com/index/introducing-o3-and-o4-mini/
Is there a non-obvious reason using something like Python to solve queries requiring calculations was not used from day one with LLMs?
Because it‘s not a feature of the LLM but the product that is built around it (like ChatGPT).
It's true that product provides the tools, but the model still needs to be trained to use tools, or it won't use them well or at the right times.
LLMs could not use tools on day one.
Finally, a new SOTA model on SWE-bench. Love to see this progress, and nice to see OpenAI finally catching up in the coding domain.
Not sure what the goal is with Codex CLI. It's not running a local LLM right, just a CLI to make API calls from the terminal?
This might be their answer to claude code more than anything else.
Yes, that's exactly what I thought as well. An attempt to get more share in the developer tooling space for the long term.
Looks more like a direct competitor to Aider.
Where do I find out more about Aider ?
Just wait a few seconds and there will be a post here with Aider benchmarks for the new model, or https://aider.chat
https://aider.chat
This is a mess. I do follow AI news, and do no know if this is "better/faster/cheaper" than 4.1
Why are they doing this?
Oh god. I´m Brazilian and can´t get the "Verification". Using my passport or id. This is very frighting future.
The Codex CLI looks nice, but it's a shame I have to bring my own API key when I already subscribe to ChatGPT Plus
I feel like the only reason O3 is better than O1 just due to the tool usage. With tool use O1 could be similar to O3.
I wish companies would adhere to a consistent naming scheme, like <name>-<params>-<cut-off-month>.
Still a knowledge cutoff of August 2023. That is a significant bottleneck to devs using it for AI stuff.
I've taken to pasting in the latest OpenAI API docs for their python library to each prompt (via API, I'm not pasting each time manually in ChatGPT) so that the AI can write code that uses itself! Like, I get it, the training data thing is hard, but - OpenAI changed their python library with breaking changes and their models largely still do not know about it! I haven't tried 4.1- series yet with their newer cutoff, but, the rest of the models like o3-mini (and I presume these new ones today) still write openai python library code in the old, broken style. Argh.
I wonder where o3 and o4-mini will land on the LMarena leaderboard. When might we see them there?
Anyone got codex working? After installing and setting up API Key I get this error :
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────Are these available via the API? I'm getting back 'model_not_found' when testing.
The benchmarks reference o3-low, medium and high. What is plain “o3”? Is that medium?
Sooo... are any of these (or their distils) getting open-sourced/open-weighted?
I want to be excited about this but after chatting with 4.1 about a simple app screenshot and it continuously forgetting and hallucinating, I am increasingly sceptical of Open AI's announcements. (No coding involved, so the context window was likely < 10% full.)
Any good leaderboard where all the very latest models are compared?
Please just give me a best value and a highest performance model.
If the ai is smart, why not have it choose the model for the user
That’s what GPT-5 was supposed to be (instead of a new base or reasoning model) last Sam updated his plans I thought. Did those change again?
I have barely found time to gauge 4.1s capabilities, so at this stage, I’d rather focus on the ever worsening names these companies bestow upon their models. To say that I the USB-IF have found their match would be an understatement.
What is again the advantage of pro over plus subscriptions?
> We expect to release OpenAI o3‑pro in a few weeks with full tool support. For now, Pro users can still access o1‑pro.
Ok, so currently they pay for nothing (or is o1-pro superior to o3?).
Babe wake up a new LLM just dropped.
Here are some notes I made to understand each of these models and when to use them.
# OpenAI Models
## Reasoning Models (o-series) - All `oX` (o-series aka `omni`) models are reasoning models. - Use these for complex, multi-step, reasoning tasks.
## Flagship/Core Models - All `x.x` and `Xo` models are the core models. - Use these for one-shot results - Examples: 4o, 4.1
## Cost Optimized - All `-mini`, `-nano` are cheaper, faster models. - Use these for high-volume, low effort tasks.
## Flagship vs Reasoning (o-series) Models - Latest flagship model = 4.1 - Latest reasoning model = o3 - The flagship models are general purpose, typically with larger context windows. These rely mostly on pattern matching. - The reasoning models are trained with extended chain-of-thought and reinforcement learning models. They work best with tools, code and other multi-step workflows. Because tools are used, the accuracy will be higher.
# List of Models
## 4o (omni) - 128K context window - complex multimodal, applications requiring the top level of reliability and nuance
## 4o-mini - 128K context window - Use: multimodal reasoning for math, coding, and structured outputs - Use: Cheaper than `4o`. Use when you can trade off accuracy vs speed/cost. - Dont Use: When high accuracy is needed
## 4.1 - 1M context window - Use: For large context ingest, such as full codebases - Use: For reliable instruction following, comprehension - Dont Use: For high volume/faster tasks
## 4.1-mini - 1M context window - Use: For large context ingest - Use: When a tradeoff can be made with accuracy vs speed
## 4.1-nano - 1M context window - Use: For high-volume, near-instant responses - Dont Use: When accuracy is required - Examples: classification, autocompletion, short-answers
## o3 - 200K context window - Use: for the most challenging reasoning tasks in coding, STEM, and vision that demand deep chain‑of‑thought and tool use - Use: Agentic workflows leveraging web search, Python execution, and image analysis in one coherent loop - Dont Use: For simple tasks, where lighter model will be faster and cheaper.
## o4-mini - 200K context window - Use: High-volume needs where reasoning and cost should be balanced - Use: For high throughput applications - Dont Use: When accuracy is critical
## o4-mini-high - 200K context window - Use: When o4-mini results are not satisfactory, but before moving to o3. - Use: Compex tool-driven reasoning, where o4-mini results are not satisfactory - Dont Use: When accuracy is critical
## o1-pro-mode - 200K context window - Use: Highly specialized science, coding, or reasoning jobs that benefit from extra compute for consistency - Dont Use: For simple tasks
## Models Sorted for Complex Coding Tasks (my opinion)
1. o3 2. Gemini 2.5 Pro 3. Claude 3.7 2. o1-pro-mode 3. o4-mini-high 4. 4.1 5. o4-mini
4o and o4 at the same time. Excellent work on the product naming, whoever did that.
Oh, that was Altman Sam.
Am Saltman
Hi, I’m an OpenAI recruiter. Are you interested in a position with us?
Enter.
It took me reading your comment to realize that they were different and this wasn’t deja vu. Maybe that says more about me than OpenAI, but my gut agrees with you.
Just wait until they announce oA and A0.
They jokingly admitted that they’re bad at naming in the 4.1 reveal video, so they’re certainly aware of the problem. They’re probably hoping to make the model lineup clearer after some of the older models get retired, but the current mess was certainly entirely foreseeable.
Energy Intensive Exceptional Intelligence (Omni-domain), AKA E-I-E-I-O.
Gave Codex a go with o4-mini and it's disappointing... Here you can see my tries. It fully fails on something a mid engineer can do after getting used to the tools: https://xcancel.com/Tymscar/status/1912578655378628847
thanks for your information!
o4-mini, not to be confused with 4o-mini.
What is wrong with OpenAI? The naming of their models seems like it is intentionally confusing - maybe to distract from lack of progress? Honestly, I have no idea which model to use for simply everyday tasks anymore.
It really is bizarre. If you had asked me 2 days ago I would have said unequivically that these models already existed. Surely given the rate of change a date-based numbering system would be more helpful?
Fix coming this summer, hopefully.
https://twitter.com/sama/status/1911906570835022319
I tend to look at the lmarena leaderboard to see what to use (or the aider polyglot leaderboard for coding)
Seems to me like they're somewhat trying to simplify now.
GPT-N.m -> Non-reasoning
oN -> Reasoning
oN+1-mini -> Reasoning but speedy; cut-down version of an upcoming oN model (unclear if true or marketing)
It would be nice if they actually stick to this pattern.
I suspect that "ChatGPT-4o" is the most confusing part. Absolutely baffling to go with that and then later "oN", but surely they will avoid any "No" models moving forward
But we have both 4o and 4.1 for non-reasoning. And it's still not clear to me which is better (the comparison on their page was from an older version of 4o).
Are the oN models built on top of GPT-N.m models? It would be nice to know the lineage there.
[dead]
OpenAI be like:
[dead]
[dead]
I have doubts whether the live stream was really live.
During the live-stream the subtitles are shown line by line.
When subtitles are auto-generated, they pop up word by word, which I assume would need to happen during a real live stream.
Line-by-line subtitles are shown if the uploader provides captions by themselves for an existing video, the only way OpenAI could provide captions ahead of time, is if the "live-stream" isn't actually live.
All YouTube live streams are like this.
I think this is just a quirk of how Google does live captions.