GPT-5. Better?

My current take on last week's OpenAI model launches:

The GPT-5 series performs worse than Sourcetable's existing solution on most of our internal evals and testing.

This was a big surprise to me, but speaks well of the existing multi-model approach we have taken. I had expected more, especially after seeing Simon Willison sing its praises.

The GPT-OSS models are giving us a ~20% boost over other open-source alternatives (Llama, DeepSeek). This mostly affects in-line command-bar driven operations (CMD+K), but not our analysis our autopilot features. We’re also finding it’s a decent prompt-router alternative. Groq remains the workhorse here for fast inference.

As an industry, I do think we’re over-rotating slightly on public-evals. Internal, we’re finding that Claude models are still outperforming GPT-5 on most spreadsheet tasks (!), including autopilot, yet on published benchmarks this should not be the case. There's a disconnect here. “Quaila” is the word that comes to mind when describing it.

Ultimately: who is your user? It used to be the case that 85% of Anthropic’s revenue came from developers, while 77% of OpenAI’s revenue came from consumers (i.e. not developers). I haven’t seen an updated statistic, but I do think these revenue sources matter when it comes to the quality and usability of models for application developers.

Quaila.

I’m seeing wildly contradictory reports on GPT-5’s model coherence over time. This is not the decisive victory we are used to seeing when “big-co” launches new flagship models.

Many of the reported wins from GPT-5 appear to come from their use of a prompt-router / multi-model architecture that selects the best model of the task at hand. We launched a similar solution in January (see: Forbes), and have found this to be an effective approach.

Incidentally, multi-model architecture not only supports better app quality (intelligence), it also helps reduce latency, LLM cost overhead, and creates necessary redundancy to ensure uptime across service provider disruptions. I’m bullish on NotDiamond, LiteLLM and other multi-model toolchain solutions.

I also expect more multi-model routing as a default behavior from these “bundled” models from all service providers. This might not be a good thing from the developer’s perspective. It’s turtles all the way down, with many models, and “models” with many models, but all with added complexity in non-deterministic LLM chains.

Multi-model is *very* interesting to observe. It speaks to a tension between explore/exploit over different time horizons. Rich Sutton’s The Bitter Lesson might be paraphrased as ~“in the long run, general intelligence outcompetes local intelligence”, but in the short run we’re all seeing that there are temporary gains to be had exploiting advantages on various axes. Models might improve by a step function every 6 months, but at the current pace 6 months is a lifetime!

So who’s are the winners here? Application companies, of course! Windsurf, Cursor, Flux, Sourcetable, Shortwave, etc. All get an immediate intelligence boost with every new model improvement. Cost savings and larger context windows are a nice win too, especially for data-heavy applications like ours.

The other win category for GPT-5 is a better awareness of what it can’t do. I definitely notice this improvement when I’m using ChatGPT, but this feature tends to be less important for Sourcetable since most user work is analysis-driven, code-verified, and hallucination free. People make financial models, not sparkle ponies.

The biggest winner, overall, is consumers. ChatGPT is the breakout success that drove OpenAI to public awareness, and it just got a big upgrade.

I do expect we’ll find places for GPT-5 models in Sourcetable’s LLM ecosystem. Usually we deploy new clear-win models same-day or next-day, but on this occasion we’re going to slow roll and be more surgical around when and how we use GPT-5 inside Sourcetable.

Finally, I just do not understand the hate spewed forth over the Internet towards OpenAI over the course of last week. The no.1 consumer company in the space — the same company that ushered in the platform shift an AI boom that we are all beneficiaries of — had a timely model release but not conclusive victory over the entire ecosystem. It wasn’t a fait accompli. So what?

Haters gonna hate. Shake it off!

Expectations on OpenAI have gone supernova, and this release wasn’t the developer win that I wanted. Personally though, I’m also a ChatGPT user feeling the immediate benefits and app quality improvement. ChatGPT is still clearly the best chat-based AI product on the market. It’s fantastic, well made, available cheaply/freely to all, and another stop forwards on arc of enlightenment.

Well done.