Vibe Check

GPT-5

Our hands-on review of OpenAI's newest model based on weeks of testing

The Verdicts

Dan Shipper
Dan Shipper
The multi-threaded CEO
Kieran Klaassen
Kieran Klaassen
The Rails-pilled master of Claude Code
Danny Aziz
Danny Aziz
The multi-model polyglot
Alex Duffy
Alex Duffy
He who makes AI agents fight each other
Naveen Naidu
Naveen Naidu
Graduate of IIT Bombay (the MIT of India 💅)
Katie Parrott
Katie Parrott
AI-pilled writer by day, vibe coder by night
Yash Poojary
Yash Poojary
The Swift-native deep researcher
Dan's Mom
Dan's mom
The voice of the average AI user
🥇

Legend

🥇
Paradigm shift
It's okay, but I wouldn't use it every day
Psyched about this model
Trash model
In ChatGPT, GPT-5 is the best model that 99 percent of the world will have ever used: It's fast, simple, and powerful enough to be a daily driver.
In the API, it's priced to make rivals sweat: GPT-5-mini undercuts Google's Gemini 2.5 Flash, and GPT-5 Standard comes in at 1/12th the cost of Claude 4 Opus.
But if you're vibe coding apps with four Claude Code instances running at once, using GPT-5 in its current form doesn't feel like a revelation. Instead, it feels like a significant upgrade to an old paradigm.

GPT-5 in ChatGPT

The first thing you'll notice about GPT-5 is that it's fast. It's really, really fast. The second thing you'll notice is that it makes using AI much simpler: OpenAI has collapsed the model picker—the top-left drop-down menu in ChatGPT—so you don't have to make any choices. You talk to GPT-5, and it decides how long to think. It's an Apple-y move—and it works.

OpenAI can accomplish this because GPT-5 isn't one model, it's a system of models. In ChatGPT, an "auto-switcher" determines the intent of your query, and decides whether to route it to the chat (for easy queries) or reasoning (for more challenging questions) version of the model.

For questions that it can answer off the top of its head—like, "What is the definition of AGI"—it will respond lightning-fast. For more complex queries—like, "Code me a beautiful new social network"—it will think for a while or do web research before returning a more considered answer.

In ChatGPT, GPT-5 returns comprehensive and readable answers with logical subsections, judicious use of white space, and bolded text to help you find what you need quickly. (It will remind you of answers you get from o3, but without the obsession with tables.)

However, my results vary in the non-reasoning version of the model. Sometimes it's great, but it often hallucinates on questions that should have been routed to the reasoning model. For example, if I take a picture of a passage in a novel and ask it to explain what's happening, GPT-5 will sometimes confidently make things up. If I ask it to "think longer," it will deliver an accurate answer.

In ChatGPT's collaborative workspace Canvas, GPT-5 can quickly one-shot front-end apps, making it the introduction of vibe coding to millions of people who would never pay for a Claude subscription or try coding app Loveable. But it's not yet a game changer if you're already a vibe coding veteran: Its work is about on par with Opus 4.1 in Claude Artifacts. Canvas also has a number of quirks that make it hard to work with; for example, it is limited to fewer than 1,000 lines of code.

The bottom line: This will be the first time most of the world has ever used a reasoning model instead of a simple chat model. GPT-5 is available for free, for everyone, today—that's a big deal.

GPT-5 in the API

OpenAI just walked into Google's house and said, "Anything you can do, I can do cheaper." GPT-5 is just. so. cheap.

For comparison, Google's cheap and fast model, Gemini 2.5 Flash, costs $0.30 per million input tokens. As a result, it's one of our favorites to use at Every. But OpenAI now has a direct answer: GPT-5-mini, which clocks in at $0.25 per million input tokens, undercutting Flash.

At the flagship tier, GPT-5's Standard pricing is $1.25 per 1 million input tokens—exactly matching Google's Gemini 2.5 Pro. If you're already paying for Pro-tier Gemini, you can switch to GPT-5 without changing your unit economics.

On the Anthropic side, the comparison is almost comical. Claude 4 Opus is pegged at $15 per million input tokens. GPT-5 Standard is $1.25 per million. That's 12 times cheaper.

Even if Opus 4.1 is better for some use cases than GPT-5, it's difficult to be 12 times better. GPT-5 forces a hard look at the math.

GPT-5 for agentic engineering

If you're an agent-slinging senior engineer or a vibe coder, GPT-5 is not going to become your daily driver for frontier AI coding. It's more like a skilled utility player than your new MVP.

Don't get me wrong: GPT-5 is a very good programmer. It's incredibly useful as a pair programmer, especially in AI-powered integrated development environments (IDEs) like Cursor. It's great for engineers from traditional backgrounds who want an AI to help collaborate on code. And it excels at research and debugging complex issues.

But the discipline of programming has fundamentally changed this summer. The benchmarks don't show it, but if you know how to YOLO four agents at once in Claude Code, GPT-5 feels like a step backward. That's partially because of the model's current personality: It's more cautious than Opus 4.1 and isn't as comfortable working independently for long periods in our testing. But it's also due to the app you use to interact with it: Both Cursor and OpenAI's command line interface tool Codex CLI are not on the same level as Claude Code. Both were built for programmer-AI pair programming, not true delegation.

I bet this will change. The model is extremely smart, just not yet built for this use case. But for now, OpenAI seems to have missed the paradigm shift in programming caused by Claude Code over the last two months.
Dan Shipper
Dan Shipper
Cofounder and CEO
01

The reach test

Hand
Day-to-day tasks

For day-to-day tasks

Yes

It's a daily driver in ChatGPT. We never use the model picker anymore and almost never need to go back to older models. It's extremely fast for day-to-day queries and gives comprehensive answers to questions that require research. It also disagrees more frequently instead of hallucinating.

Pair programming

For pair programming

Yes

Use it to fix a specific bug or build a new feature step-by-step with a helpful companion in Cursor. It's great at researching and understanding large codebases. It's extremely detail-oriented and fast.

Writing

For writing

Yes

GPT-5 has a good voice—nuanced and expressive. It's less likely to output obvious AI idioms, so it's the first thing we turn to when we have a sentence we need to polish or a paragraph we need to draft. We sometimes return to GPT 4.5 for questions that require more thought.

Agentic engineering

For agentic engineering

No

GPT-5 in Codex and Cursor is too cautious to be a good agentic programmer. It stops too often, and its output on front-end and back-end tasks is lower quality than that of Opus 4 and 4.1 in Claude Code. On big tasks, it gets lost in the details, and its output tends to be too verbose to read easily.

Editing

For editing

No

GPT-5 cannot determine whether writing is good. We have benchmarks (below) to test AI's ability to judge writing, and GPT-5 consistently fails on tasks that Opus 4 passes.

02

The team roundtable

Kieran Klaassen
Kieran Klaassen (general manager of Cora)
The Rails-pilled master of agents
GPT-5 does what you tell it to do. It takes measured, small steps and wouldn't dream of straying off course—and that's my problem with it. It's good at coding, especially doing precise back-end tasks, but it isn't optimized to be agentic. If you work in a more old-fashioned iterative process, and you give it direction and say, '"Yes, this looks good: can you now do that?" it's very easy to work with. The only issue is that's more how you'd work with AI in 2024.

GPT-5 is a Sonnet 3.5 killer, not a leap into the future.
Danny Aziz
Danny Aziz (general manager of Spiral)
The multi-model polyglot
My magic moment with GPT-5 was when I used it to merge two complex codebases.

I was working on a feature inside of Spiral that used an open-source framework that couldn't quite do what I needed. I used GPT-5 to merge code from another open-source framework into the one I was using. It didn't do it in one shot, with just one example, but something about the process felt viscerally collaborative. I felt this growing sense of confidence that we were getting there together.

GPT-5 has become my go-to for discrete, well-defined coding tasks. I still use Claude Code for longer-running, more agentic work like code reviews, but if I'm blocked or too lazy to fully think something through, working with GPT-5 gets me where I need to go.
Alex Duffy
Alex Duffy (head of AI training)
He who makes AI agents fight each other
For consumers, GPT-5 will be a noticeable step up from GPT-4o, especially for users on the free tier, where the upgrade will feel substantial. Power users may stick with specialized tools like o3 for research and Opus for writing, but developers get something valuable: a reliably good, highly steerable model.

For developers, at $1.25 input and $10 output per 1 million tokens, GPT-5's sweet spot is when I've crafted a solid prompt and need to process lots of information into a concise and high-quality output, such as documentation or a new coding function. It's dramatically cheaper than Opus but pricier on outputs than o4-mini, so you're paying for steerability—its ability to adhere to your prompts—not raw reasoning (where o3 may still win). GPT-5-mini might be the real surprise, though, undercutting Gemini's Flash on price with similar performance, assuming it can match the speed.
Naveen Naidu
Naveen Naidu (entrepreneur in residence)
Graduate of IIT Bombay (the MIT of India 💅)
I used GPT-5 to debug a gnarly app freeze bug I encountered in Monologue, the AI dictation app I'm building. I'd been chasing it for four days, including during a marathon four-hour session on a Sunday with Claude Code, with no luck. GPT-5 and I solved it together as collaborators. It helped me identify which area of the code was likely causing the freeze and nail the exact bug.
Katie Parrott
Katie Parrott (writer and AI operations lead)
AI-pilled writer by day, vibe coder by night
For writing
I used GPT-5 to turn an outline for a piece into a first draft, and I liked it! It took a few prompts for it to catch on to Every's style, but once I gave it "Atlantic article crossed with a viral Hacker News post," the results were strong. It doesn't do as many of the things I normally associate with AI writing, like, "It's not just X, but Y." I also used it to interview me for a new piece I'm working on. It had a better sense for what the spine of the piece should be and, therefore, what questions to ask, as compared to Claude.

I've been way happier overall with GPT-5 over Opus as a first draft writer.
For vibe coding
I don't like using GPT-5 in Codex as a vibe coder. It won't take on as large chunks of work at once, which is tedious. I don't have a fine-grained enough understanding of the individual steps to do much besides hit "Continue" every three seconds. It doesn't explain what it's doing next the way Claude does, even though I told it that I'm a "baby coder."
Yash Poojary
Yash Poojary (general manager of Sparkle)
The Swift-native researcher
Swift (Apple's open-source programming language) is all that matters to me—Sparkle runs entirely on it—and GPT-5 didn't wow me out of the box. But GPT-5 needed specific configuration prompts before it would work properly. Even after that extra setup, it still wasn't good enough to replace Claude for my Swift work. But when I took coding out of the equation and used GPT-5 purely for research, it crushed.

For the challenging research task of finding duplicate files on a Mac, it gave me the most technically rigorous writeup I've ever seen from an AI. It was like talking to a 140-IQ systems architect who's already built the network three times and learned from each failure.

For pure implementation, I'll still reach for Claude. But when I need deep context, tradeoff analysis, and "why" answers that change the way I build features, GPT-5 is unmatched. I wouldn't go back to Claude for research.
Dan's mom
Dan's mom
The voice of the average AI user
🥇
I really think this model is amazing. This is way more comprehensive than the answers I usually get from ChatGPT. The information it gives me is readable and flows really well. This model is gold.
04

The benchmarks

Good writing

We have our own benchmarks for evaluating AI's ability to judge writing quality, such as, "Is the writing engaging?" and, "Do sentences or sections flow naturally into each other?"

GPT-5 produces inconsistent results, sometimes passing and other times failing the same piece of writing. It was inconsistent enough for Danny to not fully trust its evaluations, especially compared to Claude Opus, which reliably gives the same results every time.

We ran it on a series of writing samples from tweets to essays, and it consistently returned "false," judging the writing engaging when it wasn't:

Danny also ran a blind "taste test" between Opus 4 and GPT-5. He gave both models the same set of prompts and asked the Every team on Discord to vote on the outputs, without revealing which model wrote what. Opus 4 (and later, 4.1, which Anthropic dropped earlier this week) came out on top.
Good writing benchmark results

One-shot a game

Kieran invented what we have dubbed the "cozy ecosystem" benchmark. He asked the model to one-shot a 3D weather game, providing the example of the simulation game RollerCoaster Tycoon but for managing a natural ecosystem. The benchmark measures how good the model is at the full spectrum of game creation tasks: coding, designing, planning, and iterating.

These are screenshots from the game that GPT-5 made:
GPT-5 game screenshot 1 GPT-5 game screenshot 2
Kieran thought that "it's OK but boring." It didn't crash when I ran it (many of these one-shot game experiments do), but the game didn't look very fun to play. However, he did think it was impressive that GPT-5 didn't make many mistakes because models like Google's Gemini and o3 struggled at the same task.

For comparison, this is the output of Opus 4.1 in one shot:
Opus 4.1 game output
Kieran prefers 4.1's take on the game. The actual gameplay that 4.1 generated was better than GPT-5.

AI Diplomacy

Alex tested GPT-5 in AI Diplomacy, where each country is run by an LLM. Playing GPT-5 as France against other models, he tried two prompts: a neutral baseline and an "aggressive" mode. (See for yourself as GPT-5 tries to dominate Europe live on Twitch.)

Using an early GPT-5 variant that does minimal reasoning, it ranked near the bottom with the baseline prompt but jumped to second place when told to be aggressive, cutting "hold" moves from 49 percent to 9 percent—even better steerability (read: prompt following) than o3. The public GPT-5 is a slower, more reasoning-heavy version that performed worse in this benchmark.

This chart shows the results from 20 games where various models acting as France faced weaker opponents. The red bars represent optimized prompts, the gray basic prompts, and the yellow the average (lines show range). o3 leads overall, but GPT-5 variants compete well with quality prompts—note GPT-5-mini matching Flash in the orange outline. The red/gray gap demonstrates how much prompt engineering matters for performance.

Alex's takeaway: It's great for consumers and very steerable, so great prompts give great results. But it's not a frontier-pushing release, at least on this benchmark.
AI Diplomacy results

Impossible puzzle

At an omakase sushi restaurant, Danny received a business card with a puzzle at the end of his meal. It contains nine sets of numbers, under which it says, "Et tu, Brute?"

He's tried to solve it with every new model. GPT-5 solved it in 1 minute and 10 seconds. The only other models that have been able to do so are o3 and o3 Pro, which took 8 minutes and 19 minutes, respectively.
Impossible puzzle

One-shot a music production app

I asked GPT-5 to one-shot an app that makes music, a classic benchmark for Kieran, who's a trained composer. It made a working prototype quickly:
GPT-5 music production app
For a one-shot vibe code, there's a lot of detail baked into this app—it's essentially a Garage Band clone. But UI-wise, it's a little too spare and minimal for my taste. For comparison, here's Claude Opus 4's attempt:
Claude Opus 4 music production app
I like Opus 4's design better, but GPT-5's app was actually functional.

Pelican on a bicycle

Ladies and gentlemen, we have finally saturated the pelican on a bicycle benchmark:

Someone call Microsoft: OpenAI has achieved AGI internally.
Pelican on a bicycle

'Thup'

We ran OpenAI researcher Aidan McLaughlin's "thup" benchmark, where you repeatedly type "thup" to a model to see how it reacts. GPT-5 offers a programmer productivity "micro-tip" with each successive "thup." Below, it tells me that I can load Javascript faster with the "defer" and "async" commands:
GPT-5 thup benchmark
In contrast to GPT-5's Dilbert-esque presentation, Claude Opus 4.1 responds with all-caps, emoji-studded mathematical facts about the current number of "thups":
Claude Opus 4.1 thup benchmark
This should tell you a lot about the difference in their personalities.

How to try it yourself

In ChatGPT:

It should show up in your model picker now.

In the API:

Use the model "gpt-5," "gpt-5-mini," or "gpt-5-nano" (in descending order of size and power). It will be available starting at 11 a.m. PST today.

Taste-test new models with us

Subscribe to join our Discord