read
Technology deep-dive

GPT-4o Performance vs Cost: What Numbers Show

Author: Sophie Laurent | Research: Ryan Mitchell Edit: Kevin Brooks Visual: Lisa Johansson
Abstract data visualization with glowing metrics and charts on dark background representing AI performance analysis
Abstract data visualization with glowing metrics and charts on dark background representing AI performance analysis

OpenAI's GPT-4o launched in May 2024 with plenty of fanfare, promising a faster, cheaper frontier model. But strip away the marketing and look at independent numbers, and the picture gets more complicated. The cost-to-intelligence ratio tells a story that many early adopters might not expect.

The General Intelligence Picture

Artificial Analysis, an independent benchmarking platform, scores GPT-4o at 14 on its Intelligence Index. That number alone does not mean much until you see the context. The average score for comparable non-reasoning models sits at 22, according to Artificial Analysis. So GPT-4o lands well below average by this particular measure.

To be fair, Artificial Analysis notes that its intelligence score is an estimate, with a full independent evaluation still forthcoming. Even as an estimate, though, a score of 14 versus an average of 22 is a noticeable gap.

What makes this more striking is the pricing. GPT-4o charges $5.00 per 1 million input tokens, while the comparable model average is $2.00. On the output side, GPT-4o costs $15.00 per 1 million tokens versus an average of $8.00. You are paying roughly 2.5 times the average price for a model that scores below average on intelligence. That is a tough ratio to defend on paper.

Where GPT-4o Does Win: Speed

The one area where GPT-4o clearly beats the pack is raw output speed. It generates 88.6 tokens per second, compared to the comparable model average of 56. For real-time applications like voice assistants or live transcription, that speed advantage matters a lot. It also supports a 128k token context window with a knowledge cutoff of October 2023, which is competitive for the current generation of large language models.

So the tradeoff starts to look clearer. You pay more, you get lower general intelligence scores, but you get significantly faster responses. Whether that tradeoff makes sense depends entirely on your use case.

The Medical Exam Wildcard

Here is where the narrative shifts. A study published in Scientific Reports tested GPT-4o on the Chinese National Medical Licensing Examination using 600 original questions from the 2020 and 2021 exams. The results were impressive.

GPT-4o achieved 84.2% accuracy on the 2020 exam and 88.2% on the 2021 exam. It significantly outperformed both GPT-4 and GPT-3.5, with a statistical significance of P < 0.001. On digestive system questions specifically, it hit 94.75% accuracy.

This creates an odd tension. How does a model that scores 14 on a general intelligence index ace a rigorous medical licensing exam? The answer likely lies in what different benchmarks measure. The Artificial Analysis index aggregates across diverse tasks, while the Chinese NMLE is a specialized domain test. GPT-4o might be genuinely stronger in focused knowledge areas even if it loses ground on broad reasoning tasks.

What We Still Do Not Know

The biggest problem right now is the data gaps. There are no direct benchmark comparisons between GPT-4o and GPT-4 Turbo on standard technical benchmarks like MMLU or HumanEval from independent sources. We also lack token throughput and time-to-first-token data for both GPT-4 Turbo and Gemini 1.5 Pro, making speed comparisons incomplete. Pricing data for those competing models is similarly absent from the independent evaluations consulted here. Without these numbers, any conclusion about GPT-4o's relative value remains partial at best.

The Bigger Question

GPT-4o forces an uncomfortable conversation about how we evaluate AI models. A single aggregate intelligence score can obscure real strengths in specific domains. At the same time, paying premium prices for below-average general benchmark performance is hard to swallow for cost-conscious developers. The real test will come when independent evaluators fill in the missing comparisons against GPT-4 Turbo and Gemini 1.5 Pro. Until then, the numbers tell a split story: fast and brilliant in narrow domains, expensive and underwhelming in aggregate. What does your workload actually prioritize: raw speed and domain depth, or broad reasoning at a lower price?

Sources Sources

Tags

More people should see this article.

If you found it useful, share it in 10 seconds. Knowledge grows when shared.

Reading Settings

Comments