Summary: A new research paper argues that overtraining large language models, long considered wasteful, becomes compute-optimal when paired with test-time scaling. This shifts how AI labs should think about balancing pretraining investment and inference-time compute.
Overtraining an AI model has generally been seen as wasteful. The conventional logic was straightforward: train until loss stops dropping, then stop. Going further just burned GPU hours with little to show for it. But a paper published in April 2026 flips that assumption, and it could reshape how labs allocate their compute budgets.
The Old Approach to Pretraining
Pretraining scaling laws, such as the well-known Chinchilla laws, have long guided how researchers allocate compute across model size and training tokens. These frameworks recommend a specific ratio between parameters and data for a given compute budget.
Overtraining sat outside that recommended ratio. It meant feeding a model more tokens than the scaling laws prescribed for its size. The extra training seemed to produce diminishing returns, so most runs were tuned to hit the compute-optimal frontier and then stopped.
Test-Time Scaling Changes the Equation
The new paper, called Train-to-Test (T²T²), introduces a variable that older scaling laws overlooked: test-time compute. This refers to the extra processing an AI model uses when generating each answer, such as through repeated sampling, where multiple candidate outputs are generated and the best one is selected.
The researchers found that overtraining and test-time scaling are not independent. They interact in a way that fundamentally reshapes the compute-optimal frontier. When you plan to use test-time compute at inference, overtraining during pretraining becomes efficient.
A model trained longer than standard scaling laws recommend will perform better per unit of inference compute than a model trained to the old optimal point. The extra pretraining investment pays off at inference, not during training itself.
Why the Interaction Matters
The paper shows that performance gains from test-time scaling depend heavily on how well-pretrained the model is. A model that has not been trained enough benefits less from inference compute than an overtrained one. So if your system will use test-time compute, that expectation should shape your pretraining strategy from the start.
T²T² creates a unified optimization problem. Rather than optimizing pretraining in isolation, you optimize the full pipeline: pretraining plus inference. And the optimal solution to that combined problem falls squarely in the overtraining regime, well outside the range of standard pretraining scaling suites.
What This Means for AI Labs
The researchers validated their findings across eight downstream tasks. They actually pretrained heavily overtrained models in the region their scaling laws predicted, and those models showed substantially stronger performance compared to models trained under standard scaling rules alone. Crucially, the effect survived post-training stages like fine-tuning, which means these results matter for real-world deployments, not just theoretical exercises.
The broader takeaway is that AI scaling is no longer a single-variable problem. Pretraining and inference are deeply connected, and optimizing them in isolation leads to suboptimal outcomes. For labs planning systems that use test-time compute, shifting resources toward longer pretraining could deliver better results for the same total compute budget.
The old rule said stop when the model stops improving. The new rule says that depends entirely on what you plan to do after training ends. If test-time scaling is part of your pipeline, overtraining might be the smartest move you can make.
Comments