5 Surprising Truths About AI That Signal the End of the 'Bigger is Better' Era

Introduction: The Gospel of Scale #

For the better part of a decade, the dominant belief in our field has been that progress follows a simple, if expensive, formula: scale up the model size and the training data. This idea was so powerful it was formalized into what computer scientist Rich Sutton called the "bitter lesson," which posits that brute-force compute has consistently and decisively outperformed more nuanced attempts to build in human domain knowledge. More data and more processing power always seemed to win.

But that era of predictable gains from scaling is coming to an end. A critical and far more interesting shift is underway, challenging the core assumption that bigger is guaranteed to be better. We're seeing a fundamental shift from brute force to intelligent design, where algorithmic efficiency, data quality, and clever inference-time strategies are overtaking raw parameter counts as the true drivers of progress. Here are five surprising truths that show why.

1. Small Models Are Now Outperforming Giants #

The most direct challenge to the "bigger is better" mantra is the surprising trend of smaller, more compact models consistently outperforming their much larger predecessors. This isn't just a minor improvement; it's a fundamental disruption.

Consider these recent examples:

Llama-3 8B, a model with 8 billion parameters, now outperforms Falcon 180B, a model more than 22 times its size.
Aya 23 8B and Aya Expanse 8B both outperform the massive BLOOM 176B model, despite having only 4.5% of its parameters.

This isn't an isolated phenomenon. Data from the Open LLM Leaderboard reveals a striking trend: since early 2023, there has been a surge in high-performing small models (under 13 billion parameters) that achieve better scores than much larger models. This isn't just about small models getting better; it's about the diminishing returns of compute and the rising impact of optimization breakthroughs that define the rate of return for every dollar spent on training. This directly undermines the foundational assumption that has driven AI investment and research for years—that a model's capability is primarily a function of its parameter count. But the story gets stranger: it turns out that even in the largest models, most of those parameters are surprisingly disposable.

2. Our Giant Models Are Full of Redundant 'Junk' #

Here's a counter-intuitive fact about large AI models: after they are fully trained, the vast majority of their learned weights can be removed—a process called pruning—with minimal loss in performance. This raises a central puzzle: If these weights can be removed at the end of training without consequence, why are they necessary during the training process in the first place?

Current evidence suggests that starting training with a smaller network from the outset fails to achieve the same performance, indicating that our learning techniques are highly inefficient. Think of it like needing a massive, multi-gigabyte compiler toolchain just to produce a 10-kilobyte binary. The final product is efficient, but the process to get there is incredibly wasteful. It tells us our core training algorithms, not the models themselves, are the problem. Research by Denil et al. has shown that a small subset of a model's weights can be used to predict 95% of the other weights, highlighting a massive degree of correlation and redundancy. This suggests that massive model size isn't a pure measure of a model's necessary complexity, but rather a costly crutch for inefficient and unstable optimization methods.

3. Smart Algorithms Are the New Brute Force #

If giant models are bloated with redundant weights (Truth #2), it stands to reason that progress must come from somewhere else. That "somewhere else" is algorithmic innovation. For years, our field was guided by the principle that scale could solve almost any problem.

"When in doubt, use brute force.” This was formalized as the “bitter lesson” by Rich Sutton who posited that computer science history tells us that throwing more compute at a problem has consistently outperformed all attempts to leverage human knowledge of a domain to teach a model.

Today, that tide is turning. A host of new techniques have emerged that dramatically improve model performance without requiring larger models or more expensive training runs. These algorithmic breakthroughs include:

Instruction finetuning: Teaching models to follow instructions more effectively after initial pre-training.
Model distillation: Using large, powerful "teacher" models to train smaller, highly capable "student" models on synthetic data.
Chain-of-thought reasoning: Prompting models to break down complex problems into a series of intermediate steps.
Increased context length: Expanding the amount of information a model can consider at once.
Retrieval-Augmented Generation (RAG): Allowing models to pull in information from external knowledge bases to answer questions.
Preference training: Aligning models with human feedback to produce more helpful and harmless responses.

These techniques can deliver significant performance gains for the same level of compute, effectively compensating for the need for ever-larger models. Leveraging human insight through smarter algorithms is becoming the new brute force.

4. "Scaling Laws" Aren't Really Laws #

For years, "scaling laws" have been used to justify massive capital investments in AI. These "laws" are attempts to create a formula that predicts how a model's performance will improve with more compute, data, or parameters. However, their predictive power has proven to be surprisingly limited.

A critical limitation is that while scaling laws can reasonably predict a model's pre-training test loss (its ability to guess the next word in a sentence), the results are often "murky or inconsistent" when predicting performance on actual downstream tasks that users care about. The very term "emergent properties"—used to describe sudden new capabilities that appear at scale—is an implicit admission that our scaling laws fail to predict what's coming next.

Furthermore, many of these so-called laws are based on analyses of fewer than 100 data points (i.e., trained models), a sample size that often lacks strong statistical support. Chasing scaling laws is a recipe for getting outmaneuvered. It leads labs to pour concrete down a single path while nimbler teams are already exploring more fertile ground.

5. The Frontier Is Moving Beyond Training #

The most exciting work in AI is shifting from a singular focus on training-time compute to new, more efficient levers for progress that happen outside of the traditional training loop. This opens up entirely new optimization spaces for us as developers.

Two of the most important new frontiers are:

Gradient-Free Exploration: Performance can be significantly improved by spending more compute at inference time—the moment the model is actually being used. Techniques like enabling tool use, running multiple models as agents, and using advanced search methods can boost capabilities without altering the model's weights at all.
A Malleable Data Space: The low cost of high-quality synthetic data means we can now treat the training dataset itself as something to be optimized. Instead of relying on static, pre-existing datasets, we can generate and steer data toward desirable properties, fill in knowledge gaps, and make previously underrepresented concepts more visible to the model.

For those of us building in this space, the message is clear: the most valuable work is no longer just about training bigger models, but about architecting smarter systems around them.

Conclusion: It’s an Interesting Time to Build Again #

The era of compute as a silver bullet is over. We now know that small models can beat giants (1), our largest models are full of junk (2), and that smart algorithms (3) are creating more value than brute force. With unreliable "scaling laws" (4) and a new frontier opening up beyond training (5), the field is more dynamic and interesting than ever.

This makes the landscape of AI more unpredictable and more open to genuine innovation. We can finally stray from the beaten path of simply throwing more compute at every problem. The focus is shifting from building bigger to building better, opening the door for a new generation of breakthroughs.

“We can only see a short distance ahead, but we can see plenty there that needs to be done.”

last updated: 2026-01-07