In last week’s post, “The botterfly effect”, I used the phrase “the slave becomes the master”. This subtle nod to Metallica’s “The End of the Line” highlighted how tools meant to help us end up directing us instead.
I toyed with another line for last week’s post: “The student becomes the master.” But I dropped it. It didn’t fit because my message was not about training generative AI systems. Plus, as recently as last week, I knew believed that generative AI systems can only be as good—on average—as the average of the expertise in their training dataset. The student could become the master only if the master were average.
Or was I wrong about it?
“Median human” performance of Generative AI. What is it?
Sam Altman (love or him hate him, people listen to him) used the term “median human” to describe the expected quality of OpenAI’s systems: better than half the population, worse than the other half. Perfectly average1.
This makes sense. Generative AI models are trained on human-created content and learn the conditional probability of such data. According to their internal representation of the data that trained them, anything that they create is the most probable outcome (thus the “median” behaviour). Give them enough data (how about the entire Internet?) and they will learn to create perfectly average Internet content.
If these very same models encounter a situation they’ve not seen in their training set, they will generate random (less probable) outputs that look “okay” to a non-expert but are utterly crazy for experts.
Not sure what I mean? Look at this chess game, between GPT-3 and a proper chess algorithm (not a large language model), Stockfish. This year-old example starts as a perfectly average game until—several moves into the game—ChatGPT encounters something it has never seen before and starts behaving very randomly (pay attention to one of the black knights).
My point? Generative AI systems are not only median (a synonym here is “mediocre”) but also start behaving unpredictably precisely when we have built enough trust in them (by watching their past performance).
The game of chess demonstrates it well. To a non-player, it looks like a perfectly decent game. To a beginner, the first several moves might even look smart. To an expert, the system starts as average, only to descend into madness.
This is also a perfect example showing that, for specific tasks, there are perfectly capable algorithms that outperform generative AI (Stockfish was released 15 years ago). Before using generative AI to solve a problem, ask if there’s a better way.
So, will generative AI models remain “average”? Or will they ever be able to outperform the human data that trained them?
The smartest person in the model is the model
I came across a paper titled “Transcendence: Generative Models Can Outperform The Experts That Train Them” by Edwin Zhang and colleagues from Harvard, UC Santa Barbara, and Princeton. It was released earlier this week. The study explores how generative models, when trained under specific conditions, can achieve capabilities surpassing the expertise they are trained on.
The authors built a system called ChessFormer (a “chess transformer”—the same transformer technology that hides behind the last “T” in “ChatGPT”), trained on human chess game transcripts. When evaluated under specific conditions such as low-temperature sampling (a trick to make model outcomes less random—typically bad for creativity but good for precision-demanding tasks), the model outperformed the highest-rated human players in the training dataset. In other words, the model became smarter than its teachers.
This phenomenon, which authors termed “transcendence,” suggests that generative AI systems can surpass individual expert performance by leveraging collective human expertise and minimising errors through the AI equivalent of majority voting.
This research underscores the potential for generative AI to not just mimic but exceed human expertise in specific domains. For now, it’s chess, but I am certain studies of other domains will follow. I acknowledge that this is just one paper so far, and it’s not peer-reviewed yet. Still, perhaps we should start questioning our perception of the role of generative AI in our creative and decision-making processes.
Let me spell it out: we’re starting to see early evidence of the ability of generative AI models to exceed the expertise from their training data. This is new in generative AI. And it’s eerily similar to what David Weinberger wrote twelve years ago about humans: “The smartest person in the room is the room”. Collective intelligence exceeds the intelligence of any individual.
In 2024, “The Smartest Person in the Model is the Model”.
The Smartest Person in the Model is the Model
But wait, there’s a catch.
The research uncovers an interesting insight into human expertise. Only models trained on diverse datasets, encompassing a wide range of player ratings and styles, performed significantly better than human experts. Such diversity allows the model to generalise and improve upon the individual performance of its trainers by minimising biases and errors.
Without enough human diversity, the model’s ability to outperform its training data dramatically diminishes.
What does it mean for business?
Do not expect generic models, such as ChatGPT, to outperform experts any time soon. These systems will continue to provide “median human” quality. My chess video will continue to be relevant. But that’s ok—not every model needs to transcend. Sometimes, all you need is the equivalent of the first few moves in chess, even if they’re not world-class.
To replicate your current organisational expertise, you might not need Generative AI. Stockfish (an old-school algorithm that no one would call AI these days) reminded us about it. It will take a long time for generative AI systems to beat it. There’s a category of AI called expert systems. Such systems focus on replicating human expertise directly. An expert system might be all that you need.
Train your own models if you need them to reach higher expertise levels than your human team members. These models will not be generic, though. A model trained on legal documents cannot write poems—leave that task to ChatGPT.
Transcendence seems possible only if you train the models on diverse datasets. If your human team is not diverse enough, don’t expect the generative AI systems to perform well.
Wait, what if we trained future models on the outputs of these new, transcendent models?
I might need to give Ray Kurzweil a call.
Yup, I know the difference between median and average; thanks for checking.