Join our daily and weekly newsletters for the latest updates and exclusive content covering cutting-edge AI. Learn more
The end of 2024 saw a reckoning for artificial intelligence, with industry players fearing a slowdown in progress towards even smarter AI. But OpenAI’s o3 model, announced last week, has sparked a new wave of enthusiasm and debate, and suggests that big improvements are still to come in 2025 and beyond.
This model, announced for security tests with researchers, but not yet made public, achieved an impressive score on the important ARC metric. The benchmark was created by François Chollet, renowned AI researcher and creator of the deep learning framework Keras, and is specifically designed to measure a model’s ability to handle new and intelligent tasks. As such, it provides a meaningful measure of progress toward truly intelligent AI systems.
Notably, o3 scored 75.7% on the ARC benchmark under standard compute conditions and 87.5% using high compute, significantly outperforming previous state-of-the-art results, such as the 53% rated by Claude 3.5.
This achievement of o3 represents a surprising advance, according to Chollet, who had been a critic of the capacity of large language models (LLM) to achieve this type of intelligence. It highlights innovations that could accelerate progress toward higher intelligence, whether or not we call it artificial general intelligence (AGI).
AGI is a buzzword and poorly defined, but it signals a goal: intelligence capable of adapting to new challenges or questions in ways that exceed human capabilities.
OpenAI’s o3 addresses specific reasoning and adaptability hurdles that have long stymied large language models. At the same time, it reveals challenges, including the high costs and efficiency bottlenecks inherent in pushing these systems to their limits. This article will explore five key innovations behind the o3 model, many of which are underpinned by advances in reinforcement learning (RL). It will draw on ideas from industry leaders, OpenAI’s claimsand above all Chollet’s important analysisto explain what this advance means for the future of AI by 2025.
The five fundamental innovations of o3
1. “Program synthesis” for task adaptation
OpenAI’s o3 model introduces a new feature called “program synthesis”, which allows it to dynamically combine elements learned during pre-training (specific models, algorithms or methods) into new configurations. These elements may include mathematical operations, code snippets, or logical procedures that the model has encountered and generalized during its extensive training on various datasets. Most importantly, program synthesis allows o3 to tackle tasks it has never seen directly in training, such as solving advanced coding challenges or solving new logic puzzles that require au- beyond the rote application of the information learned. François Chollet describes program synthesis as the ability of a system to recombine known tools in innovative ways, much like a chef preparing a unique dish from familiar ingredients. This feature marks a departure from previous models, which primarily retrieved and applied pre-learned knowledge without reconfiguration – and it’s also one that Chollet advocated months ago as the only viable path to better intelligence.
2. Searching for programs in natural language
At the heart of o3’s adaptability is its use of Chains of Thought (CoT) and a sophisticated search process that takes place during inference, when the model actively generates answers in a live or deployed environment. These CoTs are step-by-step natural language instructions that the model generates to explore solutions. Guided by an evaluation model, o3 actively generates multiple solution paths and evaluates them to determine the most promising option. This approach reflects human problem solving, where we think about different methods before choosing the best solution. For example, in mathematical reasoning tasks, o3 generates and evaluates alternative strategies to arrive at precise solutions. Competitors like Anthropic and Google have experimented with similar approaches, but OpenAI’s implementation sets a new standard.
3. Evaluator model: a new type of reasoning
O3 actively generates multiple solution paths during inference, evaluating each using a built-in evaluator model to determine the most promising option. By training the evaluator on expert-labeled data, OpenAI ensures that o3 develops a strong ability to reason through complex, multi-step problems. This feature allows the model to act as a judge of its own reasoning, bringing large language models closer to the ability to “think” rather than simply respond.
4. Run your own programs
One of the most revolutionary features of o3 is its ability to run its own Chains of Thought (CoT) as adaptive problem solving tools. Traditionally, CoTs are used as step-by-step reasoning frameworks to solve specific problems. OpenAI’s o3 extends this concept by leveraging CoTs as reusable building blocks, allowing the model to approach new challenges with greater adaptability. Over time, these CoTs become structured records of problem-solving strategies, much like how humans document and refine their learning through experience. This ability demonstrates how o3 pushes the limits of adaptive reasoning. According to Nat McAleese, OpenAI engineero3’s performance on never-before-seen programming challenges, such as achieving a CodeForces score above 2700, demonstrates its innovative use of CoTs to compete with top competitive programmers. This rating of 2,700 places the model at the “Grandmaster” level, among the best competitive programmers in the world.
5. Program search guided by Deep Learning
O3 leverages a deep learning-based approach during inference to evaluate and refine potential solutions to complex problems. This process involves generating multiple solution paths and using the patterns learned during training to evaluate their viability. François Chollet and other experts have noted that this reliance on “indirect assessments” – where solutions are judged based on internal measurements rather than tested in real-world scenarios – can limit the robustness of the model when applied to unpredictable or company-specific contexts.
Additionally, o3’s reliance on expert-qualified datasets to train its evaluator model raises concerns about scalability. Although these datasets improve accuracy, they also require significant human oversight, which can limit the system’s adaptability and cost-effectiveness. Chollet points out that these tradeoffs illustrate the challenges of scaling reasoning systems beyond controlled benchmarks like ARC-AGI.
Ultimately, this approach demonstrates both the potential and limitations of integrating deep learning techniques with programmatic problem solving. While o3’s innovations demonstrate progress, they also highlight the complexity of creating truly generalizable AI systems.
The big one challenge to o3
OpenAI’s o3 model achieves impressive results, but at a significant computational cost, consuming millions of tokens per task – and this expensive approach poses the model’s biggest challenge. François Chollet, Nat McAleese and others highlight concerns about the economic feasibility of such models, emphasizing the need for innovations that balance performance and affordability.
The o3 release has attracted attention from the AI community. Competitors such as Google with Gemini 2 and Chinese companies like DeepSeek 3 are also making progress, making direct comparisons difficult until these models are tested more widely.
Opinions on o3 are divided: some praise its technical advancements, while others cite high costs and a lack of transparency, suggesting that its real value will only become clear with broader testing. One of the most prominent critiques came from Google DeepMind’s Denny Zhou, who implicitly attacked the model’s reliance on reinforcement learning (RL) scaling and search mechanisms. as a potential “impasse”“, arguing instead that a model should be able to learn to reason from simpler adjustment process.
What this means for enterprise AI
Whether or not it represents the ideal direction for further innovation for businesses, o3’s new adaptability shows that AI will, in one way or another, continue to transform industries, from customer service to scientific research in the future.
Industry players will need some time to digest what o3 has brought here. For companies concerned about o3’s high computational costs, OpenAI’s upcoming release of the scaled-down “o3-mini” version of the model offers a potential alternative. Although it sacrifices some of the capabilities of the full model, o3-mini promises a more affordable option for businesses, retaining much of the core innovation while significantly reducing test time calculation requirements.
It might be a while before businesses can get their hands on the o3 model. OpenAI says the o3-mini is expected to launch by the end of January. The full release of O3 will follow thereafter, although timelines depend on feedback and information gained during the current phase of security testing. Companies will be well advised to test it. They will want to anchor the model with their data and use cases and see how it actually works.
But in the meantime, they can already use the many other competent models that are already available and well tested, including the flagship o4 and other competing models, many of which are already robust enough to create intelligent, personalized applications that provide practical value. .
Indeed, next year, we will operate at two speeds. The first is to derive practical value from AI applications and clarify what the models can do with AI agents and other innovations already made. The second will be to sit down with some popcorn and see how the intelligence race plays out – and any progress will just be the icing on the already delivered cake.
To learn more about o3’s innovations, watch the full discussion on YouTube between me and Sam Witteveen below, and follow VentureBeat for continued coverage of AI advancements.