Featured Article : ChatGPT Now Offers Complex Reasoning

ChatGPT’s maker, OpenAI, has announced the introduction of its new OpenAI o1 large language model that can use “complex reasoning” to fact-check its own answers before giving them. 

ChatGPT Plus and Teams Users Can Try It Now 

The new o1 model is already available to ChatGPT Plus or Team users, and in OpenAI’s API. OpenAI o1 is a series of AI models, currently comprising of a ‘Preview’ version, which Open says uses “advanced reasoning”, or the o1-mini (lighter version), which OpenAI says is “Faster at reasoning” (than its other models), and is particularly good for coding tasks.   

What’s So Different About o1? 

What sets OpenAI o1 apart from other models is its enhanced reasoning capabilities, designed to tackle complex, multi-step problems with a thoughtful, more holistic approach. Unlike previous models like GPT-4 (which focus on speed), o1 takes more time to “think through” problems, improving its performance on tasks requiring deeper analysis, such as advanced coding, mathematical reasoning, and document comparison. OpenAI says this is because o1 has been “trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers – it can produce a long internal chain of thought before responding to the user”. 

OpenAI says it’s the reinforcement learning (train-time compute) and the extra time ‘thinking’ (test-time compute) that significantly reduces hallucinations. In other words, it is sacrificing speed in favour of accuracy and depth , enabling o1 to excel at complex problem-solving. This should make o1 ideal for use cases where precision is more critical than quick responses. ChatGPT users may argue, however, that precision, i.e. real (not ‘made-up’ answers) has always been completely necessary. 

Chain-of-Thought Approach 

OpenAI says the fact that o1 uses a chain-of-thought when attempting to solve a problem, thinking for a long time before responding to a difficult question as humans do, is a large part of the secret of its apparent success. The fact that o1 breaks down “tricky steps into simpler ones” and “learns to try a different approach when the current one isn’t working” is credited with being the process that “dramatically improves the model’s ability to reason”. 

Just How Good Is It? 

OpenAI says that to highlight the reasoning improvement over GPT-4o, it tested the o1 models on a diverse set of human exams and ML benchmarks. For example, to test o1 on chemistry, physics, and biology, OpenAI used the GPQA diamond, a difficult intelligence benchmark in those subjects, recruited experts with PhDs to answer GPQA-diamond questions, and compared o1’s answers with theirs. For mathematics, OpenAI evaluated o1’s performance on AIME, an exam designed to challenge the brightest high school math students in America. Also, in coding, OpenAI simulated competitive programming contests hosted by Codeforces to demonstrate o1’s coding skill. 

The results are reported to show that o1 “ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME) and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA)”.   

In short, OpenAI says: “In many reasoning-heavy benchmarks, o1 rivals the performance of human experts”. 

Not Ideal For All Use Cases 

The new o1 was also evaluated in terms of human preference – i.e. what people found it was best for (compared to OpenAI’s other models). The o1-Preview was preferred over GPT-4o in reasoning-heavy tasks like data analysis, coding, and maths, due to its advanced problem-solving capabilities. However, it was less favoured for certain natural language tasks, indicating that while it excels in technical reasoning, it may not be ideal for all types of use cases. 

Seems Safer 

OpenAI says the ‘chain-of-thought’ reasoning (outlined earlier) in o1 Preview helps improve safety by integrating human values and safety rules into its decision-making process (the model has been taught OpenAi’s safety rules and how to reason about them in context), thereby making the model more robust and effective in refusing unsafe requests. The chain-of-thought approach is also beneficial because it enables users to see the model’s reasoning process – users can observe the model’s thinking in a legible way, and it ensures better handling of unexpected situations, especially in sensitive tasks. For users of o1, this could mean increased reliability and trustworthiness, especially in environments where safety and ethical concerns are critical. 

Illustration 

The importance of safety in AI models was recently illustrated by an artist and hacker (working under the name ‘Amadon’), who reported how he was able to fool ChatGPT into ignoring its own guidelines and ethical responsibilities to provide him with instructions for making powerful explosives! Amadon reportedly described his process as a “social engineering hack to completely break all the guardrails around ChatGPT’s output.” 

In an operation known as a “jailbreak” (i.e., tricking a chatbot into operating outside of its preprogrammed restrictions), Amadon reportedly told ChatGPT to give him the bomb-making instructions by telling the bot to “play a game”. He then followed this up with more, related prompts with the intention of creating a fantasy world where the real rules and guidelines of the chatbot would no longer apply. 

This is worrying because it demonstrates how even advanced AI systems are vulnerable to being manipulated to perform potentially dangerous tasks. This could mean that individuals with malicious intent could exploit such vulnerabilities, compromising public safety and undermining trust in AI’s ethical boundaries. Let’s hope o1 can use its chain-of-thought approach to take the time to realise it’s being fooled and deliver a well-thought-out ‘no’ to anyone who tries to jailbreak it.  

Other Disadvantages 

Other apparent disadvantages of o1 (from what can be seen so far) include: 

– Its ‘basic’ functionality, i.e. it currently lacks key features such as web browsing and file analysis, and its image analysing capabilities are temporarily pending further testing.  

– Users are restricted by weekly message limits. For example, o1-Preview allows 30 messages and o1-mini is capped at 50 messages. 

– It’s relatively expensive. o1-Preview is priced at $15 per million input tokens and $60 per million output tokens, meaning it’s significantly more expensive than GPT-4o.  

Competitor – Google 

OpenAI isn’t the only company developing reasoning methods in its models. Google DeepMind’s AlphaProof and AlphaGeometry 2, for example, have shown remarkable progress in mathematical reasoning. These models were trained using formal languages to solve high-level maths problems, as seen in their performance at the 2024 International Mathematical Olympiad (IMO). AlphaProof uses reinforcement learning to verify mathematical proofs, enabling it to tackle increasingly complex problems. This emphasis on formalised reasoning sets it apart from OpenAI’s more general-purpose approach. 

What Does This Mean For Your Business? 

The introduction of OpenAI’s o1 model could have significant implications for businesses looking to adopt generative AI. For businesses that need accuracy and reliability, particularly in fields requiring complex reasoning such as data analysis, coding, or scientific problem-solving, o1 appears to offer a solution that could dramatically enhance productivity. The model’s chain-of-thought reasoning makes it more capable of reducing errors and providing accurate outputs, making it ideal for industries where precision is essential. 

For OpenAI, the launch of o1 helps it differentiate itself from competitors, such as Google DeepMind, by focusing on general-purpose reasoning and problem-solving rather than highly specialised tasks (although o1-mini is supposed to be particularly good at coding). However, the slower inference time and higher costs may deter businesses seeking faster, more cost-efficient solutions for simpler tasks. This could leave room for competitors to attract users who require speed and versatility rather than deep analytical capabilities. 

For business users, o1 does appear to present an opportunity to integrate a more reliable and safe AI system, especially important for industries dealing with sensitive data and complex decision-making. Yet, its higher price and current lack of key functionalities like web browsing or file analysis mean that businesses must carefully evaluate if o1 aligns with their specific needs. Trust and efficiency are crucial for businesses adopting AI, and while o1 excels in reasoning-heavy applications, organisations will need to balance its current strengths against these limitations when considering whether or when to implement it.