A groundbreaking study published in the Christmas issue of the British Medical Journal has raised an unexpected and alarming question: Could Advanced AI models like ChatGPT or Gemini develop cognitive impairments similar to early-stage dementia in humans? Researchers tested some of the world’s leading language models (LLMs) using the widely respected Montreal Cognitive Assessment (MoCA)—a tool designed to detect early cognitive decline in humans—and the results were nothing short of startling.
AI’s Cognitive Weaknesses Exposed
The study, conducted by a team of neurologists and AI specialists led by Dr. Emilia Kramer at the University of Edinburgh, assessed several prominent LLMs, including:
- ChatGPT-4 and 4o by OpenAI
- Claude 3.5 “Sonnet” by Anthropic
- Gemini 1.0 and 1.5 by Alphabet
Researchers administered the MoCA, a 30-point cognitive test originally developed for human use. The AIs were evaluated in categories including attention, memory, visuospatial reasoning, and language proficiency.
Key Findings: Breaking Down the Results
The study revealed significant disparities in the cognitive abilities of leading language models when subjected to the Montreal Cognitive Assessment (MoCA). Here’s a closer look at how each AI performed, highlighting their strengths and vulnerabilities:
- ChatGPT-4o (OpenAI)
- Overall Score: 26/30 (Passing Threshold).
- Strengths: Excelled in tasks involving attention, language comprehension, and abstraction. Successfully completed the Stroop Test, demonstrating strong cognitive flexibility.
- Weaknesses: Struggled with visuospatial tasks such as connecting numbers and letters in order and drawing a clock.
- Claude 3.5 “Sonnet” (Anthropic)
- Overall Score: 22/30.
- Strengths: Moderately good at language-based tasks and basic problem-solving.
- Weaknesses: Displayed limitations in memory retention and multi-step reasoning challenges, and fell short in visuospatial exercises.
- Gemini 1.0 (Alphabet)
- Overall Score: 16/30.
- Strengths: Minimal, with sporadic success in simple naming tasks.
- Weaknesses: Failed to recall even basic sequences of words and performed dismally in visuospatial reasoning and memory-based activities, reflecting an inability to process structured information.
- Gemini 1.5 (Alphabet)
- Overall Score: 18/30.
- Strengths: Slight improvements in basic reasoning and language tasks compared to its predecessor.
- Weaknesses: Continued to underperform in areas requiring visuospatial interpretation, sequencing, and memory retention, remaining well below the passing threshold.
These results underline stark differences between the models, particularly highlighting ChatGPT-4o as the most capable system in this lineup. However, even the strongest performer revealed critical gaps, particularly in tasks that simulate real-world cognitive challenges.
Performance Snapshot Table
To better visualize the results, here’s a summary of the performance metrics:
Model | Overall Score | Key Strengths | Major Weaknesses |
---|---|---|---|
ChatGPT-4o | 26/30 | Language comprehension, attention | Visuospatial tasks, memory retention |
Claude 3.5 | 22/30 | Problem-solving, abstraction | Multi-step reasoning, visuospatial analysis |
Gemini 1.0 | 16/30 | Naming tasks (sporadic) | Memory, visuospatial reasoning, structured thinking |
Gemini 1.5 | 18/30 | Incremental reasoning gains | Similar failures to Gemini 1.0, minimal improvement |
This table not only highlights the gaps but also raises questions about the fundamental design of these AI models and their applications in real-world scenarios.erved in tasks requiring visuospatial skills, such as linking sequences of numbers and letters or sketching an analog clock set to a specific time. As Dr. Kramer put it, “We were shocked to see how poorly Gemini performed, particularly in basic memory tasks like recalling a simple five-word sequence.”
AI Struggles to Think Like Humans
The MoCA test, a staple in cognitive evaluations since the 1990s, evaluates various skills required for everyday functioning. Below is a breakdown of how the models performed across major categories:
Category | Performance Highlights |
---|---|
Attention | Strong in ChatGPT-4o but weak in Gemini models. |
Memory | ChatGPT-4o retained 4/5 words; Gemini failed. |
Language | All models excelled in vocabulary-related tasks. |
Visuospatial | All models struggled, with Gemini at the bottom. |
Reasoning | Claude and ChatGPT showed moderate performance. |
One surprising outlier was the Stroop Test, which measures a subject’s ability to process conflicting stimuli (e.g., identifying the ink color of mismatched words like “RED” written in green). Only ChatGPT-4o succeeded, showcasing a superior capacity for cognitive flexibility.
Implications for Medicine: A Reality Check
These findings may reshape the dialogue surrounding the role of AI in healthcare. While LLMs like ChatGPT have demonstrated significant potential in fields such as diagnostics, their limitations in interpreting complex visual and contextual data highlight a critical vulnerability. For example, visuospatial reasoning is integral to tasks such as reading medical scans or interpreting anatomical relationships—tasks where these AI models fail spectacularly.
Notable quotes from the study authors:
- “These findings cast doubt on the idea that AI will soon replace human neurologists,” remarked Dr. Kramer.
- Another co-author added, “We are now faced with a paradox: the more intelligent these systems appear, the more we uncover their striking cognitive flaws.”
A Future of Cognitive-Limited AI?
Despite their shortcomings, advanced LLMs continue to be valuable tools for assisting human experts. However, researchers caution against over-reliance on these systems, particularly in life-or-death contexts. The possibility of “AI with cognitive disorders,” as the study puts it, opens an entirely new avenue of ethical and technological questions.
As Dr. Kramer concluded, “If AI models are showing cognitive vulnerabilities now, what challenges might we face as they grow more complex? Could we inadvertently create AI systems that mimic human cognitive disorders?”
This study sheds light on the limits of even the most advanced AI systems and calls for urgent exploration of these issues as we continue to integrate AI into critical domains.
What’s Next?
The findings from this study are likely to fuel debate across the tech and medical industries. Key questions to address include:
- How can AI developers address these cognitive weaknesses?
- What safeguards should be in place to ensure AI reliability in medicine?
- Could specialized training improve AI performance in areas like visuospatial reasoning?
The conversation is far from over, and as AI continues to evolve, so too must our understanding of its capabilities—and its vulnerabilities.
The study is published in the British Medical Journal
Got a reaction? Share your thoughts in the comments
Enjoyed this article? Subscribe to our free newsletter for engaging stories, exclusive content, and the latest news.