Detecting Hallucinations in LLMs is Impossible, New Research Says

If AI system is trained only on correct data, automated hallucination detection becomes fundamentally impossible

Detecting Hallucinations in LLMs is Impossible, New Research Says

A new study from Yale University researchers raises critical questions about the feasibility of detecting hallucinations—false or fabricated outputs—in large language models (LLMs) using automated methods. Titled "(Im)possibility of Automated Hallucination Detection in Large Language Models", the paper introduces a theoretical framework to assess whether LLMs can reliably identify their own inaccuracies without human intervention.

The researchers demonstrate a surprising equivalence between hallucination detection and the complex task of language identification, a long-studied challenge in computer science.

Their findings reveal that if an AI system is trained only on correct data (positive examples), automated hallucination detection becomes fundamentally impossible across most language scenarios.

However, the study offers a silver lining: when models are trained using expert-labeled feedback—including both accurate and explicitly incorrect examples—automated detection becomes theoretically achievable. This highlights the crucial role of human input in training safe and reliable AI systems.

"Specifically, we showed that hallucination detection is typically unattainable if detectors are trained solely on positive examples from the target language (i.e., factually correct statements). In stark contrast, when detectors have access to explicitly labeled negative examples—factually incorrect statements—hallucination detection becomes tractable for all countable collections," the paper reads.

The findings provide strong theoretical support for reinforcement learning with human feedback (RLHF), a method already central to developing more trustworthy LLMs.

As generative AI continues to evolve, this research underscores the importance of integrating human oversight to ensure responsible deployment and minimise misinformation risks.

Recently, OpenAI admitted that newer models– o3 and o4 mini– hallucinate more often than older reasoning models like o1, o1-mini, and o3-mini, as well as traditional models such as GPT-4.

On OpenAI’s PersonQA benchmark, o3 hallucinated on 33% of queries — more than double the rate of o1 and o3-mini. O4-mini performed even worse, hallucinating 48% of the time.

Adding to the concern, OpenAI acknowledges it doesn’t fully understand the cause. In a technical report, the company said, "We also observed some performance differences comparing o1 and o3. Specifically, o3 tends to make more claims overall, leading to more accurate and more inaccurate/hallucinated claims."

It adds, “more research is needed” to explain why hallucinations increase as reasoning capabilities scale.

AI researchers like Gary Marcus have long warned about the hallucinatory behavior of large language models, arguing that such tendencies undermine the reliability of these systems and call into question the multi-billion-dollar valuations and hype surrounding companies developing generative AI.