Command Palette
Search for a command to run...
환각은 신뢰를 훼손한다; 메타인지가 해결책이다
환각은 신뢰를 훼손한다; 메타인지가 해결책이다
Gal Yona Mor Geva Yossi Matias
초록
사실적 신뢰성 측면에서 상당한 진전이 있었음에도 불구하고, '환각(hallucinations)'이라 불리는 오류들은 생성형 AI의 주요 우려 사항으로 남아 있습니다. 이는 대규모 언어 모델(LLM)이 점점 더 복잡하거나 미묘한 상황에서도 유용한 역할을 해야 한다는 기대가 커짐에 따라 더욱 두드러집니다. 그런데 외부 도구 없이 frontier 수준의 모델조차도, 가장 단순한 설정인 명확한 정답이 있는 사실형 질문 응답에서도 환각을 일으키고 있습니다.우리는 이 분야에서 이루어진 대부분의 사실성 개선이 경계 인식(알고 있는 것과 모르는 것을 구분하는 능력)을 향상시키는 것이 아니라, 모델의 지식 경계를 확장하는 데(더 많은 사실을 인코딩하는 것)서 비롯되었다고 주장합니다. 후자의 능력, 즉 경계 인식이 본질적으로 어렵다는 점을 우리는 추측합니다. 모델은 진실과 오류를 완벽하게 구분하는 판별력이 부족하여, 환각을 제거함과 동시에 유용성을 유지하는 것 사이의 불가피한 트레이드오프가 발생할 수 있습니다.하지만 이 트레이드오프는 문제의 틀을 바꿈으로써 해소될 수 있습니다. 만약 환각을 적절한 정황 설명 없이 제시된 부정확한 정보, 즉 '자신감 있는 오류(confident errors)'로 이해한다면, 답변하거나 멈추기의 이분법을 넘어선 제3의 길이 나타납니다. 바로 '불확실성 표현'입니다. 우리는 이를 '신뢰할 수 있는 불확실성(faithful uncertainty)'이라 제안합니다. 이는 언어적 불확실성을 내재적 불확실성과 일치시키는 개념으로, 메타인지의 한 측면입니다. 메타인지란 자신에게 불확실성이 있음을 인지하고 이에 기반해 행동할 수 있는 능력을 의미합니다. 직접적인 상호작용의 경우, 불확실성에 기반한 행동은 이를 정직하게 소통하는 것을 의미합니다. 반면, agent 기반 시스템에서는 이는 언제 검색을 수행할지, 무엇을 신뢰할지 조절하는 제어 계층으로 작용합니다.따라서 메타인지는 LLM이 동시에 신뢰할 수 있고 능력 있는 존재가 되기 위해 필수적입니다. 우리는 이 목표 달성을 위한 진전을 위해 남아 있는 과제를 강조하며 결론을 맺습니다.
One-sentence Summary
The authors propose a metacognitive framework that mitigates large language model hallucinations by aligning linguistic uncertainty with intrinsic uncertainty, shifting focus from knowledge expansion to faithful uncertainty expression that enhances trustworthiness and utility in both direct conversational interfaces and autonomous agentic systems.
Key Contributions
- This work defines faithful uncertainty as a metacognitive objective that aligns linguistic expressions of doubt with a model’s intrinsic confidence state, ensuring accurate communication of limitations at the edge of model competence.
- The approach reframes hallucination mitigation by transcending the answer-or-abstain dichotomy, positioning calibrated uncertainty as a control layer that governs tool selection and verification in agentic systems.
- The study introduces process-based evaluation frameworks for AI agents that prioritize metacognitive reliability over end-to-end correctness, incorporating specific validation mechanisms to penalize inefficient searching and sycophantic source trust.
Introduction
Large language models are increasingly deployed in high-stakes applications where factual reliability directly impacts user trust and safety, yet they continue to generate confident errors known as hallucinations. Prior mitigation strategies have primarily focused on expanding internal knowledge or improving aggregate confidence calibration, but these approaches fail because models fundamentally lack the discriminative power to reliably separate correct answers from incorrect ones at the instance level. This limitation forces a strict trade-off between factual accuracy and practical utility, as eliminating errors typically requires aggressive answer refusal that suppresses valid information. The authors reframe hallucinations as confident errors and introduce faithful uncertainty as a metacognitive solution. By aligning linguistic hedging with a model's intrinsic confidence state, they enable systems to preserve useful information while honestly communicating doubt, ultimately establishing a reliable control layer for both direct interaction and agentic tool use.
Dataset
- Dataset Composition and Sources: The authors constructed a synthetic dataset of 25,000 samples designed to replicate the empirical confidence profiles documented by Nakkiran et al. (2025).
- Subset Details and Generation Rules: The dataset maintains a fixed 25% base hallucination rate. Confidence scores are generated using Beta distributions to simulate overlapping LLM behavior, drawing correct answers from Beta(1.8, 1.0) and incorrect answers from Beta(1.0, 1.3).
- Processing and Calibration: To ensure the observed trade-offs arise solely from distribution overlap rather than probability miscalibration, the team applied isotonic regression to the raw scores. This step achieved near-perfect calibration (smECE ≈ 0.014) and isolated discriminative power as the limiting factor.
- Usage and Evaluation Strategy: Rather than using the data for model training, the authors employ it to analyze utility-error curves. They sweep a rejection threshold across the calibrated scores to calculate the proportion of correctly answered versus incorrectly answered samples at each decision boundary.
Method
The authors propose a framework centered on the concept of faithful uncertainty, which reorients the objective of large language models (LLMs) from purely factual accuracy to the alignment between the model's internal confidence and its expressed uncertainty in responses. This approach directly addresses the utility–factuality tradeoff, where models are often forced to either abstain from uncertain claims—thereby sacrificing useful information—or generate confident but potentially incorrect answers. Faithful uncertainty allows models to preserve utility by generating responses while appropriately hedging uncertain claims, thus maintaining trust without sacrificing informativeness. The core idea is that models should not only be knowledgeable but also reliably communicate their uncertainty, where the goal is not to eliminate hallucinations entirely but to ensure that any uncertainty in the output accurately reflects the model’s intrinsic confidence.
Refer to the framework diagram
. This diagram illustrates the inherent tradeoff between utility and factuality under strict factuality constraints, where models must either abstain from uncertain claims or predict, leading to potential errors that reduce trust. In contrast, the faithful uncertainty paradigm enables models to preserve utility by expressing uncertainty linguistically while still providing answers, thereby mitigating harm through appropriate hedging. This framework mirrors how human experts communicate, where confidence in a statement is explicitly reflected in language, enhancing trust without sacrificing the provision of useful information.
The operationalization of faithful uncertainty relies on two key measures: intrinsic uncertainty and linguistic uncertainty. Intrinsic uncertainty is defined as the model's internal confidence in a given assertion, quantified by the likelihood of generating conflicting answers under repeated sampling. For a query Q and assertion A, intrinsic confidence confM(A) is computed as 1−k1∑i=1k1[A contradicts Ai], where A1,…,Ak are the assertions generated by the model. Linguistic uncertainty, on the other hand, is measured by decisiveness—the degree to which an assertion is conveyed confidently or hedged in the output. This is captured as dec(A;R,Q)=Pr[A is True∣R,Q], estimated using an LLM-as-a-judge. Faithful uncertainty is then defined as the alignment between these two measures, with a faithfulness score computed as 1−∣A(R)∣1∑A∈A(R)∣dec(A;R,Q)−confM(A)∣. A score of 1 indicates perfect alignment, while lower values indicate systematic over- or under-hedging.
As shown in the figure below:
, the proposed architecture integrates a metacognitive control layer that acts as a harness for the LLM. This layer supervises and modulates the model’s behavior by accessing internal states such as reasoning, language, world knowledge, and instruction following. It routes outputs to external components like retrieval, orchestration, verification, and memory, enabling the model to dynamically assess and express uncertainty. This modular design allows the system to separate factual reasoning from metacognitive evaluation, facilitating the extraction and utilization of confidence signals.
The authors argue that faithful uncertainty is a feasible and practical objective because it bypasses the limitations of mapping finite parameters to an infinite external world. Instead, it treats the problem as a closed-loop system where the ground truth for faithfulness is internal—specifically, the model's own confidence signals. These signals are inherently computable and can be exploited through architectural improvements, data modifications, and better training recipes. The framework further supports the idea that intrinsic uncertainty signals are already present and accessible, as evidenced by recent work in mechanistic interpretability and the use of such signals in reinforcement learning to enhance diversity and reasoning.
Refer to the framework diagram
. This diagram outlines recommended practices and key challenges in implementing faithful uncertainty. Recommended practices include visualizing tradeoffs through full curves to expose the true cost of factuality, showing frontier gains at various error rates, and measuring spillovers to test interventions against general capabilities. Key challenges include the bootstrapping paradox, where SFT data is static but uncertainty depends on evolving knowledge; signal preservation, where alignment may induce overconfidence; confidence attribution, requiring models to distinguish why they are uncertain; causal evaluation, ensuring models sense internal states rather than mimic hedging patterns; and agent evaluation, which penalizes metacognitive failures rather than just incorrect answers. These considerations highlight the complexity of aligning internal confidence with linguistic expression and the need for robust evaluation methods.
Experiment
Evaluation on the SimpleQA Verified dataset reveals a fundamental bifurcation where models must choose between broad coverage with high hallucination rates or reduced utility to ensure factuality. This analysis validates an inherent capability gap, demonstrating that current language models cannot reliably distinguish their own errors from accurate knowledge and thus leave the ideal performance region unpopulated. Consequently, the experiments conclude that standard metrics obscure the unavoidable utility tax of hallucination mitigation, establishing the need for evaluation frameworks that track full utility-error tradeoffs, verify genuine capability improvements, and assess collateral impacts on general model helpfulness.