Token Grounding Detector

Motivation

Existing hallucination detectors rely on global image-level statistics — they sum or average a token's attention across all visual patches to decide if it is grounded. But hallucinated tokens can produce weak, scattered correlations across many patches that aggregate into deceptively high global scores, fooling these methods.

Key Insight A faithful object token must be strongly grounded in a specific image region. Detection should examine where a token attends, not just how much.

Figure 1. Left: SVAR globally sums attention, missing hallucinations with scattered but high total attention. Right: Our method analyzes the spatial distribution — high entropy indicates hallucination.

Method

Figure 2. Overview of our token-level hallucination detection framework.

We propose two complementary metrics from LVLM internals — no external models, no LVLM retraining:

Attention Dispersion Score (ADS)

Measures the spatial compactness of cross-modal attention. We threshold the top-10% patches, group into connected components to suppress attention sinks, then combine foreground blob mass with background entropy.

Figure 3. ADS pipeline: extract attention → threshold → connected components → entropy.

Low ADS → compact focus → grounded
High ADS → scattered attention → hallucinated

Cross-modal Grounding Consistency (CGC)

We evaluate the feature similarities between the generated token and image patches, and coin this metric as Cross-modal Grounding Consistency (CGC). At layer \(n\), let \(\mathbf{z}^{(n)}_t, \mathbf{v}^{(n)}_p \in \mathbb{R}^d\) be the token and patch embeddings, and define cosine similarity

\(S^{(n)}_{t,p} = \frac{\langle \mathbf{z}^{(n)}_t,\, \mathbf{v}^{(n)}_p \rangle}{\|\mathbf{z}^{(n)}_t\|_2 \;\|\mathbf{v}^{(n)}_p\|_2}\)

which reflects local structural alignment. The per-token map is \(\mathbf{S}^{(n)}_t = [S^{(n)}_{t,p}]_{p \in \mathcal{P}}\). To emphasize localized evidence, we obtain the token grounding score \(C^{(n)}_t\) by aggregating the top-\(k\) patches \(\mathcal{T}^{(n)}_t\):

\(C^{(n)}_t = \frac{1}{k} \sum_{p \in \mathcal{T}^{(n)}_t} S^{(n)}_{t,p}\)

High CGC → token aligns with specific visual content → grounded
Low CGC → no patch matches semantically → hallucinated

Per-layer scores are concatenated into a single feature vector for a lightweight classifier:

\(\mathbf{f}_t = [\,\text{ADS}_t^{(1)}, \ldots, \text{ADS}_t^{(L)} \;\|\; C_t^{(1)}, \ldots, C_t^{(L)}\,] \in \mathbb{R}^{2L}\)

We train XGBoost, Random Forest, and MLP classifiers. Detection adds <1 second per token.

Key Findings

Finding 1 Hallucinated tokens have diffuse, scattered attention. True tokens focus on compact regions; hallucinated tokens spread attention across unrelated areas.

Figure 4. Attention maps: true token "camera" (top) vs. hallucinated "wristwatch" (bottom).

Finding 2 Hallucinated tokens have weak semantic alignment with all patches. True tokens show high cosine similarity peaks; hallucinated tokens are uniformly low.

Figure 9. CGC heatmaps: true token "television" (top) vs. hallucinated "table" (bottom).

Layer-wise Analysis

Both signatures peak in early-to-mid layers (10–25), where cross-modal alignment is strongest before language priors dominate.

Figure 5. Layer-wise ADS: hallucinated tokens show higher dispersion in early/mid layers.

Figure 8. Layer-wise CGC: true tokens maintain higher similarity; the gap narrows at depth.

Experimental Results

MS-COCO Image Captioning (Table 2)

Our method consistently outperforms all baselines by 5–15% across three LVLMs.

POPE Benchmark (Table 3)

Under heavy class imbalance (6% hallucination rate), we outperform the best baseline by +3% F1 and +5% AUC.

Ablation: ADS + CGC (Table 4)

ADS captures where the model looks; CGC captures what it sees. Combining both yields +3–8% AUC.

Beyond the Global Scores

Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations

CVPR 2026

Motivation