iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Paradigms of visually grounded reasoning.

Abstract

Visually grounded chain-of-thought has emerged as a promising paradigm to enhance fine-grained perception in MLLMs. However, we empirically find that mandating explicit object boxes during inference often degrades performance compared to standard textual CoT. We hypothesize that visual localization capability can be internalized into the textual CoT, and that mandatory explicit grounding introduces unnecessary task interference. We propose iVGR, a reinforcement learning framework that transfers localization capability into the textual reasoning process through a dual-stream training strategy: a textual stream is aligned with a high-quality grounded stream via a novel consistency reward. Extensive experiments show iVGR significantly outperforms existing baselines on fine-grained benchmarks, while remaining compatible with tool-assisted inference workflows.

Key Insight: Textual CoT can outperform Grounded CoT

TakeawayExplicit grounding at inference is not necessary, and can even hurt. iVGR is designed to internalize this localization capability into textual reasoning.

Off-the-shelf models trained with visually grounded CoT (DeepEyes, TreeVGR) actually perform better when we simply switch to textual CoT at inference time, without any retraining.

T = textual CoT · G = visually grounded CoT.

Benchmarks	Qwen2.5-VL-7B	DeepEyes-7B		TreeVGR-7B
Benchmarks	T	G	T	G	T
V*	78.5	82.7	81.7	83.8	84.3
HRBench4K	69.0	75.1	74.9	77.1	76.9
HRBench8K	65.1	72.6	73.1	73.1	74.7
MME-RealWorld-Lite	44.5	53.2	53.5	54.9	54.7
POPE	86.3	87.7	89.2	87.3	88.4
RealWorldQA	68.1	69.4	69.7	67.3	69.5
CV-Bench-2D	75.7	75.0	77.9	76.6	77.7
CV-Bench-3D	73.6	77.3	80.8	77.2	79.3
Avg.	70.1	74.1	75.1	74.7	75.7

Method: Dual-Stream Training with Consistency Reward

Dual-stream training. For each query, the policy MLLM samples a grounded stream (explicit boxes, rewarded by format / accuracy / IoU) and a textual stream (plain reasoning, rewarded by format / accuracy / consistency). The consistency reward is computed by an LLM judge against the best grounded rollout in a Rollout Archive, transferring localization from the grounded stream into the textual stream without exposing coordinates at inference.

Qualitative Results

TakeawayGrounded CoT suffers from two failure modes: (a) localization errors that lead to incorrect answers, and (b) accurate localization paired with recognition failures.

Grounded CoT vs. textual CoT within iVGR. Left: the grounded CoT misses objects and undercounts, while the textual CoT enumerates correctly. Right: the grounded CoT localizes well but misreads the label, while the textual CoT, freed from emitting coordinates, correctly recognizes the hazard numbers.

Effect of the consistency reward. Without it, the textual stream mis-localizes the trailer and reports the wrong color. iVGR attends to the correct region and recovers the right answer.

Main Results

Takeaway(a) State-of-the-art performance on fine-grained and general VQA benchmarks. (b) Compatible with crop tools; test-time scaling further boosts performance on fine-grained benchmarks.

Model	Tool	Fine-grained VQA			General VQA					Avg.
Model	Tool	V*	HR4K	HR8K	MME-RW-L	POPE	RWQA	CV-2D	CV-3D	Avg.
Proprietary Models
Gemini-3.1-Pro-Preview	—	87.4	88.9	88.1	55.8	88.0	83.5	85.0	94.6	83.9
GPT-5.4	—	88.0	87.4	80.6	63.4	87.9	83.0	82.4	91.9	83.1
Open-source General Models
LLaVA-OneVision-7B	✗	72.8	64.6	57.9	48.2	88.3	69.5	72.9	76.9	68.9
InternVL3-8B	✗	70.2	70.0	69.3	48.6	90.3	71.0	80.6	86.1	73.3
Qwen2.5-VL-7B	✗	78.5	69.0	65.1	44.5	86.3	68.1	75.7	73.6	70.1
Qwen2.5-VL-32B	✗	80.1	73.0	69.5	46.3	86.5	70.1	76.7	84.5	73.3
Qwen2.5-VL-72B	✗	85.9	79.9	76.8	45.2	86.3	76.1	78.4	87.2	77.0
Qwen3-VL-4B	✗	78.5	77.8	71.1	48.3	89.3	71.2	78.7	91.7	75.8
Qwen3-VL-8B	✗	82.7	76.5	70.4	49.0	88.1	70.5	78.6	93.5	76.2
Qwen3-VL-32B	✗	83.8	80.0	78.1	52.1	89.4	79.3	81.2	92.8	79.6
Visually Grounded Reasoning Models
GRIT-3B	✗	54.5	48.4	43.5	33.8	80.8	58.0	72.5	68.2	57.5
Pixel-Reasoner-7B	✓	—	72.9	66.9	49.7	—	—	—	—	—
DeepEyes-7B	✓	82.7	75.1	72.6	53.2	87.7	69.4	75.0	77.3	74.1
DeepEyesV2-7B	✓	81.8	77.9	73.8	—	—	—	—	—	—
Mini-o3-7B	✓	—	77.5	73.3	—	—	—	—	—	—
Thyme-7B	✓	82.2	77.0	72.0	55.2	86.8	70.2	78.0	75.1	74.6
TreeVGR-7B	✗	83.8	77.1	73.1	54.9	87.3	67.3	76.6	77.2	74.7
iVGR-Qwen2.5-VL-7B (ours)	✗	86.4	78.3	75.5	55.6	88.9	68.6	78.4	81.1	76.6
Δ vs. Qwen2.5-VL-7B		+7.9	+9.3	+10.4	+11.1	+2.6	+0.5	+2.7	+7.5	+6.5
iVGR-Qwen3-VL-8B (ours)	✗	90.1	82.0	80.1	60.7	89.4	71.0	80.8	91.0	80.6
Δ vs. Qwen3-VL-8B		+7.4	+5.5	+9.7	+11.7	+1.3	+0.5	+2.2	-2.5	+4.4
iVGR-Qwen3-VL-32B (ours)	✗	93.2	82.9	82.9	61.2	88.8	76.3	83.9	93.8	82.9
Δ vs. Qwen3-VL-32B		+9.4	+2.9	+4.8	+9.1	-0.6	-3.0	+2.7	+1.0	+3.3

Chart Understanding & Multidisciplinary Reasoning

Model	Chart Understanding		Multidisciplinary Reasoning				Avg.
Model	ChartQA	AI2D	WeMath	MMStar	MMMU	MMK12	Avg.
Qwen2.5-VL-7B	86.4	83.6	35.3	63.9	54.4	53.6	62.9
iVGR-Qwen2.5-VL-7B	88.5	85.0	41.1	66.3	55.2	56.3	65.4 (+2.5)
Qwen3-VL-8B	83.2	80.4	49.7	67.9	58.0	60.4	66.6
iVGR-Qwen3-VL-8B	87.6	85.5	55.1	69.7	59.8	61.6	69.9 (+3.3)
Qwen3-VL-32B	85.0	84.5	60.0	72.3	67.7	73.9	73.9
iVGR-Qwen3-VL-32B	90.4	88.7	61.6	75.1	67.7	75.2	76.5 (+2.6)

Tool-Assisted Test-Time Scaling

Model	V*	HR4K	HR8K	Avg.
Qwen2.5-VL-7B	78.5	69.0	65.1	70.9
iVGR-7B	86.4	78.3	75.5	80.1
iVGR-7B + crops	89.0	79.4	76.3	81.6
iVGR-7B + union crop	89.0	79.9	75.8	81.6
iVGR-7B + crops + union crop	90.1	81.8	76.3	82.7
Qwen3-VL-8B	82.7	76.5	70.4	76.5
Qwen3-VL-8B + tool	90.1	82.3	78.0	83.5
iVGR-8B	90.1	82.0	80.1	84.1
iVGR-8B + crops	89.5	83.5	78.0	83.7
iVGR-8B + union crop	92.7	84.5	78.8	85.3
iVGR-8B + crops + union crop	93.2	84.3	79.3	85.6

BibTeX

BibTeX coming soon.