ICML 2026

iVGR: Internalizing Visually Grounded Reasoning
for MLLMs with Reinforcement Learning

1Visual AI Lab, The University of Hong Kong   2Independent Researcher   3University of Science and Technology of China
Corresponding author
Paradigms of visually grounded reasoning

Paradigms of visually grounded reasoning.

Abstract

Visually grounded chain-of-thought has emerged as a promising paradigm to enhance fine-grained perception in MLLMs. However, we empirically find that mandating explicit object boxes during inference often degrades performance compared to standard textual CoT. We hypothesize that visual localization capability can be internalized into the textual CoT, and that mandatory explicit grounding introduces unnecessary task interference. We propose iVGR, a reinforcement learning framework that transfers localization capability into the textual reasoning process through a dual-stream training strategy: a textual stream is aligned with a high-quality grounded stream via a novel consistency reward. Extensive experiments show iVGR significantly outperforms existing baselines on fine-grained benchmarks, while remaining compatible with tool-assisted inference workflows.

Key Insight: Textual CoT can outperform Grounded CoT

TakeawayExplicit grounding at inference is not necessary, and can even hurt. iVGR is designed to internalize this localization capability into textual reasoning.

Off-the-shelf models trained with visually grounded CoT (DeepEyes, TreeVGR) actually perform better when we simply switch to textual CoT at inference time, without any retraining.

T = textual CoT   ·   G = visually grounded CoT.

Benchmarks Qwen2.5-VL-7B DeepEyes-7B TreeVGR-7B
T GT GT
V*78.582.781.783.884.3
HRBench4K69.075.174.977.176.9
HRBench8K65.172.673.173.174.7
MME-RealWorld-Lite44.553.253.554.954.7
POPE86.387.789.287.388.4
RealWorldQA68.169.469.767.369.5
CV-Bench-2D75.775.077.976.677.7
CV-Bench-3D73.677.380.877.279.3
Avg. 70.1 74.175.1 74.775.7

Method: Dual-Stream Training with Consistency Reward

iVGR method overview

Dual-stream training. For each query, the policy MLLM samples a grounded stream (explicit boxes, rewarded by format / accuracy / IoU) and a textual stream (plain reasoning, rewarded by format / accuracy / consistency). The consistency reward is computed by an LLM judge against the best grounded rollout in a Rollout Archive, transferring localization from the grounded stream into the textual stream without exposing coordinates at inference.

Qualitative Results

TakeawayGrounded CoT suffers from two failure modes: (a) localization errors that lead to incorrect answers, and (b) accurate localization paired with recognition failures.
Grounded CoT vs Textual CoT in iVGR

Grounded CoT vs. textual CoT within iVGR. Left: the grounded CoT misses objects and undercounts, while the textual CoT enumerates correctly. Right: the grounded CoT localizes well but misreads the label, while the textual CoT, freed from emitting coordinates, correctly recognizes the hazard numbers.

With vs. without consistency reward

Effect of the consistency reward. Without it, the textual stream mis-localizes the trailer and reports the wrong color. iVGR attends to the correct region and recovers the right answer.

Main Results

Takeaway(a) State-of-the-art performance on fine-grained and general VQA benchmarks. (b) Compatible with crop tools; test-time scaling further boosts performance on fine-grained benchmarks.
Model Tool Fine-grained VQA General VQA Avg.
V*HR4KHR8K MME-RW-LPOPERWQACV-2DCV-3D
Proprietary Models
Gemini-3.1-Pro-Preview87.488.988.155.888.083.585.094.683.9
GPT-5.488.087.480.663.487.983.082.491.983.1
Open-source General Models
LLaVA-OneVision-7B72.864.657.948.288.369.572.976.968.9
InternVL3-8B70.270.069.348.690.371.080.686.173.3
Qwen2.5-VL-7B78.569.065.144.586.368.175.773.670.1
Qwen2.5-VL-32B80.173.069.546.386.570.176.784.573.3
Qwen2.5-VL-72B85.979.976.845.286.376.178.487.277.0
Qwen3-VL-4B78.577.871.148.389.371.278.791.775.8
Qwen3-VL-8B82.776.570.449.088.170.578.693.576.2
Qwen3-VL-32B83.880.078.152.189.479.381.292.879.6
Visually Grounded Reasoning Models
GRIT-3B54.548.443.533.880.858.072.568.257.5
Pixel-Reasoner-7B72.966.949.7
DeepEyes-7B82.775.172.653.287.769.475.077.374.1
DeepEyesV2-7B81.877.973.8
Mini-o3-7B77.573.3
Thyme-7B82.277.072.055.286.870.278.075.174.6
TreeVGR-7B83.877.173.154.987.367.376.677.274.7
iVGR-Qwen2.5-VL-7B (ours) 86.478.375.5 55.688.968.6 78.481.176.6
Δ vs. Qwen2.5-VL-7B +7.9+9.3+10.4 +11.1+2.6+0.5 +2.7+7.5+6.5
iVGR-Qwen3-VL-8B (ours) 90.182.080.1 60.789.471.0 80.891.080.6
Δ vs. Qwen3-VL-8B +7.4+5.5+9.7 +11.7+1.3+0.5 +2.2-2.5+4.4
iVGR-Qwen3-VL-32B (ours) 93.282.982.9 61.288.876.3 83.993.882.9
Δ vs. Qwen3-VL-32B +9.4+2.9+4.8 +9.1-0.6-3.0 +2.7+1.0+3.3

Chart Understanding & Multidisciplinary Reasoning

Model Chart Understanding Multidisciplinary Reasoning Avg.
ChartQAAI2D WeMathMMStarMMMUMMK12
Qwen2.5-VL-7B86.483.635.363.954.453.662.9
iVGR-Qwen2.5-VL-7B88.585.041.166.355.256.365.4 (+2.5)
Qwen3-VL-8B83.280.449.767.958.060.466.6
iVGR-Qwen3-VL-8B87.685.555.169.759.861.669.9 (+3.3)
Qwen3-VL-32B85.084.560.072.367.773.973.9
iVGR-Qwen3-VL-32B90.488.761.675.167.775.276.5 (+2.6)

Tool-Assisted Test-Time Scaling

ModelV*HR4KHR8KAvg.
Qwen2.5-VL-7B78.569.065.170.9
iVGR-7B86.478.375.580.1
iVGR-7B + crops89.079.476.381.6
iVGR-7B + union crop89.079.975.881.6
iVGR-7B + crops + union crop90.181.876.382.7
Qwen3-VL-8B82.776.570.476.5
Qwen3-VL-8B + tool90.182.378.083.5
iVGR-8B90.182.080.184.1
iVGR-8B + crops89.583.578.083.7
iVGR-8B + union crop92.784.578.885.3
iVGR-8B + crops + union crop93.284.379.385.6

BibTeX

BibTeX coming soon.