Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

1TeleAI, 2Shanghai Jiao Tong University, 3Northwestern Polytechnical University

*Corresponding Author

(2/28) NOTE: Our server is currently unavailable. We will complete the code organization and upload sample images as soon as possible after restored.

Introduction for SAYO

SAYO is a model trained via only visual attention based reward.

A

Most MLLMs possess extensive knowledge reserves, but their limited ability to localize objects within images restricts their full reasoning potential.
SAYO employs reward-based training by relying on key visual information tokens. In the example shown above, the trained model accurately focuses on critical visual information and enhances reasoning accuracy.

Abstract

While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.

More Example

Results

main result ablation result

Sayo has better performance in various of understanding and reasoning tasks without refocus or visual prompts. Besides, models trained using only the visual attention reward achieve performance comparable to those trained with the combined reward, whereas models trained with accuracy rewards alone exhibit only marginal gains. These findings suggest that deficiencies in current MLLMs stem less from limited reasoning capacity and more from insufficient visual perception and localization.

BibTeX

@article{domllmsreallyseeit,
      title={Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs}, 
      author={Siqu Ou and Tianrui Wan and Zhiyuan Zhao and Junyu Gao and Xuelong Li},
      year={2026},
      eprint={2602.08241},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.08241}, 
}