Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier, termed Foreground Gradient Dilution (FGD), where sparse 3D signals lead to weak supervision. To resolve this, we introduce Per-view No-target Suppression Optimization (PVSO), which provides stronger and more balanced gradients across views, enabling stable and efficient learning. To support consistent evaluation, we build MVRefer, a benchmark that defines standardized settings and metrics for MV-3DRES. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives.
Real-world agents (robots, mobile phones) often operate with sparse RGB views and strict latency constraints, unlike traditional methods that rely on pre-built dense point clouds.
We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES). The goal is to segment a 3D object described by natural language directly from a few sparse images, without any ground-truth 3D input at inference.
Challenges:
Figure 1: Comparison of the proposed MV-3DRES task (bottom) against the traditional two-stage reconstruction-then-segmentation pipeline (top) and direct reconstruction failures (middle).
To standardize evaluation, we built MVRefer based on ScanNet and ScanRefer. It emulates embodied agents by sampling N=8 sparse frames. We provide metrics that decouple grounding accuracy from reconstruction quality, including mIoUglobal (3D) and mIoUview (2D projection).
We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an end-to-end framework designed for efficiency and robustness.
Figure 2: Architecture of MVGGT. It features a Frozen Reconstruction Branch (top) that provides a stable geometric scaffold, and a Trainable Multimodal Branch (bottom) that progressively injects language cues into visual features to predict the final 3D segmentation.
The Problem (FGD): In sparse 3D space, the target object is extremely sparse (< 2%), causing the background to dominate the gradients. We call this Foreground Gradient Dilution.
Our Solution (PVSO): We introduce Per-view No-Target Suppression Optimization. Instead of relying solely on weak 3D signals, we enforce supervision in the dense 2D image domain.
Figure 3: Visualizing the optimization challenge and our solution.
Reconstruction and Segmentation of ScanNet Photos/Videos with MVGGT. Click on any thumbnail below to view the 3D reconstruction.
the cabinet is in the northwest corner of the room. the cabinet is a white rectangular box.
MVGGT significantly outperforms all other methods. Please refer to our paper for quantitative results. Here we also provide a qualitative comparison with 2D-Lift and Two-Stage methods (use the dropdown menu to switch).
the fridge is tall, rectangular, and white. it is located to the right of the stove.
@misc{wu2026mvggt,
Author = {Changli Wu and Haodong Wang and Jiayi Ji and Yutian Yao and Chunsai Du and Jihua Kang and Yanwei Fu and Liujuan Cao},
Title = {MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation},
Year = {2026}
}