Medical Referring Image Segmentation via Next-Token Mask Prediction

Published in arXiv preprint arXiv:2511.05044, 2025

This work addresses the challenge of identifying target regions in medical images using natural language descriptions. While achieving promising results, recent approaches usually involve complex design of multimodal fusion or multi-stage decoders. This paper proposes a next-token mask prediction approach that simplifies the multimodal fusion pipeline, achieving competitive performance on medical referring image segmentation benchmarks while reducing architectural complexity.

Download Paper