We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code will be made publicly available to facilitate further developments in this field.
There's a lot of excellent work that was introduced around the same time as ours.
Video Instance Matting and MaGGIe introduces the mask guilded video instance matting.
VideoMatte240K and BG-20K both are the excellent video datasets we used to synthesize our video referring matting dataset.
@article{yang2025vrmdifftextguidedvideoreferring,
title={VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion},
author={Lehan Yang and Jincen Song and Tianlong Wang and Daiqing Qi and Weili Shi and Yuheng Liu and Sheng Li},
year={2025},
eprint={2503.10678},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.10678},
}