VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

1University of Virginia, 2Columbia University 2Texas A&M University

We treat video matting as a generation task with referring capabilities. By inputting a caption that describes the instance, our model outputs the corresponding instance's matte.

Abstract

We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code will be made publicly available to facilitate further developments in this field.

Video

Coming soon

Related Links

There's a lot of excellent work that was introduced around the same time as ours.

Video Instance Matting and MaGGIe introduces the mask guilded video instance matting.

VideoMatte240K and BG-20K both are the excellent video datasets we used to synthesize our video referring matting dataset.

BibTeX

@article{yang2025vrmdifftextguidedvideoreferring,
      title={VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion}, 
      author={Lehan Yang and Jincen Song and Tianlong Wang and Daiqing Qi and Weili Shi and Yuheng Liu and Sheng Li},
      year={2025},
      eprint={2503.10678},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.10678}, 
}