Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval

May 14, 2026 ยท Grace Period ยท ๐Ÿ› International Journal of Machine Learning and Cybernetics 16, 4509-4524 (2025)

โณ Grace Period
This paper is less than 90 days old. We give authors time to release their code before passing judgment.
Authors Bolin Zhang, Chao Yang, Bin Jiang, Takahiro Komamizu, Ichiro Ide arXiv ID 2605.14838 Category cs.CV: Computer Vision Cross-listed cs.MM Citations 0 Venue International Journal of Machine Learning and Cybernetics 16, 4509-4524 (2025)
Abstract
This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create a high-quality positive sample mask, highlighting video clips most relevant to the query. Concurrently, we classify other clips in the same video as the easy negative sample and the entire video as the hard negative sample. During training, we introduce forward and inverse masked query reconstruction tasks to impose more substantial constraints on the network, promoting more robust and stable retrieval performance. Extensive experiments on two standard benchmarks affirm the effectiveness of the proposed method in VMR.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Computer Vision

๐ŸŒ… ๐ŸŒ… Old Age

Fast R-CNN

Ross Girshick

cs.CV ๐Ÿ› ICCV ๐Ÿ“š 27.7K cites 11 years ago