TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual Video Parsing
May 04, 2025 Β· Declared Dead Β· π International Conference on Multimedia Retrieval
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Yaru Chen, Peiliang Zhang, Fei Li, Faegheh Sardari, Ruohao Guo, Zhenbo Li, Wenwu Wang
arXiv ID
2505.02096
Category
cs.MM: Multimedia
Citations
2
Venue
International Conference on Multimedia Retrieval
Last Checked
3 months ago
Abstract
Audio-Visual Video Parsing (AVVP) task aims to parse the event categories and occurrence times from audio and visual modalities in a given video. Existing methods usually focus on implicitly modeling audio and visual features through weak labels, without mining semantic relationships for different modalities and explicit modeling of event temporal dependencies. This makes it difficult for the model to accurately parse event information for each segment under weak supervision, especially when high similarity between segmental modal features leads to ambiguous event boundaries. Hence, we propose a multimodal optimization framework, TeMTG, that combines text enhancement and multi-hop temporal graph modeling. Specifically, we leverage pre-trained multimodal models to generate modality-specific text embeddings, and fuse them with audio-visual features to enhance the semantic representation of these features. In addition, we introduce a multi-hop temporal graph neural network, which explicitly models the local temporal relationships between segments, capturing the temporal continuity of both short-term and long-range events. Experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance in multiple key indicators in the LLP dataset.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Multimedia
π
π
Old Age
R.I.P.
π»
Ghosted
Viewport-Adaptive Navigable 360-Degree Video Delivery
π
π
The Cartographer
A Comprehensive Survey on Cross-modal Retrieval
π
π
The Cartographer
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
R.I.P.
π»
Ghosted
A Convolutional Neural Network Approach for Post-Processing in HEVC Intra Coding
R.I.P.
π»
Ghosted
Video Generation From Text
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted