R.I.P.
👻
Ghosted
ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization
July 14, 2025 · Declared Dead · 🏛 arXiv.org
Authors
Huilai Li, Yonghao Dang, Ying Xing, Yiming Wang, Jianqin Yin
arXiv ID
2507.09945
Category
cs.MM: Multimedia
Cross-listed
cs.CV
Citations
0
Venue
arXiv.org
Repository
https://github.com/uchiha99999/ESG-Net
Last Checked
2 months ago
Abstract
Dense audio-visual event localization (DAVE) aims to identify event categories and locate the temporal boundaries in untrimmed videos. Most studies only employ event-related semantic constraints on the final outputs, lacking cross-modal semantic bridging in intermediate layers. This causes modality semantic gap for further fusion, making it difficult to distinguish between event-related content and irrelevant background content. Moreover, they rarely consider the correlations between events, which limits the model to infer concurrent events among complex scenarios. In this paper, we incorporate multi-stage semantic guidance and multi-event relationship modeling, which respectively enable hierarchical semantic understanding of audio-visual events and adaptive extraction of event dependencies, thereby better focusing on event-related information. Specifically, our eventaware semantic guided network (ESG-Net) includes a early semantics interaction (ESI) module and a mixture of dependency experts (MoDE) module. ESI applys multi-stage semantic guidance to explicitly constrain the model in learning semantic information through multi-modal early fusion and several classification loss functions, ensuring hierarchical understanding of event-related content. MoDE promotes the extraction of multi-event dependencies through multiple serial mixture of experts with adaptive weight allocation. Extensive experiments demonstrate that our method significantly surpasses the state-of-the-art methods, while greatly reducing parameters and computational load. Our code will be released on https://github.com/uchiha99999/ESG-Net.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
📜 Similar Papers
In the same crypt — Multimedia
🌅
🌅
Old Age
Quality Assessment of In-the-Wild Videos
R.I.P.
👻
Ghosted
Viewport-Adaptive Navigable 360-Degree Video Delivery
R.I.P.
👻
Ghosted
A Comprehensive Survey on Cross-modal Retrieval
R.I.P.
👻
Ghosted
An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges
R.I.P.
👻
Ghosted
A Convolutional Neural Network Approach for Post-Processing in HEVC Intra Coding
Died the same way — ⚰️ The Empty Tomb
R.I.P.
⚰️
The Empty Tomb
DSFD: Dual Shot Face Detector
R.I.P.
⚰️
The Empty Tomb
InstanceCut: from Edges to Instances with MultiCut
R.I.P.
⚰️
The Empty Tomb
FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis
R.I.P.
⚰️
The Empty Tomb