M3-CVC: Controllable Video Compression with Multimodal Generative Models
November 24, 2024 Β· Declared Dead Β· π IEEE International Conference on Acoustics, Speech, and Signal Processing
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Rui Wan, Qi Zheng, Yibo Fan
arXiv ID
2411.15798
Category
eess.IV: Image & Video Processing
Cross-listed
cs.CV
Citations
6
Venue
IEEE International Conference on Acoustics, Speech, and Signal Processing
Last Checked
4 months ago
Abstract
Traditional and neural video codecs commonly encounter limitations in controllability and generality under ultra-low-bitrate coding scenarios. To overcome these challenges, we propose M3-CVC, a controllable video compression framework incorporating multimodal generative models. The framework utilizes a semantic-motion composite strategy for keyframe selection to retain critical information. For each keyframe and its corresponding video clip, a dialogue-based large multimodal model (LMM) approach extracts hierarchical spatiotemporal details, enabling both inter-frame and intra-frame representations for improved video fidelity while enhancing encoding interpretability. M3-CVC further employs a conditional diffusion-based, text-guided keyframe compression method, achieving high fidelity in frame reconstruction. During decoding, textual descriptions derived from LMMs guide the diffusion process to restore the original video's content accurately. Experimental results demonstrate that M3-CVC significantly outperforms the state-of-the-art VVC standard in ultra-low bitrate scenarios, particularly in preserving semantic and perceptual fidelity.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Image & Video Processing
R.I.P.
π»
Ghosted
π
π
The Cartographer
Deep Learning for Hyperspectral Image Classification: An Overview
R.I.P.
π»
Ghosted
U-Net and its variants for medical image segmentation: theory and applications
R.I.P.
π»
Ghosted
Algorithm Unrolling: Interpretable, Efficient Deep Learning for Signal and Image Processing
R.I.P.
π
404 Not Found
Lightweight Image Super-Resolution with Information Multi-distillation Network
R.I.P.
π»
Ghosted
Deep Learning on Image Denoising: An overview
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted