MAG: Multi-Modal Aligned Autoregressive Co-Speech Gesture Generation without Vector Quantization
March 18, 2025 · Declared Dead · 🏛 arXiv.org
"Paper promises code 'coming soon'"
Evidence collected by the PWNC Scanner
Authors
Binjie Liu, Lina Liu, Sanyi Zhang, Songen Gu, Yihao Zhi, Tianyi Zhu, Lei Yang, Long Ye
arXiv ID
2503.14040
Category
cs.GR: Graphics
Cross-listed
cs.CV,
cs.SD
Citations
0
Venue
arXiv.org
Last Checked
1 month ago
Abstract
This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation, which results in information loss and compromises the realism of the generated gestures. To address this, inspired by the natural continuity of real-world human motion, we propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization. Specifically, (1) we introduce a motion-text-audio-aligned variational autoencoder (MTA-VAE), which leverages pre-trained WavCaps' text and audio embeddings to enhance both semantic and rhythmic alignment with motion, ultimately producing more realistic gestures. (2) Building on this, we propose a multimodal masked autoregressive model (MMAG) that enables autoregressive modeling in continuous motion embeddings through diffusion without vector quantization. To further ensure multi-modal consistency, MMAG incorporates a hybrid granularity audio-text fusion block, which serves as conditioning for diffusion process. Extensive experiments on two benchmark datasets demonstrate that MAG achieves stateof-the-art performance both quantitatively and qualitatively, producing highly realistic and diverse co-speech gestures.The code will be released to facilitate future research.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
📜 Similar Papers
In the same crypt — Graphics
R.I.P.
👻
Ghosted
R.I.P.
👻
Ghosted
Everybody Dance Now
R.I.P.
👻
Ghosted
Deep Bilateral Learning for Real-Time Image Enhancement
R.I.P.
👻
Ghosted
Animating Human Athletics
R.I.P.
👻
Ghosted
BundleFusion: Real-time Globally Consistent 3D Reconstruction using On-the-fly Surface Re-integration
R.I.P.
👻
Ghosted
Shape Transformation Using Variational Implicit Functions
Died the same way — ⏳ Coming Soon™
R.I.P.
⏳
Coming Soon™
Exploring Simple Siamese Representation Learning
R.I.P.
⏳
Coming Soon™
An Analysis of Scale Invariance in Object Detection - SNIP
R.I.P.
⏳
Coming Soon™
Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection
R.I.P.
⏳
Coming Soon™