Follow-Your-MultiPose: Tuning-Free Multi-Character Text-to-Video Generation via Pose Guidance

December 21, 2024 · Declared Dead · 🏛 IEEE International Conference on Acoustics, Speech, and Signal Processing

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Beiyuan Zhang, Yue Ma, Chunlei Fu, Xinyang Song, Zhenan Sun, Ziqiang Li arXiv ID 2412.16495 Category cs.CV: Computer Vision Cross-listed cs.MM Citations 8 Venue IEEE International Conference on Acoustics, Speech, and Signal Processing Last Checked 4 months ago

Abstract

Text-editable and pose-controllable character video generation is a challenging but prevailing topic with practical applications. However, existing approaches mainly focus on single-object video generation with pose guidance, ignoring the realistic situation that multi-character appear concurrently in a scenario. To tackle this, we propose a novel multi-character video generation framework in a tuning-free manner, which is based on the separated text and pose guidance. Specifically, we first extract character masks from the pose sequence to identify the spatial position for each generating character, and then single prompts for each character are obtained with LLMs for precise text guidance. Moreover, the spatial-aligned cross attention and multi-branch control module are proposed to generate fine grained controllable multi-character video. The visualized results of generating video demonstrate the precise controllability of our method for multi-character generation. We also verify the generality of our method by applying it to various personalized T2I models. Moreover, the quantitative results show that our approach achieves superior performance compared with previous works.