TIP and Polish: Text-Image-Prototype Guided Multi-Modal Generation via Commonality-Discrepancy Modeling and Refinement

November 12, 2025 Β· Declared Dead Β· πŸ› arXiv.org

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Zhiyong Ma, Jiahao Chen, Qingyuan Chuai, Zhengping Li arXiv ID 2511.21698 Category cs.MM: Multimedia Cross-listed cs.AI Citations 0 Venue arXiv.org Last Checked 4 months ago
Abstract
Multi-modal generation struggles to ensure thematic coherence and style consistency. Semantically, existing methods suffer from cross-modal mismatch and lack explicit modeling of commonality and discrepancy. Methods that rely on fine-grained training fail to balance semantic precision with writing style consistency. These shortcomings lead to suboptimal generation quality. To tackle these issues, we propose \textbf{\textit{TIPPo}}, a simple yet effective framework with explicit input modeling and comprehensive optimization objectives. It extracts the input text and images via multi-modal encoder and adapters, then measures the visual prototype. \textbf{T}extual, \textbf{I}mage, and \textbf{P}rototype signals are then fed to our proposed Dual Alignment Attention and Difference Operator modules before language model decoding. The proposed \textbf{Po}lishPPO reinforces the style consistency, while the unsupervised contrastive learning during SFT mitigates inter-sample representation collapse. Experimental results demonstrate the promising performance of \textbf{\textit{TIPPo}} in automatic evaluation and LLM-based criteria for creativity and semantic consistency.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Multimedia

R.I.P. πŸ‘» Ghosted

Video Generation From Text

Yitong Li, Martin Renqiang Min, ... (+3 more)

cs.MM πŸ› AAAI πŸ“š 300 cites 8 years ago

Died the same way β€” πŸ‘» Ghosted