PianoBind: A Multimodal Joint Embedding Model for Pop-piano Music

September 04, 2025 · Declared Dead · 🏛 International Society for Music Information Retrieval Conference

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Hayeon Bang, Eunjin Choi, Seungheon Doh, Juhan Nam arXiv ID 2509.04215 Category cs.SD: Sound Cross-listed cs.IR, cs.MM Citations 0 Venue International Society for Music Information Retrieval Conference Last Checked 4 months ago

Abstract

Solo piano music, despite being a single-instrument medium, possesses significant expressive capabilities, conveying rich semantic information across genres, moods, and styles. However, current general-purpose music representation models, predominantly trained on large-scale datasets, often struggle to captures subtle semantic distinctions within homogeneous solo piano music. Furthermore, existing piano-specific representation models are typically unimodal, failing to capture the inherently multimodal nature of piano music, expressed through audio, symbolic, and textual modalities. To address these limitations, we propose PianoBind, a piano-specific multimodal joint embedding model. We systematically investigate strategies for multi-source training and modality utilization within a joint embedding framework optimized for capturing fine-grained semantic distinctions in (1) small-scale and (2) homogeneous piano datasets. Our experimental results demonstrate that PianoBind learns multimodal representations that effectively capture subtle nuances of piano music, achieving superior text-to-music retrieval performance on in-domain and out-of-domain piano datasets compared to general-purpose music joint embedding models. Moreover, our design choices offer reusable insights for multimodal representation learning with homogeneous datasets beyond piano music.