Towards Best Practices for Training Multilingual Dense Retrieval Models

April 05, 2022 Β· Declared Dead Β· πŸ› ACM Trans. Inf. Syst.

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Xinyu Zhang, Kelechi Ogueji, Xueguang Ma, Jimmy Lin arXiv ID 2204.02363 Category cs.IR: Information Retrieval Cross-listed cs.CL Citations 45 Venue ACM Trans. Inf. Syst. Last Checked 4 months ago
Abstract
Dense retrieval models using a transformer-based bi-encoder design have emerged as an active area of research. In this work, we focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design. Although recent work with multilingual transformers demonstrates that they exhibit strong cross-lingual generalization capabilities, there remain many open research questions, which we tackle here. Our study is organized as a "best practices" guide for training multilingual dense retrieval models, broken down into three main scenarios: where a multilingual transformer is available, but relevance judgments are not available in the language of interest; where both models and training data are available; and, where training data are available not but models. In considering these scenarios, we gain a better understanding of the role of multi-stage fine-tuning, the strength of cross-lingual transfer under various conditions, the usefulness of out-of-language data, and the advantages of multilingual vs. monolingual transformers. Our recommendations offer a guide for practitioners building search applications, particularly for low-resource languages, and while our work leaves open a number of research questions, we provide a solid foundation for future work.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Information Retrieval

Died the same way β€” πŸ‘» Ghosted