Enhancing Vision-Language Model Pre-training with Image-text Pair Pruning Based on Word Frequency

October 09, 2024 · Declared Dead · 🏛 International Conference on Content-Based Multimedia Indexing

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Mingliang Liang, Martha Larson arXiv ID 2410.10879 Category cs.LG: Machine Learning Cross-listed cs.AI, cs.CL, cs.CV Citations 0 Venue International Conference on Content-Based Multimedia Indexing Last Checked 4 months ago

Abstract

We propose Word-Frequency-based Image-Text Pair Pruning (WFPP), a novel data pruning method that improves the efficiency of VLMs. Unlike MetaCLIP, our method does not need metadata for pruning, but selects text-image pairs to prune based on the content of the text. Specifically, WFPP prunes text-image pairs containing high-frequency words across the entire training dataset. The effect of WFPP is to reduce the dominance of frequent words. The result a better balanced word-frequency distribution in the dataset, which is known to improve the training of word embedding models. After pre-training on the pruned subset, we fine-tuned the model on the entire dataset for one additional epoch to achieve better performance. Our experiments demonstrate that applying WFPP when training a CLIP model improves performance on a wide range of downstream tasks. WFPP also provides the advantage of speeding up pre-training by using fewer samples. Additionally, we analyze the training data before and after pruning to visualize how WFPP changes the balance of word frequencies. We hope our work encourages researchers to consider the distribution of words in the training data when pre-training VLMs, not limited to CLIP.