Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data

October 07, 2024 · Declared Dead · 🏛 International Conference on Machine Learning

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors David Heurtel-Depeiges, Anian Ruoss, Joel Veness, Tim Genewein arXiv ID 2410.05078 Category cs.LG: Machine Learning Cross-listed cs.AI, cs.IT Citations 9 Venue International Conference on Machine Learning Last Checked 4 months ago

Abstract

Foundation models are strong data compressors, but when accounting for their parameter size, their compression ratios are inferior to standard compression algorithms. Naively reducing the parameter count does not necessarily help as it deteriorates predictions and, accordingly, compression. We conduct a large-scale empirical study to find a sweet spot where pre-trained vanilla transformers can achieve competitive compression ratios. To this end, we train models on 165GB of raw byte sequences of either text, image, or audio data (and all possible combinations of the three) and then compress 1GB of out-of-distribution (OOD) data from each modality. We find that relatively small models (millions of parameters) can outperform standard general-purpose compression algorithms (gzip, LZMA2) and even domain-specific compressors (PNG, JPEG-XL, FLAC) $\unicode{x2013}$ even when accounting for parameter size. We achieve, e.g., the lowest compression ratio of 0.49 on OOD audio data (vs. 0.54 for FLAC). We conduct extensive ablations and hyperparameter sweeps to study the impact of model- and dataset scale, and we investigate the effect of unimodal versus multimodal training. We find that even small models can be trained to perform well on multiple modalities, but unlike large-scale foundation models, transfer to unseen modalities is generally weak.