SONIQ: System-Optimized Noise-Injected Ultra-Low-Precision Quantization with Full-Precision Parity

November 23, 2023 · Declared Dead · + Add venue

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Cyrus Zhou, Pedro Savarese, Zack Hassman, Vaughn Richard, Michael DiBrino, Michael Maire, Yanjing Li arXiv ID 2311.14114 Category cs.AR: Hardware Architecture Cross-listed cs.LG, cs.PF Citations 1 Last Checked 3 months ago

Abstract

Ultra-low-precision inference can sharply reduce memory and latency but often degrades accuracy and relies on specialized hardware. We present SONIQ, a system-optimized, noise-injected quantization framework that learns per-channel mixed precision for both weights and activations while training under the same rules used at inference. By injecting hardware-calibrated quantization noise during training, SONIQ steers models toward the discrete arithmetic used at deployment -- without bespoke runtimes. Across CNNs and Transformers, SONIQ achieves up to 16x and 7x compression, respectively, while matching or exceeding full-precision accuracy. Measured end-to-end, SONIQ delivers up to 7.3x CPU speedup over strong INT8 baselines and up to 6.3x (vector units) / 2.8x (tensor cores) GPU speedup relative to FP16. A practical outcome is that two per-channel precision levels -- one in the 1--4-bit range and one in the 4--8-bit range -- suffice in practice; at inference, each channel selects one of the two, keeping kernels simple and fast. To our knowledge, SONIQ is the first framework to reach or surpass full-precision accuracy under ultra-low (1--4 bits per parameter) regimes while remaining deployable on commodity hardware, narrowing the gap between quantization theory and practical, high-throughput inference.