Decoupled Weight Decay for Any $p$ Norm

April 16, 2024 · Entered Twilight · 🏛 arXiv.org

Repo contents: .gitignore, README.md, python

Authors Nadav Joseph Outmezguine, Noam Levi arXiv ID 2404.10824 Category cs.LG: Machine Learning Cross-listed cs.AI, cs.NE, math.OC Citations 4 Venue arXiv.org Repository https://github.com/Nadav-out/PAdam Last Checked 3 months ago

Abstract

With the success of deep neural networks (NNs) in a variety of domains, the computational and storage requirements for training and deploying large NNs have become a bottleneck for further improvements. Sparsification has consequently emerged as a leading approach to tackle these issues. In this work, we consider a simple yet effective approach to sparsification, based on the Bridge, or $L_p$ regularization during training. We introduce a novel weight decay scheme, which generalizes the standard $L_2$ weight decay to any $p$ norm. We show that this scheme is compatible with adaptive optimizers, and avoids the gradient divergence associated with $0<p<1$ norms. We empirically demonstrate that it leads to highly sparse networks, while maintaining generalization performance comparable to standard $L_2$ regularization.