The Harder Path: Last Iterate Convergence for Uncoupled Learning in Zero-Sum Games with Bandit Feedback

April 17, 2026 ยท Grace Period ยท ๐Ÿ› ICML 2025

โณ Grace Period
This paper is less than 90 days old. We give authors time to release their code before passing judgment.
Authors Cรดme Fiegel, Pierre Mรฉnard, Tadashi Kozuno, Michal Valko, Vianney Perchet arXiv ID 2604.16087 Category cs.LG: Machine Learning Cross-listed stat.ML Citations 0 Venue ICML 2025
Abstract
We study the problem of learning in zero-sum matrix games with repeated play and bandit feedback. Specifically, we focus on developing uncoupled algorithms that guarantee, without communication between players, the convergence of the last-iterate to a Nash equilibrium. Although the non-bandit case has been studied extensively, this setting has only been explored recently, with a bound of $\mathcal{O}(T^{-1/8})$ on the exploitability gap. We show that, for uncoupled algorithms, guaranteeing convergence of the policy profiles to a Nash equilibrium is detrimental to the performance, with the best attainable rate being $ฮฉ(T^{-1/4})$ in contrast to the usual $ฮฉ(T^{-1/2})$ rate for convergence of the average iterates. We then propose two algorithms that achieve this optimal rate up to constant and logarithmic factors. The first algorithm leverages a straightforward trade-off between exploration and exploitation, while the second employs a regularization technique based on a two-step mirror descent approach.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Machine Learning