Article

Regret bound for Narendra-Shapiro bandit algorithms

Fabien Panloup, Sofiane Saadane, and Sébastien Gadat

Abstract

Narendra-Shapiro (NS) algorithms are bandit-type algorithms developed in the 1960s which have been deeply studied in infinite horizon but for which scarce non-asymptotic results exist. In this paper, we focus on a non-asymptotic study of the regret and address the following question: are Narendra-Shapiro bandit algorithms competitive from this point of view? In our main result, we obtain some uniform explicit bounds for the regret of (over)-penalized-NS algorithms. We also extend to the multi-armed case some convergence properties of penalized-NS algorithms towards a stationary Piecewise Deterministic Markov Process (PDMP). Finally, we establish some new sharp mixing bounds for these processes.

Keywords

Regret; Stochastic Bandit Algorithms; Piecewise Deterministic Markov Processes;

Replaces

Sébastien Gadat, Fabien Panloup, and Sofiane Saadane, Regret bound for Narendra-Shapiro bandit algorithms, TSE Working Paper, n. 15-556, February 2015, revised May 2016.

Reference

Fabien Panloup, Sofiane Saadane, and Sébastien Gadat, Regret bound for Narendra-Shapiro bandit algorithms, Stochastics, May 2018, pp. 886–926.

Published in

Stochastics, May 2018, pp. 886–926