Résumé
Narendra-Shapiro (NS) algorithms are bandit-type algorithms developed in the 1960s which have been deeply studied in infinite horizon but for which scarce non-asymptotic results exist. In this paper, we focus on a non-asymptotic study of the regret and address the following question: are Narendra-Shapiro bandit algorithms competitive from this point of view? In our main result, we obtain some uniform explicit bounds for the regret of (over)-penalized-NS algorithms. We also extend to the multi-armed case some convergence properties of penalized-NS algorithms towards a stationary Piecewise Deterministic Markov Process (PDMP). Finally, we establish some new sharp mixing bounds for these processes.
Mots-clés
Regret; Stochastic Bandit Algorithms; Piecewise Deterministic Markov Processes;
Remplace
Sébastien Gadat, Fabien Panloup et Sofiane Saadane, « Regret bound for Narendra-Shapiro bandit algorithms », TSE Working Paper, n° 15-556, février 2015, révision mai 2016.
Référence
Fabien Panloup, Sofiane Saadane et Sébastien Gadat, « Regret bound for Narendra-Shapiro bandit algorithms », Stochastics, mai 2018, p. 886–926.
Voir aussi
Publié dans
Stochastics, mai 2018, p. 886–926