Finite-time Analysis of the Multiarmed Bandit Problem

被引：54

作者：

Peter Auer

Nicolò Cesa-Bianchi

Paul Fischer

机构：

[1] University of Technology Graz,DTI

[2] University of Milan,Lehrstuhl Informatik II

[3] Universität Dortmund,undefined

来源：

Machine Learning | 2002年 / 47卷

关键词：

bandit problems; adaptive allocation rules; finite horizon regret;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

引用

页码：235 / 256

页数：21

共 9 条

[1]

Agrawal R.(1995)Sample mean based index policies with Advances in Applied Probability 27 1054-1078

[2]

Burnetas A.(1996)log Advances in Applied Mathematics 17 122-142

[3]

Katehakis M.(1994) regret for the multi-armed bandit problem Journal of Optimization Theory and Applications 83 113-154

[4]

Ishikida T.(1985)Optimal adaptive policies for sequential allocation problems Advances in Applied Mathematics 6 4-22

[5]

Varaiya P.(1991)Multi-armed bandit problem revisited Annals of Operations Research 28 297-312

[6]

Lai T.(undefined)Asymptotically efficient adaptive allocation rules undefined undefined undefined-undefined

[7]

Robbins H.(undefined)Nonparametric bandit methods undefined undefined undefined-undefined

[8]

Yakowitz S.(undefined)undefined undefined undefined undefined-undefined

[9]

Lowe W.(undefined)undefined undefined undefined undefined-undefined

← 1 →