Access

You are not currently logged in.

Access your personal account or get JSTOR access through your library or other institution:

login

Log in to your personal account or through your institution.

If you need an accessible version of this item please contact JSTOR User Support

The Multi-Armed Bandit Problem: Decomposition and Computation

Michael N. Katehakis and Arthur F. Veinott, Jr.
Mathematics of Operations Research
Vol. 12, No. 2 (May, 1987), pp. 262-268
Published by: INFORMS
Stable URL: http://www.jstor.org/stable/3689689
Page Count: 7
  • Get Access
  • Download ($30.00)
  • Cite this Item
If you need an accessible version of this item please contact JSTOR User Support
The Multi-Armed Bandit Problem: Decomposition and Computation
Preview not available

Abstract

The multi-armed bandit problem arises in sequentially allocating effort to one of N projects and sequentially assigning patients to one of N treatments in clinical trials. Gittins and Jones (1974) have shown that one optimal policy for the N-project problem, an N-dimensional discounted Markov decision chain, is determined by the following largest-index rule. There is an index for each state of each given project that depends only on the data of that project. In each period one allocates effort to a project with largest current index. The purpose of this paper is to give a short proof of this result and a new characterization of the index of a project in state i, viz., as the maximum expected present value in state i for the restart-in-i problem in which, in each state and period, one either continues allocating effort to the project or immediately restarts the project in state i. Moreover, it is shown that an approximate largest-index rule yields an approximately optimal policy. These results lead to more efficient methods of computing the indices on-line and/or for sparse transition matrices in large state spaces than have been suggested heretofore. By using a suitable implementation of successive approximations, a policy whose expected present value is within ε % of the maximum possible range of values of the indices can be found on-line with at most (N + T - 1)TM operations where M is the number of operations required to calculate one approximation, T is the least integer majorizing the ratio ln ε /ln a and 0 < a < 1 is the discount factor.

Page Thumbnails

  • Thumbnail: Page 
262
    262
  • Thumbnail: Page 
263
    263
  • Thumbnail: Page 
264
    264
  • Thumbnail: Page 
265
    265
  • Thumbnail: Page 
266
    266
  • Thumbnail: Page 
267
    267
  • Thumbnail: Page 
268
    268