You are not currently logged in.
Access JSTOR through your library or other institution:
The Multi-Armed Bandit Problem: Decomposition and Computation
Michael N. Katehakis and Arthur F. Veinott, Jr.
Mathematics of Operations Research
Vol. 12, No. 2 (May, 1987), pp. 262-268
Published by: INFORMS
Stable URL: http://www.jstor.org/stable/3689689
Page Count: 7
Preview not available
The multi-armed bandit problem arises in sequentially allocating effort to one of N projects and sequentially assigning patients to one of N treatments in clinical trials. Gittins and Jones (1974) have shown that one optimal policy for the N-project problem, an N-dimensional discounted Markov decision chain, is determined by the following largest-index rule. There is an index for each state of each given project that depends only on the data of that project. In each period one allocates effort to a project with largest current index. The purpose of this paper is to give a short proof of this result and a new characterization of the index of a project in state i, viz., as the maximum expected present value in state i for the restart-in-i problem in which, in each state and period, one either continues allocating effort to the project or immediately restarts the project in state i. Moreover, it is shown that an approximate largest-index rule yields an approximately optimal policy. These results lead to more efficient methods of computing the indices on-line and/or for sparse transition matrices in large state spaces than have been suggested heretofore. By using a suitable implementation of successive approximations, a policy whose expected present value is within ε % of the maximum possible range of values of the indices can be found on-line with at most (N + T - 1)TM operations where M is the number of operations required to calculate one approximation, T is the least integer majorizing the ratio ln ε /ln a and 0 < a < 1 is the discount factor.
Mathematics of Operations Research © 1987 INFORMS