The DP formalism encodes information in the form of a "reward-to-go" function (see Puterman, 1994, for details) and chooses an action that maximizes the sum of the immediate reward and the expected "reward- to-go". Thus, to compute the optimal action in any given state the "reward-to-go" function for all the future states must be known. In many applications of DP, the number of states and actions available in each state are large; consequently, the computational e®ort required to compute the optimal policy for a DP can be overwhelming { Bellman's "curse of dimensionality". For this reason, considerable recent research e®ort has focused on developing algorithms that compute an approximately optimal policy e±ciently (Bertsekas and Tsitsiklis, 1996; de Farias and Van Roy, 2002).