where j is the state subsequent to i, and E{· | i, u} denoted expected value with
respect to j, given i and u. Generally, at each state i, it is optimal to use a
control u that attains the minimum above. Thus, decisions are ranked based on
the sum of the expected cost of the present period, and the optimal expected
cost of all subsequent periods.
The objective of DP is to calculate numerically the optimal cost function
J
∗
. This computation can be done off-line, i.e., before the real system starts
operating. An optimal policy, that is, an optimal choice of u for each i, is
computed either simultaneously with J
∗
, or in real time by minimizing in the
right-hand side of Bellman’s equation. It is well known, however, that for many
important problems the computational requirements of DP are overwhelming,
mainly because of a very large number of states and controls (Bellman’s “curse
of dimensionality”). In such situations a suboptimal solution is required.
Cost Approximations in Dynamic Programming
NDP methods are suboptimal methods that center around the approximate
evaluation of the optimal cost function J
∗
, possibly through the use of neural
networks and/or simulation. In particular, we replace the optimal cost J
∗
(j)
with a suitable approximation J˜(j, r), where r is a vector of parameters, and
we use at state i the (suboptimal) control ˜µ(i) that attains the minimum in the
(approximate) right-hand side of Bellman’s equation