Simulation and Training
Some of the most successful applications of neural networks are in the areas
of pattern recognition, nonlinear regression, and nonlinear system identification.
In these applications the neural network is used as a universal approximator:
the input-output mapping of the neural network is matched to an unknown
nonlinear mapping F of interest using a least-squares optimization. This optimization
is known as training the network. To perform training, one must have
some training data, that is, a set of pairs (i, F(i)), which is representative of
the mapping F that is approximated.
It is important to note that in contrast with these neural network applications,
in the DP context there is no readily available training set of input-output
pairs (i, J∗
(i)), which can be used to approximate J
∗ with a least squares fit.
The only possibility is to evaluate (exactly or approximately) by simulation the
cost functions of given (suboptimal) policies, and to try to iteratively improve
these policies based on the simulation outcomes. This creates analytical and
computational difficulties that do not arise in classical neural network training
contexts. Indeed the use of simulation to evaluate approximately the optimal
cost function is a key new idea, that distinguishes the methodology of this article
from earlier approximation methods in DP.
Using simulation offers another major advantage: it allows the methods of
this article to be used for systems that are hard to model but easy to simulate;
that is, in problems where an explicit model is not available, and the system can
only be observed, either as it operates in real time or through a software simulator.
For such problems, the traditional DP techniques are inapplicable, and
estimation of the transition probabilities to construct a detailed mathematical
model is often cumbersome or impossible.
There is a third potential advantage of simulation: it can implicitly identify
the “most important” or “most representative” states of the system. It appears
plausible that if these states are the ones most often visited during the simulation,
the scoring function will tend to approximate better the optimal cost for
these states, and the suboptimal policy obtained will perform better.