Approximation Architectures
An important issue in function approximation is the selection of architecture,
that is, the choice of a parametric class of functions J˜(·, r) or Q˜(·, ·, r) that suits
the problem at hand. One possibility is to use a neural network architecture
of some type. We should emphasize here that in this article we use the term
“neural network” in a very broad sense, essentially as a synonym to “approximating
architecture.” In particular, we do not restrict ourselves to the classical
multilayer perceptron structure with sigmoidal nonlinearities. Any type of universal
approximator of nonlinear mappings could be used in our context. The
nature of the approximating structure is left open in our discussion, and it could
involve, for example, radial basis functions, wavelets, polynomials, splines, etc.
Cost approximation can often be significantly enhanced through the use of
feature extraction, a process that maps the state i into some vector f(i), called
the feature vector associated with the state i. Feature vectors summarize, in
a heuristic sense, what are considered to be important characteristics of the
state, and they are very useful in incorporating the designer’s prior knowledge
or intuition about the problem and about the structure of the optimal controller.
For example in a queueing system involving several queues, a feature vector may
involve for each queue a three-value indicator, that specifies whether the queue
is “nearly empty”, “moderately busy”, or “nearly full”. In many cases, analysis
can complement intuition to suggest the right features for the problem at hand.
Feature vectors are particularly useful when they can capture the “dominant
nonlinearities” in the optimal cost function J
∗
. By this we mean that J
∗
(i) can
be approximated well by a “relatively smooth” function Jˆ(f(i)); this happens
for example, if through a change of variables from states to features, the function
J
∗ becomes a (nearly) linear or low-order polynomial function of the features.
When a feature vector can be chosen to have this property, one may consider
approximation architectures where both features and (relatively simple) neural
networks are used together. In particular, the state is mapped to a feature
vector, which is then used as input to a neural network that produces the score
of the state. More generally, it is possible that both the state and the feature
vector are provided as inputs to the neural network.
A simple method to obtain more sophisticated approximations, is to partition
the state space into several subsets and construct a separate cost function
approximation in each subset. For example, by using a linear or quadratic polynomial
approximation in each subset of the partition, one can construct piecewise
linear or piecewise quadratic approximations over the entire state space.
An important issue here is the choice of the method for partitioning the state
space. Regular partitions (e.g., grid partitions) may be used, but they often lead
to a large number of subsets and very time-consuming computations. Generally
speaking, each subset of the partition should contain “similar” states so that the
variation of the optimal cost over the states of the subset is relatively smooth
and can be approximated with smooth functions. An interesting possibility is
to use features as the basis for partition. In particular, one may use a more
or less regular discretization of the space of features, which induces a possibly
irregular partition of the original state space. In this way, each subset of the
irregular partition contains states with “similar features.”