7.3.1. FALCON Bot by imitative learning
FALCON Bot created using imitative learning (IL) is called FALCON-IL Bot. The FALCON-IL Bot is trained using 8000 training samples data recorded from the Hunter Bot. We then examine if FALCON-IL Bot could learn the behavior patterns and play against the sample Hunter Bot.
FALCON-IL Bot adopts the parameter setting as follows: choice parameters αc1 = αc2 = αc3 = 0.1; learning rate parameters βc1 = βc2 = βc3 = 1 for fast learning; and contribution parameters γ c1 = 1 and γ c2 = γ c3 = 0. As in supervised learning, TD-FALCON selects a category node based on the input activities in the input state field. The vigilance parameters are set to ρc1 = ρc2 = 1 and ρc3 = 0 for a strict match criterion in the state and action fields and a zero- match requirement in the reward field.
7.3.2. FALCON Bot by online imitative learning
FALCON Bot created using online imitative learning (OIL) is called FALCON-OIL Bot. In the experiments conducted in Unreal Tournament, FALCON-OIL Bot will be trained in real time by sens- ing the samples data from its opponent, the Hunter Bot. FALCON- OIL Bot adopts the same setting of choice parameters, learn- ing rate parameters, contribution parameters, vigilance parameters, and learning rate α and discount factor γ for the Temporal Differ- ence rule as that of FALCON-IL Bot. (For details of FALCON-IL Bot and FALCON-OIL Bot, please refer to our previous work Feng and Tan, 2010.)
7.3.3. FALCON Bot by reinforcement learning
FALCON Bot using reinforcement learning only is called FALCON-RL Bot. To examine whether reinforcement learning is ef- fective to enhance the behavior learning, a series of experiments are conducted in Unreal Tournament to examine how the FALCON- IL Bot performs when it plays against the Hunter Bot.
Under the pure reinforcement learning mode, we adopt the pa- rameter setting as follows: choice parameters αc1 = αc2 = αc3 = 0.1; learning rate parameters β c1 = β c2 = 0.5 and β c3 = 0.3 to achieve a moderate learning speed; and contribution parameters γ c1 = γ c1 = 0.3 and γ c3 = 0.4. During reinforcement learning, a slower learning rate could produce a smaller set of better quality category nodes in FALCON network and lead to better predictive performance, although a lower learning rate may slow down the learning process. The vigilance parameter ρc1 is set to 0.8 for a better match criterion, ρc2 is set to 0 and ρc3 is set to 0.3 for a marginal level of match criterion on the reward space so as to en- courage the generation of category nodes. In learning value func- tion with Temporal Difference rule, the learning rate α is fixed at 0.7 and the discount factor γ is set to 0.9.
7.3.4. FALCON Bot by dual stage learning
FALCON Bot learned using the DSL strategy is called FALCON- DSL Bot. In the imitative learning stage, FALCON-DSL Bot adopts the same parameter setting as that of the FALCON-IL Bot. For the reinforcement learning stage, FALCON-DSL Bot adopts the same setting of choice parameters, learning rate parameters, and contri- bution parameters as that of FALCON-RL Bot.
The vigilance parameters ρc1 is set to 0.9 which is slightly higher than that of FALCON-RL Bot for a better match criterion, ρc2 is set to 0 and ρc3 to 0.3 for a marginal level of match criterion. For the Temporal Difference rule, FALCON-DSL Bot also adopts the same learning rate α and discount factor γ as that of FALCON-RL Bot.
For knowledge transferred from imitative learning, each of the embedded cognitive codes Cj is initialized with a reward value qj for j = 1, . . . , N. At the beginning of learning, we assume the em- bedded codes have a standard reward value of 0.75, to assume them reasonably good rules.
Fig. 8. The score difference between the learning Bots and Hunter Bot.
7.3.5. FALCON Bot by mixed model learning
FALCON Bot learned with the MML strategy is called FALCON- MML Bot. When imitative learning is activated, the same parame- ter setting as that of the FALCON-IL Bot is applied.
When the reinforcement learning mode is activated, FALCON- MML Bot adopts the same setting of choice parameters, learning rate parameters, and contribution parameters as that of FALCON- RL Bot.
For the vigilance parameters, ρc1 is set to 0.9 which is slightly higher than that of FALCON-RL Bot for a better match criterion, ρc2 is set to 0 and ρc3 to 0.3 for a marginal level of match criterion. For the Temporal Difference rule, FALCON-DSL Bot also adopts the same learning rate α and discount factor γ as those of FALCON-RL Bot.
7.3.6. Bot by standard Q-learning
For the purpose of comparison, a Bot created by standard Q-
learning (called QL Bot) is also realized in Unreal Tournament. QL Bot works by learning the value function for each chosen action in a given state. We conduct a series of experiments to examine how QL Bot performs when it plays against the Hunter Bot. For learning value function using Temporal Difference rule, the learning rate α was fixed at 0.7 and the discount fac