Each step of the Q-learning update algorithm is defined by the following expression:
Q(st,at) = Q(st,at) + α[rt+ γ maxQa(st+1,a)-Q(st,at)] (1)
- where st corresponds to the current state;
- at is the action taken at state st;
- rt is the reward received by taking the action at at the state st;
- st+1 is the next state;
- γ(gama) is the discount factor (0< γ