When state transition and reward functions are known, dynamic programming can be successfully applied to find an optimal policy. However, in practice, RL agents do not have a complete knowledge about their environments’ models. In such circumstances, temporal difference (TD) and Monte-Carlo (MC) RL algorithms are more suitable.