where b is a discount factor.
Implementing the VI approach until it converges, the optimal policy can be derived based on the following formula (for more information about how and why the optimal policy could be derived from this equation, please refer to [33]):