The agent needs to learn how to assign credit and blame for the actions it takes. A common way of learning to assign
credit is through a technique called Q-learning. Formally, the Q-value of a state-action pair (s, a) while executing a policy Qx (s, a), is the expected cumulative reward resulting from taking action a in state s and following policy thereafter.A Q-learning-based RL agent learns the optimal policy x indirectly, by learning Qx (s, a) for every state-action pair (s, a) (the Q-value matrix).