3.1.2 Reward Structure
In order to explore the environment, the scheduler implements an exploration mechanism known as ɛ-greedy action selection: Every DRAM cycle, with a small probability ɛ, the scheduler picks a random (legal) action; at all other times, it picks the (legal) action with the highest Q-value. This guarantees that there is a non-zero probability of visiting every entry in the Q-value matrix.
Each action is associated with an immediate reward. Once action at is picked and the immediate reward is determined, the Q-value prediction associated with the state-action pair
(st-1 , at-1 ) that was picked in the previous cycle t - 1 can be updated using SARSA [32] as follows: Q(st-1 , at-1 ) (1 - ∝)Q(st - 1 , at - 1 ) + ∝[rt + γQ(st , at )] where ∝ is the learning rate, empirically determined;2 rt is the immediate reward collected for the action taken; and 0 ≤ γ < 1 is a discount factor that causes future rewards to be incorporated in the form of a geometric series.3
--------------------------------------