In order to explore the environment, the scheduler implements an exploration mechanism known as ɛ-greedy action selection: Every DRAM cycle, with a small probability ɛ, the scheduler picks a random (legal) action; at all other times, it picks the (legal) action with the highest Q-value. This guarantees that there is a non-zero probability of visiting every entry in the Q-value matrix.