Dynamic power-Q-value computation: Computing a Qvalue consists of three basic steps: (a) generating the array indices, (b) reading the Q-values, and (c) adding the Qvalues and determining the maximum Q-value. To generate the array indices, we first read the six selected state attributes and concatenate the higher order bits. This is then XOR-ed with a random number and passed through a hash function. Reading the state attributes and indexing into a hash function can be approximated as a dynamic SRAM read each and consumes 1.4 pJ per read (from CACTI 6.5 [1]). The XOR function takes 0.23 pJ (an XOR function is conservatively approximated to consume the same power as an adder implemented in an older 70 nm technology [18]). We use CACTI to estimate the energy expended in reading out the Q values from the SRAM arrays. Each SRAM read consumes 0.78 pJ, and the total dynamic SRAM energy for reading out the Q values from the 32 matrices per command is 24.96 pJ. We estimate the power consumed by the adders that sum up the 32 Q-values to be 1.0 mW each [18], and accordingly calculate the energy consumed by the 16 adders used in each RL pipeline to be 3.75 pJ (adding the 32 Q-values takes up two pipeline cycles). We conservatively assume that the comparator consumes the same power as the adder. Since a maximum of 24 commands can be analyzed every DRAM cycle, a maximum of 24 final Q-values need to be compared each cycle, and the energy estimated to do so is 3 pJ. The total RL pipeline energy consumed is then 32 pJ.