Automatic Derivation of Reward Structures
In this paper we propose to follow an automated approach to solve this problem. Specifically, we devise a genetic algorithm (GA) [23] to explore the search space of possible reward functions for a given objective function. 4 Genetic algorithms (GAs) [23] are a heuristic search technique that is based on evolutionary processes. GA starts by randomly generating a population space of individuals,where each individual is a candidate solution for the problem being solved. Typically individuals in the population set are represented in binary as strings of 0’s and 1’s, but other encodings are also possible. Evolution is performed in generations and it starts by evaluating the fitness of the initial population according to an objective function. (Evaluation means simulations of candidate DRAM schedulers in our case.) Based on the fitness of individuals in the population set, the next generation of individuals are determined stochastically using some form of fitness based selection technique. These new individuals are then further evolved using operations like crossover and mutation, which leaves us with a population for the next generation. This is done iteratively until a certain number of generations has been evolved, or when a certain fitness level has been reached, after which the search is terminated. While many of the individuals in the initial population might not do anything useful, the evolutionary nature of GAs allow some of them to evolve into meaningful, high-performing solutions, and shed the rest in the process. In our GA, each individual in the population stores rewards for each of the eight actions that can be performed by the scheduler. Initially, these rewards are randomly generated. We evaluate our initial population by conducting execution-driven simulations with each individual’s memory
scheduler configuration, using a small subset of our application set5 and determining the fitness of each individual. The fitness-based selection criteria that we use is tournament selection combined with elitist selection [23, 19]. To perform crossover, we randomly pick two individuals and
swap the reward values of an action. Mutation is performed by randomly replacing the reward of an action with another value. Multiple-point crossover and mutations are performed
---------------------------------
4 We did try โ€simplerโ€ search techniques, such as manual trialand-error or automatic hill-climbing with random restarts and momentum, but the end result was significantly inferior. We believe GA offers a good trade-off between simplicity and effectiveness in this context.
5 The applications that we use for training are fft, mg, and radix (Section 5.2). We picked these because they are the fastest to simulate among the parallel applications that we evaluate. By using a small subset, picked not based on behavior but simply on execution time, we speed up training and at the same time minimize the chance of overfitting the final solution to our application set.
------------------------------------
in our experiments, which means that reward values can be swapped or replaced multiple times within a given individual. Once we have the population set for the next generation, it is evaluated against the fitness criteria, and this iterative evolutionary search process continues until we reach 50 generations, at the end of which we are left with a set of rewards, one per possible action, which together constitute our reward function.
In theory, it should be possible to periodically re-calibrate the reward values as the application goes through different execution phases. These rewards would need to be re-learned on the fi‚y by the hardware. We are currently investigating
this aspect, but in this paper we confine our solution to static rewards learned offi‚ine, which still yields good results and simplifies the design of the scheduler (as we will see, the reward structure is just a small table).