Ultimately most of the code, with the exception of a few calculations such as the one described the previous paragraph, was fully parallelized across multiple dimensions. The frist was the grid box dimension, the all calculations were parallelized across this dimension. The second was horizontal layers, which was set to 101. The third was wavenumber, which was set to 360. Finally, a data region directive was added for the PGI Accelerator, instructing it on what data needs to be copied in, copied out, and locally allocated on the GPU. Doing this explicitly is important because otherwise the code will spend considerable time doing unnecessary memory copying operations. After completing these development efforts, the PGI Accelerator produces a GPU kernel. The compiler produces information regarding the kernels it has compiled which should be checked to make sure each section of loops was fully parallelized across as many possible dimensions. Once the desired results are obtained for the compiler timing tests comparing the CPU code to the GPU code can begin. A flow chart illustrating the major steps in this porting process is shown in Fig. 2.