First, the total run time of the programs as a function of the total number of vertical columns (loop number) is compared. This run time is the total wall time of the execution and includes the non-accelerated code portions, memory copying operations and GPU initialization time. The run time was calculated for both versions of the code, at loop numbers of multiples of 256 (the number of CUDA cores) up to 4096. By the next power of two (8192) the GPU had insufficient memory to store all the variables. The results are shown in Fig. 3. From the run times, it is clear that parallelization was successful, as the slope of the run time curve vs. loop number is much less for the GPU code