The SAI preconditioner is computed in parallel on GPUs by allocating the computation of each column of M to one warp. Accelerating the SAI preconditioner involves local (per warp) parallelization of various computing kernels such as QR decomposition, dot products, sorting vector values, finding the maximum value in a vector, and so on. One of the major challenges in computing SAI precondi- tioners on GPUs is the limited size of global and shared memory and the generation of large data structures. Proposing techniques to reuse memory space and minimize the allocated memory to data structures in the kernel are key factors in producing SAI preconditioners for large problems on GPUs. In the following implementation details to overcome the above constraints and implement in parallel the computing kernels involved in solving Ax 1⁄4 b using SAI preconditioners are presented.
Computing the SAI preconditioner in parallel on GPUs involves the implementation of steps introduced in Fig. 1, which we implemented in a stage called Compute-GSAI (see Fig. 2). In this stage, every 32 threads (one warp) on the GPU computes one column of M (mk) by executing the steps in Fig. 1. Each warp first finds the dimensions of its corresponding A^ matrix (4) and assembles it. The local A^ matrices, which are very small compared to A, are then decomposed (local decompositions per warp for each A^) using the Gram Schmidt method [1] and mk is computed. SAI preconditioning on GPUs requires two additional steps (Pre-GSAI and Post-GSAI) which handle GPU memory allocation, define required data structures, gather results and determine the required number of kernel (hereafter kernel refers to a CUDA kernel) calls based on the problem size and available GPU memory. Thus, solving the Ax 1⁄4 b linear systems equations on the GPU using SAI precondi- tioners consists of four major steps (see Fig. 2):