To compute the SAI preconditioner on the GPU, the steps
indicated in the Compute-GSAI stage in Fig. 2 have to be
implemented in parallel on the GPU in a kernel called
compute preconditioner. Each column of the preconditioner M
is computed via one warp (32 threads in a block) and every
block is assigned 256 threads (eight warps) to compute eight
columns in parallel. The number of columns computed in
one SM simultaneously will depend on the allocated shared
memory per block and available resources per SM.