For all versions of our GPU-to-GPU CUDA code, we set
maxL ¼ 17, T ¼ 64, and Sblock ¼ 14;592. Consequently,
Sthread ¼ Sblock=T ¼ 228 and tWord ¼ Sthread=4 ¼ 57. Note
that since tWord is odd, we will not have shared-memory
bank conflicts (Theorem 1). We note that since our code is
written using a 1D grid of blocks and since a grid dimension
is required to be < 65;536 [10], our GPU-to-GPU code can
handle at most 65,535 blocks. With the chosen block size, n
must be less than 912 MB. For larger n, we can rewrite the
code using a 2D indexing scheme for blocks.