One way of achieving that is by reducing the number of registers utilized per thread as more registers
means fewer threads in the kernel. However, AMD GPUs
have four times as many registers as equivalent NVIDIA
GPUs, and hence, one may use them more freely on AMD
GPUs. We achieved superior performance on the AMD GPU
by explicitly using extra registers in our GEM kernel. GEM
involves accumulating the potential at the vertex due to each
atom in the molecule. Rather than updating the intermediate
result in global memory for each atom, we make use of
a register accumulator and obtain a 1.3-fold speedup, as
shown in Figure 5. Using registers to preload data from
global memory is also useful. Preloading provides up to
a 1.6-fold speedup, also shown in Figure 5. The kernel
uses a small set of data repeatedly throughout the execution
of the kernel; preloading this data into a register rather
than reading from global memory delivers a substantial
performance benefit.
One way of achieving that is by reducing the number of registers utilized per thread as more registers
means fewer threads in the kernel. However, AMD GPUs
have four times as many registers as equivalent NVIDIA
GPUs, and hence, one may use them more freely on AMD
GPUs. We achieved superior performance on the AMD GPU
by explicitly using extra registers in our GEM kernel. GEM
involves accumulating the potential at the vertex due to each
atom in the molecule. Rather than updating the intermediate
result in global memory for each atom, we make use of
a register accumulator and obtain a 1.3-fold speedup, as
shown in Figure 5. Using registers to preload data from
global memory is also useful. Preloading provides up to
a 1.6-fold speedup, also shown in Figure 5. The kernel
uses a small set of data repeatedly throughout the execution
of the kernel; preloading this data into a register rather
than reading from global memory delivers a substantial
performance benefit.
การแปล กรุณารอสักครู่..
