In general, the execution of the CUDA program is divided into
following 4 steps[14].
1) Input data copy from main memory to global memory on the
GPU
2) Instruction transfer from the CPU to the GPU by kernel call
3) Data operations on the GPU
4) Output data copy from the global memory on the GPU to the
main memory
Our MP3 decoding program is also implemented through 4 steps.
At first, the memory space for the input data is allocated on the
GPU. Then, the input data is copied from the main memory to the
global memory on the GPU. After that, the data is processed by
using the resources of the GPU. Then, the output data is copied
from the GPU to the main memory. Since the CPU and the GPU
have individual memory space, the data copy between each
memory space is required for data operations on the GPU. For this
reason, the data copy is one of the major constraints which
degrade the GPU utilization[15].