A. CUDA Program Structure
A CUDA program is a unified source code that comprises both host (CPU) and device (GPU). It consists of one or more portions that exhibit little or no data parallelism, and are implemented in host code and the portions of the program that exhibit rich amount of data parallelism are implemented in the device code. The NVIDIA C compiler separates the two during the compilation process. The host code is written in C or C++ language with keywords for labeling data parallel function called kernels and their corresponding data elements. The kernel functions typically generate a large number of threads to exploit data parallelism. These CUDA threads are of lighter weight than the CPU threads. These threads take few cycles to generate and schedule which is in contrast with CPU threads that typically take thousands of clock cycles to generate and schedule.