Parallel execution in CUDA is achieved by running a group of multiple threads whenever a kernel is launched. The group of threads is structured in a grid of 1D, 2D or 3D blocks of threads. The topology of threads in the grid is defined at kernel launching. Each thread has access to its position in the grid, so it is capable of extracting its own portion from the common computational task.