This is because the load distribution among the threads can become imbalanced, which is particularly harmful to performance on GPUs due to their SIMD architecture. • Memory constraints (GPUs tend to have less memory than CPUs) and the transfer latency between the CPU and GPU make the implementation more challenging