Figure 4 shows that, for every thousand cycles, the benchmarks average:
602 total memory lane instructions,
268 of which are global memory lane instructions (with the other 334 to scratchpad memory), and
Coalescing reduces global memory lane instructions to only 39 global memory accesses.