the fine-grained parallelism within each
search does not require O(m + n) data storage per task.
Exposing the fine-grained parallelism permits pipelining
of memory accesses for latency tolerance, but exploiting
the parallelism requires a few programming environment
and hardware features. The irregular memory access
patterns also dictate a globally addressable memory space.