However, due to the complexity in managing GPU on-chip resources through high level pro-
gramming languages and the complicated memory access patterns in data-intensive applications,
it often takes tremendous efforts to optimize these applications for high performance