Obtaining peak performance on modern computer architectures is dicult due
to the complexity of the memory hierarchy, and the need for ecient parallelism on
multiple levels ranging from instruction level parallelism (ILP) and vector parallelism
to multi-core parallelism. Tuning the performance of code to fully utilize all of these
components requires modifying the implementation to the particular architecture and
extensive experimentation to determine the best choices of algorithm and implementation
strategies. Moreover, this time consuming process must be repeated whenever
the architecture changes, as the particular choices for one machine are not optimal
for another, even when relatively minor changes are made in the architecture.