All code is implemented in C++, and the experiments were
performed on a workstation equipped with an Intel Core i5-2500
processor; 4 cores running at 3.3 GHz, 6 MB L3 cache, 4 256 kB
L2 cache, 4 32 kB L1 instruction/data caches, and 16 GB of
RAM. The GNU C++ compiler (version 4.6.1) was used with flags
-O3 -funroll-loops. The code makes use of a template constant
DIMENSIONS in order to statically unroll loops over dimensions