Application performance on multicore processors is seldom constrained
by the speed of floating point or integer units. Much more
often, limitations are caused by the memory subsystem, particularly
shared resources such as last level caches or memory controllers.
Measuring, predicting and modeling memory performance
becomes a steeper challenge with each new processor generation
due to the growing complexity and core count. We tackle the
important aspect of measuring and understanding undocumented
memory performance numbers in order to create valuable insight
into microprocessor details. For this, we build upon a set of sophisticated
benchmarks that support latency and bandwidth measurements
to arbitrary locations in the memory subsystem. These
benchmarks are extended to support AVX instructions for bandwidth
measurements and to integrate the coherence states (O)wned
and (F)orward. We then use these benchmarks to perform an indepth
analysis of current ccNUMA multiprocessor systems with
Intel (Sandy Bridge-EP) and AMD (Bulldozer) processors. Using
our benchmarks we present fundamental memory performance data
and illustrate performance-relevant architectural properties of both
designs.