Chip-multiprocessor (CMP) architectures present a challenge for efficient simulation,
combining the requirements of a detailed microprocessor simulator with that of a tightly-coupled parallel
system.
In this paper, a distributed simulator for target CMPs is presented based on the Message
Passing Interface (MPI) designed to run on a host cluster of workstations.
Microbenchmark-based
evaluation is used to narrow the parallelization design space concerning the performance of
distributed vs. centralized target L2 simulation, blocking vs. non-blocking remote cache accesses,
null-message vs. barrier techniques for clock synchronization, and network interconnect selection.
The best combination is shown to yield speedups of up to 16 on a 9-node cluster of dual-CPU
workstations, partially due to cache effects.