We use CMP$im [10], a Pin-based x86 simulator. Our baseline processor is 4-wide out-of order with 128-entry reorder buffer and a three-level cache hierarchy, similar to the baseline of the Cache Replacement Championship [11]. Each core has L1 split instruction and data caches, a unified L2 cache, and all cores in our 2-core and 4-core processor share the L3 cache.