While set-associative caches incur fewer misses than direct- mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which significantly reduces the power consumption for large set- associative caches.
We propose to use a small cache, called location cache to store the location of future cache references. If there is a hit in the location cache, the supported cache is accessed as a
direct-mapped cache.
Otherwise, the supported cache is referenced as a conventional set-associative cache.
The worst case access latency of the location cache system is the same as that of a conventional cache.
The location cache is virtually indexed so that operations on it can be performed in parallel with the TLB address translation.
These advantages make it ideal for L2 cache systems where traditional way-predication strategies perform poorly.
We used the CACTI cache model to evaluate the power consumption and access latency of proposed cache architecture.
Simple scalar CPU simulator was used to produce final results.
It is shown that the proposed location cache architecture is power efficient.
In the simulated cache configurations, up-to 47% of cache accessing energy and 25% of average cache access latency can be reduced.
1.INTRODUCTION
To achieve low miss rates, modern processors employ set-associative caches.
In a RAM-tagged n-way set-associative cache, n tag and data ways are accessed concurrently.
This wastes energy because at least n − 1 data reads are useless for each cache access.
Methods to save energy for set-associative caches have been actively researched.
1.1 Structural Approaches
The structural techniques typically segment the word-lines or the bit-lines.
Subbanking (also known as column multiplexing) technique divides the data arrays into subbanks.
Only those sub- banks that contain the desired data are accessed.
The bit-line segmentation scheme partitions the bit-lines. When the memory cells are sampled, only required bit-line segments are discharged.
The MDM (multi-divided module) cache consists of small modules with each of them operating as a stand-alone.
Only the required small module designated by the reference presented to the cache is accessed.
Albonesi proposed the re-sizable selective ways cache.
The cache set associativity can be reset by the software.
An other type of structural method is to a add small piece of cache to capture the most recently referenced data or to contain prefetched data.
Line buffer designs were proposed to cache the recently accessed cache lines.
Filter cache is a small cache that sits between the CPU and the L1 caches.
It reduces L1 cache power consumption by filtering out the references to the L1 caches. Many of the structural approaches have been proved efficient.
They can be used with other strategies describe in the following sections.
1.2 Alternative Cache Organizations
Phased caches first access the tag and then the data arrays.
Only the hit data way is accessed in the second phase, resulting in less data way access energy at the expense of longer access time.
Researchers recently proposed the way concatenation technique for reducing dynamic cache power for application- specific systems.
The cache can be configured by software to be a direct-mapped, two-way or four-way
set-associative cache so as to save power.
The MNM mechanism is proposed to discover cache misses early so that power consumption of the cache can be saved.
CAM-tagged caches are often used in low-power systems.
A CAM based cache puts one set of a cache in a small sub-bank and uses a CAM for the tag lookup of that set.
A set may have 32 or even 64 ways. However, the CAM tags must be searched before the data can be retrieved, which increases the cache latency.
The area overhead brought by CAM cells is also not negligible.
1.3 Speculative Way Selection
The basic idea of speculative way activation is to make a prediction of the way where the required data may be located.
If the prediction is correct, the cache access latency and power consumption is similar to that of a direct-mapped cache of the same size.
If the prediction is wrong, the cache is accessed again to retrieve the desired data.
The cache is accessed as a direct-mapped cache twice.
Because of high prediction accuracy, proposed designs have saved both time and power.
Some designs have also been industrialized.
Prior work can be categorized by the way the cache is probed.
1.3.1 Statically Ordered Cache Probes
The Hash-Rehash cache design and the Pseudo-associative cache design were originally proposed to reduce the miss rates of direct-mapped caches.
When a memory reference is presented to the cache, +the direct-mapped location is checked.
If there is a miss, a hash function is used to index the next cache entry.
In both designs, the most-recently-accessed cache line will be moved to the direct-mapped location.
However, exchanging large cache lines consumes large amount of power as well as bus bandwidth.
1.3.2 Dynamically Ordered Cache Probes
In contrast to the static schemes, researchers have developed schemes which redirect the first probe to a predicted location.
The MRU cache design keeps the MRU information associated to each set.
When searching for data, the block indicated by the MRU bit is probed.
However the MRU bits must be fetched prior to accessing the cache.
The PSA (Predictive Sequential
Associative) cache design moves the prediction procedure to previous stages of pipelining so that the MRU information is presented to the cache simultaneously with the memory reference.
The Reactive-Associative Cache design moves most active blocks to direct-mapped positions and reactively displaces only conflicting blocks based on the PSA cache design.
It reduces cache assess latency at the cost of higher miss rates and larger power consumption.
Dropsho discussed an accounting cache architecture.
The accounting cache first accesses part of the ways of a set associative cache, known as a primary access.
If there is a miss, then the cache accesses the other ways, known as a secondary access.
A swap be- tween the primary and secondary accesses is needed when there is a miss in the primary and a hit in the secondary access.
Energy is saved on a hit during the primary access.
Way-prediction was first proposed to reduce the cache access latency.
The power efficiency of way-prediction techniques were discussed later.
1.3.3 Limitations of Way-Prediction Schemes
Way-prediction designs have been proposed for fast L1 caches.
There are several reasons for which the original way-prediction idea cannot be applied directly to large L2 caches.
First, in way-prediction designs, the predicted way number must be made available before the actual data address is generated.
We call this an out cache 1 feature for way-prediction designs. As large L2 caches are typically physically-indexed caches, a virtual to physical address translation must be conducted before the address can be presented to the way-prediction hardware.
The way-prediction mechanism sitting between the TLB and the L2 cache will add extra delay to the critical path.
Second, L2 caches are unified caches, where most of the references come from L1 data cache misses.
MRU based prediction does not always work well
with data references. Third, the cache line size of the L2 cache is large.
In Intel P4 processors, the L2 cache line size is 128 bytes.
This means exchanging the locations of cache lines is prohibitively expensive.
Finally,way-prediction introduces non unified cache access latency.
The processor must be redesigned to take the advantage of non unified L2 cache latency.
This paper examines the popular MRU information used by existing way-prediction mechanisms.
We show that it is difficult to directly use existing way-prediction on L2 caches.
We propose to use another kind of information, namely address affinity to provide accurate location information for L2 cache references.
The proposed cache design reduces cache access power while improving the performance, compared with a conventional set-associative L2 cache.
The rest of this paper is organized as follows:Section 2 introduces the architecture of the location cache system.
Section 3 presents simulation results on access delay and power consumption of the proposed hardware.
Section 4 studies the performance and power efficiency of the proposed system.
We conclude the paper in Section 5.
2. OUR SOLUTION
We propose a new cache architecture called the location cache.
Figure 1 illustrates its structure.
The location cache is a small virtually-indexed direct-mapped cache.
It caches the location in formation (the way number in one set a memory reference falls into).
This cache works in parallel with the TLB and the L1 cache.
On an L1 cache miss, the physical address translated by the TLB and the way information of the reference are both presented to the L2 cache.
The L2 cache is then accessed as a direct-mapped cache.
There can be a miss in the location cache, then the L2 cache is accessed as a conventional
set-associative cache.
As opposed to way-prediction information, the cached location is not a prediction.
Thus when there is a hit, both time and power will be saved.
Even if there is a miss, we do not see any extra delay penalty as seen in way-prediction caches.
Caching the position, unlike caching the data itself, will not cause coherence problems in multi-processor systems.
Although the snooping mechanism may modify the data stored in the L2 cache, the location will not change.
Also, even if a cache line is replaced in the L2 cache, the way information stored in the location cache will not generate a fault.