As most of the L2 cache references are generated by L1 data
cache misses, we propose to access the location cache only when
the L1 data cache is accessed. Due to the page limitation, we only
present the average results based on all 26 benchmark applications.
Figure 4 presents the prediction accuracy with respect to variation
in the numbers of location cache entries. We also provide the accuracy
of the MRU based prediction scheme. Its hardware complexity
is similar to a location cache with 512 entries. It can be observed
that prediction based on location information has better prediction
rate. When the number of location cache entries is larger than 256,the prediction rate improved drastically, indicating the coverage of
the location cache is very important.
Figure 5 presents the power savings when different number of
location cache entries are used. The power saving is closely related
to the prediction rate and the L1 data cache miss rate. When the L1
data has very high hit rates, the location cache itself will consume
a lot of power, sometimes even more than the power saved in the
L2 cache system. In our experiment, the average L1 data cache
miss rate is 4.9%. It can be observed that in an average, as much
as 47.5% of L2 accessing power can saved by the location cache
design.
Although our primary concern is to save power, the location
cache can also improve the performance. If the processor is able
to support non-unified L2 cache access latency, the location cache
design can also improve the performance of the L2 cache. We use
the average cache access latency, which is the time the cache is
busy for each reference, to evaluate the performance. In general, a
smaller cache latency is preferred. The cache latency for a conventional
set-associative cache is 6 cycles. It is 4 cycles for that of a
direct-mapped cache. We summarize the simulation results in Figure
6. It can be observed that the location cache design can achieve
an average access latency of 4.5 cycles, which is 25% smaller than
the original set-associative cache. We also provide the performance
of the MRU based prediction scheme2. Though the MRU scheme
has good prediction rates, its performance is not as good as that
of the location cache, as prediction performs poorly with applications
like ammp, art and galgel. The long worst-case latency for
MRU way-prediction introduces significant performance degradation.
The location cache design, in any situation, will not hurt the
performance.