3Performance-EnergyConsiderationsfo

3
Performance-EnergyConsiderationsforSharedCacheManagement
inaHeterogeneousMulticoreProcessor
ANUPHOLEY†
andVINEETHMEKKAT†
,IntelCorporation
PEN-CHUNGYEWandANTONIAZHAI,UniversityofMinnesota–TwinCities
HeterogeneousmulticoreprocessorsthatintegrateCPUcoresanddata-parallelacceleratorssuchasgraphic
processingunit(GPU)coresontothesamedieraiseseveralnewissuesforsharingvariouson-chipresources.
Thesharedlast-levelcache(LLC)isoneofthemostimportantsharedresourcesduetoitsimpacton
performance.AccessestothesharedLLCinheterogeneousmulticoreprocessorscanbedominatedbythe
GPUduetothesigniﬁcantlyhighernumberofconcurrentthreadssupportedbythearchitecture.Under
currentcachemanagementpolicies,theCPUapplications’shareoftheLLCcanbesigniﬁcantlyreduced
inthepresenceofcompetingGPUapplications.FormanyCPUapplications,areducedshareoftheLLC
couldleadtosigniﬁcantperformancedegradation.Onthecontrary,GPUapplicationscantolerateincreasein
memoryaccesslatencywhenthereissufﬁcientthread-levelparallelism(TLP).Inadditiontotheperformance
challenge,introductionofdiversecoresontothesamediechangestheenergyconsumptionproﬁleand,in
turn,affectstheenergyefﬁciencyoftheprocessor.
Inthiswork,weproposeheterogeneousLLCmanagement(HeLM),anovelsharedLLCmanagement
policythattakesadvantageoftheGPU’stoleranceformemoryaccesslatency.HeLMisabletothrottle
GPULLCaccessesandyieldLLCspacetocache-sensitiveCPUapplications.Thisthrottlingisachieved
byallowingGPUaccessestobypasstheLLCwhenanincreaseinmemoryaccesslatencycanbetolerated.
ThelatencytoleranceofaGPUapplicationisdeterminedbytheavailabilityofTLP,whichismeasured
atruntimeastheaveragenumberofthreadsthatareavailableforissuing.Forabaselineconﬁguration
withtwoCPUcoresandfourGPUcores,modeledafterexistingheterogeneousprocessordesigns,HeLM
outperformsleastrecentlyused(LRU)policyby10.4%.Additionally,HeLMalsooutperformscompeting
policies.OurevaluationsshowthatHeLMisabletosustainperformancewithvaryingcoremix.
Inadditiontotheperformancebeneﬁt,bypassingalsoreducestotalaccessestotheLLC,leadingtoa
reductionintheenergyconsumptionoftheLLCmodule.However,LLCbypassinghasthepotentialto
increaseoff-chipbandwidthutilizationandDRAMenergyconsumption.OurexperimentsshowthatHeLM
exhibitsbetterenergyefﬁciencybyreducingtheED2 by18%overLRUwhileimpactingonlya7%increase
inoff-chipbandwidthutilization.
CategoriesandSubjectDescriptors:B.3.2[MemoryStructures]:DesignStyles—Cachememories;C.1.3
[ComputerSystemsOrganization]:ProcessorArchitectures—Heterogeneous(Hybrid)systems
GeneralTerms:Architecture,Experimentation,Performance
AdditionalKeyWordsandPhrases:Heterogeneousmulticore,cachemanagementpolicy,last-levelcache,
bypassing
†
AuthorswereafﬁliatedtotheUniversityofMinnesotawhenthisworkwasdone.
ThisworkissupportedinpartbyNationalScienceFoundationgrantsCCF-0916583andCPS-0931931.
Authors’addresses:A.Holey,IntelCorporation,1900PrairieCityRoad,Folsom,CA95630;email:
anup.holey@intel.com;V.Mekkat,IntelCorporation,3600JulietteLane,SantaClara,CA95054;email:
vineeth.mekkat@intel.com;P.-C.YewandA.Zhai,DepartmentofComputerScienceandEngineering,
UniversityofMinnesota,200UnionStreet,KellerHall4-192,Minneapolis,MN55455;emails:{yew,
zhai}@cs.umn.edu.
Permissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalorclassroomuseisgranted
withoutfeeprovidedthatcopiesarenotmadeordistributedforproﬁtorcommercialadvantageandthat
copiesshowthisnoticeontheﬁrstpageorinitialscreenofadisplayalongwiththefullcitation.Copyrightsfor
componentsofthisworkownedbyothersthanACMmustbehonored.Abstractingwithcreditispermitted.
Tocopyotherwise,torepublish,topostonservers,toredistributetolists,ortouseanycomponentofthis
workinotherworksrequirespriorspeciﬁcpermissionand/orafee.Permissionsmayberequestedfrom
PublicationsDept.,ACM,Inc.,2PennPlaza,Suite701,NewYork,NY10121-0701USA,fax +1(212)
869-0481,orpermissions@acm.org.
c 2015ACM1544-3566/2015/03-ART3$15.00
DOI:http://dx.doi.org/10.1145/2710019
ACMTransactionsonArchitectureandCodeOptimization,Vol.12,No.1,Article3,Publicationdate:March2015.3:2 A.Holeyetal.
ACMReferenceFormat:
AnupHoley,VineethMekkat,Pen-ChungYew,andAntoniaZhai.2015.Performance-energyconsiderations
forsharedcachemanagementinaheterogeneousmulticoreprocessor.ACMTrans.Architec.CodeOptim.
12,1,Article3(March2015),29pages.
DOI:http://dx.doi.org/10.1145/2710019
1.INTRODUCTION
Advancesinsemiconductortechnologyandtheurgentneedforenergy-efﬁcientcompu-
tationhavefacilitatedtheintegrationofcomputationalcoresthatareheterogeneousin
natureontothesamedie.Data-parallelaccelerators,suchasgraphicprocessingunits
(GPUs),areamongthemostpopularacceleratorcoresusedinsuchdesigns.Witheasy-
to-adoptprogrammingmodels,suchasNvidiaCUDA[NVIDIACorporation2007]and
OpenCL[KhronosGroup2009],thesedata-parallelcoresarenowbeingemployedtoac-
celeratediverseworkloads.AvailabilityofheterogeneousmulticoresystemslikeAMD
Fusion[Brookwood2010]andIntelSandyBridge[IntelCorporation2009]suggests
thatmulticoredesignswithheterogeneousprocessingelementsarebecomingpartof
mainstreamcomputing.Diversityintheperformancecharacteristicsofthesecompu-
tationalcorespresentsauniquesetofchallengesindesigningtheseheterogeneous
multicoreprocessors.
Inheterogeneousmulticoresystems,theefﬁcientsharingofon-chipresources,such
asthelast-levelcache(LLC),iskeytoperformance.However,theintegrationofCPU
andGPUcoresontothesamedieleadstocompetitionintheLLCthatdoesnotexist
inhomogeneoussystems.First,thedifferenceincachesensitivityamongdiversecores
impliesdifferenceinperformancebeneﬁtsobtainedfromowningthesameamount
ofcachespace.Second,GPUcoreswithalargenumberofthreadscanpotentially
dominateaccessestotheLLCandconsequentlyskewexistingcache-sharingpolicies
infavoroftheGPUcores.Asaresult,GPUcoresoccupyanunfairshareoftheLLC
withexistingpolicies.
Figure1showstheperformanceofvariouscachereplacementpoliciesinaheteroge-
neousexecutionenvironmentwhere401.bzip2(fromtheSPECCPU2006benchmark
suite[Spradling2007])executingonasingleCPUcoresharesa2MBLLCwithaGPU
benchmark(fromtheAMDAPPbenchmarksuite[AdvancedMicroDevicesIncorpo-
rated2011])executingontheGPU.TheapplicationsarelistedlaterinTableIII,and
thedetailsoftheexperimentandprocessorconﬁgurationsareprovidedinSection4.
Figure1(a)showstheaverageLLCoccupancyandFigure1(b)showsthenormalized
instructionspercycle(IPC)oftheCPUapplicationacrossallGPUbenchmarks.Occu-
pancyreferstothedistributionoftheLLCspacebetweenapplications.Since401.bzip2
iscachesensitive,whereasmostGPUapplicationsarenot,itisdesirabletoallocatea
largershareoftheLLCtotheCPUapplication.However,forthebasicleastrecently
used(LRU)policy,weobservethatamajorportionoftheLLCisoccupiedbytheGPU
application.ThisleadstosigniﬁcantperformancedegradationfortheCPUapplication
undertheLRUpolicy,asshowninFigure1(b).
PriorworkshaveshownthatjudicioussharingoftheLLCcanimprovetheoverall
performancewhendiverseworkloadssharehomogeneousmulticoresystems[Suhetal.
2004;Moretoetal.2008;Kimetal.2004;QureshiandPatt2006;XieandLoh2009,
2010;Qureshietal.2007;Jaleeletal.2010].Toevaluatewhetherthesetechniques
canbeadoptedbyheterogeneousmulticoreprocessors,westudyseveralrecentlypro-
posedpolicies.Dynamicre-referenceintervalprediction(DRRIP)[Jaleeletal.2010]is
acachemanagementpolicydevelopedprimarilyforhomogeneousmulticoreprocessors.
DRRIPpredictswhetherthere-reference(reuse)intervalofcachelinesare intermedi-
ate or distant andinsertslinesatthenon–mostrecentlyused(MRU)positionbased
ontheprediction.IfalineisreusedafterinsertionintotheLLC,itispromotedby
ACMTransactionsonArchitectureandCodeOptimization,Vol.12,No.1,Article3,Publicationdate:March2015.Performance-EnergyConsiderationsforSharedCacheManagement3:3
Fig.1. Theperformanceimpactonacache-sensitiveCPUapplicationsharingtheLLCwithgraphicpro-
cessingunit(GPU)applicationundervariouscachereplacementpolicies.Cache-sensitiveSPEC[Spradling
2007]application401.bzip2executesontheCPUcore.Theperformanceimpactismeasuredacrosstheset
ofGPUbenchmarksshownlaterinTableIII.ThreeconﬁgurationswithvaryingGPUcorecountsareevalu-
ated.TAP-RRIP[LeeandKim2012]resultsareshownonlyforfour-GPUconﬁguration,asTAP-RRIPneeds
morethantwoGPUcoresforfullfunctioning.
increasingitsagetoimproveitslifetimeinthecache.Non-MRUinsertionofcache
linesperformsbetterthanMRUinsertionbecausemostofthelinesdonotexhibit
immediatere-reference.Figure1(a)indicatesthatDRRIPprovideslittleimprovement
inLLCoccupancyinaheterogeneousenvironment,asthepolicyisoverwhelmedby
anorderofmagnitudedifferencebetweenthememoryaccessratesoftheCPUand
theGPUcores.TheperformanceimpactoftheunbalancedLLCoccupancyisshownin
Figure1(b).
Weareawareofonlyoneexistingwork,TLP-awarecachemanagementpolicy(TAP)
[LeeandKim2012],thataddressesthediversityofon-chipcoreswhiledesigningthe
LLCsharingpolicy.TAPidentiﬁesthecachesensitivityoftheGPUapplication,aswell
asthedifferenceinLLCaccessratesbetweentheCPUandGPUcores.Thisinformation
isusedtoinﬂuencethedecisionsmadebytheunderlyingcachemanagementpolicy.
Whenthesemetricsindicateacache-sensitiveGPUapplication,bothCPUandGPU
coresaregivenequalpriority.Ontheotherhand,iftheGPUapplicationiscache
insensitive,theGPUcoreisgivenalowerprioritybytheunderlyingpolicy.
TAP,althoughdesignedforheterogeneousmulticoreprocessors,stillallocatesalarge
portionofthecachetothecache-insensitiveGPUapplication.Consequently,theper-
formancedegradationduetoLLCsharingisstillsigniﬁcantforthecache-sensitive
CPUapplicationasshowninFigure1.SeveralreasonsprohibitTAPfromachieving
thedesiredperformance.First,the coresampling techniqueusedinTAPtomeasure
thecachesensitivityoftheGPUapplicationleavesasigniﬁcantamountofGPUdead
blocksintheLLC.Second,TAPtakesthesamedecisionforallGPUmemoryaccesses
inasamplingperiodandisslowtoadapttotheruntimevariationsintheapplica-
tion’sbehavior.Amoreﬁne-grainedcontrolovertheGPULLCsharecouldpotentially
improvetheutilizationofthesharedLLC.WediscussTAPindetailinSection5.3.3.
Inadditiontotheperformanceaspect,thepresenceofdiversecorescouldchange
theenergyconsumptionproﬁle,bothon-chipaswellasoff-chip,fortheheterogeneous
multicoreprocessorunderexistingpolicies.Thiscouldresultinasign

3
Performance-EnergyConsiderationsforSharedCacheManagement
inaHeterogeneousMulticoreProcessor
ANUPHOLEY†
andVINEETHMEKKAT†
,IntelCorporation
PEN-CHUNGYEWandANTONIAZHAI,UniversityofMinnesota–TwinCities
HeterogeneousmulticoreprocessorsthatintegrateCPUcoresanddata-parallelacceleratorssuchasgraphic
processingunit(GPU)coresontothesamedieraiseseveralnewissuesforsharingvariouson-chipresources.
Thesharedlast-levelcache(LLC)isoneofthemostimportantsharedresourcesduetoitsimpacton
performance.AccessestothesharedLLCinheterogeneousmulticoreprocessorscanbedominatedbythe
GPUduetothesigniﬁcantlyhighernumberofconcurrentthreadssupportedbythearchitecture.Under
currentcachemanagementpolicies,theCPUapplications’shareoftheLLCcanbesigniﬁcantlyreduced
inthepresenceofcompetingGPUapplications.FormanyCPUapplications,areducedshareoftheLLC
couldleadtosigniﬁcantperformancedegradation.Onthecontrary,GPUapplicationscantolerateincreasein
memoryaccesslatencywhenthereissufﬁcientthread-levelparallelism(TLP).Inadditiontotheperformance
challenge,introductionofdiversecoresontothesamediechangestheenergyconsumptionproﬁleand,in
turn,affectstheenergyefﬁciencyoftheprocessor.
Inthiswork,weproposeheterogeneousLLCmanagement(HeLM),anovelsharedLLCmanagement
policythattakesadvantageoftheGPU’stoleranceformemoryaccesslatency.HeLMisabletothrottle
GPULLCaccessesandyieldLLCspacetocache-sensitiveCPUapplications.Thisthrottlingisachieved
byallowingGPUaccessestobypasstheLLCwhenanincreaseinmemoryaccesslatencycanbetolerated.
ThelatencytoleranceofaGPUapplicationisdeterminedbytheavailabilityofTLP,whichismeasured
atruntimeastheaveragenumberofthreadsthatareavailableforissuing.Forabaselineconﬁguration
withtwoCPUcoresandfourGPUcores,modeledafterexistingheterogeneousprocessordesigns,HeLM
outperformsleastrecentlyused(LRU)policyby10.4%.Additionally,HeLMalsooutperformscompeting
policies.OurevaluationsshowthatHeLMisabletosustainperformancewithvaryingcoremix.
Inadditiontotheperformancebeneﬁt,bypassingalsoreducestotalaccessestotheLLC,leadingtoa
reductionintheenergyconsumptionoftheLLCmodule.However,LLCbypassinghasthepotentialto
increaseoff-chipbandwidthutilizationandDRAMenergyconsumption.OurexperimentsshowthatHeLM
exhibitsbetterenergyefﬁciencybyreducingtheED2 by18%overLRUwhileimpactingonlya7%increase
inoff-chipbandwidthutilization.
CategoriesandSubjectDescriptors:B.3.2[MemoryStructures]:DesignStyles—Cachememories;C.1.3
[ComputerSystemsOrganization]:ProcessorArchitectures—Heterogeneous(Hybrid)systems
GeneralTerms:Architecture,Experimentation,Performance
AdditionalKeyWordsandPhrases:Heterogeneousmulticore,cachemanagementpolicy,last-levelcache,
bypassing
†
AuthorswereafﬁliatedtotheUniversityofMinnesotawhenthisworkwasdone.
ThisworkissupportedinpartbyNationalScienceFoundationgrantsCCF-0916583andCPS-0931931.
Authors’addresses:A.Holey,IntelCorporation,1900PrairieCityRoad,Folsom,CA95630;email:
anup.holey@intel.com;V.Mekkat,IntelCorporation,3600JulietteLane,SantaClara,CA95054;email:
vineeth.mekkat@intel.com;P.-C.YewandA.Zhai,DepartmentofComputerScienceandEngineering,
UniversityofMinnesota,200UnionStreet,KellerHall4-192,Minneapolis,MN55455;emails:{yew,
zhai}@cs.umn.edu.
Permissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalorclassroomuseisgranted
withoutfeeprovidedthatcopiesarenotmadeordistributedforproﬁtorcommercialadvantageandthat
copiesshowthisnoticeontheﬁrstpageorinitialscreenofadisplayalongwiththefullcitation.Copyrightsfor
componentsofthisworkownedbyothersthanACMmustbehonored.Abstractingwithcreditispermitted.
Tocopyotherwise,torepublish,topostonservers,toredistributetolists,ortouseanycomponentofthis
workinotherworksrequirespriorspeciﬁcpermissionand/orafee.Permissionsmayberequestedfrom
PublicationsDept.,ACM,Inc.,2PennPlaza,Suite701,NewYork,NY10121-0701USA,fax +1(212)
869-0481,orpermissions@acm.org.
c  2015ACM1544-3566/2015/03-ART3$15.00
DOI:http://dx.doi.org/10.1145/2710019
ACMTransactionsonArchitectureandCodeOptimization,Vol.12,No.1,Article3,Publicationdate:March2015.3:2 A.Holeyetal.
ACMReferenceFormat:
AnupHoley,VineethMekkat,Pen-ChungYew,andAntoniaZhai.2015.Performance-energyconsiderations
forsharedcachemanagementinaheterogeneousmulticoreprocessor.ACMTrans.Architec.CodeOptim.
12,1,Article3(March2015),29pages.
DOI:http://dx.doi.org/10.1145/2710019
1.INTRODUCTION
Advancesinsemiconductortechnologyandtheurgentneedforenergy-efﬁcientcompu-
tationhavefacilitatedtheintegrationofcomputationalcoresthatareheterogeneousin
natureontothesamedie.Data-parallelaccelerators,suchasgraphicprocessingunits
(GPUs),areamongthemostpopularacceleratorcoresusedinsuchdesigns.Witheasy-
to-adoptprogrammingmodels,suchasNvidiaCUDA[NVIDIACorporation2007]and
OpenCL[KhronosGroup2009],thesedata-parallelcoresarenowbeingemployedtoac-
celeratediverseworkloads.AvailabilityofheterogeneousmulticoresystemslikeAMD
Fusion[Brookwood2010]andIntelSandyBridge[IntelCorporation2009]suggests
thatmulticoredesignswithheterogeneousprocessingelementsarebecomingpartof
mainstreamcomputing.Diversityintheperformancecharacteristicsofthesecompu-
tationalcorespresentsauniquesetofchallengesindesigningtheseheterogeneous
multicoreprocessors.
Inheterogeneousmulticoresystems,theefﬁcientsharingofon-chipresources,such
asthelast-levelcache(LLC),iskeytoperformance.However,theintegrationofCPU
andGPUcoresontothesamedieleadstocompetitionintheLLCthatdoesnotexist
inhomogeneoussystems.First,thedifferenceincachesensitivityamongdiversecores
impliesdifferenceinperformancebeneﬁtsobtainedfromowningthesameamount
ofcachespace.Second,GPUcoreswithalargenumberofthreadscanpotentially
dominateaccessestotheLLCandconsequentlyskewexistingcache-sharingpolicies
infavoroftheGPUcores.Asaresult,GPUcoresoccupyanunfairshareoftheLLC
withexistingpolicies.
Figure1showstheperformanceofvariouscachereplacementpoliciesinaheteroge-
neousexecutionenvironmentwhere401.bzip2(fromtheSPECCPU2006benchmark
suite[Spradling2007])executingonasingleCPUcoresharesa2MBLLCwithaGPU
benchmark(fromtheAMDAPPbenchmarksuite[AdvancedMicroDevicesIncorpo-
rated2011])executingontheGPU.TheapplicationsarelistedlaterinTableIII,and
thedetailsoftheexperimentandprocessorconﬁgurationsareprovidedinSection4.
Figure1(a)showstheaverageLLCoccupancyandFigure1(b)showsthenormalized
instructionspercycle(IPC)oftheCPUapplicationacrossallGPUbenchmarks.Occu-
pancyreferstothedistributionoftheLLCspacebetweenapplications.Since401.bzip2
iscachesensitive,whereasmostGPUapplicationsarenot,itisdesirabletoallocatea
largershareoftheLLCtotheCPUapplication.However,forthebasicleastrecently
used(LRU)policy,weobservethatamajorportionoftheLLCisoccupiedbytheGPU
application.ThisleadstosigniﬁcantperformancedegradationfortheCPUapplication
undertheLRUpolicy,asshowninFigure1(b).
PriorworkshaveshownthatjudicioussharingoftheLLCcanimprovetheoverall
performancewhendiverseworkloadssharehomogeneousmulticoresystems[Suhetal.
2004;Moretoetal.2008;Kimetal.2004;QureshiandPatt2006;XieandLoh2009,
2010;Qureshietal.2007;Jaleeletal.2010].Toevaluatewhetherthesetechniques
canbeadoptedbyheterogeneousmulticoreprocessors,westudyseveralrecentlypro-
posedpolicies.Dynamicre-referenceintervalprediction(DRRIP)[Jaleeletal.2010]is
acachemanagementpolicydevelopedprimarilyforhomogeneousmulticoreprocessors.
DRRIPpredictswhetherthere-reference(reuse)intervalofcachelinesare intermedi-
ate or distant andinsertslinesatthenon–mostrecentlyused(MRU)positionbased
ontheprediction.IfalineisreusedafterinsertionintotheLLC,itispromotedby
ACMTransactionsonArchitectureandCodeOptimization,Vol.12,No.1,Article3,Publicationdate:March2015.Performance-EnergyConsiderationsforSharedCacheManagement3:3
Fig.1. Theperformanceimpactonacache-sensitiveCPUapplicationsharingtheLLCwithgraphicpro-
cessingunit(GPU)applicationundervariouscachereplacementpolicies.Cache-sensitiveSPEC[Spradling
2007]application401.bzip2executesontheCPUcore.Theperformanceimpactismeasuredacrosstheset
ofGPUbenchmarksshownlaterinTableIII.ThreeconﬁgurationswithvaryingGPUcorecountsareevalu-
ated.TAP-RRIP[LeeandKim2012]resultsareshownonlyforfour-GPUconﬁguration,asTAP-RRIPneeds
morethantwoGPUcoresforfullfunctioning.
increasingitsagetoimproveitslifetimeinthecache.Non-MRUinsertionofcache
linesperformsbetterthanMRUinsertionbecausemostofthelinesdonotexhibit
immediatere-reference.Figure1(a)indicatesthatDRRIPprovideslittleimprovement
inLLCoccupancyinaheterogeneousenvironment,asthepolicyisoverwhelmedby
anorderofmagnitudedifferencebetweenthememoryaccessratesoftheCPUand
theGPUcores.TheperformanceimpactoftheunbalancedLLCoccupancyisshownin
Figure1(b).
Weareawareofonlyoneexistingwork,TLP-awarecachemanagementpolicy(TAP)
[LeeandKim2012],thataddressesthediversityofon-chipcoreswhiledesigningthe
LLCsharingpolicy.TAPidentiﬁesthecachesensitivityoftheGPUapplication,aswell
asthedifferenceinLLCaccessratesbetweentheCPUandGPUcores.Thisinformation
isusedtoinﬂuencethedecisionsmadebytheunderlyingcachemanagementpolicy.
Whenthesemetricsindicateacache-sensitiveGPUapplication,bothCPUandGPU
coresaregivenequalpriority.Ontheotherhand,iftheGPUapplicationiscache
insensitive,theGPUcoreisgivenalowerprioritybytheunderlyingpolicy.
TAP,althoughdesignedforheterogeneousmulticoreprocessors,stillallocatesalarge
portionofthecachetothecache-insensitiveGPUapplication.Consequently,theper-
formancedegradationduetoLLCsharingisstillsigniﬁcantforthecache-sensitive
CPUapplicationasshowninFigure1.SeveralreasonsprohibitTAPfromachieving
thedesiredperformance.First,the coresampling techniqueusedinTAPtomeasure
thecachesensitivityoftheGPUapplicationleavesasigniﬁcantamountofGPUdead
blocksintheLLC.Second,TAPtakesthesamedecisionforallGPUmemoryaccesses
inasamplingperiodandisslowtoadapttotheruntimevariationsintheapplica-
tion’sbehavior.Amoreﬁne-grainedcontrolovertheGPULLCsharecouldpotentially
improvetheutilizationofthesharedLLC.WediscussTAPindetailinSection5.3.3.
Inadditiontotheperformanceaspect,thepresenceofdiversecorescouldchange
theenergyconsumptionproﬁle,bothon-chipaswellasoff-chip,fortheheterogeneous
multicoreprocessorunderexistingpolicies.Thiscouldresultinasign

0/5000

จาก: -

เป็น: -

ผลลัพธ์ (ไทย) 1: [สำเนา]

คัดลอก!

3EnergyConsiderationsforSharedCacheManagement ประสิทธิภาพการทำงานinaHeterogeneousMulticoreProcessorANUPHOLEY†andVINEETHMEKKAT†, IntelCorporationปากกา-CHUNGYEWandANTONIAZHAI, UniversityofMinnesota-TwinCitiesHeterogeneousmulticoreprocessorsthatintegrateCPUcoresanddata-parallelacceleratorssuchasgraphiccoresontothesamedieraiseseveralnewissuesforsharingvariouson-chipresources processingunit (GPU)Isoneofthemostimportantsharedresourcesduetoitsimpacton Thesharedlast-levelcache (LLC)ประสิทธิภาพของ AccessestothesharedLLCinheterogeneousmulticoreprocessorscanbedominatedbytheGPUduetothesigniﬁcantlyhighernumberofconcurrentthreadssupportedbythearchitecture.Undercurrentcachemanagementpolicies, shareoftheLLCcanbesigniﬁcantlyreduced 'theCPUapplicationsinthepresenceofcompetingGPUapplications.FormanyCPUapplications,areducedshareoftheLLCcouldleadtosigniﬁcantperformancedegradation Onthecontrary, GPUapplicationscantolerateincreaseinmemoryaccesslatencywhenthereissufﬁcientthread-levelparallelism(TLP) Inadditiontotheperformanceท้าทาย introductionofdiversecoresontothesamediechangestheenergyconsumptionproﬁleand ในเปิด affectstheenergyefﬁciencyoftheprocessorInthiswork, weproposeheterogeneousLLCmanagement (พวง), anovelsharedLLCmanagementpolicythattakesadvantageoftheGPU'stoleranceformemoryaccesslatency HeLMisabletothrottleGPULLCaccessesandyieldLLCspacetocache-sensitiveCPUapplications.ThisthrottlingisachievedbyallowingGPUaccessestobypasstheLLCwhenanincreaseinmemoryaccesslatencycanbetoleratedThelatencytoleranceofaGPUapplicationisdeterminedbytheavailabilityofTLP, whichismeasuredatruntimeastheaveragenumberofthreadsthatareavailableforissuing ForabaselineconﬁgurationwithtwoCPUcoresandfourGPUcores, modeledafterexistingheterogeneousprocessordesigns พวงoutperformsleastrecentlyused (LRU) policyby10.4% นอกจากนี้ HeLMalsooutperformscompetingนโยบาย OurevaluationsshowthatHeLMisabletosustainperformancewithvaryingcoremixInadditiontotheperformancebeneﬁt, bypassingalsoreducestotalaccessestotheLLC, leadingtoareductionintheenergyconsumptionoftheLLCmodule.However,LLCbypassinghasthepotentialtoincreaseoff-chipbandwidthutilizationandDRAMenergyconsumption.OurexperimentsshowthatHeLMเพิ่ม exhibitsbetterenergyefﬁciencybyreducingtheED2 by18% overLRUwhileimpactingonlya7%inoff-chipbandwidthutilizationCategoriesandSubjectDescriptors:B.3.2[MemoryStructures]:DesignStyles—Cachememories C.1.3[ComputerSystemsOrganization]: ProcessorArchitectures — ระบบ Heterogeneous (ไฮบริด)GeneralTerms:Architecture ทดลอง ประสิทธิภาพสุดท้าย-levelcache cachemanagementpolicy, AdditionalKeyWordsandPhrases:Heterogeneousmulticoreเลี่ยง†AuthorswereafﬁliatedtotheUniversityofMinnesotawhenthisworkwasdoneThisworkissupportedinpartbyNationalScienceFoundationgrantsCCF-0916583andCPS-0931931Authors'addresses:A.Holey,IntelCorporation,1900PrairieCityRoad,Folsom,CA95630;email:anup.holey@intel.com; V.Mekkat,IntelCorporation,3600JulietteLane,SantaClara,CA95054;email:vineeth.mekkat@intel.com; P. C.YewandA.Zhai,DepartmentofComputerScienceandEngineeringUniversityofMinnesota, 200UnionStreet, KellerHall4-192 มิ MN55455 อีเมล์: {โฮบาร์ตzhai}@cs.umn.eduPermissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforproﬁtorcommercialadvantageandthatcopiesshowthisnoticeontheﬁrstpageorinitialscreenofadisplayalongwiththefullcitation CopyrightsforcomponentsofthisworkownedbyothersthanACMmustbehonored.AbstractingwithcreditispermittedTocopyotherwise, torepublish, topostonservers, toredistributetolists, ortouseanycomponentofthisworkinotherworksrequirespriorspeciﬁcpermissionand/orafee PermissionsmayberequestedfromPublicationsDept พลอากาศ Inc., 2PennPlaza, Suite701, NewYork, NY10121-0701USA แฟกซ์ +1(212)869-0481, orpermissions@acm.orgc 2015ACM1544-3566/2015/03-ART3$15.00DOI:http://dx.doi.org/10.1145/2710019ACMTransactionsonArchitectureandCodeOptimization,Vol.12,No.1,Article3,Publicationdate:March2015.3:2 A.HoleyetalACMReferenceFormat:AnupHoley, VineethMekkat ปากกา-ChungYew,andAntoniaZhai.2015.Performance-energyconsiderationsforsharedcachemanagementinaheterogeneousmulticoreprocessor ACMTrans.Architec.CodeOptim12,1, Article3 (March2015), 29pagesDOI:http://dx.doi.org/10.1145/27100191.บทนำAdvancesinsemiconductortechnologyandtheurgentneedforenergy-efﬁcientcompu -tationhavefacilitatedtheintegrationofcomputationalcoresthatareheterogeneousinnatureontothesamedie ข้อมูล-parallelaccelerators, suchasgraphicprocessingunits(GPUs), areamongthemostpopularacceleratorcoresusedinsuchdesigns Witheasy-การ-adoptprogrammingmodels, suchasNvidiaCUDA [NVIDIACorporation2007] และOpenCL [KhronosGroup2009], thesedata-parallelcoresarenowbeingemployedtoac -celeratediverseworkloads AvailabilityofheterogeneousmulticoresystemslikeAMDแนะนำ andIntelSandyBridge ฟิวชั่น [Brookwood2010] [IntelCorporation2009]thatmulticoredesignswithheterogeneousprocessingelementsarebecomingpartofmainstreamcomputing Diversityintheperformancecharacteristicsofthesecompu-tationalcorespresentsauniquesetofchallengesindesigningtheseheterogeneousmulticoreprocessorsInheterogeneousmulticoresystems, theefﬁcientsharingofon-chipresources เช่นasthelast-levelcache (LLC), iskeytoperformance อย่างไรก็ตาม theintegrationofCPUandGPUcoresontothesamedieleadstocompetitionintheLLCthatdoesnotexistinhomogeneoussystems แรก thedifferenceincachesensitivityamongdiversecoresimpliesdifferenceinperformancebeneﬁtsobtainedfromowningthesameamountofcachespace 2, GPUcoreswithalargenumberofthreadscanpotentiallydominateaccessestotheLLCandconsequentlyskewexistingcache-sharingpoliciesinfavoroftheGPUcores.Asaresult,GPUcoresoccupyanunfairshareoftheLLCwithexistingpoliciesFigure1showstheperformanceofvariouscachereplacementpoliciesinaheteroge-neousexecutionenvironmentwhere401.bzip2 (fromtheSPECCPU2006benchmarkexecutingonasingleCPUcoresharesa2MBLLCwithaGPU ท [Spradling2007]เกณฑ์มาตรฐาน (fromtheAMDAPPbenchmarksuite [AdvancedMicroDevicesIncorpo-rated2011])executingontheGPU.TheapplicationsarelistedlaterinTableIII,andthedetailsoftheexperimentandprocessorconﬁgurationsareprovidedinSection4Figure1 (ก) showstheaverageLLCoccupancyandFigure1 (b) showsthenormalizedoftheCPUapplicationacrossallGPUbenchmarks.Occu instructionspercycle (IPC) -pancyreferstothedistributionoftheLLCspacebetweenapplications.Since401.bzip2iscachesensitive, whereasmostGPUapplicationsarenot, itisdesirabletoallocatealargershareoftheLLCtotheCPUapplication.However,forthebasicleastrecentlyใช้นโยบาย (LRU) weobservethatamajorportionoftheLLCisoccupiedbytheGPUแอพลิเคชัน ThisleadstosigniﬁcantperformancedegradationfortheCPUapplicationundertheLRUpolicy,asshowninFigure1(b)PriorworkshaveshownthatjudicioussharingoftheLLCcanimprovetheoverall[Suhetal performancewhendiverseworkloadssharehomogeneousmulticoresystems2004 Moretoetal.2008 Kimetal.2004 QureshiandPatt2006 XieandLoh20092010 Qureshietal.2007 Jaleeletal.2010] Toevaluatewhetherthesetechniquescanbeadoptedbyheterogeneousmulticoreprocessors, westudyseveralrecentlypro-posedpolicies เป็น Dynamicre-referenceintervalprediction(DRRIP) [Jaleeletal.2010]acachemanagementpolicydevelopedprimarilyforhomogeneousmulticoreprocessorsอ้างอิง DRRIPpredictswhetherthere (นำ) intervalofcachelinesare intermedi-กิน หรือไกล andinsertslinesatthenon – mostrecentlyused (mru ของ) positionbasedontheprediction IfalineisreusedafterinsertionintotheLLC, itispromotedbyACMTransactionsonArchitectureandCodeOptimization,Vol.12,No.1,Article3,Publicationdate:March2015.Performance-EnergyConsiderationsforSharedCacheManagement3:3ภาพการ Theperformanceimpactonacache-sensitiveCPUapplicationsharingtheLLCwithgraphicpro -cessingunit (GPU) applicationundervariouscachereplacementpolicies แคช-sensitiveSPEC [Spradlingapplication401.bzip2executesontheCPUcore.Theperformanceimpactismeasuredacrosstheset 2007]ofGPUbenchmarksshownlaterinTableIII.ThreeconﬁgurationswithvaryingGPUcorecountsareevalu-เส้น เคาะ-RRIP [LeeandKim2012] resultsareshownonlyforfour GPUconﬁguration, asTAP-RRIPneedsmorethantwoGPUcoresforfullfunctioningincreasingitsagetoimproveitslifetimeinthecache ไม่ MRUinsertionofcachelinesperformsbetterthanMRUinsertionbecausemostofthelinesdonotexhibitimmediatere-อ้างอิง Figure1 (ต่อ) indicatesthatDRRIPprovideslittleimprovementinLLCoccupancyinaheterogeneousenvironment, asthepolicyisoverwhelmedbyanorderofmagnitudedifferencebetweenthememoryaccessratesoftheCPUandtheGPUcores.TheperformanceimpactoftheunbalancedLLCoccupancyisshowninFigure1(b)Weareawareofonlyoneexistingwork,TLP-awarecachemanagementpolicy(TAP)[LeeandKim2012], thataddressesthediversityofon-chipcoreswhiledesigningtheLLCsharingpolicy.TAPidentiﬁesthecachesensitivityoftheGPUapplication,aswellasthedifferenceinLLCaccessratesbetweentheCPUandGPUcores.ThisinformationisusedtoinﬂuencethedecisionsmadebytheunderlyingcachemanagementpolicyWhenthesemetricsindicateacache-sensitiveGPUapplication, bothCPUandGPUcoresaregivenequalpriority Ontheotherhand, iftheGPUapplicationiscachetheGPUcoreisgivenalowerprioritybytheunderlyingpolicy ซ้อนเคาะ althoughdesignedforheterogeneousmulticoreprocessors, stillallocatesalargeportionofthecachetothecache-insensitiveGPUapplication.Consequently,theper -formancedegradationduetoLLCsharingisstillsigniﬁcantforthecache ลับCPUapplicationasshowninFigure1.SeveralreasonsprohibitTAPfromachievingthedesiredperformance แรก coresampling techniqueusedinTAPtomeasurethecachesensitivityoftheGPUapplicationleavesasigniﬁcantamountofGPUdeadblocksintheLLC.Second,TAPtakesthesamedecisionforallGPUmemoryaccessesinasamplingperiodandisslowtoadapttotheruntimevariationsintheapplica-tion'sbehavior Amoreﬁne-grainedcontrolovertheGPULLCsharecouldpotentiallyimprovetheutilizationofthesharedLLC.WediscussTAPindetailinSection5.3.3Inadditiontotheperformanceaspect, thepresenceofdiversecorescouldchangefortheheterogeneous bothon-chipaswellasoff-ชิพ theenergyconsumptionproﬁlemulticoreprocessorunderexistingpolicies Thiscouldresultinasign

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 2:[สำเนา]

คัดลอก!

+1 (212) 869-0,481, orpermissions @ acm.org. ค? intermedi- กินหรือที่อยู่ห่างไกล coresampling

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 3:[สำเนา]

คัดลอก!

การแปล กรุณารอสักครู่..

ภาษาอื่น ๆ

การสนับสนุนเครื่องมือแปลภาษา: กรีก, กันนาดา, กาลิเชียน, คลิงออน, คอร์สิกา, คาซัค, คาตาลัน, คินยารวันดา, คีร์กิซ, คุชราต, จอร์เจีย, จีน, จีนดั้งเดิม, ชวา, ชิเชวา, ซามัว, ซีบัวโน, ซุนดา, ซูลู, ญี่ปุ่น, ดัตช์, ตรวจหาภาษา, ตุรกี, ทมิฬ, ทาจิก, ทาทาร์, นอร์เวย์, บอสเนีย, บัลแกเรีย, บาสก์, ปัญจาป, ฝรั่งเศส, พาชตู, ฟริเชียน, ฟินแลนด์, ฟิลิปปินส์, ภาษาอินโดนีเซี, มองโกเลีย, มัลทีส, มาซีโดเนีย, มาราฐี, มาลากาซี, มาลายาลัม, มาเลย์, ม้ง, ยิดดิช, ยูเครน, รัสเซีย, ละติน, ลักเซมเบิร์ก, ลัตเวีย, ลาว, ลิทัวเนีย, สวาฮิลี, สวีเดน, สิงหล, สินธี, สเปน, สโลวัก, สโลวีเนีย, อังกฤษ, อัมฮาริก, อาร์เซอร์ไบจัน, อาร์เมเนีย, อาหรับ, อิกโบ, อิตาลี, อุยกูร์, อุสเบกิสถาน, อูรดู, ฮังการี, ฮัวซา, ฮาวาย, ฮินดี, ฮีบรู, เกลิกสกอต, เกาหลี, เขมร, เคิร์ด, เช็ก, เซอร์เบียน, เซโซโท, เดนมาร์ก, เตลูกู, เติร์กเมน, เนปาล, เบงกอล, เบลารุส, เปอร์เซีย, เมารี, เมียนมา (พม่า), เยอรมัน, เวลส์, เวียดนาม, เอสเปอแรนโต, เอสโทเนีย, เฮติครีโอล, แอฟริกา, แอลเบเนีย, โคซา, โครเอเชีย, โชนา, โซมาลี, โปรตุเกส, โปแลนด์, โยรูบา, โรมาเนีย, โอเดีย (โอริยา), ไทย, ไอซ์แลนด์, ไอร์แลนด์, การแปลภาษา.