AnImprovedAnalytical SuperscalarMicroprocessor MemoryModel MemoryModel
XiChenandTorAamodt
ElectricalandComputerEngineering UniversityofBritishColumbia June22nd,2008(MoBS-2008)
AnImprovedAnalytical SuperscalarMicroprocessor MemoryModel - - PowerPoint PPT Presentation
AnImprovedAnalytical SuperscalarMicroprocessor MemoryModel MemoryModel XiChenandTorAamodt ElectricalandComputerEngineering UniversityofBritishColumbia June22 nd ,2008(MoBS-2008)
ElectricalandComputerEngineering UniversityofBritishColumbia June22nd,2008(MoBS-2008)
Introduction Background ModelingLongLatencyMemorySystems
PendingHits AccurateHiddenMissLatencyEstimation
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
AccurateHiddenMissLatencyEstimation ALimitedNumberofOutstandingCacheMisses SWAM&SWAM-MLP
Methodology Results FutureWorkandConclusions
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
Creatinganddebuggingasimulatortakestime Runningdetailedsimulationsisslow
Over 8months
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
Timetosimulate1secondofLCMP(32-core,4-threadpercore)workloadexecution(LiZhao etal.,ExploringLarge-ScaleCMPArchitecturesUsingManySim,IEEEMicro,Issue4,2007)
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
Analytical Model
program characteristics microarchitectural parameters
performance
AnalyticalModelingvs.PerformanceSimulations
Fastspeed(ordersofmagnitudetimesfaster)
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
Providingmoreinsightsforchipdesigners
Lessaccuratethanperformancesimulations Coveringonlymajormicroarchitecturalparameters
TejasKarkhanisandJamesSmith.AFirst-order
Stableperformanceisdisruptedbydifferenttypesof
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
time IPC miss-event#1 miss-event#2 miss-event#3
First-orderModel
Theoverallperformance(CPI)ismodeledbythesum
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
inabsenceof miss-events branch mispredictions datacache misses instruction cachemisses
Thebaseline inourpaperisourcarefulre-
8 10 12
iss(mcf)
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
2 4 6
20 50 100 200 500
memoryaccesslatency(cycle) CPI_D$mis
PendingHits AccurateHiddenMissLatencyEstimation
i1 i2 i3 i4 i5 i6 i7 i8 errorisreduced 43.5% ->27.5% 15.5%
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
AccurateHiddenMissLatencyEstimation ProfileWindowSelection
i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3 i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3
29.2%
32.4%
i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i4 i5 i6 i7 i8 i1 i2 i3
Assumingtheinstructionwindowsizeiseight
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
num_serialized_D$miss:0->1
i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i4 i5 i6 i7 i8 i1 i2 i3
Assumingtheinstructionwindowsizeiseight
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
num_serialized_D$miss:1->3
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
A(traditional)cachesimulatorcannotidentify
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
… miss
...
miss hit
ldr2,(0)r1 ldr3,(4)r1 addr4,r2,r5 addr6,r3,r5 ldr7,(0)r6
i1 i2 i3 i4 i5
i1 i3 i6 i8
not consideringpendinghits(num_serialized_D$miss+=1) wrong value
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
ldr7,(0)r6
addr9,r5,r7
……
i5 i6 i7 i8 i9
i2 i4 i5 i7 i8
ldr2,(0)r1 ldr3,(4)r1 addr4,r2,r5 addr6,r3,r5 ldr7,(0)r6
i1 i2 i3 i4 i5 miss hit(i1) miss
i1 i3 i6 i8
consideringpendinghits(num_serialized_D$miss+=2) correct value
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
ldr7,(0)r6
addr9,r5,r7
……
i5 i6 i7 i8 i9 miss
i2 i4 i5 i7
Errorisreducedfrom43.5% to27.5% with bestfixedcyclecompensationformisslatency
latencymodeledbyCPIbase
Partofthelatencyofamissmaybehidden
whenamississues, itcanbe(ornear)the commitsideoftheROB
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
fetch commit ROB
Partofthelatencyofamissmaybehidden
whenamississues, itcanbe(ornear) themiddle oftheROB whenamississues, itcanbe(ornear) themiddle oftheROB
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
fetch commit ROB ROB
whenamississues, itcanbe(ornear)the fetchsideoftheROB
Partofthelatencyofamissmaybehidden
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
fetch commit ROB
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
fetch commit ROB
ldmiss ldmiss
tohidemisslatencyinpointerchasing
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
dist:averagedistancebetweentwoconsecutivemisses(the distancebetweentwomissesissaturatedbythesizeofROB) errorisreducedfrom15.5% (bestfixedcyclecompensation)to10.3%
Themodelthusfarhasassumedthatthe
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
Weproposeatechniquetomodelalimited
Westopaprofilestepandupdatenum_serialized_D$miss
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3
Withplainprofiling,errorreducesfrom32.4% to23.9% for8MSHRs
i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3
plainprofiling(baseline) slidingprofiling
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3
slidingprofiling
i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3
SWAMprofiling
29.2% (errorofplain)
(doesnotimprovetheaccuracymuch butslowdownthemodelsignificantly)
i4 i5 i6 i7 i8 i1 i2 i3
num_serialized_miss+=4(wrongvalue)
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
i4 i5 i6 i7 i8 i1 i2 i3
num_serialized_miss+=2(correctvalue) 12.8% (errorofSWAM)to9.2% (errorofSWAM-MLP)for8MSHRs 23.2% (errorofSWAM)to9.9% (errorofSWAM-MLP)for4MSHRs
WemodifiedSimpleScalarandusedasetofmemory
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
Unlimitednumberofoutstandingcachemissessupported
40 60 80 100 Plainw/oPH Plainw/PH SWAMw/PH
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
20 40
a p p a r t e q k l u c s w m m c f e m h t h p r m l b m m e a n
error(%)
39.7%
ModelingalimitednumberofMSHRs
SWAM-MLPfurtherdecreaseserrorofSWAMfrom12.8%to9.2%(8
MSHRs,e.g.,Prescott),from23.2%to9.9%(4MSHRs)
60 80 100 Plainw/oMSHR Plainw/MSHR SWAM SWAM-MLP
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
20 40 app art eqk luc swm mcf em hth prm lbm mean error(%)
Ourimprovementsdonot slowdownthefirst-ordermodel
Analyticallymodelingtheperformanceimpactofhardware
Analyticallymodelingthethroughputoffine-grain
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
ExtendingtheanalyticalmodelforCMPswithsuperscalar,
AnalyticallymodelingtheperformanceofSMT
ModelingLongLatencyMemorySystem
PendingDataCacheHit AccurateHiddenMissLatencyEstimation ModelingMSHRs SWAM&SWAM-MLP
AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel
UniversityofBritishColumbia
SWAM&SWAM-MLP
Overallourimprovementsreducetheerrorof