AnImprovedAnalytical SuperscalarMicroprocessor MemoryModel - - PowerPoint PPT Presentation

an improved analytical superscalar microprocessor memory
SMART_READER_LITE
LIVE PREVIEW

AnImprovedAnalytical SuperscalarMicroprocessor MemoryModel - - PowerPoint PPT Presentation

AnImprovedAnalytical SuperscalarMicroprocessor MemoryModel MemoryModel XiChenandTorAamodt ElectricalandComputerEngineering UniversityofBritishColumbia June22 nd ,2008(MoBS-2008)


slide-1
SLIDE 1

AnImprovedAnalytical SuperscalarMicroprocessor MemoryModel MemoryModel

XiChenandTorAamodt

ElectricalandComputerEngineering UniversityofBritishColumbia June22nd,2008(MoBS-2008)

slide-2
SLIDE 2

Outline

Introduction Background ModelingLongLatencyMemorySystems

PendingHits AccurateHiddenMissLatencyEstimation

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

AccurateHiddenMissLatencyEstimation ALimitedNumberofOutstandingCacheMisses SWAM&SWAM-MLP

Methodology Results FutureWorkandConclusions

slide-3
SLIDE 3

Introduction

Processordesignisacomplicatedtask Cycle-accurateperformancesimulatorsare

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

Cycle-accurateperformancesimulatorsare

extensivelyusedtoexploredesignspace

Creatinganddebuggingasimulatortakestime Runningdetailedsimulationsisslow

slide-4
SLIDE 4

Introduction

Simulatingfuturelarge-scaleCMPs?

Over 8months

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

Timetosimulate1secondofLCMP(32-core,4-threadpercore)workloadexecution(LiZhao etal.,ExploringLarge-ScaleCMPArchitecturesUsingManySim,IEEEMicro,Issue4,2007)

slide-5
SLIDE 5

Introduction

AnalyticalModeling

analternativetocycle-accuratesimulations

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

Analytical Model

program characteristics microarchitectural parameters

performance

slide-6
SLIDE 6

Introduction

AnalyticalModelingvs.PerformanceSimulations

Pros

Fastspeed(ordersofmagnitudetimesfaster)

Providingmoreinsightsforchipdesigners

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

Providingmoreinsightsforchipdesigners

Cons

Lessaccuratethanperformancesimulations Coveringonlymajormicroarchitecturalparameters

slide-7
SLIDE 7

Background

First-orderModel

TejasKarkhanisandJamesSmith.AFirst-order

SuperscalarProcessorModel.ISCA’04.

Stableperformanceisdisruptedbydifferenttypesof

miss-events

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

miss-events

time IPC miss-event#1 miss-event#2 miss-event#3

slide-8
SLIDE 8

Background

First-orderModel

Theoverallperformance(CPI)ismodeledbythesum

  • fisolatedCPIcomponentsduetoeachtypeofmiss-

event. CPImodeled =CPIbase +CPIbmsp +CPID$miss +CPII$miss

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

inabsenceof miss-events branch mispredictions datacache misses instruction cachemisses

Thebaseline inourpaperisourcarefulre-

implementationofthefirst-ordermodelbaseduponthe availabledetailsdescribedintheISCA’04paperandits follow-upwork.

slide-9
SLIDE 9

PendingDataCacheHits

8 10 12

iss(mcf)

actual w/oPH

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

2 4 6

20 50 100 200 500

memoryaccesslatency(cycle) CPI_D$mis

slide-10
SLIDE 10

Contributions

PendingHits AccurateHiddenMissLatencyEstimation

i1 i2 i3 i4 i5 i6 i7 i8 errorisreduced 43.5% ->27.5% 15.5%

  • >10.3%

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

AccurateHiddenMissLatencyEstimation ProfileWindowSelection

i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3 i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3

baseline contribution

29.2%

  • >10.3%

32.4%

  • >9.2%
  • >10.3%
slide-11
SLIDE 11

ModelingCPID$miss

: Baseline

i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i4 i5 i6 i7 i8 i1 i2 i3

Assumingtheinstructionwindowsizeiseight

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

num_serialized_D$miss:0->1

i1 i3 i5 i8 i6 i7 i2 i4

slide-12
SLIDE 12

ModelingCPID$miss

: Baseline

i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i4 i5 i6 i7 i8 i1 i2 i3

Assumingtheinstructionwindowsizeiseight

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

num_serialized_D$miss:1->3

i9 i11 i13 i16 i15 i14 i10 i12

slide-13
SLIDE 13

ModelingCPID$miss

: Baseline

mem_lat:mainmemorylatency

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

mem_lat:mainmemorylatency total_num_instructions:totalnumberof

instructionscommitted

slide-14
SLIDE 14

PendingDataCacheHits

Apendingdatacachehitresultsfroma

memoryreferencetoacacheblockfor whicharequesthasalreadybeeninitiated byanotherinstruction

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

byanotherinstruction

Pendingdatacachehitsarecommondue

tospatiallocalityinapplications

slide-15
SLIDE 15

PendingDataCacheHits

A(traditional)cachesimulatorcannotidentify

pendingcachehitsduetothelackoftiming information

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

ldr2,(0)r1 ldr3,(4)r1 addr4,r2,r5 addr6,r3,r5 ldr7,(0)r6 ……

L2cache

… miss

L1cache

...

miss hit

slide-16
SLIDE 16

PendingDataCacheHits

ldr2,(0)r1 ldr3,(4)r1 addr4,r2,r5 addr6,r3,r5 ldr7,(0)r6

i1 i2 i3 i4 i5

i1 i3 i6 i8

not consideringpendinghits(num_serialized_D$miss+=1) wrong value

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

ldr7,(0)r6

  • rr8,r4,r5

addr9,r5,r7

  • rr10,r8,r9

……

i5 i6 i7 i8 i9

i2 i4 i5 i7 i8

slide-17
SLIDE 17

ModelingPendingHits

ldr2,(0)r1 ldr3,(4)r1 addr4,r2,r5 addr6,r3,r5 ldr7,(0)r6

i1 i2 i3 i4 i5 miss hit(i1) miss

i1 i3 i6 i8

consideringpendinghits(num_serialized_D$miss+=2) correct value

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

ldr7,(0)r6

  • rr8,r4,r5

addr9,r5,r7

  • rr10,r8,r9

……

i5 i6 i7 i8 i9 miss

i2 i4 i5 i7

Errorisreducedfrom43.5% to27.5% with bestfixedcyclecompensationformisslatency

latencymodeledbyCPIbase

slide-18
SLIDE 18

CompensatingOverestimate

[K&SISCA’04,KarkhanisPhDThesis]

Partofthelatencyofamissmaybehidden

whenamississues, itcanbe(ornear)the commitsideoftheROB

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

fetch commit ROB

slide-19
SLIDE 19

CompensatingOverestimate

[K&SISCA’04,KarkhanisPhDThesis]

Partofthelatencyofamissmaybehidden

whenamississues, itcanbe(ornear) themiddle oftheROB whenamississues, itcanbe(ornear) themiddle oftheROB

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

fetch commit ROB ROB

slide-20
SLIDE 20

CompensatingOverestimate

[K&SISCA’04,KarkhanisPhDThesis]

whenamississues, itcanbe(ornear)the fetchsideoftheROB

Partofthelatencyofamissmaybehidden

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

fetch commit ROB

slide-21
SLIDE 21

CompensatingOverestimate

Fixed-cyclecompensationusedbyprior

workisnotaccurateforallthebenchmarks westudy.

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

Weproposetocompensatethe

  • verestimateusingtheaveragedistance

betweenconsecutivemisses.

slide-22
SLIDE 22

CompensatingOverestimate

fetch commit ROB

ldmiss ldmiss

  • ver70%instructionscanbeused

tohidemisslatencyinpointerchasing

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

dist:averagedistancebetweentwoconsecutivemisses(the distancebetweentwomissesissaturatedbythesizeofROB) errorisreducedfrom15.5% (bestfixedcyclecompensation)to10.3%

slide-23
SLIDE 23

ModelingalimitednumberofMSHRs

Themodelthusfarhasassumedthatthe

numberofoutstandingcachemissessupported isunlimited.

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

Weproposeatechniquetomodelalimited

numberofoutstandingcachemissessupported.

slide-24
SLIDE 24

ModelingalimitednumberofMSHRs

Westopaprofilestepandupdatenum_serialized_D$miss

whenthenumberofmissesanalyzedisequaltothenumber

  • fMissStatusHoldingRegisters(MSHRs)

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3

ROBsize =8,NMSHR =4

Withplainprofiling,errorreducesfrom32.4% to23.9% for8MSHRs

slide-25
SLIDE 25

Start-with-a-miss(SWAM)Profiling

Eachprofilestepstartswithacachemiss

i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3

plainprofiling(baseline) slidingprofiling

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3

slidingprofiling

i9 i10 i11 i12 i13 i14 i15 i16 i4 i5 i6 i7 i8 i1 i2 i3

SWAMprofiling

29.2% (errorofplain)

  • >10.3% (errorofSWAM)

(doesnotimprovetheaccuracymuch butslowdownthemodelsignificantly)

slide-26
SLIDE 26

SWAM-MLP

i1 i2 i5 i4 i6 i3

i4 i5 i6 i7 i8 i1 i2 i3

SWAM(ROBsize =8,NMSHR =4) SWAM-MLP

num_serialized_miss+=4(wrongvalue)

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

i7 i8

i4 i5 i6 i7 i8 i1 i2 i3

SWAM-MLP

num_serialized_miss+=2(correctvalue) 12.8% (errorofSWAM)to9.2% (errorofSWAM-MLP)for8MSHRs 23.2% (errorofSWAM)to9.9% (errorofSWAM-MLP)for4MSHRs

slide-27
SLIDE 27

Methodology

WemodifiedSimpleScalarandusedasetofmemory

intensivebenchmarksfromSPEC2000andOLDEN suitewhosecachemissesperthousandinstructions (MPKI)ishigherthan10forourcacheconfigurations

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

slide-28
SLIDE 28

Results

Unlimitednumberofoutstandingcachemissessupported

40 60 80 100 Plainw/oPH Plainw/PH SWAMw/PH

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

  • 100
  • 80
  • 60
  • 40
  • 20

20 40

a p p a r t e q k l u c s w m m c f e m h t h p r m l b m m e a n

error(%)

39.7%

  • >29.2%
  • >10.3%
slide-29
SLIDE 29

Results

ModelingalimitednumberofMSHRs

SWAM-MLPfurtherdecreaseserrorofSWAMfrom12.8%to9.2%(8

MSHRs,e.g.,Prescott),from23.2%to9.9%(4MSHRs)

60 80 100 Plainw/oMSHR Plainw/MSHR SWAM SWAM-MLP

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

  • 100
  • 80
  • 60
  • 40
  • 20

20 40 app art eqk luc swm mcf em hth prm lbm mean error(%)

Ourimprovementsdonot slowdownthefirst-ordermodel

slide-30
SLIDE 30

Current/FutureWork

Analyticallymodelingtheperformanceimpactofhardware

dataprefetching(bythependinghitscausedbydata prefetches)

Analyticallymodelingthethroughputoffine-grain

multithreadedmicroprocessors(e.g.,Sun’sNiagara)

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

multithreadedmicroprocessors(e.g.,Sun’sNiagara)

ExtendingtheanalyticalmodelforCMPswithsuperscalar,

  • ut-of-orderexecutioncores

AnalyticallymodelingtheperformanceofSMT

superscalarcores

slide-31
SLIDE 31

Conclusions

ModelingLongLatencyMemorySystem

PendingDataCacheHit AccurateHiddenMissLatencyEstimation ModelingMSHRs SWAM&SWAM-MLP

AnImprovedAnalyticalSuperscalar MicroprocessorMemoryModel

  • XiChen,TorAamodt

UniversityofBritishColumbia

SWAM&SWAM-MLP

Overallourimprovementsreducetheerrorof

  • urbaselinefrom39.7%to10.3%.Theerroris

lessthan10%whenmodelingMSHRs.Our improvementsdonotslowdownthefirst-order model.