an improved analytical superscalar microprocessor memory

AnImprovedAnalytical SuperscalarMicroprocessor MemoryModel - PowerPoint PPT Presentation

AnImprovedAnalytical SuperscalarMicroprocessor MemoryModel MemoryModel XiChenandTorAamodt ElectricalandComputerEngineering UniversityofBritishColumbia June22 nd ,2008(MoBS-2008)


  1. An�Improved�Analytical� Superscalar�Microprocessor� Memory�Model Memory�Model Xi�Chen�and�Tor�Aamodt Electrical�and�Computer�Engineering University�of�British�Columbia June�22 nd ,�2008�(MoBS-2008)

  2. Outline � Introduction � Background � Modeling�Long�Latency�Memory�Systems � Pending�Hits � Accurate�Hidden�Miss�Latency�Estimation � Accurate�Hidden�Miss�Latency�Estimation � A�Limited�Number�of�Outstanding�Cache�Misses � SWAM�&�SWAM-MLP � Methodology� � Results � Future�Work�and�Conclusions An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt � Microprocessor�Memory�Model University�of�British�Columbia

  3. Introduction � Processor�design�is�a�complicated�task � Cycle-accurate�performance�simulators�are� � Cycle-accurate�performance�simulators�are� extensively�used�to�explore�design�space � Creating�and�debugging�a�simulator�takes�time � Running�detailed�simulations�is�slow An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt � Microprocessor�Memory�Model University�of�British�Columbia

  4. Introduction � Simulating�future�large-scale�CMPs�? Over� 8�months Time�to�simulate�1�second�of�LCMP�(32-core,�4-thread�per�core)�workload�execution�(Li�Zhao� et�al.,� Exploring�Large-Scale�CMP�Architectures�Using�ManySim ,�IEEE�Micro,�Issue�4,�2007) An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt � Microprocessor�Memory�Model University�of�British�Columbia

  5. Introduction � Analytical�Modeling � an�alternative�to�cycle-accurate�simulations program� characteristics Analytical� performance Model microarchitectural� parameters An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt � Microprocessor�Memory�Model University�of�British�Columbia

  6. Introduction � Analytical�Modeling�vs.�Performance�Simulations � Pros � Fast�speed�(orders�of�magnitude�times�faster) � Providing�more�insights�for�chip�designers Providing�more�insights�for�chip�designers � Cons � Less�accurate�than�performance�simulations � Covering�only�major�microarchitectural�parameters�� An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt � Microprocessor�Memory�Model University�of�British�Columbia

  7. Background � First-order�Model � Tejas�Karkhanis�and�James�Smith.� A�First-order� Superscalar�Processor�Model .�ISCA’04. � Stable�performance�is�disrupted�by�different�types�of� miss-events� miss-events� IPC miss-event�#1 miss-event�#2 miss-event�#3 time An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt � Microprocessor�Memory�Model University�of�British�Columbia

  8. Background � First-order�Model � The�overall�performance�(CPI)�is�modeled�by�the�sum� of�isolated�CPI�components�due�to�each�type�of�miss- event.� CPI modeled =��CPI base +��CPI bmsp +��CPI D$miss +��CPI I$miss in�absence�of� branch data�cache� instruction� miss-events mispredictions misses cache�misses � The� baseline in�our�paper�is�our�careful�re- implementation�of�the�first-order�model�based�upon�the� available�details�described�in�the�ISCA’04�paper�and�its� follow-up�work. An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt � Microprocessor�Memory�Model University�of�British�Columbia

  9. Pending�Data�Cache�Hits actual w/o�PH 12 10 iss�(mcf) 8 CPI_D$mis 6 4 2 0 20 50 100 200 500 memory�access�latency�(cycle) An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt � Microprocessor�Memory�Model University�of�British�Columbia

  10. Contributions � Pending�Hits i1 i3 i6 i8 error�is�reduced 43.5% ->�27.5�% i2 i4 i5 i7 15.5% � Accurate�Hidden�Miss�Latency�Estimation � Accurate�Hidden�Miss�Latency�Estimation ->�10.3�% ->�10.3�% � Profile�Window�Selection baseline contribution 29.2% i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 ->�10.3�% 32.4% ->�9.2�% i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt �� Microprocessor�Memory�Model University�of�British�Columbia

  11. : Baseline Modeling�CPI D$miss� � Assuming�the�instruction�window�size�is�eight i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 num_serialized_D$miss:�0�->�1 i1 i3 i5 i7 i6 i2 i4 i8 An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt �� Microprocessor�Memory�Model University�of�British�Columbia

  12. : Baseline Modeling�CPI D$miss� � Assuming�the�instruction�window�size�is�eight i1 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 num_serialized_D$miss:�1�->�3 i9 i11 i13 i15 i16 i10 i12 i14 An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt �� Microprocessor�Memory�Model University�of�British�Columbia

  13. : Baseline Modeling�CPI D$miss� � mem_lat :�main�memory�latency � mem_lat :�main�memory�latency � total_num_instructions :�total�number�of� instructions�committed An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt �� Microprocessor�Memory�Model University�of�British�Columbia

  14. Pending�Data�Cache�Hits � A�pending�data�cache�hit�results�from�a� memory�reference�to�a�cache�block�for� which�a�request�has�already�been�initiated� by�another�instruction by�another�instruction � Pending�data�cache�hits�are�common�due� to�spatial�locality�in�applications An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt �� Microprocessor�Memory�Model University�of�British�Columbia

  15. Pending�Data�Cache�Hits � A�(traditional)�cache�simulator�cannot�identify� pending�cache�hits�due�to�the�lack�of�timing� information L2�cache ld�r2,�(0)r1 L1�cache ld�r3,�(4)r1 add�r4,�r2,�r5 add�r6,�r3,�r5 ... miss … miss hit ld�r7,�(0)r6 …… An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt �� Microprocessor�Memory�Model University�of�British�Columbia

  16. Pending�Data�Cache�Hits ld�r2,�(0)r1 i1 ld�r3,�(4)r1 i2 not considering�pending�hits�(num_serialized_D$miss�+=� 1 ) add�r4,�r2,�r5 i3 wrong add�r6,�r3,�r5 i4 i1 i3 i6 value ld�r7,�(0)r6 ld�r7,�(0)r6 i5 i5 i8 i8 or�r8,�r4,�r5 i6 i2 i4 i5 i7 add�r9,�r5,�r7 i7 or�r10,�r8,�r9 i8 …… i9 An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt �� Microprocessor�Memory�Model University�of�British�Columbia

  17. Modeling�Pending�Hits considering�pending�hits�(num_serialized_D$miss�+=� 2 ) ld�r2,�(0)r1 i1 miss correct ld�r3,�(4)r1 i2 hit� (i1) value add�r4,�r2,�r5 i3 i1 i3 i6 add�r6,�r3,�r5 i4 i8 ld�r7,�(0)r6 ld�r7,�(0)r6 i5 i5 miss miss or�r8,�r4,�r5 i6 i2 i4 i5 i7 add�r9,�r5,�r7 i7 or�r10,�r8,�r9 i8 …… i9 latency�modeled�by�CPI base Error�is�reduced�from� 43.5% to� 27.5�% with best�fixed�cycle�compensation�for�miss�latency An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt �� Microprocessor�Memory�Model University�of�British�Columbia

  18. Compensating�Overestimate� [K&S�ISCA’04,�Karkhanis�PhD�Thesis] � Part�of�the�latency�of�a�miss�may�be�hidden when�a�miss�issues,� it�can�be�(or�near)�the� commit�side�of�the�ROB fetch ROB commit An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt �� Microprocessor�Memory�Model University�of�British�Columbia

  19. Compensating�Overestimate� [K&S�ISCA’04,�Karkhanis�PhD�Thesis] � Part�of�the�latency�of�a�miss�may�be�hidden when�a�miss�issues,� when�a�miss�issues,� it�can�be�(or�near)� it�can�be�(or�near)� the�middle of�the�ROB the�middle of�the�ROB fetch ROB ROB commit An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt �� Microprocessor�Memory�Model University�of�British�Columbia

  20. Compensating�Overestimate� [K&S�ISCA’04,�Karkhanis�PhD�Thesis] � Part�of�the�latency�of�a�miss�may�be�hidden when�a�miss�issues,� it�can�be�(or�near)�the� fetch�side�of�the�ROB fetch ROB commit An�Improved�Analytical�Superscalar� Xi�Chen,�Tor�Aamodt �� Microprocessor�Memory�Model University�of�British�Columbia

Recommend


More recommend