A tour of a microprocessor museum Tour t heme: -architectural - PDF document

A tour of a microprocessor museum Tour t heme: µ -architectural parallelism complicates performance understanding. BAFL: Bottleneck Analysis of Fine-grain Parallelism Tour game: “ Bot t leneck Hunt ” Which instruction slowed down the execution, and by how much? Prof. Rast islav Bodík More specifically, why t he following model fails? execution time = wit h Brian Fields in part wit h Shai Rubin, Prof . Mark Hill, Prof . Mary Vernon cost(instruction 1 ) + … + cost(instruction n ) [cycles] University of Wisconsin The computer system A tour of a microprocessor museum (0) • Many levels of granularity, each with unique no parallelism performance problems Int el 80386 internet WANS • servers • fetch • • decode • execut e microprocessors • Our goal: • quantitative approach for modern (out -of-order) • processors Who cares? A tour of a microprocessor museum (1) • Archit ect s: scalar pipeline parallelism circuit complexity • Int el 80486 power consumption • • Soft ware engineers: performance-critical software • St udent s: • Fetch Decd Read Exe Mem Writ e intuition how processors work • Processors: • understand themselves •

Why crit ical pat h? A tour of a microprocessor museum (2) � Microprocessors are fine-grain parallel systems in-order superscalar pipeline Int el Pent ium like wide-area networks: • queues are like routers, pipelines are like communicat ion links . • many (bad) event s going on in parallel, t heir lat ency t olerat ed Fetch Decd Read Exe Mem Writ e Fetch Decd Read Exe Mem Writ e The Bill Cosby Rule:* “ You’ re not a parent if you only have one child.” *rule named by Amir Roth A tour of a microprocessor museum (3) Outline out-of-order superscalar � The model of micro-execution Int el Pent ium 4 • capt ure bot h program and processor const raint s Four met rics: crit icalit y • slack • • execut ion modes • cost Critical path of a microexecut ion A tour of a microprocessor museum (end) out-of-order superscalar Critical path misconceptions: typical buffers, queues, windows • “ Every ‘ bad event’ is critical.” • branch misprediction • reorder-buffer st all • L1 cache miss • L2 cache miss decode reorder store buffer buffer • “ Critical path is obvious … ” buffer (ROB) … it cont ains inst ruct ions providing dat a for ‘ bad event s’ reservat ion missed st at ions loads � processors are good at tolerating latency, but poor at deciding what to tolerate.

Modeling: why hard? Critical Path Models (3) OOO + finite re -order buffer Crit ical pat h consist s of: 1. inst ruct ions and dat a dependences as in a traditional “ compiler” view Fetch • F F F F F microarchit ect ural resource const raint s 2. Execute • branch mispredictions, finite fetch b/ w, etc. E E E E E Together describe the microexecut ion of a Commit C C C C C given program executing on a given machine ROB Size How t o model in a uniform way? Critical Path Models (1) Critical Path Models (4) OOO + finite ROB + branch misp First, for a simple in-order machine Data dependencies • Resource dependencies • Fetch F F F F F Execute E E E E E Commit Dynamic C C C C C i 1 i 2 i 3 i 4 i 5 Instructions oldest newest mispredict ed branch Resources constrain the dataflow execution Critical Path Models (2) Example first inst ruct ion For an out-of-order machine 0 1 0 1 0 1 0 F Fetch F F F F F in order 4 1 1 1 1 1 1 1 1 2 3 1 1 Execute E E E E E E out of order 1 1 1 3 2 4 2 2 2 2 2 Commit 0 1 0 1 0 1 0 C C C C C C in order 0 0 0 0 0 i 1 i 2 i 3 i 4 i 5 oldest newest last inst ruct ion

Example Execution Modes Three modesof execution fetch limited (F-mode) 0 1 0 1 0 1 0 execute limited (E-mode) F commit limited (C-mode) 4 1 1 1 1 1 1 1 1 2 3 1 1 F F F F F F F F F F F F F E F-mode 1 1 1 3 2 4 2 2 2 2 2 E E E E E E E E E E E E E 0 1 0 1 0 1 0 C E-mode C C C C C C C C C C C C C 0 0 0 0 0 C-mode CP Lengt h = 16 cycles ⇒ Exe Time = 16 cycles Example Execution Modes Ent ering F-mode what if t his load is an L1 miss? S tart of program Branch misp. ROB stall (3 cycles � 12 cycles) F F F F F F F F F F F 0 1 0 1 0 1 0 F E E E ... ... E E E ... ... E E E E E ... 4 1 1 1 1 1 1 1 1 2 3 1 1 C C C C C C C C C C C E 1 st inst . in program 1 1 1 Ent ering E-mode Ent ering C -mode 3 2 4 2 2 2 2 2 Fetch catches up ROB stall 0 1 0 1 0 1 0 C F F F F F F F F 0 ... E E E E E ... ... E E E ... CP Lengt h = 16 cycles ⇒ Exe Time = 16 cycles C C C C C C C C Example Validation: can we trust our model? Execution Time Reduction (in cycles) per Cycle of Latency Reduced what if t his load is an L1 miss? (3 cycles � 12 cycles) 1 0 1 0 1 0 1 0 F 0.9 Reducing CP Latencies 4 0.8 1 1 1 1 1 1 1 1 2 12 1 1 0.7 0.6 E 0.5 1 1 1 0.4 3 2 4 2 2 2 2 2 0.3 0 1 0 1 0 1 0 C Reducing non-CP Latencies 0.2 0.1 0 0 crafty eon gcc gzip parser perl twolf vortex ammp art galgel mesa CP Lengt h = 19 cycles ⇒ Exe Time = 19 cycles

Outline S tep 1. Observing R2 The model of micro-execution R1 R2 + R3 E capt ure bot h program and processor const raint s R3 • Dependence resolved early Four met rics: • criticality If dependence int o R2 is on crit ical pat h, t hen value of R2 arrived last . • slack execut ion modes • critical ⇒ cost arrives last • ⁄ arrives last ⇒ crit ical Why criticality predictor? Policies! Last-arrive edges: a CPU stethoscope F � E E � E C � F mechanism mechanism mechanism mechanism policy: policy: policy: policy: current current current current bet t er bet t er bet t er bet t er E � F E � C F � F OOO execut ion OOO execut ion OOO execut ion OOO execut ion how to schedule? how to schedule? how to schedule? how to schedule? oldest first oldest first oldest first oldest first critical first critical first critical first critical first predict ion and predict ion and predict ion and predict ion and when t o speculat e? when t o speculat e? when t o speculat e? when t o speculat e? on each predict ion on each predict ion on each predict ion on each predict ion only critical only critical only critical only critical speculat ion speculat ion speculat ion speculat ion C � C how to serve mem how to serve mem how to serve mem how to serve mem non- blocking caches non- blocking caches non- blocking caches non- blocking caches FIFO FIFO FIFO FIFO critical first critical first critical first critical first requests? requests? requests? requests? CPU pre -fetch, pre -fetch, pre -fetch, pre -fetch, what t o prefet ch? what t o prefet ch? what t o prefet ch? what t o prefet ch? all misses all misses all misses all misses prefet ch crit ical prefet ch crit ical prefet ch crit ical prefet ch crit ical pre -execut e pre -execut e pre -execut e pre -execut e � Current policies are egalitarian: all “ bad” events equally harmful. Prediction: why hard? Implementing last-arrive edges Observe events within the machine Three st eps: observe the microexecution ⇒ hard! 1. F F F F F F F F • measuring edge latencies is intrusive E E E E E E E E analyze to find critical path ⇒ hard! C C C C C C C C 2. E � F if branch misp. C � F if ROB st all F � F ot herwise • graph too large to buffer and topological sort too complex • F F F F F F F F F store prediction for later use ⇒ easy! 3. E E E E E E E E E • store in table indexed by PC C C C C C C C C C E � E observe F � E if dat a E � C if commit C � C ot herwise arrival order of point er is ready on fet ch operands delayed

Last-arrive edges … and we’ve found the critical path! Backward propagate along last-arrive edges 0 0 1 1 0 1 0 F F 1 1 1 1 1 1 4 1 1 1 2 3 1 E E 1 1 1 3 2 4 2 2 2 2 2 0 1 0 1 0 1 0 C C 0 0 0 0 0 � Found CP by only observing last-arrive edges � but still requires constructing entire graph Remove latencies Prediction: why hard? Three st eps: Do not need explicit weights ⇒ solved! 1. observe the microexecution measuring edge latencies is intrusive • F 2. analyze to find critical path graph too large to buffer ⇒ hard! • and topological sort too complex ⇒ solved! E • store prediction for later use ⇒ easy! 3. C store in table indexed by PC • Prune the graph Step 2. Efficient analysis (predictor training) Only last-arrive edges needed CP is a ” long” chain of last -arrive edges. (other edges must be non-critical) ⇒ t he longer a given chain of last-arrive edges, t he more likely it is part of t he CP F Algorithm: find sufficient ly long last -arrive chains Plant token into a node n 1. E Propagate forward, only along last -arrive edges 2. Check for token after several hundred cycles 3. C If token alive, n is assumed critical 4.

A tour of a microprocessor museum Tour t heme: -architectural - PDF document

A tour of a microprocessor museum Tour t heme: -architectural parallelism complicates performance understanding. BAFL: Bottleneck Analysis of Fine-grain Parallelism Tour game: Bot t leneck Hunt Which instruction slowed down the

THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR Under the framework of

A G E N D A Tour Policy Oakhill Tour Presentation Travel & Sports Tour

Outline Overview VR Tour VR Tour Entities Luiz Velho Tour Script IMPA Tour

The University of Chicago Morton Weiss Judaica Museum Walk Museum Hyde Park Art Center Smart

DAY TOURS 2016 TOUR OPTIONS SCHEDULED TOUR We require a minimum of two people to conduct a

2019 KR19 TOUR PRESENTATION Kalahari Sunset KR19 EURO TOUR PRESENTATION Kalahari Khoi-San

Detector development for the MuSEUM experiment at J-PARC 1 Sohtaro Kanda / for the MuSEUM

connEcticut childrEnS muSEum Connecticut Childrens Museum 22 Wall Street New Haven, CT

WITTE MUSEUM THE CASE FOR GROWTH December 2006 The WITTE MUSEUM and SAN ANTONIO Partners

Gibbes Museum of Art What is the Gibbes Museum of Art? Hello! We look forward to your visit to

Explore Museum Careers with a Paid Internship nmaahc.si.edu/RFSInterns The Museum NMAAHC is the

ESA Microprocessor Development Status and Roadmap Roland Weigand European Space Agency

Goal: To familiarize students with microprocessor-based circuit design. The course deals

SYSC3601 Microprocessor Systems Unit 6: Input/Output (I/O) Systems SYSC3601 1 Microprocessor

Intel Microprocessor Handbook Pdf Barry B Brey Slides PDF ebook the intel microprocessor barry b

Phuket Football Tour 26 November 4 December 2017 Phuket Football Tour Biennial football tour

Availability Enhancement and Analysis for Mixed-Criticality Systems on Multi-core Roberto MEDINA,

Crisis Communications Master Class Austin February 15, 2019 Nick Lanyi Ragan Consulting Group

Improving Sampling-based Uncertainty Quantification Performance Through Embedded Ensemble

Introduo ao MPI-IO Escola Regional de Alto Desempenho 2018 Porto Alegre RS Jean Luca Bez 1

Colorado DOT

Critical thinking across the disciplines Peter Donovan Don Jack Lorraine Cornwell Critical

Judicious Choice of Waveform Parameters and Judicious Choice of Waveform Parameters and Accurate

DVFS-Control Techniques for Dense Linear Algebra Operations on Multi-Core Processors Pedro Alonso

A tour of a microprocessor museum Tour t heme: -architectural - PDF document

A tour of a microprocessor museum Tour t heme: -architectural parallelism complicates performance understanding. BAFL: Bottleneck Analysis of Fine-grain Parallelism Tour game: Bot t leneck Hunt Which instruction slowed down the

THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR THE FUTURE TOUR Under the framework of

A G E N D A Tour Policy Oakhill Tour Presentation Travel &amp; Sports Tour

Outline Overview VR Tour VR Tour Entities Luiz Velho Tour Script IMPA Tour

The University of Chicago Morton Weiss Judaica Museum Walk Museum Hyde Park Art Center Smart

DAY TOURS 2016 TOUR OPTIONS SCHEDULED TOUR We require a minimum of two people to conduct a

2019 KR19 TOUR PRESENTATION Kalahari Sunset KR19 EURO TOUR PRESENTATION Kalahari Khoi-San

Detector development for the MuSEUM experiment at J-PARC 1 Sohtaro Kanda / for the MuSEUM

connEcticut childrEnS muSEum Connecticut Childrens Museum 22 Wall Street New Haven, CT

WITTE MUSEUM THE CASE FOR GROWTH December 2006 The WITTE MUSEUM and SAN ANTONIO Partners

Gibbes Museum of Art What is the Gibbes Museum of Art? Hello! We look forward to your visit to

Explore Museum Careers with a Paid Internship nmaahc.si.edu/RFSInterns The Museum NMAAHC is the

ESA Microprocessor Development Status and Roadmap Roland Weigand European Space Agency

Goal: To familiarize students with microprocessor-based circuit design. The course deals

SYSC3601 Microprocessor Systems Unit 6: Input/Output (I/O) Systems SYSC3601 1 Microprocessor

Intel Microprocessor Handbook Pdf Barry B Brey Slides PDF ebook the intel microprocessor barry b

Phuket Football Tour 26 November 4 December 2017 Phuket Football Tour Biennial football tour

Availability Enhancement and Analysis for Mixed-Criticality Systems on Multi-core Roberto MEDINA,

Crisis Communications Master Class Austin February 15, 2019 Nick Lanyi Ragan Consulting Group

Improving Sampling-based Uncertainty Quantification Performance Through Embedded Ensemble

Introduo ao MPI-IO Escola Regional de Alto Desempenho 2018 Porto Alegre RS Jean Luca Bez 1

Colorado DOT

Critical thinking across the disciplines Peter Donovan Don Jack Lorraine Cornwell Critical

Judicious Choice of Waveform Parameters and Judicious Choice of Waveform Parameters and Accurate

DVFS-Control Techniques for Dense Linear Algebra Operations on Multi-Core Processors Pedro Alonso

A G E N D A Tour Policy Oakhill Tour Presentation Travel & Sports Tour