Major Challenges to Achieve Exascale Performance Shekhar Borkar - PowerPoint PPT Presentation

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009 Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO) 1

Outline Exascale performance goals Major challenges Potential solutions Paradigm shift Summary 2

Performance Roadmap EFLOP 1.E+08 PFLOP 1.E+06 1.E+04 GFLOP TFLOP 1.E+02 GFLOP 1.E+00 1.E-02 MFLOP 12 Years 11 Years 10 Years 1.E-04 1960 1970 1980 1990 2000 2010 2020 3

From Giga to Exa, via Tera & Peta Exa 1000 Relative Tr Performance Peta 1.E+08 Peta 2.5M X 100 1.E+06 Tera 4,000X 1.E+04 Tera Concurrency 10 36X 1.E+02 30X 250X G Transistor Performance G 1 1.E+00 1986 1996 2006 2016 1986 1996 2006 2016 Exa 5V 1 1.E+08 Relative Energy/Op G Vcc scaling Peta 0.1 1.E+06 Power Tera 1.E+04 Tera Peta 0.01 1M X 1.E+02 4,000X G 80X 1.E+00 0.001 1986 1996 2006 2016 1986 1996 2006 2016 4

Building with Today’s Technology TFLOP Machine today Decode and control Translations …etc 4450W Power supply losses Cooling…etc 10TB disk @ 1TB/disk @10W 5KW 100W Disk 100pJ com per FLOP 100W Com 0.1B/FLOP @ 1.5nJ per Byte 150W Memory 200pJ per FLOP 200W Compute KW Tera, MW Peta, GW Exa? 5

The Power & Energy Challenge TFLOP Machine today 4550W TFLOP Machine then 5KW 100W With Exa Technology Disk 100W Com 5W ~20W ~3W 150W Memory ~5W 2W 200W 5W Compute 6

Starting Point: Optimistic yet Realistic 80 Core TFLOP Chip 1.5mm 1.5mm 12.64mm 12.64mm I/O Area I/O Area Global clk spine + clk buffers Global clk spine + clk buffers DMEM DMEM DMEM single tile single tile FPMAC0 FPMAC0 FPMAC0 RF RF RF 1.5mm 1.5mm RIB RIB RIB IMEM IMEM IMEM 2.0mm 2.0mm 2.0mm 2.0mm 2.0mm MSINT MSINT 21.72mm 21.72mm CLK CLK CLK FPMAC1 FPMAC1 FPMAC1 Router Router Router Technology Technology 65nm CMOS Process 65nm CMOS Process Interconnect Interconnect 1 poly, 8 metal (Cu) 1 poly, 8 metal (Cu) Transistors Transistors 100 Million 100 Million Die Area Die Area 275mm 2 275mm 2 PLL PLL TAP TAP 3mm 2 3mm 2 Tile area Tile area Package Package 1248 pin LGA, 14 layers, 1248 pin LGA, 14 layers, I/O Area I/O Area 343 signal pins 343 signal pins 7

Scaling Assumptions Technology 45nm 32nm 22nm 16nm 11nm 8nm 5nm (2008) (2010) (2012) (2014) (2016) (2018) (2020) (High Volume) Transistor density 1.75 1.75 1.75 1.75 1.75 1.75 1.75 Frequency scaling 15% 10% 8% 5% 4% 3% 2% Vdd scaling -10% -7.5% -5% -2.5% -1.5% -1% -0.5% Dimension & Capacitance 0.75 0.75 0.75 0.75 0.75 0.75 0.75 SD Leakage scaling/micron 1X Optimistic to 1.43X Pessimistic 65nm Core + Local Memory 8nm Core + Local Memory DP FP Add, Multiply Integer Core, RF DP FP Add, Multiply Router Integer Core, RF Router 5mm 2 (50%) 0.17mm 2 (50%) Memory 0.35MB 0.17mm 2 (50%) Memory 0.35MB ~0.6mm 5mm 2 (50%) 0.34mm2, 4.6GHz, 9.2GF , 0.24 to 0.46W 10mm2, 3GHz, 6GF , 1.8W 8

Processor Chip 20000 500 Chip Performance (GF) 400 Chip Power (W) 15000 Power(W) GFLOPs 300 10000 200 5000 100 0 0 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm 20mm 2018, 8nm technology node Cores/Module 1150 Total Local Memory 400 MB Frequency 4.61 GHz 20mm Peak performance 10.6 TF Power 300 - 600W Energy efficiency 34 - 18 GF/Watt 400mm2 30-60 MW for Exascale 9

Processor Node 128 GB 128 GB Peak performance 10.6 TF Total DRAM Capacity 512GB Total DRAM BW 1TB/s (0.1B/FLOP) DRAM Power 800 W* Total Power 1100 - 1400W Energy efficiency 9.5 - 8 GF/Watt 110-140 MW for Exascale 256GB/s 64b 128 GB 128 GB *Assumes 5% Vdd scaling each technology generation 140 pJ energy consumed per accessed bit 10

Node Power Breakdown 10 TF, ~ 1KW Aggressive voltage Compute scaling Fabric Hierarchical heterogeneous topologies DRAM Efficient signaling Repartitioning 11

Voltage Scaling When designed to voltage scale 1 10 Energy Efficiency 0.8 8 Freq Normalized 0.6 6 0.4 4 Total Power Leakage 0.2 2 0 0 0.3 0.5 0.7 0.9 Vdd (Normal) 12

Energy Efficiency with Vdd Scaling 160 Energy Efficiency (GF/W) 140 Vdd 120 0.7x 0.5x 100 80 60 40 20 0 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm ~3X Compute energy efficiency with Vdd Scaling 13

On-die Mesh Interconnect 45nm 22nm 16nm 32nm 20mm 20mm 20mm 20mm 70 Cores 123 Cores 214 Cores 375 Cores 500 Network 400 Chip Power (W) Compute 300 200 100 0 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm On-die network (mesh) power is high Worse if link width scales up each generation 14

Mesh — Retrospective Bus: Good at board level, does not extend well • Transmission line issues: loss and signal integrity, limited frequency • Width is limited by pins and board area • Broadcast, simple to implement Point to point busses: fast signaling over longer distance • Board level, between boards, and racks • High frequency, narrow links • 1D Ring, 2D Mesh and Torus to reduce latency • Higher complexity and latency in each node Hence, emergence of packet switched network But, pt-to-pt packet switched network on a chip? 15

Interconnect Delay & Energy 10000 2 65nm, 3GHz Router Delay 1000 1.5 Delay (ps) pJ/Bit 100 1 10 0.5 1 0 0 5 10 15 20 Length (mm) 16

Bus —The Other Extreme… Issues: Slow, < 300MHz Shared, limited scalability? Solutions: Repeaters to increase freq Wide busses for bandwidth Multiple busses for scalability Benefits: Power? Simpler cache coherency Move away from frequency, embrace parallelism 17

Hierarchical & Heterogeneous C C C C C C R R Bus Bus Bus C C C C C C C C C C R R Bus Bus 2 nd Level Bus C C C C Bus to connect over short distances Hierarchy of Busses Or hierarchical circuit and packet switched networks 18

Revise DRAM Architecture Signaling M Control Energy cost today: ~175 pJ/bit DRAM Array Traditional DRAM New DRAM architecture Addr RAS Page Page Page Page Page Page Addr CAS Activates many pages Activates few pages Lots of reads and writes (refresh) Read and write (refresh) what is needed Small amount of read data is used All read data is used Requires small number of pins Requires large number of IO’s (3D) 19

Data Locality Chip to memory Communication: ~1.5nJ per Byte ~150pJ per Byte Core-to-core Communication on the chip: Chip to chip ~10pJ per Byte Communication: ~100pJ per Byte Data movement is expensive — keep it local (1) Core to core, (2) Chip-to-chip, (3) Memory 20

Impact of Exploding Parallelism Almost flat because Vdd close to Vt 450 4X increase in the 400 0.5x Vdd Million Cores/EFLOP number of cores 350 (Parallelism) 300 Increased 250 communication and 0.7x Vdd 200 related energy 150 1x Vdd Increased HW, and unreliability 100 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm 1. Strike a balance between Com & Computation 2. Resiliency (Gradual, Intermittent, Permanent faults) 21

Road to Unreliability? From Peta to Exa Reliability Issues 1,000X parallelism More hardware for something to go wrong >1,000X intermittent faults due to soft errors Aggressive Vcc scaling Gradual faults due to increased variations to reduce power/energy More susceptible to Vcc droops (noise) More susceptible to dynamic temp variations Exacerbates intermittent faults — soft errors Deeply scaled Aging related faults technologies Lack of burn-in? Variability increases dramatically Resiliency will be the corner-stone 22

Resiliency Faults Example Faults cause errors (data & control) Permanent faults Stuck-at 0 & 1 Datapath errors Detected by parity/ECC Gradual faults Variability Silent data corruption Need HW hooks Temperature Control errors Control lost (Blue screen) Intermittent faults Soft errors Voltage droops Minimal overhead for resiliency Aging faults Degradation Error detection Applications Fault isolation System Software Programming system Fault confinement Microcode, Platform Reconfiguration Microarchitecture Recovery & Adapt Circuit & Design 23

Needs a Paradigm Shift Past and present priorities — Single thread performance Frequency Programming productivity Legacy, compatibility Architecture features for productivity Constraints (1) Cost (2) Reasonable Power/Energy Future priorities — Throughput performance Parallelism Power/Energy Architecture features for energy Simplicity Constraints (1) Programming productivity (2) Cost Evaluate each (old) architecture feature with new priorities 24

Summary Von-Neumann computing & CMOS technology (nothing else in sight) Voltage scaling to reduce power and energy • Explodes parallelism • Cost of communication vs computation — critical balance • Resiliency to combat side-effects and unreliability Programming system for extreme parallelism System software to harmonize all of the above 25

Major Challenges to Achieve Exascale Performance Shekhar Borkar - PowerPoint PPT Presentation

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009 Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO) 1 Outline Exascale performance goals Major challenges Potential solutions

Working together to Working together to achieve our mission and address achieve our mission and

Hospital to Home in 7-10 days What do we want to achieve? How will we achieve this?

Migration A thematic process Four major challenges Swedens three major cities Stockholm,

recruitment needs Challenges for the City of Stockholm clara.lindblom@stockholm.se Challenges

Informational Webinar for Major Appliance Producers Presented by: Major Appliance Recycling

Major Topics Major Topics Good News Good News TMDL Basics TMDL Basics TMDL

BISHOPS DAY in the Region MESSAGE CELEBRATIONS CHALLENGES CHALLENGES Trust CHALLENGES

EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in

2018 Achieve more. BUSINESS OVERVIEW 1 Achieve more. ABOUT ADAPT IT 1000+ employees 10 000

North Carolina Read to Achieve The Goal of Read to Achieve The goal of the State is to

Construction Construction Performance Audits: How to Achieve How to Achieve Compliance and

PROJECT AIM PROJECT AIM ACHIEVE- -INSPIRE INSPIRE- -MOTIVATE MOTIVATE ACHIEVE Philosophy

State Standards Michael Cohen, President, Achieve and Jason Weedon, Sr. Vice President, Achieve

Junior Status Workshops Class of 2019 Connect to Achieve; Lead to Inspire Connect to Achieve;

2018 Achieve more. BUSI BU SINESS OVERVIEW Achieve more. 2 ABOUT ADAPT IT IT 1000+

Read to Achieve: Understanding the Legislation North Carolina Department of Public Instruction

Chairs Address Erica Borghi Chair, ISITC 2017 - 2018 Goals Content Membership Leadership

Outbound Discovery and High Availability Jonathan Rosenberg Cisco Systems Problem Statement

Plan for Today We shall continue our study of fair allocation of indivisible goods . Computational

Export Controls at ISU Brooke Langlitz Director Office of Research Integrity September 17, 2015

1 This is a brief outline of the module. We would look at various aspects of physical vapor

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

"The Jews who were there gathered around him, saying, 'How long will you keep us in suspense?

SPECTROSCOPY APPARATUS FOR THE MEASUREMENT OF THE HYPERFINE STRUCTURE OF ANTIHYDROGEN AT CERN 1

Major Challenges to Achieve Exascale Performance Shekhar Borkar - PowerPoint PPT Presentation

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009 Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO) 1 Outline Exascale performance goals Major challenges Potential solutions

Working together to Working together to achieve our mission and address achieve our mission and

Hospital to Home in 7-10 days What do we want to achieve? How will we achieve this?

Migration A thematic process Four major challenges Swedens three major cities Stockholm,

recruitment needs Challenges for the City of Stockholm clara.lindblom@stockholm.se Challenges

Informational Webinar for Major Appliance Producers Presented by: Major Appliance Recycling

Major Topics Major Topics Good News Good News TMDL Basics TMDL Basics TMDL

BISHOPS DAY in the Region MESSAGE CELEBRATIONS CHALLENGES CHALLENGES Trust CHALLENGES

EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in Systems EDA Challenges in

2018 Achieve more. BUSINESS OVERVIEW 1 Achieve more. ABOUT ADAPT IT 1000+ employees 10 000

North Carolina Read to Achieve The Goal of Read to Achieve The goal of the State is to

Construction Construction Performance Audits: How to Achieve How to Achieve Compliance and

PROJECT AIM PROJECT AIM ACHIEVE- -INSPIRE INSPIRE- -MOTIVATE MOTIVATE ACHIEVE Philosophy

State Standards Michael Cohen, President, Achieve and Jason Weedon, Sr. Vice President, Achieve

Junior Status Workshops Class of 2019 Connect to Achieve; Lead to Inspire Connect to Achieve;

2018 Achieve more. BUSI BU SINESS OVERVIEW Achieve more. 2 ABOUT ADAPT IT IT 1000+

Read to Achieve: Understanding the Legislation North Carolina Department of Public Instruction

Chairs Address Erica Borghi Chair, ISITC 2017 - 2018 Goals Content Membership Leadership

Outbound Discovery and High Availability Jonathan Rosenberg Cisco Systems Problem Statement

Plan for Today We shall continue our study of fair allocation of indivisible goods . Computational

Export Controls at ISU Brooke Langlitz Director Office of Research Integrity September 17, 2015

1 This is a brief outline of the module. We would look at various aspects of physical vapor

Transforming Graphical System Models to Graphical Attack Models ! Joint work with Marieta

&quot;The Jews who were there gathered around him, saying, 'How long will you keep us in suspense?

SPECTROSCOPY APPARATUS FOR THE MEASUREMENT OF THE HYPERFINE STRUCTURE OF ANTIHYDROGEN AT CERN 1

"The Jews who were there gathered around him, saying, 'How long will you keep us in suspense?