Major Challenges to Achieve Exascale Performance Shekhar Borkar - - PowerPoint PPT Presentation

major challenges to achieve
SMART_READER_LITE
LIVE PREVIEW

Major Challenges to Achieve Exascale Performance Shekhar Borkar - - PowerPoint PPT Presentation

Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009 Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO) 1 Outline Exascale performance goals Major challenges Potential solutions


slide-1
SLIDE 1

1

Major Challenges to Achieve Exascale Performance

Shekhar Borkar Intel Corp. April 29, 2009

Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO)

slide-2
SLIDE 2

2

Outline

Exascale performance goals Major challenges Potential solutions Paradigm shift Summary

slide-3
SLIDE 3

3

Performance Roadmap

1.E-04 1.E-02 1.E+00 1.E+02 1.E+04 1.E+06 1.E+08

1960 1970 1980 1990 2000 2010 2020

GFLOP

MFLOP GFLOP TFLOP PFLOP EFLOP

12 Years 11 Years 10 Years

slide-4
SLIDE 4

4

From Giga to Exa, via Tera & Peta

1 10 100 1000 1986 1996 2006 2016

Relative Tr Performance

G Tera Peta 30X 250X

0.001 0.01 0.1 1 1986 1996 2006 2016

Relative Energy/Op

G Tera Peta

5V Vcc scaling

1.E+00 1.E+02 1.E+04 1.E+06 1.E+08 1986 1996 2006 2016

G Tera Peta 36X Exa 4,000X Concurrency 2.5M X Transistor Performance

1.E+00 1.E+02 1.E+04 1.E+06 1.E+08 1986 1996 2006 2016

G Tera Peta 80X Exa 4,000X 1M X

Power

slide-5
SLIDE 5

5

Building with Today’s Technology

200pJ per FLOP 200W 150W 100W 100W 4450W Decode and control Translations …etc Power supply losses Cooling…etc 5KW

Compute Memory Com Disk

TFLOP Machine today 10TB disk @ 1TB/disk @10W 0.1B/FLOP @ 1.5nJ per Byte 100pJ com per FLOP

KW Tera, MW Peta, GW Exa?

slide-6
SLIDE 6

6

The Power & Energy Challenge

200W 150W 100W 100W 4550W 5KW

Compute Memory Com Disk

TFLOP Machine today 5W 2W ~5W ~3W 5W TFLOP Machine then With Exa Technology ~20W

slide-7
SLIDE 7

7

1.5mm 2.0mm

FPMAC0

Router

IMEM DMEM RF RIB

CLK

FPMAC1

MSINT

Global clk spine + clk buffers

1.5mm 2.0mm 2.0mm

FPMAC0

Router

IMEM DMEM RF RIB

CLK

FPMAC1 FPMAC0

Router

IMEM DMEM RF RIB

CLK

FPMAC1

MSINT

Global clk spine + clk buffers

Starting Point: Optimistic yet Realistic

21.72mm 12.64mm

I/O Area

I/O Area PLL

single tile 1.5mm 2.0mm

TAP

21.72mm 12.64mm

I/O Area

I/O Area PLL

single tile 1.5mm 2.0mm

TAP

100 Million Transistors 275mm2 Die Area 3mm2 Tile area 1248 pin LGA, 14 layers, 343 signal pins Package 1 poly, 8 metal (Cu) Interconnect 65nm CMOS Process Technology 100 Million Transistors 275mm2 Die Area 3mm2 Tile area 1248 pin LGA, 14 layers, 343 signal pins Package 1 poly, 8 metal (Cu) Interconnect 65nm CMOS Process Technology

80 Core TFLOP Chip

slide-8
SLIDE 8

8

Scaling Assumptions

Technology (High Volume) 45nm (2008) 32nm (2010) 22nm (2012) 16nm (2014) 11nm (2016) 8nm (2018) 5nm (2020) Transistor density 1.75 1.75 1.75 1.75 1.75 1.75 1.75 Frequency scaling 15% 10% 8% 5% 4% 3% 2% Vdd scaling

  • 10%
  • 7.5%
  • 5%
  • 2.5%
  • 1.5%
  • 1%
  • 0.5%

Dimension & Capacitance 0.75 0.75 0.75 0.75 0.75 0.75 0.75 SD Leakage scaling/micron 1X Optimistic to 1.43X Pessimistic 65nm Core + Local Memory

Memory 0.35MB

5mm2 (50%)

DP FP Add, Multiply Integer Core, RF Router

5mm2 (50%)

10mm2, 3GHz, 6GF , 1.8W 8nm Core + Local Memory

Memory 0.35MB 0.17mm2 (50%) DP FP Add, Multiply Integer Core, RF Router 0.17mm2 (50%)

0.34mm2, 4.6GHz, 9.2GF , 0.24 to 0.46W

~0.6mm

slide-9
SLIDE 9

9

Processor Chip

2018, 8nm technology node

20mm 400mm2 20mm

Cores/Module 1150 Total Local Memory 400 MB Frequency 4.61 GHz Peak performance 10.6 TF Power 300 - 600W Energy efficiency 34 - 18 GF/Watt

30-60 MW for Exascale

5000 10000 15000 20000

65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm

Chip Performance (GF)

GFLOPs

100 200 300 400 500

65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm

Chip Power (W)

Power(W)

slide-10
SLIDE 10

10

Processor Node

128 GB 128 GB

256GB/s 64b

128 GB 128 GB

Peak performance 10.6 TF Total DRAM Capacity 512GB Total DRAM BW 1TB/s (0.1B/FLOP) DRAM Power 800 W* Total Power 1100 - 1400W Energy efficiency 9.5 - 8 GF/Watt

*Assumes 5% Vdd scaling each technology generation 140 pJ energy consumed per accessed bit

110-140 MW for Exascale

slide-11
SLIDE 11

11

Node Power Breakdown

DRAM Fabric

Compute

10 TF, ~ 1KW Aggressive voltage scaling Hierarchical heterogeneous topologies Efficient signaling Repartitioning

slide-12
SLIDE 12

12

Voltage Scaling

0.2 0.4 0.6 0.8 1 0.3 0.5 0.7 0.9 Vdd (Normal) Normalized 2 4 6 8 10

Freq Total Power Leakage Energy Efficiency

When designed to voltage scale

slide-13
SLIDE 13

13

Energy Efficiency with Vdd Scaling

20 40 60 80 100 120 140 160

65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm

Energy Efficiency (GF/W) Vdd 0.7x 0.5x ~3X Compute energy efficiency with Vdd Scaling

slide-14
SLIDE 14

14

On-die Mesh Interconnect

On-die network (mesh) power is high Worse if link width scales up each generation

20mm

45nm

20mm

32nm

20mm

22nm

20mm

16nm

70 Cores 123 Cores 214 Cores 375 Cores 100 200 300 400 500 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm Chip Power (W) Network Compute

slide-15
SLIDE 15

15

Mesh—Retrospective

Bus: Good at board level, does not extend well

  • Transmission line issues: loss and signal integrity, limited frequency
  • Width is limited by pins and board area
  • Broadcast, simple to implement

Point to point busses: fast signaling over longer distance

  • Board level, between boards, and racks
  • High frequency, narrow links
  • 1D Ring, 2D Mesh and Torus to reduce latency
  • Higher complexity and latency in each node

Hence, emergence of packet switched network But, pt-to-pt packet switched network on a chip?

slide-16
SLIDE 16

16

Interconnect Delay & Energy

1 10 100 1000 10000 5 10 15 20 Length (mm) Delay (ps) 0.5 1 1.5 2 pJ/Bit

65nm, 3GHz

Router Delay

slide-17
SLIDE 17

17

Bus—The Other Extreme…

Issues: Slow, < 300MHz Shared, limited scalability? Solutions: Repeaters to increase freq Wide busses for bandwidth Multiple busses for scalability Benefits: Power? Simpler cache coherency Move away from frequency, embrace parallelism

slide-18
SLIDE 18

18

Hierarchical & Heterogeneous

Bus C C C C Bus to connect over short distances Bus C C C C Bus C C C C Bus C C C C Bus C C C C

2nd Level Bus

Hierarchy of Busses Or hierarchical circuit and packet switched networks R R R R

slide-19
SLIDE 19

19

Revise DRAM Architecture

Page Page Page RAS CAS Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins

Traditional DRAM New DRAM architecture

Page Page Page Addr Addr Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of IO’s (3D)

Energy cost today: ~175 pJ/bit

Signaling DRAM Array M Control

slide-20
SLIDE 20

20

Data Locality

Core-to-core Communication

  • n the chip:

~10pJ per Byte Chip to memory Communication: ~1.5nJ per Byte ~150pJ per Byte Chip to chip Communication: ~100pJ per Byte Data movement is expensive—keep it local (1) Core to core, (2) Chip-to-chip, (3) Memory

slide-21
SLIDE 21

21

Impact of Exploding Parallelism

100 150 200 250 300 350 400 450

65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm

Million Cores/EFLOP

1x Vdd 0.7x Vdd 0.5x Vdd

Almost flat because Vdd close to Vt 4X increase in the number of cores (Parallelism) Increased communication and related energy Increased HW, and unreliability

  • 1. Strike a balance between Com & Computation
  • 2. Resiliency (Gradual, Intermittent, Permanent faults)
slide-22
SLIDE 22

22

Road to Unreliability?

From Peta to Exa Reliability Issues 1,000X parallelism More hardware for something to go wrong >1,000X intermittent faults due to soft errors Aggressive Vcc scaling to reduce power/energy Gradual faults due to increased variations More susceptible to Vcc droops (noise) More susceptible to dynamic temp variations Exacerbates intermittent faults—soft errors Deeply scaled technologies Aging related faults Lack of burn-in? Variability increases dramatically

Resiliency will be the corner-stone

slide-23
SLIDE 23

23

Resiliency

Faults Example Permanent faults Stuck-at 0 & 1 Gradual faults Variability Temperature Intermittent faults Soft errors Voltage droops Aging faults Degradation Faults cause errors (data & control) Datapath errors Detected by parity/ECC Silent data corruption Need HW hooks Control errors Control lost (Blue screen)

Minimal overhead for resiliency

Circuit & Design Microarchitecture Microcode, Platform Programming system Applications

Error detection Fault isolation Fault confinement Reconfiguration Recovery & Adapt

System Software

slide-24
SLIDE 24

24

Needs a Paradigm Shift

Evaluate each (old) architecture feature with new priorities

Single thread performance Frequency Programming productivity Legacy, compatibility Architecture features for productivity Constraints (1) Cost (2) Reasonable Power/Energy Throughput performance Parallelism Power/Energy Architecture features for energy Simplicity Constraints (1) Programming productivity (2) Cost

Past and present priorities— Future priorities—

slide-25
SLIDE 25

25

Summary

Von-Neumann computing & CMOS technology (nothing else in sight) Voltage scaling to reduce power and energy

  • Explodes parallelism
  • Cost of communication vs computation—critical balance
  • Resiliency to combat side-effects and unreliability

Programming system for extreme parallelism System software to harmonize all of the above