1
Major Challenges to Achieve Exascale Performance
Shekhar Borkar Intel Corp. April 29, 2009
Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO)
Major Challenges to Achieve Exascale Performance Shekhar Borkar - - PowerPoint PPT Presentation
Major Challenges to Achieve Exascale Performance Shekhar Borkar Intel Corp. April 29, 2009 Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO) 1 Outline Exascale performance goals Major challenges Potential solutions
1
Acknowledgment: Exascale WG sponsored by Dr. Bill Harrod, DARPA (IPTO)
2
3
1960 1970 1980 1990 2000 2010 2020
MFLOP GFLOP TFLOP PFLOP EFLOP
12 Years 11 Years 10 Years
4
1 10 100 1000 1986 1996 2006 2016
Relative Tr Performance
G Tera Peta 30X 250X
0.001 0.01 0.1 1 1986 1996 2006 2016
Relative Energy/Op
G Tera Peta
5V Vcc scaling
1.E+00 1.E+02 1.E+04 1.E+06 1.E+08 1986 1996 2006 2016
G Tera Peta 36X Exa 4,000X Concurrency 2.5M X Transistor Performance
1.E+00 1.E+02 1.E+04 1.E+06 1.E+08 1986 1996 2006 2016
G Tera Peta 80X Exa 4,000X 1M X
Power
5
Compute Memory Com Disk
6
Compute Memory Com Disk
7
1.5mm 2.0mm
FPMAC0
Router
IMEM DMEM RF RIB
CLK
FPMAC1
MSINT
Global clk spine + clk buffers
1.5mm 2.0mm 2.0mm
FPMAC0
Router
IMEM DMEM RF RIB
CLK
FPMAC1 FPMAC0
Router
IMEM DMEM RF RIB
CLK
FPMAC1
MSINT
Global clk spine + clk buffers
21.72mm 12.64mm
I/O Area
single tile 1.5mm 2.0mm
21.72mm 12.64mm
I/O Area
single tile 1.5mm 2.0mm
100 Million Transistors 275mm2 Die Area 3mm2 Tile area 1248 pin LGA, 14 layers, 343 signal pins Package 1 poly, 8 metal (Cu) Interconnect 65nm CMOS Process Technology 100 Million Transistors 275mm2 Die Area 3mm2 Tile area 1248 pin LGA, 14 layers, 343 signal pins Package 1 poly, 8 metal (Cu) Interconnect 65nm CMOS Process Technology
80 Core TFLOP Chip
8
Technology (High Volume) 45nm (2008) 32nm (2010) 22nm (2012) 16nm (2014) 11nm (2016) 8nm (2018) 5nm (2020) Transistor density 1.75 1.75 1.75 1.75 1.75 1.75 1.75 Frequency scaling 15% 10% 8% 5% 4% 3% 2% Vdd scaling
Dimension & Capacitance 0.75 0.75 0.75 0.75 0.75 0.75 0.75 SD Leakage scaling/micron 1X Optimistic to 1.43X Pessimistic 65nm Core + Local Memory
Memory 0.35MB
DP FP Add, Multiply Integer Core, RF Router
5mm2 (50%)
10mm2, 3GHz, 6GF , 1.8W 8nm Core + Local Memory
Memory 0.35MB 0.17mm2 (50%) DP FP Add, Multiply Integer Core, RF Router 0.17mm2 (50%)
0.34mm2, 4.6GHz, 9.2GF , 0.24 to 0.46W
~0.6mm
9
2018, 8nm technology node
20mm 400mm2 20mm
Cores/Module 1150 Total Local Memory 400 MB Frequency 4.61 GHz Peak performance 10.6 TF Power 300 - 600W Energy efficiency 34 - 18 GF/Watt
5000 10000 15000 20000
65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm
Chip Performance (GF)
GFLOPs
100 200 300 400 500
65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm
Chip Power (W)
Power(W)
10
256GB/s 64b
Peak performance 10.6 TF Total DRAM Capacity 512GB Total DRAM BW 1TB/s (0.1B/FLOP) DRAM Power 800 W* Total Power 1100 - 1400W Energy efficiency 9.5 - 8 GF/Watt
*Assumes 5% Vdd scaling each technology generation 140 pJ energy consumed per accessed bit
11
Compute
12
13
65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm
14
20mm
45nm
20mm
32nm
20mm
22nm
20mm
16nm
70 Cores 123 Cores 214 Cores 375 Cores 100 200 300 400 500 65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm Chip Power (W) Network Compute
15
16
65nm, 3GHz
Router Delay
17
18
2nd Level Bus
19
Page Page Page RAS CAS Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins
Page Page Page Addr Addr Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of IO’s (3D)
Energy cost today: ~175 pJ/bit
Signaling DRAM Array M Control
20
21
100 150 200 250 300 350 400 450
65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm
Million Cores/EFLOP
1x Vdd 0.7x Vdd 0.5x Vdd
Almost flat because Vdd close to Vt 4X increase in the number of cores (Parallelism) Increased communication and related energy Increased HW, and unreliability
22
From Peta to Exa Reliability Issues 1,000X parallelism More hardware for something to go wrong >1,000X intermittent faults due to soft errors Aggressive Vcc scaling to reduce power/energy Gradual faults due to increased variations More susceptible to Vcc droops (noise) More susceptible to dynamic temp variations Exacerbates intermittent faults—soft errors Deeply scaled technologies Aging related faults Lack of burn-in? Variability increases dramatically
23
Faults Example Permanent faults Stuck-at 0 & 1 Gradual faults Variability Temperature Intermittent faults Soft errors Voltage droops Aging faults Degradation Faults cause errors (data & control) Datapath errors Detected by parity/ECC Silent data corruption Need HW hooks Control errors Control lost (Blue screen)
Minimal overhead for resiliency
Circuit & Design Microarchitecture Microcode, Platform Programming system Applications
System Software
24
Single thread performance Frequency Programming productivity Legacy, compatibility Architecture features for productivity Constraints (1) Cost (2) Reasonable Power/Energy Throughput performance Parallelism Power/Energy Architecture features for energy Simplicity Constraints (1) Programming productivity (2) Cost
25