A look ahead: Echelon Talk contents [13 slides] 1. The Echelon - PowerPoint PPT Presentation

A look ahead: Echelon

Talk contents [13 slides] 1. The Echelon system [4]. 2. The challenge of power consumption in Echelon [9]. 2

I. Introduction 3

Companies involved in the project 4

System sketch 5

Thread count estimation 2010: 4640 GPUs 2018: 90K GPUs (32*145 Fermi GPUs) (based on a Echelon system) Threads/SM 1 536 ~ 1 000 (16x) 24 576 ~ 100 000 Threads/GPU Threads/Cabinet (32x) 786 432 ~ 10 000 000 Threads/Machine (145x) ~ 100 000 000 ~ 10 000 000 000 6

How to get to billions of threads Programming System: Programmer expresses all of the concurrency. Programming system decides how much to deploy in space and how much to iterate in time. Architecture: Fast, low overhead thread array creation and management. Fast, low overhead communication and synchronization. Message-driven computing (active messages). 7

II. The challenge of power consumption in Echelon 8

Power consumption in typical computers 9

The high cost of data movement: Fetching operands costs more than computing on them As of 2012 chips in silicon, we have: Square chip: 20 x 20 mm. 26 pJ 256 pJ 1 nJ Efficient off-chip link: 500 pJ 256-bit buses DRAM read/write: 64-bit DP: 20 pJ. 16 nJ (16.000 pJ) 256-bit access on a 8 KB. SRAM cache: 50 pJ. Manufacturing process: 28 nm. 10

Addressing the power challenge Locality and its role on power consumption: Bulk of data must be accessed from register file (2 pJ.), not across the chip (integrated cache, 150 pJ.), off-chip (external cache, 300 pJ.), or across the system (DRAM memory, 1000 pJ.). Application, programming system and architecture must work together to exploit locality. Overhead: Bulk of execution energy must go to carrying out the operation, not scheduling instructions (where 100x is consumed today). Optimizations: At all levels of the memory hierarchy to operate efficiently. 11

Power consumption within a GPU Manufacturing process (and year): 40 nm. (’10) 10 nm. (es (estim. 2017) User platform: Desktop Desktop Laptop Vdd (nominal) 0.9 V. 0.75 V. 0.65 V. Target frequency 1.6 GHz. 2.5 GHz. 2 GHz. Energy for a madd in double-precision 50 pJ. 8.7 pJ. 6.5 pJ. Energy for a add with integer data 0.5 pJ. 0.07 pJ. 0.05 pJ. 64-bit read from 8 KB. SRAM 14 pJ. 2.4 pJ. 1.8 pJ. Wire energy (per transition) 240 fJ/bit/mm 150 fJ/bit/mm 115 fJ/bit/mm Wire energy (256 bits, distance of 10 mm.) 310 pJ. 200 pJ. 150 pJ. Communications take the bulk of power consumption. And instruction scheduling in an out-of-order CPU is even worse, spending 2000 pJ. for each instruction (either integer o floating-point). 12

Scaling makes locality even more important: Power consumption within VRAM Manufacturing process (and year): 45 nm. (2010) 16 nm. (estimated for 2017) DRAM interface pin bandwidth 4 Gbps. 50 Gbps. DRAM interface energy 20-30 pJ/bit 2 pJ/bit DRAM access energy 8-15 pJ/bit 2.5 pJ/bit 13

Projections for power consumption in CPUs and GPUs (in picoJules) CPU in CPU in Echelon's goal by GPU in 2015 2010 2015 maximizing locality Instruction 2000 560 3 3 scheduling Access to 75 37,5 37,5 10,5 on-chip cache Access to 100 15 15 9 off-chip cache Arithmetic operation 25 3 3 3 (average cost) Local access 14 2,1 2,1 2,7 to register file 2214 617,6 60,6 28,2 TOTAL 14

Basic power guidelines at different levels The bulk of the power is consumed by data movement rather than operations. Therefore, algorithms should be designed to perform more work per unit data movement: Performing more operations as long as they save transfers. Recomputing values instead of fetching them. Programming systems should further optimize this data movement: Using techniques such as blocking and tiling. Being aware of the energy cost for each instruction. Architectures should provide: A memory hierarchy exposed to the programmer. Efficient mechanisms for communication. 15

A basic idea to optimize power consumption in GPUs: Temporal SIMT Existing SIMT (Single Instruction Multiple Thread) amortizes instruction fetch across multiple threads, but: Perform poorly (and energy inefficiently) when threads diverge. Execute redundant instructions that are common across threads. Solution: Temporal SIMT. Execute threads in thread block in sequence on a single lane, which amortizes fetch. Shared registers for common values, which amortizes execution.

Power consumption on Nvidia's roadmap Maxwell 16 GFLOPS in double precision for each watt consumed 14 12 10 8 Kepler 6 4 Fermi 2 Tesla 2008 2010 2012 2014 17

A look ahead: Echelon Talk contents [13 slides] 1. The Echelon - PowerPoint PPT Presentation

A look ahead: Echelon Talk contents [13 slides] 1. The Echelon system [4]. 2. The challenge of power consumption in Echelon [9]. 2 I. Introduction 3 Companies involved in the project 4 System sketch 5 Thread count estimation 2010: 4640

Collection #1 LOOk 1/8 LOOk 2/8 LOOk 3/8 LOOk 4/8 LOOk 5/8 LOOk 6/8

Ahead of the game in 86 still ahead now www.csgconsult.com Ahead of the game in 86

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

1.2 Row Reduction and Echelon Forms McDonald Fall 2018, MATH 2210Q 1.2 Slides Homework: Read the

Workhorse is changing the way the world works Work Ahead. Work Ahead. 3 Cautionary Note

DRIVE AHEAD DOR-TDF-3TC vs. EFV-TDF-FTC as Initial Therapy DRIVE AHEAD: Design DRIVE AHEAD:

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

Charge Ahead RECAP OF 2016 + A LOOK AHEAD TO 2017 AND BEYOND Coalition for Clean Air |

Mu Multi-Ec Echelon Network Eva valuation and and In Inven entory St Strateg egy Bo Boxi

Investor Presentation June 2017 About us Echelon is a Canadian Property & Casualty insurer,

Innovation in stock optimization THE NEXT FRONTIER : MULTI-ECHELON PLANNING Breda, November 2015

Investor Presentation November 2018 About us Echelon is a Canadian Property & Casualty

Investor Presentation November 2017 About us Echelon is a Canadian Property & Casualty

Finite Mathematics MAT 141: Chapter 2 Notes Solutions to Linear Systems by the Echelon Method

MATH 105: Finite Mathematics 2-3: Systems of m Equations with n Variables Prof. Jonathan Duncan

Earnings Summary First Quarter 2017 Conference Call Wednesday, April 26, 2017 9:00 a.m. ET

Wire Shaping is Practical Hongbo Zhang and Martin D.F. Wong, U of Illinois Kai-Yuan (Kevin) Chao,

First Quarter 2020 Financial Results Michael H. McGarry , Chairman and Chief Executive Officer

33:010:458 33:010:458 Accounting Information Accounting Information Systems Systems Dr. Peter

A Practical Split Manufacturing Framework for Trojan Prevention via Simultaneous Wire Lifting and

STRATASYS Q4 2017 FINANCIAL RESULTS CONFERENCE CALL February 28 th , 2018 Q4 2017 Conference

LITTLE VALUE CREATION, ARTICULATION AND PROPAGATING FORCES : A HYPOTHESIS FOR THE MEXICAN

Does aid reduce inequality? Evidence for Latin America David Castells-Quintana AQR-IREA.

A look ahead: Echelon Talk contents [13 slides] 1. The Echelon - PowerPoint PPT Presentation

A look ahead: Echelon Talk contents [13 slides] 1. The Echelon system [4]. 2. The challenge of power consumption in Echelon [9]. 2 I. Introduction 3 Companies involved in the project 4 System sketch 5 Thread count estimation 2010: 4640

Collection #1 LOOk 1/8 LOOk 2/8 LOOk 3/8 LOOk 4/8 LOOk 5/8 LOOk 6/8

Ahead of the game in 86 still ahead now www.csgconsult.com Ahead of the game in 86

MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN SLIDES [EN] MARKDOWN

Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides Needs Slides

1.2 Row Reduction and Echelon Forms McDonald Fall 2018, MATH 2210Q 1.2 Slides Homework: Read the

Workhorse is changing the way the world works Work Ahead. Work Ahead. 3 Cautionary Note

DRIVE AHEAD DOR-TDF-3TC vs. EFV-TDF-FTC as Initial Therapy DRIVE AHEAD: Design DRIVE AHEAD:

SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides SBF AGM 2017 CEO Slides

Charge Ahead RECAP OF 2016 + A LOOK AHEAD TO 2017 AND BEYOND Coalition for Clean Air |

Mu Multi-Ec Echelon Network Eva valuation and and In Inven entory St Strateg egy Bo Boxi

Investor Presentation June 2017 About us Echelon is a Canadian Property &amp; Casualty insurer,

Innovation in stock optimization THE NEXT FRONTIER : MULTI-ECHELON PLANNING Breda, November 2015

Investor Presentation November 2018 About us Echelon is a Canadian Property &amp; Casualty

Investor Presentation November 2017 About us Echelon is a Canadian Property &amp; Casualty

Finite Mathematics MAT 141: Chapter 2 Notes Solutions to Linear Systems by the Echelon Method

MATH 105: Finite Mathematics 2-3: Systems of m Equations with n Variables Prof. Jonathan Duncan

Earnings Summary First Quarter 2017 Conference Call Wednesday, April 26, 2017 9:00 a.m. ET

Wire Shaping is Practical Hongbo Zhang and Martin D.F. Wong, U of Illinois Kai-Yuan (Kevin) Chao,

First Quarter 2020 Financial Results Michael H. McGarry , Chairman and Chief Executive Officer

33:010:458 33:010:458 Accounting Information Accounting Information Systems Systems Dr. Peter

A Practical Split Manufacturing Framework for Trojan Prevention via Simultaneous Wire Lifting and

STRATASYS Q4 2017 FINANCIAL RESULTS CONFERENCE CALL February 28 th , 2018 Q4 2017 Conference

LITTLE VALUE CREATION, ARTICULATION AND PROPAGATING FORCES : A HYPOTHESIS FOR THE MEXICAN

Does aid reduce inequality? Evidence for Latin America David Castells-Quintana AQR-IREA.

Investor Presentation June 2017 About us Echelon is a Canadian Property & Casualty insurer,

Investor Presentation November 2018 About us Echelon is a Canadian Property & Casualty

Investor Presentation November 2017 About us Echelon is a Canadian Property & Casualty