Memory Hierarchies in the Era of Specialization Sarita Adve - PowerPoint PPT Presentation

Coherence, Consistency, and Déjà vu: Memory Hierarchies in the Era of Specialization Sarita Adve University of Illinois at Urbana-Champaign sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli, Gio Salvador, Matt Sinclair, Hyojin Sung and numerous colleagues and students over > 25 years This work was supported in part by the NSF and by C-FAR, one of the six SRC STARnet Centers, sponsored by MARCO and DARPA .

Silver Bullets for End of Moore’s Law? Parallelism Specialization BUT impact software, hardware, hardware-software interface This talk: Memory hierarchy for specialized, parallel systems Global address space, coherence, consistency But first …

My Story (1988  2017) • 1988 to 1989: What is a memory consistency model? – Simplest model: sequential consistency (SC) [Lamport79] • Memory operations execute one at a time in program order • Simple, but inefficient – Implementation/performance-centric view LD LD • Order in which memory operations execute ST ST • Different vendors w/ different models (orderings) Fence – Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray, … LD ST LD ST • Many ambiguities due to complexity, by design(?), … Memory model = What value can a read return? HW/SW Interface: affects performance, programmability, portability 3

My Story (1988  2017) • 1988 to 1989: What is a memory model? – What value can a read return? • 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …] – Sequential consistency for data-race-free programs • 2000-08: Java, C++, … memory model [POPL05, PLDI08, CACM10] – DRF model + big mess (but after 20 years, convergence at last) • 2008-14: Software-centric view for coherence: DeNovo protocol – More performance-, energy-, and complexity-efficient than MESI [PACT12, ASPLOS14, ASPLOS15] • 2014-: Déjà vu: Heterogeneous systems [ISCA15, Micro15, ISCA17] 4

Traditional Heterogeneous SoC Memory Hierarchies • Loosely coupled memory hierarchies – Local memories don’t communicate with each other – Unnecessary data movement CPU CPU GPU Modem Vect. Vect. A/V HW Accels. GPS L1 L1 Cache Cache Multi- DSP DSP DSP L2 Cache media Interconnect Main Memory A tightly coupled memory hierarchy is needed 5

Tightly Coupled SoC Memory Hierarchies • Tightly coupled memory hierarchies: unified address space – Entering mainstream, especially CPU-GPU – Accelerator can access CPU’s data using same address Accelerator CPU Cache Cache L2 $ L2 $ Bank Bank Interconnection n/w Inefficient coherence and consistency Specialized private memories still used for efficiency 6

Memory Hierarchies for Heterogeneous SoC Programmability Efficiency • DeNovoA: Efficient coherence, simple DRF consistency [MICRO ’15, Top Picks ’16 Honorable Mention] • DRFrlx: SC-centric semantics for relaxed atomics [ISCA’17] • Stash: Integrate specialized memories in global address space [ISCA ’15, Top Picks ’16 Honorable Mention] • Spandex: Integrate accelerators/CPUs w/ different protocols • Dynamic load balancing and work stealing w/ accelerators 7

Memory Hierarchies for Heterogeneous SoC Programmability Efficiency • DeNovoA: Efficient coherence, simple DRF consistency [MICRO ’15, Top Picks ’16 Honorable Mention] • DRFrlx: SC-centric semantics for relaxed atomics [ISCA’17] • Stash: Integrate specialized memories in global address space [ISCA ’15, Top Picks ’16 Honorable Mention] • Spandex: Integrate accelerators/CPUs w/ different protocols • Dynamic load balancing and work stealing w/ accelerators Focus: CPU-GPU systems with caches and scratchpads 9

CPU Coherence: MSI CPU CPU L1 Cache L1 Cache L2 Cache, L2 Cache, Directory Directory Interconnection n/w • Single writer, multiple reader – On write miss, get ownership + invalidate all sharers – On read miss, add to sharer list  Directory to store sharer list  Many transient states Complex + inefficient  Excessive traffic, indirection 10

GPU Coherence with DRF GPU CPU Flush dirty Invalidate Cache Cache Dirty Valid Valid data all data L2 Cache L2 Cache Bank Bank Interconnection n/w • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points – Synchronization accesses must go to last level cache (LLC) Simple, but inefficient at synchronization 11

GPU Coherence with DRF • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points – Synchronization accesses must go to last level cache (LLC) 12

GPU Coherence with HRF heterogeneous HRF • With data-race-free (DRF) memory model [ASPLOS ’14] – No data races; synchs must be explicitly distinguished heterogeneous and their scopes – At all synch points global • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points Global – Synchronization accesses must go to last level cache (LLC) – No overhead for locally scoped synchs • But higher programming complexity • Adopted for HSA, OpenCL, CUDA 13

Modern GPU Coherence & Consistency Consistency Coherence De Facto Data-race-free (DRF) High overhead on synchs Simple Inefficient No overhead for local synchs Heterogeneous-race-free (HRF) Recent Scoped synchronization Efficient for local synch Complex Do GPU models (HRF) need to be more complex than CPU models (DRF)? NO! Not if coherence is done right! DeNovoA+DRF: Efficient AND simpler memory model 14

A Classification of Coherence Protocols • Read hit: Don’t return stale data • Read miss: Find one up-to-date copy Invalidator Writer Reader Track Ownership MESI DeNovo up-to- date Writethrough GPU copy • Reader-initiated invalidations – No invalidation or ack traffic, directories, transient states • Obtaining ownership for written data – Reuse owned data across synchs (not flushed at synch points) 15

DeNovo Coherence with DRF GPU CPU Invalidate Cache Cache Obtain Own Dirty Valid non-owned data ownership L2 Cache L2 Cache Bank Bank Interconnection n/w • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data Can reuse Obtain ownership for dirty data owned data • Invalidate all data all non-owned data can be at L1 – Synchronization accesses must go to last level cache (LLC) • 3% state overhead vs. GPU coherence + HRF 16

Evaluation Methodology • 1 CPU core + 15 GPU compute units (CU) – Each node has private L1, scratchpad, tile of shared L2 • Simulation Environment: – GEMS, Simics, GPGPU-Sim, GPUWattch, McPAT • Workloads – Rodinia, Parboil: no fine-grained synch – UC-Davis microbenchmarks + UTS from HRF paper – Graph analytics workloads

Key Evaluation Questions 1. Streaming apps without fine-grained synch – How does DeNovo+DRF compare to GPU+DRF? DeNovo does not hurt performance for streaming apps 2. Apps with locally scoped fine-grained synch – How does DeNovo+DRF compare to GPU+HRF? HRF’s complexity not needed 3. Apps with globally scoped fine-grained synch – How does DeNovo+DRF compare to GPU+HRF?

Global Synch – Execution Time FAM SLM SPM SPMBO AVG 100% 80% 60% 40% 20% 0% DD GH DD DD GH DD GH G* DD D* GH G* D* G* D* GH G* D* G* D* DeNovo has 28% lower execution time than GPU with global synch

Global Synch – Energy N/W L2 $ L1 D$ Scratch GPU Core+ FAM SLM SPM SPMBO AVG 100% 80% 60% 40% 20% 0% DD GH DD DD GH DD GH DD GH GH G* D* G* D* G* D* G* D* G* D* DeNovo has 51% lower energy than GPU with global synch

Key Evaluation Questions 1. Streaming apps without fine-grained synch – How does DeNovo+DRF compare to GPU+DRF? DeNovo does not hurt performance for streaming apps 2. Apps with locally scoped fine-grained synch – How does DeNovo+DRF compare to GPU+HRF? HRF’s complexity not needed for local scope 3. Apps with globally scoped fine-grained synch – How does DeNovo+DRF compare to GPU+HRF? DeNovo+DRF provides much better performance and energy DeNovo+DRF = Efficient and simple Impact: AMD/HRF authors agree, refined DeNovo [Micro16] But … Deja Vu!

Memory Hierarchies in the Era of Specialization Sarita Adve - PowerPoint PPT Presentation

Coherence, Consistency, and Dj vu: Memory Hierarchies in the Era of Specialization Sarita Adve University of Illinois at Urbana-Champaign sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli, Gio Salvador, Matt Sinclair, Hyojin Sung

Explicit Loop Specialization & Polymorphic Hardware Specialization Christopher Batten and

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. & Law Response to ERA I ( ii)

Lecture 20: Cache Hierarchies, Virtual Memory Todays topics: Cache hierarchies

Integrable twisted hierarchies Twisted with D 2 symmetries hierarchies of a splitting type

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Complexity Hierarchies Lecture 2 2

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

Reactive Systems Why now? Electronic Commerce Era Multicore Era Cloud Era Backlash to the BOFH

OUTLINE CHAPTER 10 Recursive Hierarchies Table of contents Recursive Hierarchies and Bridges

Relational Data Hierarchies CSC444 Why hierarchies?

Hierarchies in inclusion logic Miika Hannula University of Helsinki 27.8.2014 Miika Hannula

Soliton hierarchies and matrix loop algebras Wen-Xiu Ma Department of Mathematics and Statistics

Relational Data Hierarchies CS444 Why hierarchies?

Relational Data Hierarchies CSC544 Why hierarchies?

CS 134: Operating Systems Disk Deja-vu 1 / 29 Overview CS34 Overview 2013-05-17 Working

Section 12.0 Transcription Factors, Binding Sites, and the Challenge of Finding Novel Problems

to your home church! The he Fo Fool olishn hnes ess s an and W d Wisdo sdom m How to

Executive Summary How it works Separating fluids is based on the principal of density

Ahmed Rami Melhem Alex Jones Abousamra University of Pittsburgh Dj Vu Switching for

DEJA-VU : DOUBLE FEATURE PRESENTATION AND ITERATED LOSS IN DEEP TRANSFORMER NETWORKS Andros

Agenda VoLTE Overview Why VoLTE Why VoLTE Initial Questions Discussion Page 2

Stoned dj vu again Peter Kleissner, Michael Eisendle Agenda Introduction to bootkits

Memory Hierarchies in the Era of Specialization Sarita Adve - PowerPoint PPT Presentation

Coherence, Consistency, and Dj vu: Memory Hierarchies in the Era of Specialization Sarita Adve University of Illinois at Urbana-Champaign sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli, Gio Salvador, Matt Sinclair, Hyojin Sung

Explicit Loop Specialization &amp; Polymorphic Hardware Specialization Christopher Batten and

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. &amp; Law Response to ERA I ( ii)

Lecture 20: Cache Hierarchies, Virtual Memory Todays topics: Cache hierarchies

Integrable twisted hierarchies Twisted with D 2 symmetries hierarchies of a splitting type

Outline DMP204 SCHEDULING, TIMETABLING AND ROUTING 1. Complexity Hierarchies Lecture 2 2

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Memory Hierarchy Design Memory Hierarchy Design Chapter 5 and Appendix C 1 Overview

Memory Hierarchy: Caching CSE 141, S2'06 Jeff Brown The memory subsystem Computer Control

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

Reactive Systems Why now? Electronic Commerce Era Multicore Era Cloud Era Backlash to the BOFH

OUTLINE CHAPTER 10 Recursive Hierarchies Table of contents Recursive Hierarchies and Bridges

Relational Data Hierarchies CSC444 Why hierarchies?

Hierarchies in inclusion logic Miika Hannula University of Helsinki 27.8.2014 Miika Hannula

Soliton hierarchies and matrix loop algebras Wen-Xiu Ma Department of Mathematics and Statistics

Relational Data Hierarchies CS444 Why hierarchies?

Relational Data Hierarchies CSC544 Why hierarchies?

CS 134: Operating Systems Disk Deja-vu 1 / 29 Overview CS34 Overview 2013-05-17 Working

Section 12.0 Transcription Factors, Binding Sites, and the Challenge of Finding Novel Problems

to your home church! The he Fo Fool olishn hnes ess s an and W d Wisdo sdom m How to

Executive Summary How it works Separating fluids is based on the principal of density

Ahmed Rami Melhem Alex Jones Abousamra University of Pittsburgh Dj Vu Switching for

DEJA-VU : DOUBLE FEATURE PRESENTATION AND ITERATED LOSS IN DEEP TRANSFORMER NETWORKS Andros

Agenda VoLTE Overview Why VoLTE Why VoLTE Initial Questions Discussion Page 2

Stoned dj vu again Peter Kleissner, Michael Eisendle Agenda Introduction to bootkits

Explicit Loop Specialization & Polymorphic Hardware Specialization Christopher Batten and

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. & Law Response to ERA I ( ii)