Memory Hierarchies in the Era of Specialization Sarita Adve - - PowerPoint PPT Presentation

memory hierarchies in the era of specialization
SMART_READER_LITE
LIVE PREVIEW

Memory Hierarchies in the Era of Specialization Sarita Adve - - PowerPoint PPT Presentation

Coherence, Consistency, and Dj vu: Memory Hierarchies in the Era of Specialization Sarita Adve University of Illinois at Urbana-Champaign sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli, Gio Salvador, Matt Sinclair, Hyojin Sung


slide-1
SLIDE 1

Coherence, Consistency, and Déjà vu: Memory Hierarchies in the Era of Specialization

Sarita Adve

University of Illinois at Urbana-Champaign sadve@Illinois.edu w/ Johnathan Alsop, Rakesh Komuravelli, Gio Salvador, Matt Sinclair, Hyojin Sung and numerous colleagues and students over > 25 years

This work was supported in part by the NSF and by C-FAR, one of the six SRC STARnet Centers, sponsored by MARCO and DARPA.

slide-2
SLIDE 2

BUT impact software, hardware, hardware-software interface This talk: Memory hierarchy for specialized, parallel systems Global address space, coherence, consistency But first …

Silver Bullets for End of Moore’s Law?

Parallelism Specialization

slide-3
SLIDE 3
  • 1988 to 1989: What is a memory consistency model?

– Simplest model: sequential consistency (SC) [Lamport79]

  • Memory operations execute one at a time in program order
  • Simple, but inefficient

– Implementation/performance-centric view

  • Order in which memory operations execute
  • Different vendors w/ different models (orderings)

– Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray, …

  • Many ambiguities due to complexity, by design(?), …

Memory model = What value can a read return? HW/SW Interface: affects performance, programmability, portability

3

My Story (1988  2017)

LD LD LD ST ST ST ST LD Fence

slide-4
SLIDE 4
  • 1988 to 1989: What is a memory model?

– What value can a read return?

  • 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

  • 2000-08: Java, C++, … memory model [POPL05, PLDI08, CACM10]

– DRF model + big mess (but after 20 years, convergence at last)

  • 2008-14: Software-centric view for coherence: DeNovo protocol

– More performance-, energy-, and complexity-efficient than MESI [PACT12, ASPLOS14, ASPLOS15]

  • 2014-: Déjà vu: Heterogeneous systems [ISCA15, Micro15, ISCA17]

4

My Story (1988  2017)

slide-5
SLIDE 5

Traditional Heterogeneous SoC Memory Hierarchies

  • Loosely coupled memory hierarchies

– Local memories don’t communicate with each other – Unnecessary data movement

5

Main Memory Interconnect Modem GPS DSP DSP GPU A/V HW Accels. DSP Multi- media CPU L1 Cache L2 Cache CPU L1 Cache Vect. Vect.

A tightly coupled memory hierarchy is needed

slide-6
SLIDE 6

Tightly Coupled SoC Memory Hierarchies

  • Tightly coupled memory hierarchies: unified address space

– Entering mainstream, especially CPU-GPU – Accelerator can access CPU’s data using same address

6

L2 $ Bank Interconnection n/w Accelerator L2 $ Bank CPU Cache Cache

Inefficient coherence and consistency Specialized private memories still used for efficiency

slide-7
SLIDE 7

Memory Hierarchies for Heterogeneous SoC

7

  • DeNovoA: Efficient coherence, simple DRF consistency

[MICRO ’15, Top Picks ’16 Honorable Mention]

  • DRFrlx: SC-centric semantics for relaxed atomics [ISCA’17]
  • Stash: Integrate specialized memories in global address space

[ISCA ’15, Top Picks ’16 Honorable Mention]

  • Spandex: Integrate accelerators/CPUs w/ different protocols
  • Dynamic load balancing and work stealing w/ accelerators

Efficiency Programmability

slide-8
SLIDE 8

Memory Hierarchies for Heterogeneous SoC

8

  • DeNovoA: Efficient coherence, simple DRF consistency

[MICRO ’15, Top Picks ’16 Honorable Mention]

  • DRFrlx: SC-centric semantics for relaxed atomics [ISCA’17]
  • Stash: Integrate specialized memories in global address space

[ISCA ’15, Top Picks ’16 Honorable Mention]

  • Spandex: Integrate accelerators/CPUs w/ different protocols
  • Dynamic load balancing and work stealing w/ accelerators

Efficiency Programmability

slide-9
SLIDE 9

Memory Hierarchies for Heterogeneous SoC

9

  • DeNovoA: Efficient coherence, simple DRF consistency

[MICRO ’15, Top Picks ’16 Honorable Mention]

  • DRFrlx: SC-centric semantics for relaxed atomics [ISCA’17]
  • Stash: Integrate specialized memories in global address space

[ISCA ’15, Top Picks ’16 Honorable Mention]

  • Spandex: Integrate accelerators/CPUs w/ different protocols
  • Dynamic load balancing and work stealing w/ accelerators

Focus: CPU-GPU systems with caches and scratchpads

Efficiency Programmability

slide-10
SLIDE 10

CPU Coherence: MSI

  • Single writer, multiple reader

– On write miss, get ownership + invalidate all sharers – On read miss, add to sharer list

 Directory to store sharer list

Many transient states Excessive traffic, indirection

10

L2 Cache, Directory

Interconnection n/w CPU

L2 Cache, Directory

CPU L1 Cache L1 Cache

Complex + inefficient

slide-11
SLIDE 11

GPU Coherence with DRF

  • With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished – At all synch points

  • Flush all dirty data: Unnecessary writethroughs
  • Invalidate all data: Can’t reuse data across synch points

– Synchronization accesses must go to last level cache (LLC) Simple, but inefficient at synchronization

11

L2 Cache Bank Interconnection n/w GPU L2 Cache Bank CPU Cache Cache Valid Dirty Valid

Flush dirty data Invalidate all data

slide-12
SLIDE 12

GPU Coherence with DRF

  • With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished – At all synch points

  • Flush all dirty data: Unnecessary writethroughs
  • Invalidate all data: Can’t reuse data across synch points

– Synchronization accesses must go to last level cache (LLC)

12

slide-13
SLIDE 13
  • With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished – At all synch points

  • Flush all dirty data: Unnecessary writethroughs
  • Invalidate all data: Can’t reuse data across synch points

– Synchronization accesses must go to last level cache (LLC)

– No overhead for locally scoped synchs

  • But higher programming complexity
  • Adopted for HSA, OpenCL, CUDA

GPU Coherence with HRF

13

heterogeneous HRF [ASPLOS ’14]

global and their scopes Global heterogeneous

slide-14
SLIDE 14

Modern GPU Coherence & Consistency

Consistency Coherence De Facto Recent

14

Heterogeneous-race-free (HRF) Scoped synchronization Complex No overhead for local synchs Efficient for local synch Data-race-free (DRF) Simple High overhead on synchs Inefficient

DeNovoA+DRF: Efficient AND simpler memory model Do GPU models (HRF) need to be more complex than CPU models (DRF)? NO! Not if coherence is done right!

slide-15
SLIDE 15
  • Read hit: Don’t return stale data
  • Read miss: Find one up-to-date copy

A Classification of Coherence Protocols

Invalidator Writer Reader Track up-to- date copy Ownership Writethrough MESI GPU DeNovo

15

  • Reader-initiated invalidations

– No invalidation or ack traffic, directories, transient states

  • Obtaining ownership for written data

– Reuse owned data across synchs (not flushed at synch points)

slide-16
SLIDE 16

DeNovo Coherence with DRF

  • With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished – At all synch points

  • Flush all dirty data
  • Invalidate all data

– Synchronization accesses must go to last level cache (LLC)

  • 3% state overhead vs. GPU coherence + HRF

16

L2 Cache Bank Interconnection n/w GPU L2 Cache Bank CPU Cache Cache

Obtain

  • wnership

Invalidate non-owned data

Dirty Valid Own Can reuse

  • wned data

Obtain ownership for dirty data

can be at L1

all non-owned data

slide-17
SLIDE 17

Evaluation Methodology

  • 1 CPU core + 15 GPU compute units (CU)

– Each node has private L1, scratchpad, tile of shared L2

  • Simulation Environment:

– GEMS, Simics, GPGPU-Sim, GPUWattch, McPAT

  • Workloads

– Rodinia, Parboil: no fine-grained synch – UC-Davis microbenchmarks + UTS from HRF paper – Graph analytics workloads

slide-18
SLIDE 18

Key Evaluation Questions

  • 1. Streaming apps without fine-grained synch

– How does DeNovo+DRF compare to GPU+DRF?

  • 2. Apps with locally scoped fine-grained synch

– How does DeNovo+DRF compare to GPU+HRF?

  • 3. Apps with globally scoped fine-grained synch

– How does DeNovo+DRF compare to GPU+HRF? DeNovo does not hurt performance for streaming apps HRF’s complexity not needed

slide-19
SLIDE 19

0% 20% 40% 60% 80% 100% G* D* G* D* G* D* G* D* G* D*

FAM SLM SPM SPMBO AVG

DeNovo has 28% lower execution time than GPU with global synch

Global Synch – Execution Time

GH GH GH GH GH DD DD DD DD DD

slide-20
SLIDE 20

0% 20% 40% 60% 80% 100% G* D* G* D* G* D* G* D* G* D* N/W L2 $ L1 D$ Scratch GPU Core+

Global Synch – Energy

DeNovo has 51% lower energy than GPU with global synch

FAM SLM SPM SPMBO AVG

GH GH GH GH GH DD DD DD DD DD

slide-21
SLIDE 21

Key Evaluation Questions

1. Streaming apps without fine-grained synch – How does DeNovo+DRF compare to GPU+DRF? 2. Apps with locally scoped fine-grained synch – How does DeNovo+DRF compare to GPU+HRF? 3. Apps with globally scoped fine-grained synch – How does DeNovo+DRF compare to GPU+HRF? DeNovo does not hurt performance for streaming apps HRF’s complexity not needed for local scope DeNovo+DRF provides much better performance and energy DeNovo+DRF = Efficient and simple Impact: AMD/HRF authors agree, refined DeNovo [Micro16] But … Deja Vu!

slide-22
SLIDE 22

Memory Hierarchies for Heterogeneous SoC

22

  • DeNovoA: Efficient coherence, simple DRF consistency

[MICRO ’15, Top Picks ’16 Honorable Mention]

  • DRFrlx: SC-centric semantics for relaxed atomics [ISCA’17]
  • Stash: Integrate specialized memories in global address space

[ISCA ’15, Top Picks ’16 Honorable Mention]

  • Spandex: Integrate accelerators/CPUs w/ different protocols
  • Dynamic load balancing and work stealing w/ accelerators

Efficiency Programmability

slide-23
SLIDE 23
  • 1988 to 1989: What is a memory model?

– What value can a read return?

  • 1990s: Software-centric view: Data-race-free (DRF) model [ISCA90, …]

– Sequential consistency for data-race-free programs

  • 2000-08: Java, C++, … memory model [POPL05, PLDI08, CACM10]

– DRF model + big mess (but after 20 years, convergence at last)

  • 2008-14: Software-centric view for coherence: DeNovo protocol

– More performance-, energy-, and complexity-efficient than MESI [PACT12, ASPLOS14, ASPLOS15]

  • 2014-17: Déjà vu: Heterogeneous systems [ISCA15, Micro15]

– Coherence, consistency, global addressability

23

My Story (1988  2017)