Efficient GPU Synchronization without Scopes: Saying No to Complex - - PowerPoint PPT Presentation

efficient gpu synchronization without scopes
SMART_READER_LITE
LIVE PREVIEW

Efficient GPU Synchronization without Scopes: Saying No to Complex - - PowerPoint PPT Presentation

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Motivation Heterogeneous systems now


slide-1
SLIDE 1

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models

Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu

slide-2
SLIDE 2

Motivation

Consistency Coherence Defacto Recent

2

Heterogeneous systems now used for a wide variety of applications Emerging applications have fine-grained synchronization BUT current GPUs have sub-optimal consistency and coherence

This work: simple consistency + efficient coherence

Heterogeneous-race-free (HRF) Scoped synchronization Complex No overhead for local synchs Efficient for local synch Data-race-free (DRF) Simple High overhead on synchs Inefficient

slide-3
SLIDE 3

complex consistency models

Motivation (Cont.)

DeNovo+DRF: Efficient AND simpler memory model

– Comparable or better results vs. GPU+DRF and GPU+HRF

3

Do GPU models (HRF) need to be more complex than CPU models (DRF)? NO! Not if coherence is done right!

slide-4
SLIDE 4

Outline

  • Motivation
  • Coherence Protocols and Consistency Models

– Classification – GPU Coherence – DeNovo Coherence – Coherence and Consistency Summary

  • Results
  • Conclusion

4

slide-5
SLIDE 5
  • Read hit: Don’t return stale data
  • Read miss: Find one up-to-date copy

A Classification of Coherence Protocols

Invalidator Writer Reader Track up-to- date copy Ownership Writethrough MESI GPU DeNovo

5

  • Reader-initiated invalidations

– No invalidation or ack traffic, directories, transient states

  • Obtaining ownership for written data

– Reuse owned data across synchs (not flushed at synch points)

slide-6
SLIDE 6

GPU Coherence with DRF

  • With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished – At all synch points

  • Flush all dirty data: Unnecessary writethroughs
  • Invalidate all data: Can’t reuse data across synch points

– Synchronization accesses must go to last level cache (LLC)

6

L2 Cache Bank Interconnection n/w GPU L2 Cache Bank CPU Cache Cache Valid Dirty Valid

Flush dirty data Invalidate all data

slide-7
SLIDE 7
  • With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished – At all synch points

  • Flush all dirty data: Unnecessary writethroughs
  • Invalidate all data: Can’t reuse data across synch points

– Synchronization accesses must go to last level cache (LLC)

– No overhead for locally scoped synchs

  • But higher programming complexity

GPU Coherence with HRF

7

heterogeneous HRF [ASPLOS ’14]

global and their scopes Global heterogeneous

slide-8
SLIDE 8

DeNovo Coherence with DRF

  • With data-race-free (DRF) memory model

– No data races; synchs must be explicitly distinguished – At all synch points

  • Flush all dirty data
  • Invalidate all non-owned data

– Synchronization accesses must go to last level cache (LLC)

  • 3% state overhead vs. GPU coherence + HRF

8

L2 Cache Bank Interconnection n/w GPU L2 Cache Bank CPU Cache Cache

Obtain

  • wnership

Invalidate non-owned data

Dirty Valid Own Can reuse

  • wned data

Obtain ownership for dirty data

can be performed at L1

slide-9
SLIDE 9

DeNovo Configurations Studied

9

  • DeNovo+DRF:

– Invalidate all non-owned data at synch points

  • DeNovo-RO+DRF:

– Avoids invalidating read-only data at synch points

  • DeNovo+HRF:

– Reuse valid data if synch is locally scoped

slide-10
SLIDE 10

Coherence & Consistency Summary

10

Coherence + Consistency Reuse Data Owned Valid Do Synchs at L1 X X X local local local  X   local  (GD) (GH) (DD) (DH) (DD+RO)  read-only  GPU + DRF GPU + HRF DeNovo-RO + DRF DeNovo + DRF DeNovo + HRF

slide-11
SLIDE 11

Outline

  • Motivation
  • Coherence Protocols and Consistency Models
  • Results
  • Conclusion

11

slide-12
SLIDE 12

Evaluation Methodology

12

  • 1 CPU core + 15 GPU compute units (CU)

– Each node has private L1, scratchpad, tile of shared L2

  • Simulation Environment

– GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT

  • Workloads

– 10 apps from Rodinia, Parboil: no fine-grained synch

  • DeNovo and GPU coherence perform comparably

– UC-Davis microbenchmarks + UTS from HRF paper:

  • Mutex, semaphore, barrier, work sharing
  • Shows potential for future apps
  • Created two versions of each: globally, locally/hybrid scoped synch
slide-13
SLIDE 13

FAM SLM SPM SPMBO AVG

0% 20% 40% 60% 80% 100% G* D* G* D* G* D* G* D* G* D*

DeNovo has 28% lower execution time than GPU with global synch

13

Global Synch – Execution Time

slide-14
SLIDE 14

0% 20% 40% 60% 80% 100% G* D* G* D* G* D* G* D* G* D* N/W L2 $ L1 D$ Scratch GPU Core+

Global Synch – Energy

DeNovo has 51% lower energy than GPU with global synch

14

FAM SLM SPM SPMBO AVG

slide-15
SLIDE 15

0% 20% 40% 60% 80% 100%

GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH

Local Synch – Execution Time

FAM SLM SPM SPMBO SS SSBO TBEX UTS AVG

15

TB

GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14]

GD GH DD DD+RO DH

slide-16
SLIDE 16

0% 20% 40% 60% 80% 100%

GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH

Local Synch – Execution Time

FAM SLM SPM SPMBO SS SSBO TBEX UTS AVG

16

TB

GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model

GD GH DD DD+RO DH

slide-17
SLIDE 17

0% 20% 40% 60% 80% 100%

GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH

Local Synch – Execution Time

FAM SLM SPM SPMBO SS SSBO TBEX UTS AVG

17

TB

GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model DeNovo-RO+DRF reduces gap by not invalidating read-only data

GD GH DD DD+RO DH

slide-18
SLIDE 18

0% 20% 40% 60% 80% 100%

GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH

Local Synch – Execution Time

FAM SLM SPM SPMBO SS SSBO TBEX UTS AVG

18

TB

GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model DeNovo-RO+DRF reduces gap by not invalidating read-only data DeNovo+HRF is best, if consistency complexity acceptable

GD GH DD DD+RO DH

slide-19
SLIDE 19

0% 20% 40% 60% 80% 100%

GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH

N/W L2 $ L1 D$ Scratch GPU Core+

Local Synch – Energy

Energy trends similar to execution time

19

FAM SLM SPM SPMBO SS SSBO TBEX UTS AVG TB

slide-20
SLIDE 20
  • Emerging heterogeneous apps use fine-grained synch

– GPU coherence + DRF: inefficient, but simple memory model – GPU coherence + HRF: efficient, but complex memory model – DeNovo + DRF: efficient AND simple memory model

complex consistency models!

Conclusions

20

Do GPU models (HRF) need to be more complex than CPU models (DRF)?