efficient gpu synchronization without scopes
play

Efficient GPU Synchronization without Scopes: Saying No to Complex - PowerPoint PPT Presentation

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Motivation Heterogeneous systems now


  1. Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu

  2. Motivation Heterogeneous systems now used for a wide variety of applications Emerging applications have fine-grained synchronization BUT current GPUs have sub-optimal consistency and coherence Consistency Coherence Defacto Data-race-free (DRF) High overhead on synchs Inefficient Simple No overhead for local synchs Heterogeneous-race-free (HRF) Recent Scoped synchronization Efficient for local synch Complex This work: simple consistency + efficient coherence 2

  3. Motivation (Cont.) Do GPU models (HRF) need to be more complex than CPU models (DRF)? NO! Not if coherence is done right! DeNovo+DRF: Efficient AND simpler memory model – Comparable or better results vs. GPU+DRF and GPU+HRF complex consistency models 3

  4. Outline • Motivation • Coherence Protocols and Consistency Models – Classification – GPU Coherence – DeNovo Coherence – Coherence and Consistency Summary • Results • Conclusion 4

  5. A Classification of Coherence Protocols • Read hit: Don’t return stale data • Read miss: Find one up-to-date copy Invalidator Writer Reader Track Ownership MESI DeNovo up-to- date Writethrough GPU copy • Reader-initiated invalidations – No invalidation or ack traffic, directories, transient states • Obtaining ownership for written data – Reuse owned data across synchs (not flushed at synch points) 5

  6. GPU Coherence with DRF GPU CPU Flush dirty Invalidate Cache Cache Dirty Valid Valid data all data L2 Cache L2 Cache Bank Bank Interconnection n/w • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points – Synchronization accesses must go to last level cache (LLC) 6

  7. GPU Coherence with HRF heterogeneous HRF [ASPLOS ’14] • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished heterogeneous and their scopes – At all synch points global • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points Global – Synchronization accesses must go to last level cache (LLC) – No overhead for locally scoped synchs • But higher programming complexity 7

  8. DeNovo Coherence with DRF GPU CPU Invalidate Obtain Cache Cache Own Dirty Valid non-owned data ownership L2 Cache L2 Cache Bank Bank Interconnection n/w • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data Obtain ownership for dirty data Can reuse owned data • Invalidate all non-owned data can be performed at L1 – Synchronization accesses must go to last level cache (LLC) • 3% state overhead vs. GPU coherence + HRF 8

  9. DeNovo Configurations Studied • DeNovo+DRF: – Invalidate all non-owned data at synch points • DeNovo-RO+DRF: – Avoids invalidating read-only data at synch points • DeNovo+HRF: – Reuse valid data if synch is locally scoped 9

  10. Coherence & Consistency Summary Coherence + Consistency Reuse Data Do Synchs at L1 Owned Valid GPU + DRF (GD) X X X GPU + HRF (GH) local local local DeNovo + DRF (DD)  X  DeNovo-RO + DRF (DD+RO) read-only   DeNovo + HRF (DH)  local  10

  11. Outline • Motivation • Coherence Protocols and Consistency Models • Results • Conclusion 11

  12. Evaluation Methodology • 1 CPU core + 15 GPU compute units (CU) – Each node has private L1, scratchpad, tile of shared L2 • Simulation Environment – GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT • Workloads – 10 apps from Rodinia, Parboil: no fine-grained synch • DeNovo and GPU coherence perform comparably – UC-Davis microbenchmarks + UTS from HRF paper: • Mutex, semaphore, barrier, work sharing • Shows potential for future apps • Created two versions of each: globally, locally/hybrid scoped synch 12

  13. Global Synch – Execution Time FAM SLM SPM SPMBO AVG 100% 80% 60% 40% 20% 0% G* D* G* D* G* D* G* D* G* D* DeNovo has 28% lower execution time than GPU with global synch 13

  14. Global Synch – Energy N/W L2 $ L1 D$ Scratch GPU Core+ FAM SLM SPM SPMBO AVG 100% 80% 60% 40% 20% 0% G* D* G* D* G* D* G* D* G* D* DeNovo has 51% lower energy than GPU with global synch 14

  15. Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ ASPLOS ’14 ] 15

  16. Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ ASPLOS ’14 ] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model 16

  17. Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14 ] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model DeNovo-RO+DRF reduces gap by not invalidating read-only data 17

  18. Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14 ] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model DeNovo-RO+DRF reduces gap by not invalidating read-only data DeNovo+HRF is best, if consistency complexity acceptable 18

  19. Local Synch – Energy N/W L2 $ L1 D$ Scratch GPU Core+ SLM SPM SPMBO SS SSBO TBEX TB UTS AVG FAM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH Energy trends similar to execution time 19

  20. Conclusions • Emerging heterogeneous apps use fine-grained synch – GPU coherence + DRF: inefficient, but simple memory model – GPU coherence + HRF: efficient, but complex memory model Do GPU models (HRF) need to be more complex than CPU models (DRF)? – DeNovo + DRF: efficient AND simple memory model complex consistency models! 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend