Efficient GPU Synchronization without Scopes: Saying No to Complex - PowerPoint PPT Presentation

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu

Motivation Heterogeneous systems now used for a wide variety of applications Emerging applications have fine-grained synchronization BUT current GPUs have sub-optimal consistency and coherence Consistency Coherence Defacto Data-race-free (DRF) High overhead on synchs Inefficient Simple No overhead for local synchs Heterogeneous-race-free (HRF) Recent Scoped synchronization Efficient for local synch Complex This work: simple consistency + efficient coherence 2

Motivation (Cont.) Do GPU models (HRF) need to be more complex than CPU models (DRF)? NO! Not if coherence is done right! DeNovo+DRF: Efficient AND simpler memory model – Comparable or better results vs. GPU+DRF and GPU+HRF complex consistency models 3

Outline • Motivation • Coherence Protocols and Consistency Models – Classification – GPU Coherence – DeNovo Coherence – Coherence and Consistency Summary • Results • Conclusion 4

A Classification of Coherence Protocols • Read hit: Don’t return stale data • Read miss: Find one up-to-date copy Invalidator Writer Reader Track Ownership MESI DeNovo up-to- date Writethrough GPU copy • Reader-initiated invalidations – No invalidation or ack traffic, directories, transient states • Obtaining ownership for written data – Reuse owned data across synchs (not flushed at synch points) 5

GPU Coherence with DRF GPU CPU Flush dirty Invalidate Cache Cache Dirty Valid Valid data all data L2 Cache L2 Cache Bank Bank Interconnection n/w • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points – Synchronization accesses must go to last level cache (LLC) 6

GPU Coherence with HRF heterogeneous HRF [ASPLOS ’14] • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished heterogeneous and their scopes – At all synch points global • Flush all dirty data: Unnecessary writethroughs • Invalidate all data: Can’t reuse data across synch points Global – Synchronization accesses must go to last level cache (LLC) – No overhead for locally scoped synchs • But higher programming complexity 7

DeNovo Coherence with DRF GPU CPU Invalidate Obtain Cache Cache Own Dirty Valid non-owned data ownership L2 Cache L2 Cache Bank Bank Interconnection n/w • With data-race-free (DRF) memory model – No data races; synchs must be explicitly distinguished – At all synch points • Flush all dirty data Obtain ownership for dirty data Can reuse owned data • Invalidate all non-owned data can be performed at L1 – Synchronization accesses must go to last level cache (LLC) • 3% state overhead vs. GPU coherence + HRF 8

DeNovo Configurations Studied • DeNovo+DRF: – Invalidate all non-owned data at synch points • DeNovo-RO+DRF: – Avoids invalidating read-only data at synch points • DeNovo+HRF: – Reuse valid data if synch is locally scoped 9

Coherence & Consistency Summary Coherence + Consistency Reuse Data Do Synchs at L1 Owned Valid GPU + DRF (GD) X X X GPU + HRF (GH) local local local DeNovo + DRF (DD)  X  DeNovo-RO + DRF (DD+RO) read-only   DeNovo + HRF (DH)  local  10

Outline • Motivation • Coherence Protocols and Consistency Models • Results • Conclusion 11

Evaluation Methodology • 1 CPU core + 15 GPU compute units (CU) – Each node has private L1, scratchpad, tile of shared L2 • Simulation Environment – GEMS, Simics, Garnet, GPGPU-Sim, GPUWattch, McPAT • Workloads – 10 apps from Rodinia, Parboil: no fine-grained synch • DeNovo and GPU coherence perform comparably – UC-Davis microbenchmarks + UTS from HRF paper: • Mutex, semaphore, barrier, work sharing • Shows potential for future apps • Created two versions of each: globally, locally/hybrid scoped synch 12

Global Synch – Execution Time FAM SLM SPM SPMBO AVG 100% 80% 60% 40% 20% 0% G* D* G* D* G* D* G* D* G* D* DeNovo has 28% lower execution time than GPU with global synch 13

Global Synch – Energy N/W L2 $ L1 D$ Scratch GPU Core+ FAM SLM SPM SPMBO AVG 100% 80% 60% 40% 20% 0% G* D* G* D* G* D* G* D* G* D* DeNovo has 51% lower energy than GPU with global synch 14

Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ ASPLOS ’14 ] 15

Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ ASPLOS ’14 ] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model 16

Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14 ] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model DeNovo-RO+DRF reduces gap by not invalidating read-only data 17

Local Synch – Execution Time GD GH DD DD+RO DH SLM SPMBO SS SSBO TBEX TB UTS AVG FAM SPM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GPU+HRF is much better than GPU+DRF with local synch [ASPLOS ’14 ] DeNovo+DRF comparable to GPU+HRF, but simpler consistency model DeNovo-RO+DRF reduces gap by not invalidating read-only data DeNovo+HRF is best, if consistency complexity acceptable 18

Local Synch – Energy N/W L2 $ L1 D$ Scratch GPU Core+ SLM SPM SPMBO SS SSBO TBEX TB UTS AVG FAM 100% 80% 60% 40% 20% 0% GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH GD GH DD DD+RO DH Energy trends similar to execution time 19

Conclusions • Emerging heterogeneous apps use fine-grained synch – GPU coherence + DRF: inefficient, but simple memory model – GPU coherence + HRF: efficient, but complex memory model Do GPU models (HRF) need to be more complex than CPU models (DRF)? – DeNovo + DRF: efficient AND simple memory model complex consistency models! 20

Efficient GPU Synchronization without Scopes: Saying No to Complex - PowerPoint PPT Presentation

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Motivation Heterogeneous systems now

Digital night vision April, 2020 scopes (Monoculars) NIGHT VISION MONOCULARS (SCOPES)

Analog night vision April, 2020 scopes (Monoculars) NIGHT VISION MONOCULARS (SCOPES)

Birdwatching Spotting Scopes April, 2020 GENERAL FEATURES OF BIRDWATCHING SPOTTING SCOPES

Target or tactical June, 2020 spotting scopes TARGET OR TACTICAL SPOTTING SCOPES Target or

Angled Spotting Scopes March, 2020 ANGLED SPOTTING SCOPES FOR HUNTING Appropriate for hunting

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

CSSE232 Computer Architecture I Control Hazards Pipelining From

Branch Prediction Philipp Koehn 11 October 2019 Philipp Koehn Computer Systems Fundamentals:

NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR ACCELERATORS Davide Giri Columbia

Flushing Program Workshop developed by RCAP/AWWA and funded by the USEPA Learning Objectives

XtraDB 5.7: Key Performance Algorithms Laurynas Biveinis Alexey Stroganov Percona

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

HOW TO MAKE CHORD CORRECT Pamela Zave AT&T LaboratoriesResearch Florham Park, New Jersey,

Ad-Hoc Problems Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Efficient GPU Synchronization without Scopes: Saying No to Complex - PowerPoint PPT Presentation

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve University of Illinois @ Urbana-Champaign hetero@cs.illinois.edu Motivation Heterogeneous systems now

Digital night vision April, 2020 scopes (Monoculars) NIGHT VISION MONOCULARS (SCOPES)

Analog night vision April, 2020 scopes (Monoculars) NIGHT VISION MONOCULARS (SCOPES)

Birdwatching Spotting Scopes April, 2020 GENERAL FEATURES OF BIRDWATCHING SPOTTING SCOPES

Target or tactical June, 2020 spotting scopes TARGET OR TACTICAL SPOTTING SCOPES Target or

Angled Spotting Scopes March, 2020 ANGLED SPOTTING SCOPES FOR HUNTING Appropriate for hunting

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

CSSE232 Computer Architecture I Control Hazards Pipelining From

Branch Prediction Philipp Koehn 11 October 2019 Philipp Koehn Computer Systems Fundamentals:

NOC-BASED SUPPORT OF HETEROGENEOUS CACHE-COHERENCE MODELS FOR ACCELERATORS Davide Giri Columbia

Flushing Program Workshop developed by RCAP/AWWA and funded by the USEPA Learning Objectives

XtraDB 5.7: Key Performance Algorithms Laurynas Biveinis Alexey Stroganov Percona

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins Overview

HOW TO MAKE CHORD CORRECT Pamela Zave AT&amp;T LaboratoriesResearch Florham Park, New Jersey,

Ad-Hoc Problems Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

HOW TO MAKE CHORD CORRECT Pamela Zave AT&T LaboratoriesResearch Florham Park, New Jersey,