Johnathan Alsop , Matthew D. Sinclair , Sarita V. Adve* *Illinois, - PowerPoint PPT Presentation

Johnathan Alsop *, Matthew D. Sinclair* †‡ , Sarita V. Adve* *Illinois, † AMD, ‡ Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

Specialized architectures are increasingly important in all compute domains 2

Specialization Requires Better Memory Systems Traditional heterogeneity: Shared coherent memory: CPU Mem Shared Space Mem Space CPU CPU data in coherent Existing solutions: complex and inflexible data data out Accelerator Accelerator Accelerator Mem Space  No fine-grain synchronization ✓ Fine-grain synchronization  No irregular access patterns ✓ Irregular access  Wasteful data movement ✓ Implicit data reuse 3

Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity 4

Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity Typical CPU workloads: fine-grain synch, latency sensitive 5

Heterogeneous devices have diverse memory demands Fine-grain Latency Temporal Synch Sensitivity locality Spatial Throughput locality Sensitivity Typical GPU workloads: spatial locality, throughput sensitive

MESI protocol fits CPU workloads Properties MESI GPU coherence DeNovo ✓ Spatial locality Reads: Line Reads: Flexible Granularity Line  False sharing Writes: Word Writes: Word ✓ Temporal locality for reads Invalidation Writer-invalidate Self-invalidate Self-invalidate  Overheads limit throughput ✓ Temporal locality for writes Updates Ownership Write-through Ownership  Indirection if low locality MESI Good for: CPU 7

GPUs prefer simpler protocols Properties MESI GPU coherence DeNovo ✓ No false sharing Reads: Line Reads: Flexible Granularity Line  Synch limits spatial locality Writes: Word Writes: Word ✓ Simple, scalable Invalidation Writer-invalidate Self-invalidate Self-invalidate  Synch limits read reuse ✓ Simple, low overhead Updates Ownership Write-through Ownership  Synch limits write reuse GPU MESI coh. Good for: CPU GPU 8

DeNovo is a good fit for CPU and GPU Properties MESI GPU coherence DeNovo Reads: Line Reads: Flexible Granularity Line Writes: Word Writes: Word Invalidation Writer-invalidate Self-invalidate Self-invalidate Updates Ownership Write-through Ownership GPU DeNovo MESI coh. Good for: CPU GPU CPU or GPU 9

Existing Solutions: Inflexible and Inefficient Accel 2 ? GPU CPU Accel 1 GPU GPU DeNovo MESI L1 MESI L1 coh. L1 coh. L1 L1 MESI/GPU coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 10

Existing Solutions: Inflexible and Inefficient Accel 2 ? GPU CPU Accel 1 If the glove doesn’t fit… GPU MESI L1 MESI L1 coh. L1 There’s limited benefit! MESI/GPU coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 11

Existing Solutions: Inflexible and Inefficient Accel 2 ? GPU CPU Accel 1 If the glove doesn’t fit… GPU MESI L1 MESI L1 coh. L1 There’s limited benefit! MESI/GPU coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 12

Spandex: Flexible Heterogeneous Coherence Interface Accel 2 ? GPU CPU Accel 1 GPU GPU DeNovo MESI L1 coh. L1 coh. L1 L1 Spandex Adapts to exploit individual device’s workload attributes Better performance, lower complexity ⇒ Fits like a glove for any heterogeneous system! 13

Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 TU TU TU • DeNovo-based LLC External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 14

Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 TU TU TU • DeNovo-based LLC External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 15

Device Request Interface Action Request Indicates ReqV Self-invalidation Read ReqS Writer-invalidation ReqWT Write-through Write ReqO Ownership only ReqWT+data Atomic write-through Read+ Write ReqO+data Ownership + Data Writeback ReqWB Owned data eviction Requests also specify granularity and (optionally) a bitmask 16

Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 • DeNovo-based LLC TU TU TU External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 17

Spandex LLC Accel ? CPU GPU • States: I, V, O, S ST ST • Allocation at line granularity GPU coh. L1 DeNovo L1 MESI L1 • Ownership at word granularity • Data field tracks owner ID • May generate requests to owner/sharer ReqWT ReqO RspWT RspO Tag State O Mask Data ✓ No false sharing V ID ID ✓ Non-blocking ownership transfer Spandex LLC 18

Spandex Overview Accel ? CPU GPU Key Components • Flexible device request interface MESI L1 GPU coh. L1 DeNovo L1 • DeNovo-based LLC TU TU TU External Request Interface • External request interface Device Request Interface Device may need a translation unit (TU) Spandex LLC 19

External Request Interface Must handle if Accel ? External Request CPU GPU supports state ReqV O ReqO O GPU coh. L1 DeNovo L1 MESI L1 ReqO+data O RvkO O States: I, V States: I, V, O States: I, S, O Inv S ReqS S and O ReqV ReqV RspV • Translation Unit may implement Spandex LLC O functionality if not supported by device Spandex LLC 20

Evaluation: Configurations Hierarchical MESI Configuration LLC protocol CPU protocol GPU protocol GPU L1 CPU L1 GPU L1 CPU L1 … … HMG Hierarchical MESI MESI GPU coherence GPU L2 HMD Hierarchical MESI MESI DeNovo MESI LLC SMG Spandex MESI GPU coherence Spandex SMD Spandex MESI DeNovo CPU L1 GPU L1 GPU L1 CPU L1 … … SDG Spandex DeNovo GPU coherence SDD Spandex DeNovo DeNovo Spandex LLC CPU-GPU workloads from Pannotia and Chai benchmark suites 21

Evaluation: CPU-GPU Applications 120% Execution Time (cycles) 100% 80% 60% 40% 20% 0% HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD Hbest Sbest BC PR HSTI TRNS RSCT TQH Average • Different workloads prefer different protocols • Spandex flexibility ⇒ consistently better execution time (avg 16% lower) 22

Evaluation: CPU-GPU Applications Probe ReqWT+data ReqWB/WT ReqO[+data] ReqV/S Network Traffic (flits) 100% 80% 60% 40% 20% 0% HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD Hbest Sbest BC PR HSTI TRNS RSCT TQH Average • Spandex flexibility ⇒ consistently better NW traffic (avg 27% lower) 23

Conclusion and Future Work Producer Consumer MESI LLC ⇒ Simple, Flexible, Efficient Future Work: exploit SW or HW hints about data access patterns • Dynamic Spandex request selection • Producer-consumer forwarding • Extended granularity flexibility 24

Johnathan Alsop *, Matthew D. Sinclair* †‡ , Sarita V. Adve* *Illinois, † AMD, ‡ Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

Johnathan Alsop , Matthew D. Sinclair , Sarita V. Adve* *Illinois, - PowerPoint PPT Presentation

Johnathan Alsop , Matthew D. Sinclair , Sarita V. Adve* *Illinois, AMD, Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA) Specialized architectures are increasingly important in all compute domains 2

Interface, Data, Approximation Sarita Adve With: Vikram Adve, Johnathan Alsop, Maria Kotsifakou,

for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve

Memory Hierarchies in the Era of Specialization Sarita Adve University of Illinois at

Parallel Programming Must Be Deterministic by Default Robert Bocchino , Vikram Adve, Sarita Adve,

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D.

Shared Memory Consistency Models: A Tutorial By Sarita Adve, Kourosh Gharachorloo WRL

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads Ruchira Sasanka Sarita V. Adve

A Tutorial By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Outline

Review of Memory Models: A Case for Rethinking Parallel Languages and Hardware by Sarita V. Adve

SINCLAIR ZX SPECTRUM: 30 years of amusement and learning Josetxu Malanda 16th June 2012 Nonick

James A. Gifford Causeway Prepared by Brad Sinclair 7. a) 5. Who is Brad Sinclair? Life time

COMMUNITY WIRELESS MESH NETWORKS Johnathan Ishmael ishmael@comp.lancs.ac.uk Talk Overview

Lightning Introductions Digital Computing Beyond Moores Law May 3-4, 2018 Sarita

Pr e se nte d By: T odd Paton T ra ditio na l Adve rtise me nt Billb o a rds, T V Co mme

CS6354: Memory models 1 To read more This days papers: Adve and Gharachorloo, Shared

INVESTOR PRESENTATION FEBRUARY 2017 Chief Executive - Neil Sinclair Finance Director - Stephen

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli

Evaluating the performance of HPC- style SYCL applications Tom Deakin and Simon McIntosh-Smith

Side-Channel Attacks and Defenses for SGX and SEV Yinqian Zhang Associate Professor Computer

in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart Weiwei Jiang

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Vi Visual S Studio Cod o Code e Shipping One of the Largest Microso3 JavaScript Applica8ons

The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew

SoK: A Study of Using Hardware- assisted Isolated Execu<on Environments for Security Fengwei

Johnathan Alsop *, Matthew D. Sinclair* , Sarita V. Adve* *Illinois, - PowerPoint PPT Presentation

Johnathan Alsop *, Matthew D. Sinclair* , Sarita V. Adve* *Illinois, AMD, Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA) Specialized architectures are increasingly important in all compute domains 2

Interface, Data, Approximation Sarita Adve With: Vikram Adve, Johnathan Alsop, Maria Kotsifakou,

for Relaxed Atomics on Heterogeneous Systems Matthew D. Sinclair, Johnathan Alsop, Sarita V. Adve

Memory Hierarchies in the Era of Specialization Sarita Adve University of Illinois at

Parallel Programming Must Be Deterministic by Default Robert Bocchino , Vikram Adve, Sarita Adve,

Efficient GPU Synchronization without Scopes: Saying No to Complex Consistency Models Matthew D.

Shared Memory Consistency Models: A Tutorial By Sarita Adve, Kourosh Gharachorloo WRL

The Energy Efficiency of CMP vs. SMT for Multimedia Workloads Ruchira Sasanka Sarita V. Adve

A Tutorial By Sarita Adve &amp; Kourosh Gharachorloo Slides by Jim Larson Outline

Review of Memory Models: A Case for Rethinking Parallel Languages and Hardware by Sarita V. Adve

SINCLAIR ZX SPECTRUM: 30 years of amusement and learning Josetxu Malanda 16th June 2012 Nonick

James A. Gifford Causeway Prepared by Brad Sinclair 7. a) 5. Who is Brad Sinclair? Life time

COMMUNITY WIRELESS MESH NETWORKS Johnathan Ishmael ishmael@comp.lancs.ac.uk Talk Overview

Lightning Introductions Digital Computing Beyond Moores Law May 3-4, 2018 Sarita

Pr e se nte d By: T odd Paton T ra ditio na l Adve rtise me nt Billb o a rds, T V Co mme

CS6354: Memory models 1 To read more This days papers: Adve and Gharachorloo, Shared

INVESTOR PRESENTATION FEBRUARY 2017 Chief Executive - Neil Sinclair Finance Director - Stephen

Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong , Rafael Ubal, David Kaeli

Evaluating the performance of HPC- style SYCL applications Tom Deakin and Simon McIntosh-Smith

Side-Channel Attacks and Defenses for SGX and SEV Yinqian Zhang Associate Professor Computer

in a 14nm FinFET Library: Comparison to an Industrial Synchronous Counterpart Weiwei Jiang

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

Vi Visual S Studio Cod o Code e Shipping One of the Largest Microso3 JavaScript Applica8ons

The Barrelfish operating system for CMPs: research issues Tim Harris Based on slides by Andrew

SoK: A Study of Using Hardware- assisted Isolated Execu&lt;on Environments for Security Fengwei

Johnathan Alsop , Matthew D. Sinclair , Sarita V. Adve* *Illinois, - PowerPoint PPT Presentation

Johnathan Alsop , Matthew D. Sinclair , Sarita V. Adve* *Illinois, AMD, Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA) Specialized architectures are increasingly important in all compute domains 2

A Tutorial By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Outline

SoK: A Study of Using Hardware- assisted Isolated Execu<on Environments for Security Fengwei