Johnathan Alsop *, Matthew D. Sinclair* , Sarita V. Adve* *Illinois, - - PowerPoint PPT Presentation

johnathan alsop matthew d sinclair sarita v adve
SMART_READER_LITE
LIVE PREVIEW

Johnathan Alsop *, Matthew D. Sinclair* , Sarita V. Adve* *Illinois, - - PowerPoint PPT Presentation

Johnathan Alsop *, Matthew D. Sinclair* , Sarita V. Adve* *Illinois, AMD, Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA) Specialized architectures are increasingly important in all compute domains 2


slide-1
SLIDE 1

Johnathan Alsop*, Matthew D. Sinclair*†‡, Sarita V. Adve* *Illinois, †AMD, ‡Wisconsin

Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

slide-2
SLIDE 2

Specialized architectures are increasingly important in all compute domains

2

slide-3
SLIDE 3

CPU Accelerator

Specialization Requires Better Memory Systems

Traditional heterogeneity:

3

CPU Accelerator

CPU Mem Space Accelerator Mem Space data in data out Shared Mem Space

coherent data

Shared coherent memory:

 No fine-grain synchronization  No irregular access patterns  Wasteful data movement ✓Fine-grain synchronization ✓Irregular access ✓Implicit data reuse

Existing solutions: complex and inflexible

slide-4
SLIDE 4

4

Spatial locality Temporal locality Throughput Sensitivity Latency Sensitivity Fine-grain Synch

Heterogeneous devices have diverse memory demands

slide-5
SLIDE 5

5

Spatial locality Temporal locality Throughput Sensitivity Latency Sensitivity Fine-grain Synch

Typical CPU workloads: fine-grain synch, latency sensitive

Heterogeneous devices have diverse memory demands

slide-6
SLIDE 6

Spatial locality Temporal locality Throughput Sensitivity Latency Sensitivity Fine-grain Synch

Typical GPU workloads: spatial locality, throughput sensitive

Heterogeneous devices have diverse memory demands

slide-7
SLIDE 7

Properties MESI GPU coherence DeNovo Granularity Line Reads: Line Writes: Word Reads: Flexible Writes: Word Invalidation Writer-invalidate Self-invalidate Self-invalidate Updates Ownership Write-through Ownership

MESI protocol fits CPU workloads

7

✓ Spatial locality  False sharing

Good for:

CPU

✓ Temporal locality for reads  Overheads limit throughput ✓ Temporal locality for writes  Indirection if low locality

MESI

slide-8
SLIDE 8

Properties MESI GPU coherence DeNovo Granularity Line Reads: Line Writes: Word Reads: Flexible Writes: Word Invalidation Writer-invalidate Self-invalidate Self-invalidate Updates Ownership Write-through Ownership

GPUs prefer simpler protocols

8

MESI

Good for:

GPU CPU

✓ No false sharing  Synch limits spatial locality ✓ Simple, scalable  Synch limits read reuse ✓ Simple, low overhead  Synch limits write reuse

GPU coh.

slide-9
SLIDE 9

Properties MESI GPU coherence DeNovo Granularity Line Reads: Line Writes: Word Reads: Flexible Writes: Word Invalidation Writer-invalidate Self-invalidate Self-invalidate Updates Ownership Write-through Ownership

DeNovo is a good fit for CPU and GPU

MESI GPU coh.

Good for:

GPU CPU CPU or GPU

9

DeNovo

slide-10
SLIDE 10

Accel 2 ? Accel 1

Existing Solutions: Inflexible and Inefficient

CPU

MESI LLC

MESI L1

GPU

GPU

  • coh. L1

DeNovo L1 GPU

  • coh. L1

MESI/GPU coh. Hybrid L2

MESI L1

Examples: ARM ACE, IBM CAPI, AMD APU

10

slide-11
SLIDE 11

Accel 2 ? Accel 1

Existing Solutions: Inflexible and Inefficient

CPU

MESI LLC

MESI L1

GPU

GPU

  • coh. L1

MESI/GPU coh. Hybrid L2

MESI L1

11

If the glove doesn’t fit… There’s limited benefit!

Examples: ARM ACE, IBM CAPI, AMD APU

slide-12
SLIDE 12

Accel 2 ? Accel 1

Existing Solutions: Inflexible and Inefficient

CPU

MESI LLC

MESI L1

GPU

GPU

  • coh. L1

MESI/GPU coh. Hybrid L2

MESI L1

12

If the glove doesn’t fit… There’s limited benefit!

Examples: ARM ACE, IBM CAPI, AMD APU

slide-13
SLIDE 13

Spandex: Flexible Heterogeneous Coherence Interface

CPU

MESI L1

Adapts to exploit individual device’s workload attributes Better performance, lower complexity GPU

GPU

  • coh. L1

Accel 1

GPU

  • coh. L1

Accel 2 ?

DeNovo L1

Spandex

⇒ Fits like a glove for any heterogeneous system!

13

slide-14
SLIDE 14

Key Components

  • Flexible device request interface
  • DeNovo-based LLC
  • External request interface

Device may need a translation unit (TU)

Spandex Overview

Spandex LLC

14

Device Request Interface

External Request Interface

TU TU TU

CPU GPU Accel ?

MESI L1 GPU coh. L1 DeNovo L1

slide-15
SLIDE 15

Key Components

  • Flexible device request interface
  • DeNovo-based LLC
  • External request interface

Device may need a translation unit (TU)

Spandex Overview

Spandex LLC

15

Device Request Interface

External Request Interface

TU TU TU

CPU GPU Accel ?

MESI L1 GPU coh. L1 DeNovo L1

slide-16
SLIDE 16

Action Request Indicates Read ReqV Self-invalidation ReqS Writer-invalidation Write ReqWT Write-through ReqO Ownership only Read+ Write ReqWT+data Atomic write-through ReqO+data Ownership + Data Writeback ReqWB Owned data eviction

Device Request Interface

16

Requests also specify granularity and (optionally) a bitmask

slide-17
SLIDE 17

Key Components

  • Flexible device request interface
  • DeNovo-based LLC
  • External request interface

Device may need a translation unit (TU)

Spandex Overview

Spandex LLC

17

Device Request Interface

External Request Interface

TU TU TU

CPU GPU Accel ?

MESI L1 GPU coh. L1 DeNovo L1

slide-18
SLIDE 18

Spandex LLC

CPU GPU Accel ?

Spandex LLC

18

  • States: I, V, O, S
  • Allocation at line granularity
  • Ownership at word granularity
  • Data field tracks owner ID
  • May generate requests to owner/sharer

Tag Data O Mask

V ID ID

State

MESI L1 GPU coh. L1 DeNovo L1

ST ReqWT RspWT ST RspO ReqO

✓No false sharing ✓Non-blocking ownership transfer

slide-19
SLIDE 19

Key Components

  • Flexible device request interface
  • DeNovo-based LLC
  • External request interface

Device may need a translation unit (TU)

Spandex Overview

Spandex LLC

19

Device Request Interface

External Request Interface

TU TU TU

CPU GPU Accel ?

MESI L1 GPU coh. L1 DeNovo L1

slide-20
SLIDE 20

External Request Interface

CPU GPU Accel ?

Spandex LLC

20

MESI L1 GPU coh. L1 DeNovo L1 States: I, V States: I, S, O States: I, V, O External Request Must handle if supports state ReqV O ReqO O ReqO+data O RvkO O Inv S ReqS S and O

  • Translation Unit may implement

functionality if not supported by device Spandex LLC

ReqV ReqV RspV O

slide-21
SLIDE 21

Evaluation: Configurations

Configuration LLC protocol CPU protocol GPU protocol HMG Hierarchical MESI MESI GPU coherence HMD Hierarchical MESI MESI DeNovo SMG Spandex MESI GPU coherence SMD Spandex MESI DeNovo SDG Spandex DeNovo GPU coherence SDD Spandex DeNovo DeNovo

21

Spandex

Spandex LLC CPU L1 CPU L1 GPU L1 GPU L1

… …

Hierarchical MESI

MESI LLC CPU L1 CPU L1 GPU L2 GPU L1 GPU L1

… …

CPU-GPU workloads from Pannotia and Chai benchmark suites

slide-22
SLIDE 22

Evaluation: CPU-GPU Applications

0% 20% 40% 60% 80% 100% 120%

HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD Hbest Sbest

Execution Time (cycles) BC PR HSTI Average TRNS RSCT TQH

  • Different workloads prefer different protocols
  • Spandex flexibility ⇒ consistently better execution time (avg 16% lower)

22

slide-23
SLIDE 23

Evaluation: CPU-GPU Applications

0% 20% 40% 60% 80% 100%

HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD HMG HMD SMG SMD SDG SDD Hbest Sbest

Network Traffic (flits)

Probe ReqWT+data ReqWB/WT ReqO[+data] ReqV/S

BC PR HSTI Average TRNS RSCT TQH

  • Spandex flexibility ⇒ consistently better NW traffic (avg 27% lower)

23

slide-24
SLIDE 24

Conclusion and Future Work

Future Work: exploit SW or HW hints about data access patterns

  • Dynamic Spandex request selection
  • Producer-consumer forwarding
  • Extended granularity flexibility

⇒ Simple, Flexible, Efficient

Producer Consumer

MESI LLC

24

slide-25
SLIDE 25

Johnathan Alsop*, Matthew D. Sinclair*†‡, Sarita V. Adve* *Illinois, †AMD, ‡Wisconsin

Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)