C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , - - PowerPoint PPT Presentation

c ohesion a hybrid memory model for accelerators
SMART_READER_LITE
LIVE PREVIEW

C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , - - PowerPoint PPT Presentation

C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , Daniel R. Johnson, William Tuohy, Steven S. Lumetta and Sanjay J. Patel University of Illinois at Urbana-Champaign Chip Multiprocessors Today General-purpose + accelerators


slide-1
SLIDE 1

COHESION: A Hybrid Memory Model for Accelerators

John H. Kelm, Daniel R. Johnson, William Tuohy, Steven S. Lumetta and Sanjay J. Patel University of Illinois at Urbana-Champaign

slide-2
SLIDE 2

Chip Multiprocessors Today

  • General-purpose + accelerators (e.g., GPUs)
  • General-purpose CMP Challenges:

1. Programmability 2. Power/perf density of ILP-centric cores 3. Scalability of HW coherence, strict memory models

  • Accelerator Challenges:

1. Inflexible programming/execution models 2. Hard to scale irregular parallel apps 3. Lack of conventional memory model

2 John H. Kelm

slide-3
SLIDE 3

Our Proposal: Hybrid Memory Model

Chip Multiprocessors Tomorrow

  • Industry Trend: Integration over time
  • Hybrids: Accelerators + CPUs together on die
  • More core/compute heterogeneity but…

…more homogeneity in memory model

3

CPU MEM GPU MEM

Past… …Present… …Future.

CPU MEM GPU MEM CPU Accelerator GPU

John H. Kelm

slide-4
SLIDE 4
  • Ex: NVIDIA GPU, IBM Cell
  • Optimized for:

– Maximum throughput – Loosely coupled sharing – Coarse-grained synchronization – Short silicon design cycle

  • Provides:

– Multiple address spaces – Scratchpad memories – Relaxed ordering – SW-managed coherence

  • Ex: Intel i7, Sun Niagara
  • Optimized for:

– Minimal latency – Tightly coupled sharing – Fine-grained synchronization – Minimal programmer effort

  • Provides:

– Single address space – Hardware caching – Strong ordering – HW-managed coherence

CMP Memory Model Choices

Conventional Multicore CPU Contemporary Accelerator

4 John H. Kelm

slide-5
SLIDE 5

Roadmap

  • Motivation and context
  • Problem statement
  • COHESION design
  • Use cases and programming examples

Addressed in this talk:

  • 1. Opportunity: Is combining protocols worthwhile?
  • 2. Feasibility: How does one implement hybrid memory models?
  • 3. Tradeoffs: What are the tradeoffs in HWcc v. SWcc?
  • 4. Benefit: What does hybrid coherence get you?

5 John H. Kelm

slide-6
SLIDE 6

Problem: Scalable Coherence

  • Available architectures:

– Accelerators: 100s of cores, TFLOPS, no coherence – CMPs: <10s of cores, GFLOPS, HW coherence – Multiple memory models on-die

  • What devs want in heterogeneous CMPs:

– Hardware caches (locality) – Single address space (marshalling) – Minimal changes to current practices

  • Accelerator scalability w/CMP memory model

6 John H. Kelm

slide-7
SLIDE 7

(To Memory) (To Interconnect)

L3$1 L3$2 L3$3 L3$4 L3$5 L3$6 L3$7 DRAM Bank 0 DRAM Bank 2 DRAM Bank 1 DRAM Bank 4 DRAM Bank 6 DRAM Bank 3 DRAM Bank 5 DRAM Bank 7 Unordered Multistage Interconnect Cluster127 Cluster1 Cluser126

1024-core Processor Organization

C0

Rigel Cluster

L2 Cache

C1 C2 C3 C4 C5 C6 C7

Baseline Architecture

7 John H. Kelm

  • Variant of the Rigel Architecture [Kelm et al. ISCA’09]
  • 1024-core CMP, HW caches, single address space, MIMD

L3$0

L3 Cache Bank

Directory Controller

Directory

slide-8
SLIDE 8

Opportunity: HWcc v. SWcc Shootout

John H. Kelm 8

  • Note: Lower bars are better
  • Question: Can we leverage both HW+SW protocols?

* SWcc based on the Task-centric Memory Model [Kelm et al. PACT’09][Kelm et al. IEEEMicro’10]

0.0 x 0.2 x 0.4 x 0.6 x 0.8 x 1.0 x 1.2 x 1.4 x

cg dmm gjk sobel kmeans mri march heat stencil

Runtime Normalized to IdealSWcc

Ideal SWcc Best SWcc (of 4 policies) Full Directory (ideal on-die)

SWcc Wins HWcc Wins Parity

slide-9
SLIDE 9

Opportunity: Network Traffic Reduction

9 John H. Kelm

  • SWcc w/baseline arch (left), HWcc ww/DIRFULL (right)
  • SWcc: Fewer L2 messages in network, some flush overhead
  • HWcc: Extraneous msgs for unshared data (WrRequest, RdRelease)

0.0 0.5 1.0 1.5 2.0 2.5

SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc cg dmm gjk heat kmeans mri sobel stencil

Relative Number of Messages

Read Releases Probe Responses Software Flushes Cache Evictions Uncached/Atomic Operations Instruction Requests Write Requests Read Requests

slide-10
SLIDE 10

Opportunity: Reduce Directory Utilization

  • Not all entries used Wasted die area
  • For many, 256K maximum never reached (red line)
  • Observations:
  • 1. Use SWcc when possible to reduce network traffic
  • 2. Build smaller sparse directory for common case

John H. Kelm 10

0K 50K 100K 150K 200K 250K

Average # Directory Entries Allocated

Stack Heap/Global Code Maximum Allocated

  • Max. possible entries
slide-11
SLIDE 11
  • Support for coherence domain transitions
  • 1. Protocol for safe migration SWccHWcc
  • 2. Minor architecture extensions
  • Motivation:

↑HWcc: Supports arbitrary sharing, no SW overhead ↓HWcc: Area + message overhead ↑SWcc: Removes HW overheads + design complexity ↓SWcc: Flush overhead + coherence burden on SW

11 John H. Kelm

K

COHESION: Toward a Hybrid Memory Model

slide-12
SLIDE 12

Clean SWCL Immutable SWIM

Private (Dirty)

SWPD Shared HWS Modified HWM (Per Line) (Per Word)

LD LD WB ST LD

INV

ST ST LD LD

INV

WrReq WrReq RdReq WrRel RdRel

LD

Private (Clean)

SWPC Invalid HWI

LD ST LD ST

Synchronize SW-to-HW Transitions

Protocol Synthesis

John H. Kelm 12

  • Create a bridge between SWccand HWcc
  • Leverage existing HWcc and SWcc techniques
slide-13
SLIDE 13

code segment stack segment

wm-1 wm-2 w0 w1 set0 set1 setn-2 setn-1

sharers tag I/M/S

global data Coherence bit vectors (1 bit/line in memory)

base_addr

0x00000000 0xFFFFE000 start_addr size valid

16 MB table 4 GB memory

Sparse Directory Coarse-grain Region Table Fine-grain Region Table

(One per L3$ bank) (Strided across L3 banks) (Global Table)

COHESION Architecture

  • Extension to baseline directory protocol

– Addition 1: Region table/bit vector in memory – Addition 2: One bit/line in the L2 cache (not shown)

  • SW writes table  COHESION controller exec’s transition

13 John H. Kelm

slide-14
SLIDE 14

I I I I A B A B A’ B A B A B

Case 1 Case 2 Case 3

CACHE0 CACHE1 MEMORY

A B I I I I I I A B A B A’ B A B A B

CACHE0 CACHE1 MEMORY

A B I I

DIRECTORY STATE

I S M

1 1 1 0

DIRECTORY STATE

Example Software  Hardware Transitions

  • App. initiates transitions between SWcc and HWcc
  • COHESION controller probes L2’s to reconstruct state
  • See paper for other cases and HWcc  SWcc

14 John H. Kelm

slide-15
SLIDE 15

Static COHESION Example (1 of 3)

  • COHESION provides static partitioning of data
  • (Large) read-only/private regions SWcc
  • (Small) shared regions HWcc

John H. Kelm 15

HWcc (writer) SWcc (private) HWcc (writer) SWcc (private) HWcc (reader) HWcc (reader)

Data regions for two grid blocks from a 2D stencil computation

slide-16
SLIDE 16

Dynamic COHESION Example (2 of 3)

John H. Kelm 16

Parallel Sort (on four cores)

P0 P0 P1 P0 P2 P1 P3

… … … …

  • 1. Parallel Quicksort
  • 2. Sequential Selection

Sort (Phase 0)

  • 3. Sequential Selection

Sort (Phases 1-N)

  • 4. Result Visible to All

SWcc Data HWcc Data SWcc  HWcc SWcc Data HWcc Data SWcc  HWcc (Unsorted) (Sorted)

slide-17
SLIDE 17

System SW COHESION Example (3 of 3)

  • Problem: Supporting multitasking w/SWcc
  • OS process creation workflow
  • 1. Runtime allocates proc’s memory HWcc
  • 2. Start new process
  • 3. Process runs, migrates, SWcc  HWcc transitions
  • 4. Exit process
  • 5. Runtime makes allocated memory HWcc
  • COHESION enables: Migration, isolation, cleanup

John H. Kelm 17

slide-18
SLIDE 18

Network Message Reductions

  • HWccReal: HWcc-only w/sparse directory used by COHESION
  • HWccIdeal: Full on-die directory
  • Benefit: lessens constraints on network design

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal cg dmm gjk heat kmeans mri sobel stencil

Relative Number of Messages

Probe Responses Read Releases Software Flushes Cache Evictions Uncached/Atomics Instruction Requests Write Requests Read Requests

18 John H. Kelm

slide-19
SLIDE 19

Directory Size Sensitivity

  • Reduces perf. cliffs in sparse directory designs
  • Benefit: Smaller on-die coherence structures

0.0x 1.0x 2.0x 3.0x 4.0x 5.0x 6.0x 7.0x 8.0x 256 512 1024 2048 4096 8192 16384

Directory Entries per L3 Cache Bank (With COHESION)

cg dmm gjk heat kmeans mri sobel stencil 0.0x 1.0x 2.0x 3.0x 4.0x 5.0x 6.0x 7.0x 8.0x 256 512 1024 2048 4096 8192 16384

Normalized Runtime (v. DirFULL) Directory Entries per L3 Cache Bank (Without COHESION)

19 John H. Kelm

slide-20
SLIDE 20

Runtime: COHESION, SWcc, HWcc

  • Perf. close to SWcc and full-directory HWcc
  • Reduce network/directory overhead w/o perf. loss
  • Further HWccSWcc optimizations possible

7.09x 9.19x 9.21x 3.88x 0.0x 0.5x 1.0x 1.5x 2.0x

cg dmm gjk heat kmeans mri sobel stencil

Runtime Normalized to Cohesion

Cohesion SWcc HWccOpt HWcc

20 John H. Kelm

slide-21
SLIDE 21

Conclusions

  • Why COHESION? CMPs w/multiple mem. models
  • Usage scenarios identified

– System software/migratory tasks w/SWcc – App uses: Static, dynamic, and host+accel – Optimization Path: Piecemeal HWccSWcc

  • Hybrid memory model has potential

– Reduces strain on HWcc implementation – Reduces network constraints – Competitive performance

21 John H. Kelm