COHESION: A Hybrid Memory Model for Accelerators
John H. Kelm, Daniel R. Johnson, William Tuohy, Steven S. Lumetta and Sanjay J. Patel University of Illinois at Urbana-Champaign
C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , - - PowerPoint PPT Presentation
C OHESION : A Hybrid Memory Model for Accelerators John H. Kelm , Daniel R. Johnson, William Tuohy, Steven S. Lumetta and Sanjay J. Patel University of Illinois at Urbana-Champaign Chip Multiprocessors Today General-purpose + accelerators
John H. Kelm, Daniel R. Johnson, William Tuohy, Steven S. Lumetta and Sanjay J. Patel University of Illinois at Urbana-Champaign
1. Programmability 2. Power/perf density of ILP-centric cores 3. Scalability of HW coherence, strict memory models
1. Inflexible programming/execution models 2. Hard to scale irregular parallel apps 3. Lack of conventional memory model
2 John H. Kelm
Our Proposal: Hybrid Memory Model
3
CPU MEM GPU MEM
CPU MEM GPU MEM CPU Accelerator GPU
John H. Kelm
– Maximum throughput – Loosely coupled sharing – Coarse-grained synchronization – Short silicon design cycle
– Multiple address spaces – Scratchpad memories – Relaxed ordering – SW-managed coherence
– Minimal latency – Tightly coupled sharing – Fine-grained synchronization – Minimal programmer effort
– Single address space – Hardware caching – Strong ordering – HW-managed coherence
Conventional Multicore CPU Contemporary Accelerator
4 John H. Kelm
5 John H. Kelm
6 John H. Kelm
(To Memory) (To Interconnect)
L3$1 L3$2 L3$3 L3$4 L3$5 L3$6 L3$7 DRAM Bank 0 DRAM Bank 2 DRAM Bank 1 DRAM Bank 4 DRAM Bank 6 DRAM Bank 3 DRAM Bank 5 DRAM Bank 7 Unordered Multistage Interconnect Cluster127 Cluster1 Cluser126
1024-core Processor Organization
C0
Rigel Cluster
L2 Cache
C1 C2 C3 C4 C5 C6 C7
7 John H. Kelm
L3$0
L3 Cache Bank
Directory Controller
Directory
John H. Kelm 8
* SWcc based on the Task-centric Memory Model [Kelm et al. PACT’09][Kelm et al. IEEEMicro’10]
0.0 x 0.2 x 0.4 x 0.6 x 0.8 x 1.0 x 1.2 x 1.4 x
cg dmm gjk sobel kmeans mri march heat stencil
Runtime Normalized to IdealSWcc
Ideal SWcc Best SWcc (of 4 policies) Full Directory (ideal on-die)
9 John H. Kelm
0.0 0.5 1.0 1.5 2.0 2.5
SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc SWcc HWcc cg dmm gjk heat kmeans mri sobel stencil
Relative Number of Messages
Read Releases Probe Responses Software Flushes Cache Evictions Uncached/Atomic Operations Instruction Requests Write Requests Read Requests
John H. Kelm 10
0K 50K 100K 150K 200K 250K
Average # Directory Entries Allocated
Stack Heap/Global Code Maximum Allocated
11 John H. Kelm
K
Clean SWCL Immutable SWIM
Private (Dirty)
SWPD Shared HWS Modified HWM (Per Line) (Per Word)
LD LD WB ST LD
INV
ST ST LD LD
INV
WrReq WrReq RdReq WrRel RdRel
LD
Private (Clean)
SWPC Invalid HWI
LD ST LD ST
Synchronize SW-to-HW Transitions
John H. Kelm 12
code segment stack segment
…
wm-1 wm-2 w0 w1 set0 set1 setn-2 setn-1
sharers tag I/M/S
global data Coherence bit vectors (1 bit/line in memory)
base_addr
0x00000000 0xFFFFE000 start_addr size valid
16 MB table 4 GB memory
Sparse Directory Coarse-grain Region Table Fine-grain Region Table
(One per L3$ bank) (Strided across L3 banks) (Global Table)
– Addition 1: Region table/bit vector in memory – Addition 2: One bit/line in the L2 cache (not shown)
13 John H. Kelm
I I I I A B A B A’ B A B A B
Case 1 Case 2 Case 3
CACHE0 CACHE1 MEMORY
A B I I I I I I A B A B A’ B A B A B
CACHE0 CACHE1 MEMORY
A B I I
DIRECTORY STATE
I S M
1 1 1 0
DIRECTORY STATE
14 John H. Kelm
John H. Kelm 15
HWcc (writer) SWcc (private) HWcc (writer) SWcc (private) HWcc (reader) HWcc (reader)
Data regions for two grid blocks from a 2D stencil computation
John H. Kelm 16
P0 P0 P1 P0 P2 P1 P3
Sort (Phase 0)
Sort (Phases 1-N)
SWcc Data HWcc Data SWcc HWcc SWcc Data HWcc Data SWcc HWcc (Unsorted) (Sorted)
John H. Kelm 17
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal SWcc Cohesion HWccIdeal HWccReal cg dmm gjk heat kmeans mri sobel stencil
Relative Number of Messages
Probe Responses Read Releases Software Flushes Cache Evictions Uncached/Atomics Instruction Requests Write Requests Read Requests
18 John H. Kelm
0.0x 1.0x 2.0x 3.0x 4.0x 5.0x 6.0x 7.0x 8.0x 256 512 1024 2048 4096 8192 16384
Directory Entries per L3 Cache Bank (With COHESION)
cg dmm gjk heat kmeans mri sobel stencil 0.0x 1.0x 2.0x 3.0x 4.0x 5.0x 6.0x 7.0x 8.0x 256 512 1024 2048 4096 8192 16384
Normalized Runtime (v. DirFULL) Directory Entries per L3 Cache Bank (Without COHESION)
19 John H. Kelm
7.09x 9.19x 9.21x 3.88x 0.0x 0.5x 1.0x 1.5x 2.0x
cg dmm gjk heat kmeans mri sobel stencil
Runtime Normalized to Cohesion
Cohesion SWcc HWccOpt HWcc
20 John H. Kelm
21 John H. Kelm