THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS - - PowerPoint PPT Presentation
THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS - - PowerPoint PPT Presentation
E XPLOITING C OMMUTATIVITY TO R EDUCE THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS G UOWEI Z HANG , W EBB H ORN , D ANIEL S ANCHEZ MICRO 2015 Executive summary 2 Updates to shared data limit parallelism in current
Executive summary
Updates to shared data limit parallelism in current
systems
Insight: Many updates are commutative Coup extends cache coherence protocols to make
commutative updates as cheap as reads
Maintains coherence and consistency Accelerates update-heavy applications significantly
2
Updates are expensive
A: 20 Shared cache Core/$ 1 Core/$ 0
Time
3
add(A, 1); add(A, 1); add(A, 1); read(A); add(A, 2); add(A, 2); add(A, 2); Core 0 Core 1 A: 21 A: 23 +1 +2
Traffic Serialization
Updates are expensive, even with RMOs
A: 20 Shared cache Core/$ 1 Core/$ 0
Time
4
add(A, 1); add(A, 1); add(A, 1); read(A); add(A, 2); add(A, 2); add(A, 2); Core 0 Core 1 ALU A: 21 A: 23 +2
Traffic Serialization Complicates consistency
+1
Coup: exploiting commutativity
A: 20 Shared cache Core/$ 1 Core/$ 0
Time
5
add(A, 1); add(A, 1); add(A, 1); read(A); add(A, 2); add(A, 2); add(A, 2); Core 0 Core 1 A: +0 A: +0 A: +1 A: +2 A: 23 A: 29 ALU
Low traffic Concurrent updates Simple consistency Less general than RMOs
+1 +2
Commutative updates are common
6
Operations Applications
Reduction variables Iterative algorithms Graph traversal Reference counting
Software privatization vs. Coup
7
Privatization
X
One read-only copy
… X.0 X.1 X.N
Multiple thread-private, update-only copies
…
Reduction
Software privatization
Needs to amortize privatization/reduction costs Wastes shared cache & memory capacity Must apply selectively
Coup
No overheads No wasted capacity Apply to any update that might commute
Outline
Introduction Coup Evaluation
8
Structural changes
9 Core 0 Core N-1 Shared cache/dir
Private Cache 0 … Private Cache N-1
ISA
… load (&x) Store (&x, v) ...
…
Coherence states
M S I
comm_add (&x, v) comm_or (&x, v) … … U Reduction unit
Example: extending MSI
10
S I M
R R W W W W
MSI
S I M U
R R R W, R C C W, C C W W W W
MUSI
Transitions Initiated by own core (gain permissions) Initiated by others (lose permissions) States Legend Requests Modified Shared (read-only) Invalid Read Write Update-only Commutative update
Coherence and consistency
11
Coherence is maintained Consistency is not affected See paper for proofs
Implementation and verification
12 S M I E
No extra stable states Easy to verify
Own request (R,W,C,wback) IM xMI
Transient
Split Race
Transitions initiated by
Response to own request Inval/downgrade request
M
States Stable Legend
S M
IS SM IM WB
I
ISI xMI xMS WBI
E M
IN NM IM WB
I
xMI xMN WBI
E
xNI NN
N
Evaluation Methodology
13
L4 cache & global dir chip Processor chip L4 cache & global dir chip L4 cache & global dir chip Processor chip Processor chip
1-8 processor and L4 chips
… …
Core 0 L1I L1D L2 0 Core 15 L1I L1D L2 15
…
Processor chip organization
Shared L3 and chip directory
…
to L4 chips
8 sockets × 16 cores/socket = 128 cores
Coup vs. Atomic Operations
14
MESI
histogram spmv pagerank bfs fluidanimate
COUP
1 32 64 96 128
Cores
20 40 60 80 100
Speedup
1 32 64 96 128
Cores
10 20 30 40 50 1 32 64 96 128
Cores
5 10 15 20 25 1 32 64 96 128
Cores
10 20 30 40 50 60 1 32 64 96 128
Cores
10 20 30 40 50 60 70 1 32 64 96 128
Cores
20 40 60 80 100
Speedup
1 32 64 96 128
Cores
5 10 15 20 25 1 32 64 96 128
Cores
10 20 30 40 50 1 32 64 96 128
Cores
10 20 30 40 50 60 1 32 64 96 128
Cores
10 20 30 40 50 60 70
0.2 0.4 0.6 0.8 1 1.2
histogram spmv pagerank bfs fluidanimate
Normalized AMAT
MESI COUP
1.0% 2.4% 4.9% 0.40% 0.96%
Fraction of commutative instructions
Modifying algorithms to exploit Coup 15
Delayed deallocation reference counting
0.5 1 1.5 2 2.5 Refcache Coup Performance
Scheme Data structure Refcache[1] Hash table Coup implementation Hierarchical bit vectors + comm add/or
[1] Clements et al, EuroSys 2013
Conclusions
16
Coup allows concurrent commutative updates
Maintains coherence and consistency
Coup implementation accelerates single-word updates
Minor hardware overhead Accelerates update-heavy applications by up to 2.4x
Coup opens exciting research avenues
Commutativity-aware hardware transactional memory Support arbitrary update functions, semantic commutativity