THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS - - PowerPoint PPT Presentation

the c ost of u pdates to s hared d ata
SMART_READER_LITE
LIVE PREVIEW

THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS - - PowerPoint PPT Presentation

E XPLOITING C OMMUTATIVITY TO R EDUCE THE C OST OF U PDATES TO S HARED D ATA IN C ACHE -C OHERENT S YSTEMS G UOWEI Z HANG , W EBB H ORN , D ANIEL S ANCHEZ MICRO 2015 Executive summary 2 Updates to shared data limit parallelism in current


slide-1
SLIDE 1

EXPLOITING COMMUTATIVITY TO REDUCE

THE COST OF UPDATES TO SHARED DATA IN CACHE-COHERENT SYSTEMS GUOWEI ZHANG, WEBB HORN, DANIEL SANCHEZ

MICRO 2015

slide-2
SLIDE 2

Executive summary

 Updates to shared data limit parallelism in current

systems

 Insight: Many updates are commutative  Coup extends cache coherence protocols to make

commutative updates as cheap as reads

 Maintains coherence and consistency  Accelerates update-heavy applications significantly

2

slide-3
SLIDE 3

Updates are expensive

A: 20 Shared cache Core/$ 1 Core/$ 0

Time

3

add(A, 1); add(A, 1); add(A, 1); read(A); add(A, 2); add(A, 2); add(A, 2); Core 0 Core 1 A: 21 A: 23 +1 +2

Traffic Serialization

slide-4
SLIDE 4

Updates are expensive, even with RMOs

A: 20 Shared cache Core/$ 1 Core/$ 0

Time

4

add(A, 1); add(A, 1); add(A, 1); read(A); add(A, 2); add(A, 2); add(A, 2); Core 0 Core 1 ALU A: 21 A: 23 +2

Traffic Serialization Complicates consistency

+1

slide-5
SLIDE 5

Coup: exploiting commutativity

A: 20 Shared cache Core/$ 1 Core/$ 0

Time

5

add(A, 1); add(A, 1); add(A, 1); read(A); add(A, 2); add(A, 2); add(A, 2); Core 0 Core 1 A: +0 A: +0 A: +1 A: +2 A: 23 A: 29 ALU

Low traffic Concurrent updates Simple consistency Less general than RMOs

+1 +2

slide-6
SLIDE 6

Commutative updates are common

6

 Operations  Applications

Reduction variables Iterative algorithms Graph traversal Reference counting

slide-7
SLIDE 7

Software privatization vs. Coup

7

Privatization

X

One read-only copy

… X.0 X.1 X.N

Multiple thread-private, update-only copies

Reduction

Software privatization

Needs to amortize privatization/reduction costs Wastes shared cache & memory capacity Must apply selectively

Coup

No overheads No wasted capacity Apply to any update that might commute

slide-8
SLIDE 8

Outline

 Introduction  Coup  Evaluation

8

slide-9
SLIDE 9

Structural changes

9 Core 0 Core N-1 Shared cache/dir

Private Cache 0 … Private Cache N-1

ISA

… load (&x) Store (&x, v) ...

Coherence states

M S I

comm_add (&x, v) comm_or (&x, v) … … U Reduction unit

slide-10
SLIDE 10

Example: extending MSI

10

S I M

R R W W W W

MSI

S I M U

R R R W, R C C W, C C W W W W

MUSI

Transitions Initiated by own core (gain permissions) Initiated by others (lose permissions) States Legend Requests Modified Shared (read-only) Invalid Read Write Update-only Commutative update

slide-11
SLIDE 11

Coherence and consistency

11

 Coherence is maintained  Consistency is not affected  See paper for proofs

slide-12
SLIDE 12

Implementation and verification

12 S M I E

No extra stable states Easy to verify

Own request (R,W,C,wback) IM xMI

Transient

Split Race

Transitions initiated by

Response to own request Inval/downgrade request

M

States Stable Legend

S M

IS SM IM WB

I

ISI xMI xMS WBI

E M

IN NM IM WB

I

xMI xMN WBI

E

xNI NN

N

slide-13
SLIDE 13

Evaluation Methodology

13

L4 cache & global dir chip Processor chip L4 cache & global dir chip L4 cache & global dir chip Processor chip Processor chip

1-8 processor and L4 chips

… …

Core 0 L1I L1D L2 0 Core 15 L1I L1D L2 15

Processor chip organization

Shared L3 and chip directory

to L4 chips

8 sockets × 16 cores/socket = 128 cores

slide-14
SLIDE 14

Coup vs. Atomic Operations

14

MESI

histogram spmv pagerank bfs fluidanimate

COUP

1 32 64 96 128

Cores

20 40 60 80 100

Speedup

1 32 64 96 128

Cores

10 20 30 40 50 1 32 64 96 128

Cores

5 10 15 20 25 1 32 64 96 128

Cores

10 20 30 40 50 60 1 32 64 96 128

Cores

10 20 30 40 50 60 70 1 32 64 96 128

Cores

20 40 60 80 100

Speedup

1 32 64 96 128

Cores

5 10 15 20 25 1 32 64 96 128

Cores

10 20 30 40 50 1 32 64 96 128

Cores

10 20 30 40 50 60 1 32 64 96 128

Cores

10 20 30 40 50 60 70

0.2 0.4 0.6 0.8 1 1.2

histogram spmv pagerank bfs fluidanimate

Normalized AMAT

MESI COUP

1.0% 2.4% 4.9% 0.40% 0.96%

Fraction of commutative instructions

slide-15
SLIDE 15

Modifying algorithms to exploit Coup 15

Delayed deallocation reference counting

0.5 1 1.5 2 2.5 Refcache Coup Performance

Scheme Data structure Refcache[1] Hash table Coup implementation Hierarchical bit vectors + comm add/or

[1] Clements et al, EuroSys 2013

slide-16
SLIDE 16

Conclusions

16

 Coup allows concurrent commutative updates

 Maintains coherence and consistency

 Coup implementation accelerates single-word updates

 Minor hardware overhead  Accelerates update-heavy applications by up to 2.4x

 Coup opens exciting research avenues

 Commutativity-aware hardware transactional memory  Support arbitrary update functions, semantic commutativity

slide-17
SLIDE 17

THANKS FOR YOUR ATTENTION! QUESTIONS ARE WELCOME!