E XPLOITING S EMANTIC C OMMUTATIVITY IN H ARDWARE S PECULATION G - - PowerPoint PPT Presentation

e xploiting s emantic c ommutativity in h ardware s
SMART_READER_LITE
LIVE PREVIEW

E XPLOITING S EMANTIC C OMMUTATIVITY IN H ARDWARE S PECULATION G - - PowerPoint PPT Presentation

E XPLOITING S EMANTIC C OMMUTATIVITY IN H ARDWARE S PECULATION G UOWEI Z HANG , V IRGINIA C HIU , D ANIEL S ANCHEZ MICRO 2016 Executive summary 2 Exploiting commutativity benefits update-heavy apps Software techniques that exploit


slide-1
SLIDE 1

EXPLOITING SEMANTIC COMMUTATIVITY IN HARDWARE SPECULATION

GUOWEI ZHANG, VIRGINIA CHIU, DANIEL SANCHEZ

MICRO 2016

slide-2
SLIDE 2

Executive summary

¨ Exploiting commutativity benefits update-heavy apps

¤ Software techniques that exploit commutativity incur high run-

time overheads (STM is 2-6x slower than HTM)

¤ Prior hardware exploits only single-instruction commutative

  • perations (e.g., addition)

¨ CommTM exploits multi-instruction commutativity

¤ Extends coherence protocol to perform commutative

  • perations locally and concurrently

¤ Leverages HTM to support multi-instruction updates ¤ Benefits speculative execution by reducing conflicts ¤ Accelerates full applications by up to 3.4x at 128 cores

2

slide-3
SLIDE 3

Commutativity

Single-instruction commutativity

Multi-instruction commutativity

ADD MIN OR

Set insertion Top-K insertion Ordered put

¨ Commutative operations produce equivalent results when

reordered

¤ No true data dependence à No need for communication ¤ Software exploits commutativity but incurs high run-time

  • verheads

Coup

[Zhang et al, MICRO 2015]

3

CommTM

slide-4
SLIDE 4

Commutativity

¨ Commutative operations produce equivalent results when

reordered

¤ No true data dependence à No need for communication ¤ Software exploits commutativity but incurs high run-time

  • verheads

¤ Multi-instruction example: set (linked-list) insertion

4

head null

a insert(a); insert(b); b

head null

b insert(b); insert(a); a

head null

Different but semantically equivalent states

slide-5
SLIDE 5

Example: addition in conventional HTM

6

void add (int* counter, int delta) { tx_begin(); int v = load(counter); int nv = v + delta; store(counter, nv); tx_end(); }

slide-6
SLIDE 6

Example: addition in conventional HTM

6

void add (int* counter, int delta) { tx_begin(); int v = load(counter); int nv = v + delta; store(counter, nv); tx_end(); }

slide-7
SLIDE 7

Example: addition in conventional HTM

6

void add (int* counter, int delta) { tx_begin(); int v = load(counter); int nv = v + delta; store(counter, nv); tx_end(); }

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 Core 0 Core 1 A: 20

slide-8
SLIDE 8

Example: addition in conventional HTM

6

void add (int* counter, int delta) { tx_begin(); int v = load(counter); int nv = v + delta; store(counter, nv); tx_end(); }

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 add(A, 1): Txn 0 load A add(A, 1): Txn 1

load A

Core 0 Core 1 A: 20 read read

slide-9
SLIDE 9

Example: addition in conventional HTM

6

void add (int* counter, int delta) { tx_begin(); int v = load(counter); int nv = v + delta; store(counter, nv); tx_end(); }

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 add(A, 1): Txn 0 load A

store A

add(A, 1): Txn 1

load A

Core 0 Core 1 A: 20 read write read Conflict!

slide-10
SLIDE 10

Example: addition in conventional HTM

6

void add (int* counter, int delta) { tx_begin(); int v = load(counter); int nv = v + delta; store(counter, nv); tx_end(); }

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 add(A, 1): Txn 0 load A

store A

add(A, 1): Txn 1

load A a b

  • r

t

Core 0 Core 1 A: 20 read write

slide-11
SLIDE 11

Example: addition in conventional HTM

6

void add (int* counter, int delta) { tx_begin(); int v = load(counter); int nv = v + delta; store(counter, nv); tx_end(); }

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 add(A, 1): Txn 0 load A

store A

add(A, 1): Txn 1

load A a b

  • r

t commit

Core 0 Core 1

load A store A commit restart store A abort

add(A, 1): Txn 2

load A load A restart commit load A

add(A, 1): Txn 3

store A load A a b

  • r

t restart commit

A: 24

slide-12
SLIDE 12

Example: addition in conventional HTM

6

void add (int* counter, int delta) { tx_begin(); int v = load(counter); int nv = v + delta; store(counter, nv); tx_end(); }

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 add(A, 1): Txn 0 load A

store A

add(A, 1): Txn 1

load A a b

  • r

t commit

Core 0 Core 1

load A store A commit restart store A abort

add(A, 1): Txn 2

load A load A restart commit load A

add(A, 1): Txn 3

store A load A a b

  • r

t restart commit

A: 24

Traffic Serialization Wasted transactional work

slide-13
SLIDE 13

Example: addition in CommTM

7

void add (int* counter, int delta) { tx_begin(); int v = load(counter); int nv = v + delta; store(counter, nv); tx_end(); }

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 Core 0 Core 1 A: 20

slide-14
SLIDE 14

Example: addition in CommTM

7

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 Core 0 Core 1 A: 20

void add (int* counter, int delta) { tx_begin(); int v = load[ADD](counter); int nv = v + delta; store[ADD](counter, nv); tx_end(); }

slide-15
SLIDE 15

Example: addition in CommTM

7

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 Core 0 Core 1

void add (int* counter, int delta) { tx_begin(); int v = load[ADD](counter); int nv = v + delta; store[ADD](counter, nv); tx_end(); }

A: 0 A: 20

ADD ADD

slide-16
SLIDE 16

Example: addition in CommTM

7

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 Core 0 Core 1

void add (int* counter, int delta) { tx_begin(); int v = load[ADD](counter); int nv = v + delta; store[ADD](counter, nv); tx_end(); }

add(A, 1): Txn0 add(A, 1): Txn1

load[ADD] A load[ADD] A

A: 0 A: 20 read read

ADD ADD

slide-17
SLIDE 17

Example: addition in CommTM

7

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 Core 0 Core 1

void add (int* counter, int delta) { tx_begin(); int v = load[ADD](counter); int nv = v + delta; store[ADD](counter, nv); tx_end(); }

add(A, 1): Txn0 add(A, 1): Txn1

load[ADD] A store[ADD] A load[ADD] A store[ADD] A

A: 0 A: 20 read write read write

ADD ADD

slide-18
SLIDE 18

Example: addition in CommTM

7

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 Core 0 Core 1

void add (int* counter, int delta) { tx_begin(); int v = load[ADD](counter); int nv = v + delta; store[ADD](counter, nv); tx_end(); }

add(A, 1): Txn0 add(A, 1): Txn1

load[ADD] A store[ADD] A load[ADD] A store[ADD] A commit commit

A: 1 A: 21

ADD ADD

slide-19
SLIDE 19

Example: addition in CommTM

7

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 Core 0 Core 1

void add (int* counter, int delta) { tx_begin(); int v = load[ADD](counter); int nv = v + delta; store[ADD](counter, nv); tx_end(); }

add(A, 1): Txn0 add(A, 1): Txn1

load[ADD] A store[ADD] A load[ADD] A store[ADD] A commit commit

add(A, 1): Txn2 load[ADD] A

store[ADD] A commit

add(A, 1): Txn3

load[ADD] A store[ADD] A commit

A: 2 A: 22

ADD ADD

slide-20
SLIDE 20

Example: addition in CommTM

7

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 Core 0 Core 1

void add (int* counter, int delta) { tx_begin(); int v = load[ADD](counter); int nv = v + delta; store[ADD](counter, nv); tx_end(); }

add(A, 1): Txn0 add(A, 1): Txn1

load[ADD] A store[ADD] A load[ADD] A store[ADD] A commit commit

add(A, 1): Txn2 load[ADD] A

store[ADD] A commit

add(A, 1): Txn3

load[ADD] A store[ADD] A commit

A: 2 A: 22 load A

ADD ADD

slide-21
SLIDE 21

Example: addition in CommTM

7

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 Core 0 Core 1

void add (int* counter, int delta) { tx_begin(); int v = load[ADD](counter); int nv = v + delta; store[ADD](counter, nv); tx_end(); }

add(A, 1): Txn0 add(A, 1): Txn1

load[ADD] A store[ADD] A load[ADD] A store[ADD] A commit commit

add(A, 1): Txn2 load[ADD] A

store[ADD] A commit

add(A, 1): Txn3

load[ADD] A store[ADD] A commit reduction

load A

User-defined reduction

A: 24

slide-22
SLIDE 22

Example: addition in CommTM

7

add(A, 1); add(A, 1); add(A, 1); add(A, 1); Core 0 Core 1 Core 0 Core 1

void add (int* counter, int delta) { tx_begin(); int v = load[ADD](counter); int nv = v + delta; store[ADD](counter, nv); tx_end(); }

add(A, 1): Txn0 add(A, 1): Txn1

load[ADD] A store[ADD] A load[ADD] A store[ADD] A commit commit

add(A, 1): Txn2 load[ADD] A

store[ADD] A commit

add(A, 1): Txn3

load[ADD] A store[ADD] A commit reduction

load A

User-defined reduction

A: 24

Less traffic Concurrent updates Less wasted transactional work Less run-time/memory overheads than STM

slide-23
SLIDE 23

CommTM

slide-24
SLIDE 24

Programming interface

9

void reduce[ADD] (int* counter, int delta) { int v = load[ADD](counter); int nv = v + delta; store[ADD](counter, nv); }

16 counter 20 delta 36 counter reduce[ADD]

Transactional update Non-transactional reduction handler Labeled loads/stores +

void add (int* counter, int delta) { tx_begin(); int v = load[ADD](counter); int nv = v + delta; store[ADD](counter, nv); tx_end(); }

slide-25
SLIDE 25

Handling arbitrary object sizes

10

¨ For objects smaller than a cache line, assume lines are full

  • f aligned elements and reduce all of them

void reduce[ADD] (int* counterLine, int[] deltas) { for (int i = 0; I < intsPerCacheLine; i++) { int v = load[ADD](counterLine[i]); int nv = v + deltas[i]; store[ADD](counterLine[i], nv); } }

slide-26
SLIDE 26

Handling arbitrary object sizes

10

¨ For objects smaller than a cache line, assume lines are full

  • f aligned elements and reduce all of them

void reduce[ADD] (int* counterLine, int[] deltas) { for (int i = 0; I < intsPerCacheLine; i++) { int v = load[ADD](counterLine[i]); int nv = v + deltas[i]; store[ADD](counterLine[i], nv); } }

counterLine deltas

slide-27
SLIDE 27

Handling arbitrary object sizes

10

¨ For objects smaller than a cache line, assume lines are full

  • f aligned elements and reduce all of them

void reduce[ADD] (int* counterLine, int[] deltas) { for (int i = 0; I < intsPerCacheLine; i++) { int v = load[ADD](counterLine[i]); int nv = v + deltas[i]; store[ADD](counterLine[i], nv); } }

5 16 2 4 counterLine 20 deltas

slide-28
SLIDE 28

Handling arbitrary object sizes

10

¨ For objects smaller than a cache line, assume lines are full

  • f aligned elements and reduce all of them

¨ For objects larger than a cache line, use a level of

indirection

void reduce[ADD] (int* counterLine, int[] deltas) { for (int i = 0; I < intsPerCacheLine; i++) { int v = load[ADD](counterLine[i]); int nv = v + deltas[i]; store[ADD](counterLine[i], nv); } }

5 16 2 4 counterLine 20 deltas 5 36 2 4 counterLine reduce[ADD]

+ + + +

slide-29
SLIDE 29

Example: set (linked-list) insertion

void insert (SetDesc* s, Node* n) { tx_begin(); Node* head = load[INSERT](s); n->next = head; store[INSERT](s, n); if (head == nullptr) store[INSERT](s+sizeof(Node*), n); tx_end(); } void reduce[INSERT] (SetDesc* s0, SetDesc* s1) { if (s1->head == nullptr) return; Node* head0 = load[INSERT](s0); if (head0 == nullptr) { store[INSERT](s0, s1->head); } else { Node* tail0 = load[INSERT](s0+sizeof(Node*)); tail0->next = s1->head; } store[INSERT](s0+sizeof(Node*), s1->tail); }

11

Struct SetDesc { Node* head; Node* tail; }; Struct Node { … Node* next; };

slide-30
SLIDE 30

Implementation

12

¨ Baseline HTM

¤ MESI coherence protocol ¤ Eager conflict detection ¤ Timestamp-based conflict resolution ¤ Lazy version management (buffer speculative data in L1s) Core L1I L1D L2 Core L1I L1D L2

Shared L3 / directory

CommTM can be applied to other HTMs and hardware speculation techniques

slide-31
SLIDE 31

Coherence protocol

S I M

R R W W W W

MSI

S I M U

R R W, R L W, L L W W, R W, R W

CommTM-MSI

L Transitions Initiated by own core (gain permissions) Initiated by others (lose permissions) States Legend Requests Read Write Invalid Modified Shared (read-only) User-defined reducible Labeled load/store

label

13

slide-32
SLIDE 32

Reducible-state transitions

14

A: 3 A: 20

Modified User-defined reducible Legend

Entering U state triggered by labeled load/store Leaving U state with non- transactional reductions

A: 20 L2 Shared cache L2 L1 L1 Core 0 Core 1

load[ADD](A)

slide-33
SLIDE 33

Reducible-state transitions

14

A: 0 A: 0

ACK ACK GETU A [ADD] DOWNU A [ADD]

A: 3 A: 20

Modified User-defined reducible Legend

Entering U state triggered by labeled load/store Leaving U state with non- transactional reductions

A: 20 L2 Shared cache L2 L1 L1 Core 0 Core 1

load[ADD](A)

slide-34
SLIDE 34

Reducible-state transitions

14

A: 0 A: 0

ACK ACK GETU A [ADD] DOWNU A [ADD]

A: 22 Shared cache L1 Core 2 L2 L1 Core 1 L2 A: 3 A: 20

Modified User-defined reducible Legend

load(A)

A: 2 A: 2

Entering U state triggered by labeled load/store Leaving U state with non- transactional reductions

A: 20 L2 Shared cache L2 L1 L1 Core 0 Core 1

load[ADD](A)

slide-35
SLIDE 35

Reducible-state transitions

14

A: 0 A: 0

ACK ACK GETU A [ADD] DOWNU A [ADD] ACK FINISH

A: -- A: 24 Shared cache L1 Core 2 L2 L1 Core 1 L2

INV A A:+22

add_reduce

GETS A

A: 3 A: 20

Modified User-defined reducible Legend

load(A)

Entering U state triggered by labeled load/store Leaving U state with non- transactional reductions

A: 20 L2 Shared cache L2 L1 L1 Core 0 Core 1

load[ADD](A)

Hardware shadow thread/interrupt Cannot access other reducible data

slide-36
SLIDE 36

Transactional execution

15

L2 L1 Core A: 3 A: 3

¨ Speculative value management for U state is analogous

to M state

slide-37
SLIDE 37

Transactional execution

15

L2 L1 Core A: 3 00 A: 3 speculatively-read speculatively-written

¨ Speculative value management for U state is analogous

to M state

slide-38
SLIDE 38

Transactional execution

15

L2 L1 Core A: 3 00 TX (ts:0)

tx_begin() ld[ADD](A) st[ADD](A)

L2 L1 Core A: 3 A: 4 11

tx_end()

L2 L1 Core A: 3 A: 4 00

tx_begin() ld[ADD](A) st[ADD](A)

TX (ts:1) L2 L1 Core A: 4 A: 5 11 A: 3 speculatively-read speculatively-written

¨ Speculative value management for U state is analogous

to M state

¨ Invalidation to speculatively accessed data in U state

triggers a conflict

slide-39
SLIDE 39

Gather requests allows more concurrency 16

¨ Motivation

¤ Conditional commutativity: Operations commute only when

reducible data meets some conditions

¤ Frequent reductions triggered by condition checks limit

concurrency

¨ Gather requests allow partial updates to move across

caches without leaving the reducible state

¤ Achieves higher concurrency (e.g., for reference counting)

slide-40
SLIDE 40

Evaluation

slide-41
SLIDE 41

Evaluation on microbenchmarks

18

Baseline CommTM

counter set insertion

  • rdered put

top-k insertion

1 32 64 96 128

Threads

20 40 60 80 100

Speedup

1 32 64 96 128

Threads

20 40 60 80 100 120

Speedup

1 32 64 96 128

Threads

20 40 60 80 100

Speedup

1 32 64 96 128

Threads

20 40 60 80 100 120

Speedup

1 32 64 96 128

Threads

20 40 60 80 100

Speedup

1 32 64 96 128

Threads

20 40 60 80 100 1 32 64 96 128

Threads

20 40 60 80 100 120

Speedup

1 32 64 96 128

Threads

20 40 60 80 100 120

Speedup

Up to 128x speedup over baseline TM

slide-42
SLIDE 42

Evaluation on full applications

19

boruvka kmeans ssca2 genome vacation

Baseline CommTM

1 32 64 96 128

Threads

10 20 30 40 50 60 70

Speedup

1 32 64 96 128

Threads

10 20 30 40 50 60 70

Speedup

1 32 64 96 128

Threads

2 4 6 8 10 12

Speedup

1 32 64 96 128

Threads

20 40 60 80 100 120

Speedup

1 32 64 96 128

Threads

5 10 15 20 25 30

Speedup

3.4x

Non-transactional Transactional, committed Transactional, aborted

Breakdown of core cycles at 128 threads (lower is better)

0.0 0.2 0.4 0.6 0.8 1.0

Normalized core cycles

Baseline CommTM

0.0 0.2 0.4 0.6 0.8 1.0

Normalized core cycles

Baseline CommTM

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Normalized core cycles

Baseline CommTM

0.0 0.2 0.4 0.6 0.8 1.0

Normalized core cycles

Baseline CommTM

0.0 0.2 0.4 0.6 0.8 1.0

Normalized core cycles

Baseline CommTM

slide-43
SLIDE 43

Conclusions

20

¨ Leverages HTM to support multi-instruction operations ¨ Extends coherence protocol to allow local and concurrent

updates

¨ Bridges the gap between software and hardware speculation ¨ Reduces conflicts and serialized transactions significantly ¨ Accelerates challenging workloads by up to 3.4x at 128 cores

slide-44
SLIDE 44

THANKS FOR YOUR ATTENTION! QUESTIONS ARE WELCOME!