VIRTUES AND LIMITATIONS OF COMMODITY HARDWARE TRANSACTIONAL MEMORY - - PowerPoint PPT Presentation

virtues and limitations of commodity hardware
SMART_READER_LITE
LIVE PREVIEW

VIRTUES AND LIMITATIONS OF COMMODITY HARDWARE TRANSACTIONAL MEMORY - - PowerPoint PPT Presentation

PACT 2014 VIRTUES AND LIMITATIONS OF COMMODITY HARDWARE TRANSACTIONAL MEMORY Nuno Diegues, Paolo Romano and Lus Rodrigues 2 Virtues and Limitations of HTM PACT 2014 The multi-core (r)evolution 2 Virtues and Limitations of HTM PACT 2014


slide-1
SLIDE 1

VIRTUES AND LIMITATIONS OF COMMODITY HARDWARE TRANSACTIONAL MEMORY

Nuno Diegues, Paolo Romano and Luís Rodrigues

PACT 2014

slide-2
SLIDE 2

Virtues and Limitations of HTM PACT 2014

2

The multi-core (r)evolution

slide-3
SLIDE 3

Virtues and Limitations of HTM PACT 2014

2

Multi-cores are now ubiquitous

The multi-core (r)evolution

slide-4
SLIDE 4

Virtues and Limitations of HTM PACT 2014

2

Multi-cores are now ubiquitous

The multi-core (r)evolution

Shared Memory

CPU1 CPU2 CPU3 CPU4

slide-5
SLIDE 5

Virtues and Limitations of HTM PACT 2014

2

Multi-cores are now ubiquitous

The multi-core (r)evolution

Shared Memory

CPU1 CPU2 CPU3 CPU4

slide-6
SLIDE 6

Virtues and Limitations of HTM PACT 2014

2

Multi-cores are now ubiquitous

The multi-core (r)evolution

Shared Memory

CPU1 CPU2 CPU3 CPU4

Concurrent programming is complex

slide-7
SLIDE 7

Virtues and Limitations of HTM PACT 2014

2

Multi-cores are now ubiquitous

The multi-core (r)evolution

Shared Memory

CPU1 CPU2 CPU3 CPU4

Concurrent programming is complex

slide-8
SLIDE 8

Virtues and Limitations of HTM PACT 2014

2

Multi-cores are now ubiquitous

The multi-core (r)evolution

Shared Memory

CPU1 CPU2 CPU3 CPU4

Concurrent programming is complex

Hard to get right:

  • fine-grained locks
  • deadlocks
  • correctness

Classic approach: Locking

slide-9
SLIDE 9

Virtues and Limitations of HTM PACT 2014

2

Multi-cores are now ubiquitous

The multi-core (r)evolution

Shared Memory

CPU1 CPU2 CPU3 CPU4

Concurrent programming is complex

Hard to get right:

  • fine-grained locks
  • deadlocks
  • correctness

Classic approach: Locking

atomic { withdraw(acc1,val); deposit(acc2,val); }

Transactional Memory abstraction Programmer identifies atomic blocks Runtime implements synchronization

Transactional Memory System

slide-10
SLIDE 10

Virtues and Limitations of HTM PACT 2014

3

TM is now available in commodity processors

slide-11
SLIDE 11

Virtues and Limitations of HTM PACT 2014

3

TM is now available in commodity processors

  • Intel: Haswell in desktops, laptops, tablets, servers…
  • IBM: BG/Q, zEC12, Power8
slide-12
SLIDE 12

Virtues and Limitations of HTM PACT 2014

3

TM is now available in commodity processors

Over 10 years of:

  • Software implementations (STMs)
  • Simulations of HTMs and HybridTMs
  • Intel: Haswell in desktops, laptops, tablets, servers…
  • IBM: BG/Q, zEC12, Power8
slide-13
SLIDE 13

Virtues and Limitations of HTM PACT 2014

3

TM is now available in commodity processors

Over 10 years of:

  • Software implementations (STMs)
  • Simulations of HTMs and HybridTMs
  • Intel: Haswell in desktops, laptops, tablets, servers…
  • IBM: BG/Q, zEC12, Power8

Where does commodity HTM stand in the big picture?

slide-14
SLIDE 14

Virtues and Limitations of HTM PACT 2014

3

TM is now available in commodity processors

Over 10 years of:

  • Software implementations (STMs)
  • Simulations of HTMs and HybridTMs
  • Intel: Haswell in desktops, laptops, tablets, servers…
  • IBM: BG/Q, zEC12, Power8

Where does commodity HTM stand in the big picture? Our contribution: largest TM study to date

slide-15
SLIDE 15

Virtues and Limitations of HTM PACT 2014

3

TM is now available in commodity processors

Over 10 years of:

  • Software implementations (STMs)
  • Simulations of HTMs and HybridTMs
  • Intel: Haswell in desktops, laptops, tablets, servers…
  • IBM: BG/Q, zEC12, Power8

Where does commodity HTM stand in the big picture? Our contribution: largest TM study to date

Framework with 4 STMs, Intel HTM, 2 HyTMs and locking strategies; Metrics for performance and power consumption; 10 benchmarks.

slide-16
SLIDE 16

Virtues and Limitations of HTM PACT 2014

4

HTM: Intel Transactional Synchronization Extensions (TSX)

slide-17
SLIDE 17

Virtues and Limitations of HTM PACT 2014

4

HTM: Intel Transactional Synchronization Extensions (TSX)

Widely available in millions of machines Similar in nature to IBM’s HTMs

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache 64KB 256KB L3 Cache

slide-18
SLIDE 18

Virtues and Limitations of HTM PACT 2014

4

HTM: Intel Transactional Synchronization Extensions (TSX)

Widely available in millions of machines Similar in nature to IBM’s HTMs

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache 64KB 256KB L3 Cache

  • L1 modified to be transactional
  • Cache coherence detects conflicts eagerly
  • Strong atomicity
slide-19
SLIDE 19

Virtues and Limitations of HTM PACT 2014

5

HTM: Intel Transactional Synchronization Extensions (TSX)

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache

CPU 1 CPU 2

slide-20
SLIDE 20

Virtues and Limitations of HTM PACT 2014

5

HTM: Intel Transactional Synchronization Extensions (TSX)

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache

CPU 1 CPU 2 xbegin

TSX: on

slide-21
SLIDE 21

Virtues and Limitations of HTM PACT 2014

5

HTM: Intel Transactional Synchronization Extensions (TSX)

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache

CPU 1 CPU 2 xbegin

TSX: on

read x: 0 // Set bit read on x cache line

slide-22
SLIDE 22

Virtues and Limitations of HTM PACT 2014

5

HTM: Intel Transactional Synchronization Extensions (TSX)

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache

CPU 1 CPU 2 xbegin

TSX: on

read x: 0 // Set bit read on x cache line

x: 0 -- r

slide-23
SLIDE 23

Virtues and Limitations of HTM PACT 2014

5

HTM: Intel Transactional Synchronization Extensions (TSX)

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache

CPU 1 CPU 2 xbegin

TSX: on

read x: 0 // Set bit read on x cache line

x: 0 -- r

write y = 1 // Buffer write in L1 cache

y: 1 -- w

slide-24
SLIDE 24

Virtues and Limitations of HTM PACT 2014

5

HTM: Intel Transactional Synchronization Extensions (TSX)

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache

CPU 1 CPU 2 xbegin read x: 0 // Set bit read on x cache line write y = 1 // Buffer write in L1 cache xend // Atomically clean bits and publish

x: 0 y: 1

slide-25
SLIDE 25

Virtues and Limitations of HTM PACT 2014

5

HTM: Intel Transactional Synchronization Extensions (TSX)

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache

CPU 1 CPU 2 xbegin read x: 0 // Set bit read on x cache line write y = 1 // Buffer write in L1 cache xend // Atomically clean bits and publish

x: 0 y: 1

xbegin read y: 1

slide-26
SLIDE 26

Virtues and Limitations of HTM PACT 2014

5

HTM: Intel Transactional Synchronization Extensions (TSX)

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache

CPU 1 CPU 2 xbegin read x: 0 // Set bit read on x cache line write y = 1 // Buffer write in L1 cache xend // Atomically clean bits and publish

x: 0 y: 1

xbegin read y: 1

y: 1 -- r

slide-27
SLIDE 27

Virtues and Limitations of HTM PACT 2014

5

HTM: Intel Transactional Synchronization Extensions (TSX)

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache

CPU 1 CPU 2 xbegin read x: 0 // Set bit read on x cache line write y = 1 // Buffer write in L1 cache xend // Atomically clean bits and publish

x: 0 y: 1

xbegin read y: 1 write y = 2

… …

y: 1 -- r

slide-28
SLIDE 28

Virtues and Limitations of HTM PACT 2014

5

HTM: Intel Transactional Synchronization Extensions (TSX)

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache

CPU 1 CPU 2 xbegin read x: 0 // Set bit read on x cache line write y = 1 // Buffer write in L1 cache xend // Atomically clean bits and publish xbegin read y: 1 write y = 2

… …

y: 1 -- r x: 0 y: 2

slide-29
SLIDE 29

Virtues and Limitations of HTM PACT 2014

5

HTM: Intel Transactional Synchronization Extensions (TSX)

CPU1 CPU2

Memory Bus

L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache

CPU 1 CPU 2 xbegin read x: 0 // Set bit read on x cache line write y = 1 // Buffer write in L1 cache xend // Atomically clean bits and publish xbegin read y: 1 write y = 2 xabort invalidation

snooped write invalidates tx read

… …

y: 1 -- r x: 0 y: 2

slide-30
SLIDE 30

Virtues and Limitations of HTM PACT 2014

6

In an ideal world…

xbegin widthdraw(acc1,val) deposit(acc2,val) xend

slide-31
SLIDE 31

Virtues and Limitations of HTM PACT 2014

6

In an ideal world…

xbegin widthdraw(acc1,val) deposit(acc2,val) xend

Transactions may abort:

  • because of contention on same

memory locations

Transactions restart

slide-32
SLIDE 32

Virtues and Limitations of HTM PACT 2014

6

In an ideal world…

xbegin widthdraw(acc1,val) deposit(acc2,val) xend

Transactions may abort:

  • because of contention on same

memory locations

…and every transaction shall eventually succeed

Transactions restart

slide-33
SLIDE 33

Virtues and Limitations of HTM PACT 2014

7

…in practice: Best-Effort Nature

No progress guarantees:

slide-34
SLIDE 34

Virtues and Limitations of HTM PACT 2014

7

…in practice: Best-Effort Nature

No progress guarantees:

  • A transaction may always abort

slide-35
SLIDE 35

Virtues and Limitations of HTM PACT 2014

7

…in practice: Best-Effort Nature

No progress guarantees:

  • A transaction may always abort


…due to a number of reasons:

  • Forbidden instructions
  • Capacity of caches
  • Faults and signals
  • Contending transactions, aborting each other
slide-36
SLIDE 36

Virtues and Limitations of HTM PACT 2014

8

Restrictions of TSX

slide-37
SLIDE 37

Virtues and Limitations of HTM PACT 2014

8

  • Writes:
  • size of L1 cache: 32KB
  • non-negligible aborts for >8KB
  • cache associativity

Restrictions of TSX

slide-38
SLIDE 38

Virtues and Limitations of HTM PACT 2014

8

  • Writes:
  • size of L1 cache: 32KB
  • non-negligible aborts for >8KB
  • cache associativity

Restrictions of TSX

  • Reads:
  • up to 4MB
  • overflow structure in L2 cache
  • presumed to be a Bloom-Filter [Eurosys14]
slide-39
SLIDE 39

Virtues and Limitations of HTM PACT 2014

8

  • Writes:
  • size of L1 cache: 32KB
  • non-negligible aborts for >8KB
  • cache associativity

Restrictions of TSX

  • Reads:
  • up to 4MB
  • overflow structure in L2 cache
  • presumed to be a Bloom-Filter [Eurosys14]
  • Interrupts:
  • up to 1M cycles
  • Roughly 0.5 ms on a Haswell Xeon
slide-40
SLIDE 40

Virtues and Limitations of HTM PACT 2014

8

  • Writes:
  • size of L1 cache: 32KB
  • non-negligible aborts for >8KB
  • cache associativity

Restrictions of TSX

  • Reads:
  • up to 4MB
  • overflow structure in L2 cache
  • presumed to be a Bloom-Filter [Eurosys14]
  • Interrupts:
  • up to 1M cycles
  • Roughly 0.5 ms on a Haswell Xeon

TSX alone is not enough

slide-41
SLIDE 41

Virtues and Limitations of HTM PACT 2014

9

TSX with a fall-back

start: int status = xbegin if (status == ok) // != ok when aborted if (fallback-in-use()) xabort // fall-back in use else goto code // fast-path

!

if (shouldRetry()) // retry policy goto start else use-fallback() // use fall-back

!

code: application logic

!

if (inFastPath) xend // fast-path else quit-fallback() // fall-back

slide-42
SLIDE 42

Virtues and Limitations of HTM PACT 2014

10

TSX with a fall-back: a single lock

start: int status = xbegin if (status == ok) // != ok when aborted if (isTaken(lock)) xabort // fall-back in use else goto code // fast-path

!

if (shouldRetry()) // retry policy goto start else acquire(lock) // use fall-back

!

code: application logic

!

if (inFastPath) // fast-path xend else // fall-back release(lock)

slide-43
SLIDE 43

Virtues and Limitations of HTM PACT 2014

11

Lesson #1

Tuning best-effort HTMs is extremely important

The hardware is only a part of the solution.

slide-44
SLIDE 44

Virtues and Limitations of HTM PACT 2014

11

Lesson #1

Tuning best-effort HTMs is extremely important

The hardware is only a part of the solution. Avoid HLE when possible

  • the fallback is triggered too often
  • cannot be tuned
slide-45
SLIDE 45

Virtues and Limitations of HTM PACT 2014

11

Lesson #1

Tuning best-effort HTMs is extremely important

The hardware is only a part of the solution. 2x improvement by choosing the best configuration on average across all workloads. Avoid HLE when possible

  • the fallback is triggered too often
  • cannot be tuned
slide-46
SLIDE 46

Virtues and Limitations of HTM PACT 2014

12

Lesson #1: Tuning TSX

Which fall-back to use?

  • Lock implementation

!

When to take the fall-back?

  • Retry policy
  • Contention management
slide-47
SLIDE 47

Virtues and Limitations of HTM PACT 2014

12

Lesson #1: Tuning TSX

Which fall-back to use?

  • Lock implementation

!

When to take the fall-back?

  • Retry policy
  • Contention management

Overhead (%) Lock Performance Power Ticket 1.0 1.1 MCS 2.4 1.2 CLH 2.9 2.4 RW 14.2 17.4 TTAS 15.2 17.4 Spin 16.4 17.5 average across all benchmarks and thread counts

slide-48
SLIDE 48

Virtues and Limitations of HTM PACT 2014

12

Lesson #1: Tuning TSX

Which fall-back to use?

  • Lock implementation

!

When to take the fall-back?

  • Retry policy
  • Contention management

Overhead (%) Lock Performance Power Ticket 1.0 1.1 MCS 2.4 1.2 CLH 2.9 2.4 RW 14.2 17.4 TTAS 15.2 17.4 Spin 16.4 17.5 average across all benchmarks and thread counts Avoid lemming effect [ASPLOS12]

  • avalanche aborts that exhaust retry policy

!

Manage contention with auxiliary lock [PPOPP13]

  • fallback lock creates spurious aborts

!

Retry policy using literature values [HPC13,HPCA14]

  • give up on HTM after a threshold of aborts
slide-49
SLIDE 49

Virtues and Limitations of HTM PACT 2014

13

Software TM in the picture

slide-50
SLIDE 50

Virtues and Limitations of HTM PACT 2014

13

Software TM in the picture

int i = 0!

!

…!

!

atomic {! i++! } Source program

slide-51
SLIDE 51

Virtues and Limitations of HTM PACT 2014

13

Software TM in the picture

int i = 0!

!

…!

!

atomic {! i++! } int i = 0! …! TM.begin-tx()! int tmp = TM.read(&i)! tmp++! TM.write(&i, tmp)! TM.end-tx() Source program Compiled program instrumentation

!

to invoke STM

slide-52
SLIDE 52

Virtues and Limitations of HTM PACT 2014

13

Software TM in the picture

int i = 0!

!

…!

!

atomic {! i++! } int i = 0! …! TM.begin-tx()! int tmp = TM.read(&i)! tmp++! TM.write(&i, tmp)! TM.end-tx() Source program Compiled program instrumentation

!

to invoke STM Over 10 years of research on STM ➔ many prototypes and designs. We considered four state of the art implementations:

  • TL2: commit-time locking, used in Intel paper for comparison with TSX
  • Norec: single commit lock, least instrumentation overhead
  • TinySTM: encounter-time locking
  • SwissTM: lazy val for r/w and eager for w/w; novel contention manager
slide-53
SLIDE 53

Virtues and Limitations of HTM PACT 2014

14

Fall-backs for best-effort HTM

Single global lock

  • as seen before
slide-54
SLIDE 54

Virtues and Limitations of HTM PACT 2014

14

Fall-backs for best-effort HTM

Single global lock

  • as seen before

Fine-grained locks

  • possibly check more than one lock
  • requires programmer to define which and how many locks
  • r automatic lock inference techniques [Transact06,LCPC13]
slide-55
SLIDE 55

Virtues and Limitations of HTM PACT 2014

14

Fall-backs for best-effort HTM

Single global lock

  • as seen before

STMs

  • separate code paths
  • uninstrumented for fast path in HTM
  • instrumented reads and writes for STM
  • use HTM to boost STM commit [SPAA13]
  • NOrec and TL2 for HybridTMs

Fine-grained locks

  • possibly check more than one lock
  • requires programmer to define which and how many locks
  • r automatic lock inference techniques [Transact06,LCPC13]
slide-56
SLIDE 56

Virtues and Limitations of HTM PACT 2014

14

Fall-backs for best-effort HTM

Single global lock

  • as seen before
  • tomorrow: Invyswell

STMs

  • separate code paths
  • uninstrumented for fast path in HTM
  • instrumented reads and writes for STM
  • use HTM to boost STM commit [SPAA13]
  • NOrec and TL2 for HybridTMs

Fine-grained locks

  • possibly check more than one lock
  • requires programmer to define which and how many locks
  • r automatic lock inference techniques [Transact06,LCPC13]
slide-57
SLIDE 57

Virtues and Limitations of HTM PACT 2014

15

Experimental settings

GL FL

!

TSX-GL TSX-FL

!

TL2 NOrec SwissTM TinySTM

!

TSX-TL2 TSX-NOrec Locking HTM STM HyTM

Synchronization techniques under comparison:

slide-58
SLIDE 58

Virtues and Limitations of HTM PACT 2014

16

Target machine:

  • Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo)
  • 4 cores, 8 hardware threads (via hyper-threading)

! !

Standard metrics for evaluation:

  • Time to complete benchmarks
  • presented as speedup
  • Energy consumed (collected via Intel RAPL)
  • presented as relative energy
  • The baseline for comparison is a sequential, non-synchronized execution

! !

Benchmarks:

  • 7 STAMP benchmarks
  • Memcached [ASPLOS14]
  • Concurrent data-structures

Experimental settings

slide-59
SLIDE 59

Virtues and Limitations of HTM PACT 2014

17

Lesson #2: HTM is not a silver bullet

slide-60
SLIDE 60

Virtues and Limitations of HTM PACT 2014

17

Lesson #2: HTM is not a silver bullet

Tested 900 scenarios (STAMP+data structures)

Identified three categories of applications:

slide-61
SLIDE 61

Virtues and Limitations of HTM PACT 2014

17

Lesson #2: HTM is not a silver bullet

  • 1. Uncontested winner
  • concurrent data structures
  • STAMP: small, occasional transactions

Tested 900 scenarios (STAMP+data structures)

Identified three categories of applications:

slide-62
SLIDE 62

Virtues and Limitations of HTM PACT 2014

17

Lesson #2: HTM is not a silver bullet

  • 1. Uncontested winner
  • concurrent data structures
  • STAMP: small, occasional transactions

Tested 900 scenarios (STAMP+data structures)

Identified three categories of applications:

Kmeans

TSX-GL TSX-NOrec TinySTM

slide-63
SLIDE 63

Virtues and Limitations of HTM PACT 2014

17

Lesson #2: HTM is not a silver bullet

  • 1. Uncontested winner
  • concurrent data structures
  • STAMP: small, occasional transactions

Tested 900 scenarios (STAMP+data structures)

Identified three categories of applications:

  • 2. Only better without hyper-threading
  • more competitive energy-wise
  • spurious aborts due to hw restrictions

TSX-GL TSX-NOrec TinySTM

slide-64
SLIDE 64

Virtues and Limitations of HTM PACT 2014

17

Lesson #2: HTM is not a silver bullet

  • 1. Uncontested winner
  • concurrent data structures
  • STAMP: small, occasional transactions

Tested 900 scenarios (STAMP+data structures)

Identified three categories of applications:

  • 2. Only better without hyper-threading
  • more competitive energy-wise
  • spurious aborts due to hw restrictions

Intruder

TSX-GL TSX-NOrec TinySTM

slide-65
SLIDE 65

Virtues and Limitations of HTM PACT 2014

17

Lesson #2: HTM is not a silver bullet

  • 1. Uncontested winner
  • concurrent data structures
  • STAMP: small, occasional transactions

Tested 900 scenarios (STAMP+data structures)

Identified three categories of applications:

  • 2. Only better without hyper-threading
  • more competitive energy-wise
  • spurious aborts due to hw restrictions
  • 3. Always worse than others
  • except for single-threaded execution
  • capacity aborts and quantum exhaustion

TSX-GL TSX-NOrec TinySTM

slide-66
SLIDE 66

Virtues and Limitations of HTM PACT 2014

17

Lesson #2: HTM is not a silver bullet

  • 1. Uncontested winner
  • concurrent data structures
  • STAMP: small, occasional transactions

Tested 900 scenarios (STAMP+data structures)

Identified three categories of applications:

  • 2. Only better without hyper-threading
  • more competitive energy-wise
  • spurious aborts due to hw restrictions
  • 3. Always worse than others
  • except for single-threaded execution
  • capacity aborts and quantum exhaustion

Yada

TSX-GL TSX-NOrec TinySTM

slide-67
SLIDE 67

Virtues and Limitations of HTM PACT 2014

18

Lesson #3: STM is still very competitive

slide-68
SLIDE 68

Virtues and Limitations of HTM PACT 2014

18

Most robust all-around solution

  • albeit more power hungry than HTM-based approaches

Lesson #3: STM is still very competitive

slide-69
SLIDE 69

Virtues and Limitations of HTM PACT 2014

18

Most robust all-around solution

  • albeit more power hungry than HTM-based approaches

Lesson #3: STM is still very competitive

Considering the best STM (SwissTM would be similar)

Relative time Relative energy

TSX-GL TSX-NOrec TinySTM

All STAMP benchmarks

slide-70
SLIDE 70

Virtues and Limitations of HTM PACT 2014

18

Most robust all-around solution

  • albeit more power hungry than HTM-based approaches

Lesson #3: STM is still very competitive

Considering the best STM (SwissTM would be similar)

Relative time Relative energy

  • Worst at 1 thread

TSX-GL TSX-NOrec TinySTM

All STAMP benchmarks

slide-71
SLIDE 71

Virtues and Limitations of HTM PACT 2014

18

Most robust all-around solution

  • albeit more power hungry than HTM-based approaches

Lesson #3: STM is still very competitive

Considering the best STM (SwissTM would be similar)

Relative time Relative energy

  • Worst at 1 thread
  • Turning point at 3 threads

TSX-GL TSX-NOrec TinySTM

All STAMP benchmarks

slide-72
SLIDE 72

Virtues and Limitations of HTM PACT 2014

18

Most robust all-around solution

  • albeit more power hungry than HTM-based approaches

Lesson #3: STM is still very competitive

Considering the best STM (SwissTM would be similar)

Relative time Relative energy

  • Worst at 1 thread
  • Turning point at 3 threads
  • 84% more performance, 46% more energy-efficiency (over TSX-GL)

TSX-GL TSX-NOrec TinySTM

All STAMP benchmarks

slide-73
SLIDE 73

Virtues and Limitations of HTM PACT 2014

19

Lesson #4: Fine-grained locking is not worth it

slide-74
SLIDE 74

Virtues and Limitations of HTM PACT 2014

19

Lesson #4: Fine-grained locking is not worth it

Speedup Relative Energy

Sub-set of benchmarks with fine-grained locks

HTM is able to achieve the same degree of parallelism

  • TSX-FL checks more locks in each transaction on average
  • tends to be worse than TSX-GL

TSX-GL FL TSX-FL TinySTM

slide-75
SLIDE 75

Virtues and Limitations of HTM PACT 2014

19

Lesson #4: Fine-grained locking is not worth it

Speedup Relative Energy

Sub-set of benchmarks with fine-grained locks

HTM is able to achieve the same degree of parallelism

  • TSX-FL checks more locks in each transaction on average
  • tends to be worse than TSX-GL

Benchmarks include concurrent data-structures

  • small transactions benefit HTM or FL approaches

TSX-GL FL TSX-FL TinySTM

slide-76
SLIDE 76

Virtues and Limitations of HTM PACT 2014

20

Research Directions: HybridTMs

slide-77
SLIDE 77

Virtues and Limitations of HTM PACT 2014

20

Research Directions: HybridTMs

None of the evaluated HybridTMs is ever the best approach

slide-78
SLIDE 78

Virtues and Limitations of HTM PACT 2014

20

Research Directions: HybridTMs

TSX-GL TSX-NOrec TinySTM

All STAMP benchmarks None of the evaluated HybridTMs is ever the best approach

slide-79
SLIDE 79

Virtues and Limitations of HTM PACT 2014

20

Research Directions: HybridTMs

TSX-GL TSX-NOrec TinySTM

All STAMP benchmarks None of the evaluated HybridTMs is ever the best approach

slide-80
SLIDE 80

Virtues and Limitations of HTM PACT 2014

20

Research Directions: HybridTMs

TSX-GL TSX-NOrec TinySTM

!

  • Spurious aborts from fallback of STM with HTM
  • More efficient algorithms exist with non-transactional operations
  • Not available on Intel TSX or IBM BG/Q
  • Is it a requirement for efficient future HybridTMs?

All STAMP benchmarks None of the evaluated HybridTMs is ever the best approach

Speedup Relative Energy

slide-81
SLIDE 81

Virtues and Limitations of HTM PACT 2014

21

Research Directions: Compiler Instrumentation

slide-82
SLIDE 82

Virtues and Limitations of HTM PACT 2014

21

Research Directions: Compiler Instrumentation

STMs (and HybridTM’s software path) used manual instrumentation.

slide-83
SLIDE 83

Virtues and Limitations of HTM PACT 2014

21

Research Directions: Compiler Instrumentation

STMs (and HybridTM’s software path) used manual instrumentation. What changes if we rely on the compiler?

slide-84
SLIDE 84

Virtues and Limitations of HTM PACT 2014

21

Research Directions: Compiler Instrumentation

STMs (and HybridTM’s software path) used manual instrumentation. What changes if we rely on the compiler? GCC 4.8

slide-85
SLIDE 85

Virtues and Limitations of HTM PACT 2014

21

Research Directions: Compiler Instrumentation

STMs (and HybridTM’s software path) used manual instrumentation. What changes if we rely on the compiler?

  • Read- and write-sets can increase up to 3x (in SSCA2)
  • Conservative compiler instruments accesses that are clearly not shared
  • Even simple static analysis (such as those in Clang) would improve

GCC 4.8

slide-86
SLIDE 86

Virtues and Limitations of HTM PACT 2014

21

Also important for HTM:

  • If non-transactional operations are available
  • May reduce capacity aborts

Research Directions: Compiler Instrumentation

STMs (and HybridTM’s software path) used manual instrumentation. What changes if we rely on the compiler?

  • Read- and write-sets can increase up to 3x (in SSCA2)
  • Conservative compiler instruments accesses that are clearly not shared
  • Even simple static analysis (such as those in Clang) would improve

GCC 4.8

slide-87
SLIDE 87

Virtues and Limitations of HTM PACT 2014

22

Research Directions: Automatic HTM tuning

slide-88
SLIDE 88

Virtues and Limitations of HTM PACT 2014

22

Research Directions: Automatic HTM tuning

We used the best tuning of TSX on average.

slide-89
SLIDE 89

Virtues and Limitations of HTM PACT 2014

22

Research Directions: Automatic HTM tuning

We used the best tuning of TSX on average. But what about the optimal for each case?

slide-90
SLIDE 90

Virtues and Limitations of HTM PACT 2014

22

Research Directions: Automatic HTM tuning

We used the best tuning of TSX on average. But what about the optimal for each case?

Speedup % Kmeans SSCA2 Intruder Vacation Genome Yada Labyrinth

4 threads 12 7 20 36 12 13 2 8 threads 5 8 80 21 2 55 39 Time improvement of optimal vs average

slide-91
SLIDE 91

Virtues and Limitations of HTM PACT 2014

22

Research Directions: Automatic HTM tuning

Technique for optimal tuning, focus only on performance:

  • Self-Tuning Intel TSX --- Best paper at USENIX ICAC’14

We used the best tuning of TSX on average. But what about the optimal for each case?

Speedup % Kmeans SSCA2 Intruder Vacation Genome Yada Labyrinth

4 threads 12 7 20 36 12 13 2 8 threads 5 8 80 21 2 55 39 Time improvement of optimal vs average

slide-92
SLIDE 92

Virtues and Limitations of HTM PACT 2014

23

Summary

slide-93
SLIDE 93

Virtues and Limitations of HTM PACT 2014

23

Summary

HTM is not a silver bullet

  • Shines with short infrequent transactions and concurrent data structures
  • Great in energy efficiency and at low thread count
  • Hyper-threading amplifies HTM’s inherent limitations
  • HTM requires careful tuning of parameters governing the fallback:
  • automatic tuning is highly desirable to preserve ease of usage
slide-94
SLIDE 94

Virtues and Limitations of HTM PACT 2014

23

Summary

HTM is not a silver bullet

  • Shines with short infrequent transactions and concurrent data structures
  • Great in energy efficiency and at low thread count
  • Hyper-threading amplifies HTM’s inherent limitations
  • HTM requires careful tuning of parameters governing the fallback:
  • automatic tuning is highly desirable to preserve ease of usage

STM performs best on average

  • …and with applications with complex transactions
  • Its energy efficiency tends to be worse than HTM
  • Compiler instrumentation has room for improvement
slide-95
SLIDE 95

Virtues and Limitations of HTM PACT 2014

23

Summary

HTM is not a silver bullet

  • Shines with short infrequent transactions and concurrent data structures
  • Great in energy efficiency and at low thread count
  • Hyper-threading amplifies HTM’s inherent limitations
  • HTM requires careful tuning of parameters governing the fallback:
  • automatic tuning is highly desirable to preserve ease of usage

STM performs best on average

  • …and with applications with complex transactions
  • Its energy efficiency tends to be worse than HTM
  • Compiler instrumentation has room for improvement

HybridTMs are not there yet

  • Need better support from hardware
  • Can we do better without it?