The moment of truth: are we done with STM? Nuno Diegues , Paolo - - PowerPoint PPT Presentation

the moment of truth are we done with stm
SMART_READER_LITE
LIVE PREVIEW

The moment of truth: are we done with STM? Nuno Diegues , Paolo - - PowerPoint PPT Presentation

The moment of truth: are we done with STM? Nuno Diegues , Paolo Romano, Lus Rodrigues ndiegues@gsd.inesc-id.pt Nuno Diegues 1/27 Over 20 years of Transactional Memory Nuno Diegues 2/27 Over 20 years of Transactional Memory Commodity


slide-1
SLIDE 1

The moment of truth: are we done with STM?

Nuno Diegues, Paolo Romano, Luís Rodrigues

ndiegues@gsd.inesc-id.pt

Nuno Diegues 1/27

slide-2
SLIDE 2

Over 20 years of Transactional Memory

Nuno Diegues 2/27

slide-3
SLIDE 3

Over 20 years of Transactional Memory

Commodity processors with hardware support

Nuno Diegues 2/27

slide-4
SLIDE 4

Over 20 years of Transactional Memory

Processors by IBM (BG/Q and zEC12) and Intel (Haswell)

Nuno Diegues 2/27

slide-5
SLIDE 5

The question Raise the question: are we done with STM?

Nuno Diegues 3/27

slide-6
SLIDE 6

The question Raise the question: are we done with STM?

+ Hardware ought to be faster + Transparency and ease of use

Nuno Diegues 3/27

slide-7
SLIDE 7

The question Raise the question: are we done with STM?

+ Hardware ought to be faster + Transparency and ease of use

  • Research in STMs has evolved into a mature state
  • Limited nature of hardware

Nuno Diegues 3/27

slide-8
SLIDE 8

The question Raise the question: are we done with STM?

+ Hardware ought to be faster + Transparency and ease of use

  • Research in STMs has evolved into a mature state
  • Limited nature of hardware

What else is there to find?

Nuno Diegues 3/27

slide-9
SLIDE 9

Outline

1 (Quick) Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 4/27

slide-10
SLIDE 10

Outline

1 Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 5/27

slide-11
SLIDE 11

Study

Commodity hardware in Intel TSX

◮ IBM processors target high performance computing Nuno Diegues 6/27

slide-12
SLIDE 12

Study

Commodity hardware in Intel TSX

◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache Nuno Diegues 6/27

slide-13
SLIDE 13

Study

Commodity hardware in Intel TSX

◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache

Standard metrics for evaluation

◮ Time to complete benchmarks ◮ Power consumed (collected via Intel RAPL) ◮ Relative to sequential, non-instrumented executions Nuno Diegues 6/27

slide-14
SLIDE 14

Study

Commodity hardware in Intel TSX

◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache

Standard metrics for evaluation

◮ Time to complete benchmarks ◮ Power consumed (collected via Intel RAPL) ◮ Relative to sequential, non-instrumented executions ◮ Combined metric: Speedup / KJoules Nuno Diegues 6/27

slide-15
SLIDE 15

Study

Commodity hardware in Intel TSX

◮ IBM processors target high performance computing ◮ Intel Haswell Xeon E3-1275v3 3.5GHz (3.9GHz Turbo) ◮ 4 cores, 8 hardware threads (via hyper-threading) ◮ 4x32KB L1 caches, 4x256KB L2 caches, 8MB L3 cache

Standard metrics for evaluation

◮ Time to complete benchmarks ◮ Power consumed (collected via Intel RAPL) ◮ Relative to sequential, non-instrumented executions ◮ Combined metric: Speedup / KJoules

STAMP benchmarks (excluded Bayes) with standard parameters

Nuno Diegues 6/27

slide-16
SLIDE 16

Outline

1 Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 7/27

slide-17
SLIDE 17

Compared Techniques

Locks STM HTM Hybrid TM

Nuno Diegues 8/27

slide-18
SLIDE 18

Compared Techniques - Locks

All benchmarks used an interface with the atomic construct: GL: single global lock FL: fine-grained locks — per-application effort

Nuno Diegues 9/27

slide-19
SLIDE 19

Compared Techniques - STM

Nuno Diegues 10/27

slide-20
SLIDE 20

Compared Techniques - STM

TL2: commit-time locking NOrec: aimed at low thread count (single commit lock) TinySTM: encounter-time locking SwissTM: mixed encounter-time and commit-time locking

Nuno Diegues 10/27

slide-21
SLIDE 21

Compared Techniques - HTM

Intel TSX is single version, ensures strong isolation and allows nesting. Most important it is best-effort:

Nuno Diegues 11/27

slide-22
SLIDE 22

Compared Techniques - HTM

Intel TSX is single version, ensures strong isolation and allows nesting. Most important it is best-effort: No transaction is guaranteed to commit Exhausting cache lines with transactional footprint Architectural states, instructions, traps

Nuno Diegues 11/27

slide-23
SLIDE 23

Compared Techniques - HTM

Intel TSX is single version, ensures strong isolation and allows nesting. Most important it is best-effort: No transaction is guaranteed to commit Exhausting cache lines with transactional footprint Architectural states, instructions, traps Fallback path must be provided in software

◮ address to routine provided on XBEGIN Nuno Diegues 11/27

slide-24
SLIDE 24

Compared Techniques - HTM

Intel TSX is single version, ensures strong isolation and allows nesting. Most important it is best-effort: No transaction is guaranteed to commit Exhausting cache lines with transactional footprint Architectural states, instructions, traps Fallback path must be provided in software

◮ address to routine provided on XBEGIN

TSX-GL and TSX-FL

Nuno Diegues 11/27

slide-25
SLIDE 25

Compared Techniques - HyTM

Use an STM in the fallback path of TSX: TSX-TL2 with reduced hardware transactions TSX-NOrec simpler, since NOrec has a single lock

Nuno Diegues 12/27

slide-26
SLIDE 26

Outline

1 Motivation 2 Study Description 3 Compared Techniques 4 Results and Insights 5 Summary of Conclusions Nuno Diegues 13/27

slide-27
SLIDE 27

STAMP results

Workload Characterization:

Time in Tx (%) Contention kmeans low (7) low ssca2 low (17) low intruder medium (33) high vacation high (89) low genome high (97) low yada high (99) medium labyrinth high (100) high

Nuno Diegues 14/27

slide-28
SLIDE 28

STAMP results

Workload Characterization:

Time in Tx (%) Contention L kmeans low (7) low L ssca2 low (17) low M intruder medium (33) high M vacation high (89) low H genome high (97) low H yada high (99) medium H labyrinth high (100) high

Nuno Diegues 14/27

slide-29
SLIDE 29

STAMP results

Characterization of the Techniques

Most Performant Least Power Consumption L kmeans L ssca2 M intruder M vacation H genome H yada H labyrinth

Nuno Diegues 15/27

slide-30
SLIDE 30

Plot labels

GL TSX-GL TL2 TSX-TL2 NOrec TSX-NOrec SwissTM TinySTM

Nuno Diegues 16/27

slide-31
SLIDE 31

Plot labels

TSX-GL TSX-NOrec TinySTM

Nuno Diegues 16/27

slide-32
SLIDE 32

Plot labels

TSX-GL TSX-NOrec TinySTM

Speedup / KJoule along increasing threads

Nuno Diegues 16/27

slide-33
SLIDE 33

kmeans - low intensity

20 40 60 80 100 120 1 2 3 4 5 6 7 8 Speedup/Joule threads

TSX-GL TSX-NOrec TinySTM

Sequential overhead is noticeable GL allows some concurrency due to L workload HyTMs lag behind due to the STMs poor performance

Nuno Diegues 17/27

slide-34
SLIDE 34

STAMP results - low intensity of transactions

% of transactions aborted capacity architectural conflict interaction

80 40 TSX-GL TSX-TL2 TSX-NOrec

1 thread 4 threads 8 threads

1 thread has negligible aborts STMs have 15% abort rate

Nuno Diegues 18/27

slide-35
SLIDE 35

STAMP results - low intensity of transactions

Characterization of the Techniques

Most Performant Least Power Consumption L kmeans TSX-GL TSX-GL L ssca2 TSX-GL TSX-GL M intruder M vacation H genome H yada H labyrinth

Nuno Diegues 19/27

slide-36
SLIDE 36

intruder - medium intensity

2 4 6 8 10 12 14 1 2 3 4 5 6 7 8 Speedup/Joule threads

TSX-GL TSX-NOrec TinySTM

Binding threads round-robin: > 4t uses hyper-threading TSX-based approaches suffer from pressure on caches Best STMs (not TL2) scale regardless

Nuno Diegues 20/27

slide-37
SLIDE 37

STAMP results - medium intensity of transactions

Most Performant Least Power Consumption L kmeans TSX-GL TSX-GL L ssca2 TSX-GL TSX-GL M intruder TSX-GL ≤ 4t; TinySTM ≥ 5t TSX-GL ≤ 5t; TinySTM ≥ 6t M vacation TSX-GL ≤ 2t; TinySTM ≥ 3t TSX-GL ≤ 4t; TinySTM ≥ 5t H genome H yada H labyrinth

Nuno Diegues 21/27

slide-38
SLIDE 38

yada - high intensity

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 1 2 3 4 5 6 7 8 Speedup/Joule threads

TSX-GL TSX-NOrec TinySTM

TSX-GL does not scale HyTMs follow the trend of the STM counter-part When time to complete stagnates, power consumption stagnates

◮ Logical cores of hyper-threading ◮ Allow for additional hardware parallelism ◮ Do not consume as much additional power Nuno Diegues 22/27

slide-39
SLIDE 39

STAMP results - high intensity of transactions

80 40 TSX-GL TSX-TL2 TSX-NOrec

capacity architectural conflict interaction

Most conflicts are not due to data accesses

Nuno Diegues 23/27

slide-40
SLIDE 40

STAMP results

Most Performant Least Power Consumption L kmeans TSX-GL TSX-GL L ssca2 TSX-GL TSX-GL M intruder TSX-GL ≤ 4t; TinySTM ≥ 5t TSX-GL ≤ 5t; TinySTM ≥ 6t M vacation TSX-GL ≤ 2t; TinySTM ≥ 3t TSX-GL ≤ 4t; TinySTM ≥ 5t H genome TinySTM TinySTM H yada SwissTM TinySTM H labyrinth STMs (except TL2) STMs (except TL2)

Nuno Diegues 24/27

slide-41
SLIDE 41

STAMP - fine-grained locking

Requires a per-application effort Reasoning with transactions is meant to simplify programming

Nuno Diegues 25/27

slide-42
SLIDE 42

STAMP - fine-grained locking

Requires a per-application effort Reasoning with transactions is meant to simplify programming Does not change the landscape of performance and power consumption

Nuno Diegues 25/27

slide-43
SLIDE 43

STAMP - fine-grained locking

Requires a per-application effort Reasoning with transactions is meant to simplify programming Does not change the landscape of performance and power consumption

◮ Additional lock acquisitions are noticeable in L workloads ◮ An efficient fine-grained lock scheme was not found ◮ TinySTM was competitive in H workloads Nuno Diegues 25/27

slide-44
SLIDE 44

Lessons Learnt

TSX is only worth in workloads with low intensity in transaction.

Nuno Diegues 26/27

slide-45
SLIDE 45

Lessons Learnt

TSX is only worth in workloads with low intensity in transaction. STMs are the all-around champion, consistently performing good.

Nuno Diegues 26/27

slide-46
SLIDE 46

Lessons Learnt

TSX is only worth in workloads with low intensity in transaction. STMs are the all-around champion, consistently performing good. The choice of fallback, when and how to do so, can impact performance and power consumption in 75%.

Nuno Diegues 26/27

slide-47
SLIDE 47

Lessons Learnt

TSX is only worth in workloads with low intensity in transaction. STMs are the all-around champion, consistently performing good. The choice of fallback, when and how to do so, can impact performance and power consumption in 75%. Existing HyTMs do not justify the complexity to use them.

Nuno Diegues 26/27

slide-48
SLIDE 48

Thank You

Questions?

Nuno Diegues 27/27