Low-Overhead Software Transactional Memory with Progress Guarantees - - PowerPoint PPT Presentation

low overhead software transactional memory with progress
SMART_READER_LITE
LIVE PREVIEW

Low-Overhead Software Transactional Memory with Progress Guarantees - - PowerPoint PPT Presentation

Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, Jipeng Huang, Man Cao, Michael D. Bond 1 Do We Need Efficient STM? 2 Problem Solved! Blue Gene/Q 3 Problem Solved? HTM is limited 4


slide-1
SLIDE 1

Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics

Minjia Zhang,

1

Jipeng Huang, Man Cao, Michael D. Bond

slide-2
SLIDE 2

Do We Need Efficient STM?

2

slide-3
SLIDE 3

Problem Solved!

3

Blue Gene/Q

slide-4
SLIDE 4

HTM is limited…

4

Problem Solved?

slide-5
SLIDE 5

Best-effort HTM: no completion guarantee1 Performance penalty: short transactions2 Language-level support for atomic blocks: STM fallback

[1] I. Calciu et al. Invyswell: A Hybrid Transactional Memory for Haswell’s Restricted Transactional Memory. In PACT, 2014. [2] R. M. Yoo et al. Performance Evaluation of Intel Transactional Synchronization Extensions for High-Performance

  • Computing. In SC, 2013.

5

atomic { from.balance -= amount; to.balance += amount; } transaction

Problem Solved?

slide-6
SLIDE 6

Existing STMs add high overhead 1,2,3

6

Software Transactional Memory Is Slow

[1] C. Cascaval et al. Software Transactional Memory: Why Is It Only a Research Toy? In CACM, 2008 [2] A. Dragojevi´c, et al. Why STM Can Be More than a Research Toy. In CACM, 2011 [3] R. M. Yoo et al. Kicking the Tires of Software Transactional Memory: Why the Going Gets Tough. In SPAA, 2008.

slide-7
SLIDE 7

Existing STMs add high overhead 1,2,3 Related challenges: scalability, progress guarantees, strong semantics

7

Software Transactional Memory Is Slow

[1] C. Cascaval et al. Software Transactional Memory: Why Is It Only a Research Toy? In CACM, 2008 [2] A. Dragojevi´c, et al. Why STM Can Be More than a Research Toy. In CACM, 2011 [3] R. M. Yoo et al. Kicking the Tires of Software Transactional Memory: Why the Going Gets Tough. In SPAA, 2008.

slide-8
SLIDE 8

Challenge

Expensive to detect conflicts

T1 atomic { … … = o.f; … = p.g; …

  • .f = …;

p.g = …; … }

8

  • .f = …

T2

slide-9
SLIDE 9

Challenge

Expensive to detect conflicts

9

p.g = … T2 T1 atomic { … … = o.f; … = p.g; …

  • .f = …;

p.g = …; … }

slide-10
SLIDE 10

Challenge

Expensive to detect conflicts

10

t.k = … T2 T1 atomic { … … = o.f; … = p.g; …

  • .f = …;

p.g = …; … }

slide-11
SLIDE 11

Challenge

Expensive to detect conflicts

11

instrumentation ? T2 T1 atomic { … … = o.f; … = p.g; …

  • .f = …;

p.g = …; … }

slide-12
SLIDE 12

12

slide-13
SLIDE 13

 Adds very low overhead  Achieves good scalability by using a hybrid approach  Provides strong progress guarantees  Provides strong atomicity

13

LarkTM Contributions

slide-14
SLIDE 14

Key Insight

Avoid high instrumentation costs by minimizing instrumentation costs for non-conflicting accesses

14

slide-15
SLIDE 15

LarkTM Design

Per-object biased reader-writer locks1,2 Eager concurrency control Piggybacking conflict detection and conflict resolution on lock transfers

15

  • 1. M. D. Bond et al. Octet: Capturing and Controlling Cross-Thread Dependences Efficiently. In OOSPLA, 2013.
  • 2. B. Hindman and D. Grossman. Atomicity via Source-to-Source Translation. In MSPC, 2006.
slide-16
SLIDE 16

LarkTM Design

Per-object biased reader-writer locks1,2 Eager concurrency control Piggybacking conflict detection and conflict resolution on lock transfers

16

  • 1. M. D. Bond et al. Octet: Capturing and Controlling Cross-Thread Dependences Efficiently. In OOSPLA, 2013.
  • 2. B. Hindman and D. Grossman. Atomicity via Source-to-Source Translation. In MSPC, 2006.
  • Minimal instrumentation and synchronization for both

transactional and non-transactional non-conflicting accesses

  • Does not release locks even if transactions commit
slide-17
SLIDE 17

17

Biased Locks

f lock state

  • bject o
slide-18
SLIDE 18

18

Biased Locks

∈ {WrExT, RdExT, RdSh} f lock state

  • bject o
slide-19
SLIDE 19

19

Time T1

Multi-thread Execution

f lock state T2

WrExT1

  • bject o
slide-20
SLIDE 20

transaction start

txn id: 42

  • .f = 1

20

Time T1

Multi-thread Execution

f lock state T2 last txn

WrExT1

  • bject o
slide-21
SLIDE 21

transaction start

txn id: 42

  • .f = 1

21

Time T1

Multi-thread Execution

f lock state T2

update

last txn 42

WrExT1

  • bject o
slide-22
SLIDE 22

transaction start

txn id: 42

  • .f = 1

22

Time T1

Multi-thread Execution

f lock state T2

add

  • .f

undo log last txn 42

WrExT1

  • bject o
slide-23
SLIDE 23

transaction start

txn id: 42

  • .f = 1

23

Time T1 T2

Multi-thread Execution

f lock state

update

last txn 1 42

WrExT1

  • bject o
slide-24
SLIDE 24

transaction start

txn id: 42

  • .f = 1

24

Time T1 T2

  • .f = 2

Multi-thread Execution

f lock state last txn 1 42

WrExT1

  • bject o
slide-25
SLIDE 25

transaction start

txn id: 42

  • .f = 1

… …

25

Time T1 T2

  • .f = 2

Multi-thread Execution

f lock state No synchronization on T1’s accesses to o Problem! last txn 1 42

WrExT1

  • bject o
slide-26
SLIDE 26

transaction start

txn id: 42

26

Time T1 T2

  • .f = 2

Multi-thread Execution

f lock state T2 starts coordination

  • .f = 1

… …

last txn 1 42

WrExT1

  • bject o
slide-27
SLIDE 27

transaction start

txn id: 42

27

Time T1 T2

  • .f = 2

Coordination

f lock state

update

  • .f = 1

… …

last txn 1 42

IntT2

  • bject o
slide-28
SLIDE 28

transaction start

txn id: 42

28

Time T1 T2

  • .f = 2

Coordination

f lock state

request

  • .f = 1

… …

last txn 1 42

IntT2

  • bject o
slide-29
SLIDE 29

transaction start

txn id: 42

29

Time T1 T2

  • .f = 2

Coordination

f lock state

request

… = o.f

  • .f = 1

… … safe point safe point

last txn 1 42

IntT2

  • bject o
slide-30
SLIDE 30

transaction start

txn id: 42

30

Time T1 T2

  • .f = 2

Coordination

f lock state

request

… = o.f

  • .f = 1

… … safe point safe point

Detecting Conflicts

last txn 1 42

IntT2

  • bject o
slide-31
SLIDE 31

transaction start

txn id: 42

31

Time T1 T2

  • .f = 2

A Transactional Conflict

f lock state

request

… = o.f safe point safe point

  • .f = 1

… …

Detecting Conflicts Contention Management

detected conflicts

Resolving Conflicts

last txn 1 42

IntT2

  • bject o
slide-32
SLIDE 32

transaction start 32

Time T1 T2

  • .f = 2

Not A Transactional Conflict

f lock state

safe point

no conflict request

… … … safe point

Detecting Conflicts

last txn

txn id: 43

1 42

IntT2

  • bject o
slide-33
SLIDE 33

transaction start

txn id: 42

33

Time T1 T2

  • .f = 2

Coordination

f lock state

request

… = o.f safe point

  • .f = 1

… …

Detecting Conflicts

last txn 1 42

IntT2

  • bject o
slide-34
SLIDE 34

transaction start 34

Time T1 T2

  • .f = 2

Coordination

f lock state

response

waiting

request

txn id: 42

… = o.f safe point

  • .f = 1

… …

Detecting Conflicts

last txn 1 42

IntT2

  • bject o
slide-35
SLIDE 35

transaction start

txn id: 42

35

Time T1 T2

  • .f = 2

Strong Progress Guarantees

f lock state

request

safe point

  • .f = 1

… … … = o.f

may abort Detecting Conflicts

last txn

waiting

may abort

response

1 42

IntT2

  • bject o
slide-36
SLIDE 36

transaction start

txn id: 42

36

Time T1 T2

  • .f = 2

Strong Progress Guarantees

f lock state

request

safe point

  • .f = 1

… … … = o.f

may abort Detecting Conflicts

last txn

waiting

may abort

Starvation and livelock freedom

response

1 42

IntT2

  • bject o
slide-37
SLIDE 37

transaction start transaction start

txn id: 42

37

Time T1 T2

Strong Atomicity Semantics

f lock state

transactional access

  • .f = 2

request

safe point

  • .f = 1

… … … = o.f

abort Detecting Conflicts

last txn

waiting

Transactional vs. Transactional Conflict

response

1 42

IntT2

  • bject o
slide-38
SLIDE 38

transaction start

retry

transaction start

txn id: 42

38

Time T1 T2

Strong Atomicity Semantics

f lock state

transactional access request

  • .f = 2

safe point

  • .f = 1

… … … = o.f

Detecting Conflicts abort

last txn

waiting

Transactional vs. Transactional Conflict

response

1 42

IntT2

  • bject o
slide-39
SLIDE 39

transaction start

txn id: 42

39

Time T1 T2

Strong Atomicity Semantics

f lock state

safe point

non-transactional access request

  • .f = 2

safe point

  • .f = 1

… … … = o.f

Detecting Conflicts abort

last txn

waiting

Transactional vs. Non-transactional Conflict

response

1 42

IntT2

  • bject o
slide-40
SLIDE 40

transaction start

txn id: 42

40

Time T1 T2

Strong Atomicity Semantics

f lock state

non-transactional access

retry

request

  • .f = 2

safe point

  • .f = 1

… … … = o.f

Detecting Conflicts abort

last txn

waiting

Transactional vs. Non-transactional Conflict

response

1 42

IntT2

  • bject o
slide-41
SLIDE 41

41

Time T1 T2

Strong Atomicity Semantics

non-transactional access request

  • .f = 2

response

T1

transaction end

safe point … = o.f

  • .f = …

Non-transactional accesses  short transactions no setting up/tearing down cost

slide-42
SLIDE 42

42

Time T1 T2

No Transactional Conflict

f lock state

  • .f = 2

request transaction end transaction start

txn id: 51

safe point

Detecting Conflicts

last txn

waiting

response

1 42

IntT2

  • bject o
slide-43
SLIDE 43

transaction start

txn id: 51

43

Time T1 T2

No Transactional Conflict

f lock state

acquire lock

  • .f = 2

request transaction end

safe point

Detecting Conflicts

last txn

waiting

response

1 42

WrExT2

  • bject o
slide-44
SLIDE 44

transaction start

txn id: 51

44

Time T1 T2

No Transactional Conflict

f lock state

  • .f = 2

request transaction end update add

  • .f

undo log

safe point

Detecting Conflicts

last txn

waiting

response

2 51

WrExT2

  • bject o
slide-45
SLIDE 45

transaction start

txn id: 51

45

Time T1 T2

No Transactional Conflict

f lock state

  • .f = 2

request transaction end

  • .f

undo log

Two versions of coordination protocol

  • .f = 2

safe point

Detecting Conflicts

last txn

waiting

response

2 51

WrExT2

  • bject o
slide-46
SLIDE 46

LarkTM-O

46

Adds very low overhead and scales well for low-contention cases

slide-47
SLIDE 47

txn: 51

47

Time T1 T2

High-Contention Applications

… = o.f … …

  • .f = …

… … … = o.f … …

  • .f = …

txn: 42 txn: 43 txn: 52

… = o.f … …

  • .f = …

… …

  • .f = …
slide-48
SLIDE 48

48

Time T1 T2

High-Contention Applications

request response

  • .f = …

… = o.f … …

  • .f = …

… … … = o.f … …

  • .f = …

… = o.f … …

  • .f = …

request response safe point safe point

txn: 51 txn: 42 txn: 43 txn: 52

request

slide-49
SLIDE 49

LarkTM-S

49

Handling High Contention

slide-50
SLIDE 50

50

Time T1 T2

LarkTM-S: Hybrid with Traditional Locking

… = o.f … …

  • .f = …

… … … = o.f … …

  • .f = …

… = o.f … …

  • .f = …

txn: 51 txn: 42 txn: 43 txn: 52

  • .f = 1
  • causes high contention
slide-51
SLIDE 51

51

Time T1 T2

… = o.f … …

  • .f = …

… … … = o.f … …

  • .f = …

… = o.f … …

  • .f = …

txn: 51 txn: 42 txn: 43 txn: 52

  • .f = 1

LarkTM-S: Hybrid with Traditional Locking

slide-52
SLIDE 52

52

Comparison Of Concurrency Control

1 B. Saha et al. McRT-STM: A High Performance Software Transactional Memory System for a Multi-Core Runtime. In PPoPP, 2006. 2 T. Shpeisman et al. Enforcing Isolation and Ordering in STM. In PLDI, 2007. 3 L. Dalessandro et al. NOrec: Streamlining STM by Abolishing Ownership Records. In PPoPP, 2010.

Write concurrency control Read concurrency control LarkTM-O Eager per-object biased reader–writer lock Eager per-object biased reader–writer lock LarkTM-S IntelSTM–LarkTM-O hybrid IntelSTM–LarkTM-O hybrid IntelSTM1,2 Eager per-object lock Lazy version validation NOrec3 Lazy global seqlock Lazy value validation

slide-53
SLIDE 53

53

Instrumented accesses LarkTM-O All accesses LarkTM-S All accesses IntelSTM All accesses NOrec All transactional accesses

Comparison Of Instrumentation

except redundant accesses

slide-54
SLIDE 54

54

Progress Guarantee LarkTM-O Livelock and starvation free LarkTM-S Livelock and starvation free IntelSTM None NOrec Livelock free

Comparison Of Progress Guarantees

slide-55
SLIDE 55

55

Semantics LarkTM-O Strong Atomicity LarkTM-S Strong Atomicity IntelSTM Strong Atomicity NOrec Single Global Lock Atomicity (SLA)

Comparison Of Semantics

slide-56
SLIDE 56
  • LarkTM-O, LarkTM-S, IntelSTM (McRT), and NOrec
  • Developed in Jikes RVM 3.1.3
  • All STMs share features as much as possible (e.g., inlining

decisions, redundant barrier analysis, name-mangling)

  • Source code publicly available on

the Jikes RVM Research Archive

56

Implementation

slide-57
SLIDE 57

Evaluation Methodology

  • TM programs
  • STAMP benchmarks
  • STM comparison
  • Norec
  • IntelSTM
  • LarkTM-O
  • LarkTM-S
  • Platform
  • Eight 8-core processors (AMD Opteron 6272)
  • Four 8-core processors (Intel Xeon E5-4620)

57

slide-58
SLIDE 58

Single-Thread Performance

58

Overhead (%)

slide-59
SLIDE 59

Single-Thread Performance

59

50 100 150 200 250 300 Overhead (%) NOrec

610

slide-60
SLIDE 60

Single-Thread Performance

60

50 100 150 200 250 300 Overhead (%) NOrec IntelSTM

610 2870

slide-61
SLIDE 61

Single-Thread Performance

61

50 100 150 200 250 300 Overhead (%) NOrec IntelSTM LarkTM-O

610 2870

slide-62
SLIDE 62

Single-Thread Performance

62

50 100 150 200 250 300 Overhead (%) NOrec IntelSTM LarkTM-O LarkTM-S

610 2870

slide-63
SLIDE 63

Single-Thread Performance

63

50 100 150 200 250 300 Overhead (%) NOrec IntelSTM LarkTM-O LarkTM-S

610 2870 40% 73%

slide-64
SLIDE 64

64

Speedup Geomean

NOrec 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 4 8 Speedup Threads NOrec IntelSTM LarkTM-O LarkTM-S

slide-65
SLIDE 65

65

Speedup Geomean

NOrec IntelSTM 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 4 8 Speedup Threads NOrec IntelSTM LarkTM-O LarkTM-S

slide-66
SLIDE 66

66

Speedup Geomean

NOrec IntelSTM LarkTM-O 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 4 8 Speedup Threads NOrec IntelSTM LarkTM-O LarkTM-S

slide-67
SLIDE 67

67

Speedup Geomean

NOrec IntelSTM LarkTM-O LarkTM-S 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 4 8 Speedup Threads NOrec IntelSTM LarkTM-O LarkTM-S

slide-68
SLIDE 68

68

NOrec IntelSTM LarkTM-O LarkTM-S 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 4 8 Speedup Threads

Toward Practical STM

Low instrumentation

  • verhead
slide-69
SLIDE 69

69

NOrec IntelSTM LarkTM-O LarkTM-S 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 4 8 Speedup Threads

Toward Practical STM

scales well Low instrumentation

  • verhead
slide-70
SLIDE 70

70

NOrec IntelSTM LarkTM-O LarkTM-S 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 4 8 Speedup Threads

Toward Practical STM

scales well Low instrumentation

  • verhead

Strong progress guarantees

slide-71
SLIDE 71

71

NOrec IntelSTM LarkTM-O LarkTM-S 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 4 8 Speedup Threads

Toward Practical STM

scales well Low instrumentation

  • verhead

Strong progress guarantees Strong semantics

slide-72
SLIDE 72

72

NOrec IntelSTM LarkTM-O LarkTM-S 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 1 2 4 8 Speedup Threads

Toward Practical STM

scales well Low instrumentation

  • verhead

Strong progress guarantees Strong semantics

Thank you