Extending Hardware Transactional Memory Capacity via Rollback-Only - - PowerPoint PPT Presentation

extending hardware transactional memory capacity via
SMART_READER_LITE
LIVE PREVIEW

Extending Hardware Transactional Memory Capacity via Rollback-Only - - PowerPoint PPT Presentation

Extending Hardware Transactional Memory Capacity via Rollback-Only Transactions and Suspend/Resume Alexander Shady Issa Pascal Felber Paolo Romano Matveev 1 Extending Hardware Transactional Memory Capacity via Rollback-Only Transactions and


slide-1
SLIDE 1

Extending Hardware Transactional Memory Capacity

via

Rollback-Only Transactions and Suspend/Resume

1

Shady Issa Pascal Felber Alexander Matveev Paolo Romano

slide-2
SLIDE 2

Extending Hardware Transactional Memory Capacity

via

Rollback-Only Transactions and Suspend/Resume

1

POWER8-TM

Shady Issa Pascal Felber Alexander Matveev Paolo Romano

slide-3
SLIDE 3

Transactional Memory

  • alternative paradigm for parallel programming
  • easy to use
  • potential of fine-grained locking performance

withdraw(account, value){ __transaction{ if account.balance > value: account.balance -= value; return account.balance; else return -1; } } Transactional memory

implementation

2

slide-4
SLIDE 4

Hardware Transactional Memory

  • Intel and IBM processors
  • implemented in the cache coherence protocol
  • cache line granularity
  • best effort
  • S/W fallback is needed

3

slide-5
SLIDE 5

Capacity Limitations

4

1 2 3 4 5 6 Throughput (106Tx/s) Transaction size

HTM-SGL

10 20 30 40 50 60 70 80 90 Abort rate (%)

HTM tx HTM non-tx HTM capacity Lock aborts ROT conflicts ROT capacity

slide-6
SLIDE 6

Capacity Limitations

4

1 2 3 4 5 6 Throughput (106Tx/s) Transaction size

HTM-SGL

10 20 30 40 50 60 70 80 90 Abort rate (%)

HTM tx HTM non-tx HTM capacity Lock aborts ROT conflicts ROT capacity

capacity aborts

slide-7
SLIDE 7

Capacity Limitations

4

1 2 3 4 5 6 Throughput (106Tx/s) Transaction size

HTM-SGL

10 20 30 40 50 60 70 80 90 Abort rate (%)

HTM tx HTM non-tx HTM capacity Lock aborts ROT conflicts ROT capacity

activation of the fallback path capacity aborts

slide-8
SLIDE 8

POWER8-TM

  • hardware/software co-design
  • utilises specific features available in POWER8:
  • suspend/resume
  • ROTs
  • to support execution of larger transactions

5

slide-9
SLIDE 9

Rollback-only Transaction

  • lightweight transaction type
  • updates are applied atomically
  • does not track the reads
  • theoretically infinite read-set
  • not serialisable

6

slide-10
SLIDE 10

ROTs

7

Thread 1 Thread 2 Begin ROT Begin ROT read X X = 1 End ROT read X

inconsistent value

X = 0 X = 0 returns 0 returns 1

slide-11
SLIDE 11

ROTs

7

Thread 1 Thread 2 Begin ROT Begin ROT read X X = 1 End ROT read X

inconsistent value

WAR X = 0 X = 0 returns 0 returns 1

slide-12
SLIDE 12

ROTs

8

Thread 1 Thread 2 Begin ROT Begin ROT read X X = 1 End ROT read X X = 0 X = 0 returns 0

slide-13
SLIDE 13

ROTs

8

Thread 1 Thread 2 Begin ROT Begin ROT read X X = 1 End ROT read X

new value can

  • nly appear now

consistent

X = 0 X = 0 returns 0 returns 0

slide-14
SLIDE 14

ROTs

8

Thread 1 Thread 2 Begin ROT Begin ROT read X X = 1 End ROT read X

new value can

  • nly appear now

consistent

RAW X = 0 X = 0 returns 0 returns 0

slide-15
SLIDE 15

ROTs

9

Thread 1 Thread 2 Begin ROT Begin ROT read X X = 1 End ROT read X

wait for concurrent ROTs non-transactionally

X = 0 X = 0

slide-16
SLIDE 16

ROTs

10

Thread 1 Thread 2 Begin ROT Begin ROT read X X = 1 Y = 1 read Y End ROT End ROT WAR WAR

X = 0 Y = 0 X = 0 Y = 0

slide-17
SLIDE 17

ROTs

10

Thread 1 Thread 2 Begin ROT Begin ROT read X X = 1 Y = 1 read Y End ROT End ROT

X = 0 Y = 1 X = 1 Y = 0 X = 0 Y = 0 X = 0 Y = 0

slide-18
SLIDE 18

Touch-to-Validate

  • core algorithm of P8TM
  • to make concurrent execution of ROTs safe and

serialisable

  • basic intuition: convert WAR to RAW

11

slide-19
SLIDE 19

T2V

12

Thread 1 Thread 2 Begin ROT Begin ROT read X write X write Y read Y End ROT End ROT

X = 0 Y = 0 X = 0 Y = 0

slide-20
SLIDE 20

T2V

12

Thread 1 Thread 2 Begin ROT Begin ROT read X write X write Y read Y End ROT End ROT

X = 0 Y = 0 X = 0 Y = 0

slide-21
SLIDE 21

T2V

12

Thread 1 Thread 2 Begin ROT Begin ROT read X write X write Y read Y End ROT re-read X re-read Y End ROT

X = 0 Y = 0 X = 0 Y = 0

slide-22
SLIDE 22

T2V

12

Thread 1 Thread 2 Begin ROT Begin ROT read X write X write Y read Y End ROT re-read X re-read Y End ROT

X = 0 Y = 0 X = 0 Y = 0

slide-23
SLIDE 23

T2V

12

Thread 1 Thread 2 Begin ROT Begin ROT read X write X write Y read Y End ROT re-read X re-read Y End ROT

X = 0 Y = 0 X = 0 Y = 0

slide-24
SLIDE 24

T2V

  • needs to track only the addresses
  • this must be done in software
  • how can software outperform hardware?

13

slide-25
SLIDE 25

1:____________ 2:____________ 3:____________ 4:____________ 5:____________ 6:____________ 7:____________ 8:____________ 9:____________ 10:____________ 64:___________

TMCAM

14

Begin HTM read A read B End HTM

TMCAM

read C read D write E

slide-26
SLIDE 26

1:____________ 2:____________ 3:____________ 4:____________ 5:____________ 6:____________ 7:____________ 8:____________ 9:____________ 10:____________ 64:___________

TMCAM

14

Begin HTM read A read B End HTM

&A &B

TMCAM

read C read D write E

&C &D &E

slide-27
SLIDE 27

Read-set Tracking

15

1:___________________________________ 2:___________________________________ 3:___________________________________ 4:___________________________________ 5:___________________________________ 6:___________________________________ 7:___________________________________ 8:___________________________________ 9:___________________________________ 10:__________________________________ 64:__________________________________

Begin ROT read A read B End ROT read C read D write E

slide-28
SLIDE 28

Read-set Tracking

15

1:___________________________________ 2:___________________________________ 3:___________________________________ 4:___________________________________ 5:___________________________________ 6:___________________________________ 7:___________________________________ 8:___________________________________ 9:___________________________________ 10:__________________________________ 64:__________________________________

Begin ROT read A read B End ROT read C read D write E

store &A store &B store &C store &D

slide-29
SLIDE 29

Read-set Tracking

15

1:___________________________________ 2:___________________________________ 3:___________________________________ 4:___________________________________ 5:___________________________________ 6:___________________________________ 7:___________________________________ 8:___________________________________ 9:___________________________________ 10:__________________________________ 64:__________________________________

Begin ROT read A read B End ROT read C read D write E

store &A store &B store &C store &D

&A &B &C&D &E

slide-30
SLIDE 30

Read-set Tracking

15

1:___________________________________ 2:___________________________________ 3:___________________________________ 4:___________________________________ 5:___________________________________ 6:___________________________________ 7:___________________________________ 8:___________________________________ 9:___________________________________ 10:__________________________________ 64:__________________________________

Begin ROT read A read B End ROT read C read D write E

128bytes

8 bytes

store &A store &B store &C store &D

&A &B &C&D &E

slide-31
SLIDE 31

Read-set Tracking

15

1:___________________________________ 2:___________________________________ 3:___________________________________ 4:___________________________________ 5:___________________________________ 6:___________________________________ 7:___________________________________ 8:___________________________________ 9:___________________________________ 10:__________________________________ 64:__________________________________

Begin ROT read A read B End ROT read C read D write E

128bytes

8 bytes

store &A store &B store &C store &D

up to 16x larger read-set

&A &B &C&D &E

slide-32
SLIDE 32

HTM

  • transactions may fit in HTM
  • we need to avoid extra overheads of using ROTs
  • try first in HTM, if it overflows, fallback to ROT
  • how can HTMs and ROTs run concurrently?

16

slide-33
SLIDE 33

HTM + ROT

17

Thread 1 Thread 2 Begin HTM Begin ROT read X X = 1 Y = 1 End ROT End HTM

X = 0 Y = 0 X = 0 Y = 0

slide-34
SLIDE 34

HTM + ROT

17

Thread 1 Thread 2 Begin HTM Begin ROT read X X = 1 Y = 1 End ROT End HTM

HTM is protected by H/W

X = 0 Y = 0 X = 0 Y = 0

slide-35
SLIDE 35

HTM + ROT

17

Thread 1 Thread 2 Begin HTM Begin ROT read X Y = 1 End ROT End HTM

HTM is protected by H/W

X = 0 Y = 0 X = 0 Y = 0

slide-36
SLIDE 36

HTM + ROT

17

Thread 1 Thread 2 Begin HTM Begin ROT read X Y = 1 read Y End ROT End HTM

HTM is protected by H/W

read Y

X = 0 Y = 0 X = 0 Y = 0

slide-37
SLIDE 37

HTM + ROT

17

Thread 1 Thread 2 Begin HTM Begin ROT read X Y = 1 read Y End ROT End HTM

HTM is protected by H/W

read Y

inconsistent value

returns 0 returns 1

X = 0 Y = 0 X = 0 Y = 0

slide-38
SLIDE 38

HTM + ROT

17

Thread 1 Thread 2 Begin HTM Begin ROT read X Y = 1 read Y End ROT End HTM

HTM is protected by H/W

read Y T2V

inconsistent value

returns 0 returns 1

X = 0 Y = 0 X = 0 Y = 0

slide-39
SLIDE 39

HTM + ROT

17

Thread 1 Thread 2 Begin HTM Begin ROT read X Y = 1 read Y End ROT End HTM

HTM is protected by H/W

read Y T2V

consistent value using S/R

returns 0

X = 0 Y = 0 X = 0 Y = 0

returns 0

slide-40
SLIDE 40

Uninstrumented Read-only

  • read only transactions without any instrumentation
  • outside the context of HTM or ROT
  • no bounds on Tx size
  • HTMs and ROTs must wait for UROs

18

slide-41
SLIDE 41

POWER8-TM

19

HTM ROT

GL

Transaction

update Tx read-only Tx

w/o instrumentation

slide-42
SLIDE 42

POWER8-TM

19

HTM ROT

GL

Transaction

update Tx read-only Tx

w/o instrumentation

small

  • verkill
  • verkill
slide-43
SLIDE 43

POWER8-TM

19

HTM ROT

GL

Transaction

update Tx read-only Tx

w/o instrumentation

large

useless

slide-44
SLIDE 44

Self-tuning

  • lightweight, online reinforcement learning
  • determine execution path:
  • HTM —> GL : small Txs
  • ROT —> GL : large Txs
  • HTM —> ROT —> GL : mixed workload

20

slide-45
SLIDE 45

1 2 3 4 5 6 7 8 2 4 8 16 32 64 Throughput (105 tx/s) Number of threads

P8TM P8TMUCB HTM-SGL HyNoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: Vacation

21

10 physical cores

slide-46
SLIDE 46

1 2 3 4 5 6 7 8 2 4 8 16 32 64 Throughput (105 tx/s) Number of threads

P8TM P8TMUCB HTM-SGL HyNoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: Vacation

21

>3x

10 physical cores

slide-47
SLIDE 47

1 2 3 4 5 6 7 8 2 4 8 16 32 64 Throughput (105 tx/s) Number of threads

P8TM P8TMUCB HTM-SGL HyNoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: Vacation

21

>3x

committing in h/w

10 physical cores

slide-48
SLIDE 48

2 4 6 8 10 12 14 16 2 4 8 16 32 64 Throughput (106 tx/s) Number of threads

P8TM P8TMUCB HTM-SGL HyNoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: SSCA2

22

slide-49
SLIDE 49

2 4 6 8 10 12 14 16 2 4 8 16 32 64 Throughput (106 tx/s) Number of threads

P8TM P8TMUCB HTM-SGL HyNoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: SSCA2

22

small Txs

slide-50
SLIDE 50

2 4 6 8 10 12 14 16 2 4 8 16 32 64 Throughput (106 tx/s) Number of threads

P8TM P8TMUCB HTM-SGL HyNoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: SSCA2

22

UCB disables ROTs

small Txs

slide-51
SLIDE 51

Conclusion

  • POWER8-TM was able to exploit ROTs and

suspend/resume to expand the capacity limitations

  • TMCAM aware read-set tracking was necessary
  • Self-tuning was effective in adapting to different

workloads

  • POWER8-TM promotes the importance of such

features that can be used in innovative techniques to mitigate hardware limitations

23

slide-52
SLIDE 52

20 40 60 80 100 Abort rate (%)

HTM non-tx Lock aborts

SE++ SE TE

Results: read-set tracking

24

Bucket length (20,50,100,266,800,1333,2666) HTM capacity ROT conflicts ROT capacity

1 2 3 4 5 6 101 102 103 Almost no contention Speedup w.r.t. HTM-SGL

SE SE++

0.4 0.6 0.8 1.2 1.4 1.6 1.8

TE HTM-SGL

slide-53
SLIDE 53

1 2 3 4 5 6 7 8 2 4 8 16 32 64 Throughput (105 tx/s) Number of threads

P8TM HERWL P8TMUCB HTM-SGL HyNoRec NoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: Vacation

25

10 physical cores

slide-54
SLIDE 54

1 2 3 4 5 6 7 8 2 4 8 16 32 64 Throughput (105 tx/s) Number of threads

P8TM HERWL P8TMUCB HTM-SGL HyNoRec NoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: Vacation

25

>3x

10 physical cores

slide-55
SLIDE 55

1 2 3 4 5 6 7 8 2 4 8 16 32 64 Throughput (105 tx/s) Number of threads

P8TM HERWL P8TMUCB HTM-SGL HyNoRec NoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: Vacation

25

>3x

committing in h/w

10 physical cores

slide-56
SLIDE 56

20 40 60 80 100 120 140 160 2 4 8 16 32 64 Throughput (106 tx/s) Number of threads

P8TM HERWL P8TMUCB HTM-SGL HyNoRec NoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: SSCA2

26

slide-57
SLIDE 57

20 40 60 80 100 120 140 160 2 4 8 16 32 64 Throughput (106 tx/s) Number of threads

P8TM HERWL P8TMUCB HTM-SGL HyNoRec NoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: SSCA2

26

small Txs

slide-58
SLIDE 58

20 40 60 80 100 120 140 160 2 4 8 16 32 64 Throughput (106 tx/s) Number of threads

P8TM HERWL P8TMUCB HTM-SGL HyNoRec NoRec

20 40 60 80 100 Commits (%)

HTM ROT GL/STM URO

HyNoRec HTM-SGL P8TMUCB P8TM

Evaluation: SSCA2

26

UCB disables ROTs small Txs