On the Performance of Window-Based Contention Managers for - - PowerPoint PPT Presentation

on the performance of window based
SMART_READER_LITE
LIVE PREVIEW

On the Performance of Window-Based Contention Managers for - - PowerPoint PPT Presentation

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University Agenda Introduction and Motivation Previous Studies and Limitations Execution Window Model


slide-1
SLIDE 1

On the Performance of Window-Based Contention Managers for Transactional Memory

Gokarna Sharma and Costas Busch Louisiana State University

slide-2
SLIDE 2

Agenda

  • Introduction and Motivation
  • Previous Studies and Limitations
  • Execution Window Model

➢ Theoretical Results ➢ Experimental Results

  • Conclusions and Future Directions
slide-3
SLIDE 3

Retrospective

  • 1993

➢ A seminal paper by Maurice Herlihy and J. Eliot B. Moss: “Transactional Memory: Architectural Support for Lock-Free Data Structures”

  • Today

➢ Several STM/HTM implementation efforts by Intel, Sun, IBM; growing attention

  • Why TM?

➢ Many drawbacks of traditional approaches using Locks, Monitors: error-prone, difficult, composability, …

lock data modify/use data unlock data

Lock: only one thread can execute TM: many threads can execute

atomic { modify/use data }

slide-4
SLIDE 4

Transactional Memory

  • Transactions perform a sequence of read and write operations on

shared resources and appear to execute atomically

  • TM may allow transactions to run concurrently but the results must

be equivalent to some sequential execution

Example:

  • ACI(D) properties to ensure correctness

Initially, x == 1, y == 2

atomic { x = 2; y = x+1; } atomic { r1 = x; r2 = y; }

T1 T2

T1 then T2 r1==2, r2==3 T2 then T1 r1==1, r2==2

x = 2; y = 3;

T1

r1 == 1 r2 = 3;

T2

Incorrect r1 == 1, r2 == 3

slide-5
SLIDE 5

Software TM Systems

Conflicts:

➢ A contention manager decides ➢ Aborts or delay a transaction

Centralized or Distributed:

➢ Each thread may have its own CM

Example:

atomic { … x = 2; } atomic { y = 2; … x = 3; }

T1 T2

Initially, x == 1, y == 1

conflict Abort undo changes (set x==1) and restart

atomic { … x = 2; } atomic { y = 2; … x = 3; }

T1 T2 conflict Abort (set y==1) and restart OR wait and retry

slide-6
SLIDE 6

Transaction Scheduling

The most common model:

➢ m concurrent transactions on m cores that share s objects ➢ Sequence of operations and a operation takes one time unit ➢ Duration is fixed

Throughput Guarantees:

➢ Makespan: the time needed to commit all m transactions ➢ Makespan of my CM Makespan of optimal CM

Problem Complexity:

➢ NP-Hard (related to vertex coloring)

Challenge:

➢ How to schedule transactions so that makespan is minimized?

1 2 3 4 5 6 7 8

Competitive Ratio:

slide-7
SLIDE 7

Literature

  • Lots of proposals

➢ Polka, Priority, Karma, SizeMatters, …

  • Drawbacks

➢ Some need globally shared data (i.e., global clock) ➢ Workload dependent ➢ Many have no theoretical provable properties ✓ i.e., Polka – but overall good empirical performance

  • Mostly empirical evaluation using different benchmarks

➢ Choice of a contention manager significantly affects the performance ➢ Do not perform well in the worst-case (i.e., contention, system size, and number of threads increase)

slide-8
SLIDE 8

Literature on Theoretical Bounds

Guerraoui et al. [PODC’05]: First contention manager GREEDY with O(s2) competitive bound Attiya et al. [PODC’06]: Bound of GREEDY improved to O(s) Schneider and Wattenhofer [ISAAC’09]: RandomizedRounds with O(C . log m) (C is the maximum degree of a transaction in the conflict graph) Attiya et al. [OPODIS’09]: Bimodal scheduler with O(s) bound for read-dominated workloads Sharma and Busch [OPODIS’10]: Two algorithms with O(√𝑡) and O( 𝑡. log 𝑜) bounds for balanced workloads

slide-9
SLIDE 9

Objectives

Scalable transactional memory scheduling:

➢ Design contention managers that exhibit both good theoretical and empirical performance guarantees ➢ Design contention managers that scale well with the system size and complexity

slide-10
SLIDE 10

1 2 3 n n m 1 2 3 m Transactions . . . Threads

Execution Window Model

  • Collection of n sets of m concurrent transactions that

share s objects . . .

Assuming maximum degree in conflict graph C and execution time duration τ

Serialization upper bound: τ . min(Cn,mn) One-shot bound: O(sn) [Attiya et al., PODC’06] Using RandomizedRounds: O(τ . Cn log m)

slide-11
SLIDE 11

Theoretical Results

  • Offline Algorithm: (maximal independent sets)

➢ For scheduling with conflicts environments, i.e., traffic intersection control, dining philosophers problem ➢ Makespan: O(τ. (C + n log (mn)), (C is the conflict measure) ➢ Competitive ratio: O(s + log (mn)) whp

  • Online Algorithm: (random priorities)

➢ For online scheduling environments ➢ Makespan: O(τ. (C log (mn) + n log2 (mn))) ➢ Competitive ratio: O(s log (mn) + log2 (mn))) whp

  • Adaptive Algorithm

➢ Conflict graph and maximum degree C both not known ➢ Adaptively guesses C starting from 1

slide-12
SLIDE 12

Intuition (1)

  • Introduce random delays at the beginning of the

execution window 1 2 3 n n m 1 2 3 m Transactions . . . n n’

Random interval

1 2 3 n m

  • Random delays help conflicting transactions shift

avoiding many conflicts

slide-13
SLIDE 13

Intuition (2)

  • Frame based execution to handle conflicts

m

Frame size q1 q2 q3 q4 F11 F12 F1n F21 F31 F41 Fm1 F3n Threads

Makespan: max {qi} + No of frames X frame size

slide-14
SLIDE 14

Experimental Results (1)

  • Platform used

➢ Intel i7 (4-core processor) with 8GB RAM and hyperthreading on

  • Implemented window algorithms in DSTM2, an eager

conflict management STM implementation

  • Benchmarks used

➢ List, RBTree, SkipList, and Vacation from STAMP suite.

  • Experiments were run for 10 seconds and the data

plotted are average of 6 experiments

  • Contention managers used for comparison

➢ Polka – Published best CM but no theoretical provable properties ➢ Greedy – First CM with both theoretical and empirical properties ➢ Priority – Simple priority-based CM

slide-15
SLIDE 15

Experimental Results (2)

Performance throughput:

➢ No of txns committed per second ➢ Measures the useful work done by a CM each time step

2000 4000 6000 8000 10000 12000 14000 16000 18000 5 10 15 20 25 30 35 Committed transactions/sec No of threads

List Benchmark

Polka Greedy Priority Online Adaptive 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 5 10 15 20 25 30 35 Committed transactions/sec No of threads

SkipList Benchmark

Polka Greedy Priority Online Adaptive

slide-16
SLIDE 16

Experimental Results (3)

2000 4000 6000 8000 10000 12000 14000 5 10 15 20 25 30 35 Committed transacions/sec No of threads

RBTree Benchmark

Polka Greedy Priority Online Adaptive 2000 4000 6000 8000 10000 12000 14000 16000 18000 5 10 15 20 25 30 35 Committed transactions/sec No of threads

Vacation Benchmark

Polka Greedy Priority Online Adaptive

Performance throughput:

Conclusion #1: Window CMs always improve throughput over Greedy and Priority Conclusion #2: Throughput is comparable to Polka (outperforms in Vacation)

slide-17
SLIDE 17

Experimental Results (4)

Aborts per commit ratio:

➢ No of txns aborted per txn commit ➢ Measures efficiency of a CM in utilizing computing resources

2 4 6 8 10 12 14 16 18 20 5 10 15 20 25 30 35 No of aborts/commit No of threads

List Benchmark

Polka Greedy Priority Online Adaptive 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 5 10 15 20 25 30 35 No of aborts/commit No of threads

SkipList Benchmark

Polka Greedy Priority Online Adaptive

slide-18
SLIDE 18

Experimental Results (5)

Aborts per commit ratio:

1 2 3 4 5 6 7 8 9 5 10 15 20 25 30 35 No of aborts/commit No of threads

Vacation Benchmark

Polka Greedy Priority Online Adaptive 2 4 6 8 10 12 14 16 18 20 5 10 15 20 25 30 35 No of aborts/commit No of threads

RBTree Benchmark

Polka Greedy Priority Online Adaptive

Conclusion #3: Window CMs always reduce no of aborts over Greedy and Priority Conclusion #4: No of aborts are comparable to Polka (outperform in Vacation)

slide-19
SLIDE 19

Experimental Results (6)

Execution time overhead:

➢ Total time needed to commit all transactions ➢ Measures scalability of a CM in different contention scenarios

5 10 15 20 25 Low Medium High Total execution time (in seconds) Amount of contention

List Benchmark

Polka Greedy Priority Online Adaptive 0.5 1 1.5 2 2.5 Low Medium High Total execution time (in seconds) Amount of contention

SkipList Benchmark

Polka Greedy Priority Online Adaptive

slide-20
SLIDE 20

Experimental Results (7)

Execution time overhead:

2 4 6 8 10 12 14 16 18 20 Low Medium High Total execution time (in seconds) Amount of contention

RBTree Benchmark

Polka Greedy Priority Online Adaptive 1 2 3 4 5 6 7 8 Low Medium High Total execution time (in seconds) Amount of contention

Vacation Benchmark

Polka Greedy Priority Online Adaptive

Conclusion #5: Window CMs generally reduce execution time over Greedy and Priority (except SkipList) Conclusion #6: Window CMs good at high contention due to randomization overhead

slide-21
SLIDE 21

Future Directions

  • Encouraging theoretical and practical results
  • Plan to explore (experimental)

➢ Wasted Work ➢ Repeat Conflicts ➢ Average Response Time ➢ Average committed transactions durations

  • Plan to do experiments using more complex benchmarks

➢ E.g., STAMP, STMBench7, and other STM implementations

  • Plan to explore (theoretical)

➢ Other contention managers with both theoretical and empirical guarantees

slide-22
SLIDE 22

Conclusions

  • TM contention management is an important online

scheduling problem

  • Contention managers should scale with the size and

complexity of the system

  • Theoretical as well as practical performance guarantees

are essential for design decisions

  • Need to explore mechanisms that scale well in other

multi-core architectures:

➢ ccNUMA and hierarchical multilevel cache architectures ➢ Large scale distributed systems

slide-23
SLIDE 23

Thank you for your attention!!!