Analyzing the Performance of Lock-Free Data Structures: A - - PowerPoint PPT Presentation

analyzing the performance of lock free data structures a
SMART_READER_LITE
LIVE PREVIEW

Analyzing the Performance of Lock-Free Data Structures: A - - PowerPoint PPT Presentation

Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model Aras Atalar, Paul Renaud-Goud and Philippas Tsigas Chalmers University of Technology qwwe Motivation Pp Pp Lock-free Data Structures: Literature and


slide-1
SLIDE 1

Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model

Aras Atalar, Paul Renaud-Goud and Philippas Tsigas Chalmers University of Technology qwwe

slide-2
SLIDE 2

Motivation Pp Pp

◮ Lock-free Data Structures:

◮ Literature and industrial applications (Intel’s Threading Building

Blocks Framework, Java concurrency package)

◮ Limitations of their lock-based counterparts: deadlocks, convoying

and programming flexibility

◮ Provide high scalability

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 2 13

slide-3
SLIDE 3

Motivation Pp Pp

◮ Lock-free Data Structures:

◮ Literature and industrial applications (Intel’s Threading Building

Blocks Framework, Java concurrency package)

◮ Limitations of their lock-based counterparts: deadlocks, convoying

and programming flexibility

◮ Provide high scalability

◮ Framework to characterize the scalability:

◮ Facilitate the lock-free designs ◮ Rank implementations within a fair framework

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 2 13

slide-4
SLIDE 4

Settings Pp Pp

Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm

1 Initialization(); 2 while ! done do 3

Parallel_Work(); /* Application specific code, conflict-free */

4

while ! success do

5

current ← Read(AP);

6

new ← Critical_Work(current);

7

success ← CAS(AP, current, new); Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13

slide-5
SLIDE 5

Settings Pp Pp

Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm

1 Initialization(); 2 while ! done do 3

Parallel_Work(); /* Application specific code, conflict-free */

4

while ! success do

5

current ← Read(AP);

6

new ← Critical_Work(current);

7

success ← CAS(AP, current, new); Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13

slide-6
SLIDE 6

Settings Pp Pp

Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm

1 Initialization(); 2 while ! done do 3

Parallel_Work(); /* Application specific code, conflict-free */

4

while ! success do

5

current ← Read(AP);

6

new ← Critical_Work(current);

7

success ← CAS(AP, current, new); Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13

slide-7
SLIDE 7

Settings Pp Pp

Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm

1 Initialization(); 2 while ! done do 3

Parallel_Work(); /* Application specific code, conflict-free */

4

while ! success do

5

current ← Read(AP);

6

new ← Critical_Work(current);

7

success ← CAS(AP, current, new); Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13

slide-8
SLIDE 8

Settings Pp Pp

Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm

1 Initialization(); 2 while ! done do 3

Parallel_Work(); /* Application specific code, conflict-free */

4

while ! success do

5

current ← Read(AP);

6

new ← Critical_Work(current);

7

success ← CAS(AP, current, new); Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13

slide-9
SLIDE 9

Settings Pp Pp

Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm

1 Initialization(); 2 while ! done do 3

Parallel_Work(); /* Application specific code, conflict-free */

4

while ! success do

5

current ← Read(AP);

6

new ← Critical_Work(current);

7

success ← CAS(AP, current, new);

Inputs of the analysis:

◮ Platform parameters: CAS and Read Latencies, in clock cycles ◮ Algorithm parameters:

◮ Critical Work and Parallel Work Latencies, in clock cycles ◮ Total number of threads

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13

slide-10
SLIDE 10

Overview Pp Pp

cw = 50, threads = 8 4000 6000 8000 10000 12000 2000 4000 6000

Parallel Work (cycles) Throughput (ops/msec)

Case Constant Exponential Poisson

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 4 13

slide-11
SLIDE 11

Executions Under Contention Levels Pp Pp

Parallel work Throughput

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13

slide-12
SLIDE 12

Executions Under Contention Levels Pp Pp

parallel work successful retry failed retry

Parallel work Throughput

T0 T1 T2 T3 System

Low contention

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13

slide-13
SLIDE 13

Executions Under Contention Levels Pp Pp

parallel work successful retry failed retry

Parallel work Throughput

T0 T1 T2 T3 System

Peak performance

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13

slide-14
SLIDE 14

Executions Under Contention Levels Pp Pp

parallel work successful retry failed retry

Parallel work Throughput

T0 T1 T2 T3 System

High contention

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13

slide-15
SLIDE 15

Impacting Factors Pp Pp

◮ Logical Conflicts ◮ Hardware Conflicts

CAS Expansion

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 6 13

slide-16
SLIDE 16

Logical Conflicts: (f )-Cyclic Executions Pp Pp

◮ Periodic: every thread is in the same state as one period before ◮ Shortest period contains exactly 1 successful attempt and

exactly f fails per thread

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 7 13 parallel work successful retry failed retry idle thread

T0 T1 T2 T3 System Present Past Future

slide-17
SLIDE 17

Inevitable and Wasted Failures Pp Pp

T0 T1 T2 T3 System vs. T0 T1 T2 T3 System

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 8 13

slide-18
SLIDE 18

Hardware Conflicts: CAS Expansion Pp Pp

Read & Critical Work Previously expanded CAS Expansion CAS ◮ Input: Prl threads already in the retry loop ◮ A new thread attempts to CAS during the retry

(Read + Critical_Work + e (Prl) + CAS), within a probability h: e (Prl + h) = e (Prl) + h × retry cost(t) retry dt.

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 9 13

slide-19
SLIDE 19

Throughput: Combining Impacting Factors Pp Pp

◮ Input: Prl (Average number of threads inside retry loop)

  • 1. Calculate expansion: e (Prl)
  • 2. Compute amount of work in a retry:

Retry = Read + Critical_Work + e (Prl) + CAS

  • 3. Estimate number of logical conflicts:

LogicalConflicts(Retry, Parallel_Work, Threads) Average number of threads inside the retry loop

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 10 13

slide-20
SLIDE 20

Throughput: Combining Impacting Factors Pp Pp

◮ Input: Prl (Average number of threads inside retry loop)

  • 1. Calculate expansion: e (Prl)
  • 2. Compute amount of work in a retry:

Retry = Read + Critical_Work + e (Prl) + CAS

  • 3. Estimate number of logical conflicts:

LogicalConflicts(Retry, Parallel_Work, Threads) Average number of threads inside the retry loop

◮ Convergence via fixed point iteration Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 10 13

slide-21
SLIDE 21

Results: Synthetic Tests Pp Pp

cw = 50, threads = 4 cw = 50, threads = 8 cw = 1600, threads = 4 cw = 1600, threads = 8 4000 6000 8000 10000 12000 4000 6000 8000 10000 12000 1000 1500 1000 1500 1000 2000 3000 2000 4000 6000 5000 10000 15000 20000 10000 20000 30000 40000

Parallel Work (cycles) Throughput (ops/msec)

Case Low High Average Real

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 11 13

slide-22
SLIDE 22

Back-off Optimization: Michael-Scott Queue Pp Pp

cw = 225, threads = 8 3000 4000 5000 6000 7000 2500 5000 7500

Parallel Work (cycles) Throughput (ops/msec)

Type Exponential Linear New None Value 1 2 4 8 16 32

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 12 13

slide-23
SLIDE 23

Conclusion Pp Pp

◮ Focus on the cases where parallel work is constant ◮ An approach based on the estimation of logical and hardware

conflicts

◮ Validate our model using synthetic tests and several reference data

structures

◮ Linear combination of retry loops Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 13 13

slide-24
SLIDE 24

Results: Treiber’s Stack Pp Pp

cw = 50, threads = 6 cw = 1500, threads = 6 4000 6000 8000 10000 12000 1000 1500 2000 1000 2000 3000 4000 10000 20000 30000

Parallel Work (cycles) Throughput (ops/msec)

Case Low High Average Real cw = 50, threads = 8 cw = 1500, threads = 8 4000 6000 8000 10000 12000 1000 1500 2000 2000 4000 6000 10000 20000 30000 40000

Parallel Work (cycles) Throughput (ops/msec)

Case Low High Average Real

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 14 13

slide-25
SLIDE 25

Discussion Pp Pp

cw = 4000, threads = 6

2 4 6 8 10000 20000 30000 40000

Parallel Work (cycles)

0.25 0.50 0.75 Consecutive Fail Frequency Case

  • Av. Fails per Success

Model Average Normalized Throughput

Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 15 13