Analyzing the Performance of Lock-Free Data Structures: A - - PowerPoint PPT Presentation
Analyzing the Performance of Lock-Free Data Structures: A - - PowerPoint PPT Presentation
Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model Aras Atalar, Paul Renaud-Goud and Philippas Tsigas Chalmers University of Technology qwwe Motivation Pp Pp Lock-free Data Structures: Literature and
Motivation Pp Pp
◮ Lock-free Data Structures:
◮ Literature and industrial applications (Intel’s Threading Building
Blocks Framework, Java concurrency package)
◮ Limitations of their lock-based counterparts: deadlocks, convoying
and programming flexibility
◮ Provide high scalability
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 2 13
Motivation Pp Pp
◮ Lock-free Data Structures:
◮ Literature and industrial applications (Intel’s Threading Building
Blocks Framework, Java concurrency package)
◮ Limitations of their lock-based counterparts: deadlocks, convoying
and programming flexibility
◮ Provide high scalability
◮ Framework to characterize the scalability:
◮ Facilitate the lock-free designs ◮ Rank implementations within a fair framework
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 2 13
Settings Pp Pp
Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm
1 Initialization(); 2 while ! done do 3
Parallel_Work(); /* Application specific code, conflict-free */
4
while ! success do
5
current ← Read(AP);
6
new ← Critical_Work(current);
7
success ← CAS(AP, current, new); Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13
Settings Pp Pp
Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm
1 Initialization(); 2 while ! done do 3
Parallel_Work(); /* Application specific code, conflict-free */
4
while ! success do
5
current ← Read(AP);
6
new ← Critical_Work(current);
7
success ← CAS(AP, current, new); Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13
Settings Pp Pp
Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm
1 Initialization(); 2 while ! done do 3
Parallel_Work(); /* Application specific code, conflict-free */
4
while ! success do
5
current ← Read(AP);
6
new ← Critical_Work(current);
7
success ← CAS(AP, current, new); Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13
Settings Pp Pp
Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm
1 Initialization(); 2 while ! done do 3
Parallel_Work(); /* Application specific code, conflict-free */
4
while ! success do
5
current ← Read(AP);
6
new ← Critical_Work(current);
7
success ← CAS(AP, current, new); Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13
Settings Pp Pp
Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm
1 Initialization(); 2 while ! done do 3
Parallel_Work(); /* Application specific code, conflict-free */
4
while ! success do
5
current ← Read(AP);
6
new ← Critical_Work(current);
7
success ← CAS(AP, current, new); Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13
Settings Pp Pp
Output: Data structure throughput, i.e. number of successful operations per unit of time Procedure AbstractAlgorithm
1 Initialization(); 2 while ! done do 3
Parallel_Work(); /* Application specific code, conflict-free */
4
while ! success do
5
current ← Read(AP);
6
new ← Critical_Work(current);
7
success ← CAS(AP, current, new);
Inputs of the analysis:
◮ Platform parameters: CAS and Read Latencies, in clock cycles ◮ Algorithm parameters:
◮ Critical Work and Parallel Work Latencies, in clock cycles ◮ Total number of threads
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 3 13
Overview Pp Pp
cw = 50, threads = 8 4000 6000 8000 10000 12000 2000 4000 6000
Parallel Work (cycles) Throughput (ops/msec)
Case Constant Exponential Poisson
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 4 13
Executions Under Contention Levels Pp Pp
Parallel work Throughput
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13
Executions Under Contention Levels Pp Pp
parallel work successful retry failed retry
Parallel work Throughput
T0 T1 T2 T3 System
Low contention
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13
Executions Under Contention Levels Pp Pp
parallel work successful retry failed retry
Parallel work Throughput
T0 T1 T2 T3 System
Peak performance
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13
Executions Under Contention Levels Pp Pp
parallel work successful retry failed retry
Parallel work Throughput
T0 T1 T2 T3 System
High contention
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 5 13
Impacting Factors Pp Pp
◮ Logical Conflicts ◮ Hardware Conflicts
CAS Expansion
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 6 13
Logical Conflicts: (f )-Cyclic Executions Pp Pp
◮ Periodic: every thread is in the same state as one period before ◮ Shortest period contains exactly 1 successful attempt and
exactly f fails per thread
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 7 13 parallel work successful retry failed retry idle thread
T0 T1 T2 T3 System Present Past Future
Inevitable and Wasted Failures Pp Pp
T0 T1 T2 T3 System vs. T0 T1 T2 T3 System
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 8 13
Hardware Conflicts: CAS Expansion Pp Pp
Read & Critical Work Previously expanded CAS Expansion CAS ◮ Input: Prl threads already in the retry loop ◮ A new thread attempts to CAS during the retry
(Read + Critical_Work + e (Prl) + CAS), within a probability h: e (Prl + h) = e (Prl) + h × retry cost(t) retry dt.
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 9 13
Throughput: Combining Impacting Factors Pp Pp
◮ Input: Prl (Average number of threads inside retry loop)
- 1. Calculate expansion: e (Prl)
- 2. Compute amount of work in a retry:
Retry = Read + Critical_Work + e (Prl) + CAS
- 3. Estimate number of logical conflicts:
LogicalConflicts(Retry, Parallel_Work, Threads) Average number of threads inside the retry loop
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 10 13
Throughput: Combining Impacting Factors Pp Pp
◮ Input: Prl (Average number of threads inside retry loop)
- 1. Calculate expansion: e (Prl)
- 2. Compute amount of work in a retry:
Retry = Read + Critical_Work + e (Prl) + CAS
- 3. Estimate number of logical conflicts:
LogicalConflicts(Retry, Parallel_Work, Threads) Average number of threads inside the retry loop
◮ Convergence via fixed point iteration Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 10 13
Results: Synthetic Tests Pp Pp
cw = 50, threads = 4 cw = 50, threads = 8 cw = 1600, threads = 4 cw = 1600, threads = 8 4000 6000 8000 10000 12000 4000 6000 8000 10000 12000 1000 1500 1000 1500 1000 2000 3000 2000 4000 6000 5000 10000 15000 20000 10000 20000 30000 40000
Parallel Work (cycles) Throughput (ops/msec)
Case Low High Average Real
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 11 13
Back-off Optimization: Michael-Scott Queue Pp Pp
cw = 225, threads = 8 3000 4000 5000 6000 7000 2500 5000 7500
Parallel Work (cycles) Throughput (ops/msec)
Type Exponential Linear New None Value 1 2 4 8 16 32
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 12 13
Conclusion Pp Pp
◮ Focus on the cases where parallel work is constant ◮ An approach based on the estimation of logical and hardware
conflicts
◮ Validate our model using synthetic tests and several reference data
structures
◮ Linear combination of retry loops Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 13 13
Results: Treiber’s Stack Pp Pp
cw = 50, threads = 6 cw = 1500, threads = 6 4000 6000 8000 10000 12000 1000 1500 2000 1000 2000 3000 4000 10000 20000 30000
Parallel Work (cycles) Throughput (ops/msec)
Case Low High Average Real cw = 50, threads = 8 cw = 1500, threads = 8 4000 6000 8000 10000 12000 1000 1500 2000 2000 4000 6000 10000 20000 30000 40000
Parallel Work (cycles) Throughput (ops/msec)
Case Low High Average Real
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 14 13
Discussion Pp Pp
cw = 4000, threads = 6
2 4 6 8 10000 20000 30000 40000
Parallel Work (cycles)
0.25 0.50 0.75 Consecutive Fail Frequency Case
- Av. Fails per Success
Model Average Normalized Throughput
Aras Atalar Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model 15 13