Lock-Free Search Data Structures: Throughput Modeling with Poisson - - PowerPoint PPT Presentation
Lock-Free Search Data Structures: Throughput Modeling with Poisson - - PowerPoint PPT Presentation
Lock-Free Search Data Structures: Throughput Modeling with Poisson Processes Aras Atalar, Paul Renaud-Goud, Philippas Tsigs Chalmers University of Technology qwwe Concurrent Data Structures Pp Pp Concurrency: Concurrency is the
Concurrent Data Structures Pp Pp
◮ Concurrency:
∗ Concurrency is the overlapped executions of processes ∗ Interleaving of steps of processes ∗ Synchronization to avoid interleavings that lead to unintended states
◮ Lock-based concurrent data structures:
∗ Rely on mutual exclusion to work in isolation ∗ Limitations: deadlocks, priority inversion and programming flexibility (difficult to compose)
◮ Lock-free concurrent data structures:
∗ Guarantee system-wide progress ∗ Employ optimistic conflict control ∗ Limitations: harder to design and implement
Aras Atalar Throughput of Lock-Free Search Data Structures 2 18
Related Work Pp Pp
◮ Theoretical results:
◮ Focus on retry loop conflicts and hardware conflicts (exist when
- perations overlap in time and memory location)
∗ Amortized analyses parameterized with a measure of contention ∗ Model asynchrony with adversarial scheduler ∗ Target worst-case execution times
◮ Empirical results:
∗ Compare the performance of different implementations ∗ Help to grasp the hardware-software interaction
◮ In this work:
∗ Study the throughput performance of lock-free search data structure ∗ Propose analytical tools that provide estimations that is close to what we observe in practice
Aras Atalar Throughput of Lock-Free Search Data Structures 3 18
Lock-free Search Data Structures Pp Pp
◮ Search data structure is a collection of key, value pairs which are
stored in an organized way to allow efficient search, delete and insert
- perations (e.g. Hash table, binary tree, skip list, linked list)
◮ Formed of basic blocks (Nodes) ◮ Accessed with Read and Modify (CAS) events ◮ Retry loop conflicts are very improbable (Nodes ≫ Threads) Aras Atalar Throughput of Lock-Free Search Data Structures 4 18
Algorithm Skeleton Pp Pp
Output of the analysis: Data structure throughput (T ), i.e. number of successful data structure operations per unit of time Procedure AbstractAlgorithm
1 while ! done do 2
key ← SelectKey(keyPMF);
3
- peration ← SelectOperation(operationPMF);
4
result ← SearchDataStructure(key, operation); ◮ Key ∈ [1, Range] and Operation ∈ {Search, Insert, Delete} ◮ Memoryless and stationary key and operation selection process Aras Atalar Throughput of Lock-Free Search Data Structures 5 18
Algorithm Skeleton Pp Pp
Output of the analysis: Data structure throughput (T ), i.e. number of successful data structure operations per unit of time Procedure AbstractAlgorithm
1 while ! done do 2
key ← SelectKey(keyPMF);
3
- peration ← SelectOperation(operationPMF);
4
result ← SearchDataStructure(key, operation); ◮ Key ∈ [1, Range] and Operation ∈ {Search, Insert, Delete} ◮ Memoryless and stationary key and operation selection process ◮ Inputs of the analysis:
◮ Platform parameters: Data and TLB cache hit latencies, CAS
latency, in clock cycles
◮ Algorithm parameters: PMFs for the key and operation selection,
Key range (R), Total number of threads (P), Expected latency of key and operation selection
Aras Atalar Throughput of Lock-Free Search Data Structures 5 18
Impacting Factors Pp Pp
◮ An operation triggers a number of node accesses (Which nodes?) ◮ Latency of the operation: sum of the latencies of accesses
5 3 1 2 3 4 5 6 7 8 4 2 6 8 7 : Internal Nodes : External Nodes Search (key=3)
Aras Atalar Throughput of Lock-Free Search Data Structures 6 18
Impacting Factors Pp Pp
◮ Identify the factors that impact the latency of an access:
∗ Capacity misses in data and TLB caches (both in sequential and concurrent executions) ∗ Coherence misses (only in concurrent executions) ∗ Execution time of CAS and stall time due to others’ CAS (only in concurrent executions)
◮ Define access latency of node Ni:
Accessi = tcmp + CASexe
i
+ CASstall
i
+ CASreco
i
+
- ℓ
Hitcacheℓ
i
+
- ℓ
Hittlbℓ
i
(1)
Aras Atalar Throughput of Lock-Free Search Data Structures 7 18
Impacting Factors Pp Pp
Over a sequence of operations: Coherence Miss
◮ Step 1: P0 reads IntNodekey=3 (brings a valid copy to P0)
5 1 2 3 4 5 6 7 8 4 2 6 8 7 : Internal Nodes : External Nodes Thread 0: Read 3 Thread 0: Search (key=3)
Aras Atalar Throughput of Lock-Free Search Data Structures 8 18
Impacting Factors Pp Pp
Over a sequence of operations: Coherence Miss
◮ Step 1: P0 reads IntNodekey=3 (brings a valid copy to P0) ◮ Step 2: P1 modifies IntNodekey=3 (invalidates the copy of P0)
5 1 2 5 6 7 8 2 6 8 7 : Internal Nodes : External Nodes Thread 1: Delete (key=4) Thread 1: Modify 3 3
Aras Atalar Throughput of Lock-Free Search Data Structures 8 18
Impacting Factors Pp Pp
Over a sequence of operations: Coherence Miss
◮ Step 1: P0 reads IntNodekey=3 (brings a valid copy to P0) ◮ Step 2: P1 modifies IntNodekey=3 (invalidates the copy of P0) ◮ Step 3: P0 read IntNodekey=3 (coherence miss of P0)
5 1 2 5 6 7 8 2 6 8 7 : Internal Nodes : External Nodes Thread 0: Search (key=4) Thread 0: Read 3 3
Aras Atalar Throughput of Lock-Free Search Data Structures 8 18
Approach Pp Pp
Observation: Latency of a node access depends on the interleaving of accesses To estimate the latency of an access on node Ni:
◮ Follow the sequence events (Read and Modify seperately) on Ni by
a thread, when Ni ∈ DS
◮ Slice the execution into consecutive intervals, where an interval
begins with a call to an operation by the thread
◮ Each interval potentially includes a Read event (resp. Modify) at Ni ◮ Think of a static structure: Stationary and memoryless access
pattern Bernoulli Process
Aras Atalar Throughput of Lock-Free Search Data Structures 9 18
Approach Pp Pp
◮ Poisson Process approximation is well-conditioned if the success
probability is small
◮ Dynamicity: DS change state with insertions and deletions ◮ Bernoulli trials with different success probabilities Poisson
Process (if pj are small)
◮ Key characteristic: set of nodes that are accessed in an operation is
small in front of all nodes
Aras Atalar Throughput of Lock-Free Search Data Structures 10 18
Time Distance to Poisson Process
p=0.1 p=0.2 p=0.8 1
Statistical Test: Kolmogorov–Smirnov Pp Pp
- ●
- ●
- ●
- ●
- ●● ●
- ●
- ● ●
- ●
- ●
- ● ● ●
- ● ●●●
- ●
- ●
- ● ● ●
- ●
- Range: 16384, threads=4, Ins−Del:25−25
0e+00 2e+06 4e+06 6e+06 8e+06 0.00 0.25 0.50 0.75 1.00
t (Inter−arrival Time) P[X < t] Tracked Keys
- key0
key1 key2 key3
(a) Read Events for Skiplist
- ●
- ●
- ●● ●
- ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●● ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- Range: 16384, threads=4, Ins−Del:25−25
0e+00 5e+06 1e+07 0.00 0.25 0.50 0.75 1.00
t (Inter−arrival Time) P[X < t] Tracked Keys
- key0
key1 key2 key3
(b) Read Events for Hash Table
- ● ●
- ●●
- ●
- ●
- ●
- ●● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ● ●
- ● ●
- ●
- ●
- ●●
- ●
- ● ●●
- ● ●
- ●
- ●
- ●
- ● ●● ●
- ●●●
- ●
- Range: 16384, threads=4, Ins−Del:25−25
0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 0.00 0.25 0.50 0.75 1.00
t (Inter−arrival Time) P[X < t] Tracked Keys
- key0
key1 key2 key3
(c) Read Events for Binary Tree
- ●
- ●
- ● ●●
- ●
- ● ● ●
- ● ●
- ● ●
- ●
- ●
- ●
Range: 16384, threads=4, Ins−Del:25−25
500000 1000000 1500000 2000000 0.00 0.25 0.50 0.75 1.00
t (Inter−arrival Time) P[X < t] Tracked Keys
- key0
key1 key2 key3
(d) Read Events for Linked List
Aras Atalar Throughput of Lock-Free Search Data Structures 11 18
Statistical Test: Kolmogorov–Smirnov Pp Pp
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ● ● ●
Range: 256, threads=4, Ins−Del:25−25
500000 1000000 1500000 2000000 0.00 0.25 0.50 0.75 1.00
t (Inter−arrival Time) P[X < t] Tracked Keys
- key0
key1 key2 key3
(a) CAS Events for Skip list
- ●
- ●●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- Range: 256, threads=4, Ins−Del:25−25
500000 1000000 1500000 2000000 0.00 0.25 0.50 0.75 1.00
t (Inter−arrival Time) P[X < t] Tracked Keys
- key0
key1 key2 key3
(b) CAS Events for Hash Table
- ●●
- ● ●
- ●
- ● ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- ●
- Range: 256, threads=4, Ins−Del:25−25
2500000 5000000 7500000 0.00 0.25 0.50 0.75 1.00
t (Inter−arrival Time) P[X < t] Tracked Keys
- key0
key1 key2 key3
(c) CAS Events for Binary Tree
- ●
- ●●
- ●
- ●
- ● ●
- ●●
- ●
- ●
- ● ●
- ● ●
- ●
- ●●●
- ●
- ● ●
- ● ●●
- ●●● ● ●●
- ●
- ●
- ●
- ● ●
- Range: 256, threads=4, Ins−Del:25−25
0e+00 3e+06 6e+06 9e+06 0.00 0.25 0.50 0.75 1.00
t (Inter−arrival Time) P[X < t] Tracked Keys
- key0
key1 key2 key3
(d) CAS Events for Linked List
Aras Atalar Throughput of Lock-Free Search Data Structures 11 18
Poisson Rates Pp Pp
◮ Extract the rate of events on Ni (by a thread) based on a random
- peration at a random time as a function of throughtput (T ):
∀e ∈ {cas, read} : λe
i = T
P ×
- ∈{ins,del,src}
R
- k=1
P [Op = opo
k] × P [opo k e(Ni) | Ni ∈ D] ◮ Throughput per thread: T P ◮ Probability of operation of type o and key k: P [Op = opo k] ◮ Instantiate P [opo k e(Ni) | Ni ∈ D] based on the particularity of
data structure
Aras Atalar Throughput of Lock-Free Search Data Structures 12 18
Poisson Rates Pp Pp
P [opsrc
k′ e(Nk) | Nk ∈ D] for Skip list:
Search (key=k’) key=-∞ key=k key=k’ key=∞ Node Node Data Routing ht>2
◮ Nj is in the structure if the latest operation on Nj is an insert ◮ Obtain the probability of a node to be in D (Thanks to memoryless
and stationary access pattern)
Aras Atalar Throughput of Lock-Free Search Data Structures 13 18
Access Latency Pp Pp
◮ Applying expectation to the access latency of Ni:
E [Accessi] = tcmp + E [CASexe
i
] + E
- CASstall
i
- + E [CASreco
i
] + E
- ℓ
Hitcacheℓ
i
- + E
- ℓ
Hittlbℓ
i
- ◮ Express each term according to the rates at every node λcas
⋆ , λread ⋆ ◮ Useful properties of Poisson Processes: Superposition and Thinning Aras Atalar Throughput of Lock-Free Search Data Structures 14 18
Access Latency Pp Pp
Estimate the expected access latency E [Accessi] for Ni:
◮ A thread encounters a coherence miss while accessing Ni if the
previous event of the thread on Ni is followed by CAS of another thread: (i) Events from the given thread = λcas
i
+ λread
i
(ii) Superpose (Merge) CAS events from any other thread = λcas
i
(P − 1) P [Coherence Miss on Ni] = λcas
i
(P − 1) λcas
i
P + λread
i Aras Atalar Throughput of Lock-Free Search Data Structures 15 18
Access Latency vs. Throughput Pp Pp
Link the access latencies and rates with throughput:
◮ Little’s Law states that the expected number of threads accessing a
node is the product of access rate and access latency
◮ Link latencies to throughput using Little’s Law by summing over all
nodes and application latency P =
N
- i=0
(piλacc
i
E [Accessi]) P:Total number of threads piλacc
i
: Average arrival rate to Ni E [Accessi]: Expected latency to access Ni
Aras Atalar Throughput of Lock-Free Search Data Structures 16 18
Results Pp Pp
Ins − Del
- 0 − 0
0.5 − 0.5 5 − 5 10 − 10 15 − 15 25 − 25 40 − 40 50 − 50
- Range: 65536
4 8 12 16 2e+07 4e+07 6e+07
Number of Threads Throughput (ops/sec)
(a) Skip List
- Range: 65536
4 8 12 16 1e+08 2e+08 3e+08
Number of Threads Throughput (ops/sec)
(b) Hash Table
- Range: 65536
4 8 12 16 2.5e+07 5.0e+07 7.5e+07
Number of Threads Throughput (ops/sec)
(c) Binary Tree
- Range: 65536
4 8 12 16 20000 40000 60000 80000
Number of Threads Throughput (ops/sec)
(d) Linked List
Aras Atalar Throughput of Lock-Free Search Data Structures 17 18
Conclusion Pp Pp
◮ Analytical tools for throughput of lock-free search data structures ◮ Validate with: hash tables, skiplists, linked lists, binary trees ◮ Could be useful:
∗ Compare different lock-free designs ∗ Facilitates the design decisions ∗ Drive the tuning process (e.g. memory aligment strategies)
Aras Atalar Throughput of Lock-Free Search Data Structures 18 18