Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age
Viktor Leis, Peter Boncz*, Alfons Kemper, Thomas Neumann
Technische Universität München *CWI
with some modifications by: S. Sudarshan
Viktor Leis 1 / 22
Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework - - PowerPoint PPT Presentation
Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age Viktor Leis, Peter Boncz*, Alfons Kemper, Thomas Neumann Technische Universitt Mnchen *CWI with some modifications by: S. Sudarshan Viktor Leis 1 /
Viktor Leis 1 / 22
DRAM socket 0 DRAM socket 1 socket 3 socket 2 DRAM DRAM 25.6GB/s 12.8GB/s
(bidireconal)
8 cores 24MB L3 8 cores 24MB L3 8 cores 24MB L3 8 cores 24MB L3
Viktor Leis 2 / 22
◮ Morsel-driven query execution (work is distributed between
◮ Set of fast parallel algorithms for the most important relational
◮ Systematic approach to integrating NUMA-awareness into
◮ Focus on storage, and on individual operations (hash join,
◮ NUMA has been addressed by quite a few papers ◮ Focus of this paper is on efficiently evaluating a full query, and
Viktor Leis 3 / 22
◮ optimizer statically determines at query compile time how
◮ instantiates one query operator plan for each thread ◮ connects these with exchange operators, which encapsulate
Viktor Leis 4 / 22
◮ Thread oversubscription causes context switching ◮ Hash re-partitioning often does not pay off ◮ Exchange operators create additional copies of the tuples Viktor Leis 5 / 22
A 16 18 27 5 7 B 8 33 10 5 23 B 8 33 10 5 23 C v x y z u
HT(S) HT(T)
A 16 7 10 27 18 5 7 5 ... ... ... ... ... Z a c i b e j d f ... ... ... ... ...
R
Z a ... ... A 16 ... ... B 8 ... ... C v ... ...
Result
store probe(16) probe(10) probe(8) probe(27) store
Z b ... ... A 27 ... ... B 10 ... ... C y ... ...
morsel morsel
Viktor Leis 6 / 22
Viktor Leis 7 / 22
Build HT(T) Pipe 1 Scan T Pipe 1 Scan T Pipe 1 Scan T
Viktor Leis 7 / 22
Build HT(S) Build HT(T) Pipe 2 Scan S Pipe 2 Scan S Pipe 2 Scan S
Viktor Leis 7 / 22
Build HT(S) Build HT(T) Probe HT(T) Pipe 3 Scan R Probe HT(S) Probe HT(T) Pipe 3 Scan R Probe HT(S) Probe HT(T) Pipe 3 Scan R Probe HT(S) Probe HT(T) Pipe 3 Scan R Probe HT(S)
Viktor Leis 7 / 22
Viktor Leis 8 / 22
Insert the pointer into HT
Viktor Leis 9 / 22
Storage area of red core
Storage area of green core Storage area of blue core next morsel
...(R)
v
...(R)
v
...(R)
v
Viktor Leis 10 / 22
(J1, Mr1) Assign Pipeline-Job J1 on morsel Mr to Core0
Pipeline- Job J1 Pipeline- Job J2
Mr1 Mr2 Mr3 Mg1 Mg2 Mg3 Mb1 Mb2 Mb3 (virtual) lists of morsels to be processed (colors indicates on what socket/core the morsel is located) List of pending pipeline-jobs (possibly of different queries)
Core0 Core Core Core Core Core Core Core
DRAM
Core8 Core Core Core Core Core Core Core
DRAM
Core Core Core Core Core Core Core Core
DRAM
Core Core Core Core Core Core Core Core
DRAM Socket Socket
inter connect
Socket Socket Example NUMA Multi-Core Server with 4 Sockets and 32 Cores
Pipeline- Job J3
dispatch(Core0)
Viktor Leis 11 / 22
00000100
10000010
Viktor Leis 12 / 22
Viktor Leis 13 / 22
Viktor Leis 14 / 22
q13 arrives q14 finishes q14 arrives q13 finishes me Viktor Leis 15 / 22
◮ Relations are partitioned over NUMA nodes ◮ Worker threads ask for NUMA-local morsels ◮ May steal morsels from other sockets to avoid idle workers DRAM socket 0 DRAM socket 1 socket 3 socket 2 DRAM DRAM DRAM socket 0 DRAM socket 1 socket 3 socket 2 DRAM DRAM
25.6GB/s 51.2GB/s 12.8GB/s
(bidireconal)
8 cores 24MB L3 8 cores 24MB L3 8 cores 24MB L3 8 cores 24MB L3 8 cores 20MB L3 8 cores 20MB L3 8 cores 20MB L3 8 cores 20MB L3
16.0GB/s
(bidireconal)
Viktor Leis 16 / 22
K 8 13 3 V 9 7 10
ht
K 8 3 13 3 3 10 33 4 33 8 ... V 9 2 7 8 4 7 22 17 4 7 ... K 4 33 10 3 V 17 22 7 4
ht
group g r
p morsel morsel (12,7) (8,3) (8,9) (4,30) spill when ht becomes full next red morsel K 12 8 4 V ... ... ...
HT
K 13 33 V ... ...
HT
(41,4) (13,7) (13,14) (33,5) group g r
p g r
p group Result ptn 0 Result ptn 1
Phase 1: local pre-aggregation Phase 2: aggregate partition-wise
Partition 0 Partition 0 ...Partition 3 ... ...Partition 3 ...
Viktor Leis 17 / 22
Viktor Leis 18 / 22
Viktor Leis 19 / 22
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40 1 16 32 48 64 1 16 32 48 64 1 16 32 48 64 1 16 32 48 64
threads speedup over HyPer
System HyPer (full-fledged) HyPer (not NUMA aware) HyPer (non-adaptive) Vectorwise
Viktor Leis 20 / 22
Viktor Leis 21 / 22