Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework - - PowerPoint PPT Presentation

morsel driven parallelism a numa aware query evaluation
SMART_READER_LITE
LIVE PREVIEW

Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework - - PowerPoint PPT Presentation

Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age Viktor Leis, Peter Boncz*, Alfons Kemper, Thomas Neumann Technische Universitt Mnchen *CWI with some modifications by: S. Sudarshan Viktor Leis 1 /


slide-1
SLIDE 1

Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age

Viktor Leis, Peter Boncz*, Alfons Kemper, Thomas Neumann

Technische Universität München *CWI

with some modifications by: S. Sudarshan

Viktor Leis 1 / 22

slide-2
SLIDE 2

Introduction

◮ Number of CPU cores keeps growing:

4-socket Ivy Bridge EX with 60 cores, 120 threads, 1TB RAM (50,000$)

◮ These systems support terabytes of NUMA RAM: disk is not a

bottleneck

◮ For analytic workloads intra-query parallelization is necessary

to utilize such systems

DRAM socket 0 DRAM socket 1 socket 3 socket 2 DRAM DRAM 25.6GB/s 12.8GB/s

(bidireconal)

8 cores 24MB L3 8 cores 24MB L3 8 cores 24MB L3 8 cores 24MB L3

◮ Number of CPU cores keeps growing:

4-socket Ivy Bridge EX with 60 cores, 120 threads, 1TB RAM (50,000$)

Viktor Leis 2 / 22

slide-3
SLIDE 3

Contributions

◮ We present an architectural blueprint for a query engine

incorporating the following

◮ Morsel-driven query execution (work is distributed between

threads dynamically using work stealing)

◮ Set of fast parallel algorithms for the most important relational

  • perators

◮ Systematic approach to integrating NUMA-awareness into

database systems

◮ Lots of prior work on algorithms for main-memory databases

◮ Focus on storage, and on individual operations (hash join,

merge join, aggregation, ...)

◮ NUMA has been addressed by quite a few papers ◮ Focus of this paper is on efficiently evaluating a full query, and

  • n algorithms that support pipelined evaluation

Viktor Leis 3 / 22

slide-4
SLIDE 4

Related Work: Volcano-Style Parallelism (1)

◮ Encapsulation of Parallelism in the Volcano Query Processing

System, Goetz Graefe, SIGMOD 1990 SIGMOD Test of Time Award 2000

◮ Plan-driven approach:

◮ optimizer statically determines at query compile time how

many threads should run

◮ instantiates one query operator plan for each thread ◮ connects these with exchange operators, which encapsulate

parallelism and manage threads

◮ Elegant model which is used by many systems

XchgHashSplit(3:3)

v R1 v R2 v R3 r r r

Xchg(3:1)

r v R

Viktor Leis 4 / 22

slide-5
SLIDE 5

Volcano-Style Parallelism (2)

+ Operators are largely oblivious to parallelism + Great for shared-nothing parallel systems − But can do better for shared memory parallel systems with all data in-memory − Static work partitioning can cause load imbalances − Degree of parallelism cannot easily be changed mid-query − Not NUMA aware − Overhead:

◮ Thread oversubscription causes context switching ◮ Hash re-partitioning often does not pay off ◮ Exchange operators create additional copies of the tuples Viktor Leis 5 / 22

slide-6
SLIDE 6

Morsel-Driven Query Execution (1)

◮ Break input into constant-sized work units (“morsels”) ◮ Dispatcher assigns morsels to worker threads ◮ # worker threads = # hardware threads ◮ Operators are designed for parallel execution

A 16 18 27 5 7 B 8 33 10 5 23 B 8 33 10 5 23 C v x y z u

HT(S) HT(T)

A 16 7 10 27 18 5 7 5 ... ... ... ... ... Z a c i b e j d f ... ... ... ... ...

R

Z a ... ... A 16 ... ... B 8 ... ... C v ... ...

Result

store probe(16) probe(10) probe(8) probe(27) store

Z b ... ... A 27 ... ... B 10 ... ... C y ... ...

morsel morsel

Dispatcher

Viktor Leis 6 / 22

slide-7
SLIDE 7

Morsel-Driven Query Execution (2)

◮ Each pipeline is parallelized individually using all threads

BB BA

S R

v v

T

v

Viktor Leis 7 / 22

slide-8
SLIDE 8

Morsel-Driven Query Execution (2)

◮ Each pipeline is parallelized individually using all threads

BB BA

S R

v v

T

v

Build HT(T) Pipe 1 Scan T Pipe 1 Scan T Pipe 1 Scan T

v v v

Viktor Leis 7 / 22

slide-9
SLIDE 9

Morsel-Driven Query Execution (2)

◮ Each pipeline is parallelized individually using all threads

BB BA

S R

v v

T

v

Build HT(S) Build HT(T) Pipe 2 Scan S Pipe 2 Scan S Pipe 2 Scan S

v v v

Viktor Leis 7 / 22

slide-10
SLIDE 10

Morsel-Driven Query Execution (2)

◮ Each pipeline is parallelized individually using all threads

BB BA

S R

v v

T

v

Build HT(S) Build HT(T) Probe HT(T) Pipe 3 Scan R Probe HT(S) Probe HT(T) Pipe 3 Scan R Probe HT(S) Probe HT(T) Pipe 3 Scan R Probe HT(S) Probe HT(T) Pipe 3 Scan R Probe HT(S)

v v v v

Viktor Leis 7 / 22

slide-11
SLIDE 11

Parallel In-Memory Hash Join

  • 1. Several algorithms proposed earlier for parallel in-memory

hash join

  • 2. Option 1: partition relation and process each partioning in

parallel

  • 3. Option 2: build a global hash table on build relation, but

parallellize both building and probing

  • 4. Earlier work shows Option 2 is better
  • 5. Key issues: maximize locality, minimize synchronization

Viktor Leis 8 / 22

slide-12
SLIDE 12

NUMA-aware Processing of Build Phase

morsel T

Phase 1: process T morsel-wise and store NUMA-locally Phase 2: scan NUMA-local storage area and insert pointers into HT next morsel Storage area of blue core

scan

Insert the pointer into HT

global Hash Table Storage area of red core Storage area of green core

v v v

Viktor Leis 9 / 22

slide-13
SLIDE 13

Morsel-Wise Processing of Probe Phase

morsel

R

Storage area of red core

HT(T) HT(S)

Storage area of green core Storage area of blue core next morsel

...(R)

v

...(R)

v

...(R)

v

Viktor Leis 10 / 22

slide-14
SLIDE 14

Dispatcher

Dispatcher

(J1, Mr1) Assign Pipeline-Job J1 on morsel Mr to Core0

Pipeline- Job J1 Pipeline- Job J2

Mr1 Mr2 Mr3 Mg1 Mg2 Mg3 Mb1 Mb2 Mb3 (virtual) lists of morsels to be processed (colors indicates on what socket/core the morsel is located) List of pending pipeline-jobs (possibly of different queries)

Core0 Core Core Core Core Core Core Core

DRAM

Core8 Core Core Core Core Core Core Core

DRAM

Core Core Core Core Core Core Core Core

DRAM

Core Core Core Core Core Core Core Core

DRAM Socket Socket

inter connect

Socket Socket Example NUMA Multi-Core Server with 4 Sockets and 32 Cores

Pipeline- Job J3

dispatch(Core0)

Scheduler (beyond the scope of this paper) prioritize Pipeline Jobs according to Quality of Service constraints

Viktor Leis 11 / 22

slide-15
SLIDE 15

Hash Table

d

00000100

e

10000010

f hashTable 16 bit tag for early filtering 48 bit pointer

◮ Unused bits in pointers act as a cheap bloom filter

Viktor Leis 12 / 22

slide-16
SLIDE 16

Lock-Free Insertion into Hash Table

  • 1. insert(entry) {
  • 2. // determine slot in hash table

3. slot = entry->hash >> hashTableShift 4. do { 5.

  • ld = hashTable[slot]

6. // set next to old entry without tag 7. entry->next = removeTag(old) 8. // add old and new tag 9. new = entry | (old&tagMask) | tag(entry− >hash) 10. // try to set new value, repeat on failure 11. } while (!CAS(hashTable[slot], old, new)) 12. }

  • 13. }

Viktor Leis 13 / 22

slide-17
SLIDE 17

Storage Implementation

  • 1. Use large virtual memory pages (2MB) both for the hash table

and the tuple storage areas.

1.1 The number of TLB misses is reduced, the page table is guaranteed to fit into L1 cache, and scalability problems from too many kernel page faults during the build phase are avoided.

  • 2. Allocate the hash table using the Unix mmap system call, if

available.

2.1 Page gets allocated on first write, initialized to 0’s 2.2 Pages located on same NUMA node as thread that first writes the page, ensuring locality if only single NUMA node is used.

  • 3. May be a good idea to partition table using primary/foreign

key

3.1 e.g. order and lineitem on orderkey

Viktor Leis 14 / 22

slide-18
SLIDE 18

Morsels

◮ No load imbalances: all workers finish very close in time ◮ Morsels allow to react to workload changes: priority-based

scheduling of dynamic workloads possible

worker 0 worker 1 worker 2 worker 3

q13 arrives q14 finishes q14 arrives q13 finishes me Viktor Leis 15 / 22

slide-19
SLIDE 19

NUMA Awareness

◮ NUMA awareness at the morsel level ◮ E.g., Table scan:

◮ Relations are partitioned over NUMA nodes ◮ Worker threads ask for NUMA-local morsels ◮ May steal morsels from other sockets to avoid idle workers DRAM socket 0 DRAM socket 1 socket 3 socket 2 DRAM DRAM DRAM socket 0 DRAM socket 1 socket 3 socket 2 DRAM DRAM

Nehalem EX Sandy Bridge EP

25.6GB/s 51.2GB/s 12.8GB/s

(bidireconal)

8 cores 24MB L3 8 cores 24MB L3 8 cores 24MB L3 8 cores 24MB L3 8 cores 20MB L3 8 cores 20MB L3 8 cores 20MB L3 8 cores 20MB L3

16.0GB/s

(bidireconal)

Viktor Leis 16 / 22

slide-20
SLIDE 20

Parallel Aggregation

◮ Aggregation: partitioning-based with cheap pre-aggregation ◮ Stage 1: Fixed size hash table per thread, overflow to

partitions

◮ Stage 2: Final aggregation: thread per partition

K 8 13 3 V 9 7 10

ht

K 8 3 13 3 3 10 33 4 33 8 ... V 9 2 7 8 4 7 22 17 4 7 ... K 4 33 10 3 V 17 22 7 4

ht

group g r

  • u

p morsel morsel (12,7) (8,3) (8,9) (4,30) spill when ht becomes full next red morsel K 12 8 4 V ... ... ...

HT

K 13 33 V ... ...

HT

(41,4) (13,7) (13,14) (33,5) group g r

  • u

p g r

  • u

p group Result ptn 0 Result ptn 1

Phase 1: local pre-aggregation Phase 2: aggregate partition-wise

Partition 0 Partition 0 ...Partition 3 ... ...Partition 3 ...

Viktor Leis 17 / 22

slide-21
SLIDE 21

Parallel Merge Sort

◮ Sorting for order by and top-K only, sorting for merge join not

efficient

◮ Local sort in parallel, followed by parallel merge ◮ Key issue: finding exact separators. Median-of-medians algo.

Viktor Leis 18 / 22

slide-22
SLIDE 22

Evaluation: TPC-H (SF 100), Nehalem EX (32 cores)

TPC-H # time [s] speedup 1 0.28 32.4 2 0.08 22.3 3 0.66 24.7 4 0.38 21.6 5 0.97 21.3 6 0.17 27.5 7 0.53 32.4 8 0.35 31.2 9 2.14 32.0 10 0.60 20.0 11 0.09 37.1 TPC-H # time [s] speedup 12 0.22 42.0 13 1.95 40.0 14 0.19 24.8 15 0.44 19.8 16 0.78 17.3 17 0.44 30.5 18 2.78 24.0 19 0.88 29.5 20 0.18 33.4 21 0.91 28.0 22 0.30 25.7

◮ single threaded: 30x faster than PostgreSQL, 10x faster than

commercial column store, similar speed as Vectorwise

◮ multi threaded: 5x faster than Vectorwise, 50x faster than

Cloudera Impala on 20-node cluster

Viktor Leis 19 / 22

slide-23
SLIDE 23

Scalability

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40 1 16 32 48 64 1 16 32 48 64 1 16 32 48 64 1 16 32 48 64

threads speedup over HyPer

System HyPer (full-fledged) HyPer (not NUMA aware) HyPer (non-adaptive) Vectorwise

Viktor Leis 20 / 22

slide-24
SLIDE 24

Conclusions

◮ Getting good scalability and performance on many-core

systems is challenging but possible

◮ However, it not possible to bolt on parallelism to an existing

query engine, one must redesign it with modern hardware in mind

◮ With morsel-driven parallelism HyPer can finish ad hoc queries

  • n hundreds of GBs in seconds

www.hyper-db.com

Viktor Leis 21 / 22