Engineering Aggregation Operators for Relational In-Memory Database - - PowerPoint PPT Presentation

engineering aggregation operators
SMART_READER_LITE
LIVE PREVIEW

Engineering Aggregation Operators for Relational In-Memory Database - - PowerPoint PPT Presentation

Engineering Aggregation Operators for Relational In-Memory Database Systems Ingo Mller PhD defense February 11, 2016 Institute of Theoretical Informatics, Algorithmics II, Department of Informatics In cooperation with SAP SE


slide-1
SLIDE 1

KIT – The Research University in the Helmholtz Association

Institute of Theoretical Informatics, Algorithmics II, Department of Informatics

www.kit.edu

In cooperation with SAP SE

Engineering Aggregation Operators for Relational In-Memory Database Systems

Ingo Müller – PhD defense – February 11, 2016

slide-2
SLIDE 2

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 2

  • Feb. 11, 2016

Introduction – The Race of Database Systems

1 10 100 1000

Relative Performance Time

Hardware evolution [Bui12]

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

gap

+60%/yr +9%/yr

100 1000 10000 100000 1000000

Size of the Digital Universe [EiB] Time

Data growth [RG12]

+80%/yr

Trend 1: data volumes increase exponentially (or faster) Trend 2: compute power increases exponentially

But also more and more complex, for example memory access Database systems are in a continuous race to translate Moore‘s law. Database systems

slide-3
SLIDE 3

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 3

  • Feb. 11, 2016

Introduction – Grouping with Aggregation

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

Store Berlin Berlin Paris Berlin Paris Vienna Item pen paper ruler pen pen paper Price 1.00€ 3.00€ 2.00€ 1.00€ 1.00€ 3.00€

What is the sum of the prices of all sold items per store?

Input (Sales) Store Paris Vienna Berlin Item 3.00€ 3.00€ 5.00€ Output SELECT Store, SUM(Price) AS Sum FROM Sales GROUP BY Store Berlin Berlin Berlin 1.00€ 3.00€ 1.00€ Paris Paris 2.00€ 1.00€ Vienna 3.00€

slide-4
SLIDE 4

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 4

  • Feb. 11, 2016

Challenges and Overview

Cache efficiency  lower bound + (optimal) recursive algorithm Optimizer independence  adaptive execution strategy Memory constraint  intra-operator pipelining CPU friendliness  low-level tuning of inner loops Parallelism  work stealing Skewed data distribution  robust algorithm design Communication efficiency  adaptive pre-aggregation System integration  compatible with major DB architectures

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

Result: up to 3.7x faster and robust enough for use in production.

* * * * * * *[SIGMOD15]

slide-5
SLIDE 5

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 5

  • Feb. 11, 2016

Challenge: Cache Efficiency – Motivation

Two textbook algorithms: Hash-Aggregation

Insert every row into hash map with grouping attributes as key Aggregate to existing intermediate result

Sort-Aggregation

Sort input by grouping attributes Aggregate consecutive rows in a single pass Can we do better? Long standing conjecture: no!

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

M = cache size B = block size N = input size K = output size

slide-6
SLIDE 6

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 6

  • Feb. 11, 2016

External Memory Model – Proof Techniques

Known lower bounds for Aggregation

Based on comparisons [MR91,AK+93]  Do not hold for Hashing!

Proof technique [AV88,Gre12]

Count the number of possible permutations after t transfers Compare with possible number of input permutations

Modifications for Aggregation

Allow semi-group operation in cache Count “permutations” as before

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

“external” memory cache of M records block of B records N input records K output records

slide-7
SLIDE 7

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 7

  • Feb. 11, 2016

External Memory Model – Result

Lower bound* for Aggregation Same bound as for Sorting Multisets [AK+93]

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

block transfers

*simplified asymptotic worst case

We confirm: Aggregation is as hard as Sorting!  Use as guideline.

𝑂 𝐶 log𝑁

𝐶

𝐿 𝐶

M = cache size B = block size N = input size K = output size

𝑂 𝑄𝐶 log𝑁

𝐶

𝐿 𝐶

slide-8
SLIDE 8

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 8

  • Feb. 11, 2016

Outline

Cache efficiency  lower bound  (optimal) recursive algorithm Optimizer independence  adaptive execution strategy Memory constraint  intra-operator pipelining

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

slide-9
SLIDE 9

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 9

  • Feb. 11, 2016

Challenge: Adaptivity – Motivation

Traditional approach [Gra93]

Implement HashAggregation and SortAggregation Optimizer selects implementation based on statistics beforehand

Problem

Wrong statistics may lead to suboptimal performance

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

M = cache size B = block size N = input size K = output size

Our goal: adaptively switch between Hashing and Sorting during execution.

slide-10
SLIDE 10

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 10

  • Feb. 11, 2016

Adaptivity – Mixing Hashing and Sorting

Recursive algorithm: In each level of recursion: mix Hashing and Sorting adaptively Partitioning recurses when necessary Hashing ends recursion when possible efficiently

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

slide-11
SLIDE 11

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 11

  • Feb. 11, 2016

Adaptivity – Mixing Hashing and Sorting

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

Our mechanism achieves the best of Hashing and Sorting.

slide-12
SLIDE 12

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 12

  • Feb. 11, 2016

Evaluation – Comparison with Prior Work

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

2 Xeon E7-8870 CPUs (each 10 cores) N = 232, uniform distribution Original implementation of [CR07,YR+11]

Efficient recursive processing is crucial for large outputs. 3.7x

slide-13
SLIDE 13

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 13

  • Feb. 11, 2016

Outline

Cache efficiency  lower bound  (optimal) recursive algorithm Optimizer independence  adaptive execution strategy Memory constraint  intra-operator pipelining

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

slide-14
SLIDE 14

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 14

  • Feb. 11, 2016

Memory Constraint – Intra-Operator Pipelining

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

Split work into blocks Limit number of blocks

Pipelining allows to limit the amount of intermediate memory.

Recycle free blocks Interleave/Overlap processing levels

slide-15
SLIDE 15

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 15

  • Feb. 11, 2016

Memory Constraint – Intra-Operator Scheduling

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

In which level to work?  Heuristic: target 50% memory usage On which partition to work?  Priority queue on partition length

Balance PQ PQ

slide-16
SLIDE 16

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 16

  • Feb. 11, 2016

Memory Constraint – Evaluation

Performance basically preserved (for moderate result sizes) Trade-off between memory usage and performance

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

Cache efficiency can be achieved under memory constraint.

Input size = 16GiB, memory constraint = 256MiB Input size = 16GiB, K = 223

1.2% of unconstraint 2x

slide-17
SLIDE 17

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 17

  • Feb. 11, 2016

Summary

Cache efficiency  lower bound + (optimal) recursive algorithm Optimizer independence  adaptive execution strategy Memory constraint  intra-operator pipelining CPU friendliness  low-level tuning of inner loops Parallelism  work stealing Skewed data distribution  robust algorithm design Communication efficiency  adaptive pre-aggregation System integration  compatible with major DB architectures

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

Thank you! Questions?

* * * * * * *[SIGMOD15]

slide-18
SLIDE 18

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 18

  • Feb. 11, 2016

References

[RG12]

  • D. Reinsel, J. Gantz. “The Digital Universe in 2020: Big Data, Bigger Digital Shadows,

and Biggest Growth in the Far East. ” In: IDC. 2012 [Bui12] Peter Bui. “ Computer Organization and Design – Lecture 8: Memory Hierarchy.” URL: http://cs.uwec.edu/~buipj/teaching/cs.352.f12/lectures/lecture_08.html [SIGMOD15]

  • I. Müller, P. Sanders, A. Lacurie, W. Lehner, and F. Färber. “Cache-Efficient

Aggregation: Hashing Is Sorting.” In: SIGMOD. 2015 [MR91]

  • I. Munro and V. Raman. “Sorting Multisets and Vectors In-Place.” In: WADS. 1991.

[AK+93]

  • L. Arge, M. Knudsen, K. Larsen, F. Dehne, J. Sack, N. Santoro, and S. Whitesides. “A

General Lower Bound on the I/O-Complexity of Comparison-based Algorithms.” In: WADS. 1993. [AV88]

  • A. Aggarwal and J. S. Vitter. “The Input/Output Complexity of Sorting and Related

Problems.” In: Commun. ACM 31.9 (1988), pp. 1116–1127. [Gre12]

  • G. Greiner. “Sparse Matrix Computations and their I/O Complexity.” PhD thesis.

Technische Universität München, 2012. [Gra93]

  • G. Graefe. “Query evaluation techniques for large databases.” In: ACM Computing

Surveys 25.2 (1993), pp. 73–169. [CR07]

  • J. Cieslewicz and K.A. Ross. “Adaptive Aggregation on Chip Multiprocessors.” In:
  • PVLDB. 2007.

[YR+11]

  • Y. Ye, K.A. Ross, and N. Vesdapunt. “Scalable Aggregation on Multicore Processors.”

In: DaMoN. 2011. The photo of the title page has been provided by Vladislav Bezrukov under the CC BY 2.0 licence. It has not been modified.

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

slide-19
SLIDE 19

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 19

  • Feb. 11, 2016

Backup – List of Publications

  • F. Färber, N. May, W. Lehner, P. Große, I. Müller, H. Rauhe, and J. Dees.

“The SAP HANA Database – An Architecture Overview.” In: IEEE Data Eng.

  • Bull. 35.1 (2012), pp. 28–33
  • P. Sanders, S. Schlag, and I. Müller. “Communication-efficient algorithms for

fundamental big data problems.” In: IEEE Big Data Conf. 2013

  • T. Willhalm, I. Oukid, I. Müller, and F. Faerber. “Vectorizing Database Column

Scans with Complex Predicates.” In: ADMS. 2013

  • I. Müller, P. Sanders, R. Schulze, and W. Zhou. “Retrieval and Perfect Hashing

using Fingerprinting.” In: SEA. 2014

  • I. Müller, C. Ratsch, and F. Faerber. “Adaptive String Dictionary Compression

in In-Memory Column-Store Database Systems.” In: EDBT. 2014

  • L. Hübschle-Schneider, P. Sanders, and I. Müller. “Communication Efficient

Algorithms for Top-k Selection Problems.” In: CoRR abs/1502.0 (2015)

  • I. Müller, P. Sanders, A. Lacurie, W. Lehner, and F. Färber. “Cache-Efficient

Aggregation: Hashing Is Sorting.” In: SIGMOD. 2015

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

slide-20
SLIDE 20

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 20

  • Feb. 11, 2016

Backup – Columnwise Processing Scheme

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

slide-21
SLIDE 21

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 21

  • Feb. 11, 2016

Backup – Scaling with the Number of Columns

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

slide-22
SLIDE 22

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 22

  • Feb. 11, 2016

Backup – Scaling with the Number of Cores

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

slide-23
SLIDE 23

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 23

  • Feb. 11, 2016

Backup – Workload Study

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems

slide-24
SLIDE 24

Institute of Theoretical Informatics, Algorithmics II Department of Informatics 24

  • Feb. 11, 2016

Backup – Skewed Data Distributions

Ingo Müller – PhD Defense Engineering Aggregation Operators for Relational In-Memory Database Systems