An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar - - PowerPoint PPT Presentation

an in depth analysis of data aggregation cost factors in
SMART_READER_LITE
LIVE PREVIEW

An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar - - PowerPoint PPT Presentation

An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database Stephan Mller, Hasso Plattner Enterprise Platform and Integration Concepts Hasso Plattner Institute, Potsdam (Germany) DOLAP 2012 November 2nd Agenda


slide-1
SLIDE 1

An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

Stephan Müller, Hasso Plattner

Enterprise Platform and Integration Concepts Hasso Plattner Institute, Potsdam (Germany) DOLAP 2012 – November 2nd

slide-2
SLIDE 2

Agenda

■ Data aggregation ■ Aggregations in Hyrise ■ Cost factors and benchmarks ■ Summary and future work

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-3
SLIDE 3

Motivation

■ Aggregation □ Abstractions for conceptualizing the real world □ Resource-intensive operation in OLAP workloads ■ Columnar in-memory databases □ Optimized for set processing □ Enables reunification of OLTP and OLAP [1]

[1] H. Plattner. A common database approach for OLTP and OLAP using an in-memory column database. In SIGMOD, 2009. DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-4
SLIDE 4

Data Aggregation

■ SELECT product, SUM (amount) AS revenue FROM sales WHERE year=2011 GROUP BY product ■ Two phases □ Grouping (hash-based, sorting, nested loops) □ Calculation

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-5
SLIDE 5

Data Aggregation in Hyrise

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

1 1 2 1 1 3 4 1 amount 1 2 3 1 3 5 2 4 fruit "fruit" dictionary 1 apples 2 bananas 3 cherries 4 grapes 5 melons 1 2 3 4 5 6 7 8 id apples bananas cherries apples cherries melons bananas grapes fruit 10 10 20 10 30 50 20 40 amount 1 2 3 4 5 6 7 8 id

■ Columnar Storage ■ Dictionary compression

slide-6
SLIDE 6

Data Aggregation – Grouping

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

1 1 2 1 1 3 4 1 amount 1 2 3 4 5 6 7 8 id 1 2 3 1 3 5 2 4 fruit dictionary 1 apples 2 bananas 3 cherries 4 grapes 5 melons position on "fruit" 1 2 3 4 5 0, 3 table 1, 6 2, 4 7 5

slide-7
SLIDE 7

Data Aggregation - Calculation

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

1 1 2 1 1 3 4 1 amount 1 2 3 4 5 6 7 8 id 1 2 3 1 3 5 2 4 fruit position on "fruit" 1 2 3 4 5 0, 3 table 1, 6 2, 4 7 5 "amount" dictionary 1 10 2 20 3 30 4 40 5 50 20 50 30 10 40 fruit sum(amount) apples bananas cherries grapes melons

slide-8
SLIDE 8

Cost Modeling in IMDBs

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

■ Cost model based on cache misses [2] □ Basic patterns for each cache level □ Combine patterns to model complex operations

[2] Manegold, S. et al. 2002. Generic database cost models for hierarchical memory

  • systems. VLDB. (2002).
slide-9
SLIDE 9

Cost Modeling of the Aggregation

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-10
SLIDE 10

Cost Factors

■ Data-dependent □ Dataset size □ Distinct grouping values □ Distinct grouping values distribution □ Sorting of grouping values ■ Data-independent □ Aggregation function □ Hash implementation

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-11
SLIDE 11

Hash Implementations

■ Ordering □ Sorted tree (std::map) □ Unsorted hash table (std::unordered_map) ■ Pre-allocation strategies □ Naïve implementation (automatic growth) □ Distinct value setup (scaffold of hash map) □ Histogram-based setup (scaffold and position list)

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-12
SLIDE 12

Data Set Size

■ Number of bytes processed during query □ Input table data □ Intermediate result data □ Final result data ■ Influencing factors □ Relation size □ Relational expression □ Grouping attributes □ Aggregation attributes □ Aggregation functions

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-13
SLIDE 13

Distinct Values – L1 Cache Misses

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-14
SLIDE 14

Distinct Values – CPU Cycles

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-15
SLIDE 15

Distinct Values Distribution

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

  • 1

10 100 1000 10000 100000 0e+00 2e+06 4e+06 6e+06 8e+06

Number of distinct values Number of cache misses

  • Uniform − L1 cache misses

Exponential − L1 cache misses Uniform − L2 cache misses Exponential − L2 cache misses

slide-16
SLIDE 16

Impact of Sorted Attributes – L1 Misses

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-17
SLIDE 17

Hash Implementations – L1 Misses

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-18
SLIDE 18

Hash Implementations – CPU Cycles

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-19
SLIDE 19

Summary

■ High impact of aggregation operation in a mixed OLTP and OLAP workload ■ Data characteristics influence aggregation performance ■ Hash implementation choice matters ■ Pre-allocation can reduce cache misses ■ Future work □ Multithreaded aggregations □ Extension of existing cost models

DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

slide-20
SLIDE 20

Thank you!

stephan.mueller@hpi.uni-potsdam.de http://epic.hpi.uni-potsdam.de

20 DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database