An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar - - PowerPoint PPT Presentation
An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar - - PowerPoint PPT Presentation
An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database Stephan Mller, Hasso Plattner Enterprise Platform and Integration Concepts Hasso Plattner Institute, Potsdam (Germany) DOLAP 2012 November 2nd Agenda
Agenda
■ Data aggregation ■ Aggregations in Hyrise ■ Cost factors and benchmarks ■ Summary and future work
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Motivation
■ Aggregation □ Abstractions for conceptualizing the real world □ Resource-intensive operation in OLAP workloads ■ Columnar in-memory databases □ Optimized for set processing □ Enables reunification of OLTP and OLAP [1]
[1] H. Plattner. A common database approach for OLTP and OLAP using an in-memory column database. In SIGMOD, 2009. DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Data Aggregation
■ SELECT product, SUM (amount) AS revenue FROM sales WHERE year=2011 GROUP BY product ■ Two phases □ Grouping (hash-based, sorting, nested loops) □ Calculation
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Data Aggregation in Hyrise
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
1 1 2 1 1 3 4 1 amount 1 2 3 1 3 5 2 4 fruit "fruit" dictionary 1 apples 2 bananas 3 cherries 4 grapes 5 melons 1 2 3 4 5 6 7 8 id apples bananas cherries apples cherries melons bananas grapes fruit 10 10 20 10 30 50 20 40 amount 1 2 3 4 5 6 7 8 id
■ Columnar Storage ■ Dictionary compression
Data Aggregation – Grouping
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
1 1 2 1 1 3 4 1 amount 1 2 3 4 5 6 7 8 id 1 2 3 1 3 5 2 4 fruit dictionary 1 apples 2 bananas 3 cherries 4 grapes 5 melons position on "fruit" 1 2 3 4 5 0, 3 table 1, 6 2, 4 7 5
Data Aggregation - Calculation
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
1 1 2 1 1 3 4 1 amount 1 2 3 4 5 6 7 8 id 1 2 3 1 3 5 2 4 fruit position on "fruit" 1 2 3 4 5 0, 3 table 1, 6 2, 4 7 5 "amount" dictionary 1 10 2 20 3 30 4 40 5 50 20 50 30 10 40 fruit sum(amount) apples bananas cherries grapes melons
Cost Modeling in IMDBs
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
■ Cost model based on cache misses [2] □ Basic patterns for each cache level □ Combine patterns to model complex operations
[2] Manegold, S. et al. 2002. Generic database cost models for hierarchical memory
- systems. VLDB. (2002).
Cost Modeling of the Aggregation
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Cost Factors
■ Data-dependent □ Dataset size □ Distinct grouping values □ Distinct grouping values distribution □ Sorting of grouping values ■ Data-independent □ Aggregation function □ Hash implementation
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Hash Implementations
■ Ordering □ Sorted tree (std::map) □ Unsorted hash table (std::unordered_map) ■ Pre-allocation strategies □ Naïve implementation (automatic growth) □ Distinct value setup (scaffold of hash map) □ Histogram-based setup (scaffold and position list)
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Data Set Size
■ Number of bytes processed during query □ Input table data □ Intermediate result data □ Final result data ■ Influencing factors □ Relation size □ Relational expression □ Grouping attributes □ Aggregation attributes □ Aggregation functions
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Distinct Values – L1 Cache Misses
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Distinct Values – CPU Cycles
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
Distinct Values Distribution
DOLAP 2012 | An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database
- 1
10 100 1000 10000 100000 0e+00 2e+06 4e+06 6e+06 8e+06
Number of distinct values Number of cache misses
- Uniform − L1 cache misses
Exponential − L1 cache misses Uniform − L2 cache misses Exponential − L2 cache misses