When Buying for an !
Things To Know
7
Alekh Jindal, Jorge Quiané, Jens Dittrich
7 Things To Know When Buying for an ! Alekh - - PowerPoint PPT Presentation
7 Things To Know When Buying for an ! Alekh Jindal, Jorge Quian, Jens Dittrich 1 What Shoes? Why Shoes? Analyzing MR Jobs (HadoopToSQL, Manimal) Generating MR Jobs (PigLatin, Hive) Executing MR Jobs (Hadoop++,
Alekh Jindal, Jorge Quiané, Jens Dittrich
3
(PigLatin, Hive) (HadoopToSQL, Manimal)
(Hadoop++, epiC)
5
6
8 KB 1 GB
7
001 alex bsc 002 tim msc 003 mat bsc 004 joel bsc 005 phil msc 006 ron msc 007 neo bsc 008 jack msc 009 jens bsc 010 tom msc
001 alex bsc 002 tim msc 003 mat bsc 004 joel bsc 005 phil msc 006 ron msc 007 neo bsc 008 jack msc 009 jens bsc 010 tom msc
9
001 alex bsc 002 tim msc 003 mat bsc 004 joel bsc 005 phil msc 006 ron msc 007 neo bsc 008 jack msc 009 jens bsc 010 tom msc
(default)
* A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, April, 2011 **
10
Non-required Reads Network Costs Data Block Placement Tuple Reconstruction
10
Non-required Reads Network Costs Data Block Placement Tuple Reconstruction
1 2 3 4 5 5 10 15 20 25 30 Data Access Cost [sec] Number of Referenced Attributes (Out of 30) Trojan Layout Row Layout Column Layout PAX Layout Optimal Layout
12
Replica 2 Replica 1 Replica 3
Non-required Reads Network Costs Data Block Placement Tuple Reconstruction
13
14
16
Column groups Interesting Column groups Complete & disjoint column groups Columns
Filter Pack
Novel Column Group Interestingness Column Group Packing as 0-1 Knapsack
17
Query groups Interesting Query groups Complete & disjoint query groups Queries
Filter Pack
18
Replica 1 Replica 2 Replica 3
Column groups Interesting Column groups Complete & disjoint column groups Columns
Filter Pack
Column groups Interesting Column groups Complete & disjoint column groups Columns
Filter Pack
Column groups Interesting Column groups Complete & disjoint column groups Columns
Filter Pack
Pack Filter
Pack Filter
Column groups Interesting Column groups Complete & disjoint column groups Columns
Filter Pack
Column groups Interesting Column groups Complete & disjoint column groups Columns
Filter Pack
Column groups Interesting Column groups Complete & disjoint column groups Columns
Filter Pack
19
Replica 1 Replica 2 Replica 3
Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8 Q2, Q3, Q4 Q5 Q1, Q6, Q7, Q8
Name, Address, Phone, AcctBal, Mktsegment, Comment Custkey, Nationkey Custkey, Name, Address, Nationkey, Phone, AcctBal, Comment Mktsegment Phone, AcctBal Address, Nationkey, Comment Mktsegment Custkey Name
20
22
Load
Query
Schedule ?
dataset layout-1 layout-2 layout-3
itemize UDF to transparently read the referenced attributes
TPC-H Lineitem, TPC-H Customer, SSB LineOrder, SDSS PhotoObj
First 8 queries from the respective benchmark for each table
focus on scan and projection operators i.e. map-phase-only jobs improvement: record reader time (I/O and tuple reconstruction)
50 virtual nodes in a 10 node cluster
24
25
1 2 3 4 5 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Improvement Factor TPC-H Queries
26
* M. Grund et al. HYRISE - A Main Memory Hybrid Storage Engine. PVLDB, November, 2010.
#Non-required Attributes Read #Joins in Tuple Reconstruction
HADOOP-ROW 525 HADOOP-PAX 139 HYRISE* Layout 2 64 Trojan Layout 14 20
27
1 2 3 4 5 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
Scheduling Penalty
Best-Layout Locality (default) Best-Layout & Locality
28