7 Things To Know When Buying for an ! Alekh - - PowerPoint PPT Presentation

7
SMART_READER_LITE
LIVE PREVIEW

7 Things To Know When Buying for an ! Alekh - - PowerPoint PPT Presentation

7 Things To Know When Buying for an ! Alekh Jindal, Jorge Quian, Jens Dittrich 1 What Shoes? Why Shoes? Analyzing MR Jobs (HadoopToSQL, Manimal) Generating MR Jobs (PigLatin, Hive) Executing MR Jobs (Hadoop++,


slide-1
SLIDE 1

When Buying for an !

Things To Know

7

Alekh Jindal, Jorge Quiané, Jens Dittrich

slide-2
SLIDE 2

What Shoes? Why Shoes?

1

slide-3
SLIDE 3

3

(PigLatin, Hive) (HadoopToSQL, Manimal)

Analyzing MR Jobs

(Hadoop++, epiC)

Executing MR Jobs Data Layouts & Access Paths !! Generating MR Jobs

slide-4
SLIDE 4

Why Elephant Needs Different Shoes?

2

slide-5
SLIDE 5

DBMS MapReduce

Very Large Scale Storage & Execution

5

slide-6
SLIDE 6

DBMS MapReduce

Large Data Block Sizes

6

8 KB 1 GB

slide-7
SLIDE 7

DBMS MapReduce

Block Level Data Replication

7

001 alex bsc 002 tim msc 003 mat bsc 004 joel bsc 005 phil msc 006 ron msc 007 neo bsc 008 jack msc 009 jens bsc 010 tom msc

001 alex bsc 002 tim msc 003 mat bsc 004 joel bsc 005 phil msc 006 ron msc 007 neo bsc 008 jack msc 009 jens bsc 010 tom msc

slide-8
SLIDE 8

What’s Wrong with Old Shoes?

3

slide-9
SLIDE 9

Current Data Layouts in Hadoop

9

001 alex bsc 002 tim msc 003 mat bsc 004 joel bsc 005 phil msc 006 ron msc 007 neo bsc 008 jack msc 009 jens bsc 010 tom msc

Row Column* PAX**

(default)

* A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, April, 2011 **

  • Y. He et al. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. ICDE, 2011
slide-10
SLIDE 10

Current Data Layouts in Hadoop

10

Row Column PAX

Non-required Reads Network Costs Data Block Placement Tuple Reconstruction

slide-11
SLIDE 11

Current Data Layouts in Hadoop

10

Row Column PAX

Non-required Reads Network Costs Data Block Placement Tuple Reconstruction

1 2 3 4 5 5 10 15 20 25 30 Data Access Cost [sec] Number of Referenced Attributes (Out of 30) Trojan Layout Row Layout Column Layout PAX Layout Optimal Layout

slide-12
SLIDE 12

What Shoes do We Propose?

4

slide-13
SLIDE 13

Trojan Data Layouts

12

Replica 2 Replica 1 Replica 3

slide-14
SLIDE 14

Row Column PAX Trojan

Non-required Reads Network Costs Data Block Placement Tuple Reconstruction

Trojan Data Layouts

13

slide-15
SLIDE 15

14

How do we design shoe for one leg? How do we design shoes for all legs? How do we make the shoes from the design?

Challenges in Trojan Data Layouts

slide-16
SLIDE 16

How Do We Design the Shoes?

5

slide-17
SLIDE 17

16

Column groups Interesting Column groups Complete & disjoint column groups Columns

Filter Pack

Single Replica

Novel Column Group Interestingness Column Group Packing as 0-1 Knapsack

slide-18
SLIDE 18

17

Query groups Interesting Query groups Complete & disjoint query groups Queries

Filter Pack

Multiple Replicas

slide-19
SLIDE 19

18

Multiple Replicas

Replica 1 Replica 2 Replica 3

Column groups Interesting Column groups Complete & disjoint column groups Columns

Filter Pack

Column groups Interesting Column groups Complete & disjoint column groups Columns

Filter Pack

Column groups Interesting Column groups Complete & disjoint column groups Columns

Filter Pack

Pack Filter

slide-20
SLIDE 20

Pack Filter

Column groups Interesting Column groups Complete & disjoint column groups Columns

Filter Pack

Column groups Interesting Column groups Complete & disjoint column groups Columns

Filter Pack

Column groups Interesting Column groups Complete & disjoint column groups Columns

Filter Pack

19

Multiple Replicas

Replica 1 Replica 2 Replica 3

TPC-H Customer

Q1, Q2, Q3, Q4, Q5, Q6, Q7, Q8 Q2, Q3, Q4 Q5 Q1, Q6, Q7, Q8

Name, Address, Phone, AcctBal, Mktsegment, Comment Custkey, Nationkey Custkey, Name, Address, Nationkey, Phone, AcctBal, Comment Mktsegment Phone, AcctBal Address, Nationkey, Comment Mktsegment Custkey Name

slide-21
SLIDE 21

Trojan Layout Advantages

  • Multiple layouts for a given workload
  • Default row layout still available
  • Specialized replicas for different query sub-class
  • Divide and conquer layout computation

20

slide-22
SLIDE 22

How do We Ride the Elephant?

6

slide-23
SLIDE 23

Putting It All Together

22

Load

Query

Schedule ?

Create trojan layout configuration file in HDFS

dataset layout-1 layout-2 layout-3

Supply referenced attributes in JobConf

itemize UDF to transparently read the referenced attributes

Three Optimization Options:

  • data locality (default)
  • best layout
  • best layout & locality
slide-24
SLIDE 24

How were the Field Trials?

7

slide-25
SLIDE 25
  • Datasets

TPC-H Lineitem, TPC-H Customer, SSB LineOrder, SDSS PhotoObj

  • Queries

First 8 queries from the respective benchmark for each table

  • Methodology

focus on scan and projection operators i.e. map-phase-only jobs improvement: record reader time (I/O and tuple reconstruction)

  • Hardware

50 virtual nodes in a 10 node cluster

24

Setup

slide-26
SLIDE 26

Per-replica Trojan Layout Performance

25

1 2 3 4 5 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8

  • ver Hadoop-Row
  • ver Hadoop-PAX

Improvement Factor TPC-H Queries

  • ver Hadoop-PAX

TPC-H Lineitem

slide-27
SLIDE 27

Layout Quality

26

* M. Grund et al. HYRISE - A Main Memory Hybrid Storage Engine. PVLDB, November, 2010.

>14% improvement over HYRISE

#Non-required Attributes Read #Joins in Tuple Reconstruction

HADOOP-ROW 525 HADOOP-PAX 139 HYRISE* Layout 2 64 Trojan Layout 14 20

slide-28
SLIDE 28

Scheduling Decisions

27

TPC-H Lineitem

1 2 3 4 5 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8

Scheduling Penalty

Best-Layout Locality (default) Best-Layout & Locality

slide-29
SLIDE 29

Summary

  • Data layouts crucial to MR job performance
  • Exploit default data block replication in MR
  • Novel algorithm to compute per-replica layouts
  • Improvement: 4.8x over Row, 3.5x over PAX
  • Better than HYRISE; 14% improvement

28