Stubby: A Transformation-based Optimizer for MapReduce Workflows
Harold Lim, Herodotos Herodotou, Shivnath Babu Duke University
Stubby: A Transformation-based Optimizer for MapReduce Workflows - - PowerPoint PPT Presentation
Stubby: A Transformation-based Optimizer for MapReduce Workflows Harold Lim, Herodotos Herodotou, Shivnath Babu Duke University MapReduce Workflow 30 MapReduce Workflow D0 1 D0 2 J1 J2 D2 D1 J3 D3 J4 D4 J5 J6 D5 D6 J7 31 D7
Harold Lim, Herodotos Herodotou, Shivnath Babu Duke University
30
D1 D2 D3 D4 D6 D5 D7 D01 D02 J1 J2 J3 J4 J5 J6 J7
31
D1 D2 D3 D4 D6 D5 D7 D01 D02 J1 J2 J3 J4 J5 J6 J7
MapReduce Jobs
32
Datasets
D1 D2 D3 D4 D6 D5 D7 D01 D02 J1 J2 J3 J4 J5 J6 J7
33
D1 D2 D01 D02 J1,2 D1 D2 D3 D4 D6 D5 D7 D01 D02 J1 J2 J3 J4 J5 J6 J7 D6 D7 J3,4,5,6,7
34
35
36
37
38
39
MapReduce Workflow Optimization Challenges
40
MapReduce Workflow Optimization Challenges
Many Interfaces Information Spectrum Large Plan Space
41
MapReduce Workflow Optimization Challenges
Many Interfaces Information Spectrum Large Plan Space
42
Transformations
MapReduce Workflow Optimization Challenges
Many Interfaces Information Spectrum Large Plan Space
43
Transformations
2 4 IR SN LA WG BA BR PJ US Speedup Baseline Stubby
MapReduce Workflow Optimization Challenges
44
45
MapReduce Execution Engine MapReduce Workflow Pig Hive Jaql FlumeJava
Optimization Challenges Many Interfaces
46
MapReduce Execution Engine MapReduce Workflow Pig Hive Jaql FlumeJava
Optimization Challenges Many Interfaces
47
OPT OPT OPT OPT MapReduce Execution Engine MapReduce Workflow Pig Hive Jaql FlumeJava
Optimization Challenges Many Interfaces
48
Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine MapReduce Workflow Stubby
Optimization Challenges Many Interfaces
49
Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine MapReduce Workflow Stubby Schema Filters Partitions Dataset
Optimization Challenges Many Interfaces Information Spectrum
50
Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine MapReduce Workflow Stubby Schema Filters Partitions Dataset
Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine Annotations, MapReduce Workflow Stubby Schema Filters Partitions Dataset
Optimization Challenges Many Interfaces Information Spectrum
51
Optimization Challenges Many Interfaces Information Spectrum
52
Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine Annotations, MapReduce Workflow Stubby
Optimization Challenges Many Interfaces Information Spectrum Large Plan Space
53
Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine Annotations, MapReduce Workflow Stubby
Optimization Challenges Many Interfaces Information Spectrum Large Plan Space
54
Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine Annotations, MapReduce Workflow Stubby
55
customization of functionality
large plan space
56
customization of functionality
large plan space
57
customization of functionality
large plan space
58
Many Interfaces Information Spectrum Large Plan Space
59
Transformations
MapReduce Workflow Optimization Challenges
Many Interfaces Information Spectrum Large Plan Space
60
Transformations
MapReduce Workflow Optimization Challenges
needed for workflow optimization
K1={C} filter={C<100} K2={O} K3={O} map_cost={50} reduce_cost={20}
dataset = {schema=<C,O,I,N,SH>, partition=<hash(C)>}
61
needed for workflow optimization
K1={C} filter={C<100} K2={O} K3={O} map_cost={50} reduce_cost={20}
dataset = {schema=<C,O,I,N,SH>, partition=<hash(C)>}
62
Dataset Annotations
needed for workflow optimization
K1={C} filter={C<100} K2={O} K3={O} map_cost={50} reduce_cost={20}
dataset = {schema=<C,O,I,N,SH>, partition=<hash(C)>}
63
Schema and Filter Annotations
needed for workflow optimization
K1={C} filter={C<100} K2={O} K3={O} map_cost={50} reduce_cost={20}
dataset = {schema=<C,O,I,N,SH>, partition=<hash(C)>}
64
Profile Annotations
65
& filter annotations
VLDB 2011]
66
& filter annotations
VLDB 2011]
67
& filter annotations
VLDB 2011]
68
& filter annotations
VLDB 2011]
69
Many Interfaces Information Spectrum Large Plan Space
70
Tr
MapReduce Workflow Optimization Challenges
Many Interfaces Information Spectrum Large Plan Space
71
Tr
MapReduce Workflow Optimization Challenges
different interfaces by being External to any interface
72
different interfaces by being External to any interface
M1 R1 M2 R2 D1 D2 D01 D02 J1 J2
73
different interfaces by being External to any interface
M1 R1 M2 R2 D1 D2 D01 D02 J1 J2
Transformation 74
different interfaces by being External to any interface
M1 R1 M2 R2 D1 D2 D01 D02 J1 J2 M1 M2 R1 R2 D1 D2 D01 D02 J1-2
Transformation 75
different interfaces by being External to any interface
M1 R1 M2 R2 D1 D2 D01 D02 J1 J2 M1 M2 R1 R2 D1 D2 D01 D02 J1-2
Transformation
76
different interfaces by being External to any interface
M1 R1 M2 R2 D1 D2 D01 D02 J1 J2 M1 M2 R1 R2 D1 D2 D01 D02 J1-2
Transformation
77
different interfaces by being External to any interface
M1 R1 M2 R2 D1 D2 D01 D02 J1 J2 M1 M2 R1 R2 D1 D2 D01 D02 J1-2
Transformation
78
79
80 hash(O) sort(O) hash(O,Z) sort(O,Z) <51,2> <50,1> M M R R … … M M M R R … … <51,1> <51,2> <50,1> <51,2> <50,1> <51,1> <51,1> <50,1> <51,2> <51,1> J.K2={O} J.K2={O,Z}
Transformation 81 hash(O) sort(O) hash(O,Z) sort(O,Z) <51,2> <50,1> M M R R … … M M M R R … … <51,1> <51,2> <50,1> <51,2> <50,1> <51,1> <51,1> <50,1> <51,2> <51,1> J.K2={O} J.K2={O,Z} <51,2> <51,1> hash(O) sort(O,Z) M M R R … … <51,1> <51,2> <50,1> <50,1> <51,2> <51,1> <50,1> M R M R …
Transformation 82 hash(O) sort(O) hash(O,Z) sort(O,Z) <51,2> <50,1> M M R R … … M M M R R … … <51,1> <51,2> <50,1> <51,2> <50,1> <51,1> <51,1> <50,1> <51,2> <51,1> J.K2={O} J.K2={O,Z} <51,2> <51,1> hash(O) sort(O,Z) M M R R … … <51,1> <51,2> <50,1> <50,1> <51,2> <51,1> <50,1> M R M R …
Transformation
83 hash(O) sort(O) hash(O,Z) sort(O,Z) <51,2> <50,1> M M R R … … M M M R R … … <51,1> <51,2> <50,1> <51,2> <50,1> <51,1> <51,1> <50,1> <51,2> <51,1> J.K2={O} J.K2={O,Z} <51,2> <51,1> hash(O) sort(O,Z) M M R R … … <51,1> <51,2> <50,1> <50,1> <51,2> <51,1> <50,1> M R M R …
cost-based approach
84
0.5 1 1.5 2 2.5 3 Performance Degradation Performance Improvement Speedup
cost-based approach
85
0.5 1 1.5 2 2.5 3 Performance Degradation Performance Improvement Speedup
(more functions in a task)
cost-based approach
86
0.5 1 1.5 2 2.5 3 Performance Degradation Performance Improvement Speedup
(more functions in a task) + Eliminates inter-task data transfer + Eliminates sorting overhead + Eliminates writing output to disk
87
M M R
88
M M R
Transformation 89
M M R
Transformation
M
90
R M
M M R
Transformation
M
91
R M
MapReduce job
M M R
Transformation
M
92
R M
93
94
M R M R M R
Transformation 95
M R M R M R
Transformation 96
M R M R M R M M M R R R
Transformation
97
M R M R M R M M M R R R
98
R filter={0<=O<100} hash(O) M R M
99
R filter={0<=O<100} hash(O) M R M
Transformation 100
R filter={0<=O<100} hash(O) M R M R filter={0<=O<100} range(O) split-points(100,200,…) M R M
Transformation 101
R filter={0<=O<100} hash(O) M R M R filter={0<=O<100} range(O) split-points(100,200,…) M R M
Transformation
102
103
104
M M R R
Memory Buffer 512MB 2 Reduce Tasks
Transformation 105
M M R R M M R R R R
Memory Buffer 512MB 2 Reduce Tasks
vs. Memory Buffer 128MB
Transformation
buffer, compression, combiner, reduce tasks, etc)
transformations (interaction)
106
M M R R M M R R R R
Memory Buffer 512MB 2 Reduce Tasks
vs. Memory Buffer 128MB
Transformation
buffer, compression, combiner, reduce tasks, etc)
transformations (interaction)
107
M M R R M M R R R R
Memory Buffer 512MB 2 Reduce Tasks
vs. Memory Buffer 128MB
Many Interfaces Information Spectrum Large Plan Space
108
Transformations
MapReduce Workflow Optimization Challenges
Many Interfaces Information Spectrum Large Plan Space
109
Transformations
MapReduce Workflow Optimization Challenges
110
M5 R5 M7 R7 M4 M6 R6 M3 R3 M1 R1 M2 R2 D1 D2 D3 D4 D6 D5 D7 D01 D02 J1 J2 J3 J4 J5 J6 J7
U(1)
111 Optimization unit localizes interactions among plan space choices
M5 R5 M7 R7 M4 M6 R6 M3 R3 M1 M2 R1 R2 D1 D2 D3 D4 D6 D5 D7 D01 D02 J1-2 J3 J4 J5 J6 J7
U(1)
112
M5 R5 M7 R7 M4 M6 R6 M3 R3 M1 M2 R1 R2 D1 D2 D3 D4 D6 D5 D7 D01 D02 J1-2 J3 J4 J5 J6 J7
U(2)
113
M5 R5 M7 R7 M4 M6 R6 M3 R3 M1 M2 R1 R2 D1 D2 D3 D4 D6 D5 D7 D01 D02 J1-2 J3 J4 J5 J6 J7
U(2)
Dynamically generated because previous
transforms workflow 114 Top-Down because producer jobs affect the input datasets
R5 R6 M7 R7 M3 R3 M4 M5 M6 M1 M2 R1 R2 D1 D2 D6 D7 D01 D02 J1-2 J3-7
U(4)
115
116
spaces
117
spaces
118
spaces
M5 R5 M7 R7 M6 R6 D6 D5 D7 J5 J6 J7
119
consumer jobs
compressing D5 forces J7 to decompress
spaces
M5 R5 M7 R7 M6 R6 D6 D5 D7 J5 J6 J7
120
consumer jobs
compressing D5 forces J7 to decompress
spaces
M5 R5 M7 R7 M6 R6 D6 D5 D7 J5 J6 J7
121
spaces
M5 R5 M7 R7 M6 R6 D6 D5 D7 J5 J6 J7
122
spaces
M5 R5 M7 R7 M6 R6 D6 D5 D7 J5 J6 J7
123
124
M4 M3 R3 D1 D2 D3 D4
p1
M4 M3 R3 D1 D2 D3 D4
p2
M3 R3 M4 D1 D2 D4
p3
M3 R3 M4 D1 D2 D4
p4
M4 M3 R3 D1 D2 D3 D4 125
M4 M3 R3 D1 D2 D3 D4
p1
M4 M3 R3 D1 D2 D3 D4
p2
M3 R3 M4 D1 D2 D4
p3
M3 R3 M4 D1 D2 D4
p4
126
M4 M3 R3 D1 D2 D3 D4
p1
M4 M3 R3 D1 D2 D3 D4
p2
M3 R3 M4 D1 D2 D4
p3
M3 R3 M4 D1 D2 D4
p4
configurations with best cost for each combination pi
127
M4 M3 R3 D1 D2 D3 D4
p1
M4 M3 R3 D1 D2 D3 D4
p2
M3 R3 M4 D1 D2 D4
p3
M3 R3 M4 D1 D2 D4
p4
configurations with best cost for each combination pi
128
M4 M3 R3 D1 D2 D3 D4
p1
M4 M3 R3 D1 D2 D3 D4
p2
M3 R3 M4 D1 D2 D4
p3
M3 R3 M4 D1 D2 D4
p4
configurations with best cost for each combination pi
Best Cost: 20 18 15 16
129
M4 M3 R3 D1 D2 D3 D4
p1
M4 M3 R3 D1 D2 D3 D4
p2
M3 R3 M4 D1 D2 D4
p3
M3 R3 M4 D1 D2 D4
p4
configurations with best cost for each combination pi
Best Cost: 20 18 15 16
130
Many Interfaces Information Spectrum Large Plan Space
131
Transformations
2 4 IR SN LA WG BA BR PJ US Speedup Baseline Stubby
MapReduce Workflow Optimization Challenges
Many Interfaces Information Spectrum Large Plan Space
132
Transformations
2 4 IR SN LA WG BA BR PJ US Speedup Baseline Stubby
MapReduce Workflow Optimization Challenges
133
MapReduce classes to run multiple functions in map/reduce tasks)
134
MapReduce classes to run multiple functions in map/reduce tasks)
135
application domains (ranges from 2 to 7 jobs)
Pig and manually-tuned configurations using rules-of-thumb
Abbr Workflow Dataset Size IR Information Retrieval 264GB SN Social Network Analysis 267 GB LA Log Analysis 500 GB WG Web Graph Analysis 255 GB BA Business Analytics Query 550 GB BR Business Report Generation 530 GB PJ Post-processing Jobs 10 GB US User-defined Logical Splits 530 GB 136
1 2 3 4 5 IR SN LA WG BA BR PJ US Speedup Baseline Stubby Vertical Horizontal
137
configuration parameters
transformation 1 2 3 4 5 IR SN LA WG BA BR PJ US Speedup Baseline Stubby Starfish YSmart MRShare
138
139
50 100 150 200 250 IR SN LA WG BA BR PJ US Optimization Time (s) 0% 2% 4% 6% 8% 10% IR SN LA WG BA BR PJ US Optimization Overhead (%)
2011], Manimal [VLDB 2011], Jaql [VLDB 2011]
2011]
Zhou et al. [ICDE 2010]
et al. [VLDB 1996]
140
141