Stubby: A Transformation-based Optimizer for MapReduce Workflows - - PowerPoint PPT Presentation

stubby a transformation based
SMART_READER_LITE
LIVE PREVIEW

Stubby: A Transformation-based Optimizer for MapReduce Workflows - - PowerPoint PPT Presentation

Stubby: A Transformation-based Optimizer for MapReduce Workflows Harold Lim, Herodotos Herodotou, Shivnath Babu Duke University MapReduce Workflow 30 MapReduce Workflow D0 1 D0 2 J1 J2 D2 D1 J3 D3 J4 D4 J5 J6 D5 D6 J7 31 D7


slide-1
SLIDE 1

Stubby: A Transformation-based Optimizer for MapReduce Workflows

Harold Lim, Herodotos Herodotou, Shivnath Babu Duke University

slide-2
SLIDE 2

MapReduce Workflow

30

slide-3
SLIDE 3

D1 D2 D3 D4 D6 D5 D7 D01 D02 J1 J2 J3 J4 J5 J6 J7

MapReduce Workflow

31

slide-4
SLIDE 4

D1 D2 D3 D4 D6 D5 D7 D01 D02 J1 J2 J3 J4 J5 J6 J7

MapReduce Workflow

MapReduce Jobs

32

Datasets

slide-5
SLIDE 5

D1 D2 D3 D4 D6 D5 D7 D01 D02 J1 J2 J3 J4 J5 J6 J7

33

Automatic MapReduce Workflow Optimizer

slide-6
SLIDE 6

D1 D2 D01 D02 J1,2 D1 D2 D3 D4 D6 D5 D7 D01 D02 J1 J2 J3 J4 J5 J6 J7 D6 D7 J3,4,5,6,7

34

Automatic MapReduce Workflow Optimizer 7 Jobs to 2 Jobs!

slide-7
SLIDE 7
  • Stubby [ˈstʌbɪ] adj - short and broad

35

Automatic MapReduce Workflow Optimizer

slide-8
SLIDE 8
  • Stubby [ˈstʌbɪ] adj - short and broad

36

Automatic MapReduce Workflow Optimizer

slide-9
SLIDE 9
  • Stubby [ˈstʌbɪ] adj - short and broad

37

Automatic MapReduce Workflow Optimizer

slide-10
SLIDE 10
  • Stubby [ˈstʌbɪ] adj - short and broad

38

Automatic MapReduce Workflow Optimizer

slide-11
SLIDE 11

Outline

39

MapReduce Workflow Optimization Challenges

slide-12
SLIDE 12

Outline

40

MapReduce Workflow Optimization Challenges

slide-13
SLIDE 13

Outline

Many Interfaces Information Spectrum Large Plan Space

41

MapReduce Workflow Optimization Challenges

slide-14
SLIDE 14

Outline

Many Interfaces Information Spectrum Large Plan Space

42

Transformations

Annotations Interactions

MapReduce Workflow Optimization Challenges

slide-15
SLIDE 15

Outline

Many Interfaces Information Spectrum Large Plan Space

43

Transformations

Annotations Interactions

2 4 IR SN LA WG BA BR PJ US Speedup Baseline Stubby

MapReduce Workflow Optimization Challenges

slide-16
SLIDE 16

MapReduce Ecosystem

44

slide-17
SLIDE 17

MapReduce Ecosystem

45

MapReduce Execution Engine MapReduce Workflow Pig Hive Jaql FlumeJava

slide-18
SLIDE 18

MapReduce Ecosystem

Optimization Challenges Many Interfaces

46

MapReduce Execution Engine MapReduce Workflow Pig Hive Jaql FlumeJava

slide-19
SLIDE 19

MapReduce Ecosystem

Optimization Challenges Many Interfaces

47

OPT OPT OPT OPT MapReduce Execution Engine MapReduce Workflow Pig Hive Jaql FlumeJava

slide-20
SLIDE 20

MapReduce Ecosystem

Optimization Challenges Many Interfaces

48

Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine MapReduce Workflow Stubby

slide-21
SLIDE 21

MapReduce Ecosystem

Optimization Challenges Many Interfaces

49

Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine MapReduce Workflow Stubby Schema Filters Partitions Dataset

slide-22
SLIDE 22

MapReduce Ecosystem

Optimization Challenges Many Interfaces Information Spectrum

50

Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine MapReduce Workflow Stubby Schema Filters Partitions Dataset

slide-23
SLIDE 23

Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine Annotations, MapReduce Workflow Stubby Schema Filters Partitions Dataset

MapReduce Ecosystem

Optimization Challenges Many Interfaces Information Spectrum

51

slide-24
SLIDE 24

MapReduce Ecosystem

Optimization Challenges Many Interfaces Information Spectrum

52

Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine Annotations, MapReduce Workflow Stubby

slide-25
SLIDE 25

MapReduce Ecosystem

Optimization Challenges Many Interfaces Information Spectrum Large Plan Space

53

Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine Annotations, MapReduce Workflow Stubby

slide-26
SLIDE 26

MapReduce Ecosystem

Optimization Challenges Many Interfaces Information Spectrum Large Plan Space

54

Optimized MapReduce Workflow Transformation-based Pig Hive Jaql FlumeJava MapReduce Execution Engine Annotations, MapReduce Workflow Stubby

slide-27
SLIDE 27

Design Principles of Stubby

55

slide-28
SLIDE 28

Design Principles of Stubby

  • Optional Annotations convey information to Stubby
  • Transformations allow for easy extension and

customization of functionality

  • Identification of non-interacting subspaces to deal with

large plan space

56

slide-29
SLIDE 29

Design Principles of Stubby

  • Optional Annotations convey information to Stubby
  • Transformations allow for easy extension and

customization of functionality

  • Identification of non-interacting subspaces to deal with

large plan space

57

slide-30
SLIDE 30

Design Principles of Stubby

  • Optional Annotations convey information to Stubby
  • Transformations allow for easy extension and

customization of functionality

  • Identification of non-interacting subspaces to deal with

large plan space

58

slide-31
SLIDE 31

Next

Many Interfaces Information Spectrum Large Plan Space

59

Transformations

An Annotations tations Interactions

MapReduce Workflow Optimization Challenges

slide-32
SLIDE 32

Next

Many Interfaces Information Spectrum Large Plan Space

60

Transformations

An Annotations tations Interactions

MapReduce Workflow Optimization Challenges

slide-33
SLIDE 33

Annotations

  • Mechanism for higher levels to communicate information

needed for workflow optimization

  • 3 Types of Annotations

K1={C} filter={C<100} K2={O} K3={O} map_cost={50} reduce_cost={20}

M R

dataset = {schema=<C,O,I,N,SH>, partition=<hash(C)>}

61

slide-34
SLIDE 34

Annotations

  • Mechanism for higher levels to communicate information

needed for workflow optimization

  • 3 Types of Annotations

K1={C} filter={C<100} K2={O} K3={O} map_cost={50} reduce_cost={20}

M R

dataset = {schema=<C,O,I,N,SH>, partition=<hash(C)>}

62

Dataset Annotations

slide-35
SLIDE 35

Annotations

  • Mechanism for higher levels to communicate information

needed for workflow optimization

  • 3 Types of Annotations

K1={C} filter={C<100} K2={O} K3={O} map_cost={50} reduce_cost={20}

M R

dataset = {schema=<C,O,I,N,SH>, partition=<hash(C)>}

63

Schema and Filter Annotations

slide-36
SLIDE 36

Annotations

  • Mechanism for higher levels to communicate information

needed for workflow optimization

  • 3 Types of Annotations

K1={C} filter={C<100} K2={O} K3={O} map_cost={50} reduce_cost={20}

M R

dataset = {schema=<C,O,I,N,SH>, partition=<hash(C)>}

64

Profile Annotations

slide-37
SLIDE 37

Who Creates the Annotations?

65

slide-38
SLIDE 38

Who Creates the Annotations?

  • Interfaces have all the information. Just propagate it
  • E.g., PigLatin statement: A = LOAD ‘data’ AS (A,B,C);
  • Modified Pig to automatically generate dataset, schema,

& filter annotations

  • Only ~570 lines of code! (Pig is ~80000 lines of code)
  • Profile Annotations generated using Starfish [Herodotou

VLDB 2011]

  • Stubby considers optimizations based on what is given

66

slide-39
SLIDE 39

Who Creates the Annotations?

  • Interfaces have all the information. Just propagate it
  • E.g., PigLatin statement: A = LOAD ‘data’ AS (A,B,C);
  • Modified Pig to automatically generate dataset, schema,

& filter annotations

  • Only ~570 lines of code! (Pig is ~80000 lines of code)
  • Profile Annotations generated using Starfish [Herodotou

VLDB 2011]

  • Stubby considers optimizations based on what is given

67

slide-40
SLIDE 40

Who Creates the Annotations?

  • Interfaces have all the information. Just propagate it
  • E.g., PigLatin statement: A = LOAD ‘data’ AS (A,B,C);
  • Modified Pig to automatically generate dataset, schema,

& filter annotations

  • Only ~570 lines of code! (Pig is ~80000 lines of code)
  • Profile Annotations generated using Starfish [Herodotou

VLDB 2011]

  • Stubby considers optimizations based on what is given

68

slide-41
SLIDE 41

Who Creates the Annotations?

  • Interfaces have all the information. Just propagate it
  • E.g., PigLatin statement: A = LOAD ‘data’ AS (A,B,C);
  • Modified Pig to automatically generate dataset, schema,

& filter annotations

  • Only ~570 lines of code! (Pig is ~80000 lines of code)
  • Profile Annotations generated using Starfish [Herodotou

VLDB 2011]

  • Stubby considers optimizations based on what is given

69

slide-42
SLIDE 42

Next

Many Interfaces Information Spectrum Large Plan Space

70

Tr

Tran ansfor formations mations Annotations Interactions

MapReduce Workflow Optimization Challenges

slide-43
SLIDE 43

Next

Many Interfaces Information Spectrum Large Plan Space

71

Tr

Tran ansfor formations mations Annotations Interactions

MapReduce Workflow Optimization Challenges

slide-44
SLIDE 44

Transformations

  • Transformations + Annotations allow Stubby to support

different interfaces by being External to any interface

72

slide-45
SLIDE 45

Transformations

  • Transformations + Annotations allow Stubby to support

different interfaces by being External to any interface

M1 R1 M2 R2 D1 D2 D01 D02 J1 J2

73

slide-46
SLIDE 46

Transformations

  • Transformations + Annotations allow Stubby to support

different interfaces by being External to any interface

M1 R1 M2 R2 D1 D2 D01 D02 J1 J2

Transformation 74

slide-47
SLIDE 47

Transformations

  • Transformations + Annotations allow Stubby to support

different interfaces by being External to any interface

M1 R1 M2 R2 D1 D2 D01 D02 J1 J2 M1 M2 R1 R2 D1 D2 D01 D02 J1-2

Transformation 75

slide-48
SLIDE 48

Transformations

  • Transformations + Annotations allow Stubby to support

different interfaces by being External to any interface

M1 R1 M2 R2 D1 D2 D01 D02 J1 J2 M1 M2 R1 R2 D1 D2 D01 D02 J1-2

Transformation

  • Annotations ensure only valid transformations are considered

76

slide-49
SLIDE 49

Transformations

  • Transformations + Annotations allow Stubby to support

different interfaces by being External to any interface

M1 R1 M2 R2 D1 D2 D01 D02 J1 J2 M1 M2 R1 R2 D1 D2 D01 D02 J1-2

Transformation

  • Annotations ensure only valid transformations are considered
  • Transformations can be combined (whole >> sum of parts!)
  • Stubby considers 5 types of transformations (more to come)

77

slide-50
SLIDE 50

Transformations

  • Transformations + Annotations allow Stubby to support

different interfaces by being External to any interface

M1 R1 M2 R2 D1 D2 D01 D02 J1 J2 M1 M2 R1 R2 D1 D2 D01 D02 J1-2

Transformation

  • Annotations ensure only valid transformations are considered
  • Transformations can be combined (whole >> sum of parts!)
  • Stubby considers 5 types of transformations (more to come)

78

slide-51
SLIDE 51

Intra-Job Vertical Packing

  • Transforms a MapReduce job into a Map-only job

79

slide-52
SLIDE 52

Intra-Job Vertical Packing

  • Transforms a MapReduce job into a Map-only job

80 hash(O) sort(O) hash(O,Z) sort(O,Z) <51,2> <50,1> M M R R … … M M M R R … … <51,1> <51,2> <50,1> <51,2> <50,1> <51,1> <51,1> <50,1> <51,2> <51,1> J.K2={O} J.K2={O,Z}

slide-53
SLIDE 53

Intra-Job Vertical Packing

  • Transforms a MapReduce job into a Map-only job

Transformation 81 hash(O) sort(O) hash(O,Z) sort(O,Z) <51,2> <50,1> M M R R … … M M M R R … … <51,1> <51,2> <50,1> <51,2> <50,1> <51,1> <51,1> <50,1> <51,2> <51,1> J.K2={O} J.K2={O,Z} <51,2> <51,1> hash(O) sort(O,Z) M M R R … … <51,1> <51,2> <50,1> <50,1> <51,2> <51,1> <50,1> M R M R …

slide-54
SLIDE 54

Intra-Job Vertical Packing

  • Transforms a MapReduce job into a Map-only job

Transformation 82 hash(O) sort(O) hash(O,Z) sort(O,Z) <51,2> <50,1> M M R R … … M M M R R … … <51,1> <51,2> <50,1> <51,2> <50,1> <51,1> <51,1> <50,1> <51,2> <51,1> J.K2={O} J.K2={O,Z} <51,2> <51,1> hash(O) sort(O,Z) M M R R … … <51,1> <51,2> <50,1> <50,1> <51,2> <51,1> <50,1> M R M R …

slide-55
SLIDE 55

Intra-Job Vertical Packing

  • Transforms a MapReduce job into a Map-only job

Transformation

  • Group/Partition requirements of both jobs is now enforced at the same time

83 hash(O) sort(O) hash(O,Z) sort(O,Z) <51,2> <50,1> M M R R … … M M M R R … … <51,1> <51,2> <50,1> <51,2> <50,1> <51,1> <51,1> <50,1> <51,2> <51,1> J.K2={O} J.K2={O,Z} <51,2> <51,1> hash(O) sort(O,Z) M M R R … … <51,1> <51,2> <50,1> <50,1> <51,2> <51,1> <50,1> M R M R …

slide-56
SLIDE 56

Intra-job Vertical Packing (2)

  • Can have positive / negative effect on performance -> Need

cost-based approach

84

0.5 1 1.5 2 2.5 3 Performance Degradation Performance Improvement Speedup

slide-57
SLIDE 57

Intra-job Vertical Packing (2)

  • Can have positive / negative effect on performance -> Need

cost-based approach

85

0.5 1 1.5 2 2.5 3 Performance Degradation Performance Improvement Speedup

  • Forces dependencies
  • f configurations (e.g., parallelism)
  • Resource contention

(more functions in a task)

slide-58
SLIDE 58

Intra-job Vertical Packing (2)

  • Can have positive / negative effect on performance -> Need

cost-based approach

86

0.5 1 1.5 2 2.5 3 Performance Degradation Performance Improvement Speedup

  • Forces dependencies
  • f configurations (e.g., parallelism)
  • Resource contention

(more functions in a task) + Eliminates inter-task data transfer + Eliminates sorting overhead + Eliminates writing output to disk

slide-59
SLIDE 59

Inter-job Vertical Packing

  • Merges a map-only job with another job

87

slide-60
SLIDE 60

Inter-job Vertical Packing

  • Merges a map-only job with another job

M M R

88

slide-61
SLIDE 61

Inter-job Vertical Packing

  • Merges a map-only job with another job

M M R

Transformation 89

slide-62
SLIDE 62

Inter-job Vertical Packing

  • Merges a map-only job with another job

M M R

Transformation

M

90

R M

slide-63
SLIDE 63

Inter-job Vertical Packing

  • Merges a map-only job with another job

M M R

Transformation

M

91

R M

  • If combine intra-job + inter-job -> 2 MapReduce jobs to 1

MapReduce job

slide-64
SLIDE 64

Inter-job Vertical Packing

  • Merges a map-only job with another job

M M R

Transformation

M

  • Again, not always a good thing
  • + Eliminates writing to disk
  • - Forces dependencies

92

R M

slide-65
SLIDE 65

Horizontal Packing

  • Combine concurrent running jobs into a single job

93

slide-66
SLIDE 66

Horizontal Packing

  • Combine concurrent running jobs into a single job

94

M R M R M R

slide-67
SLIDE 67

Horizontal Packing

  • Combine concurrent running jobs into a single job

Transformation 95

M R M R M R

slide-68
SLIDE 68

Horizontal Packing

  • Combine concurrent running jobs into a single job

Transformation 96

M R M R M R M M M R R R

slide-69
SLIDE 69

Horizontal Packing

  • Combine concurrent running jobs into a single job

Transformation

  • + Read dataset once
  • + Share overhead of launching jobs
  • - Extra overhead of sorting/partitioning combined map output
  • - Share limited memory resources per task (can spill more)

97

M R M R M R M M M R R R

slide-70
SLIDE 70

Partition Function

  • Change how map outputs are partitioned and sorted

98

slide-71
SLIDE 71

Partition Function

  • Change how map outputs are partitioned and sorted

R filter={0<=O<100} hash(O) M R M

99

slide-72
SLIDE 72

Partition Function

  • Change how map outputs are partitioned and sorted

R filter={0<=O<100} hash(O) M R M

Transformation 100

slide-73
SLIDE 73

Partition Function

  • Change how map outputs are partitioned and sorted

R filter={0<=O<100} hash(O) M R M R filter={0<=O<100} range(O) split-points(100,200,…) M R M

Transformation 101

slide-74
SLIDE 74

Partition Function

  • Change how map outputs are partitioned and sorted

R filter={0<=O<100} hash(O) M R M R filter={0<=O<100} range(O) split-points(100,200,…) M R M

Transformation

  • Enables partition pruning
  • Enables vertical packing transformation

102

slide-75
SLIDE 75

Configuration Transformation

  • Changes the configuration of a MapReduce job

103

slide-76
SLIDE 76

Configuration Transformation

  • Changes the configuration of a MapReduce job

104

M M R R

Memory Buffer 512MB 2 Reduce Tasks

slide-77
SLIDE 77

Configuration Transformation

  • Changes the configuration of a MapReduce job

Transformation 105

M M R R M M R R R R

Memory Buffer 512MB 2 Reduce Tasks

  • vs. 4 Reduce Tasks

vs. Memory Buffer 128MB

slide-78
SLIDE 78

Configuration Transformation

  • Changes the configuration of a MapReduce job

Transformation

  • Many configurations that affect performance (e.g., sort

buffer, compression, combiner, reduce tasks, etc)

  • Impact of configuration depends on other

transformations (interaction)

106

M M R R M M R R R R

Memory Buffer 512MB 2 Reduce Tasks

  • vs. 4 Reduce Tasks

vs. Memory Buffer 128MB

slide-79
SLIDE 79

Configuration Transformation

  • Changes the configuration of a MapReduce job

Transformation

  • Many configurations that affect performance (e.g., sort

buffer, compression, combiner, reduce tasks, etc)

  • Impact of configuration depends on other

transformations (interaction)

107

M M R R M M R R R R

Memory Buffer 512MB 2 Reduce Tasks

  • vs. 4 Reduce Tasks

vs. Memory Buffer 128MB

slide-80
SLIDE 80

Next

Many Interfaces Information Spectrum Large Plan Space

108

Transformations

Annotations Interac racti tions ns

MapReduce Workflow Optimization Challenges

slide-81
SLIDE 81

Next

Many Interfaces Information Spectrum Large Plan Space

109

Transformations

Annotations Interac racti tions ns

MapReduce Workflow Optimization Challenges

slide-82
SLIDE 82

Optimization Process

110

slide-83
SLIDE 83

Optimization Process

M5 R5 M7 R7 M4 M6 R6 M3 R3 M1 R1 M2 R2 D1 D2 D3 D4 D6 D5 D7 D01 D02 J1 J2 J3 J4 J5 J6 J7

U(1)

111 Optimization unit localizes interactions among plan space choices

slide-84
SLIDE 84

Optimization Process

M5 R5 M7 R7 M4 M6 R6 M3 R3 M1 M2 R1 R2 D1 D2 D3 D4 D6 D5 D7 D01 D02 J1-2 J3 J4 J5 J6 J7

U(1)

112

slide-85
SLIDE 85

Optimization Process

M5 R5 M7 R7 M4 M6 R6 M3 R3 M1 M2 R1 R2 D1 D2 D3 D4 D6 D5 D7 D01 D02 J1-2 J3 J4 J5 J6 J7

U(2)

113

slide-86
SLIDE 86

Optimization Process

M5 R5 M7 R7 M4 M6 R6 M3 R3 M1 M2 R1 R2 D1 D2 D3 D4 D6 D5 D7 D01 D02 J1-2 J3 J4 J5 J6 J7

U(2)

Dynamically generated because previous

  • ptimization unit

transforms workflow 114 Top-Down because producer jobs affect the input datasets

  • f consumer jobs
slide-87
SLIDE 87

Optimization Process

R5 R6 M7 R7 M3 R3 M4 M5 M6 M1 M2 R1 R2 D1 D2 D6 D7 D01 D02 J1-2 J3-7

U(4)

115

slide-88
SLIDE 88

Divide and Conquer

116

slide-89
SLIDE 89

Divide and Conquer

  • Divide workflow into Optimization Units to have smaller plan

spaces

  • Issue: Interactions among plan space choices
  • Insight: Based on Dataset and Resource dependencies

117

slide-90
SLIDE 90

Divide and Conquer

  • Divide workflow into Optimization Units to have smaller plan

spaces

  • Issue: Interactions among plan space choices
  • Insight: Based on Dataset and Resource dependencies

118

slide-91
SLIDE 91

Divide and Conquer

  • Divide workflow into Optimization Units to have smaller plan

spaces

  • Issue: Interactions among plan space choices
  • Insight: Based on Dataset and Resource dependencies

M5 R5 M7 R7 M6 R6 D6 D5 D7 J5 J6 J7

119

slide-92
SLIDE 92
  • Divide into producer-consumer relationships
  • Transformations on producer jobs, affect transformations on

consumer jobs

  • E.g, partition function on J5 -> vertical packing on J7,

compressing D5 forces J7 to decompress

Divide and Conquer

  • Divide workflow into Optimization Units to have smaller plan

spaces

  • Issue: Interactions among plan space choices
  • Insight: Based on Dataset and Resource dependencies

M5 R5 M7 R7 M6 R6 D6 D5 D7 J5 J6 J7

120

slide-93
SLIDE 93
  • Divide into producer-consumer relationships
  • Transformations on producer jobs, affect transformations on

consumer jobs

  • E.g, partition function on J5 -> vertical packing on J7,

compressing D5 forces J7 to decompress

Divide and Conquer

  • Divide workflow into Optimization Units to have smaller plan

spaces

  • Issue: Interactions among plan space choices
  • Insight: Based on Dataset and Resource dependencies

M5 R5 M7 R7 M6 R6 D6 D5 D7 J5 J6 J7

121

slide-94
SLIDE 94

Divide and Conquer

  • Divide workflow into Optimization Units to have smaller plan

spaces

  • Issue: Interactions among plan space choices
  • Insight: Based on Dataset and Resource dependencies

M5 R5 M7 R7 M6 R6 D6 D5 D7 J5 J6 J7

  • Concurrent jobs use the same cluster resources
  • E.g, affect configuration and horizontal packing transformations

122

slide-95
SLIDE 95

Divide and Conquer

  • Divide workflow into Optimization Units to have smaller plan

spaces

  • Issue: Interactions among plan space choices
  • Insight: Based on Dataset and Resource dependencies

M5 R5 M7 R7 M6 R6 D6 D5 D7 J5 J6 J7

123

slide-96
SLIDE 96

Within an Optimization Unit

  • Enumerate all valid combinations of packing transformations

124

slide-97
SLIDE 97

Within an Optimization Unit

  • Enumerate all valid combinations of packing transformations

M4 M3 R3 D1 D2 D3 D4

p1

M4 M3 R3 D1 D2 D3 D4

p2

M3 R3 M4 D1 D2 D4

p3

M3 R3 M4 D1 D2 D4

p4

M4 M3 R3 D1 D2 D3 D4 125

slide-98
SLIDE 98

Within an Optimization Unit

  • Enumerate all valid combinations of packing transformations

M4 M3 R3 D1 D2 D3 D4

p1

M4 M3 R3 D1 D2 D3 D4

p2

M3 R3 M4 D1 D2 D4

p3

M3 R3 M4 D1 D2 D4

p4

126

slide-99
SLIDE 99

Within an Optimization Unit

  • Enumerate all valid combinations of packing transformations

M4 M3 R3 D1 D2 D3 D4

p1

M4 M3 R3 D1 D2 D3 D4

p2

M3 R3 M4 D1 D2 D4

p3

M3 R3 M4 D1 D2 D4

p4

  • Use Starfish’s What-If Engine [Herodotou VLDB 2011] for costing
  • Use Recursive Random Search [Ye SIGMETRICS 03] to find

configurations with best cost for each combination pi

127

slide-100
SLIDE 100

Within an Optimization Unit

  • Enumerate all valid combinations of packing transformations

M4 M3 R3 D1 D2 D3 D4

p1

M4 M3 R3 D1 D2 D3 D4

p2

M3 R3 M4 D1 D2 D4

p3

M3 R3 M4 D1 D2 D4

p4

  • Use Starfish’s What-If Engine [Herodotou VLDB 2011] for costing
  • Use Recursive Random Search [Ye SIGMETRICS 03] to find

configurations with best cost for each combination pi

128

slide-101
SLIDE 101

Within an Optimization Unit

  • Enumerate all valid combinations of packing transformations

M4 M3 R3 D1 D2 D3 D4

p1

M4 M3 R3 D1 D2 D3 D4

p2

M3 R3 M4 D1 D2 D4

p3

M3 R3 M4 D1 D2 D4

p4

  • Use Starfish’s What-If Engine [Herodotou VLDB 2011] for costing
  • Use Recursive Random Search [Ye SIGMETRICS 03] to find

configurations with best cost for each combination pi

Best Cost: 20 18 15 16

129

slide-102
SLIDE 102

Within an Optimization Unit

  • Enumerate all valid combinations of packing transformations

M4 M3 R3 D1 D2 D3 D4

p1

M4 M3 R3 D1 D2 D3 D4

p2

M3 R3 M4 D1 D2 D4

p3

M3 R3 M4 D1 D2 D4

p4

  • Use Starfish’s What-If Engine [Herodotou VLDB 2011] for costing
  • Use Recursive Random Search [Ye SIGMETRICS 03] to find

configurations with best cost for each combination pi

Best Cost: 20 18 15 16

  • Pick combination with lowest cost

130

slide-103
SLIDE 103

Next

Many Interfaces Information Spectrum Large Plan Space

131

Transformations

Annotations Interactions

2 4 IR SN LA WG BA BR PJ US Speedup Baseline Stubby

MapReduce Workflow Optimization Challenges

slide-104
SLIDE 104

Next

Many Interfaces Information Spectrum Large Plan Space

132

Transformations

Annotations Interactions

2 4 IR SN LA WG BA BR PJ US Speedup Baseline Stubby

MapReduce Workflow Optimization Challenges

slide-105
SLIDE 105

Implementation

133

slide-106
SLIDE 106

Implementation

  • Minimal code changes to Apache Pig
  • ~570 lines to generate annotations
  • ~65 lines to import/export workflows
  • ~800 lines for runtime support of
  • ptimized workflows (e.g., wrapper

MapReduce classes to run multiple functions in map/reduce tasks)

  • Similar effort expected for Stubby

to support other interfaces

134

slide-107
SLIDE 107

Implementation

  • Minimal code changes to Apache Pig
  • ~570 lines to generate annotations
  • ~65 lines to import/export workflows
  • ~800 lines for runtime support of
  • ptimized workflows (e.g., wrapper

MapReduce classes to run multiple functions in map/reduce tasks)

  • Similar effort expected for Stubby

to support other interfaces

135

slide-108
SLIDE 108

Experimental Evaluation

  • 51 Amazon EC2 m1.large nodes
  • Representative MapReduce workflows from several

application domains (ranges from 2 to 7 jobs)

  • Baseline – Enabled all rule-based optimization supported in

Pig and manually-tuned configurations using rules-of-thumb

Abbr Workflow Dataset Size IR Information Retrieval 264GB SN Social Network Analysis 267 GB LA Log Analysis 500 GB WG Web Graph Analysis 255 GB BA Business Analytics Query 550 GB BR Business Report Generation 530 GB PJ Post-processing Jobs 10 GB US User-defined Logical Splits 530 GB 136

slide-109
SLIDE 109

Performance Improvements

  • Different workflows present different transformation
  • pportunities
  • 2X to 4.5X speedup over Baseline

1 2 3 4 5 IR SN LA WG BA BR PJ US Speedup Baseline Stubby Vertical Horizontal

137

slide-110
SLIDE 110

Comparison against State-of-the-Art

  • Starfish [Herodotou VLDB 2011] – Cost-based selection of

configuration parameters

  • YSmart [Lee ICDCS 2011] – Rule-based approach to transformations
  • MRShare [Nykiel VLDB 2010] – Cost-based horizontal packing

transformation 1 2 3 4 5 IR SN LA WG BA BR PJ US Speedup Baseline Stubby Starfish YSmart MRShare

138

slide-111
SLIDE 111

Optimization Efficiency

  • Average case: < 2 minutes optimization time, 3% overhead
  • Worst case: 5 minutes optimization time, 10.5% overhead

139

50 100 150 200 250 IR SN LA WG BA BR PJ US Optimization Time (s) 0% 2% 4% 6% 8% 10% IR SN LA WG BA BR PJ US Optimization Overhead (%)

slide-112
SLIDE 112

Related Work

  • Optimizing data-parallel workflows
  • Rule-based: FlumeJava [PLDI 2010], YSmart [ICDCS

2011], Manimal [VLDB 2011], Jaql [VLDB 2011]

  • Cost-based: MRShare [VLDB 2010], Starfish [VLDB

2011]

  • Other transformations
  • Multi-way joins: Wu et al. [SOCC 2011]
  • Transformation-based optimizer for SCOPE system:

Zhou et al. [ICDE 2010]

  • Fault-tolerance: FTOpt [SIGMOD 2011]
  • Computation of multiple aggregates over the same
  • r similar sets of grouping attributes: Chatziantoniou

et al. [VLDB 1996]

  • ETL workflows: Simitsis et al. [ICDE 2005]

140

slide-113
SLIDE 113

Conclusions

  • Extensible transformation-based optimizer
  • Annotations as medium for information
  • Identify non-interacting subspaces
  • Speedups of up to 4.5X over the baseline
  • http://www.cs.duke.edu/starfish

141