RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 - - PowerPoint PPT Presentation

restore reusing results of mapreduce jobs
SMART_READER_LITE
LIVE PREVIEW

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 - - PowerPoint PPT Presentation

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice deletes intermediate results of MapReduce jobs These results are not useless A system that reuses the output of MapReduce jobs / sub-jobs --


slide-1
SLIDE 1

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS

Junjie Hu

1

slide-2
SLIDE 2

Introduction

¨ Current practice deletes intermediate results of

MapReduce jobs

¨ These results are not useless ¨ A system that reuses the output of MapReduce

jobs / sub-jobs -- ReStore

2

slide-3
SLIDE 3

Example

Proje ct Load

Data1 Data1

Load Proje ct Grou p Stor e Stor e 3

slide-4
SLIDE 4

Restore system architecture

4

slide-5
SLIDE 5

Plan Matcher and Rewriter

¨ Before a job J is matched, all other jobs J

depends on have to be matched and rewritten to use the job stored in the repository

¨ A physical plan in the repository is considered

matched if it is contained within the input MapReduce job

5

slide-6
SLIDE 6

Example

¨ A = load

‘page_review’ as (user, timestamp, page_info);

¨ Store A into ‘out1’; ¨ A = load

‘page_review’ as (user, timestamp, page_info);

¨ B = foreach A

generate user, page_info

¨ Store B into ‘out2’;

6

slide-7
SLIDE 7

Match Algorithm

¨ Use DFS ¨ ReStore uses the first match(greedy) ¨ Rules to order the physical plans:

1) A is preferred to B if all the operators in B have equivalent operators in A(A subsumes B) 2) Based on the ratio between I/O size, execution time

7

slide-8
SLIDE 8

Two types of reuse

¨ Job

pros: 1) easy to reuse 2) already stored cons: 1) not always reusable

¨ Sub-jobs (how to generate)

pros: 1) more opportunities to be reused cons: discuss later

8

slide-9
SLIDE 9

Discussion

¨ Why not always reuses jobs? ¨ The challenge in reusing sub-jobs? ¨ The disadvantages in reusing sub-jobs?

9

slide-10
SLIDE 10

How to generate sub-jobs

¨ Inject ‘store’ after each

  • perators

¨ Use heuristics, inject

‘store’ after ‘good’ candidate

OP1 OP2

Store Store

…… 10

slide-11
SLIDE 11

Heuristics for choosing sub-jobs

¨ Conservative Heuristic

the operator that reduces the input-size. E.g.: project, filter.

¨ Aggressive Heuristic

the operator that reduces input size and

  • utputs of operators

are known to be

  • expensive. E.g.: join,

group, project,filter

11

slide-12
SLIDE 12

The property of the job should be kept in the ReStore Repository

¨ Property 1: can reduce the execution time of a

workflow that contains this job/sub-job

¨ Property 2: can be reused in future workflows ¨ Check these properties based on statistics of

MapReduce system

12

slide-13
SLIDE 13

Experiment

¨ Use PigMix: a set of queries used to test Pig

  • performance. E.g.: L3 join, L11 distinct + union

¨ Two instances to test: 15GB and 150GB(more details

  • n paper)

¨ Speedup: improved execution time / original execution

time

¨ Overhead: executing time in addition to injecting store

  • perators / original execution time

13

slide-14
SLIDE 14

Overall: effect of reusing jobs

Speedup: 9.8 L3:Group and aggregate L11:union two data sets 14

slide-15
SLIDE 15

The effect of reusing sub-jobs outputs for data size 150GB

Speedup: 24.4 Overhead: 1.6 15

slide-16
SLIDE 16

Execution time when reusing sub-jobs chosen by different heuristics

L7: nested split Why aggressive is much worse than no-h? 16

slide-17
SLIDE 17

Overall: Reusing whole jobs and sub-jobs

17

slide-18
SLIDE 18

Performance on 15GB and 150GB

¨ Data size:150GB

Speedup: 24.4 Overhead:1.6

¨ Data size: 15GB

Speedup: 3 Over head:2.4

Win! 18

slide-19
SLIDE 19

Effect of Data Reduction

¨ As the amount of data

eliminated by the Filter

  • f Projector operator

increases, overhead decreases and speedup increases.

19

slide-20
SLIDE 20

Conclusion

¨ Jobs of MapReduce can be reused ¨ Intermediate results of MapReduce jobs can be

useful

¨ Trade-off between increased workload by injecting

extra store operators and decreased workload by reusing results

¨ The type of command

20

slide-21
SLIDE 21

ONLY AGGRESSIVE ELEPHANTS ARE FAST ELEPHANTS

Xueman Mou

21

slide-22
SLIDE 22

Background

¨ Hadoop + HDFS

¤ Each different filter conditions trigger a new

MapRedece Job

¤ “going shopping without a shopping list” ¤ “Let’s see what I am going to encounter on the way” 22

slide-23
SLIDE 23

What is HAIL…

¨ Hadoop Aggressive Indexing Library ¨ HAIL:

¤ Keeps existing replicas in different sort orders and with

different clustered indexes

¤ Faster to find a suitable index ¤ Longer runtime for a workload 23

slide-24
SLIDE 24

Why HAIL

¨ Each MapReduce job requires to scan the whole disk

¤ slow query time

¨ Trojan index

¤ expensive index creation ¤ How to use General attributes for other tasks

¨ HDFS keeps replicas which all have the same physical

data layouts

24

slide-25
SLIDE 25

HAIL

¨ Client analyzes input data for each HDFS block ¨ Converts each HDFS block to binary PAX ¨ Sort data in parallel in different sorting orders ¨ Datanode creates clustered index ¨ MapReduce job exploits the indexes ¨ Failover: Standard Hadoop scanning

25

slide-26
SLIDE 26

What is PAX?

¨ Partition Attributes Across ¨ A data organization model ¨ Significantly improves cache performance by

grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior.

http://www.pdl.cmu.edu/ftp/Database/pax.pdf 26

slide-27
SLIDE 27

Use case

¨ Bob: representative analyst ¨ A large web log has three fields, which may serve

as different filter conditions:

¤ visitDate ¤ adRevenue ¤ sourceIP 27

slide-28
SLIDE 28

Upload Process

Reuse as much HDFS existing pipeline as possible 1: parse into rows based on end of line 2: parse each row by the schema specified 3: HDFS gets list of datanodes for block 4: PAX data is cut into packets

PCK – data packet ACK – acknowledgement number

8: DN1, DN2 immediately forward pckt 9: DN3 verify checksums 10: DN3 acknowledge pckt back to DN2 6: assemble block in main memory 7: Sorts data, create indexs, form HAIL block

28

slide-29
SLIDE 29

HDFS Namenode Extension

¨ It keeps track of different sort orders ¨ HAIL needs to schedule map tasks close to replicas

having suitable indexes

¨ Central namenode keeps Dir_Block mapping:

blockID → Set Of DataNodes.

and Dir_rep mapping:

(blockID, datanode) → HAILBlockReplicaInfo.

29

slide-30
SLIDE 30

Indexing Pipeline

Figure 2: HAIL data column index

  • Why clustered indexing?

– Cheap to create in main memory – Cheap to write to disk – Cheap to query from disk

  • Divides data of attribute sourceIP into partitions

Consisting of 1024 values

  • Child pointers to start offset
  • Only the first child pointer is explicit

– all leaves are contiguous on disk – can be reached by simply multiplying the leaf size with the leaf ID.

30

slide-31
SLIDE 31

Query

¨ SELECT sourceIP

FROM UserVisits WHERE visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’;

31

slide-32
SLIDE 32

Query Pipeline

Annotates his map function to specify the selection predicate and the projected attributes required by his MapReduce job. JobClient logically breaks the input into smaller pieces called input splits. An input split defines the input data of a map task. For each map task, the JobTracker decides on which computing node to schedule the map task, using the split locations. The map task uses a RecordReader UDF in order to read its input data blocki from the closest datanode.

32

slide-33
SLIDE 33

Query Pipeline – System Perspective

¨ It is crucial to be non-intrusive to the standard Hadoop

execution pipeline so that users run MapReduce jobs exactly as before.

¨ HailInputFormat

¤ a more elaborate splitting policy, called HailSplitting.

¨ HailRecordReader

¤ responsible for retrieving the records that satisfy the

selection predicate of MapReduce jobs.

33

slide-34
SLIDE 34

Experiment

¨ Six different clusters ¤ One physical cluster with 10 nodes ¤ Three EC2 clusters using different data types each with 10 nodes ¤ Two EC2 clusters: one with 50 nodes, the other 100 nodes ¨ Two datasets: ¤ UserVisits table – 20GB data per node ¤ Synthetic dataset – 13GB data per node n consisting of 19 integer attributes in order to understand the effects

  • f selectivity.

34

slide-35
SLIDE 35

Queries

¨

Bob-Q1 (selectivity: 3.1 x 10−2) SELECT sourceIP FROM UserVisits WHERE visitDate

¨

BETWEEN ‘1999-01-01’ AND ‘2000-01-01’; Bob-Q2 (selectivity: 3.2 x 10−8 ) SELECT searchWord, duration, adRevenue

¨

FROM UserVisits WHERE sourceIP=‘172.101.11.46’; Bob-Q3 (selectivity: 6 x 10−9)

¨

SELECT searchWord, duration, adRevenue FROM UserVisits WHERE sourceIP=‘172.101.11.46’ AND visitDate=‘1992-12-22’; Bob-Q4 (selectivity: 1.7 x 10−2)

¨

SELECT searchWord, duration, adRevenue FROM UserVisits WHERE adRevenue>=1 AND adRevenue<=10; Additionally, we use a variation of query Bob-Q4 to see how well HAIL performs on queries with low selectivities:

¨

Bob-Q5 (selectivity: 2.04 x 10−1 )

¨

SELECT searchWord, duration, adRevenue FROM UserVisits WHERE adRevenue>=1 AND adRevenue<=100;

35

slide-36
SLIDE 36

Experiment Result (1)

HAIL has a negligible upload

  • verhead of ∼2% over

standard Hadoop. When HAIL creates one index per replica the overhead still remains very low (at most ∼14%). HAIL outperforms Hadoop by a factor of 1.6 even when creating three indexes. marks the time Hadoop takes to upload with the default repli- cation factor of three. HAIL significantly outperforms Hadoop for any replication factor. 36

slide-37
SLIDE 37

Experiment Result (2)

UV: Upload times for UserVisits when scaling-up Syn: Upload times for Synthetic when scaling-up [sec]

  • HAIL achieves roughly the same upload times for the Synthetic dataset.
  • HAIL improves its upload times for larger clusters for UserVisits dataset.
  • More interesting, we observe that HAIL does not suffer from

high performance variability.

37

slide-38
SLIDE 38

Experiment Result (3)

38

slide-39
SLIDE 39

Fault tolerence

  • HAIL preserves the failover property of Hadoop by having almost the

same slowdown.

  • When HAIL creates the same index on all replicas (HAIL-1Idx), HAIL has a

lower slowdown since failed map tasks can still perform an index scan even after failure.

HAIL: create indexes on three different attributes, one for each replica. HAIL-1ldx: create an index on the same attribute for all three replicas.

39

slide-40
SLIDE 40

Wrap Up

¨ HAIL can:

¤ Improve both upload and query times ¤ Keep failover properties of Hadoop ¤ works with existing MapReduce jobs incurring only

minimal changes to them

40

slide-41
SLIDE 41

Questions to Ponder

¨ Why does HAIL have different performance on different data types? ¨ Would it be possible to use HAIL for the data that is already stored in the

cluster?

¨ Is HAIL open source? ¨ Can HAIL integrate with other systems, such as Pig or Hive? – API ¨ Why do/don’t the authors use Hadoop++? ¨ What is it like when using HAIL on other cases? ¨ How useful it will be for needs of different users/queries? ¨ ….

41

slide-42
SLIDE 42

Backup Slides

¨ HAIL vs. Hadoop++

¤ HAIL create Trojan indexes per physical replica instead of

logical HDFS replica

¤ HAIL create indexes less expensive

¨ Twitter full text indexing

¤ Only suitable for highly selective queries

¨ CoHadoop

¤ Did not improve indexing features of Hadoop++

42

slide-43
SLIDE 43

BUILDING WAVELET HISTOGRAMS ON LARGE DATA IN MAPREDUCE

Junjie Hu

43

slide-44
SLIDE 44

Introduction

¨ Histograms are important for summarizing data ¨ Wavelet histogram is one of the most widely used ¨ Straightforward adaption for building wavelet

histograms to MapReduce is inefficient

¨ Require new algorithm

44

slide-45
SLIDE 45

How to build wavelet histograms

¨ Suppose dataset has a key drawn from domain u = {1, 2,

…, u}

¨ Frequency vector as v = ( v(1), v(2) …, v(u)) where v(x) is

the number of occurrences of key x in the data sets

¨ Calculate wavelet basis vector ψ, same length as v. ψ is

unrelated to v

¨ Coefficients are w(i) = <v, ψ(i)> (dot products), i = 1, …, u ¨ Computer the best k-term wavelet representations using

centralized algorithm

45

slide-46
SLIDE 46

Baseline solution (Send-V)

¨ m mappers(node) and single reducer(coordinator) ¨ Mapper: each mapper emits (x, vlocal (x)) for all x

in the splits and its local frequency

¨ Reducer: aggregate (x, v(x)).

Calculate w(i) = <v, ψ(i)> then select best k-terms.

46

slide-47
SLIDE 47

Alternative baseline (Send-Coefficient)

¨ Due to the Distributive

Law, we can calculate each local coefficient

  • n mapper, and let

reducer aggregate those results

47

slide-48
SLIDE 48

Discussion?

¨ Drawback? ¨ Any improvement? (note that coefficient could be

negative)

¨ How many intermediate files ? Suppose the splits

number is m, domain size is u.

48

slide-49
SLIDE 49

Hadoop Wavelet top-k(H-WTopk)

¨ For an item(key) x, r(x) denotes its aggregated

score(coefficient) and rj(x) is its score at node j.

¨ An lower bound τ(x) of item x score’s magnitude.

τ(x) ≤ |r(x)|

¨ A global threshold τ, kth largest τ(x). ¨ If an item’s local score is always below τ/m, then it

can be discarded.

49

slide-50
SLIDE 50

Three rounds for H-WTopk: Round 1

¨ Each node emits k highest and lowest score. If coordinator

receives x’ score from a node, update its upper bound τ+ (x) and lower bound τ-(x) with addition of rlocal(x). Otherwise, add the kth highest score of this node sends to τ+(x), and the kth lowest to τ-(x).

¨ Set τ(x) = 0 if τ+(x) and τ-(x) have different signs.

Otherwise, τ (x) = min(|τ+(x)|, |τ-(x)|)

¨ Pick the k-th largest τ(x), denotes as T1. It’s a threshold

for the magnitude of the top-k items.

50

slide-51
SLIDE 51

Three rounds for H-WTopk: Round 2

¨ For each node j, emits item x if |rj(x)|> T1/m. ¨ Define R as the set of items coordinator received. Refine

τ+(x) and τ-(x) for every x from R. If a node has not been received from a node, use T1/m and -T1/m to update upper/lower bound.

¨ Calculate a better threshold T2. For any x from R,

compute τ(x) = max(|τ+(x)|,|τ-(x)|). Delete x from R if τ(x) < T2.

51

slide-52
SLIDE 52

Three rounds for H-WTopk: Round 3

¨ Ask each node for the scores of all items in R. ¨ Computer the aggregated scores exactly for those

items, from which we pick k items of largest magnitude.

52

slide-53
SLIDE 53

H-WTop

¨ Use lower/upper bound to estimates the score the

item, and calculate a threshold to prune items

¨ Cost on communication is better than baseline

solution.

¨ Need 3 rounds. ¨ Drawback?

53

slide-54
SLIDE 54

Drawback for H-WTopk

¨ Three rounds of MapReduce jobs incurs a lot of

  • verhead

¨ On node j, every split needs to be fully scanned to

compute local frequency vector vj and compute local wavelet coefficient wi,j, i = 1, …, u

¨ Any solution?

54

slide-55
SLIDE 55

Sampling

¨ Assume n is #records in dataset, if we want to

approximate each frequency v(x) with a standard deviation of εn, a sample of size Θ(1/ε2) is

  • required. A sample probability p = 1/(ε2n).

¨ If n is very large, we need a very small ε to keep

  • accuracy. For ε= 10-6, even with one-byte key, still

needs to emit 1TB data.

¨ Cost of communication is O(1/ε2)

55

slide-56
SLIDE 56

Two-level sampling (TwoLevel-S)

¨ For each split j, extract a sample from input.

Calculate (x, sj(x)) from sample. sj(x) is the counts of x.

¨ Perform a second-level sample, for any item x with

sj(x) ≥ (1 / ε√m), emit the (x, sj(x), otherwise, sample it with a probability proportional to sj(x), i.e. (ε√m × sj(x)), and emit the pair(x, NULL).

56

slide-57
SLIDE 57

TwoLevel-S

¨ Communication cost reduced to O(√m/ε). ¨ It provides an unbiased estimation of coefficient w.

(see proof in paper)

¨ Both H-WTopk and TwoLevel-S work well in

practice(see experiment results in paper)

57

slide-58
SLIDE 58

Wrap-up

¨ Provide two approaches to calculate coefficients used in

wavelet histograms: one exact computation approach and

  • ne approximate computation approach.

¨ Design algorithms for MapReduce jobs, one should consider

the cost of communication (i.e., number of intermediate results). It’s one of the factors that influences the efficiency

  • f algorithm. (For one who has taken cs425 last semester

and worked on mp4 may have a good understanding for this)

58

slide-59
SLIDE 59

¨ Thanks for watching.

59