RESTORE: REUSING RESULTS OF MAPREDUCE JOBS
Junjie Hu
1
RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 - - PowerPoint PPT Presentation
RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice deletes intermediate results of MapReduce jobs These results are not useless A system that reuses the output of MapReduce jobs / sub-jobs --
1
¨ Current practice deletes intermediate results of
¨ These results are not useless ¨ A system that reuses the output of MapReduce
2
Proje ct Load
Data1 Data1
Load Proje ct Grou p Stor e Stor e 3
4
¨ Before a job J is matched, all other jobs J
¨ A physical plan in the repository is considered
5
¨ A = load
¨ Store A into ‘out1’; ¨ A = load
¨ B = foreach A
¨ Store B into ‘out2’;
6
¨ Use DFS ¨ ReStore uses the first match(greedy) ¨ Rules to order the physical plans:
7
¨ Job
¨ Sub-jobs (how to generate)
8
¨ Why not always reuses jobs? ¨ The challenge in reusing sub-jobs? ¨ The disadvantages in reusing sub-jobs?
9
¨ Inject ‘store’ after each
¨ Use heuristics, inject
OP1 OP2
Store Store
…… 10
¨ Conservative Heuristic
¨ Aggressive Heuristic
11
¨ Property 1: can reduce the execution time of a
¨ Property 2: can be reused in future workflows ¨ Check these properties based on statistics of
12
¨ Use PigMix: a set of queries used to test Pig
¨ Two instances to test: 15GB and 150GB(more details
¨ Speedup: improved execution time / original execution
¨ Overhead: executing time in addition to injecting store
13
Speedup: 9.8 L3:Group and aggregate L11:union two data sets 14
Speedup: 24.4 Overhead: 1.6 15
L7: nested split Why aggressive is much worse than no-h? 16
17
¨ Data size:150GB
¨ Data size: 15GB
Win! 18
¨ As the amount of data
19
¨ Jobs of MapReduce can be reused ¨ Intermediate results of MapReduce jobs can be
¨ Trade-off between increased workload by injecting
¨ The type of command
20
21
¨ Hadoop + HDFS
¤ Each different filter conditions trigger a new
¤ “going shopping without a shopping list” ¤ “Let’s see what I am going to encounter on the way” 22
¨ Hadoop Aggressive Indexing Library ¨ HAIL:
¤ Keeps existing replicas in different sort orders and with
¤ Faster to find a suitable index ¤ Longer runtime for a workload 23
¨ Each MapReduce job requires to scan the whole disk
¤ slow query time
¨ Trojan index
¤ expensive index creation ¤ How to use General attributes for other tasks
¨ HDFS keeps replicas which all have the same physical
24
¨ Client analyzes input data for each HDFS block ¨ Converts each HDFS block to binary PAX ¨ Sort data in parallel in different sorting orders ¨ Datanode creates clustered index ¨ MapReduce job exploits the indexes ¨ Failover: Standard Hadoop scanning
25
¨ Partition Attributes Across ¨ A data organization model ¨ Significantly improves cache performance by
http://www.pdl.cmu.edu/ftp/Database/pax.pdf 26
¨ Bob: representative analyst ¨ A large web log has three fields, which may serve
¤ visitDate ¤ adRevenue ¤ sourceIP 27
Reuse as much HDFS existing pipeline as possible 1: parse into rows based on end of line 2: parse each row by the schema specified 3: HDFS gets list of datanodes for block 4: PAX data is cut into packets
PCK – data packet ACK – acknowledgement number
8: DN1, DN2 immediately forward pckt 9: DN3 verify checksums 10: DN3 acknowledge pckt back to DN2 6: assemble block in main memory 7: Sorts data, create indexs, form HAIL block
28
¨ It keeps track of different sort orders ¨ HAIL needs to schedule map tasks close to replicas
¨ Central namenode keeps Dir_Block mapping:
29
Figure 2: HAIL data column index
– Cheap to create in main memory – Cheap to write to disk – Cheap to query from disk
Consisting of 1024 values
– all leaves are contiguous on disk – can be reached by simply multiplying the leaf size with the leaf ID.
30
¨ SELECT sourceIP
31
Annotates his map function to specify the selection predicate and the projected attributes required by his MapReduce job. JobClient logically breaks the input into smaller pieces called input splits. An input split defines the input data of a map task. For each map task, the JobTracker decides on which computing node to schedule the map task, using the split locations. The map task uses a RecordReader UDF in order to read its input data blocki from the closest datanode.
32
¨ It is crucial to be non-intrusive to the standard Hadoop
¨ HailInputFormat
¤ a more elaborate splitting policy, called HailSplitting.
¨ HailRecordReader
¤ responsible for retrieving the records that satisfy the
33
¨ Six different clusters ¤ One physical cluster with 10 nodes ¤ Three EC2 clusters using different data types each with 10 nodes ¤ Two EC2 clusters: one with 50 nodes, the other 100 nodes ¨ Two datasets: ¤ UserVisits table – 20GB data per node ¤ Synthetic dataset – 13GB data per node n consisting of 19 integer attributes in order to understand the effects
34
¨
Bob-Q1 (selectivity: 3.1 x 10−2) SELECT sourceIP FROM UserVisits WHERE visitDate
¨
BETWEEN ‘1999-01-01’ AND ‘2000-01-01’; Bob-Q2 (selectivity: 3.2 x 10−8 ) SELECT searchWord, duration, adRevenue
¨
FROM UserVisits WHERE sourceIP=‘172.101.11.46’; Bob-Q3 (selectivity: 6 x 10−9)
¨
SELECT searchWord, duration, adRevenue FROM UserVisits WHERE sourceIP=‘172.101.11.46’ AND visitDate=‘1992-12-22’; Bob-Q4 (selectivity: 1.7 x 10−2)
¨
SELECT searchWord, duration, adRevenue FROM UserVisits WHERE adRevenue>=1 AND adRevenue<=10; Additionally, we use a variation of query Bob-Q4 to see how well HAIL performs on queries with low selectivities:
¨
Bob-Q5 (selectivity: 2.04 x 10−1 )
¨
SELECT searchWord, duration, adRevenue FROM UserVisits WHERE adRevenue>=1 AND adRevenue<=100;
35
HAIL has a negligible upload
standard Hadoop. When HAIL creates one index per replica the overhead still remains very low (at most ∼14%). HAIL outperforms Hadoop by a factor of 1.6 even when creating three indexes. marks the time Hadoop takes to upload with the default repli- cation factor of three. HAIL significantly outperforms Hadoop for any replication factor. 36
UV: Upload times for UserVisits when scaling-up Syn: Upload times for Synthetic when scaling-up [sec]
high performance variability.
37
38
same slowdown.
lower slowdown since failed map tasks can still perform an index scan even after failure.
HAIL: create indexes on three different attributes, one for each replica. HAIL-1ldx: create an index on the same attribute for all three replicas.
39
¨ HAIL can:
¤ Improve both upload and query times ¤ Keep failover properties of Hadoop ¤ works with existing MapReduce jobs incurring only
40
¨ Why does HAIL have different performance on different data types? ¨ Would it be possible to use HAIL for the data that is already stored in the
cluster?
¨ Is HAIL open source? ¨ Can HAIL integrate with other systems, such as Pig or Hive? – API ¨ Why do/don’t the authors use Hadoop++? ¨ What is it like when using HAIL on other cases? ¨ How useful it will be for needs of different users/queries? ¨ ….
41
¨ HAIL vs. Hadoop++
¤ HAIL create Trojan indexes per physical replica instead of
¤ HAIL create indexes less expensive
¨ Twitter full text indexing
¤ Only suitable for highly selective queries
¨ CoHadoop
¤ Did not improve indexing features of Hadoop++
42
43
¨ Histograms are important for summarizing data ¨ Wavelet histogram is one of the most widely used ¨ Straightforward adaption for building wavelet
¨ Require new algorithm
44
¨ Suppose dataset has a key drawn from domain u = {1, 2,
¨ Frequency vector as v = ( v(1), v(2) …, v(u)) where v(x) is
¨ Calculate wavelet basis vector ψ, same length as v. ψ is
¨ Coefficients are w(i) = <v, ψ(i)> (dot products), i = 1, …, u ¨ Computer the best k-term wavelet representations using
45
¨ m mappers(node) and single reducer(coordinator) ¨ Mapper: each mapper emits (x, vlocal (x)) for all x
¨ Reducer: aggregate (x, v(x)).
46
¨ Due to the Distributive
47
¨ Drawback? ¨ Any improvement? (note that coefficient could be
¨ How many intermediate files ? Suppose the splits
48
¨ For an item(key) x, r(x) denotes its aggregated
¨ An lower bound τ(x) of item x score’s magnitude.
¨ A global threshold τ, kth largest τ(x). ¨ If an item’s local score is always below τ/m, then it
49
¨ Each node emits k highest and lowest score. If coordinator
¨ Set τ(x) = 0 if τ+(x) and τ-(x) have different signs.
¨ Pick the k-th largest τ(x), denotes as T1. It’s a threshold
50
¨ For each node j, emits item x if |rj(x)|> T1/m. ¨ Define R as the set of items coordinator received. Refine
¨ Calculate a better threshold T2. For any x from R,
51
¨ Ask each node for the scores of all items in R. ¨ Computer the aggregated scores exactly for those
52
¨ Use lower/upper bound to estimates the score the
¨ Cost on communication is better than baseline
¨ Need 3 rounds. ¨ Drawback?
53
¨ Three rounds of MapReduce jobs incurs a lot of
¨ On node j, every split needs to be fully scanned to
¨ Any solution?
54
¨ Assume n is #records in dataset, if we want to
¨ If n is very large, we need a very small ε to keep
¨ Cost of communication is O(1/ε2)
55
¨ For each split j, extract a sample from input.
¨ Perform a second-level sample, for any item x with
56
¨ Communication cost reduced to O(√m/ε). ¨ It provides an unbiased estimation of coefficient w.
¨ Both H-WTopk and TwoLevel-S work well in
57
¨ Provide two approaches to calculate coefficients used in
¨ Design algorithms for MapReduce jobs, one should consider
58
¨ Thanks for watching.
59