RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1

Introduction ¨ Current practice deletes intermediate results of MapReduce jobs ¨ These results are not useless ¨ A system that reuses the output of MapReduce jobs / sub-jobs -- ReStore 2

Example Proje Stor Load Data1 ct e Proje Grou Stor Load Data1 ct p e 3

Restore system architecture 4

Plan Matcher and Rewriter ¨ Before a job J is matched, all other jobs J depends on have to be matched and rewritten to use the job stored in the repository ¨ A physical plan in the repository is considered matched if it is contained within the input MapReduce job 5

Example ¨ A = load ¨ A = load ‘page_review’ as (user, ‘page_review’ as (user, timestamp, page_info); timestamp, page_info); ¨ Store A into ‘out1’; ¨ B = foreach A generate user, page_info ¨ Store B into ‘out2’; 6

Match Algorithm ¨ Use DFS ¨ ReStore uses the first match(greedy) ¨ Rules to order the physical plans: 1) A is preferred to B if all the operators in B have equivalent operators in A(A subsumes B) 2) Based on the ratio between I/O size, execution time 7

Two types of reuse ¨ Job pros: 1) easy to reuse 2) already stored cons: 1) not always reusable ¨ Sub-jobs (how to generate) pros: 1) more opportunities to be reused cons: discuss later 8

Discussion ¨ Why not always reuses jobs? ¨ The challenge in reusing sub-jobs? ¨ The disadvantages in reusing sub-jobs? 9

How to generate sub-jobs ¨ Inject ‘store’ after each Store operators ¨ Use heuristics, inject ‘store’ after ‘good’ OP1 Store OP2 candidate …… 10

Heuristics for choosing sub-jobs ¨ Conservative Heuristic ¨ Aggressive Heuristic the operator that the operator that reduces the input-size. reduces input size and E.g.: project, filter. outputs of operators are known to be expensive. E.g.: join, group, project,filter 11

The property of the job should be kept in the ReStore Repository ¨ Property 1: can reduce the execution time of a workflow that contains this job/sub-job ¨ Property 2: can be reused in future workflows ¨ Check these properties based on statistics of MapReduce system 12

Experiment ¨ Use PigMix: a set of queries used to test Pig performance. E.g.: L3 join, L11 distinct + union ¨ Two instances to test: 15GB and 150GB(more details on paper) ¨ Speedup: improved execution time / original execution time ¨ Overhead: executing time in addition to injecting store operators / original execution time 13

Overall: effect of reusing jobs Speedup: 9.8 L3:Group and aggregate L11:union two data sets 14

The effect of reusing sub-jobs outputs for data size 150GB Speedup: 24.4 Overhead: 1.6 15

Execution time when reusing sub-jobs chosen by different heuristics Why aggressive is much worse than no-h? L7: nested split 16

Overall: Reusing whole jobs and sub-jobs 17

Performance on 15GB and 150GB ¨ Data size:150GB ¨ Data size: 15GB Speedup: 24.4 Speedup: 3 Overhead:1.6 Over head:2.4 Win! 18

Effect of Data Reduction ¨ As the amount of data eliminated by the Filter of Projector operator increases, overhead decreases and speedup increases. 19

Conclusion ¨ Jobs of MapReduce can be reused ¨ Intermediate results of MapReduce jobs can be useful ¨ Trade-off between increased workload by injecting extra store operators and decreased workload by reusing results ¨ The type of command 20

ONLY AGGRESSIVE ELEPHANTS ARE FAST ELEPHANTS Xueman Mou 21

Background ¨ Hadoop + HDFS ¤ Each different filter conditions trigger a new MapRedece Job ¤ “going shopping without a shopping list” ¤ “Let’s see what I am going to encounter on the way” 22

What is HAIL… ¨ Hadoop Aggressive Indexing Library ¨ HAIL: ¤ Keeps existing replicas in different sort orders and with different clustered indexes ¤ Faster to find a suitable index ¤ Longer runtime for a workload 23

Why HAIL ¨ Each MapReduce job requires to scan the whole disk ¤ slow query time ¨ Trojan index ¤ expensive index creation ¤ How to use General attributes for other tasks ¨ HDFS keeps replicas which all have the same physical data layouts 24

HAIL ¨ Client analyzes input data for each HDFS block ¨ Converts each HDFS block to binary PAX ¨ Sort data in parallel in different sorting orders ¨ Datanode creates clustered index ¨ MapReduce job exploits the indexes ¨ Failover: Standard Hadoop scanning 25

What is PAX? ¨ Partition Attributes Across ¨ A data organization model ¨ Significantly improves cache performance by grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior. http://www.pdl.cmu.edu/ftp/Database/pax.pdf 26

Use case ¨ Bob: representative analyst ¨ A large web log has three fields, which may serve as different filter conditions: ¤ visitDate ¤ adRevenue ¤ sourceIP 27

Upload Process Reuse as much HDFS existing pipeline as possible 8: DN1, DN2 immediately forward pckt 1: parse into rows based on end of line 9: DN3 verify checksums 2: parse each row by the schema specified 10: DN3 acknowledge pckt back to DN2 3: HDFS gets list of datanodes for block 6: assemble block in main memory 4: PAX data is cut into packets 7: Sorts data, create indexs, form HAIL block PCK – data packet ACK – acknowledgement number 28

HDFS Namenode Extension ¨ It keeps track of different sort orders ¨ HAIL needs to schedule map tasks close to replicas having suitable indexes ¨ Central namenode keeps Dir_Block mapping: blockID → Set Of DataNodes. and Dir_rep mapping: (blockID, datanode) → HAILBlockReplicaInfo. 29

Indexing Pipeline • Why clustered indexing? – Cheap to create in main memory – Cheap to write to disk – Cheap to query from disk • Divides data of attribute sourceIP into partitions Consisting of 1024 values • Child pointers to start offset • Only the first child pointer is explicit – all leaves are contiguous on disk – can be reached by simply multiplying the leaf size with the leaf ID. 30 Figure 2: HAIL data column index

Query ¨ SELECT sourceIP FROM UserVisits WHERE visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’; 31

For each map task, the JobTracker decides on Query Pipeline which computing node to schedule the map task, using the split locations. Annotates his map function to specify The map task uses a RecordReader UDF JobClient logically breaks the input the selection predicate and the in order to read its input data blocki into smaller pieces called input splits. projected attributes required from the closest datanode. An input split defines 32 by his MapReduce job. the input data of a map task.

Query Pipeline – System Perspective ¨ It is crucial to be non-intrusive to the standard Hadoop execution pipeline so that users run MapReduce jobs exactly as before. ¨ HailInputFormat ¤ a more elaborate splitting policy, called HailSplitting. ¨ HailRecordReader ¤ responsible for retrieving the records that satisfy the selection predicate of MapReduce jobs. 33

Experiment ¨ Six different clusters ¤ One physical cluster with 10 nodes ¤ Three EC2 clusters using different data types each with 10 nodes ¤ Two EC2 clusters: one with 50 nodes, the other 100 nodes ¨ Two datasets: ¤ UserVisits table – 20GB data per node ¤ Synthetic dataset – 13GB data per node n consisting of 19 integer attributes in order to understand the effects of selectivity. 34

Queries Bob-Q1 (selectivity: 3.1 x 10 − 2) ¨ SELECT sourceIP FROM UserVisits WHERE visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’; Bob-Q2 (selectivity: 3.2 x 10 − 8 ) ¨ SELECT searchWord, duration, adRevenue FROM UserVisits WHERE sourceIP=‘172.101.11.46’; Bob-Q3 (selectivity: 6 x 10 − 9) ¨ SELECT searchWord, duration, adRevenue FROM UserVisits WHERE sourceIP=‘172.101.11.46’ ¨ AND visitDate=‘1992-12-22’; Bob-Q4 (selectivity: 1.7 x 10 − 2) SELECT searchWord, duration, adRevenue FROM UserVisits WHERE adRevenue>=1 AND ¨ adRevenue<=10; Additionally, we use a variation of query Bob-Q4 to see how well HAIL performs on queries with low selectivities: Bob-Q5 (selectivity: 2.04 x 10 − 1 ) ¨ SELECT searchWord, duration, adRevenue FROM UserVisits WHERE adRevenue>=1 AND ¨ adRevenue<=100; 35

Experiment Result (1) HAIL outperforms Hadoop marks the time Hadoop HAIL has a negligible upload by a factor of 1.6 even takes to upload with overhead of ∼ 2% over when creating three indexes. the default replication standard Hadoop. factor of three. When HAIL creates one index HAIL significantly outperforms Hadoop per replica the overhead still for any replication factor. remains very low (at most ∼ 14%). 36

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 - PowerPoint PPT Presentation

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice deletes intermediate results of MapReduce jobs These results are not useless A system that reuses the output of MapReduce jobs / sub-jobs --

JOBS, JOBS, JOBS! JOBS, JOBS, JOBS! Jobs, jobs, JO JOBS! JOBS, JOBS, JOBS! The other reality

Restore Louisiana Task Force March 17, 2017 Agenda Timelines Restore LA Homeowner Assistance

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

the RESTORE Act October 9, 2014 Daphne Civic Center Overview of Meeting 1. Summary of the RESTORE

Restore Us Again Restore Us Again SONG SHEET - MAR 29, 2020 Verse You give

Jobs at sea TRINITY HOUSE // KEY STAGE 2 JOBS AT SEA Starter Activity 1 TRINITY HOUSE //

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Green Jobs Employment experiences Green Jobs Employment experiences Green Jobs Employment

Mining Network Traffic Data Ljiljana Trajkovi ljilja@cs.sfu.ca Communication Networks

Chesapeake Bay Foundation Webcast Developing a Chesapeake Bay Pollution Reduction Plan

4D Group Presentation May 2017 4D Group - structure 4D SAS Headquarters Le Pecq 100% Wakanda

SMALL BUSINESS RELIEF GRANT PROGRAM + CITY CARES ACT FUNDING FOR SMALL BUSINESSES City

A 10 Point Plan for a better Openreach 7 July 2016 Why are we here today? Almost no-one thinks

The FTTC Project of Deutsche Telekom Regulatory Holidays would boost VDSL Investments and 3play

Broadband Infrastructure delivery in Darlington 2012-2020 1 Broadband Infrastructure delivery in

Fibre broadband: the legacy starts today Paul Bimson Regional Partnership Director BT Group