Declarative MapReduce 1 Declarative Languages Describe what you - - PowerPoint PPT Presentation

declarative mapreduce
SMART_READER_LITE
LIVE PREVIEW

Declarative MapReduce 1 Declarative Languages Describe what you - - PowerPoint PPT Presentation

Declarative MapReduce 1 Declarative Languages Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)? 2 Relational Operators Projection - SELECT revenue


slide-1
SLIDE 1

Declarative MapReduce

1

slide-2
SLIDE 2

Declarative Languages

Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)?

2

slide-3
SLIDE 3

Relational Operators

Projection - π

SELECT revenue – expenses AS profit FROM …

Selection (Filter) – σ

SELECT … WHERE cost > 5000

Aggregate - Σ

SELECT SUM(cost)

Grouped Aggregate

SELECT SUM(cost) GROUP BY product_id

Join - ⋈

SELECT … FROM Employee, Department WHERE Employee.dept_id = Deptartment.id

3

slide-4
SLIDE 4

Example: Log file

host logname time method url response bytes referer useragent

pppa006.compuserve.com - 807256800GET /images/launch-logo.gif 200 1713 vcc7.langara.bc.ca

  • 807256804GET

/shuttle/missions/missions.html 200 8677 pppa006.compuserve.com - 807256806GET /history/apollo/images/apollo-logo1.gif 200 1173 bettong.client.uq.oz.au

  • 807256900GET

/history/skylab/skylab.html 304 bettong.client.uq.oz.au

  • 807256913GET

/images/ksclogosmall.gif 304 202.32.48.43

  • 807259091GET

/shuttle/resources/orbiters/atlantis.gif 404 bettong.client.uq.oz.au

  • 807256913GET

/history/apollo/images/apollo-logo.gif 200 3047 ad03-053.compuserve.com - 807257487GET /cgi-bin/imagemap/countdown70?284,288 302 85 hella.stm.it

  • 807256914GET

/shuttle/missions/sts-70/images/DSC-95EC-0001.jpg 200 513911

4

We will model a tuple as a map [String → Value]. This can be implemented as a hash table, for example. E.g., tuple.host = “…”

slide-5
SLIDE 5

Projection

Input: A tuple with a set of attributes Output: A tuple with another set of attributes Can be modeled as a map-only job Example: Add day of week based on the time

5

map(tuple, context) { date = new Date(tuple.time) tuple.day_of_week = date.getDayOfWeek(); context.write(tuple); }

slide-6
SLIDE 6

Selection (Filter)

Input: A tuple with a set of attributes Output: Either the tuple of it matches the predicate, or nothing if it does not Can be modeled as a map-only job Example: Find records with response code 200

6

map(tuple, context) { response_code = tuple.response if (response_code == 200) context.write(tuple) }

slide-7
SLIDE 7

Aggregation

Input: A relation with a set of tuples Output: One value that aggregates an entire column Can be modeled as one map-reduce job Example: Find the sum of bytes

7

// Configure with one reducer map(tuple, context) { context.write(1, tuple.bytes) } reduce(key, values[], context) { context.write(key, sum(values)); }

slide-8
SLIDE 8

Aggregation

A combiner can be used to speed up the processing Note: Hadoop a special key called NullWritable for this scenario

8

// Configure with one reducer map(tuple, context) { context.write(1, tuple.bytes) } combine/reduce(key, values[], context) { context.write(key, sum(values)); }

slide-9
SLIDE 9

Other Aggregate Functions

The same technique can be used for any function that is associative and commutative. This includes, min, max, sum, and count. It also includes all functions that can be derived from these functions, e.g., average and standard deviation

9

slide-10
SLIDE 10

Grouped Aggregation

Input: A relation with a set of tuples Output: One value that aggregates an entire column for each value of the group key Can be modeled as one map-reduce job Example: Find the sum of bytes for each response code

10

map(tuple, context) { context.write(tuple.response, tuple.bytes) } combine/reduce(key, values[], context) { context.write(key, sum(values)); }

slide-11
SLIDE 11

Equi-join

Input: Two relations and a join column Output: A tuple that combines each pair of tuples where the join column is equal Can be modeled as one map-reduce job Special case: Self-join. Both inputs are the same Example (Self-join): Given a log file, find log entries which originate from the same host and request the same URL

11

slide-12
SLIDE 12

Self Equi-join

Given a log file, find log entries which originate from the same host and request the same URL

12

map(tuple, context) { join_key = tuple.host+ “|” + tuple.url context.write(join_key, tuple) } reduce(key, values[], context) { for (int i = 0 to values.length) { for (int j = i + 1 to values.length) { merged_tuple = values[i] ∪ values[j]; //union context.write(key, merged_tuple); } } }

slide-13
SLIDE 13

Binary Equi-join

Given two log files, find log entries which originate from the same host and request the same URL

13

map(tuple, context, order) { join_key = tuple.host+ “|” + tuple.url tuple.input_order = order context.write(join_key, tuple) } map(tuple, context) { // Use MapContext#getInputSplit() if (context.inputPath == inputFile1) map(tuple, context, 1); else map(tuple, context, 2) }

slide-14
SLIDE 14

Binary Equi-join (cont’d)

14

reduce(key, values[], context) { for (int i = 0 to values.length) { for (int j = i + 1 to values.length) { if (values[i].input_order != values[j].input_order) merged_tuple = values[i] ∪ values[j]; //union context.write(key, merged_tuple); } } }

slide-15
SLIDE 15

Chaining of MapReduce Jobs

Hadoop is designed so that the output of MapReduce job can be fed as an input to another MapReduce job

15

SELECT day_of_week(time) as dow, SUM(bytes) FROM logfile WHERE response = 200 GROUP BY dow;

Input Select Project Grouped Aggregate

slide-16
SLIDE 16

Pig

16

A system built on-top of Hadoop (Now supports Spark as well) Provides a SQL-ETL-like query language termed Pig Latin Compiles Pig Latin programs into MapReduce programs

slide-17
SLIDE 17

Examples

Filter: Return all the lines that have a user- specified response code, e.g., 200.

17

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes);

  • k_lines = FILTER log BY response = ‘200’;

STORE ok_lines into ‘filtered_output’;

Map

slide-18
SLIDE 18

Examples

Grouped aggregate Find the total number of bytes per response code

18

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, SUM(bytes); STORE grouped_aggregate into ‘grouped_output’;

Map Reduce

slide-19
SLIDE 19

Examples

Grouped aggregate Find the average number of bytes per response code

19

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, AVG(bytes); STORE grouped_aggregate into ‘grouped_output’;

slide-20
SLIDE 20

Examples

Join: Find pairs of requests that ask for the same URL, coming from the same source

20

log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host);

slide-21
SLIDE 21

Examples

Join: Find pairs of requests that ask for the same URL, coming from the same source and happened within an hour of each other

21

log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000;

slide-22
SLIDE 22

How it works

LOAD operation

Determines the input path and InputFormat

STORE operation

Determines the output path and OutputFormat

FILTER and FOREACH

Translated into map-only jobs

AGGREGATE and JOIN

Translated into map-reduce jobs

All are compiled into one or more MapReduce jobs

22

slide-23
SLIDE 23

Additional Features

Lazy execution

Nothing gets actually executed until the STORE command is reached

Consolidation of map-only jobs

Map-only jobs (FILTER and FOREACH) can be consolidated into a next job’s map function or a previous job’s reduce function

23

slide-24
SLIDE 24

A Complex Example

24

log1 = LOAD ‘logs.csv’ USING PigStorage() AS (…); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (…); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000; grouped = GROUP filtered BY log1::host; agg_groups = FOREACH grouped GENERATE group, COUNT(*); STORE agg_groups INTO ‘final_result';

slide-25
SLIDE 25

Further Readings

Pig home page: https://pig.apache.org Detailed documentation: http://pig.apache.org/docs/r0.17.0/ The original Pig Latin paper:

Olston, Christopher, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. "Pig latin: a not-so-foreign language for data processing." In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1099-1110. ACM, 2008.

25