Declarative MapReduce 1 Declarative Languages Describe what you - - PowerPoint PPT Presentation

▶

Oct 21, 2022 260 likes •528 views

Declarative MapReduce 1 Declarative Languages Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)? 2 Relational Operators Projection - SELECT revenue

SLIDE 1

Declarative MapReduce

SLIDE 2

Declarative Languages

Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)?

SLIDE 3

Relational Operators

Projection - π

SELECT revenue – expenses AS profit FROM …

Selection (Filter) – σ

SELECT … WHERE cost > 5000

Aggregate - Σ

SELECT SUM(cost)

Grouped Aggregate

SELECT SUM(cost) GROUP BY product_id

Join - ⋈

SELECT … FROM Employee, Department WHERE Employee.dept_id = Deptartment.id

SLIDE 4

Example: Log file

host logname time method url response bytes referer useragent

pppa006.compuserve.com - 807256800GET /images/launch-logo.gif 200 1713 vcc7.langara.bc.ca

807256804GET

/shuttle/missions/missions.html 200 8677 pppa006.compuserve.com - 807256806GET /history/apollo/images/apollo-logo1.gif 200 1173 bettong.client.uq.oz.au

807256900GET

/history/skylab/skylab.html 304 bettong.client.uq.oz.au

807256913GET

/images/ksclogosmall.gif 304 202.32.48.43

807259091GET

/shuttle/resources/orbiters/atlantis.gif 404 bettong.client.uq.oz.au

807256913GET

/history/apollo/images/apollo-logo.gif 200 3047 ad03-053.compuserve.com - 807257487GET /cgi-bin/imagemap/countdown70?284,288 302 85 hella.stm.it

807256914GET

/shuttle/missions/sts-70/images/DSC-95EC-0001.jpg 200 513911

We will model a tuple as a map [String → Value]. This can be implemented as a hash table, for example. E.g., tuple.host = “…”

SLIDE 5

Projection

Input: A tuple with a set of attributes Output: A tuple with another set of attributes Can be modeled as a map-only job Example: Add day of week based on the time

map(tuple, context) { date = new Date(tuple.time) tuple.day_of_week = date.getDayOfWeek(); context.write(tuple); }

SLIDE 6

Selection (Filter)

Input: A tuple with a set of attributes Output: Either the tuple of it matches the predicate, or nothing if it does not Can be modeled as a map-only job Example: Find records with response code 200

map(tuple, context) { response_code = tuple.response if (response_code == 200) context.write(tuple) }

SLIDE 7

Aggregation

Input: A relation with a set of tuples Output: One value that aggregates an entire column Can be modeled as one map-reduce job Example: Find the sum of bytes

// Configure with one reducer map(tuple, context) { context.write(1, tuple.bytes) } reduce(key, values[], context) { context.write(key, sum(values)); }

SLIDE 8

Aggregation

A combiner can be used to speed up the processing Note: Hadoop a special key called NullWritable for this scenario

// Configure with one reducer map(tuple, context) { context.write(1, tuple.bytes) } combine/reduce(key, values[], context) { context.write(key, sum(values)); }

SLIDE 9

Other Aggregate Functions

The same technique can be used for any function that is associative and commutative. This includes, min, max, sum, and count. It also includes all functions that can be derived from these functions, e.g., average and standard deviation

SLIDE 10

Grouped Aggregation

Input: A relation with a set of tuples Output: One value that aggregates an entire column for each value of the group key Can be modeled as one map-reduce job Example: Find the sum of bytes for each response code

map(tuple, context) { context.write(tuple.response, tuple.bytes) } combine/reduce(key, values[], context) { context.write(key, sum(values)); }

SLIDE 11

Equi-join

Input: Two relations and a join column Output: A tuple that combines each pair of tuples where the join column is equal Can be modeled as one map-reduce job Special case: Self-join. Both inputs are the same Example (Self-join): Given a log file, find log entries which originate from the same host and request the same URL

SLIDE 12

Self Equi-join

Given a log file, find log entries which originate from the same host and request the same URL

map(tuple, context) { join_key = tuple.host+ “|” + tuple.url context.write(join_key, tuple) } reduce(key, values[], context) { for (int i = 0 to values.length) { for (int j = i + 1 to values.length) { merged_tuple = values[i] ∪ values[j]; //union context.write(key, merged_tuple); } } }

SLIDE 13

Binary Equi-join

Given two log files, find log entries which originate from the same host and request the same URL

map(tuple, context, order) { join_key = tuple.host+ “|” + tuple.url tuple.input_order = order context.write(join_key, tuple) } map(tuple, context) { // Use MapContext#getInputSplit() if (context.inputPath == inputFile1) map(tuple, context, 1); else map(tuple, context, 2) }

SLIDE 14

Binary Equi-join (cont’d)

reduce(key, values[], context) { for (int i = 0 to values.length) { for (int j = i + 1 to values.length) { if (values[i].input_order != values[j].input_order) merged_tuple = values[i] ∪ values[j]; //union context.write(key, merged_tuple); } } }

SLIDE 15

Chaining of MapReduce Jobs

Hadoop is designed so that the output of MapReduce job can be fed as an input to another MapReduce job

SELECT day_of_week(time) as dow, SUM(bytes) FROM logfile WHERE response = 200 GROUP BY dow;

Input Select Project Grouped Aggregate

SLIDE 16

Pig

A system built on-top of Hadoop (Now supports Spark as well) Provides a SQL-ETL-like query language termed Pig Latin Compiles Pig Latin programs into MapReduce programs

SLIDE 17

Examples

Filter: Return all the lines that have a user- specified response code, e.g., 200.

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes);

k_lines = FILTER log BY response = ‘200’;

STORE ok_lines into ‘filtered_output’;

Map

SLIDE 18

Examples

Grouped aggregate Find the total number of bytes per response code

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, SUM(bytes); STORE grouped_aggregate into ‘grouped_output’;

Map Reduce

SLIDE 19

Examples

Grouped aggregate Find the average number of bytes per response code

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, AVG(bytes); STORE grouped_aggregate into ‘grouped_output’;

SLIDE 20

Examples

Join: Find pairs of requests that ask for the same URL, coming from the same source

log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host);

SLIDE 21

Examples

Join: Find pairs of requests that ask for the same URL, coming from the same source and happened within an hour of each other

log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000;

SLIDE 22

How it works

LOAD operation

Determines the input path and InputFormat

STORE operation

Determines the output path and OutputFormat

FILTER and FOREACH

Translated into map-only jobs

AGGREGATE and JOIN

Translated into map-reduce jobs

All are compiled into one or more MapReduce jobs

SLIDE 23

Additional Features

Lazy execution

Nothing gets actually executed until the STORE command is reached

Consolidation of map-only jobs

Map-only jobs (FILTER and FOREACH) can be consolidated into a next job’s map function or a previous job’s reduce function

SLIDE 24

A Complex Example

log1 = LOAD ‘logs.csv’ USING PigStorage() AS (…); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (…); joined = JOIN log1 BY (url, host), log2 BY (url, host); filtered = FILTER joined BY ABS(log1::time - log2::time) < 3600000; grouped = GROUP filtered BY log1::host; agg_groups = FOREACH grouped GENERATE group, COUNT(*); STORE agg_groups INTO ‘final_result';

SLIDE 25

Declarative MapReduce

Declarative Languages

Describe what you want to do not how to do it The most popular example is SQL Can we compile SQL queries into MapReduce program(s)?

Relational Operators

Projection - π

Selection (Filter) – σ

Aggregate - Σ

Grouped Aggregate

Join - ⋈

Example: Log file

We will model a tuple as a map [String → Value]. This can be implemented as a hash table, for example. E.g., tuple.host = “…”

Projection

Input: A tuple with a set of attributes Output: A tuple with another set of attributes Can be modeled as a map-only job Example: Add day of week based on the time

map(tuple, context) { date = new Date(tuple.time) tuple.day_of_week = date.getDayOfWeek(); context.write(tuple); }

Selection (Filter)

Input: A tuple with a set of attributes Output: Either the tuple of it matches the predicate, or nothing if it does not Can be modeled as a map-only job Example: Find records with response code 200

map(tuple, context) { response_code = tuple.response if (response_code == 200) context.write(tuple) }

Aggregation

Input: A relation with a set of tuples Output: One value that aggregates an entire column Can be modeled as one map-reduce job Example: Find the sum of bytes

// Configure with one reducer map(tuple, context) { context.write(1, tuple.bytes) } reduce(key, values[], context) { context.write(key, sum(values)); }

Aggregation

A combiner can be used to speed up the processing Note: Hadoop a special key called NullWritable for this scenario

// Configure with one reducer map(tuple, context) { context.write(1, tuple.bytes) } combine/reduce(key, values[], context) { context.write(key, sum(values)); }

Other Aggregate Functions

The same technique can be used for any function that is associative and commutative. This includes, min, max, sum, and count. It also includes all functions that can be derived from these functions, e.g., average and standard deviation

Grouped Aggregation

Input: A relation with a set of tuples Output: One value that aggregates an entire column for each value of the group key Can be modeled as one map-reduce job Example: Find the sum of bytes for each response code

map(tuple, context) { context.write(tuple.response, tuple.bytes) } combine/reduce(key, values[], context) { context.write(key, sum(values)); }

Equi-join

Self Equi-join

Given a log file, find log entries which originate from the same host and request the same URL

map(tuple, context) { join_key = tuple.host+ “|” + tuple.url context.write(join_key, tuple) } reduce(key, values[], context) { for (int i = 0 to values.length) { for (int j = i + 1 to values.length) { merged_tuple = values[i] ∪ values[j]; //union context.write(key, merged_tuple); } } }

Binary Equi-join

Given two log files, find log entries which originate from the same host and request the same URL

map(tuple, context, order) { join_key = tuple.host+ “|” + tuple.url tuple.input_order = order context.write(join_key, tuple) } map(tuple, context) { // Use MapContext#getInputSplit() if (context.inputPath == inputFile1) map(tuple, context, 1); else map(tuple, context, 2) }

Binary Equi-join (cont’d)

reduce(key, values[], context) { for (int i = 0 to values.length) { for (int j = i + 1 to values.length) { if (values[i].input_order != values[j].input_order) merged_tuple = values[i] ∪ values[j]; //union context.write(key, merged_tuple); } } }

Chaining of MapReduce Jobs

Hadoop is designed so that the output of MapReduce job can be fed as an input to another MapReduce job

SELECT day_of_week(time) as dow, SUM(bytes) FROM logfile WHERE response = 200 GROUP BY dow;

Pig

A system built on-top of Hadoop (Now supports Spark as well) Provides a SQL-ETL-like query language termed Pig Latin Compiles Pig Latin programs into MapReduce programs

Examples

Filter: Return all the lines that have a user- specified response code, e.g., 200.

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes);

STORE ok_lines into ‘filtered_output’;

Examples

Grouped aggregate Find the total number of bytes per response code

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, SUM(bytes); STORE grouped_aggregate into ‘grouped_output’;

Examples

Grouped aggregate Find the average number of bytes per response code

log = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); grouped = GROUP log BY response; grouped_aggregate = FOREACH grouped GENERATE group, AVG(bytes); STORE grouped_aggregate into ‘grouped_output’;

Examples

Join: Find pairs of requests that ask for the same URL, coming from the same source

log1 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); log2 = LOAD ‘logs.csv’ USING PigStorage() AS (host, time, method, url, response, bytes: int); joined = JOIN log1 BY (url, host), log2 BY (url, host);

Examples

Join: Find pairs of requests that ask for the same URL, coming from the same source and happened within an hour of each other

How it works

LOAD operation

Determines the input path and InputFormat

STORE operation

Determines the output path and OutputFormat

FILTER and FOREACH

Translated into map-only jobs

AGGREGATE and JOIN

Translated into map-reduce jobs

All are compiled into one or more MapReduce jobs

Additional Features

Lazy execution

Nothing gets actually executed until the STORE command is reached

Consolidation of map-only jobs

Map-only jobs (FILTER and FOREACH) can be consolidated into a next job’s map function or a previous job’s reduce function

A Complex Example

Further Readings

Pig home page: https://pig.apache.org Detailed documentation: http://pig.apache.org/docs/r0.17.0/ The original Pig Latin paper: