CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big - - PDF document

cs535 big data 1 29 2020 week 2 b sangmi lee pallickara
SMART_READER_LITE
LIVE PREVIEW

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big - - PDF document

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA FAQs TP0 There may be adjustment of


slide-1
SLIDE 1

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 1

CS535 Big Data | Computer Science | Colorado State University

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY

  • 3. DISTRIBUTED COMPUTING

MODELS FOR SCALABLE BATCH COMPUTING

Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535

CS535 Big Data | Computer Science | Colorado State University

FAQs

  • TP0
  • There may be adjustment of your team composition
  • PA1
  • Hadoop and Spark installation video clips are posted

CS535 Big Data | Computer Science | Colorado State University

Topics of Todays Class

  • Overview of the Programing Assignment 1
  • 3. Distributed Computing Models for Scalable Batch Computing
  • MapReduce

CS535 Big Data | Computer Science | Colorado State University

Programming Assignment 1 Hyperlink-Induced Topic Search (HITS)

CS535 Big Data | Computer Science | Colorado State University

This material is built based on

  • Kleinberg, Jon. "Authoritative sources in a hyperlinked environment". Journal of the
  • ACM. 46 (5): 604–632

CS535 Big Data | Computer Science | Colorado State University

Types of Web queries

  • Yes/No queries
  • Does Chrome support .ogv video format?
  • Broad topic queries
  • Find information about “Coronavirus”
  • Similarity query
  • Find person similar to “Justin Bieber”

Image credit: https://www.cnn.com/2020/01/22/world/wuhan-coronavirus-visual-guide-intl/index.html

Im age credit: https://w w w .google.com /search?source=hp&ei=tM YxXsH aFZO 4tQ ae7ILAC Q &q=sim ilar+to+justin+bieber&oq =Sim ilar+to+justin+&gs_l=psy-ab.3.0.0l3j0i22i30l7.546394.575419..576451...17.0..0.184.1712.34j1......0....1..gw s-w iz.......0i131j0i70i249j0i10.D W TD 5rf16d8

slide-2
SLIDE 2

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 2

CS535 Big Data | Computer Science | Colorado State University

Challenge of content-based ranking for topic search

  • Assume that you are looking for “computer”
  • ”computer” in the APPLE page?

CS535 Big Data | Computer Science | Colorado State University

Challenge of content-based ranking for topic search

  • How about IBM’s web page?

CS535 Big Data | Computer Science | Colorado State University

Challenge of content-based ranking for topic search

  • O.K… Now, Google?

CS535 Big Data | Computer Science | Colorado State University

Challenge of content-based ranking for topic search

  • Most useful pages do not include the keyword (that the users are looking for)
  • Pages are not sufficiently descriptive!
  • Semantic mismatch
  • Search keys vs. descriptions

Image Credit: https://e360.yale.edu/features/could-massive-storm-surge-barriers-end-the-hudson-rivers-revival CS535 Big Data | Computer Science | Colorado State University

Ranking algorithm to find the most “authoritative” pages for the given topic

  • To find the small set of the most authoritative pages that are relevant to the query
  • Examples of the authoritative pages
  • For the topic, “python”
  • https://www.python.org/
  • For the information about “Colorado State University”
  • https://www.colostate.edu/

CS535 Big Data | Computer Science | Colorado State University

HITS (Hipertext-Induced Topic Search)

  • PageRank captures simplistic view of a network
  • Authority
  • A Web page with good, authoritative content on a specific topic
  • A Web page that is linked by many hubs
  • Hub
  • A Web page pointing to many authoritative Web pages
  • e.g. portal pages (Yahoo)
slide-3
SLIDE 3

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 3

CS535 Big Data | Computer Science | Colorado State University

HITS (Hypertext-Induced Topic Search)

  • A.K.A. Hubs and Authorities
  • Jon Kleinberg 1997
  • Topic search
  • Automatically determine hubs/authorities
  • In practice
  • Performed only on the result set (PageRank is applied on the complete set of documents)
  • Developed for the IBM Clever project
  • Used by Teoma (later Ask.com)

CS535 Big Data | Computer Science | Colorado State University

Understanding Authorities and Hubs [1/2]

  • Intuitive Idea to find authoritative results using link analysis:
  • Not all hyperlinks are related to the conferral of authority
  • Patterns that authoritative pages have
  • Authoritative Pages share considerable overlap in the sets of pages that point to them.

Authorities Hubs

CS535 Big Data | Computer Science | Colorado State University

Understanding Authorities and Hubs [2/2]

  • A good hub page points to many good authoritative pages
  • A good authoritative page is pointed to by many good hub pages
  • Authorities and hubs have a mutual reinforcement relationship

CS535 Big Data | Computer Science | Colorado State University

Calculating Authority/Hub scores [1/3]

0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 1

P1 P2 P3 P4

P1 P2 P3 P4

Let there be n Web pages Define the n x n adjacency matrix A such that, Auv= 1 if there is a link from u to v. Otherwise Auv= 0

Graph with pages

CS535 Big Data | Computer Science | Colorado State University

Calculating Authority/Hub scores [2/3]

0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 1

P1 P2 P3 P4 Each Web page i has an authority score ai and a hub score hi. We define the authority score by summing up the hub scores that point to it, !" = $

%&' (

ℎ%*%"

j: row # in the matrix i: column # in the matrix

This can be written concisely as, ! = *+ℎ

Graph with pages

CS535 Big Data | Computer Science | Colorado State University

Calculating Authority/Hub scores [3/3]

0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 1

P1 P2 P3 P4 Similarly, we define the hub score of a Web page i by summing up the authority scores !" , ℎ$ = &

"'( )

!"*"$

j: row # in the matrix i: column # in the matrix

This can be written concisely as, ℎ = *!

Graph with pages

slide-4
SLIDE 4

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 4

CS535 Big Data | Computer Science | Colorado State University

Hubs and Authorities

0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 1

P1 P2 P3 P4 Let’s start arbitrarily from a0=1, h0=1, where 1 is the all-

  • ne vector.

a0=(1,1,1,1) h0=(1,1,1,1) Repeating this, the sequences a0, a1, a2,… and h0, h1, h2,… converge (to limits x* and y*) a1=(((1x0)+(1x0)+(1x1)+(1x0)), ((1x1)+(1x0)+(1x0)+(1x0)), ((1x1)+(1x1)+(1x0)+(1x0)), ((1x1)+(1x1)+(1x1)+(1x1))) = (1,1,2,4) Normalize it: (1/(1+1+2+4), 1/(1+1+2+4), 2/(1+1+2+4), 4/(1+1+2+4)) = (1/8, 1/8, ¼, ½) a1= (1/8, 1/8, ¼, ½) (ß authority values after the first iteration)

Graph with pages

CS535 Big Data | Computer Science | Colorado State University

Hubs and Authorities

0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 1

Let’s start arbitrarily from a0=1, h0=1, where 1 is the all-one vector. a0=(1,1,1,1) h0=(1,1,1,1) a1=(1/8, 1/8, ¼, ½) h1=(((1/8 x0)+(1/8x1)+(1/4x1)+(1/2x1)), ((1/8x0)+(1/8x0)+(1/4x1)+(1/2x1)), ((1/8x1)+(1/8x0)+(1/4x0)+(1/2x1)), ((1/8x0)+(1/8x0)+(1/4x0)+(1/2x1))) = (7/8,6/8,5/8, 4/8) After the normalization: h1=(7/22,6/22,5/22, 4/22) (ß hub values after the first iteration)

Graph with pages

CS535 Big Data | Computer Science | Colorado State University

Implementing Topic Search using HITS

  • Step 1.
  • Constructing a focused subgraph based on a query
  • Step 2.
  • Iteratively calculate the authority value and hub value of the page in the subgraph

CS535 Big Data | Computer Science | Colorado State University

Step 1. Constructing a focused subgraph (root set)

  • Generate a root set from a text-based search engine
  • e.g. pages containing query words

Root set

CS535 Big Data | Computer Science | Colorado State University

Step 2. Constructing a focused subgraph (base set)

  • For each page p∈R
  • Add the set of all pages p points to
  • Add the set of all pages pointing to p

Base set

CS535 Big Data | Computer Science | Colorado State University

Step 3. Initial values

Nodes Hubs Authority P1 1 1 P2 1 1 P3 1 1 P4 1 1

Ranks Hub: P1=P2=P3=P4 Authority: P1=P2=P3=P4 P1 P3 P4 P2

slide-5
SLIDE 5

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 5

CS535 Big Data | Computer Science | Colorado State University

Ranks Hub: P1>P2>P3>P4 Authority: P1=P2<P3<P4 P1 P3 P4 P2

Step 4. After the first iteration

Nodes Hubs Authority P1 7/22 1/8 P2 6/22 1/8 P3 5/22 2/8 P4 4/22 4/8

Normalization

  • Original paper: using squares sum (to 1)
  • You can use sum (to 1)
  • value = value/(sum of all values)

CS535 Big Data | Computer Science | Colorado State University

Step N. Convergence of scores

  • Repeat the calculation (step 4) until the scores converge
  • You should specify your threshold (maximum number of N)

CS535 Big Data | Computer Science | Colorado State University

Do we need to perform the matrix multiplication?

  • Yes/No
  • It will be considered as a valid answer, however I do not recommend this
  • This will take much longer time to compute the values
  • Random walk style implementation will be more straightforward for this problem
  • Why? Your dataset is a sparse graph
  • Please see examples of PageRank algorithm provided by Apache Spark:
  • https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/graphx/lib/PageRank.html

CS535 Big Data | Computer Science | Colorado State University

  • 3. Distributed Computing Models for Scalable Batch Computing

Section 1. MapReduce Section 2. Apache Spark

CS535 Big Data | Computer Science | Colorado State University

  • 3. Distributed Computing Models for Scalable Batch Computing

Section 1. MapReduce

  • a. Introduction to MapReduce

CS535 Big Data | Computer Science | Colorado State University

This material is developed based on,

  • Anand Rajaraman, Jure Leskovec, and Jeffrey

Ullman, “Mining of Massive Datasets”, Cambridge University Press, 2012 --Chapter 2

  • Download this chapter from the CS435 schedule page
  • Hadoop: The definitive Guide, Tom White, O’Reilly,

3rd Edition, 2014

  • MapReduce Design Patterns, Donald Miner and

Adam Shook, O’Reilly, 2013

slide-6
SLIDE 6

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 6

CS535 Big Data | Computer Science | Colorado State University

What is MapReduce?

CS535 Big Data | Computer Science | Colorado State University

MapReduce [1/2]

  • MapReduce is inspired by the concepts of map and reduce in Lisp.
  • “Modern” MapReduce
  • Developed within Google as a mechanism for processing large amounts of raw data.
  • Crawled documents or web request logs
  • Distributes these data across thousands of machines
  • Same computations are performed on each CPU with different dataset

CS535 Big Data | Computer Science | Colorado State University

MapReduce [2/2]

  • MapReduce provides an abstraction that allows engineers to perform simple

computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance

CS535 Big Data | Computer Science | Colorado State University

Mapper

  • Mapper maps input key/value pairs to a set of intermediate key/value pairs
  • Maps are the individual tasks that transform input records into intermediate records
  • The transformed intermediate records do not need to be of the same type as the input records
  • A given input pair may map to zero or many output pairs
  • The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the

InputFormat for the job

CS535 Big Data | Computer Science | Colorado State University

Reducer

  • Reducer reduces a set of intermediate values which share a key to a smaller set of

values

  • Reducer has 3 primary phases
  • Shuffle, sort and reduce
  • Shuffle
  • Input to the reducer is the sorted output of the mappers
  • The framework fetches the relevant partition of the output of all the mappers via HTTP
  • Sort
  • The framework groups input to the reducer by keys

CS535 Big Data | Computer Science | Colorado State University

MapReduce Example 1

slide-7
SLIDE 7

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 7

CS535 Big Data | Computer Science | Colorado State University

Example 1: NCDC data example

  • A national climate data center record
  • Find the maximum temperature of a year (1900 ~ 1999 )

0057 332130 # USAF weather station identifier 99999 # WBAN weather station identifier 19500101 # observation date 0300 # observation time 4 + 51317 # latitude (degrees x 1000) + 028783 # longitude (degrees x 1000) FM-12 + 0171 # elevation (meters) 99999 V020 320 # wind direction (degrees) 1 # quality code N

CS535 Big Data | Computer Science | Colorado State University

The first entries for 1990

% ls raw/ 1990 | head 010010-99999-1990.gz 010014-99999-1990.gz 010015-99999-1990.gz 010016-99999-1990.gz 010017-99999-1990.gz 010030-99999-1990.gz 010040-99999-1990.gz 010080-99999-1990.gz 010100-99999-1990.gz 010150-99999-1990.gz

CS535 Big Data | Computer Science | Colorado State University

Analyzing the data with Unix Tools (1/2)

  • A program for finding the maximum recorded temperature by year from NCDC weather

records

#!/ usr/ bin/ env bash for year in all/* do echo -ne ` basename $ year .gz `”\t” gunzip -c $ year | \awk '{ temp = substr( $ 0, 88, 5) + 0; q = substr( $ 0, 93, 1); if (temp != 9999 && q ~ /[01459]/ && temp > max) max = temp } END { print max }' Done

CS535 Big Data | Computer Science | Colorado State University

Analyzing the data with Unix Tools (2/2)

  • The script loops through the compressed year files
  • Printing the year
  • Processing each file using awk
  • Extracts two fields
  • Air temperature and the quality code
  • Check if it is greater than the maximum value seen so far

% ./ max_temperature.sh 1901 317 1902 244 1903 289 1904 256 1905 283 …

CS535 Big Data | Computer Science | Colorado State University

Results?

  • The complete run for the century took 42 minutes
  • To speed up the processing
  • We need to run parts of the program in parallel
  • Process different years in different processes
  • What will be the problems?

CS535 Big Data | Computer Science | Colorado State University

Challenges

  • Dividing the work into equal-size pieces
  • Data size per year?
  • Combining the results from independent processes
  • Combining results and sorting by year?
  • You are still limited by the processing capacity of a single machine (the worst one)!
slide-8
SLIDE 8

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 8

CS535 Big Data | Computer Science | Colorado State University

Map and Reduce

  • MapReduce works by breaking the processing into two phases
  • The map phase
  • The reduce phase
  • Each phase has key-value pairs as input and output
  • Programmers should specify

1.

Types of input/output key-values

2.

The map function

3.

The reduce function

  • Optional components

1.

Combiner

2.

Partitioner

3.

InputFormat/OutputFormat

CS535 Big Data | Computer Science | Colorado State University

Visualizing the way the MapReduce works (1/3)

0067011990999991950051507004... 9999999N9 + 00001 +99999999999... 0043011990999991950051512004... 9999999N9 + 00221 +99999999999... 0043011990999991950051518004... 9999999N9-00111 +99999999999... 0043012650999991949032412004... 0500001N9 + 01111 +99999999999... 0043012650999991949032418004... 0500001N9 + 00781 +99999999999…

Sample lines of input data

(0, 0067011990999991950051507004...9999999N9 + 00001+ 99999999999...) (106, 0043011990999991950051512004...9999999N9 + 00221+ 99999999999...) (212, 0043011990999991950051518004...9999999N9-0011 1 + 99999999999...)

These lines are presented to the map function as the key-value pairs

The keys are the line offsets within the file (optional)

CS535 Big Data | Computer Science | Colorado State University

Visualizing the way the MapReduce works (2/3)

(1950, 0) (1950, 22) (1950, − 11) (1949, 111) (1949, 78)

The map function extracts the year and the air temperature and emit them as its output

(1949, [111, 78]) (1950, [0, 22, −11])

This output key-value pairs will be sorted (by key) and grouped by key Values passed to each reducer are NOT sorted Our reduce function will see the following input:

CS535 Big Data | Computer Science | Colorado State University

reduce suffle

Visualizing the way the MapReduce works (3/3)

(1949, 111) (1950, 22)

Reduce function iterates through the list and pick up the maximum reading

This is the final output

0067119 004311r 030234 003891 (0,0067119…) (106, 005764…) (212, 3847623..) (318, 384762..)

(1950, 0)(1950, 22) (1950, - 11)

(1949, [111,78]) (1950, [0,22,-11] (1949, 111) (1950, 22) 1949, 111 1950, 22

  • utput

map input

CS535 Big Data | Computer Science | Colorado State University

MapReduce Example 2

CS535 Big Data | Computer Science | Colorado State University

Example 2: WordCount [1/5]

  • For text files stored under usr/joe/wordcount/input, count the number of
  • ccurrences of each word
  • How do files and directory look?

$ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02 $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World, Bye World! $ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop, Goodbye to hadoop.

slide-9
SLIDE 9

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 9

CS535 Big Data | Computer Science | Colorado State University

Example 2: WordCount [2/5]

  • Run the MapReduce application

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output $ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop, 1 Hello 2 World! 1 World, 1

  • hadoop. 1

to 1

CS535 Big Data | Computer Science | Colorado State University

Example 2: WordCount [3/5]

Mappers

  • 1. Read a line
  • 2. Tokenize the string
  • 3. Pass the

<key,value> output to the reducer Reducers

  • 1. Collect <key,value> pairs

sharing same key

  • 2. Aggregate total number of
  • ccurrences

What do you have to pass from the Mappers?

CS535 Big Data | Computer Science | Colorado State University

Example 2: WordCount [4/5]

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

CS535 Big Data | Computer Science | Colorado State University

Example 2: WordCount [5/5]

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }

CS535 Big Data | Computer Science | Colorado State University

MapReduce Data Flow

CS535 Big Data | Computer Science | Colorado State University

MapReduce data flow with a single reducer

Split 0 Map sort Split 1 Map sort Split 2 Map sort

Reduce Part HDFS Replication Merge

copy

slide-10
SLIDE 10

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 10

CS535 Big Data | Computer Science | Colorado State University

MapReduce data flow with multiple reducers

Split 0 Map sort Split 1 Map sort Split 2 Map sort

Reduce Part HDFS Replication Merge Reduce Part 1 HDFS Replication Merge CS535 Big Data | Computer Science | Colorado State University

Exercise

Design your map and reduce function to perform following data processing. Instagram has 1 billion users currently. Find the 10 Instagram users who has the highest number of followers for each age group. The range of age groups is a year. (e.g. 25, 26, 27, 28 …) The data is formatted as follows: {InstagramUserID, TAB, date_of_birth, TAB, number_of_followers, LINEFEED}. Assume that each line will be used as the input to a Map function. Question 1: What are the input/output/functionality of your mapper? Question 2: What are the input/output/functionality of your reducer?

CS535 Big Data | Computer Science | Colorado State University

Answer

  • Assume that all the InstagramUserIDs are unique.

(1) Mapper Input: <dummy key (e.g. file offset number), a line of the input file > Functionality: Tokenize the string and calculate the age Generate an output Output <age, [InstagramUserID, number_of_followers]> (2) Reducer Input: <age, a list of [InstagramUserID, number_of_followers]> Functionality: Scan the list of values and identify top 10 Instagram users with highest number of followers Output: <age, a list of users>

CS535 Big Data | Computer Science | Colorado State University

Better Answer: Top-N design pattern

  • Assume that all the ClientIDs are unique.

(1) Mapper Input: <dummy key (e.g. file offset number), a line of the input file> Functionality: Create multiple data structures (HashMap: local_top10_25, local_top10_26… ) to store the local top 10 information (user id and number of followers) per age Tokenize the string and calculate the age If this user is considered as one of the local top 10 until this point in the age group, update local_top10_x. After the input split is completely scanned, generate output with local_top10_X. Output <age-x, local_top10_x.> (2) Reducer Input: <age-x, a list of [local_top10_x]> Functionality: Scan the list of values and identify top 10 users with highest number of followers Output: <age-x, a list of users>

  • This approach will reduce the communication within your MR cluster significantly

CS535 Big Data | Computer Science | Colorado State University

Better Answer: Top-N design pattern: More Info

  • Structure of the Top-N pattern

Input Split Filter Mapper Input Split Filter Mapper TopTen Reducer Input Split Filter Mapper Input Split Filter Mapper TopTen

  • utput

Local Top 10 Local Top 10 Local Top 10 Local Top 10

CS535 Big Data | Computer Science | Colorado State University

Better Answer: Top-N design pattern: More Info

public static class TopTenMapper extends Mapper < Object, Text, NullWritable, Text > { // Create TreeMap(s) for each age. You can maintain a HashMap: age as the key and TreeMap // as the value. This example is only for the 1 age. To serve multiple ages, your code should retrieve the corresponding TreeMap for the age. private TreeMap < Integer, Text > LocalTop10 = new TreeMap < Integer, Text >(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { // Your code to extract age and other attributes // Your code to evaluate current number of followers LocalTop10.put(number_of_followers, new Text(your_value)); if (repToRecordMap.size() > 10) { repToRecordMap.remove(repToRecordMap.firstKey()); } } protected void cleanup(Context context) throws IOException, InterruptedException { // Output our ten records to the reducers with an age as the key for (Text t : repToRecordMap.values()) { context.write(age, t); } } }

slide-11
SLIDE 11

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 11

CS535 Big Data | Computer Science | Colorado State University

Better Answer: Top-N design pattern: More Info

  • A map function can generate 0 or more outputs.
  • setup() and cleanup() are called for each Mapper and Reducer “only once”. So, if there

are 20 mappers running (10,000 inputs each), the setup/cleanup will be called only 20 times.

  • Example:

public void run(Context context) throws IOException, InterruptedException { setup(context); try { while (context.nextKey()) { reduce(context.getCurrentKey(), context.getValues(), context); } } finally { cleanup(context); } }

CS535 Big Data | Computer Science | Colorado State University

Comparison with other systems

  • MPI vs. MapReduce
  • MapReduce tries to collocate the data with the compute node
  • Data access is fast
  • Data is local!
  • Volunteer computing vs. MapReduce
  • SETI@home
  • Using donated CPU time
  • What are the differences between MapReduce vs. SETI@home?

CS535 Big Data | Computer Science | Colorado State University

Split 0 Split 1 Split 2 Split 3 Split 4

User Program Master

Worker Worker Worker Worker Worker Output file 0 Output file 1 Intermediate files (on local disks) Reduce phase Output files Map phase Input files

  • 1. Shards the input files into M pieces
  • 2. Starts up many copies of program.
  • 3. Assigns work

(2) (2) (2) (3) (3) (4) read

  • 4. Read contents of the corresponding input shard

Parses & passes the key-value pair to the Map function

  • 5. Buffered pairs are written to local disk

Location is reported to the Master, which forwards them to appropriate reducer (5) Local write

  • 6. Accesses the location notified by Master and perform reduce function

(6)

  • 7. Local write
  • 8. Wake up the user

program

CS535 Big Data | Computer Science | Colorado State University

Data locality optimization

  • Hadoop tries to run the map task on a node where the input data resides in HDFS
  • Minimizes usage of cluster bandwidth
  • If all replication nodes are running other map tasks
  • The job scheduler will look for a free map slot on a node in the same rack

CS535 Big Data | Computer Science | Colorado State University

Data movement in Map tasks

CS535 Big Data | Computer Science | Colorado State University

Shuffle

  • The process by which the system performs the sort and transfers the map outputs to

the reducers as inputs

  • MapReduce guarantees that the input to every reducer is sorted by key
slide-12
SLIDE 12

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University 12

CS535 Big Data | Computer Science | Colorado State University

Combiner functions

  • Minimize data transferred between map and reduce tasks
  • Users can specify a combiner function
  • To be run on the map output
  • To replace the map output with the combiner output

CS535 Big Data | Computer Science | Colorado State University

Combiner example

  • Example (from the previous max temperature example)
  • The first map produced,
  • (1950, 0), (1950, 20), (1950, 10)
  • The second map produced,
  • (1950, 25), (1950, 15)
  • The reduce function is called with a list of all the values,
  • (1950, [0, 20, 10, 25, 15])
  • Output will be,
  • (1950, 25)
  • We may express the function as,
  • max(0, 20, 10, 25, 15)

= max( max(0, 20, 10), max(25, 15)) = max(20, 25) = 25

CS535 Big Data | Computer Science | Colorado State University

Combiner function

  • Run a local reducer over Map output
  • Reduce the amount of data shuffled between the mappers and the reducers
  • Combiner cannot replace the reduce function
  • Why?

CS535 Big Data | Computer Science | Colorado State University

Questions?