Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A - PowerPoint PPT Presentation

Hadoop Map Reduce 1

MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 2

Logical View of MapReduce During MapReduce, the input and output are considered a set of key-value pairs 𝑙, 𝑤 Input Intermediate Output Data 𝑙 1 , 𝑤 1 𝑙 2 , 𝑤 2 𝑙 3 , 𝑤 3 𝑙 1 , 𝑤 1 𝑙 2 , 𝑤 2 𝑙 3 , 𝑤 3 Map Reduce … … … 𝑙 1 , 𝑤 1 𝑙 2 , 𝑤 2 𝑙 3 , 𝑤 3 3

Map and Reduce Functions Map Function Maps a single input record to a set (possibly empty) of intermediate records Map: 𝑙 1 , 𝑤 1 → ⟨𝑙 2 , 𝑤 2 ⟩ Reduce Function Reduces a set of intermediate records with the same key to a set (possibly empty) of output records Reduce: 𝑙 2 , 𝑤 2 → { 𝑙 3 , 𝑤 3 } 4

Functional Programming MapReduce is functional programming Both map and reduce functions are memoryless/stateless They cannot keep an internal state They cannot remember previous records They cannot be randomized Why? To allow Hadoop to parallelize the execution Execute them out-of-order Rerun failing tasks 5

Overview MR Driver Program MR Job Developer Master node Slave nodes 6

Job Execution Overview Driver Job Job Map Shuffle Reduce Cleanup submission preparation 7

Job Submission Execution location: Driver node A driver machine should have the following Compatible Hadoop binaries Cluster configuration files Network access to the master node Collects job information from the user Input and output paths Map, reduce, and any other functions Any additional user configuration Packages all this in a Hadoop Configuration 8

Hadoop Configuration Key: String Value: String Input hdfs://user/eldawy/README.txt Output hdfs://user/eldawy/wordcount Mapper edu.ucr.cs.cs167.eldawy.WordCount … Reducer … JAR File User-defined User-defined Serialized over network Master node 9

Job Preparation Runs on the master node Gets the job ready for parallel execution Collects the JAR file that contains the user- defined functions, e.g., Map and Reduce Writes the JAR and configuration to HDFS to be accessible by the executors Looks at the input file(s) to decide how many map tasks are needed Makes some sanity checks Finally, it pushes the BRB (Big Red Button) 10

Job Preparation Master node Configuration HDFS InputFormat#getSplits() FileInputSplit Split 1 Mapper 1 Path Split 2 Mapper 2 Start JAR File .. .. End Split M Mapper M 11

Map Phase Runs in parallel on worker nodes 𝑁 Mappers: Read the input Apply the map function Apply the combine function (if configured) Store the map output There is no guaranteed ordering for processing the input splits 12

Map Phase Master node … Input Splits IS 1 IS 2 IS 3 IS 4 IS 5 IS M (Map tasks) 13

Map Task Reads the job configuration and task information (mainly, InputSplit) Instantiates an object of the Mapper class Instantiates a record reader for the assigned input split Calls Mapper#setup(Context) Reads records one-by-one from the record reader and passes them to the map function The map function writes the output to the context 14

MapContext Keeps track of which input split is being read and which records are being processed Holds all the job configuration and some additional information about the map task Materializes the map output 15

Map Output What really happens to the map output? It depends on the number of reducers 0 reducers: Map output is written directly to HDFS as the final answer 1+ reducers: Map output is passed to the shuffle phase 16

Shuffle Phase Executed only in the case of one or more reducers Transfers data between the mappers and reducers Groups records by their keys to ensure local processing in the reduce phase 17

Shuffle Phase … Map 1 Map 2 Map 3 Map M … Reduce 1 Reduce 2 Reduce N 18

Shuffle Phase (Map-side) Map i k A k v k v k v k v 0 0 0 0 k v k v k v k v k v k v k v k v k v k v Input Split Partition k v k v k v k v k v 1 map k v k v k v k v 1 1 k v k v k v k v 1 k v k v k v k v k v k v k v k v N-1 k v k v k v k v N-1 N-1 N-1 k v k Z k v k v k v … Reduce 1 Reduce 2 Reduce N 19

Shuffle Phase (Reduce-side) k v … k v Map 1 Map 2 Map 3 Map M k v Reduce j Copy part 1 part 2 part 3 part M Sort k v k v k v Reduce k v k v k v k v 20

Reduce Phase Apply the reduce function to each group of similar keys k 1 v reduce k 1 v k 2 v reduce k 2 v k 3 v k 3 v reduce k 3 v output reduce k … v k N v k N v k N v reduce k N v k N v 21

Output Writing Materializes the final output to disk All results are from one process (mapper/reducer) are stored in a subdirectory An OutputFormat is used to Create any files in the output directory Write the output records one-by-one to the output Merge the results from all the tasks (if needed) While the output writing runs in parallel, the final commit step runs on a single machine 22

Advanced Issues Map failures Reduce failures Straggler problem Custom keys and values Efficient sorting on serialized data Pipeline MapReduce jobs 23

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A - PowerPoint PPT Presentation

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 2 Logical View of MapReduce During MapReduce, the

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce

Hadoop Scheduling A Hadoop job consists of Map tasks and Reduce tasks Only one job in

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Recap: Map-Reduce Map Phase Reduce Phase (per record

Hadoop Map Reduce 01/18/2018 1 MapReduce 2-in-1 A programming paradigm A query execution

Using Hadoop for Webscale Computing Ajay Anand Yahoo! aanand@yahoo-inc.com Usenix 2008 Agenda

BY SRIJHA REDDY GANGIDI What is Hadoop ? Evolution of Hadoop Created by dough cutting, a part

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

HDFS Under the Hood Sanjay Radia Sradia@yahoo-inc.com Grid Computing, Hadoop Yahoo Inc.

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Hadoop Jrg Mllenkamp Principal Field Technologist Sun Microsystems Agenda Introduction

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Working With Hadoop Mostly based on Tom Whites book Hadoop: Now that we covered the

Budget Veto Impacts Budget Veto Impacts th Legislative Response and the July 6 th Legislative

Predicting and Estimation from Time Series Bhiksha Raj 25 Nov 2014 11-755/18797 1

The Correlator for the Tianlai Experiment Jie Hao National ASIC Design Engineering Center,

Studying Black-hole kicks BH-LMXB The Stars for All Meeting observed pop. Supernova kicks

UW Colleges and UW-Extension Restructuring Steering Committee Meeting 19 Friday, May 17, 2019

Computing the Longest Common Prefix Array Based on the Burrows-Wheeler Transform Timo Beller,

The No Core Gamow Shell Model: Including the Continuum in the NCSM , Bruce R. Barrett University

Rosario Sanchez Rosario Sanchez Officially there are 11 transboundary aquifers between