Mape n Hop Francieli ZANON BOITO francieli.zanon-boito@inria.fr - PowerPoint PPT Presentation

Ges�� de D��ées à Gra�� Éc�e�l� Map��e �n� H��o�p Francieli ZANON BOITO francieli.zanon-boito@inria.fr November 2018

Ref��ce� ● Slides by Thomas Ropars ● Coursera – Big Data, University of California San Diego ● The lecture notes of V. Leroy ● Designing Data-Intensive Applications by Martin Kleppmann ● Mining of Massive Datasets by Leskovec et al. 2 of 61

In �o��y's ��as� ● The MapReduce paradigm for big data processing, and its most popular implementation (Apache Hadoop) ● Main ideas and how it works ● In the TP: put it to practice 3 of 61

His��y ● First publications ○ "The Google File System", S. Ghemawat et al. 2003 ○ "MapReduce: simplified data processing on large clusters", D. Jeffrey and S. Ghemawat 2004 ● Used to implement several tasks: ○ Building the indexing system for Google Search ○ Extracting properties of web pages ○ Graph processing, etc Google does not use MapReduce anymore * ● ○ The amount of data they handle increased too much ○ Moved on to more efficient technologies * https://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system 4 of 61

His��y ● Apache Hadoop: open source MapReduce framework ○ Implemented by people working at Yahoo!, released in 2006 ● Now it is a full ecosystem, used by many companies Notably, Facebook * ○ ○ HDFS @ Yahoo!: 600PB on 35K servers ** ○ Criteo (42k cores, 150PB, 300k jobs per day) *** * https://dzone.com/articles/how-is-facebook-deploying-big-data ** http://yahoohadoop.tumblr.com/post/138739227316/hadoop-turns-10 *** http://labs.criteo.com/about-us/ 5 of 61

Ma�n ��e�� ● A distributed computing execution framework ● Data represented as key-value pairs ● A distributed file system ● Two main operations on data: Map and Reduce 6 of 61

Map �� R�du�� ● The Map operation ○ Transformation operation ○ A function is applied to each element of the input set ○ map( f )[ x 0 , ..., x n ] = [ f (x 0 ), ..., f (x n ) ] ○ map( ∗ 2)[2, 3, 6] = [4, 6, 12] ● The Reduce operation ○ Aggregation operation (fold) ○ reduce( f )[ x 0 , ..., x n ] = [ f ( x 0 , f ( x 1 , ..., f (x n-1 , x n ))) ] ○ reduce(+)[2, 3, 6] = (2 + (3 + 6)) = 11 ○ In MapReduce, Reduce is applied to all the elements with the same key 7 of 61

Wh� i� �t �� op��? Cod� ��e (ex��t), be��t �o ��l ● “Simple” to program and execute ○ Handles distribution of data and the computation ○ Detects failure and automatically takes corrective actions ● Scale to large number of nodes ○ Data parallelism (as opposed to task parallelism): running the same task on different data pieces in parallel ○ Move the computation instead of the data ■ Distributed file system is central ■ Execute tasks where their data is 8 of 61

Wh� i� �t �� op��? ● Fault tolerance ○ Data replication by the distributed file system ○ Intermediate results are written to disk ○ Failed tasks are re-executed on other nodes ○ Tasks can be executed multiple times in parallel to deal with stragglers (slow nodes) 9 of 61

Age�� ● Introduction ● A first MapReduce program ● Apache Hadoop ○ MapReduce ○ HDFS ○ Yarn ● Combiners 10 of 61

A fi��t M��uc� ��g�a�: wo�� c��er ● We want to count the occurrences of words in a text ● Input: A set of lines, each line is a pair < line number, line content > ● Output: A set of pairs < word, number of occurrences > < "aaa", 2 > < 1, "aaa bb ccc" > < "bb", 3 > < 2, "aaa bb" > < "ccc", 1 > 11 of 61

1, "aaa bb ccc" map(key, value): 2, "bb bb d" for each word in value: 3, "d aaa bb" output pair(word, 1) 4, "d" 12 of 61

1, "aaa bb ccc" map(key, value): "aaa", 1 for each word in value: "bb", 1 "ccc", 1 output pair(word, 1) 13 of 61

1, "aaa bb ccc" map(key, value): "aaa", 1 2, "bb bb d" for each word in value: "bb", 1 "ccc", 1 output pair(word, 1) "bb", 1 "bb", 1 "d", 1 14 of 61

1, "aaa bb ccc" map(key, value): "aaa", 1 2, "bb bb d" for each word in value: "bb", 1 3, "d aaa bb" "ccc", 1 output pair(word, 1) 4, "d" "bb", 1 "bb", 1 "d", 1 "d", 1 "aaa", 1 "bb", 1 "d", 1 15 of 61

1, "aaa bb ccc" map(key, value): "aaa", 1 2, "bb bb d" for each word in value: "bb", 1 3, "d aaa bb" "ccc", 1 output pair(word, 1) 4, "d" "bb", 1 "bb", 1 reduce(key, values): "d", 1 result = 0 "d", 1 for value in values: "aaa", 1 "bb", 1 result += value "d", 1 output pair(key, result) 16 of 61

1, "aaa bb ccc" map(key, value): "aaa", 1 2, "bb bb d" for each word in value: 3, "d aaa bb" output pair(word, 1) 4, "d" reduce(key, values): "aaa", [1,1] "aaa", 2 result = 0 for value in values: "aaa", 1 result += value output pair(key, result) 17 of 61

1, "aaa bb ccc" map(key, value): "aaa", 1 2, "bb bb d" for each word in value: "bb", 1 3, "d aaa bb" output pair(word, 1) 4, "d" "bb", 1 "bb", 1 reduce(key, values): "bb", "aaa", 2 [1,1,1,1] result = 0 "bb", 4 for value in values: "aaa", 1 "bb", 1 result += value output pair(key, result) 18 of 61

1, "aaa bb ccc" map(key, value): "aaa", 1 2, "bb bb d" for each word in value: "bb", 1 3, "d aaa bb" "ccc", 1 output pair(word, 1) 4, "d" "bb", 1 "bb", 1 reduce(key, values): "d", 1 "aaa", 2 result = 0 "d", 1 "bb", 4 for value in values: "aaa", 1 "ccc", 1 "bb", 1 result += value "d", 3 "d", 1 output pair(key, result) 19 of 61

1, "aaa bb ccc" "aaa", 1 2, "bb bb d" "bb", 1 But we generate a lot of intermediate 3, "d aaa bb" "ccc", 1 data! 4, "d" "bb", 1 Why not keep a centralized counter "bb", 1 per word? "d", 1 "d", 1 "aaa", 2 That's the price we pay for scalability! "aaa", 1 "bb", 4 "bb", 1 "ccc", 1 Let's see how it works. "d", 1 "d", 3 20 of 61

Age�� ● Introduction ● A first MapReduce program ● Apache Hadoop ○ MapReduce ○ HDFS ○ Yarn ● Combiners 21 of 61

Map��e ● The developer defines: ○ map and reduce functions to manipulate key-value pairs ○ key and value types (map output needs to match reduce input) ● The map function will be executed once per input pair ● The reduce function will be executed once per existing key (with all the values associated with that key) 22 of 61

We start with the input separated in blocks and distributed over the nodes Figure from https://www.supinfo.com/articles/single/2807-introduction-to-the-mapreduce-life-cycle 23 of 61

We have one map task per input block (each task executes the map function multiple times) In the same node to avoid data movement! Figure from https://www.supinfo.com/articles/single/2807-introduction-to-the-mapreduce-life-cycle 24 of 61

Now we have the Shuffle & Sort phase First sort each map task output by key Figure from https://www.supinfo.com/articles/single/2807-introduction-to-the-mapreduce-life-cycle 25 of 61

Send the pairs to the adequate reduce task (hashing) The number of reduce tasks is configurable Figure from https://www.supinfo.com/articles/single/2807-introduction-to-the-mapreduce-life-cycle 26 of 61

Combine the pairs that have the same key Figure from https://www.supinfo.com/articles/single/2807-introduction-to-the-mapreduce-life-cycle 27 of 61

Run the reduce tasks Figure from https://www.supinfo.com/articles/single/2807-introduction-to-the-mapreduce-life-cycle 28 of 61

Now we have (unsorted) output that is distributed over some nodes Figure from https://www.supinfo.com/articles/single/2807-introduction-to-the-mapreduce-life-cycle 29 of 61

H��S ● Distributed file system for shared-nothing infrastructures ● Main goals: scalability and fault tolerance, optimized for throughput ● It is not POSIX-compliant ○ Sequential read and writes only ○ Write-once-read-many file access (supports append and truncate) 30 of 61

Mape n Hop Francieli ZANON BOITO francieli.zanon-boito@inria.fr - PowerPoint PPT Presentation

Ges de Des Gra cel Mape n Hop Francieli ZANON BOITO francieli.zanon-boito@inria.fr November 2018 Refce Slides by Thomas Ropars Coursera

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

var ol3d = new olcs.OLCesium({map: map, target: id}); ol3d.setEnabled(true); var ol3d = new

Measures of Academic Progress (MAP) What is MAP? MAP - Measures of Academic Progress

SHIPPING LANE DENSITY MAP TOP 25 CONTAINER PORTS UNION PACIFIC RAIL MAP I-80 INTERSTATE MAP

Space-time Mapping New ways of exploring and explaining data Andy Eschbacher @MrEPhysics Map

Map 7 January 2019 OSU CSE 1 Map The Map component family allows you to manipulate mappings

Declarative MapReduce 10/29/2018 1 MapReduce Examples Filter Map Aggregate Map Reduce

csci 210: Data Structures Maps and Hash Tables Summary Topics the Map ADT Map

UNDERSTANDING YOUR CHILDS MAP and COGAT RESULTS ARE THE MAP and COGAT RESULTS IMPORTANT?

DRAFT MAP 1 Office of Neighborhood Meeting Series DRAFT 1 MAP OVERVIEW Overview guiding

FROM RETROSPECTIVE TO CONTINUOUS DEEP ANALYTICS Seif Haridi KTH SICS Why most Data Analysis

Map with Food desert, Henrico County, VCU Map with Food desert, Henrico County, VCU Map with Food

Interactive Map on Migration (i-Map) platform Extension of the i-Map to the region of the Rabat

Definitive Map Review Payhembury parish Definitive Map Review Payhembury parish Public

Extra Slides for Secretary 101 2018 -1- -2- -3- -4- -5- -6- -7- -8- -9- -10- -11- -12-

The Vision of the Semantic Web InfoTECH, London, March 14, 2007 Ivan Herman, W3C 2007-02-02

We We will will be e start artin ing g soon oon, tha thank k you fo for r jo joining

DY9 ROUND ONE DSRIP APRIL REPORTING April 15, 2020 KEY ACTIVITIES UPDATE Compliance Audit

Back to School: How Employers Can Help Support Parents Dulany Dent | President / Owner, The Nanny

Common Pitfalls for Studying the Human Side of Machine Learning Joshua A. Kroll , Nitin Kohli ,

Privacy September 26 th , 2018 Have you ever felt as though your u privacy was violated? u When?

Catalyst Ubers Serverless Platform Shawn Burke - Staff Engineer Uber Seattle Why Serverless?