Extreme Computing Introduction to MapReduce Cluster Outline Map - PowerPoint PPT Presentation

Extreme Computing Introduction to MapReduce Cluster Outline Map Reduce 1

Cluster We have 12 servers: scutter01 , scutter02 , . . . scutter12 If working outside Informatics, first: ssh student.ssh.inf.ed.ac.uk Then log into a random server: ssh scutter$(printf "%02i"$((RANDOM%12+1))) Please load balance! Two years ago the cluster crashed. Cluster Outline Map Reduce 2

Cluster Software The cluster runs Hadoop on DICE (the Informatics Linux Environment). ⇒ No need to install software yourself. = You can run your own cluster but: We won’t help you install it Copy your output to the cluster Code should run on the cluster Cluster Outline Map Reduce 3

Cluster Software The cluster runs Hadoop on DICE (the Informatics Linux Environment). ⇒ No need to install software yourself. = You can run your own cluster but: We won’t help you install it Copy your output to the cluster Code should run on the cluster ⇒ Make sure your DICE account works! = We don’t have root so only computing support can help. Do this before the labs starting 2 October. Cluster Outline Map Reduce 4

Companies I Take Money From Likely Guest Lecture Currently no Guest Lecture Cluster Outline Map Reduce 5

MapReduce Incremental Approach Build MapReduce from problems. Assemble picture at the end. Assignment 1 is pure MapReduce problems. Cluster Outline Map Reduce 6

grep grep extreme Find every line containing “extreme” in a text file. Cluster Outline Map Reduce 7

grep grep extreme Find every line containing “extreme” in a text file. Input extreme students Output pay extremely high extreme students this is slow pay extremely high up to there method extremely useful method extremely useful take TTDS Cluster Outline Map Reduce 8

Distributed grep grep extreme Find every line containing “extreme” in a text file. Input extreme students Output pay extremely high extreme students this is slow pay extremely high up to there method extremely useful method extremely useful take TTDS Split input into pieces, run grep on each. Cluster Outline Map Reduce 9

Interlude: Pieces of a Text File Goal: assign a piece of the text file to each machine. Non-overlapping Break at line boundaries Fast (don’t read more than you have to) Balanced (roughly equal sizes) Cluster Outline Map Reduce 10

seek ing seek allows one to skip to a particular byte in a file. There is no seek for line offsets. You’d have read the file from the beginning and count newlines. But we can seek to a byte offset, then round up to the next line. Cluster Outline Map Reduce 11

Rounding bytes to lines Split a 300-byte text file: Task Byte Assignment Line Rounding 0 0–99 0–102 1 100–199 103–207 2 200–299 208–299 Each task can read until it sees a newline, then round up to that. → Work is divided at line boundaries. Cluster Outline Map Reduce 12

Hadoop is an implementation of MapReduce. This just shows how Hadoop splits input: Run Hadoop hadoop jar hadoop-streaming-2.7.3.jar -input /data/assignments/ex1/webSmall.txt Read big text file Write here -output /user/$USER/catted Just copy the input -mapper "cat" Ignore this for now -reducer NONE Don’t worry, you’ll get too much practice in the labs. Cluster Outline Map Reduce 13

Distributed grep hadoop jar hadoop-streaming-2.7.3.jar Run Hadoop Read big text file -input /data/assignments/ex1/webSmall.txt -output /user/$USER/grepped Write here Scan for “extreme” -mapper "grep extreme" Ignore this for now -reducer NONE Cluster Outline Map Reduce 14

Summarizing File: webSmall.txt Machine 0 Machine 1 Machine 2 mapper: grep mapper: grep mapper: grep File: part-00000 File: part-00001 File: part-00002 Hadoop takes care of: Shared file system Splitting input at line boundaries Launching tasks on multiple machines We can specify any command (“a mapper”) to run. Cluster Outline Map Reduce 15

Word Count How many times do words appear? Output Input a 3 want to use a want 1 a to to 2 a decimal use 1 decimal 1 Cluster Outline Map Reduce 16

Each mapper counts independently: Mapper 0 Mapper 1 a 1 a 2 want 1 to 1 to 1 decimal 1 use 1 Problem: Need to collate/sum counts Cluster Outline Map Reduce 17

Each mapper counts independently: Mapper 0 Mapper 1 a 1 a 2 want 1 to 1 to 1 decimal 1 use 1 Reducer 0 Reducer 1 a 3 to 2 decimal 1 use 1 want 1 Reducers sum counts Cluster Outline Map Reduce 18

Each mapper counts independently: Mapper 0 Mapper 1 a 1 a 2 want 1 to 1 to 1 decimal 1 use 1 Reducer 0 Reducer 1 a 3 to 2 decimal 1 use 1 want 1 Reducers sum counts Mappers hash the word mod 2 to decide which reducer to send to. Cluster Outline Map Reduce 19

Examine Reducer Input hadoop jar hadoop-streaming-2.7.3.jar Run Hadoop Copy code to workers -files count_map.py -input /data/assignments/ex1/webSmall.txt Read big text file Write here -output /user/$USER/reducespy -mapper count_map.py Count words locally Leave as is -reducer cat cat will copy input to output, so we can see what the input is. Cluster Outline Map Reduce 20

Sorting Hadoop sorts reducer input for you: Unsorted: Annoying Sorted: Easy to 1 to 1 want 1 to 1 use 1 use 1 to 1 want 1 Sorting makes it easy to stream in constant memory. Unsorted would require remembering words in memory. Cluster Outline Map Reduce 21

Examine Reducer Input hadoop jar hadoop-streaming-2.7.3.jar Run Hadoop Copy code to workers -files count_map.py,count_reduce.py -input /data/assignments/ex1/webSmall.txt Read big text file Write here -output /user/$USER/count -mapper count_map.py Count words locally Sum counts -reducer count_reduce.py And we get word count. . . hopefully Cluster Outline Map Reduce 22

Extreme Computing Introduction to MapReduce Cluster Outline Map - PowerPoint PPT Presentation

Extreme Computing Introduction to MapReduce Cluster Outline Map Reduce 1 Cluster We have 12 servers: scutter01 , scutter02 , . . . scutter12 If working outside Informatics, first: ssh student.ssh.inf.ed.ac.uk Then log into a random server:

Extreme Heat Preparedness Objectives What is extreme heat ? How does it impact SF? What are the

2014: Extreme territories 2 2015: Extreme territories 3 2016: Extreme territories 4 2018:

MATHEMATICS 1 CONTENTS Extreme values in one dimension Extreme values in two dimensions

Extreme Neural Network Computing Transforms Speech Quality Extreme Neural Network

The JEM-EUSO Mission to Explore the The JEM-EUSO Mission to Explore the Extreme Universe Extreme

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

Community Resilience to Extreme Events 15 th April 2019 University of Stirling Extreme Events

Low rank SDP extreme points and Applications Mohit Singh Georgia Tech SDP extreme points

Extreme Value Theory in Risk Management See McNeil, Extreme Value Theory for Risk Managers Risk

Lecture 12: Extreme Value Theory Applied Statistics 2015 1 / 18 A real problem Extreme Value

Accessibility is extreme usability. Designing accessible apps is the most extreme form of

Opportunities in Biology at the Opportunities in Biology at the Extreme Scale of Computing

Synergistic Challenges in Data-Intensive Science and Extreme Scale Computing Vivek Sarkar

Extreme Programming (XP) Extreme Programming (XP) Six Sigma Six Sigma CMMI CMMI How they can

Geography Extreme Earth Year One Geography | Year 3 | Extreme Earth | Volcanoes | Lesson 2 Aim

Extreme Environmental Extreme Environmental People Skills: An Introduction to Participatory

File IO 1 / 6 Text File IO File IO is done in Python with the built-in File object which is

C Programming for Engineers File Handling ICEN 360 Spring 2017 Prof. Dola Saha 1 Files in

File output Ch 6 Highlights - text file output - text file input Download vs stream Streams

User Input, Exceptions, and Reading and Writing Text Files 15-121 Fall 2020 Margaret Reid-Miller

File I/O, Exception, Assertion COL 100 - Introduction to Computer Science Department of Computer

Accessing Files in Python Learning Objectives Concepts about files in Python How to open

Visitor 1 Visitor Intent Represent an operation to be performed on the elements of an object

ATS 2016 Call for Abstracts Instructions for Title, Type, Category