CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java - - PowerPoint PPT Presentation

csc 369 distributed computing
SMART_READER_LITE
LIVE PREVIEW

CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java - - PowerPoint PPT Presentation

CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java Hadoop API CSC 369: Distributed Computing Alex Dekhtyar May 6 Day 14: Java Hadoop API HAPPY EQUATOR DAY! Housekeeping Lab 4 (mini-project): due Sunday night Lab 5: due


slide-1
SLIDE 1

CSC 369: Distributed Computing

Alex Dekhtyar

Day 14: Java Hadoop API

May 6

slide-2
SLIDE 2

CSC 369: Distributed Computing

Alex Dekhtyar

Day 14: Java Hadoop API

HAPPY EQUATOR DAY!

May 6

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6

Housekeeping

Lab 4 (mini-project): due Sunday night Lab 5: due tonight (grace period tomorrow) Lab 6: full lab coming out Friday Grading: slowly happening…

slide-7
SLIDE 7

Hadoop Java API

slide-8
SLIDE 8

Hadoop API

Current Version is 3.2.1.

hadoop hdfs yarn

Command-line tools We limit ourselves to hadoop jar

slide-9
SLIDE 9

Hadoop Java API

  • rg.apache.hadoop

Let’s concentrate on things we absolutely need

slide-10
SLIDE 10

Hadoop Java API

  • rg.apache.hadoop
  • rg.apache.hadoop.mapreduce
  • rg.apache.hadoop.mapreduce.lib.input
  • rg.apache.hadoop.mapreduce.lib.output
  • rg.apache.hadoop.io
  • rg.apache.hadoop.conf
  • rg.apache.hadoop.fs

Core MapReduce classes Inuput/Output parsing atomic type wrappers Job configuration File system classes

slide-11
SLIDE 11
  • rg.apache.hadoop.mapreduce
  • rg.apache.hadoop.mapreduce.Job

MapReduce Job

  • rg.apache.hadoop.mapreduce.Mapper
  • rg.apache.hadoop.mapreduce.Reducer

Extensible Mapper Extensible Reducer

  • rg.apache.hadoop.mapreduce.Partitioner

Parent class for Partitioning tasks

  • rg.apache.hadoop.mapreduce.InputFormat

Parent classes for Input/Output Formats

  • rg.apache.hadoop.mapreduce.OutputFormat
  • rg.apache.hadoop.mapreduce.InputSplit

Parent class for Input Split

slide-12
SLIDE 12

How it works

Input File

slide-13
SLIDE 13

How it works

Input File

InputSplit InputSplit InputSplit

slide-14
SLIDE 14

How it works

Input File

InputSplit InputSplit InputSplit Job

Mapper Reducer Combiner (Reducer)

slide-15
SLIDE 15

How it works

Input File

InputSplit InputSplit InputSplit Job

Mapper Reducer Combiner (Reducer) Compute Node1 Compute Node2 Compute Node3

slide-16
SLIDE 16

How it works

Input File

InputSplit InputSplit InputSplit Job

Mapper Reducer Combiner (Reducer) Compute Node1 Compute Node2 Compute Node3

slide-17
SLIDE 17

How it works

Input File

InputSplit InputSplit InputSplit

Mapper Compute Node1 Compute Node2 Compute Node3 Mapper Mapper Combiner (Reducer) Combiner (Reducer) Combiner (Reducer) Mapper

slide-18
SLIDE 18

Compute Node1 Mapper Combiner (Reducer)

InputSplit

slide-19
SLIDE 19

Compute Node1 Mapper Combiner (Reducer)

InputSplit

slide-20
SLIDE 20

Compute Node1 Compute Node2 Compute Node3 Compute Node1 Compute Node2 Compute Node3

MAP STAGE Reduce STAGE

Mapper Mapper Combiner Combiner Mapper Combiner Reducer Reducer Reducer

time

slide-21
SLIDE 21

Compute Node1 Compute Node2 Compute Node3 Compute Node1 Compute Node2 Compute Node3

MAP STAGE Reduce STAGE

Mapper Mapper Combiner Combiner Mapper Combiner Partitioner

time

Partitioner Partitioner Reducer Reducer Reducer

slide-22
SLIDE 22

Compute Node1 Compute Node2 Compute Node3 Compute Node1 Compute Node2 Compute Node3

MAP STAGE Reduce STAGE

Mapper Mapper Combiner Combiner Mapper Combiner Partitioner

time

Partitioner Partitioner

Shuffle STAGE

Reducer Reducer Reducer

slide-23
SLIDE 23

Compute Node1 Compute Node2 Compute Node3 Compute Node1 Compute Node2 Compute Node3

MAP STAGE Reduce STAGE

Mapper Mapper Combiner Combiner Mapper Combiner Partitioner

time

Partitioner Partitioner

Shuffle STAGE

Reducer Reducer Reducer

slide-24
SLIDE 24

Mapper in a nutshell

protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context)

protected void map(KEYIN key, VALUEIN value,

  • rg.apache.hadoop.mapreduce.Mapper.Context context)

protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)

void run(org.apache.hadoop.mapreduce.Mapper.Cont ext context)

slide-25
SLIDE 25

run(InputSplit s, Context c): setup(s,c); for each record in s do: map(record, c); end for; cleanup(s,c)

Run setup() once Run map() for each record Run cleaunp() once

slide-26
SLIDE 26

Reducer in a nutshell

protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context)

protected void reduce(KEYIN key, Iterable<VALUEIN> value,

  • rg.apache.hadoop.mapreduce.Mapper.Context context)

protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)

void run(org.apache.hadoop.mapreduce.Mapper.Cont ext context)

slide-27
SLIDE 27

run(InputSplit s, Context c): setup(s,c); for each record in s do: map(record, c); end for; cleanup(s,c)

Run setup() once Run map() for each record Run cleaunp() once Shuffle Sort SecondarySort

slide-28
SLIDE 28

Hadoop Java API

  • rg.apache.hadoop
  • rg.apache.hadoop.mapreduce
  • rg.apache.hadoop.mapreduce.lib.input
  • rg.apache.hadoop.mapreduce.lib.output
  • rg.apache.hadoop.io
  • rg.apache.hadoop.conf
  • rg.apache.hadoop.fs

Core MapReduce classes Inuput/Output parsing atomic type wrappers Job configuration File system classes

slide-29
SLIDE 29

Hadoop Java API

  • rg.apache.hadoop
  • rg.apache.hadoop.mapreduce
  • rg.apache.hadoop.mapreduce.lib.input
  • rg.apache.hadoop.mapreduce.lib.output
  • rg.apache.hadoop.io
  • rg.apache.hadoop.conf
  • rg.apache.hadoop.fs

Core MapReduce classes Inuput/Output parsing atomic type wrappers Job configuration File system classes

slide-30
SLIDE 30
  • rg.apache.hadoop.mapreduce.lib.input

Single File Input Format

FileInputFormat TextInputFormat KeyValueInputFormat FixedLengthInputFormat NLineInputFormat Generic Input File format (others extend it) Text Input User-defined Key-Value Pairs Fixed Length Records in input Controls the size of split (in terms of #lines)

slide-31
SLIDE 31
  • rg.apache.hadoop.mapreduce.lib.input

Single File Input Format

FileInputFormat TextInputFormat KeyValueInputFormat FixedLengthInputFormat NLineInputFormat Generic Input File format (others extend it) Text Input User-defined Key-Value Pairs Fixed Length Records in input Controls the size of split (in terms of #lines)

Other Important Classes

MultipleInputs Multiple Files as inputs to a single Mapper FileSplits File Partitions