Lecture 15.3 Hadoop! Toolchain EN 600.320/420 Instructor: Randal - - PowerPoint PPT Presentation

lecture 15 3 hadoop toolchain
SMART_READER_LITE
LIVE PREVIEW

Lecture 15.3 Hadoop! Toolchain EN 600.320/420 Instructor: Randal - - PowerPoint PPT Presentation

Lecture 15.3 Hadoop! Toolchain EN 600.320/420 Instructor: Randal Burns 4 April 2018 Department of Computer Science, Johns Hopkins University The Hadoop Tool Chain The command line tool chain Build files into directory Construct java


slide-1
SLIDE 1

Department of Computer Science, Johns Hopkins University

Lecture 15.3 Hadoop! Toolchain

EN 600.320/420 Instructor: Randal Burns 4 April 2018

slide-2
SLIDE 2

Lecture 15: Map/Reduce Part 2

The Hadoop Tool Chain

 The command line tool chain

– Build files into directory – Construct java archive (jar) – Point Hadoop! at the jar

 Many prefer to use Eclipse instead

slide-3
SLIDE 3

Lecture 15: Map/Reduce Part 2

Hadoop! Configurations

 Hadoop! is a heterogeneous, distributed system

– Many components: namenode, hdfs, reporting – Parallelization (mappers, reducers, shuffle, loading) – Typically involves managing a cluster

 But can run in several simpler ways

– Pseudo-distributed (full runtime on one machine) – Fully distributed (on a cluster)

 Running on pre-configured clusters

– Specify size and types of nodes – Launch a compiled JAVA jar file or streaming scripts – AWS, Azure, Joyent, IBM, RackSpace – Metaservices: Cloudera

slide-4
SLIDE 4

Lecture 15: Map/Reduce Part 2

Hadoop! Streaming

 Given arbitrary string processing functions to the

Hadoop! Environment

– A map script and a reduce script

 Almost equivalent to:

– cat inputdir/* | mapper.py | sort | reducer.py

slide-5
SLIDE 5

Lecture 15: Map/Reduce Part 2

Streaming and Sorting

 Streaming mode in Hadoop! Gives a different sorting

guarantee

– Recall: cat inputdir/* | mapper.py | sort | reducer.py

 Why?  Same or different semantics?  Any performance implications?

slide-6
SLIDE 6

Lecture 15: Map/Reduce Part 2

Streaming and Sorting

 Streaming mode in Hadoop! Gives a different sorting

guarantee

– Recall: cat inputdir/* | mapper.py | sort | reducer.py

 Why?

– There is no schema – So, it sorts the whole output of mapper.py as a key – This is more restrictive than the default sort – And, thus, less efficient

slide-7
SLIDE 7

Lecture 15: Map/Reduce Part 2

Map/Reduce Recast (8 y.o. #s)

 Scanning engine

– Use massive parallelism to look at large data sets

 Performance on 100 TB data sets

– 1 node @ 50 MB/s (STR of disk) = 23 days – 1000 nodes = 33 minutes

 Batch Processing

– Not real-time/user facing

 Large production environments

– Not useful on small scales – Too much overhead on small jobs