MapReduce Reduce Introdu duction ion and Hadoop p Overvie view - - PowerPoint PPT Presentation

mapreduce reduce
SMART_READER_LITE
LIVE PREVIEW

MapReduce Reduce Introdu duction ion and Hadoop p Overvie view - - PowerPoint PPT Presentation

13 June 2012 MapReduce Reduce Introdu duction ion and Hadoop p Overvie view Lab Course: Databases & Cloud Computi ting ng SS SS 2012 2012 Mart rtin in Przyja zyjaciel el-Zablo Zablocki ki Alexa lexander er Sch htz


slide-1
SLIDE 1

MapReduce Reduce

Introdu duction ion and Hadoop p Overvie view

Lab Course: Databases & Cloud Computi ting ng SS SS 2012 2012

Mart rtin in Przyja zyjaciel el-Zablo Zablocki ki Alexa lexander er Schä hätz tzle le Georg

  • rg Laus

usen en University of Freiburg Databases & Information Systems

13 June 2012

slide-2
SLIDE 2
  • 1. Why MapReduce?

a. Data Management Approaches b. Distributed Processing of Data

  • 2. Apache Hadoop

a. HDFS b. MapReduce c. Pig d. Hive e. HBase

  • 3. Programming MapReduce

a. MapReduce with Java b. Moving into the Cloud

  • 0. Agenda

nda

Age gend nda

MapReduce Introduction 2

slide-3
SLIDE 3

 Large

ge datasets sets

  • Amount of data increases constantly
  • “Five exabytes of data are generated every to days”

(corresponds to the whole amount of data generated up to 2003) by Eric Schmidt  Fac

acebo ebook:

  • k:
  • >800 million active users in Facebook,

interacting with >1 billion objects

  • 2.5 petabytes of userdata per day!

 How to explore,

  • re, analyze

yze such ch large e datasets sets?

  • 1. MapReduc

pReduce Data a Manageme ement nt

MapRedu pReduce: ce: Why hy?

MapReduce Introduction 3

slide-4
SLIDE 4

 Proce

cessi ssing ng 100 TB d datase set

 On 1 node

  • Scanning @ 50 MB/s = 23

3 days

 On 1000 node cluster

  • Scanning @ 50 MB/s = 33

3 min in

 Curren

ent t development elopment

  • Companies often can't cope with logged user behavior and

throw away data after some time  lost opportunities

  • Growing cloud-computing capacities
  • Price/performance advantage of low-end servers increases to

about a factor of twelve

  • 1. MapReduc

pReduce Data a Manageme ement nt

MapRedu pReduce: ce: Why hy?

MapReduce Introduction 4

slide-5
SLIDE 5

 High

gh-per perfo forman rmance e single gle machi achines es

  • “Scale-up” with limits (hardware, software, costs)
  • Workloads today are beyond the capacity of any single machine
  • I/O Bottleneck

 Parall

llel l Databas abases es

  • Fast and reliable
  • “Scale-out” restricted to some hundreds machines
  • Maintaining & administrations of parallel databases is hard

 Specia

eciali lized zed clust uster er of f power

  • werful

l machi chines es

  • “specialized” = powerful hardware satisfying individual software needs
  • fast and reliable but also very expensive
  • For data-intensive applications: scaling “out” is superior to scaling

“up”  performance gab insufficient to justify the price

1.

  • 1. MapReduc

pReduce Data a Manageme ement nt

Data ta Mana nageme gement nt Approache roaches

MapReduce Introduction 5

slide-6
SLIDE 6

 Clusters

usters of commo

  • mmodity

dity servers vers (wit ith h MapRe pReduce) duce)

“Commodity ty servers” = not individually adjusted

  • e.g. 8 cores, 16G of RAM
  • Cost & energy efficiency

MapRe Reduce uce

  • Designed around clusters of commodity servers
  • Widely used in a broad range of applications
  • By many organizations
  • Scaling ”out”, e.g. Yahoo! uses > 40.000 machines
  • Easy to maintain & administrate

1.

  • 1. MapReduc

pReduce Data a Manageme ement nt

Data ta Mana nageme gement nt Approache roaches s (2)

MapReduce Introduction 6

slide-7
SLIDE 7

 Proble

lem: m: How to compute the PageRank for a crawled set

  • f websites on a cluster of machines?
  • 1. MapReduc

pReduce Dist stribut uted ed Proce cessi ssing ng

Distributed tributed Proc

  • cessing

ssing of

  • f Da

Data ta

 Main

in Challen hallenges: ges:

  • How to break up a large problem into smaller tasks, that can be

executed in parallel?

  • How to assign tasks to machines?
  • How to partition and distribute data?
  • How to share intermediate results?
  • How to coordinate synchronization, scheduling, fault-tolerance?

MapReduce Introduction 7

MapReduce!

slide-8
SLIDE 8

 Scale “out”, not “up”

  • Large number of commodity servers

 Assum

ume failu lures res are commo

  • mmon
  • In a cluster of 10000 servers, expect 10 failures a day.

 Move

e proce cessin sing g to the data

  • Take advantage of data locality and avoid to transfer large datasets

through the network

 Process

cess data a sequentia quentiall lly and nd avo void id rando dom access cess

  • Random disk access causes seek times

 Hide

e system em-lev evel el detail ails from

  • m the appli

licat atio ion n developer veloper

  • Developers can focus on their problems instead of dealing with

distributed programming issues

 Seam

amless less scala alabil bilit ity

  • Scaling “out” improves the performance of an algorithm without any

modifications

MapReduce Introduction 8

  • 1. MapReduc

pReduce Dist stribut uted ed Proce cessi ssing ng

Big g ideas eas behi hind nd MapRedu pReduce ce

slide-9
SLIDE 9

 MapReduc

educe

  • Popularized by Google & widely used
  • Algorithms that can be expressed as (or mapped to) a

sequence of Map() and Reduce() functions are automatically parallelized by the framework

 Distr

tributed ibuted File e System tem

  • Data is split into equally sized blocks and stored distributed
  • Clusters of commodity hardware

 Fault tolerance by replication

  • Very large files / write-once, read-many pattern

 Advant

ntage ages

  • Partitioning + distribution of data
  • Parallelization and assigning of task
  • Scalability, fault-tolerance, scheduling,…

1.

  • 1. MapReduc

pReduce Dist stribut uted ed Proce cessi ssing ng

MapRedu pReduce ce

That all is done automatically!

MapReduce Introduction 9

slide-10
SLIDE 10

 At Google

gle

  • Index construction for Google Search (replaced in 2010 by Caffeine)
  • Article clustering for Google News
  • Statistical machine translation

 At Yahoo

  • o!
  • “Web map” powering Yahoo! Search
  • Spam detection for Yahoo! Mail

 At Face

cebook book

  • Data mining, Web log processing
  • SearchBox (with Cassandra)
  • Facebook Messages (with HBase)
  • Ad optimization
  • Spam detection

1.

  • 1. MapReduc

pReduce Dist stribut uted ed Proce cessi ssing ng

Wha hat t is MapRedu pReduce ce us used d fo for?

MapReduce Introduction 10

slide-11
SLIDE 11

 In resear

arch ch

  • Astronomical image analysis (Washington)
  • Bioinformatics (Maryland)
  • Analyzing Wikipedia conflicts (PARC)
  • Natural language processing (CMU)
  • Particle physics (Nebraska)
  • Ocean climate simulation (Washington)
  • Processing of Semantic Data (Freiburg)
  • <Your application here>

1.

  • 1. MapReduc

pReduce Dist stribut uted ed Proce cessi ssing ng

Wha hat t is MapRedu pReduce ce us used d fo for?

MapReduce Introduction 11

slide-12
SLIDE 12
  • 1. Why MapReduce?

a. Data Management Approaches b. Distributed Processing of Data

  • 2. Apache Hadoop

a. HDFS b. MapReduce c. Pig d. Hive e. HBase

  • 3. Programming MapReduce

a. MapReduce with Java b. Moving into the Cloud

  • 0. Agenda

nda

Age gend nda

MapReduce Introduction 12

slide-13
SLIDE 13
  • 2. Ap

Apache che Hadoop

  • op

“Open-source software for reliable, scalable, distributed computing”

  • 2. Hado

doop

MapReduce Introduction 13

slide-14
SLIDE 14

 Apache

che Hadoop

  • op
  • Well-known Open-Source implementation of
  • Google’s MapReduce & Google File System (GFS) paper
  • Enriched by many subprojects
  • Used by Yahoo, Facebook, Amazon, IBM, Last.fm, EBay …
  • Cloudera’s Distribution with VMWare images, tutorials and

further patches

  • 2. Hado

doop

Apac ache he Ha Hadoop:

  • op: Why

hy?

MapReduce Introduction 14

slide-15
SLIDE 15
  • 2. Hado

doop

Ha Hadoop

  • op Ecosy
  • system

stem

PIG

(Data Flow)

Hive

(SQL)

MapReduce

(Job Scheduling/Execution System)

HBase

(NoSQL)

HDFS

(Hadoop Distributed File System)

Hadoop Common

(supporting utilities, libraries)

ZooKeeper

(Coordination)

Avro

(Serialization)

Chukwa

(Managing)

MapReduce Introduction 15

slide-16
SLIDE 16

MapReduce Introduction

  • 2. Hado

doop

Yahoo’s Hadoop Cluster

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

16

slide-17
SLIDE 17

 Files split into 64MB blocks  Blocks replicated across several

DataNodes (usually 3)

 Single NameNode stores metadata

  • file names, block locations, etc

 Optimized for large files,

sequential reads

 Files are append-only

MapReduce Introduction 17

  • 2. Hado

doop HDFS

Ha Hadoop

  • op Distributed

tributed File le System tem

NameNode DataNodes

1 2 3 4 1 2 4 2 1 3 1 4 3 3 2 4

File1

slide-18
SLIDE 18

 Master & Slaves architecture  JobTracker schedules and

manages jobs

 TaskTracker executes individual

map() and reduce() task on each cluster node

 JobTracker and Namenode as

well as TaskTrackers and DataNodes are placed on the same machines

MapReduce Introduction 18

  • 2. Hado

doop Archi hitec ecture ure

Ha Hadoop

  • op Archite

hitectu cture re

Slaves

JobTracker + NameNode

TaskTracker + DataNode

Master

TaskTracker + DataNode TaskTracker + DataNode

slide-19
SLIDE 19

(1) Map Phase

se

  • Raw data read and converted to key/value pairs
  • Map() function applied to any pair

(2) Shuff

ffle le Phase ase

  • All key/value pairs are sorted and grouped by their keys

(3) Reduce

ce Phas ase

  • All values with a the same key are processed by within the

same reduce() function

  • 2. Hado

doop MapR pRed educe uce

MapRedu pReduce ce Wor

  • rkflow

kflow

MapReduce Introduction 19

slide-20
SLIDE 20
  • 2. Hado

doop MapR pRed educe uce

MapRedu pReduce ce Wor

  • rkflow

kflow (2)

 Steps of a

a MapReduc educe e execu cution ion

 Signat

nature ures

  • Map():

(in_key, in_value)  list (out_key, intermediate_value)

  • Reduce(): (out_key, list (intermediate_value))  list (out_value)

split 1 split 0

Map Map Map Reduc uce Reduc uce

  • utput 0
  • utput 1

Map p phase ase Shuffle fle & Sort Reduc uce e phase se

split 2 split 3 split 4 split 5

Input (HDFS)

Intermediate Results (Local)

Output (HDFS)

MapReduce Introduction 20

slide-21
SLIDE 21

 Every MapReduce program must specify a Mapper and

typically a Reducer

 The Mapper has a map() function that transforms input

(key, value) pairs into any number of intermediate (out_key, intermediate_value) pairs

 The Reducer has a reduce() function that transforms

intermediate (out_key, list(intermediate_value)) aggregates into any number of output (value’) pairs

  • 2. Hado

doop MapR pRed educe uce

MapRedu pReduce ce Prog

  • gramming

ramming Mod

  • del

el

MapReduce Introduction 21

slide-22
SLIDE 22

 Single master controls job execution on multiple slaves

  • Master/Slave architecture

 Mappers preferentially placed on same node or same rack as

their input block

  • Utilize Data-locality  move computation to data
  • Minimizes network usage

 Mappers save outputs to local disk before serving them to

reducers

  • Allows recovery if a reducer crashes
  • Allows having more reducers than nodes

MapReduce Introduction 22

  • 2. Hado

doop MapR pRed educe uce

MapRedu pReduce ce Execution ution Deta tails ils

slide-23
SLIDE 23

 Problem: Given a document, we want to count the

  • ccurrences of any word

 Input:

  • Document with words (e.g. Literature)

 Output:

  • List of words and their occurrences, e.g.

“Infrastructure” 12 “the” 259 …

MapReduce Introduction 23

  • 2. Hado

doop MapR pRed educe uce

Wor

  • rd

d Cou

  • unt

nt Exampl ample

slide-24
SLIDE 24

MapReduce Introduction 25

  • 2. Hado

doop MapR pRed educe uce

Wor

  • rd

d Cou

  • unt

nt Execution ution

the quick brown fox the fox ate the mouse how now brown cow

Map Map Map Reduce Reduce

brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1

the, 1 brown, 1 fox, 1 quick, 1 the, 1 fox, 1 the, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1

Inpu put Map Shu huffle ffle & S Sort Reduce duce Out utpu put

slide-25
SLIDE 25

 A combiner is a local aggregation function for repeated

keys produced by same Mapper

 Works for associative functions like sum, count, max  Decreases size of intermediate data  Example: map-side aggregation for Word Count

MapReduce Introduction 26

  • 2. Hado

doop MapR pRed educe uce

An O n Optimizati timization:

  • n: The

he Com

  • mbiner

biner

slide-26
SLIDE 26

MapReduce Introduction 27

  • 2. Hado

doop MapR pRed educe uce

Wor

  • rd

d Cou

  • unt

nt wi with th Com

  • mbiner

biner

the quick brown fox the fox ate the mouse how now brown cow

Map Map Map Reduce Reduce

brown, 2 fox, 2 how, 1 now, 1 the, 3 ate, 1 cow, 1 mouse, 1 quick, 1

the, 1 brown, 1 fox, 1 quick, 1 how, 1 now, 1 brown, 1 ate, 1 mouse, 1 cow, 1

Inpu put Map Shu huffle ffle & S Sort Reduce duce Out utpu put

the, 2 fox, 1

slide-27
SLIDE 27
  • 1. I

If a t task k cr crashes: shes:

  • Retry on another node

 OK for a map because it has no dependencies  OK for reduce because map outputs are on disk

  • If the same task fails repeatedly, fail the job or ignore

that input block (user-controlled)

MapReduce Introduction 28

  • 2. Hado

doop MapR pRed educe uce

Faul ult t Tol

  • leranc

erance e in n MapRedu pReduce ce

 Note: For these fault tolerance features to work,

your map and reduce tasks must be side-effect- free

slide-28
SLIDE 28
  • 2. I

If a no node e cr crashe hes: s:

  • Re-launch its current tasks on other nodes
  • Re-run any maps the node previously ran

 Necessary because their output files were lost along with the crashed node

MapReduce Introduction 29

  • 2. Hado

doop MapR pRed educe uce

Faul ult t Tol

  • leranc

erance e in n MapRedu pReduce ce

slide-29
SLIDE 29
  • 3. I

If a t task k is going ng slowly ly (straggler):

  • Launch second copy of task on another node

(“speculative execution”)

  • Take the output of whichever copy finishes first, and

kill the other

 Surprisingly important in large clusters

  • Stragglers occur frequently due to failing hardware,

software bugs, misconfiguration, etc

  • Single straggler may noticeably slow down a job

MapReduce Introduction 30

  • 2. Hado

doop MapR pRed educe uce

Faul ult t Tol

  • leranc

erance e in n MapRedu pReduce ce

slide-30
SLIDE 30

 Partitioners

  • Assign keys to reduces
  • Default: key.hashCode() % num_reducers

 Grouping Comparators

  • Sort keys within reduces

 Combiners

  • Local aggregation

 Compression

  • Supported compression types: zlib, LZO,…

 Counters (global)

  • Define new countable events

MapReduce Introduction 31

  • 2. Hado

doop MapR pRed educe uce

Fur urth ther er ha hand ndy to tool

  • ls
slide-31
SLIDE 31

 Zero Reduces

  • If no sorting or shuffling required
  • Set number of reduces to 0

 Distributed File Cache

  • For storing read-only copies of data on local computers

MapReduce Introduction 32

  • 2. Hado

doop MapR pRed educe uce

Fur urth ther er ha hand ndy to tool

  • ls

s (2)

slide-32
SLIDE 32
  • 1. Why MapReduce?

a. Data Management Approaches b. Distributed Processing of Data

  • 2. Apache Hadoop

a. HDFS b. MapReduce c. Pig d. Hive e. HBase

  • 3. Programming MapReduce

a. MapReduce with Java b. Moving into the Cloud

0.

  • 0. Agenda

nda

Age gend nda

PIG

(Data Flow)

Hive

(SQL)

MapReduce

(Job Scheduling/Execution System)

HBase

(NoSQL)

HDFS

(Hadoop Distributed File System)

Hadoop Common

(supporting utilities, libraries)

ZooKeeper

(Coordination)

Avro

(Serialization)

Chukwa

(Managing)

MapReduce Introduction 33

slide-33
SLIDE 33

 Many parallel algorithms can be expressed by a

series of MapReduce jobs

 But MapReduce is fairly low-level:

  • must think about keys, values, partitioning, etc

 Can we capture common “job building blocks”?

MapReduce Introduction 34

  • 2. Hado

doop Pig Latin

Pig g (Latin) tin): : Why hy?

slide-34
SLIDE 34

 Started at Yahoo! Research  Runs about 30% of Yahoo!’s jobs  Features:

  • Expresses sequences of MapReduce jobs
  • Data model: nested “bags” of items
  • Provides relational (SQL) operators (JOIN, GROUP BY, etc.)
  • Easy to plug in Java functions
  • Pig Pen development environment for Eclipse

MapReduce Introduction 35

  • 2. Hado

doop Pig Latin

Pig g (Latin) tin)

slide-35
SLIDE 35

 Suppose you have

user data in one file, page view data in another

 and you need to find

the top 5 most visited pages by users aged 18 - 25

MapReduce Introduction 36

  • 2. Hado

doop Pig Latin

An E n Example ample Problem

  • blem

Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

slide-36
SLIDE 36

MapReduce Introduction 37

  • 2. Hado

doop Pig Latin

In In Ma MapRedu pReduce ce

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

slide-37
SLIDE 37

Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, count(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’;

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt MapReduce Introduction 38

  • 2. Hado

doop Pig Latin

In In Pi Pig g Lati tin

slide-38
SLIDE 38

MapReduce Introduction 39

  • 2. Hado

doop Pig Latin

Ease se of

  • f Tran

anslation slation

Notice how naturally the components of the job translate into Pig Latin.

Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

Users = load … Filtered = filter … Pages = load … Joined = join … Grouped = group … Summed = … count()… Sorted = order … Top5 = limit …

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

slide-39
SLIDE 39

MapReduce Introduction 40

  • 2. Hado

doop Pig Latin

Ease se of

  • f Tran

anslation slation

Load Users Load Pages Filter by age Join on name Group on url Count clicks Order by clicks Take top 5

Users = load … Filtered = filter … Pages = load … Joined = join … Grouped = group … Summed = … count()… Sorted = order … Top5 = limit …

Job 1 Job 2 Job 3

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

Notice how naturally the components of the job translate into Pig Latin.

slide-40
SLIDE 40
  • 1. Why MapReduce?

a. Comparison of Data Management Approaches b. Distributed Processing of Data

  • 2. Apache Hadoop

a. HDFS b. MapReduce c. Pig d. Hive e. HBase

  • 3. Programming MapReduce

a. MapReduce with Java b. Moving into the Cloud

  • 0. Agenda

nda

Age gend nda

PIG

(Data Flow)

Hive

(SQL)

MapReduce

(Job Scheduling/Execution System)

HBase

(NoSQL)

HDFS

(Hadoop Distributed File System)

Hadoop Common

(supporting utilities, libraries)

ZooKeeper

(Coordination)

Avro

(Serialization)

Chukwa

(Managing)

MapReduce Introduction 41

slide-41
SLIDE 41

 Developed at Facebook  Used for majority of Facebook jobs  Data Warehouse infrastructure that provides data

summarization and ad hoc querying on top of Hadoop

  • MapReduce for execution
  • HDFS for storage

 “Relational database” built on Hadoop

  • Maintains list of table schemas
  • SQL-like query language (HQL)
  • Supports table partitioning, clustering, complex

data types, some optimizations

MapReduce Introduction 42

  • 2. Hado

doop Hive

Hi Hive ve

slide-42
SLIDE 42

SELECT p.url, COUNT(1) as clicks FROM users u JOIN page_views p ON (u.name = p.user) WHERE u.age >= 18 AND u.age <= 25 GROUP BY p.url ORDER BY clicks LIMIT 5;

MapReduce Introduction 43

  • 2. Hado

doop Pig Latin

Sample mple Hi Hive ve Que uery

 Find top 5 pages visited by users aged 18-25:

slide-43
SLIDE 43

 Clone of Big Table (Google)  Data is stored “Column-oriented”  Distributed over many servers  Layered over HDFS  Strong consistency (CAP Theorem)  Scalable up to billions of rows x millions of

columns

  • 2. Hado

doop HBase se

HB HBase se

MapReduce Introduction 44

slide-44
SLIDE 44

 Sqoop

  • p
  • Tool designed to help users of large data import existing

relational databases into their Hadoop cluster

  • Integrates with Hive

 Zooke

  • keeper

eper

  • High-performance coordination service for distributed

applications

 Avro

  • Data serialization system

 Chukwa

kwa

  • Data collection system
  • Displaying, monitoring and analyzing log files
  • 2. Hado

doop

  • ther

ers

And nd many ny others …

MapReduce Introduction 45

slide-45
SLIDE 45

 By providing a data-parallel programming model,

MapReduce can control job execution in useful ways:

  • Automatic division of job into tasks
  • Automatic partition and distribution of data
  • Automatic placement of computation near data
  • Recovery from failures & stragglers

 Hadoop, an open source implementation of

MapReduce, enriched by many useful subprojects

 User focuses on application, not on complexity of

distributed computing

MapReduce Introduction 46

  • 2. Hado

doop

Takea keaways ways

slide-46
SLIDE 46
  • 1. Why MapReduce?

a. Comparison of Data Management Approaches b. Distributed Processing of Data

  • 2. Apache Hadoop

a. HDFS b. MapReduce c. Pig d. Hive e. HBase

  • 3. Programming MapReduce

a. MapReduce with Java b. Moving into the Cloud

  • 0. Agenda

nda

Age gend nda

MapReduce Introduction 47

slide-47
SLIDE 47

3.

  • 3. Pr

Programming

  • gramming Ma

MapReduce pReduce

First steps with Hadoop

  • 3. Program

amming ng

MapReduce Introduction 48

slide-48
SLIDE 48

 Hadoop Distributions

  • Apache Hadoop
  • Cloudera’s Hadoop Distribution (recommended)

 Installing Hadoop on Linux

  • Follow CDH3 Installation Guide

 Hadoop within a Virtual Machine

  • Cloudera's Hadoop Demo VMWare Image
  • Ready to use Hadoop Environment

 Hadoop in the Cloud

  • Amazon’s Elastic MapReduce
  • 3. Program

amming ng

Gett tting ing sta tarte rted d wi with th Ha Hadoop

  • op

MapReduce Introduction 49

slide-49
SLIDE 49

 Since Hadoop 0.20

  • OLD:

D:

  • rg.apache.hadoop.ma

mapre red.* .*

  • NEW:

W:

  • rg.apache.hadoop.ma

mapredu reduce. ce.* *

 Note

Note

  • Many available examples are written using the old API
  • One should not mix both
  • Strongly recommended: new API!

MapReduce Introduction 50

  • 3. Program

amming ng

New w MapRedu educe ce API PI

slide-50
SLIDE 50

MapReduce Introduction 55

  • 3. Program

amming ng Runni ning Jobs bs

Web In Inte terfac faces: es: HU HUE

slide-51
SLIDE 51

MapReduce Introduction 56

  • 3. Program

amming ng Runni ning Jobs bs

Web In Inte terfac faces: es: Jo JobTra Tracke cker

http://masterIP:50030/jobtracker.jsp

slide-52
SLIDE 52

MapReduce Introduction 57

  • 3. Program

amming ng Runni ning Jobs bs

Web In Inte terfac faces: es: NameNode eNode

http://masterIP:50070/dfshealth.jsp

slide-53
SLIDE 53
  • 1. Why MapReduce?

a. Comparison of Data Management Approaches b. Distributed Processing of Data

  • 2. Apache Hadoop

a. HDFS b. MapReduce c. Pig d. Hive e. HBase

  • 3. Programming MapReduce

a. MapReduce with Java b. Moving into the Cloud

  • 0. Agenda

nda

Age gend nda

MapReduce Introduction 58

slide-54
SLIDE 54

 Provides a web-based interface and command-line

tools for running Hadoop jobs on Amazon EC2

 Data stored in Amazon S3  Monitors job and shuts down machines after use  Small extra charge on top of EC2 pricing

MapReduce Introduction 59

  • 3. Program

amming ng Elas astic c MapRe Reduce duce

Amazon zon Elastic lastic MapRe pReduce duce

slide-55
SLIDE 55

MapReduce Introduction 60

  • 3. Program

amming ng Elas astic c MapRe Reduce duce

Elastic astic MapReduc pReduce e Wor

  • rkflow

kflow

slide-56
SLIDE 56

MapReduce Introduction 61

  • 3. Program

amming ng Elas astic c MapRe Reduce duce

Elastic astic MapReduc pReduce e Wor

  • rkflow

kflow

slide-57
SLIDE 57

MapReduce Introduction 62

  • 3. Program

amming ng Elas astic c MapRe Reduce duce

Elastic astic MapReduc pReduce e Wor

  • rkflow

kflow

slide-58
SLIDE 58

MapReduce Introduction 63

  • 3. Program

amming ng Elas astic c MapRe Reduce duce

Elastic astic MapReduc pReduce e Wor

  • rkflow

kflow

slide-59
SLIDE 59

 MapReduce programming model hides the

complexity of work distribution and fault tolerance

 Principal design philosophies:

  • Make it scalable, so you can add hardware easily
  • Make it cheap, lowering hardware, programming and admin

costs

 MapReduce is not suitable for all problems, but

when it works, it may save you quite a bit of time

 Cloud computing or Cloudera makes it

straightforward to start using Hadoop at scale

MapReduce Introduction 64

  • 4. Concl

clusi usion

Takea keaways ways #2

slide-60
SLIDE 60

 Hadoop

doop Clust uster er with th 10 machi achines es & 30 TB Stor

  • rage

age

 Distri

ributed buted Process cessin ing g of Seman emantic ic Data

  • Storing strategies for RDF Graphs in HDFS, HBase, Cassandra
  • Mapping SPARQL queries to PIG or directly MapReduce
  • Executing Path queries with PIG or directly MapReduce for

investigating e.g. social networks

  • 4. Concl

clusi usion

Researc arch@D h@DBIS BIS

MapReduce Introduction 65

slide-61
SLIDE 61

 Hadoop

  • http://hadoop.apache.org/core/

 Pig

  • http://hadoop.apache.org/pig

 Hive

  • http://hadoop.apache.org/hive

 Cloudera’s Distribution

  • http://www.cloudera.com/

 Video tutorials

  • http://www.cloudera.com/hadoop-training

 Amazon Web Services

  • http://aws.amazon.com/

MapReduce Introduction 66

4.

  • 4. Reso

sources urces

Resourc

  • urces
slide-62
SLIDE 62

 Hadoop: The Definitive Guide, Second Edition

  • Tom White
  • O'Reilly Media, 2010

 Data-Intensive Text Processing with MapReduce

  • Jimmy Lin, Chris Dyer, Graeme Hirst
  • Morgan and Claypool Publishers, 2010

 Cluster Computing and MapReduce Lecture Series

  • Google, 2007: Available on Youtube

 Verarbeiten großer Datenmengen mit Hadoop

  • Oliver Fischer
  • Heise Developer, 2010: Online

MapReduce Introduction 67

  • 4. Reso

sources urces

Resourc

  • urces

s (2)

slide-63
SLIDE 63

New w MapRed educe ce API PI

Additional slides

MapReduce Introduction 68

  • X. MapReduc

pReduce New API

slide-64
SLIDE 64

 Methods can throw InterruptedException as well

as IOException

 Configuration instead of JobConf Objekt  Library classes are moved to mapreduce.lib

verschoben

  • {input, map, output, partition, reduce}.*

MapReduce Introduction 69

  • X. MapReduc

pReduce New API

New w API PI: : Top

  • p Le

Leve vel l Changes

slide-65
SLIDE 65

 Map funktion

  • map (K1 key, V1 value,

OutputCollector<K2,V2> output, Reporter reporter)

  • map(K1 key, V1 value, Context context)

 Close

  • Close()
  • cleanup(Context context)

 Output

  • Output.collect(K,V)
  • Context.write(K,V)

MapReduce Introduction 70

  • X. MapReduc

pReduce New API

Mapper er

  • ld API

new API

slide-66
SLIDE 66

 Using mapreduce.Mapper

  • void run

(RecordReader<K1,V1> input, OutputCollector<K2,V2> output, Reporter reporter)

  • void run

(Context context)

MapReduce Introduction 71

  • X. MapReduc

pReduce New API

MapRunna nnable ble

slide-67
SLIDE 67

 Reduce funktion

  • void reduce (K2, Iterator<V2> values,

OutputCollector<K3,V3> output)

  • void reduce (K2, Iterable<V2> values,

Context context)

 Iteration

  • while (values.hasNext() {

V2 value = values.next(); … }

  • for (V2 value: values) { … }

MapReduce Introduction 72

  • X. MapReduc

pReduce New API

Redu ducer cer & & Com

  • mbi

biner ner

slide-68
SLIDE 68

 JobConf + JobClient are replaced with Job  Job Constructor

  • job = new JobConf(conf, MyMapper.class)

job.setJobName(„job name“)

  • job = new Job(conf, „job name“)
  • job.setJarByClass(MyMapper.class)

MapReduce Introduction 73

  • X. MapReduc

pReduce New API

Submitti itting Job

  • bs
slide-69
SLIDE 69

 Further Properties

  • Job has getConfiguration
  • FileInputFormat in mapreduce.lib.input
  • FileOutputFormat in mapreduce.lib.output

 Ausführung

  • JobClient.runJob(job)
  • System.exit(job.waitForCompletion(true)?0:1)

MapReduce Introduction 74

  • X. MapReduc

pReduce New API

Submitti itting Job

  • bs

s (2 (2)