Introduction to Hadoop 1 Distributed Data Processing The idea of - - PowerPoint PPT Presentation

introduction to hadoop
SMART_READER_LITE
LIVE PREVIEW

Introduction to Hadoop 1 Distributed Data Processing The idea of - - PowerPoint PPT Presentation

Introduction to Hadoop 1 Distributed Data Processing The idea of distributed databases is older than you might think Richard Peebles, Eric G. Manning: A Computer Architecture for Large (Distributed) Data Bases. VLDB 1975 : 405-427 Distributed


slide-1
SLIDE 1

Introduction to Hadoop

1

slide-2
SLIDE 2

Distributed Data Processing

The idea of distributed databases is older than you might think

Richard Peebles, Eric G. Manning: A Computer Architecture for Large (Distributed) Data Bases. VLDB 1975: 405-427

Distributed data structures and algorithms have always been around So, what is new?

2

slide-3
SLIDE 3

Distributed Data Processing

A cluster of machines Big input data Final

  • utput

Data partitioning Load balancing Fault tolerance Synchronization

3

slide-4
SLIDE 4

MapReduce

A programing paradigm for expressing distributed algorithms Introduced by Google in 2004

Google File System for distributed storage Google MapReduce for distributed processing

Hadoop is the open source counterpart released in 2007 and contributed mainly by Yahoo!

HDFS Hadoop MapReduce

4

slide-5
SLIDE 5

Hadoop Overview

Master node Slave nodes Name node Resource manager Data node Node manager Data node Node manager Data node Node manager Data node Node manager Data node Node manager Data node Node manager

5

slide-6
SLIDE 6

HDFS Loading

Input file (600 MB) 128 MB 128 MB 128 MB 128 MB 88 MB HDFS Block

6

slide-7
SLIDE 7

HDFS Storage

B B B B B B B B B B B B B B B

7

slide-8
SLIDE 8

Hadoop MapReduce

A kind of functional programming A program is expressed in two functions only, map and reduce Map: r → {(k,v)} Takes as input one record and returns zero or more <key, value> pairs Reduce: (k,{v}) → a Takes one key and all its associated values and returns zero or more output values

8

slide-9
SLIDE 9

Example: Word Count

if you cannot fly, then run, if you cannot run, then walk, if you cannot walk, then crawl, but whatever you do you have to keep moving forward Input text file Output you: 5 cannot: 3 walk: 2 if: 3 … Map(line) { split line into words for each word w

  • utput (w,1)

} Reduce(w, c[]) { s = Sum(c)

  • utput(w, s)

}

9

slide-10
SLIDE 10

Hadoop Operation Modes

Standalone mode One JRE instance Pseudo-distributed mode Name node Resource manager Data node Node manager Cluster mode RM NN NM DN

10