Apache Hadoop, Big Data, and You Philip Zeyliger - - PowerPoint PPT Presentation

apache hadoop big data and you
SMART_READER_LITE
LIVE PREVIEW

Apache Hadoop, Big Data, and You Philip Zeyliger - - PowerPoint PPT Presentation

Apache Hadoop, Big Data, and You Philip Zeyliger philip@cloudera.com @philz42 @cloudera November 18, 2009 Wednesday, November 18, 2009 Hi there! Software Engineer Worked at Wednesday, November 18, 2009 I work on stuff... Wednesday,


slide-1
SLIDE 1

Apache Hadoop, Big Data, and You

Philip Zeyliger philip@cloudera.com @philz42 @cloudera November 18, 2009

Wednesday, November 18, 2009

slide-2
SLIDE 2

Hi there!

Software Engineer Worked at

Wednesday, November 18, 2009

slide-3
SLIDE 3

I work on stuff...

Wednesday, November 18, 2009

slide-4
SLIDE 4

Outline

Why should you care? (Intro) Challenging yesteryear’s assumptions The MapReduce Model HDFS, Hadoop Map/Reduce The Hadoop Ecosystem Questions

Wednesday, November 18, 2009

slide-5
SLIDE 5

Data is everywhere. Data is important.

Wednesday, November 18, 2009

slide-6
SLIDE 6

Wednesday, November 18, 2009

slide-7
SLIDE 7

Wednesday, November 18, 2009

slide-8
SLIDE 8

Wednesday, November 18, 2009

slide-9
SLIDE 9

“I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.”

Hal Varian (Google’s chief economist)

Wednesday, November 18, 2009

slide-10
SLIDE 10

So, what’s Hadoop?

The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Wednesday, November 18, 2009

slide-11
SLIDE 11

Apache Hadoop is an open-source system (written in Java!) to store and process

gobs of data

across many commodity computers.

The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Wednesday, November 18, 2009

slide-12
SLIDE 12

Two Big Components

HDFS Map/Reduce

Self-healing high- bandwidth clustered storage. Fault-tolerant distributed computing.

Wednesday, November 18, 2009

slide-13
SLIDE 13

Challenging some of yesteryear’s assumptions...

Wednesday, November 18, 2009

slide-14
SLIDE 14

Assumption 1: Machines can be reliable...

Image: MadMan the Mighty CC BY-NC-SA

Wednesday, November 18, 2009

slide-15
SLIDE 15

Hadoop Goal: Separate distributed system fault-tolerance code from application logic.

Systems Programmers Statisticians

Wednesday, November 18, 2009

slide-16
SLIDE 16

Assumption 2: Machines have identities...

Image:Laughing Squid CC BY- NC-SA

Wednesday, November 18, 2009

slide-17
SLIDE 17

Hadoop Goal: Users should interact with clusters, not machines.

Wednesday, November 18, 2009

slide-18
SLIDE 18

Assumption 3: A data set fits on one machine...

Image: Matthew J. Stinson CC- BY-NC

Wednesday, November 18, 2009

slide-19
SLIDE 19

Hadoop Goal: System should scale linearly (or better) with data size.

Wednesday, November 18, 2009

slide-20
SLIDE 20

The M/R Programming Model

Wednesday, November 18, 2009

slide-21
SLIDE 21

You specify map() and reduce() functions. The framework does the rest.

Wednesday, November 18, 2009

slide-22
SLIDE 22

map()

map: K₁,V₁→list K₂,V₂

public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { // context.write() can be called many times // this is default “identity mapper” implementation context.write((KEYOUT) key, (VALUEOUT) value); } }

Wednesday, November 18, 2009

slide-23
SLIDE 23

(the shuffle)

map output is assigned to a “reducer” map output is sorted by key

Wednesday, November 18, 2009

slide-24
SLIDE 24

reduce()

K₂, iter(V₂)→list(K₃,V₃)

public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { /** * This method is called once for each key. Most applications will define * their reduce class by overriding this method. The default implementation * is an identity function. */ @SuppressWarnings("unchecked") protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); } } }

Wednesday, November 18, 2009

slide-25
SLIDE 25

Putting it together...

Logical Physical

Wednesday, November 18, 2009

slide-26
SLIDE 26

Some samples...

Build an inverted index. Summarize data grouped by a key. Build map tiles from geographic data. OCRing many images. Learning ML models. (e.g., Naive Bayes for text classification) Augment traditional BI/DW technologies (by archiving raw data).

Wednesday, November 18, 2009

slide-27
SLIDE 27

There’s more than the Java API

perl, python, ruby, whatever. stdin/stdout/ stderr Higher-level dataflow language for easy ad-hoc analysis. Developed at Yahoo! SQL interface. Great for analysts. Developed at Facebook

Streaming Pig Hive

Friday, @10:10

Wednesday, November 18, 2009

slide-28
SLIDE 28

A typical look...

Commodity servers (8-core, 8-16GB RAM, 4-12 TB, 2x1 gE NIC) 2-level network architecture 20-40 nodes per rack

Wednesday, November 18, 2009

slide-29
SLIDE 29

The cast...

NameNode (metadata server and database) SecondaryNameNode (assistant to NameNode) JobTracker (scheduler) DataNodes (block storage) TaskTrackers (task execution)

Thanks to Zak Stone for earmuff image!

Starring... The Chorus…

Wednesday, November 18, 2009

slide-30
SLIDE 30

HDFS

Namenode Datanodes One Rack A Different Rack 3x64MB file, 3 rep 4x64MB file, 3 rep Small file, 7 rep

Wednesday, November 18, 2009

slide-31
SLIDE 31

HDFS Write Path

Wednesday, November 18, 2009

slide-32
SLIDE 32

HDFS Failures?

Datanode crash? Clients read another copy Background rebalance Namenode crash? uh-oh

Wednesday, November 18, 2009

slide-33
SLIDE 33

M/R

Tasktrackers on the same machines as datanodes One Rack A Different Rack Job on stars Different job Idle

Wednesday, November 18, 2009

slide-34
SLIDE 34

M/R

Wednesday, November 18, 2009

slide-35
SLIDE 35

Task fails Try again? Try again somewhere else? Report failure Retries possible because of idempotence

M/R Failures

Wednesday, November 18, 2009

slide-36
SLIDE 36

Hadoop in the Wild

Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) Google: 40 GB/s GFS read/write load (Jeff Dean, LADIS ’09) [~3,500 TB/day] Facebook: 4TB new data per day; DW: 4800 cores, 5.5 PB (Dhruba Borthakur, HadoopWorld)

Wednesday, November 18, 2009

slide-37
SLIDE 37

The Hadoop Ecosystem

HDFS

(Hadoop Distributed File System)

HBase (Key-Value store) MapReduce (Job Scheduling/Execution System) Pig (Data Flow) Hive (SQL) BI Reporting ETL Tools Avro (Serialization) Zookeepr (Coordination) Sqoop RDBMS

Wednesday, November 18, 2009

slide-38
SLIDE 38

Ok, fine, what next?

Get Hadoop! http://hadoop.apache.org/ Cloudera Distribution for Hadoop Try it out! (Locally, or on EC2)

Wednesday, November 18, 2009

slide-39
SLIDE 39

Just one slide...

Software: Cloudera Distribution for Hadoop, Cloudera Desktop, more… Training and certification… Free on-line training materials (including video) Support & Professional Services @cloudera, blog, etc.

Wednesday, November 18, 2009

slide-40
SLIDE 40

Questions?

philip@cloudera.com

Wednesday, November 18, 2009