Hadoop: Scalable Infrastructure for Big Data QCon London 2012 - - PowerPoint PPT Presentation

hadoop scalable infrastructure for big data
SMART_READER_LITE
LIVE PREVIEW

Hadoop: Scalable Infrastructure for Big Data QCon London 2012 - - PowerPoint PPT Presentation

Hadoop: Scalable Infrastructure for Big Data QCon London 2012 Parand Tony Darugar Founder and CEO, Xpenser parand@xpenser.com QCon London 2012 What is Hadoop? QCon London 2012 Hadoop is the Linux of Big Data Processing QCon London 2012


slide-1
SLIDE 1

QCon London 2012

Hadoop: Scalable Infrastructure for Big Data

QCon London 2012 Parand Tony Darugar Founder and CEO, Xpenser parand@xpenser.com

slide-2
SLIDE 2

QCon London 2012

What is Hadoop?

slide-3
SLIDE 3

QCon London 2012

Hadoop is the Linux

  • f Big Data Processing
slide-4
SLIDE 4

QCon London 2012

Infrastructure

for

Large Scale Computation & Data Processing

  • n a network of

Commodity Hardware.

slide-5
SLIDE 5

QCon London 2012

Why Hadoop?

slide-6
SLIDE 6

QCon London 2012

Scale

slide-7
SLIDE 7

QCon London 2012

Cost

slide-8
SLIDE 8

QCon London 2012

Freedom

slide-9
SLIDE 9

QCon London 2012

Does Anyone Use Hadoop?

slide-10
SLIDE 10

QCon London 2012

IBM VISA Microsoft Facebook Yahoo AOL ... eHarmony Zion's bank NY Times Twitter eBay LinkedIn ...

slide-11
SLIDE 11

QCon London 2012

Alternatives

Build your own Get creative with RDBMS architecture

slide-12
SLIDE 12

QCon London 2012

What's the idea?

slide-13
SLIDE 13

QCon London 2012

Commodity Hardware Distributed Operation

slide-14
SLIDE 14

QCon London 2012

Wisdom: Embrace Failure (hardware) Be Resilient (software)

slide-15
SLIDE 15

QCon London 2012

What's in the box?

slide-16
SLIDE 16

QCon London 2012

Hadoop Distributed File System

slide-17
SLIDE 17

QCon London 2012

Distributed Computation Framework

slide-18
SLIDE 18

QCon London 2012

Map-Reduce Programming Model

slide-19
SLIDE 19

QCon London 2012

HDFS

  • Your data in triplicate
  • Built-in resiliency to

large scale failures

  • Intelligent Data Distribution
  • Very large data sizes
slide-20
SLIDE 20

QCon London 2012

Distributed Computation

  • Built-in resiliency to

large scale failures

  • Distribute work to workers,

collect results from fastest

  • Move computation to data

(not data to computation)

slide-21
SLIDE 21

QCon London 2012

Map Reduce

Very simple programming model: Map(anything)->key, value Sort, partition on key Reduce(key,value)->key, value No parallel processing or message passing semantics Programmable in Java or any other language (streaming)

slide-22
SLIDE 22

QCon London 2012

Ecosystem

HBase: NoSQL BigTable clone Hive: Somewhat-SQL data store Pig: SQL-like programming model Chukwa, Scribe, Mahoot, Cassandra, Oozie, Sqoop, ...

slide-23
SLIDE 23

QCon London 2012

Commercial Support

Cloudera HortonWorks IBM ...

slide-24
SLIDE 24

QCon London 2012

How?

Try it in non-distributed mode Try it on a few spare machines Try it on EC2 Try it! http://hadoop.apache.org/

slide-25
SLIDE 25

QCon London 2012

Case Studies

slide-26
SLIDE 26

QCon London 2012

eHarmony

slide-27
SLIDE 27

QCon London 2012

Biz360 (Attensity)

slide-28
SLIDE 28

QCon London 2012

Yahoo!

slide-29
SLIDE 29

QCon London 2012

You!

slide-30
SLIDE 30

QCon London 2012

Start with ETL

slide-31
SLIDE 31

QCon London 2012

Start with batch, non time-critical tasks

slide-32
SLIDE 32

QCon London 2012

Start with storing your large data

  • n HDFS
slide-33
SLIDE 33

QCon London 2012

Move batch processing to Hadoop Serve from RDBMS

slide-34
SLIDE 34

QCon London 2012

  • Embrace. Be One

With The Hadoop.

slide-35
SLIDE 35

QCon London 2012

Parand Tony Darugar parand@xpenser.com

Questions?