Large-scale data processing [with Apache Hadoop] [and friends] [at - - PowerPoint PPT Presentation

large scale data processing with apache hadoop and
SMART_READER_LITE
LIVE PREVIEW

Large-scale data processing [with Apache Hadoop] [and friends] [at - - PowerPoint PPT Presentation

Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid] Evert Lammerts March 27, 2012, EGI Community Forum Who's who? Who's who? BiG Grid Dutch NGI Who's who? BiG Grid Dutch NGI SARA National center for


slide-1
SLIDE 1

Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid]

Evert Lammerts March 27, 2012, EGI Community Forum

slide-2
SLIDE 2

Who's who?

slide-3
SLIDE 3

Who's who?

BiG Grid

  • Dutch NGI
slide-4
SLIDE 4

Who's who?

BiG Grid

  • Dutch NGI

SARA

  • National center for academic computing & eScience
  • Partner in BiG Grid
slide-5
SLIDE 5

Who's who?

BiG Grid

  • Dutch NGI

SARA

  • National center for academic computing & eScience
  • Partner in BiG Grid

Me

  • Consultant eScience & Cloud Services
  • Lead Hadoop infrastructure
  • Tech lead LifeWatch-NL
slide-6
SLIDE 6

In this talk

  • Working on scale ( @ SARA & BiG Grid )
  • An introduction to Hadoop & MapReduce
  • Hadoop @ SARA & BiG Grid
slide-7
SLIDE 7

Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid

slide-8
SLIDE 8

SARA the national center for scientific computing

Facilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing, Large-Scale

Data Storage, High-Performance Networking, eScience, and Visualization

slide-9
SLIDE 9

Different types of computing

Parallelism

  • Data parallelism
  • Task parallelism

Architectures

  • SIMD: Single Instruction Multiple Data
  • MIMD: Multiple Instruction Multiple Data
  • MISD: Multiple Instruction Single Data
  • SISD: Single Instruction Single Data (Von Neumann)
slide-10
SLIDE 10

Parallelism: Amdahl's law

slide-11
SLIDE 11

Data parallelism

slide-12
SLIDE 12

Compute @ SARA & BiG Grid

slide-13
SLIDE 13

What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how The Datacenter as your computer

(NYT, 14/06/2006)

slide-14
SLIDE 14

Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid

slide-15
SLIDE 15

A bit of history

Nutch*

2002 2004

MR/GFS**

2006 2004

Hadoop

* http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html

slide-16
SLIDE 16

http://wiki.apache.org/hadoop/PoweredBy

2010 - 2012: A Hype in Production

slide-17
SLIDE 17

Core principals

  • Scale out, not up
  • Move processing to the data
  • Process data sequentially, avoid random reads
  • Seamless scalability

(Jimmy Lin, University of Maryland / Twitter, 2011)

slide-18
SLIDE 18

A typical data-parallel problem in abstraction

1.Iterate over a large number of records 2.Extract something of interest 3.Create an ordering in intermediate results 4.Aggregate intermediate results 5.Generate output MapReduce: functional abstraction of step 2 & step 4

(Jimmy Lin, University of Maryland / Twitter, 2011)

slide-19
SLIDE 19

MapReduce

Programmer specifies two functions

  • map(k, v) → <k', v'>*
  • reduce(k', v') → <k', v'>*

All values associated with a single key are sent to the same reducer The framework handles the rest

slide-20
SLIDE 20

The rest? Scheduling, data distribution, ordering, synchronization, error handling...

slide-21
SLIDE 21

An overview of a Hadoop cluster

slide-22
SLIDE 22

The ecosystem Apache Pig Data-flow language Hbase Key/value store Giraph Graph processing Hive, HCatalog, Elephantbird, and many, many

  • thers...
slide-23
SLIDE 23

Data-processing as a commodity

  • Cheap Clusters
  • Simple programming models
  • Easy-to-learn scripting

Anybody with the know-how can generate insights!

slide-24
SLIDE 24

Note: “the know-how” = Data Science

DevOps Programming algorithms Domain knowledge

slide-25
SLIDE 25

Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid

slide-26
SLIDE 26

Timeline

2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * (8 cores / 8 TB storage / 64GB RAM) Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants

slide-27
SLIDE 27

Architecture

slide-28
SLIDE 28

We already offer... Hadoop, Pig We will offer... Hbase, Hive, HCatalog, Oozie, and probably more...

slide-29
SLIDE 29

What is it being used for?

  • Information Retrieval
  • Natural Language Processing
  • Machine Learning
  • Econometry
  • Bioinformatics
  • Ecoinformatics
  • Collaboration with industry!
slide-30
SLIDE 30

Machine learning: Infrawatch, Hollandse Brug

slide-31
SLIDE 31

Structural health monitoring

145 sensors 100 Hz 60 seconds 60 minutes 24 hours 365 days x x x x x = large data

(Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)

slide-32
SLIDE 32

NLP & IR

  • e.g. ClueWeb: a ~13.4 TB webcrawl
  • e.g. Twitter gardenhose data
  • e.g. Wikipedia dumps
  • e.g. del.ico.us & flickr tags
  • Finding named entities: [person company place] names
  • Creating inverted indexes
  • Piloting real-time search
  • Personalization
  • Semantic web
slide-33
SLIDE 33

How do we embrace Hadoop?

  • Parallelism has never been easy… so we teach!
  • December 2010: hackathon (~50 participants - full)
  • April 2011: Workshop for Bioinformaticians
  • November 2011: 2 day PhD course (~60 participants – full)
  • June 2012: 1 day PhD course
  • The datascientist is still in school... so we fill the gap!
  • Devops maintain the system, fix bugs, develop new functionality
  • Technical consultants learn how to efficiently implement algorithms
  • Users bring domain knowledge
  • Methods are developing faster than light (don't quote me :)... so we

build the community!

  • Netherlands Hadoop User Group
slide-34
SLIDE 34

Final thoughts

  • Hadoop is the first to provide commodity computing
  • Hadoop is not the only
  • Hadoop is probably not the best
  • Hadoop has momentum
  • What degree of diversification of infrastructure should we

embrace?

  • MapReduce fits surprisingly well as a programming model

for data-parallelism

  • Where is the data scientist?
  • Teach. Teach. Work together.
slide-35
SLIDE 35

Any questions? evert.lammerts@sara.nl @eevrt @sara_nl