Large-scale data processing [with Apache Hadoop] [and friends] [at - PowerPoint PPT Presentation

Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid] Evert Lammerts March 27, 2012, EGI Community Forum

Who's who?

Who's who? BiG Grid ● Dutch NGI

Who's who? BiG Grid ● Dutch NGI SARA ● National center for academic computing & eScience ● Partner in BiG Grid

Who's who? BiG Grid ● Dutch NGI SARA ● National center for academic computing & eScience ● Partner in BiG Grid Me ● Consultant eScience & Cloud Services ● Lead Hadoop infrastructure ● Tech lead LifeWatch-NL

In this talk ● Working on scale ( @ SARA & BiG Grid ) ● An introduction to Hadoop & MapReduce ● Hadoop @ SARA & BiG Grid

Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid

SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing , Large-Scale Data Storage , High-Performance Networking , eScience , and Visualization

Different types of computing Parallelism ● Data parallelism ● Task parallelism Architectures ● SIMD: Single Instruction Multiple Data ● MIMD: Multiple Instruction Multiple Data ● MISD: Multiple Instruction Single Data ● SISD: Single Instruction Single Data (Von Neumann)

Parallelism: Amdahl's law

Data parallelism

Compute @ SARA & BiG Grid

What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how The Datacenter as your computer (NYT, 14/06/2006)

A bit of history 2002 2004 2004 2006 Nutch* MR/GFS** Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html

2010 - 2012: A Hype in Production http://wiki.apache.org/hadoop/PoweredBy

Core principals ● Scale out, not up ● Move processing to the data ● Process data sequentially, avoid random reads ● Seamless scalability (Jimmy Lin, University of Maryland / Twitter, 2011)

A typical data-parallel problem in abstraction 1.Iterate over a large number of records 2.Extract something of interest 3.Create an ordering in intermediate results 4.Aggregate intermediate results 5.Generate output MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)

MapReduce Programmer specifies two functions ● map (k, v) → <k', v'>* ● reduce (k', v') → <k', v'>* All values associated with a single key are sent to the same reducer The framework handles the rest

The rest? Scheduling, data distribution, ordering, synchronization, error handling...

An overview of a Hadoop cluster

The ecosystem Apache Pig Hbase Data-flow language Key/value store Hive, HCatalog, Giraph Elephantbird, and Graph processing many, many others...

Data-processing as a commodity ● Cheap Clusters ● Simple programming models ● Easy-to-learn scripting Anybody with the know-how can generate insights!

Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge

Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * (8 cores / 8 TB storage / 64GB RAM) Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants

Architecture

We already offer... Hadoop, Pig We will offer... Hbase, Hive, HCatalog, Oozie, and probably more...

What is it being used for? ● Information Retrieval ● Natural Language Processing ● Machine Learning ● Econometry ● Bioinformatics ● Ecoinformatics ● Collaboration with industry!

Machine learning: Infrawatch, Hollandse Brug

Structural health monitoring 145 x 100 x 60 x 60 x 24 x 365 = large data sensors Hz seconds minutes hours days (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)

NLP & IR ● e.g. ClueWeb: a ~13.4 TB webcrawl ● e.g. Twitter gardenhose data ● e.g. Wikipedia dumps ● e.g. del.ico.us & flickr tags ● Finding named entities: [person company place] names ● Creating inverted indexes ● Piloting real-time search ● Personalization ● Semantic web

How do we embrace Hadoop? ● Parallelism has never been easy… so we teach! ● December 2010: hackathon (~50 participants - full) ● April 2011: Workshop for Bioinformaticians ● November 2011: 2 day PhD course (~60 participants – full) ● June 2012: 1 day PhD course ● The datascientist is still in school... so we fill the gap! ● Devops maintain the system, fix bugs, develop new functionality ● Technical consultants learn how to efficiently implement algorithms ● Users bring domain knowledge ● Methods are developing faster than light (don't quote me :)... so we build the community! ● Netherlands Hadoop User Group

Final thoughts ● Hadoop is the first to provide commodity computing ● Hadoop is not the only ● Hadoop is probably not the best ● Hadoop has momentum ● What degree of diversification of infrastructure should we embrace? ● MapReduce fits surprisingly well as a programming model for data-parallelism ● Where is the data scientist? ● Teach. Teach. Work together.

Any questions? evert.lammerts@sara.nl @eevrt @sara_nl

Large-scale data processing [with Apache Hadoop] [and friends] [at - PowerPoint PPT Presentation

Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid] Evert Lammerts March 27, 2012, EGI Community Forum Who's who? Who's who? BiG Grid Dutch NGI Who's who? BiG Grid Dutch NGI SARA National center for

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org>

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Distributed Computation of with Apache Hadoop Tsz-Wo Sze Yahoo! Cloud Computing Apache

Digitized photographic plate photometry with VaST software Kirill Sokolovsky Michigan State

Mickey Muskopf Greg Hunt Charter Airlift Branch, TCAQ-CP Deputy Chief, Sealift Services Division

The interplay between conceptual and referential aspects of meaning Gemma Boleda Universitat

Vector Semantics Natural Language Processing Lecture 17 Adapted from Jurafsky and Martnn v3

Norm Li - architectural visualization studio based in Toronto Arch Viz is everywhere if you pay

The Jungle Universe About scales and physics in the cosmos Simon Portegies Zwart Sterrewacht

Structuralism - Its prominent past, sad present, and bright future WORKSHOP Marcin Jan Schroeder

CDT meets Trace Compass EclipseCon, March 2015 Marc Khouzam Marc-Andr Laperle ABOUT US

Large-scale data processing [with Apache Hadoop] [and friends] [at - PowerPoint PPT Presentation

Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid] Evert Lammerts March 27, 2012, EGI Community Forum Who's who? Who's who? BiG Grid Dutch NGI Who's who? BiG Grid Dutch NGI SARA National center for

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

Extension: Combiner Functions import org.apache.hadoop.io.IntWritable; import

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

Apache Hadoop 3.x State of The Union and Upgrade Guidance Wei-Chiu Chuang Wangda Tan

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian

Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella &lt;claudio@apache.org&gt;

Sergey Beryozkin, T alend Sergey Beryozkin, T alend Apache CXF Apache CXF Practical JOSE

Stream Processing with Apache Apex Thomas Weise Apache Apex PMC Chair thw@apache.org @thweise

Data Processing at the Speed of 100 Gbps using Apache Crail Patrick Stuedi IBM Research Apache

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is

Datenanalyse mit Hadoop Quelle: Apache Software Foundation Datenanalyse mit Hadoop Gideon Zenz

Apache Pig for Data Science Casey Stella April 9, 2014 Casey Stella (Hortonworks) Apache Pig

Apache Felix Web Console Carsten Ziegeler | cziegeler@apache.org ApacheCon NA 2014 About

The Apache Way The Apache Way Nick Burch Nick Burch CTO, Quanticate CTO, Quanticate The

Distributed Computation of with Apache Hadoop Tsz-Wo Sze Yahoo! Cloud Computing Apache

Digitized photographic plate photometry with VaST software Kirill Sokolovsky Michigan State

Mickey Muskopf Greg Hunt Charter Airlift Branch, TCAQ-CP Deputy Chief, Sealift Services Division

The interplay between conceptual and referential aspects of meaning Gemma Boleda Universitat

Vector Semantics Natural Language Processing Lecture 17 Adapted from Jurafsky and Martnn v3

Norm Li - architectural visualization studio based in Toronto Arch Viz is everywhere if you pay

The Jungle Universe About scales and physics in the cosmos Simon Portegies Zwart Sterrewacht

Structuralism - Its prominent past, sad present, and bright future WORKSHOP Marcin Jan Schroeder

CDT meets Trace Compass EclipseCon, March 2015 Marc Khouzam Marc-Andr Laperle ABOUT US

Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org>