SLIDE 1 Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid]
Evert Lammerts March 27, 2012, EGI Community Forum
SLIDE 2
Who's who?
SLIDE 3 Who's who?
BiG Grid
SLIDE 4 Who's who?
BiG Grid
SARA
- National center for academic computing & eScience
- Partner in BiG Grid
SLIDE 5 Who's who?
BiG Grid
SARA
- National center for academic computing & eScience
- Partner in BiG Grid
Me
- Consultant eScience & Cloud Services
- Lead Hadoop infrastructure
- Tech lead LifeWatch-NL
SLIDE 6 In this talk
- Working on scale ( @ SARA & BiG Grid )
- An introduction to Hadoop & MapReduce
- Hadoop @ SARA & BiG Grid
SLIDE 7
Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
SLIDE 8
SARA the national center for scientific computing
Facilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing, Large-Scale
Data Storage, High-Performance Networking, eScience, and Visualization
SLIDE 9 Different types of computing
Parallelism
- Data parallelism
- Task parallelism
Architectures
- SIMD: Single Instruction Multiple Data
- MIMD: Multiple Instruction Multiple Data
- MISD: Multiple Instruction Single Data
- SISD: Single Instruction Single Data (Von Neumann)
SLIDE 10
Parallelism: Amdahl's law
SLIDE 11
Data parallelism
SLIDE 12
Compute @ SARA & BiG Grid
SLIDE 13 What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how The Datacenter as your computer
(NYT, 14/06/2006)
SLIDE 14
Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
SLIDE 15 A bit of history
Nutch*
2002 2004
MR/GFS**
2006 2004
Hadoop
* http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html
SLIDE 16
http://wiki.apache.org/hadoop/PoweredBy
2010 - 2012: A Hype in Production
SLIDE 17 Core principals
- Scale out, not up
- Move processing to the data
- Process data sequentially, avoid random reads
- Seamless scalability
(Jimmy Lin, University of Maryland / Twitter, 2011)
SLIDE 18 A typical data-parallel problem in abstraction
1.Iterate over a large number of records 2.Extract something of interest 3.Create an ordering in intermediate results 4.Aggregate intermediate results 5.Generate output MapReduce: functional abstraction of step 2 & step 4
(Jimmy Lin, University of Maryland / Twitter, 2011)
SLIDE 19 MapReduce
Programmer specifies two functions
- map(k, v) → <k', v'>*
- reduce(k', v') → <k', v'>*
All values associated with a single key are sent to the same reducer The framework handles the rest
SLIDE 20
The rest? Scheduling, data distribution, ordering, synchronization, error handling...
SLIDE 21
An overview of a Hadoop cluster
SLIDE 22 The ecosystem Apache Pig Data-flow language Hbase Key/value store Giraph Graph processing Hive, HCatalog, Elephantbird, and many, many
SLIDE 23 Data-processing as a commodity
- Cheap Clusters
- Simple programming models
- Easy-to-learn scripting
Anybody with the know-how can generate insights!
SLIDE 24 Note: “the know-how” = Data Science
DevOps Programming algorithms Domain knowledge
SLIDE 25
Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid
SLIDE 26
Timeline
2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * (8 cores / 8 TB storage / 64GB RAM) Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants
SLIDE 27
Architecture
SLIDE 28
We already offer... Hadoop, Pig We will offer... Hbase, Hive, HCatalog, Oozie, and probably more...
SLIDE 29 What is it being used for?
- Information Retrieval
- Natural Language Processing
- Machine Learning
- Econometry
- Bioinformatics
- Ecoinformatics
- Collaboration with industry!
SLIDE 30
Machine learning: Infrawatch, Hollandse Brug
SLIDE 31 Structural health monitoring
145 sensors 100 Hz 60 seconds 60 minutes 24 hours 365 days x x x x x = large data
(Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)
SLIDE 32 NLP & IR
- e.g. ClueWeb: a ~13.4 TB webcrawl
- e.g. Twitter gardenhose data
- e.g. Wikipedia dumps
- e.g. del.ico.us & flickr tags
- Finding named entities: [person company place] names
- Creating inverted indexes
- Piloting real-time search
- Personalization
- Semantic web
SLIDE 33 How do we embrace Hadoop?
- Parallelism has never been easy… so we teach!
- December 2010: hackathon (~50 participants - full)
- April 2011: Workshop for Bioinformaticians
- November 2011: 2 day PhD course (~60 participants – full)
- June 2012: 1 day PhD course
- The datascientist is still in school... so we fill the gap!
- Devops maintain the system, fix bugs, develop new functionality
- Technical consultants learn how to efficiently implement algorithms
- Users bring domain knowledge
- Methods are developing faster than light (don't quote me :)... so we
build the community!
- Netherlands Hadoop User Group
SLIDE 34 Final thoughts
- Hadoop is the first to provide commodity computing
- Hadoop is not the only
- Hadoop is probably not the best
- Hadoop has momentum
- What degree of diversification of infrastructure should we
embrace?
- MapReduce fits surprisingly well as a programming model
for data-parallelism
- Where is the data scientist?
- Teach. Teach. Work together.
SLIDE 35
Any questions? evert.lammerts@sara.nl @eevrt @sara_nl