large scale data processing with apache hadoop and
play

Large-scale data processing [with Apache Hadoop] [and friends] [at - PowerPoint PPT Presentation

Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid] Evert Lammerts March 27, 2012, EGI Community Forum Who's who? Who's who? BiG Grid Dutch NGI Who's who? BiG Grid Dutch NGI SARA National center for


  1. Large-scale data processing [with Apache Hadoop] [and friends] [at BiG Grid] Evert Lammerts March 27, 2012, EGI Community Forum

  2. Who's who?

  3. Who's who? BiG Grid ● Dutch NGI

  4. Who's who? BiG Grid ● Dutch NGI SARA ● National center for academic computing & eScience ● Partner in BiG Grid

  5. Who's who? BiG Grid ● Dutch NGI SARA ● National center for academic computing & eScience ● Partner in BiG Grid Me ● Consultant eScience & Cloud Services ● Lead Hadoop infrastructure ● Tech lead LifeWatch-NL

  6. In this talk ● Working on scale ( @ SARA & BiG Grid ) ● An introduction to Hadoop & MapReduce ● Hadoop @ SARA & BiG Grid

  7. Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid

  8. SARA the national center for scientific computing Facilitating Science in The Netherlands with Equipment for and Expertise on Large-Scale Computing , Large-Scale Data Storage , High-Performance Networking , eScience , and Visualization

  9. Different types of computing Parallelism ● Data parallelism ● Task parallelism Architectures ● SIMD: Single Instruction Multiple Data ● MIMD: Multiple Instruction Multiple Data ● MISD: Multiple Instruction Single Data ● SISD: Single Instruction Single Data (Von Neumann)

  10. Parallelism: Amdahl's law

  11. Data parallelism

  12. Compute @ SARA & BiG Grid

  13. What's different about Hadoop? No more do-it-yourself parallelism – it's hard! But rather linearly scalable data parallelism Separating the what from the how The Datacenter as your computer (NYT, 14/06/2006)

  14. Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid

  15. A bit of history 2002 2004 2004 2006 Nutch* MR/GFS** Hadoop * http://nutch.apache.org/ ** http://labs.google.com/papers/mapreduce.html http://labs.google.com/papers/gfs.html

  16. 2010 - 2012: A Hype in Production http://wiki.apache.org/hadoop/PoweredBy

  17. Core principals ● Scale out, not up ● Move processing to the data ● Process data sequentially, avoid random reads ● Seamless scalability (Jimmy Lin, University of Maryland / Twitter, 2011)

  18. A typical data-parallel problem in abstraction 1.Iterate over a large number of records 2.Extract something of interest 3.Create an ordering in intermediate results 4.Aggregate intermediate results 5.Generate output MapReduce: functional abstraction of step 2 & step 4 (Jimmy Lin, University of Maryland / Twitter, 2011)

  19. MapReduce Programmer specifies two functions ● map (k, v) → <k', v'>* ● reduce (k', v') → <k', v'>* All values associated with a single key are sent to the same reducer The framework handles the rest

  20. The rest? Scheduling, data distribution, ordering, synchronization, error handling...

  21. An overview of a Hadoop cluster

  22. The ecosystem Apache Pig Hbase Data-flow language Key/value store Hive, HCatalog, Giraph Elephantbird, and Graph processing many, many others...

  23. Data-processing as a commodity ● Cheap Clusters ● Simple programming models ● Easy-to-learn scripting Anybody with the know-how can generate insights!

  24. Note: “ the know-how ” = Data Science DevOps Programming algorithms Domain knowledge

  25. Working on scale ( @ SARA & BiG Grid ) An introduction to Hadoop & MapReduce Hadoop @ SARA & BiG Grid

  26. Timeline 2009: Piloting Hadoop on Cloud 2010: Test cluster available for scientists 6 machines * 4 cores / 24 TB storage / 16GB RAM Just me! 2011: Funding granted for production service 2012: Production cluster available (~March) 72 machines * (8 cores / 8 TB storage / 64GB RAM) Integration with Kerberos for secure multi-tenancy 3 devops, team of consultants

  27. Architecture

  28. We already offer... Hadoop, Pig We will offer... Hbase, Hive, HCatalog, Oozie, and probably more...

  29. What is it being used for? ● Information Retrieval ● Natural Language Processing ● Machine Learning ● Econometry ● Bioinformatics ● Ecoinformatics ● Collaboration with industry!

  30. Machine learning: Infrawatch, Hollandse Brug

  31. Structural health monitoring 145 x 100 x 60 x 60 x 24 x 365 = large data sensors Hz seconds minutes hours days (Arno Knobbe, LIACS, 2011, http://infrawatch.liacs.nl)

  32. NLP & IR ● e.g. ClueWeb: a ~13.4 TB webcrawl ● e.g. Twitter gardenhose data ● e.g. Wikipedia dumps ● e.g. del.ico.us & flickr tags ● Finding named entities: [person company place] names ● Creating inverted indexes ● Piloting real-time search ● Personalization ● Semantic web

  33. How do we embrace Hadoop? ● Parallelism has never been easy… so we teach! ● December 2010: hackathon (~50 participants - full) ● April 2011: Workshop for Bioinformaticians ● November 2011: 2 day PhD course (~60 participants – full) ● June 2012: 1 day PhD course ● The datascientist is still in school... so we fill the gap! ● Devops maintain the system, fix bugs, develop new functionality ● Technical consultants learn how to efficiently implement algorithms ● Users bring domain knowledge ● Methods are developing faster than light (don't quote me :)... so we build the community! ● Netherlands Hadoop User Group

  34. Final thoughts ● Hadoop is the first to provide commodity computing ● Hadoop is not the only ● Hadoop is probably not the best ● Hadoop has momentum ● What degree of diversification of infrastructure should we embrace? ● MapReduce fits surprisingly well as a programming model for data-parallelism ● Where is the data scientist? ● Teach. Teach. Work together.

  35. Any questions? evert.lammerts@sara.nl @eevrt @sara_nl

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend