apache hadoop big data and you
play

Apache Hadoop, Big Data, and You Philip Zeyliger - PowerPoint PPT Presentation

Apache Hadoop, Big Data, and You Philip Zeyliger philip@cloudera.com @philz42 @cloudera November 18, 2009 Wednesday, November 18, 2009 Hi there! Software Engineer Worked at Wednesday, November 18, 2009 I work on stuff... Wednesday,


  1. Apache Hadoop, Big Data, and You Philip Zeyliger philip@cloudera.com @philz42 @cloudera November 18, 2009 Wednesday, November 18, 2009

  2. Hi there! Software Engineer Worked at Wednesday, November 18, 2009

  3. I work on stuff... Wednesday, November 18, 2009

  4. Outline Why should you care? (Intro) Challenging yesteryear’s assumptions The MapReduce Model HDFS, Hadoop Map/Reduce The Hadoop Ecosystem Questions Wednesday, November 18, 2009

  5. Data is everywhere. Data is important. Wednesday, November 18, 2009

  6. Wednesday, November 18, 2009

  7. Wednesday, November 18, 2009

  8. Wednesday, November 18, 2009

  9. “I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” Hal Varian (Google’s chief economist) Wednesday, November 18, 2009

  10. So, what’s Hadoop? The Little Prince , Antoine de Saint-Exupéry, Irene Testot-Ferry Wednesday, November 18, 2009

  11. Apache Hadoop is an open-source system (written in Java!) to store and process gobs of data across many commodity computers. The Little Prince , Antoine de Saint-Exupéry, Irene Testot-Ferry Wednesday, November 18, 2009

  12. Two Big Components HDFS Map/Reduce Self-healing high- Fault-tolerant bandwidth distributed computing. clustered storage. Wednesday, November 18, 2009

  13. Challenging some of yesteryear’s assumptions... Wednesday, November 18, 2009

  14. Assumption 1: Machines can be reliable... Image: MadMan the Mighty CC BY-NC-SA Wednesday, November 18, 2009

  15. Hadoop Goal: Separate distributed system fault-tolerance code from application logic. Systems Statisticians Programmers Wednesday, November 18, 2009

  16. Assumption 2: Machines have identities... Image:Laughing Squid CC BY- NC-SA Wednesday, November 18, 2009

  17. Hadoop Goal: Users should interact with clusters, not machines. Wednesday, November 18, 2009

  18. Assumption 3: A data set fits on one machine... Image: Matthew J. Stinson CC- BY-NC Wednesday, November 18, 2009

  19. Hadoop Goal: System should scale linearly (or better) with data size. Wednesday, November 18, 2009

  20. The M/R Programming Model Wednesday, November 18, 2009

  21. You specify map() and reduce() functions. The framework does the rest. Wednesday, November 18, 2009

  22. map() map: K ₁ ,V ₁→ list K ₂ ,V ₂ public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { // context.write() can be called many times // this is default “identity mapper” implementation context.write((KEYOUT) key, (VALUEOUT) value); } } Wednesday, November 18, 2009

  23. (the shuffle) map output is assigned to a “reducer” map output is sorted by key Wednesday, November 18, 2009

  24. reduce() K ₂ , iter(V ₂ ) → list(K ₃ ,V ₃ ) public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { /** * This method is called once for each key. Most applications will define * their reduce class by overriding this method. The default implementation * is an identity function. */ @SuppressWarnings("unchecked") protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); } } } Wednesday, November 18, 2009

  25. Putting it together... Logical Physical Wednesday, November 18, 2009

  26. Some samples... Build an inverted index. Summarize data grouped by a key. Build map tiles from geographic data. OCRing many images. Learning ML models. (e.g., Naive Bayes for text classification) Augment traditional BI/DW technologies (by archiving raw data). Wednesday, November 18, 2009

  27. There’s more than the Java API Streaming Pig Hive perl, python, Higher-level SQL interface. ruby, whatever. dataflow language Great for for easy ad-hoc stdin/stdout/ analysts. analysis. stderr Developed at Developed at Facebook Yahoo! Friday, @10:10 Wednesday, November 18, 2009

  28. A typical look... Commodity servers (8-core, 8-16GB RAM, 4-12 TB, 2x1 gE NIC) 2-level network architecture 20-40 nodes per rack Wednesday, November 18, 2009

  29. The cast... Starring... NameNode (metadata server and database) SecondaryNameNode (assistant to NameNode) JobTracker (scheduler) The Chorus… TaskTrackers DataNodes (task execution) (block storage) Thanks to Zak Stone for earmuff image! Wednesday, November 18, 2009

  30. HDFS 3x64MB file, 3 rep 4x64MB file, 3 rep Namenode Small file, 7 rep Datanodes One Rack A Different Rack Wednesday, November 18, 2009

  31. HDFS Write Path Wednesday, November 18, 2009

  32. HDFS Failures? Datanode crash? Clients read another copy Background rebalance Namenode crash? uh-oh Wednesday, November 18, 2009

  33. M/R Job on stars Different job Tasktrackers on the same Idle machines as datanodes One Rack A Different Rack Wednesday, November 18, 2009

  34. M/R Wednesday, November 18, 2009

  35. M/R Failures Task fails Try again? Try again somewhere else? Report failure Retries possible because of idempotence Wednesday, November 18, 2009

  36. Hadoop in the Wild Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) Google: 40 GB/s GFS read/write load (Jeff Dean, LADIS ’09) [~3,500 TB/day] Facebook: 4TB new data per day; DW: 4800 cores, 5.5 PB (Dhruba Borthakur, HadoopWorld) Wednesday, November 18, 2009

  37. The Hadoop Ecosystem ETL Tools BI Reporting RDBMS Hive (SQL) Sqoop Pig (Data Flow) Zookeepr (Coordination) Avro (Serialization) MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System) Wednesday, November 18, 2009

  38. Ok, fine, what next? Get Hadoop! http://hadoop.apache.org/ Cloudera Distribution for Hadoop Try it out! (Locally, or on EC2) Wednesday, November 18, 2009

  39. Just one slide... Software: Cloudera Distribution for Hadoop, Cloudera Desktop, more… Training and certification… Free on-line training materials (including video) Support & Professional Services @cloudera, blog, etc. Wednesday, November 18, 2009

  40. Questions? philip@cloudera.com Wednesday, November 18, 2009

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend