big data architectures facebook
play

Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo - PowerPoint PPT Presentation

Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo Thursday, March 8, 12 Outline Big Data @ Facebook - Scope & Scale Evolution of Big Data Architectures @ FB Past, Present and Future Questions Thursday, March


  1. Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo Thursday, March 8, 12

  2. Outline • Big Data @ Facebook - Scope & Scale • Evolution of Big Data Architectures @ FB • Past, Present and Future • Questions Thursday, March 8, 12

  3. Big Data @ FB: Scale • 25 PB of compressed data • equivalent to 300 years of HD-TV video Thursday, March 8, 12

  4. Big Data @ FB: Scale • 150 PB of uncompressed data • equivalent to 3 x the entire written works of mankind from the beginning of recorded history in all languages Thursday, March 8, 12

  5. Big Data @ FB: Scale • 400 TB/day (uncompressed) of new data • That is a lot of disks Thursday, March 8, 12

  6. Big Data @ FB: Scope • Simple reporting • Model generation • Adhoc analysis + data science • Index generation • Many many others... Thursday, March 8, 12

  7. A/B Testing Email #1 Thursday, March 8, 12

  8. A/B Testing Email #2 Thursday, March 8, 12

  9. A/B Testing Email #2 is 3x Better Thursday, March 8, 12

  10. Friend Map By Paul Butler - https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/ 469716398919 Thursday, March 8, 12

  11. Big Data @ FB: Scope • one new job every second • ~ 15% of the company uses the clusters Thursday, March 8, 12

  12. Evolution: 2007-2011 DW Size in TB 30000 25000 22500 15000 8000 7500 800 250 15 0 2007 2008 2009 2010 2011 Thursday, March 8, 12

  13. 2007: Traditional EDW Thursday, March 8, 12

  14. 2007: Traditional EDW Web Clusters MySQL Clusters Thursday, March 8, 12

  15. 2007: Traditional EDW Web Clusters RDBMS Data Warehouse MySQL Clusters Thursday, March 8, 12

  16. 2007: Traditional EDW Scribe Mid-Tier Web Clusters RDBMS Data Warehouse MySQL Clusters Thursday, March 8, 12

  17. 2007: Traditional EDW Scribe Mid-Tier Web Clusters NAS Filers RDBMS Data Warehouse MySQL Clusters Thursday, March 8, 12

  18. 2007: Traditional EDW Scribe Mid-Tier Summarization Cluster Web Clusters NAS Filers RDBMS Data Warehouse MySQL Clusters Thursday, March 8, 12

  19. 2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters RDBMS Data Warehouse Thursday, March 8, 12

  20. 2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours RDBMS Data Warehouse Thursday, March 8, 12

  21. 2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. RDBMS Data Warehouse Thursday, March 8, 12

  22. 2007: Pain Points Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. - Lots of hardware planning RDBMS Data Warehouse Thursday, March 8, 12

  23. 2007: Pain Points - compute close to storage (early map/reduce) Scribe Mid-Tier Web Clusters Summarization Cluster NAS Filers MySQL Clusters - daily ETL > 24 hours - Lots of tuning/indexes etc. - Lots of hardware planning RDBMS Data Warehouse Thursday, March 8, 12

  24. 2007: Limitations • Most use cases were in business metrics - data science, model building etc. not possible • Only summary data was stored online - details archived away Thursday, March 8, 12

  25. 2008: Move to Hadoop Scribe Mid-Tier Summarization Cluster Web Clusters NAS Filers RDBMS Data Warehouse MySQL Clusters Thursday, March 8, 12

  26. 2008: Move to Hadoop Batch Scribe Mid-Tier copier/ Web Clusters loaders Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters RDBMS Data Mart Thursday, March 8, 12

  27. 2008: Immediate Pros • Data science at scale became possible • For the first time all of the instrumented data could be held online • Use cases expanded Thursday, March 8, 12

  28. 2009: Democratizing Data Scribe Mid-Tier Web Clusters Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters RDBMS Data Mart Thursday, March 8, 12

  29. 2009: Democratizing Data Databee & Nectar: Chronos: Data instrumentation & Pipeline schema aware Framework data collection HiPal: Adhoc Scrapes: Hadoop/Hive Data Queries + Data Configuration Warehouse Discovery Driven Thursday, March 8, 12

  30. 2009: Democratizing Data(Nectar) • Typical Nectar Pipeline • Simple schema evolution built in • json encoded short term data • decomposing json for long term storage Thursday, March 8, 12

  31. 2009: Democratizing Data (Tools) • HiPal - data discovery and query authoring • Charting and dashboard generation tools Thursday, March 8, 12

  32. 2009: Democratizing Data (Tools) • Databee: Workflow language • Chronos: Scheduling tool Thursday, March 8, 12

  33. 2009: Cons of Democratization • Isolation to protect against Bad Jobs • Fair sharing of the cluster - what is a high priority job and how to enforce it Thursday, March 8, 12

  34. 2010: Controlling Chaos • Isolation • Reducing operational overhead • Better resource utilization • Measurement, ownership, accountability Thursday, March 8, 12

  35. 2010: Isolation Scribe Mid-Tier Web Clusters Hadoop/Hive Data Warehouse NAS Filers MySQL Clusters Thursday, March 8, 12

  36. 2010: Isolation Scribe Mid-Tier Web Clusters Platinum Warehouse Hive Replication NAS Filers MySQL Clusters Silver Warehouse Thursday, March 8, 12

  37. 2010: Ops Efficiency Scribe HDFS Web Clusters Platinum Warehouse Hive Replication MySQL Clusters Silver Warehouse Thursday, March 8, 12

  38. 2010: Ops Efficiency Scribe HDFS Web Clusters Platinum Warehouse Hive Replication near real time data consumers MySQL Clusters Silver Warehouse Thursday, March 8, 12

  39. 2010: Ops Efficiency Scribe HDFS Web Clusters ptail: Platinum Warehouse parallel tail Hive Replication on hdfs near real time data consumers MySQL Clusters Silver Warehouse Thursday, March 8, 12

  40. 2010: Resource Utilization (Disk) • HDFS-RAID: from 3 replicas to 2.2 replicas • RCFile: Row columnar format for compressing Hive tables Thursday, March 8, 12

  41. 2010: Resource Utilization (CPU) • Continuous copier/ loaders • Incremental scrapes • Hive optimizations to save CPU Thursday, March 8, 12

  42. 2010: Monitoring(SLAs) • Per job statistics rolled up to owner/group/team • Expected time of arrival vs Actual time of arrival of data • Simple data quality metrics Thursday, March 8, 12

  43. 2011: New Requirements • More real time requirements for aggregations • Optimizing resource utilization Thursday, March 8, 12

  44. 2011: Beyond Hadoop • Puma for real time analytics • Peregrine for simple and fast queries Thursday, March 8, 12

  45. 2010: Puma Scribe HDFS Web Clusters ptail: Platinum Warehouse parallel tail Hive Replication on hdfs near real time data consumers MySQL Clusters Silver Warehouse Thursday, March 8, 12

  46. 2010: Puma Scribe HDFS Web Clusters ptail: Platinum Warehouse parallel tail Hive Replication on hdfs near real time data consumers MySQL Clusters Silver Warehouse Thursday, March 8, 12

  47. 2010: Puma Thursday, March 8, 12

  48. 2010: Puma Scribe HDFS ptail: parallel tail on hdfs Thursday, March 8, 12

  49. 2010: Puma Scribe HDFS ptail: parallel tail on hdfs Puma Clusters Thursday, March 8, 12

  50. 2010: Puma Scribe HDFS ptail: parallel tail on hdfs Puma Clusters Hbase Cluster Thursday, March 8, 12

  51. Other Challenges Of HyperGrowth • Moving data centers • Moving sustainably fast Thursday, March 8, 12

  52. HyperGrowth - Moving Data Centers DW Size in TB 30000 25000 22500 15000 8000 7500 800 250 15 0 2007 2008 2009 2010 2011 Thursday, March 8, 12

  53. HyperGrowth - Moving Data Centers • Moved 20 PB of data • Leverage replication with fast switch • 2-3 months to accomplish the entire move Blog Post on FB by Paul Yang: http://www.facebook.com/notes/paul-yang/moving-an-elephant-large- scale-hadoop-data-migration-at-facebook/10150246275318920 Thursday, March 8, 12

  54. Questions Contact Information: ashish.thusoo@gmail.com http://www.linkedin.com/pub/ashish-thusoo/0/5a8/50 https://www.facebook.com/athusoo https://twitter.com/ashishthusoo Thursday, March 8, 12

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend