Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo - - PowerPoint PPT Presentation

big data architectures facebook
SMART_READER_LITE
LIVE PREVIEW

Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo - - PowerPoint PPT Presentation

Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo Thursday, March 8, 12 Outline Big Data @ Facebook - Scope & Scale Evolution of Big Data Architectures @ FB Past, Present and Future Questions Thursday, March


slide-1
SLIDE 1

Big Data Architectures@ Facebook

QCon London 2012 Ashish Thusoo

Thursday, March 8, 12

slide-2
SLIDE 2

Outline

  • Big Data @ Facebook - Scope & Scale
  • Evolution of Big Data Architectures @ FB
  • Past, Present and Future
  • Questions

Thursday, March 8, 12

slide-3
SLIDE 3

Big Data @ FB: Scale

  • 25 PB of

compressed data

  • equivalent to 300

years of HD-TV video

Thursday, March 8, 12

slide-4
SLIDE 4

Big Data @ FB: Scale

  • 150 PB of

uncompressed data

  • equivalent to 3 x the

entire written works of mankind from the beginning of recorded history in all languages

Thursday, March 8, 12

slide-5
SLIDE 5

Big Data @ FB: Scale

  • 400 TB/day (uncompressed) of new data
  • That is a lot of disks

Thursday, March 8, 12

slide-6
SLIDE 6

Big Data @ FB: Scope

  • Simple reporting
  • Model generation
  • Adhoc analysis + data science
  • Index generation
  • Many many others...

Thursday, March 8, 12

slide-7
SLIDE 7

A/B Testing Email #1

Thursday, March 8, 12

slide-8
SLIDE 8

A/B Testing Email #2

Thursday, March 8, 12

slide-9
SLIDE 9

A/B Testing Email #2 is 3x Better

Thursday, March 8, 12

slide-10
SLIDE 10

Friend Map

By Paul Butler - https://www.facebook.com/notes/facebook-engineering/visualizing-friendships/ 469716398919

Thursday, March 8, 12

slide-11
SLIDE 11

Big Data @ FB: Scope

  • one new job every second
  • ~ 15% of the company uses the clusters

Thursday, March 8, 12

slide-12
SLIDE 12

Evolution: 2007-2011

7500 15000 22500 30000 2007 2008 2009 2010 2011

15 250 800 8000 25000

DW Size in TB

Thursday, March 8, 12

slide-13
SLIDE 13

2007: Traditional EDW

Thursday, March 8, 12

slide-14
SLIDE 14

2007: Traditional EDW

Web Clusters MySQL Clusters

Thursday, March 8, 12

slide-15
SLIDE 15

2007: Traditional EDW

Web Clusters MySQL Clusters RDBMS Data Warehouse

Thursday, March 8, 12

slide-16
SLIDE 16

2007: Traditional EDW

Web Clusters Scribe Mid-Tier MySQL Clusters RDBMS Data Warehouse

Thursday, March 8, 12

slide-17
SLIDE 17

2007: Traditional EDW

Web Clusters Scribe Mid-Tier MySQL Clusters NAS Filers RDBMS Data Warehouse

Thursday, March 8, 12

slide-18
SLIDE 18

2007: Traditional EDW

Web Clusters Scribe Mid-Tier MySQL Clusters Summarization Cluster NAS Filers RDBMS Data Warehouse

Thursday, March 8, 12

slide-19
SLIDE 19

2007: Pain Points

Summarization Cluster

Web Clusters Scribe Mid-Tier MySQL Clusters NAS Filers

RDBMS Data Warehouse

Thursday, March 8, 12

slide-20
SLIDE 20

2007: Pain Points

Summarization Cluster

Web Clusters Scribe Mid-Tier MySQL Clusters NAS Filers

RDBMS Data Warehouse

  • daily ETL > 24 hours

Thursday, March 8, 12

slide-21
SLIDE 21

2007: Pain Points

Summarization Cluster

Web Clusters Scribe Mid-Tier MySQL Clusters NAS Filers

RDBMS Data Warehouse

  • daily ETL > 24 hours
  • Lots of tuning/indexes etc.

Thursday, March 8, 12

slide-22
SLIDE 22

2007: Pain Points

Summarization Cluster

Web Clusters Scribe Mid-Tier MySQL Clusters NAS Filers

RDBMS Data Warehouse

  • daily ETL > 24 hours
  • Lots of tuning/indexes etc.
  • Lots of hardware planning

Thursday, March 8, 12

slide-23
SLIDE 23

2007: Pain Points

Summarization Cluster

Web Clusters Scribe Mid-Tier MySQL Clusters NAS Filers

RDBMS Data Warehouse

  • daily ETL > 24 hours
  • Lots of tuning/indexes etc.
  • Lots of hardware planning
  • compute close to storage

(early map/reduce)

Thursday, March 8, 12

slide-24
SLIDE 24

2007: Limitations

  • Most use cases were

in business metrics - data science, model building etc. not possible

  • Only summary data

was stored online - details archived away

Thursday, March 8, 12

slide-25
SLIDE 25

2008: Move to Hadoop

Web Clusters Scribe Mid-Tier MySQL Clusters NAS Filers Summarization Cluster RDBMS Data Warehouse

Thursday, March 8, 12

slide-26
SLIDE 26

2008: Move to Hadoop

Web Clusters Scribe Mid-Tier MySQL Clusters NAS Filers RDBMS Data Mart Hadoop/Hive Data Warehouse Batch copier/ loaders

Thursday, March 8, 12

slide-27
SLIDE 27

2008: Immediate Pros

  • Data science at

scale became possible

  • For the first time all
  • f the instrumented

data could be held

  • nline
  • Use cases expanded

Thursday, March 8, 12

slide-28
SLIDE 28

2009: Democratizing Data

Web Clusters Scribe Mid-Tier MySQL Clusters NAS Filers RDBMS Data Mart Hadoop/Hive Data Warehouse

Thursday, March 8, 12

slide-29
SLIDE 29

2009: Democratizing Data

Hadoop/Hive Data Warehouse Databee & Chronos: Data Pipeline Framework HiPal: Adhoc Queries + Data Discovery Nectar: instrumentation & schema aware data collection Scrapes: Configuration Driven

Thursday, March 8, 12

slide-30
SLIDE 30

2009: Democratizing Data(Nectar)

  • Typical Nectar Pipeline
  • Simple schema evolution

built in

  • json encoded short term

data

  • decomposing json for

long term storage

Thursday, March 8, 12

slide-31
SLIDE 31

2009: Democratizing Data (Tools)

  • HiPal - data discovery

and query authoring

  • Charting and

dashboard generation tools

Thursday, March 8, 12

slide-32
SLIDE 32

2009: Democratizing Data (Tools)

  • Databee: Workflow

language

  • Chronos: Scheduling

tool

Thursday, March 8, 12

slide-33
SLIDE 33

2009: Cons of Democratization

  • Isolation to protect

against Bad Jobs

  • Fair sharing of the

cluster - what is a high priority job and how to enforce it

Thursday, March 8, 12

slide-34
SLIDE 34

2010: Controlling Chaos

  • Isolation
  • Reducing operational overhead
  • Better resource utilization
  • Measurement, ownership, accountability

Thursday, March 8, 12

slide-35
SLIDE 35

2010: Isolation

Web Clusters Scribe Mid-Tier MySQL Clusters NAS Filers Hadoop/Hive Data Warehouse

Thursday, March 8, 12

slide-36
SLIDE 36

2010: Isolation

Web Clusters Scribe Mid-Tier MySQL Clusters NAS Filers Platinum Warehouse Silver Warehouse Hive Replication

Thursday, March 8, 12

slide-37
SLIDE 37

2010: Ops Efficiency

Web Clusters Scribe HDFS MySQL Clusters Platinum Warehouse Silver Warehouse Hive Replication

Thursday, March 8, 12

slide-38
SLIDE 38

2010: Ops Efficiency

Web Clusters Scribe HDFS MySQL Clusters Platinum Warehouse Silver Warehouse Hive Replication near real time data consumers

Thursday, March 8, 12

slide-39
SLIDE 39

2010: Ops Efficiency

Web Clusters Scribe HDFS MySQL Clusters Platinum Warehouse Silver Warehouse Hive Replication ptail: parallel tail

  • n hdfs

near real time data consumers

Thursday, March 8, 12

slide-40
SLIDE 40

2010: Resource Utilization (Disk)

  • HDFS-RAID: from 3

replicas to 2.2 replicas

  • RCFile: Row columnar

format for compressing Hive tables

Thursday, March 8, 12

slide-41
SLIDE 41

2010: Resource Utilization (CPU)

  • Continuous copier/

loaders

  • Incremental scrapes
  • Hive optimizations to

save CPU

Thursday, March 8, 12

slide-42
SLIDE 42

2010: Monitoring(SLAs)

  • Per job statistics rolled

up to owner/group/team

  • Expected time of arrival

vs Actual time of arrival

  • f data
  • Simple data quality

metrics

Thursday, March 8, 12

slide-43
SLIDE 43

2011: New Requirements

  • More real time requirements for

aggregations

  • Optimizing resource utilization

Thursday, March 8, 12

slide-44
SLIDE 44

2011: Beyond Hadoop

  • Puma for real time analytics
  • Peregrine for simple and fast queries

Thursday, March 8, 12

slide-45
SLIDE 45

2010: Puma

Web Clusters Scribe HDFS MySQL Clusters Platinum Warehouse Silver Warehouse Hive Replication ptail: parallel tail

  • n hdfs

near real time data consumers

Thursday, March 8, 12

slide-46
SLIDE 46

2010: Puma

Web Clusters Scribe HDFS MySQL Clusters Platinum Warehouse Silver Warehouse Hive Replication ptail: parallel tail

  • n hdfs

near real time data consumers

Thursday, March 8, 12

slide-47
SLIDE 47

2010: Puma

Thursday, March 8, 12

slide-48
SLIDE 48

2010: Puma

Scribe HDFS ptail: parallel tail

  • n hdfs

Thursday, March 8, 12

slide-49
SLIDE 49

2010: Puma

Scribe HDFS ptail: parallel tail

  • n hdfs

Puma Clusters

Thursday, March 8, 12

slide-50
SLIDE 50

2010: Puma

Scribe HDFS ptail: parallel tail

  • n hdfs

Puma Clusters Hbase Cluster

Thursday, March 8, 12

slide-51
SLIDE 51

Other Challenges Of HyperGrowth

  • Moving data centers
  • Moving sustainably fast

Thursday, March 8, 12

slide-52
SLIDE 52

HyperGrowth - Moving Data Centers

7500 15000 22500 30000 2007 2008 2009 2010 2011

15 250 800 8000 25000

DW Size in TB

Thursday, March 8, 12

slide-53
SLIDE 53

HyperGrowth - Moving Data Centers

  • Moved 20 PB of data
  • Leverage replication

with fast switch

  • 2-3 months to

accomplish the entire move

Blog Post on FB by Paul Yang: http://www.facebook.com/notes/paul-yang/moving-an-elephant-large- scale-hadoop-data-migration-at-facebook/10150246275318920

Thursday, March 8, 12

slide-54
SLIDE 54

Questions

Contact Information: ashish.thusoo@gmail.com

http://www.linkedin.com/pub/ashish-thusoo/0/5a8/50 https://www.facebook.com/athusoo https://twitter.com/ashishthusoo

Thursday, March 8, 12