Realtime Data Processing at Facebook Abhay Venkatesh Actionable - - PowerPoint PPT Presentation

realtime data processing at facebook
SMART_READER_LITE
LIVE PREVIEW

Realtime Data Processing at Facebook Abhay Venkatesh Actionable - - PowerPoint PPT Presentation

Realtime Data Processing at Facebook Abhay Venkatesh Actionable reports Why e.g. Chorus: what is trending right now? Realtime monitoring Streaming at e.g. dashboard queries Facebook? Hybrid realtime-batch pipelines e.g.


slide-1
SLIDE 1

Abhay Venkatesh

Realtime Data Processing at Facebook

slide-2
SLIDE 2

Why Streaming at Facebook?

  • Actionable reports
  • e.g. Chorus: what is trending right now?
  • Realtime monitoring
  • e.g. dashboard queries
  • Hybrid realtime-batch pipelines
  • e.g. pre-emptive queries over data warehouse
slide-3
SLIDE 3

Workload Assumptions

  • s not ms, which means
  • can use persistent message bus called Scribe
  • which makes it easier to enable
  • Fault tolerance
  • Scalability
  • Multiple options for correctness
slide-4
SLIDE 4

System Architecture

slide-5
SLIDE 5

The Streaming Triad

  • Puma
  • Swift
  • Stylus
slide-6
SLIDE 6

Puma

  • For apps written in a SQL-like language
  • Quick to write (< 1 hour)
  • But run over long periods (months to years)
  • Two purposes
  • Pre-computed query results for simple

aggregation queries

  • Filtering and processing of Scribe streams
slide-7
SLIDE 7

A Puma App

slide-8
SLIDE 8

Swift

Very Basic API

  • Can read() from a Scribe Stream
  • Checkpoints every
  • N Strings, or
  • B Bytes
slide-9
SLIDE 9

Stylus

  • Low-Level Stream Processing in C++

Scribe Stream Stylus Processor(s) Scribe Stream or Data Store

slide-10
SLIDE 10

Sample Application

slide-11
SLIDE 11

Design Decisions

  • Language Paradigm
  • Data Transfer
  • Processing Semantics
  • State-saving mechanism
  • Reprocessing
slide-12
SLIDE 12

Design Decisions

  • Language Paradigm
  • Data Transfer
  • Processing Semantics
  • State-saving mechanism
  • Reprocessing
slide-13
SLIDE 13

Processing Semantics

  • At least once, at most once or exactly once
  • State semantics (inputs)
  • Output semantics
slide-14
SLIDE 14
slide-15
SLIDE 15

State-Saving Mechanisms

slide-16
SLIDE 16

Reprocessing Data

  • Data warehousing with Hive
  • Stream processing in batch environment
  • Puma -> Hive
  • Stylus -> stateless, stateful, and monoid
slide-17
SLIDE 17

Closing Thoughts

  • “Move Fast”
  • Ease of debugging
  • Ease of deployment
  • Ease of monitoring and operation
slide-18
SLIDE 18

Comparison with Naiad

Naiad Facebook Realtime Systems

  • Milliseconds, not seconds
  • Robust solutions to

micro-stragglers

  • Expense availability in event
  • f failure
  • Naiad consumes inputs from

message queue, and writes to key-value store

  • Seconds, not milliseconds
  • Does not handle micro-

stragglers

  • Persistent message bus

ensures no loss

  • Flexible, and easy to use,

deploy, debug