CS 744: DATAFLOW Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - - - PowerPoint PPT Presentation

cs 744 dataflow
SMART_READER_LITE
LIVE PREVIEW

CS 744: DATAFLOW Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - - - PowerPoint PPT Presentation

CS 744: DATAFLOW Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Assignment 2 grades up - Midterm grading - Course project proposal comments - AEFIS feedback - No Class next Tuesday? Applications Machine Learning SQL Streaming


slide-1
SLIDE 1

CS 744: DATAFLOW

Shivaram Venkataraman Fall 2019

slide-2
SLIDE 2

ADMINISTRIVIA

  • Assignment 2 grades up
  • Midterm grading
  • Course project proposal comments
  • AEFIS feedback
  • No Class next Tuesday?
slide-3
SLIDE 3

Scalable Storage Systems Datacenter Architecture Resource Management Computational Engines Machine Learning SQL Streaming Graph Applications

slide-4
SLIDE 4

DATAFLOW MODEL (?)

slide-5
SLIDE 5

MOTIVATION

Streaming Video Provider

  • How much to bill each advertiser ?
  • Need per-user, per-video viewing sessions
  • Handle out of order data

Goals

  • Easy to program
  • Balance correctness, latency and cost
slide-6
SLIDE 6

APPROACH

API Design Separate user-facing model from execution Decompose queries into

  • What is being computed
  • Where in time is it computed
  • When is it materialized
  • How does it relate to earlier results
slide-7
SLIDE 7

TERMINOLOGY

Unbounded/bounded data Streaming/Batch execution Timestamps Event time: Processing time:

slide-8
SLIDE 8

WINDOWING

slide-9
SLIDE 9

WATERMARK or SKEW

System has processed all events up to 12:02:30

slide-10
SLIDE 10

API

ParDo: GroupByKey: Windowing AssignWindow MergeWindow

slide-11
SLIDE 11

EXAMPLE

GroupByKey

slide-12
SLIDE 12

TRIGGERS AND INCREMENTAL PROCESSING

Windowing: where in event time data are grouped Triggering: when in processing time groups are emitted Strategies Discarding Accumulating Accumulating & Retracting

slide-13
SLIDE 13

RUNNING EXAMPLE

PCollection<KV<String, Integer>> input = IO.read(...); PCollection<KV<String, Integer>> output = input.apply(Sum.integersPerKey());

slide-14
SLIDE 14

GLOBAL WINDOWS, ACCUMULATE

PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtPeriod(1, MINUTE))) .accumulating()) .apply(Sum.integersPerKey());

slide-15
SLIDE 15

GLOBAL WINDOWS, COUNT, DISCARDING

PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtCount(2))) .discarding()) .apply(Sum.integersPerKey());

slide-16
SLIDE 16

FiXED WINDOWS, MICRO BATCH

PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(2, MINUTES)) .trigger(Repeat(AtWatermark()))) .accumulating())

slide-17
SLIDE 17

LESSONS / EXPERIENCES

Don’t rely on completeness Be flexible, diverse use cases

  • Billing
  • Recommendation
  • Anomaly detection

Support analysis in context of events

slide-18
SLIDE 18

DISCUSSION

https://forms.gle/s7T2r67BDvkGQhmN9

slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21

Consider you are implementing a micro-batch streaming API on top of Apache

  • Spark. What are some of the bottlenecks/challenges you might have in building

such a system?

slide-22
SLIDE 22