CS 744: DATAFLOW Shivaram Venkataraman Fall 2020 - - PowerPoint PPT Presentation

cs 744 dataflow
SMART_READER_LITE
LIVE PREVIEW

CS 744: DATAFLOW Shivaram Venkataraman Fall 2020 - - PowerPoint PPT Presentation

! welcome CS 744: DATAFLOW Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - Assignment 2 grades are up! Canvas - Midterm grading in progress - Course project proposal comments week tis Thursday feedback Peer feedback


slide-1
SLIDE 1

CS 744: DATAFLOW

Shivaram Venkataraman Fall 2020

welcome

!

slide-2
SLIDE 2

ADMINISTRIVIA

  • Assignment 2 grades are up!
  • Midterm grading in progress
  • Course project proposal comments
  • AEFIS feedback (next slide)

Canvas

Peer

feedback

Thursday

tis

week

Instructor

feedback

slide-3
SLIDE 3

AEFIS FEEDBACK

Improve writing on the slides, speak slower Get a better internet connection? Better microphone? More office hour slots Discussion groups: same group each time? Also add prof. input More time for Midterm exam, more guidance on deliverables More homework/hands-on experience vs. too many evaluation components?

Better

  • rganization

T

Let

me

know how

it

this

sounds ?

ring

slide-4
SLIDE 4

Scalable Storage Systems Datacenter Architecture Resource Management Computational Engines Machine Learning SQL Streaming Graph Applications

J f

f-

stream Processing

→ MapReduce

, Spark

GFS

Meson

DRF

slide-5
SLIDE 5

DATAFLOW MODEL (?)

  • perators
  • r

DAA

  • f
  • perators

spat

tape

"

Pytorch

slide-6
SLIDE 6

MOTIVATION

Streaming Video Provider

  • How much to bill each advertiser ?
  • Need per-user, per-video viewing sessions
  • Handle out of order data

Goals

  • Easy to program
  • Balance correctness, latency and cost

ESPN

. Lom
  • videos
,

each

video

has

some

ads

Foard

main

phone

heard

each )

→ which city etc .

Offline

  • 1
  • unbounded

data ,

  • ut
  • f
  • rder
  • how

much delay till

results

are

how accurate

are

your

results

available

slide-7
SLIDE 7

APPROACH

API Design Separate user-facing model from execution Decompose queries into

  • What is being computed
  • Where in time is it computed
  • When is it materialized
  • How does it relate to earlier results

Developers writing

→ Dataflow Model

applications

I

L

L

TENET

d) Framework

Ll) framework

processes

can process

→ Output

data

as it

bounded

arrives

data

similar to very

streaming

a

① MapReduce

small

batch

  • iii. i

FEE

:# ftp..sk

.

viewing

e→

'

events

I

process

events

1 day ma 'day ma '
  • '
as

and

when they

arrive

slide-8
SLIDE 8

TERMINOLOGY

Unbounded/bounded data Streaming/Batch execution Timestamps Event time: Processing time:

Dashboard

Processing - time

Syst€)

Data

is

constantly

arriving#

ESPN .com

See

previous

slide

  • ad

µµmtmt¥

'D )

  • Time

when

event

  • ccurs

wrt

user ( input

e.g ;

time

at

ad

was

viewed

in video

Time

at

which

an

event

is

processed

e- g.,

time

at

which

ad

  • view

event

is

processed

to

update

the

dashboard

slide-9
SLIDE 9

WINDOWING

logical

constructs

winadroewsae.ge/:;:::n?I:soam/

^

across

keys

window

  • Id
10am
  • I -
  • -
. . . . . . . .
  • remake
. Finneran ↳
  • FF
  • ¥

,

I

'

HIGH

Hom -

#

# rpm-

.
  • Do

not

  • vereat

Tuning

  • imapbeueen

)

↳ noamtmatauidne

with each

windows

consecutive keys

  • ther

windows

slide-10
SLIDE 10

WATERMARK or SKEW

System has processed all events up to 12:02:30

Watermark

is

"

not

easy

to know

  • Heuristics

you

. so ↳
  • processing

time

=
,
  • After

10 mins

,

lags

event

time

most devices

:

serial

events

: ' .
  • Event

time skew

catch up

:

  • .
T / . &
  • No

gap

between event - t &

processing time

slide-11
SLIDE 11

API

ParDo: GroupByKey: Windowing AssignWindow MergeWindow

Map

in

MapReduce

  • r

flatmate

in

Spark

Reduce

in

MapReduce

Buckets

tuple

into

a

window

Merge

buckets

based

  • n

strategy

( sessions)

slide-12
SLIDE 12

EXAMPLE

GroupByKey

Assign tuples

to

sessions

hwan

+

,

meant

  • timestamp
  • I

÷i¥

  • verlap

and

  • aedrdenfo.fi/ftamp
  • merges

them

  • .

I

slide-13
SLIDE 13

TRIGGERS AND INCREMENTAL PROCESSING

Windowing: where in event time data are grouped Triggering: when in processing time groups are emitted Strategies Discarding Accumulating Accumulating & Retracting

=

.

÷

: FEI

;÷::?;÷iwsr

.

= . 11

v1

6

Output

=

I

  • 5
,

11

retracting

ataumulahng

slide-14
SLIDE 14

RUNNING EXAMPLE

PCollection<KV<String, Integer>> input = IO.read(...); PCollection<KV<String, Integer>> output = input.apply(Sum.integersPerKey());

Single

summit for

f

key

each

key

'

slide-15
SLIDE 15

GLOBAL WINDOWS, ACCUMULATE

PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtPeriod(1, MINUTE))) .accumulating()) .apply(Sum.integersPerKey());

÷

33 t 18

I

22HI

  • '

12+10

= 22

Crigger ed→

  • .

O

every A

min in

.

ProutFane

slide-16
SLIDE 16

GLOBAL WINDOWS, COUNT, DISCARDING

PCollection<KV<String, Integer>> output = input .apply(Window.trigger(Repeat(AtCount(2))) .discarding()) .apply(Sum.integersPerKey());

  • r::fgE

. !

card.

. →

,

a

slide-17
SLIDE 17

FiXED WINDOWS, MICRO BATCH

PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(2, MINUTES)) .trigger(Repeat(AtWatermark()))) .accumulating())

  • 12:00
  • 12:02

5

  • 12:02
  • 12:04

# 14

  • 12=04-12--006

3

  • f

;D:

"-00

" o.

monk

iii:

in

a:*

A

M t

a

slide-18
SLIDE 18

SUMMARY/LESSONS

Design for unbounded data: Don’t rely on completeness Be flexible, diverse use cases

  • Billing
  • Recommendation
  • Anomaly detection

Windowing, Trigger API to simplify programming on unbounded data

slide-19
SLIDE 19

DISCUSSION

https://forms.gle/jwHjTBbR49vyQASq6

slide-20
SLIDE 20

Fixed

windows

streaming

a)

window fires

every time watermark pass

Assume

watermark

is

given

⇒ worse

latency

⇒ fewer

  • utputs

Eat

EA

X

T

Micro

batch

nie

  • D

partial

  • 1

. .

rum

entry

streaming ⇒ *.

IEEE

'm

event - ts to

system Apache Kafka

Pub

  • Sub

Ingest

proc - t

time

update

query

persist

disk

slide-21
SLIDE 21

Consider you are implementing a micro-batch streaming API on top of Apache

  • Spark. What are some of the bottlenecks/challenges you might have in building

such a system?

slide-22
SLIDE 22

NEXT STEPS

Next class: Naiad Course project proposal peer feedback