An Intro to Modern Data Stream Analytics EIT Summer School 2016 - - PowerPoint PPT Presentation

an intro to modern data stream analytics
SMART_READER_LITE
LIVE PREVIEW

An Intro to Modern Data Stream Analytics EIT Summer School 2016 - - PowerPoint PPT Presentation

An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate @ KTH <parisc@kth.se> Committer @ Apache Flink <senorcarbone@apache.org> 1 Motivation Time-critical problems / Actionable Insights


slide-1
SLIDE 1

An Intro to Modern Data Stream Analytics

EIT Summer School 2016

Paris Carbone PhD Candidate @ KTH<parisc@kth.se> Committer @ Apache Flink <senorcarbone@apache.org>

1

slide-2
SLIDE 2

Motivation

  • Time-critical problems / Actionable Insights
  • Stock market predictions
  • Fraud detection
  • Network security
  • Fresh customer recommendations

2

more like First-World Problems..

slide-3
SLIDE 3

How about Tsunamis

3

slide-4
SLIDE 4

4

Q

=

Q

Deploy Sensors

Analyse Data Regularly

Collect Data

evacuation window

earth & wave activity

slide-5
SLIDE 5

Motivation

5

Q Q

Q =

slide-6
SLIDE 6

Motivation

6

Q

Standing Query

Q =

evacuation window

slide-7
SLIDE 7

The Data Stream Paradigm

  • Standing queries are evaluated continuously
  • Input data is unbounded
  • Queries operate on the full data stream or on the

most recent views of the stream ~ windows

7

slide-8
SLIDE 8

Data Stream Basics

  • Events/Tuples : elements of computation - respect a schema
  • Data Streams : unbounded sequences of events
  • Stream Operators/Tasks: consume and produce data streams
  • Events are consumed once - no backtracking!

8

f

S1 S2 So S’1 S’2 where are computations stored?

slide-9
SLIDE 9

Synopsis-Task State

We cannot infinitely store all events seen

  • Synopsis: A summary of an infinite stream
  • It is in principle any streaming operator state
  • Examples: samples, histograms, sketches, state

machines…

9

f s

a summary of everything seen so far

  • 1. process t, s
  • 2. update s
  • 3. produce t’

t t’

slide-10
SLIDE 10

Synopses-Aggregations

  • Discussion - Rolling Aggregations
  • Propose a synopsis, s=? when
  • f= max
  • f= ArithmeticMean
  • f= stDev

10

slide-11
SLIDE 11

Synopses-Approximations

11

  • Discussion - Approximate Results
  • Propose a synopsis, s=? when
  • f= uniform random sample of k records over the

whole stream

  • f= filter distinct records over windows of 1000

records with a 5% error

slide-12
SLIDE 12

Synopses-ML and Graphs

12

  • Examples of cool synopses to check out
  • Sparsifiers/Spanners - approximating graph

properties such as shortest paths

  • Change detectors - detecting concept drift
  • Incremental decision trees - continuous stream

training and classification

slide-13
SLIDE 13

Data Stream Basics

Any other problems?

13

f

S1 S2 So S’1 S’2 Does this scale?

slide-14
SLIDE 14

Task Parallelism

  • We need task parallelism:
  • Data might be too large to process
  • State can get too large to fit in memory (e.g. graphs)
  • Data Streams might already be partitioned! (e.g. by key/

kafka partitions)

14

f

S1 S2 So S’1 S’2

how do streams get partitioned?

slide-15
SLIDE 15

Task Partitioning

  • Partitioning defines how we allocate events to each

parallel task instance. Typical partitioners are:

  • Broadcast
  • Shuffle
  • Key-based

f

s

f

s

f

s

f

s

f

s

f

s

P P P

by color

slide-16
SLIDE 16

Dataflow Pipelines

16

stream1 stream2

approximations predictions alerts ……

Q

sources sinks

slide-17
SLIDE 17

Dataflow Programming with Apache Storm

17

  • Step1: Implement input (Spouts) and intermediate operators

(Bolts)

  • Step 2: Construct a Topology by combining operators

Spout Bolt Bolt

Spouts are the topology sources They listen to data feeds Bolts represent all intermediate computation vertices of the topology They do arbitrary data manipulation Each operator can emit/subscribe to Streams (computation results)

slide-18
SLIDE 18

Example: Topology Definition

18 numbers

new_numbers

numbers

new_numbers toFile

slide-19
SLIDE 19

Stream Analytics Systems

19

Proprietary Open Source Google DataFlow IBM Infosphere Microsoft Azure Flink Storm Samza Spark Beam

slide-20
SLIDE 20

Programming Models

20

Compositional Declarative

  • Physical Representations
  • Offer basic building blocks

(Operators/Data Exchange)

  • Custom Optimisation/

Tuning

  • Logical Representations
  • Operators are transformations
  • n abstract data types
  • Advanced behaviour such as

windowing is supported

  • Self-Optimisation
slide-21
SLIDE 21

Programming Abstraction Levels

21

DStream, DataStream, PCollection…

  • Direct access to the

execution graph / topology

  • Suitable for engineers
  • Transformations abstract
  • perator details
  • Suitable for engineers

and data analysts

slide-22
SLIDE 22

Introducing Apache Flink

20 40 60 80 100 120

juli-09 nov-10 apr-12 aug-13 dec-14 maj-16

#unique contributor ids by git commits

  • A Top-level project
  • Community-driven open

source software development

  • Publicly open to new

contributors

slide-23
SLIDE 23

Native Workload Support

Apache Flink

Stream Pipelines Batch Pipelines Scalable Machine Learning Graph Analytics

slide-24
SLIDE 24

24

The Apache Flink Stack

APIs Execution

DataStream DataSet Distributed Dataflow Deployment

  • Bounded Data Sources
  • Staged/Pipelined Execution
  • Unbounded Data Sources
  • Pipelined Execution
slide-25
SLIDE 25

The Big Picture

DataStream DataSet Distributed Dataflow Deployment

Graph-Gelly Table ML

Hadoop M/R

Table CEP SQL SQL ML Graph-Gelly

slide-26
SLIDE 26

26

Basic API Concept

Source

Data Stream

Operator

Data Stream

Sink

Source

Data Set

Operator

Data Set

Sink

Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks

slide-27
SLIDE 27

Data Streams as Abstract Data Types

  • Tasks are distributed and run in a pipelined fashion.
  • State is kept within tasks.
  • Transformations are applied per-record or window.
  • Transformations: map, flatmap, filter, union…
  • Aggregations: reduce, fold, sum
  • Partitioning: forward, broadcast, shuffle, keyBy
  • Sources/Sinks: custom or Kafka, Twitter, Collections…

27

DataStream

slide-28
SLIDE 28

Example

28

textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .sum(1) .print()

“live and let live”

“live” “and” “let” “live” (live,1) (and,1) (let,1) (live,1) (live,1) (and,1) (let,1) (live,2)

slide-29
SLIDE 29

Working with Windows

29

Why windows? Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events!

#sec 40 80

SUM #2 SUM #1

20 60 100 #sec 40 80

SUM #3 SUM #2 SUM #1

20 60 100 120

15 38 65 88 15 38 38 65 65 88 15 38 65 88

110 120

myKeyedStream.timeWindow( Time.seconds(60), Time.seconds(20));

1) Sliding windows 2) Tumbling windows

myKeyedStream.timeWindow( Time.seconds(60));

window buckets/panes

We are often interested in fresh data!

slide-30
SLIDE 30

Example

30

textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print()

“live and”

(live,1) (and,1) (let,1) (live,1)

counting words over windows “let live”

10:48 11:01 Window (10:45-10:50) Window (11:00-11:05)

slide-31
SLIDE 31

Example

31

print window sum flatMap

textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print()

map where counts are kept in state

slide-32
SLIDE 32

Example

32

window sum flatMap

textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .setParallelism(4) .print()

map print

slide-33
SLIDE 33

Making State Explicit

33

  • Explicitly defined state is durable to failures
  • Flink supports two types of explicit states
  • Operator State - full state
  • Key-Value State - partitioned state per key
  • State Backends: In-memory, RocksDB, HDFS
slide-34
SLIDE 34

Fault Tolerance

34

t2 t1 snap - t1 snap - t2

snapshotting snapshotting

State is not affected by failures When failures occur we revert computation and state back to a snapshot

events

slide-35
SLIDE 35

Performance

  • Twitter Hack Week - Flink as an in-memory data store

35 Jamie Grier - http://data-artisans.com/extending-the- yahoo-streaming-benchmark/

slide-36
SLIDE 36

So how is Flink different that Spark?

36

Two major differences 1) Stream Execution 2) Mutable State

slide-37
SLIDE 37

Flink vs Spark

37

(Spark Streaming)

put new states in output RDD

dstream.updateStateByKey(…)

In S’ S

  • dedicated resources
  • leased resources
  • mutable state
  • immutable state
slide-38
SLIDE 38

What about DataSets?

38

  • Sophisticated SQL-inspired optimiser
  • Efficient Join Strategies
  • Managed Memory bypasses Garbage Collection
  • Fast, in-memory Iterative Bulk Computations
slide-39
SLIDE 39

Some Interesting Libraries

39

slide-40
SLIDE 40

Detecting Patterns

40

PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500)); DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); });

CEP Library Example (Java)

slide-41
SLIDE 41

Mining Graphs with Gelly

41

  • Iterative Graph Processing
  • Scatter-Gather
  • Gather-Sum-Apply
  • Graph Transformations/Properties
  • Library Methods: Community Detection, Label

Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc… Coming up next : Dynamic graph processing support

slide-42
SLIDE 42

Machine Learning Pipelines

42

  • Scikit-learn inspired pipelining
  • Supervised: SVM, Linear Regression
  • Preprocessing: Polynomial Features, Scalers
  • Recommendation: ALS
slide-43
SLIDE 43

Relational Queries

43

// Ingest a DataStream from an external source DataStream<Tuple3<Long, String, Integer>> ds = env.addSource(...); // Register the DataStream as table "Orders" tableEnv.registerDataStream("Orders", ds, "user, product, amount"); // Run a SQL query on the Table and retrieve the result as a new Table Table result = tableEnv.sql( "SELECT STREAM product, amount FROM Orders WHERE product LIKE '%Rubber%'");

Example Stream SQL on Table API

slide-44
SLIDE 44

Live Monitoring

44

slide-45
SLIDE 45

Coming Soon

45

  • Stream ML
  • Stream Graph Processing (Gelly-Stream)
  • Autoscaling
  • Incremental Snapshots