An Intro to Modern Data Stream Analytics
EIT Summer School 2016
Paris Carbone PhD Candidate @ KTH<parisc@kth.se> Committer @ Apache Flink <senorcarbone@apache.org>
1
An Intro to Modern Data Stream Analytics EIT Summer School 2016 - - PowerPoint PPT Presentation
An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate @ KTH <parisc@kth.se> Committer @ Apache Flink <senorcarbone@apache.org> 1 Motivation Time-critical problems / Actionable Insights
EIT Summer School 2016
Paris Carbone PhD Candidate @ KTH<parisc@kth.se> Committer @ Apache Flink <senorcarbone@apache.org>
1
2
more like First-World Problems..
3
4
Q
=
Q
Deploy Sensors
Analyse Data Regularly
Collect Data
evacuation window
earth & wave activity
5
Q Q
Q =
6
Q
Standing Query
Q =
evacuation window
most recent views of the stream ~ windows
7
8
f
S1 S2 So S’1 S’2 where are computations stored?
We cannot infinitely store all events seen
machines…
9
f s
a summary of everything seen so far
t t’
10
11
whole stream
records with a 5% error
12
properties such as shortest paths
training and classification
Any other problems?
13
f
S1 S2 So S’1 S’2 Does this scale?
kafka partitions)
14
f
S1 S2 So S’1 S’2
how do streams get partitioned?
parallel task instance. Typical partitioners are:
f
sf
sf
sf
sf
sf
sP P P
by color
16
stream1 stream2
approximations predictions alerts ……
Q
sources sinks
17
(Bolts)
Spout Bolt Bolt
Spouts are the topology sources They listen to data feeds Bolts represent all intermediate computation vertices of the topology They do arbitrary data manipulation Each operator can emit/subscribe to Streams (computation results)
18 numbers
new_numbers
numbers
new_numbers toFile
19
Proprietary Open Source Google DataFlow IBM Infosphere Microsoft Azure Flink Storm Samza Spark Beam
20
Compositional Declarative
(Operators/Data Exchange)
Tuning
windowing is supported
21
DStream, DataStream, PCollection…
execution graph / topology
and data analysts
20 40 60 80 100 120
juli-09 nov-10 apr-12 aug-13 dec-14 maj-16
#unique contributor ids by git commits
source software development
contributors
Apache Flink
Stream Pipelines Batch Pipelines Scalable Machine Learning Graph Analytics
24
APIs Execution
DataStream DataSet Distributed Dataflow Deployment
DataStream DataSet Distributed Dataflow Deployment
Graph-Gelly Table ML
Hadoop M/R
Table CEP SQL SQL ML Graph-Gelly
26
Source
Data Stream
Operator
Data Stream
Sink
Source
Data Set
Operator
Data Set
Sink
Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks
27
DataStream
28
textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .sum(1) .print()
“live and let live”
“live” “and” “let” “live” (live,1) (and,1) (let,1) (live,1) (live,1) (and,1) (let,1) (live,2)
29
Why windows? Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events!
#sec 40 80
SUM #2 SUM #1
20 60 100 #sec 40 80
SUM #3 SUM #2 SUM #1
20 60 100 120
15 38 65 88 15 38 38 65 65 88 15 38 65 88
110 120
myKeyedStream.timeWindow( Time.seconds(60), Time.seconds(20));
1) Sliding windows 2) Tumbling windows
myKeyedStream.timeWindow( Time.seconds(60));
window buckets/panes
We are often interested in fresh data!
30
textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print()
“live and”
(live,1) (and,1) (let,1) (live,1)
counting words over windows “let live”
10:48 11:01 Window (10:45-10:50) Window (11:00-11:05)
31
print window sum flatMap
textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print()
map where counts are kept in state
32
window sum flatMap
textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .setParallelism(4) .print()
map print
33
34
t2 t1 snap - t1 snap - t2
snapshotting snapshotting
State is not affected by failures When failures occur we revert computation and state back to a snapshot
events
35 Jamie Grier - http://data-artisans.com/extending-the- yahoo-streaming-benchmark/
36
Two major differences 1) Stream Execution 2) Mutable State
37
(Spark Streaming)
put new states in output RDD
dstream.updateStateByKey(…)
In S’ S
38
39
40
PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500)); DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); });
CEP Library Example (Java)
41
Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc… Coming up next : Dynamic graph processing support
42
43
// Ingest a DataStream from an external source DataStream<Tuple3<Long, String, Integer>> ds = env.addSource(...); // Register the DataStream as table "Orders" tableEnv.registerDataStream("Orders", ds, "user, product, amount"); // Run a SQL query on the Table and retrieve the result as a new Table Table result = tableEnv.sql( "SELECT STREAM product, amount FROM Orders WHERE product LIKE '%Rubber%'");
Example Stream SQL on Table API
44
45