An Intro to Modern Data Stream Analytics EIT Summer School 2016 - PowerPoint PPT Presentation

An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate @ KTH <parisc@kth.se> Committer @ Apache Flink <senorcarbone@apache.org> 1

Motivation • Time-critical problems / Actionable Insights • Stock market predictions • Fraud detection • Network security • Fresh customer recommendations more like First-World Problems.. 2

How about Tsunamis 3

Deploy Sensors earth & wave activity Analyse Data Collect Regularly Data Q evacuation window = Q 4

Motivation Q Q Q = 5

Motivation Standing Query Q evacuation window Q = 6

The Data Stream Paradigm • Standing queries are evaluated continuously • Input data is unbounded • Queries operate on the full data stream or on the most recent views of the stream ~ windows 7

Data Stream Basics • Events/Tuples : elements of computation - respect a schema • Data Streams : unbounded sequences of events • Stream Operators/Tasks: consume and produce data streams • Events are consumed once - no backtracking! S1 S’1 where are computations f S2 stored? S’2 So 8

Synopsis-Task State We cannot infinitely store all events seen • Synopsis : A summary of an infinite stream • It is in principle any streaming operator state • Examples: samples, histograms, sketches, state machines… a summary of everything s seen so far 1. process t, s t t’ 2. update s f 3. produce t’ 9

Synopses-Aggregations • Discussion - Rolling Aggregations • Propose a synopsis, s=? when • f= max • f= ArithmeticMean • f= stDev 10

Synopses-Approximations • Discussion - Approximate Results • Propose a synopsis, s=? when • f= uniform random sample of k records over the whole stream • f= filter distinct records over windows of 1000 records with a 5% error 11

Synopses-ML and Graphs • Examples of cool synopses to check out • Sparsifiers/Spanners - approximating graph properties such as shortest paths • Change detectors - detecting concept drift • Incremental decision trees - continuous stream training and classification 12

Data Stream Basics Any other problems? S1 S’1 Does this scale? f S2 S’2 So 13

Task Parallelism • We need task parallelism: • Data might be too large to process • State can get too large to fit in memory (e.g. graphs) • Data Streams might already be partitioned! (e.g. by key/ kafka partitions) S1 S’1 how do streams f get partitioned? S2 S’2 So 14

Task Partitioning • Partitioning defines how we allocate events to each parallel task instance. Typical partitioners are: s • Broadcast f P s f s f • Shuffle P s f s by f color • Key-based P s f

Dataflow Pipelines Q approximations stream1 predictions alerts …… sources stream2 sinks 16

Dataflow Programming with Apache Storm • Step1: Implement input ( Spouts ) and intermediate operators ( Bolts ) • Step 2: Construct a Topology by combining operators Spouts are the Bolts represent all intermediate computation topology sources vertices of the topology They listen to data They do arbitrary data manipulation feeds Spout Bolt Bolt Each operator can emit/subscribe to Streams ( computation results ) 17

Example: Topology Definition numbers new_numbers toFile numbers new_numbers 18

Stream Analytics Systems Proprietary Open Source Google Flink DataFlow Samza IBM Infosphere Spark Storm Microsoft Azure Beam 19

Programming Models Declarative Compositional • Physical Representations • Logical Representations • Offer basic building blocks • Operators are transformations (Operators/Data Exchange) on abstract data types • Custom Optimisation/ • Advanced behaviour such as Tuning windowing is supported • Self-Optimisation 20

Programming Abstraction Levels • Transformations abstract DStream, DataStream, operator details PCollection… • Suitable for engineers and data analysts • Direct access to the execution graph / topology • Suitable for engineers 21

Introducing Apache Flink • A Top-level project #unique contributor ids by git 120 commits 100 80 60 • Community-driven open 40 source software development 20 0 juli-09 nov-10 apr-12 aug-13 dec-14 maj-16 • Publicly open to new contributors

Native Workload Support Scalable Batch Pipelines Machine Learning Stream Pipelines Graph Analytics Apache Flink

The Apache Flink Stack • Bounded Data Sources • Unbounded Data Sources • Staged/Pipelined Execution • Pipelined Execution DataSet DataStream APIs Distributed Dataflow Execution Deployment 24

The Big Picture Graph-Gelly Graph-Gelly Hadoop M/R Table Table CEP SQL SQL ML ML DataSet DataStream Distributed Dataflow Deployment

Basic API Concept Data Data Source Operator Sink Set Set Data Data Source Operator Sink Stream Stream Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks 26

Data Streams as Abstract Data Types Transformations: map, flatmap, filter, union… • DataStream Aggregations: reduce, fold, sum • Partitioning: forward, broadcast, shuffle, keyBy • Sources/Sinks: custom or Kafka, Twitter, Collections… • Tasks are distributed and run in a pipelined fashion. • State is kept within tasks. • Transformations are applied per-record or window. • 27

Example “live and let live” textStream .flatMap {_.split("\\W+")} “live” “and” “let” “live” .map {(_, 1)} (live,1) (and,1) (let,1) (live,1) .keyBy(0) .sum(1) .print() (live,1) (and,1) (let,1) (live,2) 28

Working with Windows Why windows? 15 38 65 88 110 120 We are often interested in fresh data! window buckets/panes SUM #1 15 38 1) Sliding windows SUM #2 38 65 myKeyedStream.timeWindow( SUM #3 65 88 Time.seconds(60), #sec Time.seconds(20)); 0 40 80 20 60 100 2) Tumbling windows SUM #1 SUM #2 15 38 65 88 myKeyedStream.timeWindow( Time.seconds(60)); #sec 0 20 40 60 80 100 120 Highlight : Flink can form and trigger windows consistently under different notions of time and deal with late events! 29

Example counting words over windows “live and” 10:48 “let live” 11:01 textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() (live,1) (and,1) Window (10:45-10:50) (let,1) (live,1) Window (11:00-11:05) 30

Example textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() where counts are kept in state print flatMap map window sum 31

Example textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .setParallelism(4) .print() print flatMap map window sum 32

Making State Explicit • Explicitly defined state is durable to failures • Flink supports two types of explicit states • Operator State - full state • Key-Value State - partitioned state per key • State Backends: In-memory, RocksDB, HDFS 33

Fault Tolerance State is not affected by failures When failures occur we revert computation and state back to a snapshot snapshotting snapshotting t1 t2 events snap - t2 snap - t1 34

Performance • Twitter Hack Week - Flink as an in-memory data store Jamie Grier - http://data-artisans.com/extending-the- yahoo-streaming-benchmark/ 35

So how is Flink different that Spark? Two major differences 1) Stream Execution 2) Mutable State 36

Flink vs Spark S • dedicated resources • mutable state dstream.updateStateByKey(…) put new states in output RDD In S ’ • leased resources • immutable state (Spark Streaming) 37

What about DataSets? • Sophisticated SQL-inspired optimiser • Efficient Join Strategies • Managed Memory bypasses Garbage Collection • Fast, in-memory Iterative Bulk Computations 38

Some Interesting Libraries 39

Detecting Patterns CEP Library Example (Java) PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500)); DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); }); 40

Mining Graphs with Gelly • Iterative Graph Processing • Scatter-Gather • Gather-Sum-Apply • Graph Transformations/Properties • Library Methods : Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc… Coming up next : Dynamic graph processing support 41

Machine Learning Pipelines • Scikit-learn inspired pipelining • Supervised : SVM, Linear Regression • Preprocessing : Polynomial Features, Scalers • Recommendation : ALS 42

An Intro to Modern Data Stream Analytics EIT Summer School 2016 - PowerPoint PPT Presentation

An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate @ KTH <parisc@kth.se> Committer @ Apache Flink <senorcarbone@apache.org> 1 Motivation Time-critical problems / Actionable Insights

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Towards Benchmarking Stream Data Warehouses Arian Br, Lukasz Golab 02.11.2012 Stream Data

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Architecture 3.0 Landscape Analytics Jrgen Dllner Hasso-Plattner-Institut Jrgen

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Intro to Redis Streams IMCSUMMIT - NOVEMBER 2019 | DAVE NIELSEN What is a data stream?

Google Analytics Overview Whats Google Analytics? The Google Analytics

Document Name Solar Analytics - Rooftop PV energy analytics PREPARED BY: Your Name, Your Title

Fresh water stream ecosystem Gr ov p 2 The description of stream lives Quadrat 1: Hong Kong Newt

Gigascope: A Stream Database for Network Applications Authors: Cranor, Johnson, Spataschek

over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura Iowa

Models and Issues in Data Stream Systems Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev

? 2 M. Tiemens Hit Creation Cluster 3 M. Tiemens Topology of the Data Stream t

Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work supported by: 1 Agenda

Toward GPU Accelerated Data Stream Processing Marcus Pinnecke, David Broneske and Gunter Saake

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde2015 DATA

Finding Recent Frequent Itemsets Adaptively over Online Data Stream Yueting Chen Outline