an intro to modern data stream analytics
play

An Intro to Modern Data Stream Analytics EIT Summer School 2016 - PowerPoint PPT Presentation

An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate @ KTH <parisc@kth.se> Committer @ Apache Flink <senorcarbone@apache.org> 1 Motivation Time-critical problems / Actionable Insights


  1. An Intro to Modern Data Stream Analytics EIT Summer School 2016 Paris Carbone PhD Candidate @ KTH <parisc@kth.se> Committer @ Apache Flink <senorcarbone@apache.org> 1

  2. Motivation • Time-critical problems / Actionable Insights • Stock market predictions • Fraud detection • Network security • Fresh customer recommendations more like First-World Problems.. 2

  3. How about Tsunamis 3

  4. Deploy Sensors earth & wave activity Analyse Data Collect Regularly Data Q evacuation window = Q 4

  5. Motivation Q Q Q = 5

  6. Motivation Standing Query Q evacuation window Q = 6

  7. The Data Stream Paradigm • Standing queries are evaluated continuously • Input data is unbounded • Queries operate on the full data stream or on the most recent views of the stream ~ windows 7

  8. Data Stream Basics • Events/Tuples : elements of computation - respect a schema • Data Streams : unbounded sequences of events • Stream Operators/Tasks: consume and produce data streams • Events are consumed once - no backtracking! S1 S’1 where are computations f S2 stored? S’2 So 8

  9. Synopsis-Task State We cannot infinitely store all events seen • Synopsis : A summary of an infinite stream • It is in principle any streaming operator state • Examples: samples, histograms, sketches, state machines… a summary of everything s seen so far 1. process t, s t t’ 2. update s f 3. produce t’ 9

  10. Synopses-Aggregations • Discussion - Rolling Aggregations • Propose a synopsis, s=? when • f= max • f= ArithmeticMean • f= stDev 10

  11. Synopses-Approximations • Discussion - Approximate Results • Propose a synopsis, s=? when • f= uniform random sample of k records over the whole stream • f= filter distinct records over windows of 1000 records with a 5% error 11

  12. Synopses-ML and Graphs • Examples of cool synopses to check out • Sparsifiers/Spanners - approximating graph properties such as shortest paths • Change detectors - detecting concept drift • Incremental decision trees - continuous stream training and classification 12

  13. Data Stream Basics Any other problems? S1 S’1 Does this scale? f S2 S’2 So 13

  14. Task Parallelism • We need task parallelism: • Data might be too large to process • State can get too large to fit in memory (e.g. graphs) • Data Streams might already be partitioned! (e.g. by key/ kafka partitions) S1 S’1 how do streams f get partitioned? S2 S’2 So 14

  15. Task Partitioning • Partitioning defines how we allocate events to each parallel task instance. Typical partitioners are: s • Broadcast f P s f s f • Shuffle P s f s by f color • Key-based P s f

  16. Dataflow Pipelines Q approximations stream1 predictions alerts …… sources stream2 sinks 16

  17. Dataflow Programming with Apache Storm • Step1: Implement input ( Spouts ) and intermediate operators ( Bolts ) • Step 2: Construct a Topology by combining operators Spouts are the Bolts represent all intermediate computation topology sources vertices of the topology They listen to data They do arbitrary data manipulation feeds Spout Bolt Bolt Each operator can emit/subscribe to Streams ( computation results ) 17

  18. Example: Topology Definition numbers new_numbers toFile numbers new_numbers 18

  19. Stream Analytics Systems Proprietary Open Source Google Flink DataFlow Samza IBM Infosphere Spark Storm Microsoft Azure Beam 19

  20. Programming Models Declarative Compositional • Physical Representations • Logical Representations • Offer basic building blocks • Operators are transformations (Operators/Data Exchange) on abstract data types • Custom Optimisation/ • Advanced behaviour such as Tuning windowing is supported • Self-Optimisation 20

  21. Programming Abstraction Levels • Transformations abstract DStream, DataStream, operator details PCollection… • Suitable for engineers and data analysts • Direct access to the execution graph / topology • Suitable for engineers 21

  22. Introducing Apache Flink • A Top-level project #unique contributor ids by git 120 commits 100 80 60 • Community-driven open 40 source software development 20 0 juli-09 nov-10 apr-12 aug-13 dec-14 maj-16 • Publicly open to new contributors

  23. Native Workload Support Scalable Batch Pipelines Machine Learning Stream Pipelines Graph Analytics Apache Flink

  24. The Apache Flink Stack • Bounded Data Sources • Unbounded Data Sources • Staged/Pipelined Execution • Pipelined Execution DataSet DataStream APIs Distributed Dataflow Execution Deployment 24

  25. The Big Picture Graph-Gelly Graph-Gelly Hadoop M/R Table Table CEP SQL SQL ML ML DataSet DataStream Distributed Dataflow Deployment

  26. Basic API Concept Data Data Source Operator Sink Set Set Data Data Source Operator Sink Stream Stream Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks 26

  27. Data Streams as Abstract Data Types Transformations: map, flatmap, filter, union… • DataStream Aggregations: reduce, fold, sum • Partitioning: forward, broadcast, shuffle, keyBy • Sources/Sinks: custom or Kafka, Twitter, Collections… • Tasks are distributed and run in a pipelined fashion. • State is kept within tasks. • Transformations are applied per-record or window. • 27

  28. Example “live and let live” textStream .flatMap {_.split("\\W+")} “live” “and” “let” “live” .map {(_, 1)} (live,1) (and,1) (let,1) (live,1) .keyBy(0) .sum(1) .print() (live,1) (and,1) (let,1) (live,2) 28

  29. Working with Windows Why windows? 15 38 65 88 110 120 We are often interested in fresh data! window buckets/panes SUM #1 15 38 1) Sliding windows SUM #2 38 65 myKeyedStream.timeWindow( SUM #3 65 88 Time.seconds(60), #sec Time.seconds(20)); 0 40 80 20 60 100 2) Tumbling windows SUM #1 SUM #2 15 38 65 88 myKeyedStream.timeWindow( Time.seconds(60)); #sec 0 20 40 60 80 100 120 Highlight : Flink can form and trigger windows consistently under different notions of time and deal with late events! 29

  30. Example counting words over windows “live and” 10:48 “let live” 11:01 textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() (live,1) (and,1) Window (10:45-10:50) (let,1) (live,1) Window (11:00-11:05) 30

  31. Example textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() where counts are kept in state print flatMap map window sum 31

  32. Example textStream .flatMap {_.split("\\W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .setParallelism(4) .print() print flatMap map window sum 32

  33. Making State Explicit • Explicitly defined state is durable to failures • Flink supports two types of explicit states • Operator State - full state • Key-Value State - partitioned state per key • State Backends: In-memory, RocksDB, HDFS 33

  34. Fault Tolerance State is not affected by failures When failures occur we revert computation and state back to a snapshot snapshotting snapshotting t1 t2 events snap - t2 snap - t1 34

  35. Performance • Twitter Hack Week - Flink as an in-memory data store Jamie Grier - http://data-artisans.com/extending-the- yahoo-streaming-benchmark/ 35

  36. So how is Flink different that Spark? Two major differences 1) Stream Execution 2) Mutable State 36

  37. Flink vs Spark S • dedicated resources • mutable state dstream.updateStateByKey(…) put new states in output RDD In S ’ • leased resources • immutable state (Spark Streaming) 37

  38. What about DataSets? • Sophisticated SQL-inspired optimiser • Efficient Join Strategies • Managed Memory bypasses Garbage Collection • Fast, in-memory Iterative Bulk Computations 38

  39. Some Interesting Libraries 39

  40. Detecting Patterns CEP Library Example (Java) PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500)); DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); }); 40

  41. Mining Graphs with Gelly • Iterative Graph Processing • Scatter-Gather • Gather-Sum-Apply • Graph Transformations/Properties • Library Methods : Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc… Coming up next : Dynamic graph processing support 41

  42. Machine Learning Pipelines • Scikit-learn inspired pipelining • Supervised : SVM, Linear Regression • Preprocessing : Polynomial Features, Scalers • Recommendation : ALS 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend