Big-Data Processing III (Stream Processing) Prof. Lus Veiga IST / - PowerPoint PPT Presentation

CNV/CC&V MEIC-A/MEIC-T/METI Computação em Nuvem e Virtualização Big-Data Processing III (Stream Processing) Prof. Luís Veiga IST / INESC-ID Lisboa https://fenix.tecnico.ulisboa.pt/disciplinas/AVExe7/2019-2020/2-semestre/ LV, JG 2015-20, sources Spark, Flink

Agenda  Spark  overview, RDDs  programming Model, Examples  RDD operations  fault-tolerance, Performance  Spark Streaming  overview, discretized stream processing  windows, sliding windows, micro-batching  Flink  overview, windowing  tumbling windows, sliding windows, custom windows  time-based windows, watermarks  state management, versioning,  fault tolerance, distributed snapshots, execution semantics 2

Motivation Current popular programming models for clusters transform data flowing from stable storage to stable storage E.g., MapReduce: Map Reduce Input Output Map Reduce Map 4

Motivation  Current popular programming models for clusters transform data flowing from stable storage to stable storage  E.g., MapReduce: Map Benefits of data flow: runtime can Reduce decide where to run tasks and can Input Output Map automatically recover from failures Reduce Map 5

Motivation  Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:  Iterative algorithms (many in machine learning)  Interactive data mining tools (R, Excel, Python)  Spark makes working sets a first-class concept to efficiently support these apps 6

Spark Goal  Provide distributed memory abstractions for clusters to support apps with working sets  Retain the attractive properties of MapReduce:  Fault tolerance (for crashes & stragglers)  Data locality  Scalability Solution: augment data flow model with “ resilient distributed datasets ” (RDDs) 7

Generality of RDDs  Spark ’ s combination of data flow with RDDs unifies many proposed cluster programming models  General data flow models: MapReduce, Dryad, SQL  Specialized models for stateful apps: Pregel (Bulk Synchronous Processing), HaLoop (iterative MR), Continuous Bulk Processing  Instead of specialized APIs for 1 type of app,  give the users first-class control of the distributed datasets 8

Programming Model  Resilient distributed datasets (RDDs)  Immutable collections partitioned across clusters that can be rebuilt if a partition is lost.  Created by transforming data in stable storage using data flow operators  (map, filter, group- by, …)  Can be cached across parallel operations  Parallel operations on RDDs  Reduce, collect, count, save, … 9

Example: Log Mining  Load error messages from a log into memory, then interactively search for various patterns Cache 1 Base RDD Transformed RDD lines = spark.textFile( “ hdfs://... ” ) Worker results errors = lines.filter(_.startsWith( “ ERROR ” )) tasks messages = errors.map(_.split( ‘ \t ’ )(2)) Block 1 Driver cachedMsgs = messages.cache() Cached RDD Parallel operation cachedMsgs.filter(_.contains( “ foo ” )).count Cache 2 cachedMsgs.filter(_.contains( “ bar ” )).count Worker . . . Cache 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3 10

RDDs in More Detail  An RDD is an immutable, partitioned, logical collection of records  needs not be materialized,  but rather contains enough information to allow rebuilding a dataset from stable storage  Partitioning can be based on a key in each record (using hash or range partitioning)  Built using bulk transformations on other RDDs  Can be cached for future reuse 11

RDD Operations Transformations Parallel actions/operations (define a new RDD) (return a result to driver) map reduce filter collect sample count union countByKey groupByKey save reduceByKey join … cache … 12

RDD Fault Tolerance  RDDs maintain lineage information that can be used to reconstruct lost partitions  i.e., track data dependencies in the data flow  Ex: cachedMsgs = textFile(...).filter(_.contains( “ error ” )) .map(_.split( ‘ \t ’ )(2)) .cache() FilteredRDD HdfsRDD MappedRDD CachedRDD func: path: hdfs://… func: split(…) contains(...) 13

Benefits of RDD Model  Consistency is easy due to immutability  Inexpensive fault tolerance  (log lineage dependency information  rather than replicating/checkpointing data)  Locality-aware scheduling of tasks on partitions  Despite being restricted  (not as expressive as queries)  model seems applicable to a broad variety of applications 14

Example: Logistic Regression  Goal: find best line separating two sets of points random initial line target 15

Logistic Regression Code val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) 16

Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s 17

Example: MapReduce  MapReduce data flow can be expressed using RDD transformations res = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals)) Or with combiners: res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val)) 18

Word Count in Spark val lines = spark.textFile( “ hdfs://... ” ) val counts = lines.flatMap(_.split( “ \\s ” )) .reduceByKey(_ + _) counts.save( “ hdfs://... ” ) 19

Spark Streaming

Traditional data processing  E.g., log analysis example using a batch processor  Latency from log event to serving layer usually in the range of hours Periodic (custom) or Periodic log analysis Web server continuous ingestion job into storage Logs Web server Batch job(s) for Serving HDFS / S3 log analysis layer Logs Web server Job scheduler Logs (Oozie) every 2 hrs 21

Log event analysis using stream processor  Stream processors allow to analyze events with sub-second latency . Forward events Process events in Web server immediately to real time & update pub/sub bus serving layer Web server High throughput Stream Serving publish/subscribe Processor layer bus Web server 22

Discretized Stream Processing Run a streaming computation as a series of Spark very small, deterministic batch jobs Streami live data stream ng  Chop up the live stream into micro -batches of X seconds batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations Spark processed  Finally, the processed results of the RDD results operations are returned in batches 23

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs Spark live data stream Streamin  Batch sizes as low as ½ second, g latency ~ 1 second (micro- batches of X batching) seconds Spark  Potential for limited processed results combination of batch processing and stream processing in the same system 24

Example – Count hashtags val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags. countByValue () batch @ t+1 batch @ t batch @ t+2 tweets flatMap flatMap flatMap hashTags map map map … reduceByKey reduceByKey reduceByKey tagCounts [(#cat, 10), (#dog, 25), ... ] 25

Example – Count hashtags over last 10 mins val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags. window(Minutes(10), Seconds(1)) .countByValue() sliding window window length sliding interval operation 26

Example – Count hashtags over last 10 mins val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() t-1 t t+1 t+2 t+3 hashTags sliding window countByValue count over all tagCounts the data in the window 27

Fault-tolerance RDDs remember the sequence tweets  input data (dataflow) of operations that RDD replicated created it from the original in memory fault-tolerant input data flatMap Batches of input data are  replicated in memory of hashTags multiple worker nodes, RDD lost partitions therefore fault-tolerant recomputed on other workers Data lost due to worker  failure, can be recomputed from input data 28

Apache Flink  Apache Flink is an open source stream processing framework  Low latency  High throughput  Stateful  Distributed  Developed at the Apache Software Foundation,  Used in production 30

Apache Flink Real-world data is produced in a continuous fashion. Systems like Flink embrace streaming nature of data. Kafka topic Web server Stream processor Apache Kafka: reliable message queue/feed broker 31

Overview of Flink Architecture 32

Big-Data Processing III (Stream Processing) Prof. Lus Veiga IST / - PowerPoint PPT Presentation

CNV/CC&V MEIC-A/MEIC-T/METI Computao em Nuvem e Virtualizao Big-Data Processing III (Stream Processing) Prof. Lus Veiga IST / INESC-ID Lisboa https://fenix.tecnico.ulisboa.pt/disciplinas/AVExe7/2019-2020/2-semestre/ LV, JG

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Phase III Stream Assessment Study: Potential Stream Restoration Projects Strawberry Run and

Big Data for Data Science Data streams and low latency processing event.cwi.nl/lsde DATA STREAM

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

An Introduction To Data Stream Query Processing Neil Conway <nconway@truviso.com> Truviso,

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Towards Benchmarking Stream Data Warehouses Arian Br, Lukasz Golab 02.11.2012 Stream Data

Android Malware Analysis on Attacks and Defense Android malware Android malware With the

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

Adaptive Mobile Web Design with a server side flavour James Rosewell & Thomas Holmes 1st

Mobility as an Integrated Service Through the Use of Naming Ran Atkinson, Extreme Networks

End-User Measurement through Paid Aver6sements George

Internet Technology Voice over IP Peter Gradwell www.gradwell.com | 01225 800 800 |

Chapter 4: Network Layer Chapter goals: understand principles behind network layer services:

Reliable Multicast in the STOW RTI Prototype 97S-SIW-119 Harry Wolfson

Big-Data Processing III (Stream Processing) Prof. Lus Veiga IST / - PowerPoint PPT Presentation

CNV/CC&V MEIC-A/MEIC-T/METI Computao em Nuvem e Virtualizao Big-Data Processing III (Stream Processing) Prof. Lus Veiga IST / INESC-ID Lisboa https://fenix.tecnico.ulisboa.pt/disciplinas/AVExe7/2019-2020/2-semestre/ LV, JG

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Phase III Stream Assessment Study: Potential Stream Restoration Projects Strawberry Run and

Big Data for Data Science Data streams and low latency processing event.cwi.nl/lsde DATA STREAM

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

An Introduction To Data Stream Query Processing Neil Conway &lt;nconway@truviso.com&gt; Truviso,

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Towards Benchmarking Stream Data Warehouses Arian Br, Lukasz Golab 02.11.2012 Stream Data

Android Malware Analysis on Attacks and Defense Android malware Android malware With the

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

Adaptive Mobile Web Design with a server side flavour James Rosewell &amp; Thomas Holmes 1st

Mobility as an Integrated Service Through the Use of Naming Ran Atkinson, Extreme Networks

End-User Measurement through Paid Aver6sements George

Internet Technology Voice over IP Peter Gradwell www.gradwell.com | 01225 800 800 |

Chapter 4: Network Layer Chapter goals: understand principles behind network layer services:

Reliable Multicast in the STOW RTI Prototype 97S-SIW-119 Harry Wolfson

An Introduction To Data Stream Query Processing Neil Conway <nconway@truviso.com> Truviso,

Adaptive Mobile Web Design with a server side flavour James Rosewell & Thomas Holmes 1st