Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 9: Real-Time Data Analytics (1/2) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States 1 See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details 1

Structure of the Course Machine Learning Data Mining and Analyzing Graphs Relational Data Analyzing Text Analyzing What’s beyond batch processing? “Core” framework features and algorithm design for batch processing 2

Use Cases Across Industries Transportation Retail Credit Consumer Dynamic • Dynamic Identify Internet & Re-routing Inventory fraudulent transactions Mobile Of traffic or Management as soon as they occur. Optimize user Vehicle Fleet. • Real-time engagement based In-store on user’s current Offers and behavior. recommendations Healthcare Manufacturing Surveillance Digital Continuously • Identify Identify Advertising monitor patient equipment threats & Marketing vital stats and failures and and intrusions proactively identify Optimize and react instantly In real-time at-risk patients. • personalize content Perform based on real-time Proactive information. maintenance. 4 4

Canonical Stream Processing Architecture Data HDFS Sources HBase Data Ingest Stream App 1 Kafka Flume processing Kafka engine App 2 . . . 5 5

What is a data stream? Sequence of items: Structured (e.g., tuples) Ordered (implicitly or timestamped) Arriving continuously at high volumes Sometimes not possible to store entirely Sometimes not possible to even examine all items 6 6

What exactly do you do? “Standard” relational operations: Select Project Transform (i.e., apply custom UDF) Group by Join Aggregations What else do you need to make this “work”? 7 7

Issues of Semantics Group by… aggregate When do you stop grouping and start aggregating? Joining a stream and a static source Simple lookup Joining two streams How long do you wait for the join key in the other stream? Joining two streams, group by and aggregation When do you stop joining? What’s the solution? 8 8

Windows Windows restrict processing scope: Windows based on ordering attributes (e.g., time) Windows based on item (record) counts Windows based on explicit markers (e.g., punctuations) 9 9

Windows on Ordering Attributes Assumes the existence of an attribute that defines the order of stream elements (e.g., time) Let T be the window size in units of the ordering attribute sliding window t 2 ’ t 3 ’ t 4 ’ t 1 t 2 t 3 t 4 t 1 ' t i ’ – t i = T t 3 tumbling window t 1 t 2 t i+1 – t i = T 10 10

Windows on Counts Window of size N elements (sliding, tumbling) over the stream t 2 ’ t 3 ’ t 4 ’ t 1 t 2 t 1 ' t 3 11 11

Windows from “Punctuations” Application- inserted “end -of- processing” Example: stream of actions… “end of user session” Properties Advantage: application-controlled semantics Disadvantage: unpredictable window size (too large or too small) 12 12

Streams Processing Challenges Inherent challenges Latency requirements Space bounds System challenges Bursty behavior and load balancing Out-of-order message delivery and non-determinism Consistency semantics (at most once, exactly once, at least once) 13 13

Producer/Consumers Producer Consumer How do consumers get data from producers? 14 14

Producer/Consumers Producer Consumer Producer pushes e.g., callback 15 15

Producer/Consumers Producer Consumer Consumer pulls e.g., poll, tail 16 16

Producer/Consumers Consumer Producer Consumer Producer Consumer Consumer 17 17

Producer/Consumers Consumer Producer Consumer Broker Producer Consumer Consumer Queue, Pub/Sub 18 18

Producer/Consumers Consumer Producer Consumer Broker Producer Consumer Consumer 19 19

Stream Processing Frameworks • Apache Spark Streaming • Apache Storm • Apache Flink 20

Spark Streaming: Discretized Streams Run a streaming computation as a series of very small, deterministic batch jobs Chop up the stream into batches of X seconds Process as RDDs! Return results in batches live data stream Spark Streaming batches of X seconds Spark processed results 22 Source: All following Spark Streaming slides by Tathagata Das 22

Example: Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) DStream: a sequence of RDD representing a stream of data Twitter Streaming API batch @ t batch @ t+1 batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed) 24 24

Example: Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) transformation: modify data in one new DStream Dstream to create another DStream batch @ t batch @ t+1 batch @ t+2 tweets DStream flatMa flatMa flatMa p p p hashTags Dstream new RDDs created … [#cat, #dog, … ] for every batch 25 25

Example: Get hashtags from Twitter val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage batch @ t batch @ t+1 batch @ t+2 tweets DStream flatMap flatMap flatMap hashTags DStream save save save every batch saved to HDFS 26 26

Fault Tolerance Bottom line: they’re just RDDs! 27 27

Fault Tolerance Bottom line: they’re just RDDs! tweets input data RDD replicated in memory flatMap hashTags RDD lost partitions recomputed on other workers 28 28

Key Concepts DStream – sequence of RDDs representing a stream of data Twitter, HDFS, Kafka, Flume, TCP sockets Transformations – modify data from on DStream to another Standard RDD operations – map, countByValue , reduce, join, … Stateful operations – window, countByValueAndWindow , … Output Operations – send data to external entity saveAsHadoopFiles – saves to HDFS foreach – do anything with each batch of results 29 29

Example: Count the hashtags val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.countByValue() batch @ t batch @ t+1 batch @ t+2 tweets flatMap flatMap flatMap hashTags map map map … reduceByKey reduceByKey reduceByKey tagCounts [(#cat, 10), (#dog, 25), ... ] 30 30

Example: Count the hashtags over last 10 mins val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>) val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() sliding window window length sliding interval operation 31 31

Example: Count the hashtags over last 10 mins val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() t-1 t+2 t+3 t t+1 hashTags sliding window countByValue tagCounts count over all the data in the window 32 32

Smart window-based countByValue val tagCounts = hashtags.countByValueAndWindow(Minutes(10), Seconds(1)) t-1 t t+1 t+2 t+3 hashTags countByValue add the counts from the new batch + subtract the – in the window counts from batch tagCounts + ? before the window 33 33

Smart window-based reduce Incremental counting generalizes to many reduce operations Need a function to “inverse reduce” (“subtract” for counting) val tagCounts = hashtags .countByValueAndWindow(Minutes(10), Seconds(1)) val tagCounts = hashtags .reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(1)) tagCounts = hashtags .reduceByKeyAndWindow(lambda x,y:x+y, lambda x,y:x-y, Minutes(10), Seconds(1)) 34 34

Performance Can process 6 GB/sec (60M records/sec ) of data on 100 nodes at sub-second latencyTested  with 100 streams of data on 100 EC2 instances with 4 cores each 35 35

Comparison with Storm Higher throughput than Storm  Spark Streaming: 670k records/second/node  Storm: 115k records/second/node 36 36

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 9: Real-Time Data Analytics (1/2) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a Creative Commons

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients Backprop in Conv

Sliding Window Temporal Graph Coloring George B. Mertzios 1 Hendrik Molter 2 Viktor Zamaraev 1 1

Minimal absent words in a sliding window & applications to on-line pattern matching Maxime

Stream Statistics Over Sliding Window Algorithm Sum Problem Trends References Anil Maheshwari

ProtoDUNE Software Trigger Dev Jon Sensenig, David Last, David Rivera, Philip Rodrigues, Lukas

occurring in the space surrounding the body Maria Luiza Rangel 1 , Lidiane Souza 1 , Lucas Frota 1

Demonstration of High Transformer Ratio Plasma Wakefield Acceleration Proof-of-Principle Plasma

Computational Semantics and Pragmatics Autumn 2011 Raquel Fernndez Institute for Logic,

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) - PDF document

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2020) Part 9: Real-Time Data Analytics (1/2) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a Creative Commons

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

CS 4803 / 7643: Deep Learning Topics: (Finish) Computing Gradients Backprop in Conv

Sliding Window Temporal Graph Coloring George B. Mertzios 1 Hendrik Molter 2 Viktor Zamaraev 1 1

Minimal absent words in a sliding window &amp; applications to on-line pattern matching Maxime

Stream Statistics Over Sliding Window Algorithm Sum Problem Trends References Anil Maheshwari

ProtoDUNE Software Trigger Dev Jon Sensenig, David Last, David Rivera, Philip Rodrigues, Lukas

occurring in the space surrounding the body Maria Luiza Rangel 1 , Lidiane Souza 1 , Lucas Frota 1

Demonstration of High Transformer Ratio Plasma Wakefield Acceleration Proof-of-Principle Plasma

Computational Semantics and Pragmatics Autumn 2011 Raquel Fernndez Institute for Logic,

Minimal absent words in a sliding window & applications to on-line pattern matching Maxime