Spark Streaming Summary by Lucy Yu Motivation Most of big data - PowerPoint PPT Presentation

Spark Streaming Summary by Lucy Yu

Motivation • Most of “big data” happens in a streaming context – Network monitoring, real-time fraud detection, algorithmic trading, risk management • Current Model: Continuous Operator Model – Fault tolerance achieved via replication or upstream backup

Motivation source

Replication source

Upstream Backup source

D-Streams • “Instead of managing long-lived operators, the idea …is to structure a streaming computation as a series of stateless, deterministic batch computations on small time intervals.” • Use a data structure: Resilient Distributed Datasets (RDDs) – keeps data in memory – can recover it without replication (track the lineage graph of operations that were used to build it)

D-Streams https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html

D-Streams • The data received in each interval stored reliable across the cluster to form an input dataset for that interval • Do batch operation to get another RDD, which acts as a state or output

D-Stream API • Users register one or more streams using a functional API • Transformations create a new D-Stream from parent stream(s) – Stateless: map, reduce, groupBy, join – Stateful: window operations • Output operations let the program write data to external systems (e.g. save)

Transformations Return a new DStream by passing each element of the map(func) source DStream through a function func. Similar to map, but each input item can be mapped to 0 or flatMap(func) more output items. Return a new DStream by selecting only the records of the filter(func) source DStream on which func returns true. Return a new DStream of single-element RDDs by reduce(func) aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel. Return a new "state" DStream where the state for each key is updateStateByKey(func) updated by applying the given function on the previous state of the key and the new values for the key.

Window Operations https://spark.apache.org/docs/1.2.0/streaming-programming-guide.html

Window Operations Return a new DStream which is computed based on window(windowLength, windowed batches of the source DStream. slideInterval) Return a sliding window count of elements in the stream. countByWindow(window Length, slideInterval) Return a new single-element stream, created by aggregating reduceByWindow(func, elements in the stream over a sliding interval using func. The windowLength, function should be associative so that it can be computed slideInterval) correctly in parallel. When called on a DStream of (K, V) pairs, returns a new reduceByKeyAndWindow DStream of (K, V) pairs where the values for each key are (func, windowLength, aggregated using the given reduce function func over slideInterval, [numTasks]) batches in a sliding window. A more efficient version of the above reduceByKeyAndWindow reduceByKeyAndWindow() where the reduce value of each (func, invFunc, window is calculated incrementally using the reduce values windowLength, of the previous window. slideInterval, [numTasks])

Fault Recovery • Parallel recovery of a lost node’s state. – When a node fails, each node in the cluster works to recompute part of the lost node’s RDDs, resulting in significantly faster recovery than upstream backup without the cost of replication. • In a similar way, D-Streams can recover from stragglers using speculative execution • Checkpoint state RDDs periodically

Spark Streaming Summary by Lucy Yu Motivation Most of big data - PowerPoint PPT Presentation

Spark Streaming Summary by Lucy Yu Motivation Most of big data happens in a streaming context Network monitoring, real-time fraud detection, algorithmic trading, risk management Current Model: Continuous Operator Model

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323:

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

A (Probably not) Project Proposal: Spark Streaming vs Apache Storm for Real-time Event Detection

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Reducing Data Dimension Recommended reading: Bishop, chapter 3.6, 8.6 Wall et al., 2003

Understanding Optimization Phase Interactions to Reduce the Phase Order Search Space Michael

Slide Reduction, RevisitedFilling the Gaps in SVP Approximation Noah Stephens- Divesh

TIDE 1022 Computational Thinking for Work and Play Jaelle Scheuerman Carola Wenk Newcomb

1.2 Row Reduction and Echelon Forms McDonald Fall 2018, MATH 2210Q 1.2 Slides Homework: Read the

Sustainable Transportation Advisory Council Meeting #1 Thursday, March 5, 2020 Co-chairs welcome

Slide Reduction, RevisitedFilling the Gaps in Lattice SVP Approximation Jianwei Li ISG, RHUL,

Model order reduction for PDE constrained optimization in vibrations Karl Meerbergen (Joint work

Spark Streaming Summary by Lucy Yu Motivation Most of big data - PowerPoint PPT Presentation

Spark Streaming Summary by Lucy Yu Motivation Most of big data happens in a streaming context Network monitoring, real-time fraud detection, algorithmic trading, risk management Current Model: Continuous Operator Model

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323:

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

A (Probably not) Project Proposal: Spark Streaming vs Apache Storm for Real-time Event Detection

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

Reducing Data Dimension Recommended reading: Bishop, chapter 3.6, 8.6 Wall et al., 2003

Understanding Optimization Phase Interactions to Reduce the Phase Order Search Space Michael

Slide Reduction, RevisitedFilling the Gaps in SVP Approximation Noah Stephens- Divesh

TIDE 1022 Computational Thinking for Work and Play Jaelle Scheuerman Carola Wenk Newcomb

1.2 Row Reduction and Echelon Forms McDonald Fall 2018, MATH 2210Q 1.2 Slides Homework: Read the

Sustainable Transportation Advisory Council Meeting #1 Thursday, March 5, 2020 Co-chairs welcome

Slide Reduction, RevisitedFilling the Gaps in Lattice SVP Approximation Jianwei Li ISG, RHUL,

Model order reduction for PDE constrained optimization in vibrations Karl Meerbergen (Joint work

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark