Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS - PowerPoint PPT Presentation

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1 / 1

Spark Streaming Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 2 / 1

Motivation ◮ Many applications must process large streams of live data and provide results in real-time. • Wireless sensor networks • Traffic management applications • Stock marketing • Environmental monitoring applications • Fraud detection tools • ... Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 3 / 1

Stream Processing Systems ◮ Database Management Systems (DBMS): data-at-rest analytics • Store and index data before processing it. • Process data only when explicitly asked by the users. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 4 / 1

Stream Processing Systems ◮ Database Management Systems (DBMS): data-at-rest analytics • Store and index data before processing it. • Process data only when explicitly asked by the users. ◮ Stream Processing Systems (SPS): data-in-motion analytics • Processing information as it flows, without storing them persistently. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 4 / 1

DBMS vs. SPS (1/2) ◮ DBMS: persistent data where updates are relatively infrequent. ◮ SPS: transient data that is continuously updated. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 5 / 1

DBMS vs. SPS (2/2) ◮ DBMS: runs queries just once to return a complete answer. ◮ SPS: executes standing queries, which run continuously and provide updated answers as new data arrives. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 6 / 1

Core Idea of Spark Streaming ◮ Run a streaming computation as a series of very small and deterministic batch jobs. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 7 / 1

Spark Streaming ◮ Run a streaming computation as a series of very small, deterministic batch jobs. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 8 / 1

Spark Streaming ◮ Run a streaming computation as a series of very small, deterministic batch jobs. • Chop up the live stream into batches of X seconds. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 8 / 1

Spark Streaming ◮ Run a streaming computation as a series of very small, deterministic batch jobs. • Chop up the live stream into batches of X seconds. • Spark treats each batch of data as RDDs and processes them using RDD operations. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 8 / 1

Spark Streaming ◮ Run a streaming computation as a series of very small, deterministic batch jobs. • Chop up the live stream into batches of X seconds. • Spark treats each batch of data as RDDs and processes them using RDD operations. • Finally, the processed results of the RDD operations are returned in batches. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 8 / 1

Spark Streaming ◮ Run a streaming computation as a series of very small, deterministic batch jobs. • Chop up the live stream into batches of X seconds. • Spark treats each batch of data as RDDs and processes them using RDD operations. • Finally, the processed results of the RDD operations are returned in batches. • Discretized Stream Processing (DStream) Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 8 / 1

DStream ◮ DStream: sequence of RDDs representing a stream of data. ◮ Any operation applied on a DStream translates to operations on the underlying RDDs. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 9 / 1

StreamingContext ◮ StreamingContext: the main entry point of all Spark Streaming functionality. ◮ To initialize a Spark Streaming program, a StreamingContext object has to be created. val conf = new SparkConf().setAppName(appName).setMaster(master) val ssc = new StreamingContext(conf, Seconds(1)) Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 10 / 1

Source of Streaming ◮ Two categories of streaming sources. ◮ Basic sources directly available in the StreamingContext API, e.g., file systems, socket connections, .... ◮ Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter, .... ssc.socketTextStream("localhost", 9999) TwitterUtils.createStream(ssc, None) Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 11 / 1

DStream Transformations ◮ Transformations: modify data from on DStream to a new DStream. ◮ Standard RDD operations, e.g., map, join, ... ◮ DStream operations, e.g., window operations Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 12 / 1

DStream Transformation Example val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 13 / 1

Window Operations ◮ Apply transformations over a sliding window of data: window length and slide interval. val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream(IP, Port) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val windowedWordCounts = pairs.reduceByKeyAndWindow(_ + _, Seconds(30), Seconds(10)) Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 14 / 1

MapWithState Operation ◮ Maintains state while continuously updating it with new information. ◮ It requires the checkpoint directory. ◮ A new operation after updateStateByKey . val ssc = new StreamingContext(conf, Seconds(1)) ssc.checkpoint(".") val lines = ssc.socketTextStream(IP, Port) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val stateWordCount = pairs.mapWithState( StateSpec.function(mappingFunc)) val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => { val sum = one.getOrElse(0) + state.getOption.getOrElse(0) state.update(sum) (word, sum) } Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 15 / 1

Transform Operation ◮ Allows arbitrary RDD-to-RDD functions to be applied on a DStream. ◮ Apply any RDD operation that is not exposed in the DStream API, e.g., joining every RDD in a DStream with another RDD. // RDD containing spam information val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(...) val cleanedDStream = wordCounts.transform(rdd => { // join data stream with spam information to do data cleaning rdd.join(spamInfoRDD).filter(...) ... }) Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 16 / 1

Spark Streaming and DataFrame val words: DStream[String] = ... words.foreachRDD { rdd => // Get the singleton instance of SQLContext val sqlContext = SQLContext.getOrCreate(rdd.sparkContext) import sqlContext.implicits._ // Convert RDD[String] to DataFrame val wordsDataFrame = rdd.toDF("word") // Register as table wordsDataFrame.registerTempTable("words") // Do word count on DataFrame using SQL and print it val wordCountsDataFrame = sqlContext.sql("select word, count(*) as total from words group by word") wordCountsDataFrame.show() } Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 17 / 1

GraphX Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 18 / 1

Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 19 / 1

Introduction ◮ Graphs provide a flexible abstraction for describing relationships be- tween discrete objects. ◮ Many problems can be modeled by graphs and solved with appro- priate graph algorithms. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 20 / 1

Large Graph Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 21 / 1

Can we use platforms like MapReduce or Spark, which are based on data-parallel model, for large-scale graph proceeding? Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 22 / 1

Graph-Parallel Processing ◮ Restricts the types of computation. ◮ New techniques to partition and distribute graphs. ◮ Exploit graph structure. ◮ Executes graph algorithms orders-of-magnitude faster than more general data-parallel systems. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 23 / 1

Data-Parallel vs. Graph-Parallel Computation (1/3) Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 24 / 1

Data-Parallel vs. Graph-Parallel Computation (2/3) ◮ Graph-parallel computation: restricting the types of computation to achieve performance. Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 25 / 1

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS - PowerPoint PPT Presentation

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1 / 1 Spark Streaming Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 2 / 1

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323:

Distributed Machine Learning on Spark Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com Outline

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

A (Probably not) Project Proposal: Spark Streaming vs Apache Storm for Real-time Event Detection

Vector like matter and grand unification Borut Bajc J. Stefan Institute, Ljubljana, Slovenia

Large-Margin Softmax Loss for Conv. Neural Networks Weiyang Liu 1* , Yandong Wen 2* , Zhiding Yu 3

Sketch Me That Shoe Qian Yu et al. CVPR 2016 presenter: Wei-Lin Hsiao advisor: Kristen Grauman

An Im Improved Affi fine Equivalence Alg lgorithm for Random Permutations Itai Dinur

Tulczyjews approach for particles in gauge fields J. Phys. A: Math. Theor. 48 (2015) 145201

A Declarative Approach to BroadCast TV Jean-Charles Verdi Senior Director Connected

2 for the price of 1 School of Photovoltaic and Renewable Energy Engineering Murad J Y Tayebjee

Learning Human Preferences and Perceptions From Data Robert Nowak University of Wisconsin MIDAS

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS - PowerPoint PPT Presentation

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1 / 1 Spark Streaming Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 2 / 1

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323:

Distributed Machine Learning on Spark Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com Outline

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Streaming OODT: Combining Apache Spark's Power with Apache OODT Michael Starch NASA

A (Probably not) Project Proposal: Spark Streaming vs Apache Storm for Real-time Event Detection

Vector like matter and grand unification Borut Bajc J. Stefan Institute, Ljubljana, Slovenia

Large-Margin Softmax Loss for Conv. Neural Networks Weiyang Liu 1* , Yandong Wen 2* , Zhiding Yu 3

Sketch Me That Shoe Qian Yu et al. CVPR 2016 presenter: Wei-Lin Hsiao advisor: Kristen Grauman

An Im Improved Affi fine Equivalence Alg lgorithm for Random Permutations Itai Dinur

Tulczyjews approach for particles in gauge fields J. Phys. A: Math. Theor. 48 (2015) 145201

A Declarative Approach to BroadCast TV Jean-Charles Verdi Senior Director Connected

2 for the price of 1 School of Photovoltaic and Renewable Energy Engineering Murad J Y Tayebjee

Learning Human Preferences and Perceptions From Data Robert Nowak University of Wisconsin MIDAS

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark