Streaming items through a cluster with Spark Streaming Tathagata TD - PowerPoint PPT Presentation

Streaming items through a cluster with Spark Streaming Tathagata “TD” Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015

Who am I? Who am I? > Project Management Committee (PMC) member of Apache Spark > Lead developer of Spark Streaming > Formerly in AMPLab, UC Berkeley > Software developer at Databricks > Databricks was started by creators of Spark to provide Spark-as-a-service in the cloud

Big Data Big Data

Big Big Streaming Streaming Data Data

Why process Big Why process Big Streaming Streaming Data? Data? Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets

How to Process Big How to Process Big Streaming Streaming Data Data Ingest Process Store data data results Raw Tweets > Ingest – Receive and buffer the streaming data > Process – Clean, extract, transform the data > Store – Store transformed data for consumption

How to Process Big How to Process Big Streaming Streaming Data Data Ingest Process Store data data results Raw Tweets > For big streams, every step requires a cluster > Every step requires a system that is designed for it

Stream Ingestion Systems Stream Ingestion Systems Ingest Process Store data data results Raw Tweets Amazon Kinesis > Kafka – popular distributed pub-sub system > Kinesis – Amazon managed distributed pub-sub > Flume – like a distributed data pipe

Stream Ingestion Systems Stream Ingestion Systems Ingest Process Store data data results Raw Tweets > Spark Streaming – most demanded > Storm – most widely deployed (as of now ;) ) > Samza – gaining popularity in certain scenarios

Stream Ingestion Systems Stream Ingestion Systems Ingest Process Store data data results Raw Tweets > File systems – HDFS, Amazon S3, etc. > Key-value stores – HBase, Cassandra, etc. > Databases – MongoDB, MemSQL, etc.

Producers and Consumers Producers and Consumers Topic X Producer 1 Consumer (topicX, data1) (topicX, data3) (topicY, data2) (topicX, data1) (topicX, data3) (topicY, data2) Topic Y Kafka Cluster Producer 2 Consumer > Producers publish data tagged by “topic” > Consumers subscribe to data of a particular “topic”

Topics and Partitions Topics and Partitions > Topic = category of message, divided into partitions > Partition = ordered, numbered stream of messages > Producer decides which (topic, partition) to put each message in

Topics and Partitions Topics and Partitions > Topic = category of message, divided into partitions > Partition = ordered, numbered stream of messages > Producer decides which (topic, partition) to put each message in > Consumer decides which (topic, partition) to pull messages from - High-level consumer – handles fault-recovery with Zookeeper - Simple consumer – low-level API for greater control

How to process Kafka messages? How to process Kafka messages? Ingest Process Store data data results Raw Tweets > Incoming tweets received in distributed manner and buffered in Kafka > How to process them?

treaming

What is Spark Streaming? What is Spark Streaming? Scalable, fault-tolerant stream processing system High-level API Fault-tolerant Integration joins, windows, … Exactly-once semantics, Integrate with MLlib, SQL, often 5x less code even for stateful ops DataFrames, GraphX Kafka Kafka File systems File systems Flume Flume Databases Databases HDFS HDFS Kinesis Kinesis Dashboards Dashboards Twitter Twitter

How does Spark Streaming work? How does Spark Streaming work? > Receivers chop up data streams into batches of few seconds > Spark processing engine processes each batch and pushes out the results to external data stores Receivers Receivers data streams data streams results as results as batches as batches as RDDs RDDs RDDs RDDs

Spark Programming Model Spark Programming Model > Resilient distributed datasets (RDDs) - Distributed, partitioned collection of objects - Manipulated through parallel transformations (map, filter, reduceByKey, …) - All transformations are lazy, execution forced by actions (count, reduce, take, …) - Can be cached in memory across cluster - Automatically rebuilt on failure

Spark Streaming Programming Model Spark Streaming Programming Model > Discretized Stream (DStream) - Represents a stream of data - Implemented as a infinite sequence of RDDs > DStreams API very similar to RDD API - Functional APIs in - Create input DStreams from Kafka, Flume, Kinesis, HDFS, … - Apply transformations

Example – Get Example – Get hashtags hashtags from Twitter from Twitter val ¡ssc ¡= ¡new ¡StreamingContext(conf, ¡Seconds(1)) ¡ StreamingContext ¡ is ¡the ¡star)ng ¡ Batch ¡interval, ¡by ¡which ¡ point ¡of ¡all ¡streaming ¡func)onality ¡ streams ¡will ¡be ¡chopped ¡up ¡

Example – Get Example – Get hashtags hashtags from Twitter from Twitter val ¡ssc ¡= ¡new ¡StreamingContext(conf, ¡Seconds(1)) ¡ val ¡tweets ¡= ¡TwitterUtils.createStream(ssc, ¡auth) ¡ Input ¡DStream ¡ TwiCer ¡Streaming ¡API ¡ batch ¡@ ¡t ¡ batch ¡@ ¡t+1 ¡ batch ¡@ ¡t+2 ¡ tweets ¡DStream ¡ replicated ¡and ¡stored ¡in ¡ memory ¡as ¡RDDs ¡

Example – Get Example – Get hashtags hashtags from Twitter from Twitter val ¡tweets ¡= ¡TwitterUtils.createStream(ssc, ¡None) ¡ val ¡hashTags ¡= ¡tweets.flatMap(status ¡=> ¡getTags(status)) ¡ transformed ¡ transforma0on : ¡modify ¡data ¡in ¡one ¡ DStream ¡ DStream ¡to ¡create ¡another ¡DStream ¡ ¡ batch ¡@ ¡t ¡ batch ¡@ ¡t+1 ¡ batch ¡@ ¡t+2 ¡ tweets ¡DStream ¡ flatMap ¡ flatMap ¡ flatMap ¡ hashTags ¡Dstream ¡ … new ¡RDDs ¡created ¡ [#cat, ¡#dog, ¡… ¡] ¡ for ¡every ¡batch ¡ ¡

Example – Get hashtags Example – Get hashtags from Twitter from Twitter val ¡tweets ¡= ¡TwitterUtils.createStream(ssc, ¡None) ¡ val ¡hashTags ¡= ¡tweets.flatMap(status ¡=> ¡getTags(status)) ¡ hashTags.saveAsTextFiles("hdfs://...") ¡ output ¡opera0on : ¡to ¡push ¡data ¡to ¡external ¡storage ¡ batch ¡@ ¡t ¡ batch ¡@ ¡t+1 ¡ batch ¡@ ¡t+2 ¡ tweets ¡DStream ¡ flatMap flatMap flatMap hashTags ¡DStream ¡ save save save every ¡batch ¡ saved ¡to ¡HDFS ¡

Example – Get hashtags Example – Get hashtags from Twitter from Twitter val ¡tweets ¡= ¡TwitterUtils.createStream(ssc, ¡None) ¡ val ¡hashTags ¡= ¡tweets.flatMap(status ¡=> ¡getTags(status)) ¡ hashTags.foreachRDD(hashTagRDD ¡=> ¡{ ¡... ¡}) ¡ foreachRDD : ¡do ¡whatever ¡you ¡want ¡with ¡the ¡processed ¡data ¡ batch ¡@ ¡t ¡ batch ¡@ ¡t+1 ¡ batch ¡@ ¡t+2 ¡ tweets ¡DStream ¡ flatMap flatMap flatMap hashTags ¡DStream ¡ foreach foreach foreach Write ¡to ¡a ¡database, ¡update ¡analy)cs ¡ UI, ¡do ¡whatever ¡you ¡want ¡

Example – Get Example – Get hashtags hashtags from Twitter from Twitter val ¡tweets ¡= ¡TwitterUtils.createStream(ssc, ¡None) ¡ val ¡hashTags ¡= ¡tweets.flatMap(status ¡=> ¡getTags(status)) ¡ hashTags.foreachRDD(hashTagRDD ¡=> ¡{ ¡... ¡}) ¡ ¡ ¡ all ¡of ¡this ¡was ¡just ¡setup ¡for ¡what ¡to ¡do ¡when ¡ streaming ¡data ¡is ¡receiver ¡ ¡ ¡ ¡ this ¡actually ¡starts ¡the ¡receiving ¡and ¡processing ¡ ssc.start() ¡ ¡

What’s going on inside? What’s going on inside? > Receiver buffers tweets Spark Cluster in Executors’ memory > Spark Streaming Driver launch tasks to launches tasks to process tweets process tweets Buffered Receiver Twitter Tweets Buffered Tweets Raw Tweets Driver running DStreams Executors

What’s going on inside? What’s going on inside? Kafka Cluster Spark Cluster Receiver launch tasks to process data receive data in parallel Receiver Driver Receiver running DStreams Executors

Performance Performance Can process 60M records/sec (6 GB/sec) 60M records/sec (6 GB/sec) on 100 nodes 100 nodes at sub-second sub-second latency 3.5 ¡ 7 ¡ Cluster ¡Thhroughput ¡(GB/s) ¡ WordCount ¡ Cluster ¡Throughput ¡(GB/s) ¡ Grep ¡ 6 ¡ 3 ¡ 2.5 ¡ 5 ¡ 4 ¡ 2 ¡ 3 ¡ 1.5 ¡ 2 ¡ 1 ¡ 1 ¡sec ¡ 1 ¡sec ¡ 1 ¡ 0.5 ¡ 2 ¡sec ¡ 2 ¡sec ¡ 0 ¡ 0 ¡ 0 ¡ 50 ¡ 100 ¡ 0 ¡ 50 ¡ 100 ¡ # ¡Nodes ¡in ¡Cluster ¡ # ¡Nodes ¡in ¡Cluster ¡

Window-based Transformations Window-based Transformations val ¡tweets ¡= ¡ TwitterUtils.createStream(ssc, ¡auth) ¡ val ¡hashTags ¡= ¡tweets.flatMap(status ¡=> ¡getTags(status)) ¡ val ¡tagCounts ¡= ¡hashTags.window(Minutes(1), ¡Seconds(5)).countByValue() ¡ sliding ¡window ¡ window ¡length ¡ sliding ¡interval ¡ opera)on ¡ window ¡length ¡ DStream ¡of ¡data ¡ sliding ¡interval ¡

Streaming items through a cluster with Spark Streaming Tathagata TD - PowerPoint PPT Presentation

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? Who am I? > Project Management Committee (PMC) member of Apache Spark >

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

biological systems also physical systems? Shant Shahbazian Faculty of physics, Department of

Summary of Astroparticle Physics Carsten Rott rott@skku.edu Sungkyunkwan University, Korea

Probing Modified Gravity via Wide Binaries Charalambos Pittordis Supervisor: Dr W. J. Sutherland

The Purpose of Education Sesh Velamoor The Purpose of Education should be to generate a

PERSISTENT I/O CHALLENGES & APPROACHES CHALLENGES & APPROACHES Angelos Bilas, FORTH

It is pitch black. You are likely to be eaten by a Grue. Project Part 3 of the project has

Game%AI%Overview% Introduc3on% History% Overview%/%Categorize% Agent%Based%Modeling%

Stable Marriage Problem Introduced by Gale and Shapley in a 1962 paper in the American

Streaming items through a cluster with Spark Streaming Tathagata TD - PowerPoint PPT Presentation

Streaming items through a cluster with Spark Streaming Tathagata TD Das @tathadas CME 323: Distributed Algorithms and Optimization Stanford, May 6, 2015 Who am I? Who am I? > Project Management Committee (PMC) member of Apache Spark >

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Day 4 Lab1: Docker container for Kafka - Spark streaming - Cassandra This Dockerfile sets up

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

biological systems also physical systems? Shant Shahbazian Faculty of physics, Department of

Summary of Astroparticle Physics Carsten Rott rott@skku.edu Sungkyunkwan University, Korea

Probing Modified Gravity via Wide Binaries Charalambos Pittordis Supervisor: Dr W. J. Sutherland

The Purpose of Education Sesh Velamoor The Purpose of Education should be to generate a

PERSISTENT I/O CHALLENGES &amp; APPROACHES CHALLENGES &amp; APPROACHES Angelos Bilas, FORTH

It is pitch black. You are likely to be eaten by a Grue. Project Part 3 of the project has

Game%AI%Overview% Introduc3on% History% Overview%/%Categorize% Agent%Based%Modeling%

Stable Marriage Problem Introduced by Gale and Shapley in a 1962 paper in the American

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

PERSISTENT I/O CHALLENGES & APPROACHES CHALLENGES & APPROACHES Angelos Bilas, FORTH