The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, - PowerPoint PPT Presentation

Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

Gerard Maas Señor SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg

Agenda What is Spark and Why We Should Care? Streaming APIs in Spark - Structured Streaming - Interactive Session 1 - Spark Streaming - Interactive Session 2 Spark Streaming X Structured Streaming @maasg

Streaming | Big Data @maasg

100Tb 5Mb @maasg

100Tb 5Mb/s @maasg

∑ Stream = Dataset 𝚬 Dataset = Stream - Tyler Akidau, Google @maasg

Once upon a time...

Structured Streaming Spark SQL Datasets/Frames Apache Spark Core Data Sources GraphFrames Spark MLLib Spark Streaming @maasg

Structured Streaming @maasg

Structured Streaming Kafka HDFS/S3 Sockets Custom Streaming DataFrame @maasg

Structured Streaming Output Mode Kafka Kafka HDFS/S3 Files Query Sockets foreachSink Custom console memory Streaming DataFrame @maasg

Interactive Session 1 Structured Streaming @maasg

Sensor Anomaly Detection Sensor Data Multiplexer Structured Local Process Streaming Spark Notebook @maasg

Live @maasg

Interactive Session 1 Structured Streaming QUICK RECAP @maasg

Sources val rawData = sparkSession.readStream .format("kafka") // csv, json, parquet, socket, rate .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", sourceTopic) .option("startingOffsets", "latest") .load() @maasg

Operations ... val rawValues = rawData.selectExpr("CAST(value AS STRING)") .as[String] val jsonValues = rawValues.select(from_json($"value", schema) as "record") val sensorData = jsonValues.select("record.*").as[SensorData] … @maasg

Event Time ... val movingAverage = sensorData .withColumn("timestamp", toSeconds($"ts").cast(TimestampType)) .withWatermark("timestamp", "30 seconds") .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds")) .agg(avg($"temp")) ... @maasg

Sinks ... val visualizationQuery = sensorData.writeStream .queryName("visualization") // this will be the SQL table name .format("memory") .outputMode("update") .start() ... val kafkaWriterQuery = kafkaFormat.writeStream .queryName("kafkaWriter") .format("kafka") .outputMode("append") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("topic", targetTopic) .option("checkpointLocation", "/tmp/spark/checkpoint") .start() @maasg

Use Cases ● Streaming ETL ● Stream aggregations, windows ● Event-time oriented analytics ● Arbitrary stateful stream processing ● Join Streams with other streams and with Fixed Datasets ● Apply Machine Learning Models @maasg

Structured Streaming @maasg

Spark Streaming Kafka Databases Flume Spark SQL Kinesis Spark ML HDFS Twitter ... Sockets API Server HDFS/S3 Apache Spark Streams Custom @maasg

t0 t1 t2 t3 ti ti+1 DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] @maasg

t0 t1 t2 t3 ti ti+1 DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] Transformation T -> U RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] @maasg

t0 t1 t2 t3 ti ti+1 DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T] Transformation T -> U RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Actions @maasg

API: Transformations map, flatmap, filter count, reduce, n countByValue, reduceByKey union, join cogroup @maasg

API: Transformations mapWithState … … @maasg

API: Transformations transform val iotDstream = MQTTUtils.createStream(...) val devicePriority = sparkContext.cassandraTable(...) val prioritizedDStream = iotDstream.transform{rdd => rdd.map(d => (d.id, d)).join(devicePriority) } @maasg

Actions ------------------------------------------- Time: 1459875469000 ms print ------------------------------------------- data1 data2 saveAsTextFiles , xxx saveAsObjectFiles , yyy zzz saveAsHadoopFiles * foreachRDD @maasg

Actions ------------------------------------------- Time: 1459875469000 ms print ------------------------------------------- data1 data2 saveAsTextFiles , xxx saveAsObjectFiles , yyy zzz saveAsHadoopFiles Spark SQL * Dataframes foreachRDD GraphFrames Any API @maasg

Interactive Session 2 Spark Streaming @maasg

Sensor Anomaly Detection Sensor Data Multiplexer Structured Local Process Streaming Spark Notebook @maasg

Live @maasg

Interactive Session 2 Spark Streaming QUICK RECAP @maasg

Streaming Context import org.apache.spark.streaming.StreamingContext val streamingContext = new StreamingContext(sparkContext, Seconds(10)) @maasg

Source val kafkaParams = Map[String, String]( "metadata.broker.list" -> kafkaBootstrapServer, "group.id" -> "sensor-tracker-group", "auto.offset.reset" -> "largest", "enable.auto.commit" -> (false: java.lang.Boolean).toString ) val topics = Set(topic) @transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( streamingContext, kafkaParams, topics) @maasg

Transformations import spark.implicits._ val sensorDataStream = stream.transform{rdd => val jsonData = rdd.map{case (k,v) => v} val ds = sparkSession.createDataset(jsonData) val jsonDF = spark.read.json(ds) val sensorDataDS = jsonDF.as[SensorData] sensorDataDS.rdd } @maasg

DIY Custom Model val model = new M2Model() … model.trainOn(inputData) … val scoredDStream = model.predictOnValues(inputData) @maasg

Output suspects.foreachRDD{rdd => val sample = rdd.take(20).map(_.toString) val total = s"total found: ${rdd.count}" outputBox(total +: sample) } @maasg

Usecases ● Complex computing/state management (local + cluster) ● Streaming Machine Learning ○ Learn ○ Score ● Join Streams with Updatable Datasets ● RDD-based streaming computations ● [-] Event-time oriented analytics ● [-] Optimizations: Query & Data ● [-] Continuous processing @maasg

Sensor Anomaly Detection (Real Time Detection) Sensor Data Multiplexer Structured Structured Local Process Streaming Streaming @maasg

+ Structured Streaming

Spark Streaming + Structured Streaming val parse: Dataset[String] => Dataset[Record] = ??? val process: Dataset[Record] => Dataset[Result] = ??? val serialize: Dataset[Result] => Dataset[String] = ??? Spark Streaming Structured Streaming val kafkaStream = spark.readStream … val dstream = KafkaUtils.createDirectStream(...) val f = parse andThen process andThen serialize dstream.map{rdd => val ds = sparkSession.createDataset(rdd) val result = f(kafkaStream) val f = parse andThen process andThen serialize result.writeStream .format("kafka") val result = f(ds) .option("kafka.bootstrap.servers",bootstrapServers) result.write.format("kafka") .option("topic", writeTopic) .option("kafka.bootstrap.servers", bootstrapServers) .option("checkpointLocation", checkpointLocation) .option("topic", writeTopic) .start() .option("checkpointLocation", checkpointLocation) .save() } 47 @maasg

Streaming Pipelines Structured Streaming Keyword Keyword Similarity Extraction Relevance DB Storage @maasg

Structured Streaming Spark Streaming Abstract Fixed to microbatch Time (Processing Time, Event Time) Streaming Interval Fixed Micro batch, Best Effort MB, Execution Fixed Micro batch Continuous (NRT) Abstraction DataFrames/Dataset DStream, RDD Access to the scheduler @maasg

New Project? 80% Structured Streaming 20%

lightbend.com/fast-data-platform

@maasg

Gerard Maas Señor SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg

Thank You!

The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, - PowerPoint PPT Presentation

Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc. Gerard Maas Seor SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/

History and Biology Thursday, April 3, 14 Apis Cerana Apis Cerana Thursday, April 3, 14 Apis

NEC Forum Tale of Two Contracts Ir. Ir. PAUL LEE PAUL LEE, , Kai Kai-hung hung Assistant

A Tale of Two Indices: Positive vs. Normative Indexation in the Emerging Markets April 2020 A

City of Forest Park A Tale of Two TIFs A Tale of Two TIFs It was the best of times, it was

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

Three Talks: 1. How does the solar wind blow? 2 A Tale of Two Space Plasma Physics 2. A Tale of

Analysis of Security APIs (ASA-2) June 26, 2008 Minimizing Threats from Flawed Security APIs:

The current state of banking APIs Open APIs are high priority for financial institutions What are

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Michael Watchorn: Campus Manager Monash University Berwick and Peninsula Campuses A tale of two

Hash Functions Hash Functions Lecture 10 Hash Functions Lecture 10 Before we talk about

Hash Functions Hash Functions Lecture 10 Hash Functions Lecture 10 Before we talk about

The 42 Year-Old Tennis Player Heard a Crack: What I Do, Tension Side or Compression Side, and

The tale fundamental group, tale homotopy and anabelian geometry Axel Sarlin |

@tsmith @tsmith But I like Java! We're stuck in a Java purgatory Error prone Code (no

Kotlin & C# A Comparison of Two Modern Languages Kirill Rakhman Syntax Properties Kotlin

Cubical Computational Type Carlo Angiuli Theory Evan Cavallo (*) Favonia & RedPRL Bob

Last time: monads (etc.) = > > 1/ 48 This time: generic programming val show : a

Interpreters, Part 2 Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

object jective ive ca caml ml Daniel Jackson MIT Lab for Computer Science 6898: Advanced

CS261 Data Structures Binary Search Trees III Generic Container

Mystery DFA What language does this DFA accept? We can experiment: It rejects 1, 10,

The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, - PowerPoint PPT Presentation

Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc. Gerard Maas Seor SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/

History and Biology Thursday, April 3, 14 Apis Cerana Apis Cerana Thursday, April 3, 14 Apis

NEC Forum Tale of Two Contracts Ir. Ir. PAUL LEE PAUL LEE, , Kai Kai-hung hung Assistant

A Tale of Two Indices: Positive vs. Normative Indexation in the Emerging Markets April 2020 A

City of Forest Park A Tale of Two TIFs A Tale of Two TIFs It was the best of times, it was

A Tale of Two Theories: A Tale of Two Theories: Reconciling Reconciling random matrix theory

Three Talks: 1. How does the solar wind blow? 2 A Tale of Two Space Plasma Physics 2. A Tale of

Analysis of Security APIs (ASA-2) June 26, 2008 Minimizing Threats from Flawed Security APIs:

The current state of banking APIs Open APIs are high priority for financial institutions What are

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Michael Watchorn: Campus Manager Monash University Berwick and Peninsula Campuses A tale of two

Hash Functions Hash Functions Lecture 10 Hash Functions Lecture 10 Before we talk about

Hash Functions Hash Functions Lecture 10 Hash Functions Lecture 10 Before we talk about

The 42 Year-Old Tennis Player Heard a Crack: What I Do, Tension Side or Compression Side, and

The tale fundamental group, tale homotopy and anabelian geometry Axel Sarlin |

@tsmith @tsmith But I like Java! We're stuck in a Java purgatory Error prone Code (no

Kotlin &amp; C# A Comparison of Two Modern Languages Kirill Rakhman Syntax Properties Kotlin

Cubical Computational Type Carlo Angiuli Theory Evan Cavallo (*) Favonia &amp; RedPRL Bob

Last time: monads (etc.) = &gt; &gt; 1/ 48 This time: generic programming val show : a

Interpreters, Part 2 Dr. Mattox Beckman University of Illinois at Urbana-Champaign Department of

object jective ive ca caml ml Daniel Jackson MIT Lab for Computer Science 6898: Advanced

CS261 Data Structures Binary Search Trees III Generic Container

Mystery DFA What language does this DFA accept? We can experiment: It rejects 1, 10,

Kotlin & C# A Comparison of Two Modern Languages Kirill Rakhman Syntax Properties Kotlin

Cubical Computational Type Carlo Angiuli Theory Evan Cavallo (*) Favonia & RedPRL Bob

Last time: monads (etc.) = > > 1/ 48 This time: generic programming val show : a