The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, - - PowerPoint PPT Presentation

the tale of two streaming apis
SMART_READER_LITE
LIVE PREVIEW

The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, - - PowerPoint PPT Presentation

Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc. Gerard Maas Seor SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/


slide-1
SLIDE 1

Processing Fast Data with Apache Spark:

The Tale of Two Streaming APIs

Gerard Maas Senior SW Engineer, Lightbend, Inc.

slide-2
SLIDE 2

Gerard Maas

Señor SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg

slide-3
SLIDE 3

@maasg

What is Spark and Why We Should Care? Streaming APIs in Spark

  • Structured Streaming
  • Interactive Session 1
  • Spark Streaming
  • Interactive Session 2

Spark Streaming X Structured Streaming

Agenda

slide-4
SLIDE 4

@maasg

Streaming | Big Data

slide-5
SLIDE 5

@maasg

100Tb 5Mb

slide-6
SLIDE 6

@maasg

100Tb 5Mb/s

slide-7
SLIDE 7

@maasg

∑ Stream = Dataset 𝚬 Dataset = Stream

  • Tyler Akidau, Google
slide-8
SLIDE 8

Once upon a time...

slide-9
SLIDE 9

@maasg Apache Spark Core Spark SQL Spark MLLib Spark Streaming Structured Streaming Datasets/Frames GraphFrames Data Sources

slide-10
SLIDE 10

@maasg Apache Spark Core Spark SQL Spark MLLib Spark Streaming Structured Streaming GraphFrames Data Sources Datasets/Frames

slide-11
SLIDE 11

@maasg

Structured Streaming

slide-12
SLIDE 12

@maasg

Structured Streaming

slide-13
SLIDE 13

@maasg

Structured Streaming

Kafka Sockets HDFS/S3 Custom

Streaming DataFrame

slide-14
SLIDE 14

@maasg

Structured Streaming

Kafka Sockets HDFS/S3 Custom

Streaming DataFrame

slide-15
SLIDE 15

@maasg

Structured Streaming

Kafka Sockets HDFS/S3 Custom

Streaming DataFrame Query

Kafka Files foreachSink console memory

Output Mode

slide-16
SLIDE 16

@maasg

Interactive Session

1

Structured Streaming

slide-17
SLIDE 17

@maasg Sensor Data Multiplexer

Structured Streaming

Spark Notebook Local Process

Sensor Anomaly Detection

slide-18
SLIDE 18

@maasg

Live

slide-19
SLIDE 19

@maasg

Interactive Session 1 Structured Streaming

QUICK RECAP

slide-20
SLIDE 20

@maasg

val rawData = sparkSession.readStream .format("kafka") // csv, json, parquet, socket, rate .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", sourceTopic) .option("startingOffsets", "latest") .load()

Sources

slide-21
SLIDE 21

@maasg

Operations

...

val rawValues = rawData.selectExpr("CAST(value AS STRING)") .as[String] val jsonValues = rawValues.select(from_json($"value", schema) as "record") val sensorData = jsonValues.select("record.*").as[SensorData]

slide-22
SLIDE 22

@maasg

Event Time

...

val movingAverage = sensorData .withColumn("timestamp", toSeconds($"ts").cast(TimestampType)) .withWatermark("timestamp", "30 seconds") .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds")) .agg(avg($"temp"))

...

slide-23
SLIDE 23

@maasg

Sinks

... val visualizationQuery = sensorData.writeStream .queryName("visualization") // this will be the SQL table name .format("memory") .outputMode("update") .start() ... val kafkaWriterQuery = kafkaFormat.writeStream .queryName("kafkaWriter") .format("kafka") .outputMode("append") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("topic", targetTopic) .option("checkpointLocation", "/tmp/spark/checkpoint") .start()

slide-24
SLIDE 24

@maasg

Use Cases

  • Streaming ETL
  • Stream aggregations, windows
  • Event-time oriented analytics
  • Arbitrary stateful stream processing
  • Join Streams with other streams and with Fixed

Datasets

  • Apply Machine Learning Models
slide-25
SLIDE 25

@maasg

Structured Streaming

slide-26
SLIDE 26

@maasg

Spark Streaming

Kafka Flume Kinesis Twitter Sockets HDFS/S3 Custom

Apache Spark Spark SQL Spark ML ...

Databases HDFS API Server Streams

slide-27
SLIDE 27

@maasg DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

t0 t1 t2 t3 ti ti+1

slide-28
SLIDE 28

@maasg DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

t0 t1 t2 t3 ti ti+1

RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Transformation T -> U

slide-29
SLIDE 29

@maasg DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

t0 t1 t2 t3 ti ti+1

RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Transformation T -> U Actions

slide-30
SLIDE 30

@maasg

API: Transformations

map, flatmap, filter count, reduce, countByValue, reduceByKey n union, join cogroup

slide-31
SLIDE 31

@maasg

API: Transformations

mapWithState

slide-32
SLIDE 32

@maasg

API: Transformations

transform

val iotDstream = MQTTUtils.createStream(...) val devicePriority = sparkContext.cassandraTable(...) val prioritizedDStream = iotDstream.transform{rdd => rdd.map(d => (d.id, d)).join(devicePriority) }

slide-33
SLIDE 33

@maasg

Actions

print

  • Time: 1459875469000 ms
  • data1
data2

saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles

xxx yyy zzz

foreachRDD

*

slide-34
SLIDE 34

@maasg

Actions

print

  • Time: 1459875469000 ms
  • data1
data2

saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles

xxx yyy zzz

foreachRDD

*

Spark SQL Dataframes GraphFrames Any API

slide-35
SLIDE 35

@maasg

Interactive Session

2

Spark Streaming

slide-36
SLIDE 36

@maasg Sensor Data Multiplexer

Structured Streaming

Spark Notebook Local Process

Sensor Anomaly Detection

slide-37
SLIDE 37

@maasg

Live

slide-38
SLIDE 38

@maasg

Interactive Session 2 Spark Streaming

QUICK RECAP

slide-39
SLIDE 39

@maasg

import org.apache.spark.streaming.StreamingContext val streamingContext = new StreamingContext(sparkContext, Seconds(10))

Streaming Context

slide-40
SLIDE 40

@maasg val kafkaParams = Map[String, String]( "metadata.broker.list" -> kafkaBootstrapServer, "group.id" -> "sensor-tracker-group", "auto.offset.reset" -> "largest", "enable.auto.commit" -> (false: java.lang.Boolean).toString ) val topics = Set(topic) @transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( streamingContext, kafkaParams, topics)

Source

slide-41
SLIDE 41

@maasg

import spark.implicits._ val sensorDataStream = stream.transform{rdd => val jsonData = rdd.map{case (k,v) => v} val ds = sparkSession.createDataset(jsonData) val jsonDF = spark.read.json(ds) val sensorDataDS = jsonDF.as[SensorData] sensorDataDS.rdd }

Transformations

slide-42
SLIDE 42

@maasg

val model = new M2Model() … model.trainOn(inputData) … val scoredDStream = model.predictOnValues(inputData)

DIY Custom Model

slide-43
SLIDE 43

@maasg

suspects.foreachRDD{rdd => val sample = rdd.take(20).map(_.toString) val total = s"total found: ${rdd.count}"

  • utputBox(total +: sample)

}

Output

slide-44
SLIDE 44

@maasg

Usecases

  • Complex computing/state management (local + cluster)
  • Streaming Machine Learning

○ Learn ○ Score

  • Join Streams with Updatable Datasets
  • RDD-based streaming computations
  • [-] Event-time oriented analytics
  • [-] Optimizations: Query & Data
  • [-] Continuous processing
slide-45
SLIDE 45

@maasg Sensor Data Multiplexer

Structured Streaming

Local Process

Sensor Anomaly Detection (Real Time Detection)

Structured Streaming

slide-46
SLIDE 46

Structured Streaming

+

slide-47
SLIDE 47

@maasg

Spark Streaming + Structured Streaming

47

val parse: Dataset[String] => Dataset[Record] = ??? val process: Dataset[Record] => Dataset[Result] = ??? val serialize: Dataset[Result] => Dataset[String] = ???

val kafkaStream = spark.readStream… val f = parse andThen process andThen serialize val result = f(kafkaStream) result.writeStream .format("kafka") .option("kafka.bootstrap.servers",bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .start() val dstream = KafkaUtils.createDirectStream(...) dstream.map{rdd => val ds = sparkSession.createDataset(rdd) val f = parse andThen process andThen serialize val result = f(ds) result.write.format("kafka") .option("kafka.bootstrap.servers", bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .save() }

Structured Streaming Spark Streaming

slide-48
SLIDE 48

@maasg

Streaming Pipelines

Structured Streaming

Keyword Extraction Keyword Relevance Similarity DB Storage

slide-49
SLIDE 49

@maasg

Time Execution Abstraction Structured Streaming Spark Streaming

Abstract (Processing Time, Event Time) Fixed to microbatch Streaming Interval Fixed Micro batch, Best Effort MB, Continuous (NRT) Fixed Micro batch DataFrames/Dataset DStream, RDD Access to the scheduler

slide-50
SLIDE 50

Structured Streaming

New Project? 80%

20%

slide-51
SLIDE 51

lightbend.com/fast-data-platform

slide-52
SLIDE 52

@maasg

slide-53
SLIDE 53

Gerard Maas

Señor SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg

slide-54
SLIDE 54

Thank You!