Processing Fast Data with Apache Spark:
The Tale of Two Streaming APIs
Gerard Maas Senior SW Engineer, Lightbend, Inc.
The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, - - PowerPoint PPT Presentation
Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc. Gerard Maas Seor SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/
Processing Fast Data with Apache Spark:
The Tale of Two Streaming APIs
Gerard Maas Senior SW Engineer, Lightbend, Inc.
Gerard Maas
Señor SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg
@maasg
What is Spark and Why We Should Care? Streaming APIs in Spark
Spark Streaming X Structured Streaming
Agenda
@maasg
@maasg
@maasg
@maasg
Once upon a time...
@maasg Apache Spark Core Spark SQL Spark MLLib Spark Streaming Structured Streaming Datasets/Frames GraphFrames Data Sources
@maasg Apache Spark Core Spark SQL Spark MLLib Spark Streaming Structured Streaming GraphFrames Data Sources Datasets/Frames
@maasg
Structured Streaming
@maasg
Structured Streaming
@maasg
Structured Streaming
Kafka Sockets HDFS/S3 Custom
Streaming DataFrame
@maasg
Structured Streaming
Kafka Sockets HDFS/S3 Custom
Streaming DataFrame
@maasg
Structured Streaming
Kafka Sockets HDFS/S3 Custom
Streaming DataFrame Query
Kafka Files foreachSink console memory
Output Mode
@maasg
@maasg Sensor Data Multiplexer
Structured Streaming
Spark Notebook Local Process
Sensor Anomaly Detection
@maasg
@maasg
@maasg
val rawData = sparkSession.readStream .format("kafka") // csv, json, parquet, socket, rate .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", sourceTopic) .option("startingOffsets", "latest") .load()
@maasg
...
val rawValues = rawData.selectExpr("CAST(value AS STRING)") .as[String] val jsonValues = rawValues.select(from_json($"value", schema) as "record") val sensorData = jsonValues.select("record.*").as[SensorData]
…
@maasg
...
val movingAverage = sensorData .withColumn("timestamp", toSeconds($"ts").cast(TimestampType)) .withWatermark("timestamp", "30 seconds") .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds")) .agg(avg($"temp"))
...
@maasg
... val visualizationQuery = sensorData.writeStream .queryName("visualization") // this will be the SQL table name .format("memory") .outputMode("update") .start() ... val kafkaWriterQuery = kafkaFormat.writeStream .queryName("kafkaWriter") .format("kafka") .outputMode("append") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("topic", targetTopic) .option("checkpointLocation", "/tmp/spark/checkpoint") .start()
@maasg
Datasets
@maasg
Structured Streaming
@maasg
Spark Streaming
Kafka Flume Kinesis Twitter Sockets HDFS/S3 Custom
Apache Spark Spark SQL Spark ML ...
Databases HDFS API Server Streams
@maasg DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
@maasg DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Transformation T -> U
@maasg DStream[T] RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U] Transformation T -> U Actions
@maasg
API: Transformations
map, flatmap, filter count, reduce, countByValue, reduceByKey n union, join cogroup
@maasg
API: Transformations
mapWithState
…
…
@maasg
API: Transformations
transform
val iotDstream = MQTTUtils.createStream(...) val devicePriority = sparkContext.cassandraTable(...) val prioritizedDStream = iotDstream.transform{rdd => rdd.map(d => (d.id, d)).join(devicePriority) }
@maasg
Actions
saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles
xxx yyy zzzforeachRDD
@maasg
Actions
saveAsTextFiles, saveAsObjectFiles, saveAsHadoopFiles
xxx yyy zzzforeachRDD
Spark SQL Dataframes GraphFrames Any API
@maasg
@maasg Sensor Data Multiplexer
Structured Streaming
Spark Notebook Local Process
Sensor Anomaly Detection
@maasg
@maasg
@maasg
import org.apache.spark.streaming.StreamingContext val streamingContext = new StreamingContext(sparkContext, Seconds(10))
@maasg val kafkaParams = Map[String, String]( "metadata.broker.list" -> kafkaBootstrapServer, "group.id" -> "sensor-tracker-group", "auto.offset.reset" -> "largest", "enable.auto.commit" -> (false: java.lang.Boolean).toString ) val topics = Set(topic) @transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( streamingContext, kafkaParams, topics)
@maasg
import spark.implicits._ val sensorDataStream = stream.transform{rdd => val jsonData = rdd.map{case (k,v) => v} val ds = sparkSession.createDataset(jsonData) val jsonDF = spark.read.json(ds) val sensorDataDS = jsonDF.as[SensorData] sensorDataDS.rdd }
@maasg
val model = new M2Model() … model.trainOn(inputData) … val scoredDStream = model.predictOnValues(inputData)
@maasg
suspects.foreachRDD{rdd => val sample = rdd.take(20).map(_.toString) val total = s"total found: ${rdd.count}"
}
@maasg
Usecases
○ Learn ○ Score
@maasg Sensor Data Multiplexer
Structured Streaming
Local Process
Sensor Anomaly Detection (Real Time Detection)
Structured Streaming
Structured Streaming
+
@maasg
Spark Streaming + Structured Streaming
47val parse: Dataset[String] => Dataset[Record] = ??? val process: Dataset[Record] => Dataset[Result] = ??? val serialize: Dataset[Result] => Dataset[String] = ???
val kafkaStream = spark.readStream… val f = parse andThen process andThen serialize val result = f(kafkaStream) result.writeStream .format("kafka") .option("kafka.bootstrap.servers",bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .start() val dstream = KafkaUtils.createDirectStream(...) dstream.map{rdd => val ds = sparkSession.createDataset(rdd) val f = parse andThen process andThen serialize val result = f(ds) result.write.format("kafka") .option("kafka.bootstrap.servers", bootstrapServers) .option("topic", writeTopic) .option("checkpointLocation", checkpointLocation) .save() }Structured Streaming Spark Streaming
@maasg
Streaming Pipelines
Structured Streaming
Keyword Extraction Keyword Relevance Similarity DB Storage
@maasg
Time Execution Abstraction Structured Streaming Spark Streaming
Abstract (Processing Time, Event Time) Fixed to microbatch Streaming Interval Fixed Micro batch, Best Effort MB, Continuous (NRT) Fixed Micro batch DataFrames/Dataset DStream, RDD Access to the scheduler
Structured Streaming
20%
lightbend.com/fast-data-platform
@maasg
Gerard Maas
Señor SW Engineer @maasg https://github.com/maasg https://www.linkedin.com/ in/gerardmaas/ https://stackoverflow.com /users/764040/maasg