Real-time Data Pipelines with Structured Streaming in Tathagata TD - PowerPoint PPT Presentation

Real-time Data Pipelines with Structured Streaming in Tathagata “ TD ” Das @tathadas DataEngConf 2018 18 th April, San Francisco

About Me Started Spark Streaming project in AMPLab, UC Berkeley Currently focused on building Structured Streaming PMC Member of Engineer on the StreamTeam @ "we make all your streams come true"

Applications Streaming SQL ML Graph unified processing engine EC2 YARN Environments Data Sources

Data Pipelines – 10000ft view Data Lake Dump ETL Analytics Data Warehouse unstructured unstructured structured data data dump data streams

Data Pipeline @ Fortune 100 Company Trillions of Records Separate warehouses for Messy data not ready each type of analytics Security Infra for analytics IDS/IPS, DLP, antivirus, load Incidence balancers, proxy servers DW1 Response DATALAKE1 Cloud Infra & Apps Dump Complex ETL AWS, Azure, Google Cloud Alerting DW2 DATALAKE2 Servers Infra Linux, Unix, Windows Reports DW3 Hours of delay in accessing data Network Infra Routers, switches, WAPs, Very expensive to scale databases, LDAP Proprietary formats No advanced analytics (ML)

New Pipeline @ Fortune 100 Company Incidence Response Alerting Dump Complex ETL STRUCTURED SQL, ML, DELTA STREAMING STREAMING Reports Data usable in minutes/seconds Easy to scale Open formats Enables advanced analytics

STRUCTURED STREAMING

you should not have to reason about streaming

you should write simple queries & Spark should continuously update the answer

Treat Streams as Unbounded Tables unbounded input table data stream new data in the data stream = new rows appended to a unbounded table

Anatomy of a Streaming Query Example Read JSON data from Kafka Parse nested JSON ET ETL Store in structured Parquet table Get end-to-end failure guarantees

Anatomy of a Streaming Query spark.readStream.format("kafka") Source .option("kafka.boostrap.servers",...) .option("subscribe", "topic") Specify where to read data from .load() Built-in support for Files / Kafka / Kinesis* Can include multiple sources of returns a different types using join() / union() DataFrame *Available only on Databricks Runtime

DataFrame ó Table static data = streaming data = bounded table unbounded table Sin Single e AP API !

DataFrame/Dataset Da DataFrame SQ SQL Da Dataset spark.sql(" val df: DataFrame = val ds: Dataset[(String, Double)] = spark.table("device-data") SELECT type, sum(signal) spark.table("device-data") .as[DeviceData] FROM devices .groupBy("type") .groupByKey(_.type) GROUP BY type .sum("signal")) .mapValues(_.signal) ") .reduceGroups(_ + _) Most familiar to BI Analysts Great for Data Scientists familiar Great for Data Engineers who Supports SQL-2003, HiveQL with Pandas, R Dataframes want compile-time type safety Same semantics, same performance Choose your hammer for whatever nail you have!

Anatomy of a Streaming Query spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Kafka DataFrame key value topic partition offset timestamp [binary] [binary] "topic" 0 345 1486087873 [binary] [binary] "topic" 3 2890 1486086721

Anatomy of a Streaming Query Transformations spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Cast bytes from Kafka records to a .selectExpr("cast (value as string) as json") string, parse it as a json, and .select(from_json("json", schema).as("data")) generate nested columns 100s of built-in, optimized SQL functions like from_json user-defined functions, lambdas, function literals with map, flatMap…

Anatomy of a Streaming Query spark.readStream.format("kafka") Sink .option("kafka.boostrap.servers",...) .option("subscribe", "topic") Write transformed output to .load() external storage systems .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) Built-in support for Files / Kafka .writeStream Use foreach to execute arbitrary .format("parquet") .option("path", "/parquetTable/") code with the output data Some sinks are transactional and exactly once (e.g. files)

Anatomy of a Streaming Query spark.readStream.format("kafka") Processing Details .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Trigger: when to process data .selectExpr("cast (value as string) as json") - Fixed interval micro-batches .select(from_json("json", schema).as("data")) - As fast as possible micro-batches .writeStream - Continuously (new in Spark 2.3) .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") Checkpoint location: for tracking the .option("checkpointLocation", "…") progress of the query .start()

Spark automatically streamifies! t = 1 t = 2 t = 3 Read from Kafka Kafka Source spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) Project .option("subscribe", "topic") Optimized device, signal .load() new data process new data new data process process Operator .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) codegen, off- Filter .writeStream heap, etc. .format("parquet") signal > 15 .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") Write to Parquet .start() Parquet Sink DataFrames, Logical Optimized Series of Incremental Datasets, SQL Plan Plan Execution Plans Spark SQL converts batch-like query to a series of incremental execution plans operating on new batches of data

Fault-tolerance with Checkpointing t = 1 t = 2 t = 3 Checkpointing new data process new data new data process process Saves processed offset info to stable storage Saved as JSON for forward-compatibility write Allows recovery from any failure ahead Can resume after limited changes to your log end-to-end streaming transformations (e.g. adding new exactly-once filters to drop corrupted data, etc.) guarantees

Anatomy of a Streaming Query spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") ETL ET .load() .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) .writeStream Raw data from Kafka available .format("parquet") as structured data in seconds, .option("path", "/parquetTable/") ready for querying .trigger("1 minute") .option("checkpointLocation", "…") .start()

Performance: Benchmark Structured Streaming reuses 40-core throughput the Spark SQL Optimizer 65M 70 Millions of records/s and Tungsten Engine 60 50 40 30 22M 20 3x 10 700K 0 Kafka Apache Flink Structured cheaper faster Streams Streaming More details in our blog post

Business Logic independent of Execution Mode spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() .selectExpr("cast (value as string) as json") Business logic .select(from_json("json", schema).as("data")) remains unchanged .writeStream .format("parquet") .option("path", "/parquetTable/") .trigger("1 minute") .option("checkpointLocation", "…") .start()

Business Logic independent of Execution Mode spark. read .format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load() Business logic .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) remains unchanged . write .format("parquet") .option("path", "/parquetTable/") .load() Peripheral code decides whether it’s a batch or a streaming query

Business Logic independent of Execution Mode .selectExpr("cast (value as string) as json") .select(from_json("json", schema).as("data")) Batch Micro-batch Continuous** Streaming Streaming low latency high latency ultra-low latency (seconds) (hours/minutes) (milliseconds) efficient resource allocation execute on-demand static resource allocation high throughput high throughput **experimental release in Spark 2.3, read our blog

Event time Aggregations Windowing is just another type of grouping in Struct. Streaming parsedData .groupBy(window("timestamp","1 hour")) number of records every hour .count() parsedData avg signal strength of each .groupBy( "device", device every 10 mins window("timestamp","10 mins")) .avg("signal") Support UDAFs!

Real-time Data Pipelines with Structured Streaming in Tathagata TD - PowerPoint PPT Presentation

Real-time Data Pipelines with Structured Streaming in Tathagata TD Das @tathadas DataEngConf 2018 18 th April, San Francisco About Me Started Spark Streaming project in AMPLab, UC Berkeley Currently focused on building Structured

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Semi-structured data Data is not just text, but is not as well- Semi-structured data

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Care and Feeding of Lead Acid Batteries Revision 1.0 2016 Presented by: Frederick B. Cook

drug holidays ver 7-10 7/13/2018 Long-term Treatment and Drug Financial Disclosures Holidays

Durable Transactional Memory Can Scale With TimeStone * R. Madhava Krishnan , Jaeho Kim * , Ajit

Tamper Resistance - a Cautionary Note Ross Anderson Markus Kuhn University of Cambridge

Print version Updated: 25 February 2020 Lecture #20 Dissolved Carbon Dioxide: Closed Systems II

Impact of Sludge Towards Stabilization of the Fire Road Mine M. Coleman 1 , K.D.Phinney 2 + + 1

Cp*Ru(II) COMPLEXES BEARING PRIMARY AMINE LIGANDS L = N(CH 3 ) 2 catalyst precursor

CME/STATS 195 CME/STATS 195 Lecture 7: Hypothesis Testing and Lecture 7: Hypothesis Testing and

Real-time Data Pipelines with Structured Streaming in Tathagata TD - PowerPoint PPT Presentation

Real-time Data Pipelines with Structured Streaming in Tathagata TD Das @tathadas DataEngConf 2018 18 th April, San Francisco About Me Started Spark Streaming project in AMPLab, UC Berkeley Currently focused on building Structured

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Semi-structured data Data is not just text, but is not as well- Semi-structured data

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

20 STREAMING AGREEMENT 19 16 OCTOBER US$145 million Streaming Agreement US$145 million

2 Workloa d? 3 OLTP 4 OLAP OLTP 4 OLAP OLTP Streaming 4 Scan- OLAP OLTP Streaming

Care and Feeding of Lead Acid Batteries Revision 1.0 2016 Presented by: Frederick B. Cook

drug holidays ver 7-10 7/13/2018 Long-term Treatment and Drug Financial Disclosures Holidays

Durable Transactional Memory Can Scale With TimeStone * R. Madhava Krishnan , Jaeho Kim * , Ajit

Tamper Resistance - a Cautionary Note Ross Anderson Markus Kuhn University of Cambridge

Print version Updated: 25 February 2020 Lecture #20 Dissolved Carbon Dioxide: Closed Systems II

Impact of Sludge Towards Stabilization of the Fire Road Mine M. Coleman 1 , K.D.Phinney 2 + + 1

Cp*Ru(II) COMPLEXES BEARING PRIMARY AMINE LIGANDS L = N(CH 3 ) 2 catalyst precursor

CME/STATS 195 CME/STATS 195 Lecture 7: Hypothesis Testing and Lecture 7: Hypothesis Testing and

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure