Streaming SQL to Unify Batch and Stream Processing: Theory and - PowerPoint PPT Presentation

Streaming SQL to Unify Batch and Stream Processing: Theory and Practice with Apache Flink at Uber Fabian Hueske Shuyi Chen Strata Data Conference, San Jose March, 7th 2018 1

What is Apache Flink? Data Stream Processing Event-driven Batch Processing realtime results Applications from data streams process static and data-driven actions historic data and services Stateful Computations Over Data Streams 2

What is Apache Flink? Stateful computations over streams real-time and historic fast, scalable, fault tolerant, in-memory, event time, large state, exactly-once Application Queries Streams Applications Database Devices Stream Historic etc. Data File / Object Storage 3

Hardened at scale Streaming Platform Service Streaming Platform as a Service billions messages per day 3700+ container running Flink, A lot of Stream SQL 1400+ nodes, 22k+ cores, 100s of jobs 100s jobs, 1000s nodes, TBs state, Fraud detection metrics, analytics, real time ML, Streaming Analytics Platform Streaming SQL as a platform 4

Powerful Abstractions Layered abstractions to navigate simple to complex use cases High-level SQL / Table API (dynamic tables) Analytics API val stats = stream Stream- & Batch DataStream API (streams, windows) .keyBy( "sensor" ) Data Processing .timeWindow(Time.seconds(5)) .sum((a, b) -> a.add(b)) Process Function (events, state, time) Stateful Event- Driven Applications def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state 5 // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) }

Apache Flink’s Relational APIs ANSI SQL LINQ-style Table API SELECT user, COUNT(url) AS cnt tableEnvironment FROM clicks .scan("clicks") GROUP BY user .groupBy('user) .select('user, 'url.count as 'cnt) Unified APIs for batch & streaming data A query specifies exactly the same result regardless whether its input is static batch data or streaming data. 6

Query Translation tableEnvironment SELECT user, COUNT(url) AS cnt .scan("clicks") FROM clicks .groupBy('user) GROUP BY user .select('user, 'url.count as 'cnt) Input data is Input data is bounded unbounded (batch) (streaming) 7

What if “clicks” is a file? Input data is Result is produced Clicks read at once at once user cTime url user cnt Mary 12:00:00 https://… SELECT Mary 2 Bob 12:00:00 https://… user, Bob 1 COUNT(url) as cnt Mary 12:00:02 https://… FROM clicks Liz 1 GROUP BY user Liz 12:00:03 https://… 8

What if “clicks” is a stream? Input data is Result is continuously Clicks continuously read produced user cTime url user cnt Mary 12:00:00 https://… SELECT Mary Mary 2 1 Bob 12:00:00 https://… user, Bob 1 COUNT(url) as cnt Mary 12:00:02 https://… FROM clicks Liz 1 GROUP BY user Liz 12:00:03 https://… The result is identical! 9

Why is stream-batch unification important? Usability § ANSI SQL syntax: No custom “StreamSQL” syntax. • ANSI SQL semantics: No stream-specific results. • Portability § Run the same query on bounded and unbounded data • Run the same query on recorded and real-time data • bounded query bounded query future past now start of the stream unbounded query unbounded query Do we need to soften SQL semantics for streaming? § 10

DBMSs Run Queries on Streams § Materialized views (MV) are similar to regular views, but persisted to disk or memory • Used to speed-up analytical queries • MVs need to be updated when the base tables change § MV maintenance is very similar to SQL on streams • Base table updates are a stream of DML statements • MV definition query is evaluated on that stream • MV is query result and continuously updated 11

Continuous Queries in Flink § Core concept is a “Dynamic Table” • Dynamic tables are changing over time § Queries on dynamic tables • produce new dynamic tables (which are updated based on input) • do not terminate § Stream ↔ Dynamic table conversions 12

Stream ↔ Dynamic Table Conversions § Append Conversions • Records are only inserted/appended § Upsert Conversions • Records are inserted/updated/deleted and have a (composite) unique key § Changelog Conversions • Records are inserted/updated/deleted 13

SQL Feature Set in Flink 1.5.0 § SELECT FROM WHERE § GROUP BY / HAVING Non-windowed, TUMBLE, HOP, SESSION windows • § JOIN Windowed INNER, LEFT / RIGHT / FULL OUTER JOIN • Non-windowed INNER JOIN • § Scalar, aggregation, table-valued UDFs § SQL CLI Client (beta) § [streaming only] OVER / WINDOW UNBOUNDED / BOUNDED PRECEDING • § [batch only] UNION / INTERSECT / EXCEPT / IN / ORDER BY 14

What can I build with this? Data Pipelines § Transform, aggregate, and move events in real-time • Low-latency ETL § Convert and write streams to file systems, DBMS, K-V stores, indexes, … • Convert appearing files into streams • Stream & Batch Analytics § Run analytical queries over bounded and unbounded data • Query and compare historic and real-time data • Data Preparation for Live Dashboards § Compute and update data to visualize in real-time • 15

The New York Taxi Rides Data Set The New York City Taxi & Limousine Commission provides a public data § set about taxi rides in New York City We can derive a streaming table from the data § Table: TaxiRides § rideId: BIGINT // ID of the taxi ride isStart: BOOLEAN // flag for pick-up (true) or drop-off (false) event lon: DOUBLE // longitude of pick-up or drop-off location lat: DOUBLE // latitude of pick-up or drop-off location rowtime: TIMESTAMP // time of pick-up or drop-off event 16

Identify popular pick-up / drop-off locations § Compute every 5 minutes for each location the number of departing and arriving taxis of the last 15 minutes . SELECT cell, isStart, HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd, COUNT(*) AS cnt FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell FROM TaxiRides WHERE isInNYC(lon, lat)) GROUP BY cell, isStart, HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) 17

Average ride duration per pick-up location § Join start ride and end ride events on rideId and compute average ride duration per pick-up location . SELECT pickUpCell, AVG(TIMESTAMPDIFF(MINUTE, e.rowtime, s.rowtime) AS avgDuration FROM (SELECT rideId, rowtime, toCellId(lon, lat) AS pickUpCell FROM TaxiRides WHERE isStart) s JOIN (SELECT rideId, rowtime FROM TaxiRides WHERE NOT isStart) e ON s.rideId = e.rideId AND e.rowtime BETWEEN s.rowtime AND s.rowtime + INTERVAL '1' HOUR GROUP BY pickUpCell 18

Building a Dashboard SELECT cell, isStart, HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd, COUNT(*) AS cnt FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell FROM TaxiRides WHERE isInNYC(lon, lat)) GROUP BY cell, isStart, HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) Elastic Search Kafka 19

Flink SQL in Production @ UBER 20

Uber's business is Real-Time Uber 21

Challenges Infrastructure Productivity Operation q Target audience q 100s of Billions of q ~1000 streaming messages / day q Operation people jobs q At-least-once q Data scientists processing q Engineers q Multiple DCs q Integrations q Exactly-once state processing q Logging q Backend services q 99.99% SLA on q Storage systems availability q Data management q 99.99% SLA on q Monitoring latency 22

Stream processing @ Uber § Apache Samza (Since Jul. 2015) • Scalable • At-least-once message processing • Managed state • Fault tolerance § Apache Flink ( Since May, 2017 ) • All of above • Exactly-once stateful computation • Accurate • Unified stream & batch processing with SQL 23

Lifecycle of building a streaming job 24

Writing the job Business Input Testing Debugging Logics Output Java/Scala Duplicate code • • Streaming/batch • 25

Running the job Resource Monitoring Deployment Logging Maintenance estimation & Alerts Manual process • Hard to scale beyond > 10 jobs • 26

Job from idea to production takes days 27

How can we improve efficiency as a platform? 28

Flink SQL to be savior SELECT AVG(…) FROM eats_order WHERE … 29

Connectors SELECT AVG(…) FROM eats_order WHERE … HTTP 30 Pinot

UI & Backend services To make it self-service § SQL composition & validation • Connectors management • 31

UI & Backend services To make it self-service § Job compilation and generation • Resource estimation • Test Analyze input Analyze query deployment Kafka input rate YARN containers SELECT * FROM ... Hive metastore data CPU Heap memory 32

UI & Backend services To make it self-service § Job deployment • Sandbox Functional correctness • Play around with SQL • Staging Promote System generated estimate • Production like load • Production Managed • 33

UI & Backend services To make it self-service § Job management • 34

Streaming SQL to Unify Batch and Stream Processing: Theory and - PowerPoint PPT Presentation

Streaming SQL to Unify Batch and Stream Processing: Theory and Practice with Apache Flink at Uber Fabian Hueske Shuyi Chen Strata Data Conference, San Jose March, 7th 2018 1 What is Apache Flink? Data Stream Processing Event-driven

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

SQL SQL SQL = Structured Query Language Standard query language for relational

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Foundations of streaming SQL or: how I learned to love stream & table theory Slides:

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

Embedded Software Streaming Embedded Software Streaming via Block Stream via Block Stream A

Outline Background SQL history and terminology Introduction SAS seminar Proc

SQL and JS Pitfalls Assignment 2 Preparation SQL Concepts SQL vs. NoSQL

StreamCloud: A Large Scale Data Streaming System Gulisano, Vincenzo Jimenez-Peris, Ricardo

RAY A Scalable computation engine Ray is a flexible, high-performance, distributed

Py Pyro: A Spa patial-Tempo mporal Big-Data Storage System m Shen Li Shaohan Hu Raghu Ganti

EPAs benthic habitat data for Yaquina estuary Presented by Ted DeWitt Data Set Contributors:

Alpha Presentation Using Sensors to Study Human Behavior The Capstone Experience Team Michigan

INTEGRATION VALERIY KOSYKH, EVGENII VIAZILOV, ALEXANDER STERIN, OLGA BULYGINA RIHMI-WDC, OBNINSK

Are p and q connected? Network connectivity Yes, they are connected! Network connectivity

A Scalable Server for 3D Metaverses Ewen Cheslack-Postava, Tahir Azim, Behram F.T. Mistree,

Streaming SQL to Unify Batch and Stream Processing: Theory and - PowerPoint PPT Presentation

Streaming SQL to Unify Batch and Stream Processing: Theory and Practice with Apache Flink at Uber Fabian Hueske Shuyi Chen Strata Data Conference, San Jose March, 7th 2018 1 What is Apache Flink? Data Stream Processing Event-driven

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

SQL SQL SQL = Structured Query Language Standard query language for relational

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

Batch &amp; Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Foundations of streaming SQL or: how I learned to love stream &amp; table theory Slides:

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

Training Presentation Web Streaming Introduction What is Web Streaming? Who is Streaming?

Embedded Software Streaming Embedded Software Streaming via Block Stream via Block Stream A

Outline Background SQL history and terminology Introduction SAS seminar Proc

SQL and JS Pitfalls Assignment 2 Preparation SQL Concepts SQL vs. NoSQL

StreamCloud: A Large Scale Data Streaming System Gulisano, Vincenzo Jimenez-Peris, Ricardo

RAY A Scalable computation engine Ray is a flexible, high-performance, distributed

Py Pyro: A Spa patial-Tempo mporal Big-Data Storage System m Shen Li Shaohan Hu Raghu Ganti

EPAs benthic habitat data for Yaquina estuary Presented by Ted DeWitt Data Set Contributors:

Alpha Presentation Using Sensors to Study Human Behavior The Capstone Experience Team Michigan

INTEGRATION VALERIY KOSYKH, EVGENII VIAZILOV, ALEXANDER STERIN, OLGA BULYGINA RIHMI-WDC, OBNINSK

Are p and q connected? Network connectivity Yes, they are connected! Network connectivity

A Scalable Server for 3D Metaverses Ewen Cheslack-Postava, Tahir Azim, Behram F.T. Mistree,

Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Foundations of streaming SQL or: how I learned to love stream & table theory Slides: