1
Streaming SQL to Unify Batch and Stream Processing: Theory and Practice with Apache Flink at Uber
Strata Data Conference, San Jose March, 7th 2018
Streaming SQL to Unify Batch and Stream Processing: Theory and - - PowerPoint PPT Presentation
Streaming SQL to Unify Batch and Stream Processing: Theory and Practice with Apache Flink at Uber Fabian Hueske Shuyi Chen Strata Data Conference, San Jose March, 7th 2018 1 What is Apache Flink? Data Stream Processing Event-driven
1
Strata Data Conference, San Jose March, 7th 2018
2
Batch Processing
process static and historic data
Data Stream Processing
realtime results from data streams
Event-driven Applications
data-driven actions and services
3 Queries Applications Devices etc. Database Stream File / Object Storage
Historic Data Streams Application
4
Streaming Platform Service billions messages per day A lot of Stream SQL Streaming Platform as a Service 3700+ container running Flink, 1400+ nodes, 22k+ cores, 100s of jobs Fraud detection Streaming Analytics Platform 100s jobs, 1000s nodes, TBs state, metrics, analytics, real time ML, Streaming SQL as a platform
5
Process Function (events, state, time) DataStream API (streams, windows) SQL / Table API (dynamic tables) Stream- & Batch Data Processing High-level Analytics API Stateful Event- Driven Applications
val stats = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum((a, b) -> a.add(b))
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … }
state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) }
6
tableEnvironment .scan("clicks") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user
7
tableEnvironment .scan("clicks") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks GROUP BY user
Input data is bounded (batch) Input data is unbounded (streaming)
8
Clicks
user cTime url Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… user cnt Mary 2 Bob 1 Liz 1 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user
Input data is read at once Result is produced at once
9
user cTime url user cnt SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user
Clicks
Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… Bob 1 Liz 1 Mary 1 Mary 2
Input data is continuously read Result is continuously produced
10 now bounded query unbounded query past future bounded query start of the stream unbounded query
11
12
13
14
15
§ The New York City Taxi & Limousine Commission provides a public data set about taxi rides in New York City § We can derive a streaming table from the data § Table: TaxiRides
rideId: BIGINT // ID of the taxi ride isStart: BOOLEAN // flag for pick-up (true) or drop-off (false) event lon: DOUBLE // longitude of pick-up or drop-off location lat: DOUBLE // latitude of pick-up or drop-off location rowtime: TIMESTAMP // time of pick-up or drop-off event
16
SELECT cell, isStart, HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd, COUNT(*) AS cnt FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell FROM TaxiRides WHERE isInNYC(lon, lat)) GROUP BY cell, isStart, HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)
17
SELECT pickUpCell, AVG(TIMESTAMPDIFF(MINUTE, e.rowtime, s.rowtime) AS avgDuration FROM (SELECT rideId, rowtime, toCellId(lon, lat) AS pickUpCell FROM TaxiRides WHERE isStart) s JOIN (SELECT rideId, rowtime FROM TaxiRides WHERE NOT isStart) e ON s.rideId = e.rideId AND e.rowtime BETWEEN s.rowtime AND s.rowtime + INTERVAL '1' HOUR GROUP BY pickUpCell
18
19
Elastic Search Kafka
SELECT cell, isStart, HOP_END(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE) AS hopEnd, COUNT(*) AS cnt FROM (SELECT rowtime, isStart, toCellId(lon, lat) AS cell FROM TaxiRides WHERE isInNYC(lon, lat)) GROUP BY cell, isStart, HOP(rowtime, INTERVAL '5' MINUTE, INTERVAL '15' MINUTE)
20
21
Uber
q Operation people q Data scientists q Engineers
q Logging q Backend services q Storage systems q Data management q Monitoring
22
23
24
25
Business Logics Input Output Testing Debugging
26
Resource estimation Deployment Monitoring & Alerts Logging Maintenance
27
28
29
SELECT AVG(…) FROM eats_order WHERE …
30
HTTP
SELECT AVG(…) FROM eats_order WHERE …
Pinot
31
32
Analyze input Analyze query Test deployment
Kafka input rate Hive metastore data SELECT * FROM ... YARN containers CPU Heap memory
33 Sandbox
Staging
Production
Promote
34
35
Watchdog
36
37
38 SELECT restaurant_id, AVG(etd) AS avg_etd FROM restaurant_stream GROUP BY TUMBLE(proctime, INTERVAL '5' MINUTE), restaurant_id
39
Better Data -> Better Food -> Better Business = A Winning Recipe
Eats restaurant manager blog
40
41
43