SLIDE 1 STREAM PROCESSING @ UBER
DANNY YUAN @ UBER
SLIDE 2
What is Uber
SLIDE 3
Transportation at your fingertips
SLIDE 4
SLIDE 5
Stream Data Allows Us To Feel The Pulse Of Cities
SLIDE 6
Marketplace Health
SLIDE 7
What’s Going on Now
SLIDE 8
What’s Happened?
SLIDE 9
Status Tracking
SLIDE 10
SLIDE 11
SLIDE 12
SLIDE 13
A Little Background
SLIDE 14
Uber’s Platform Is a Distributed State Machine
Rider States
SLIDE 15 Uber’s Platform Is a Distributed State Machine
Rider States Driver States
SLIDE 16
Applications can’t do everything
SLIDE 17
Instead, Applications Emit Events
SLIDE 18
Events Should Be Available In Seconds
SLIDE 19
Events Should Rarely Get Lost
SLIDE 20
Events Should Be Cheap And Scalable
SLIDE 21
SLIDE 22
Where are the challenges?
SLIDE 23
Many Dimensions
Dozens of fields per event
SLIDE 24
Granular Data
SLIDE 25
Granular Data
SLIDE 26
Granular Data
Over 10,000 hexagons in the city
SLIDE 27
Granular Data
7 vehicle types
SLIDE 28
Granular Data
1440 minutes in a day
SLIDE 29
Granular Data
13 driver states
SLIDE 30
Granular Data
300 cities
SLIDE 31
Granular Data
1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion possible combinations
SLIDE 32
Unknown Query Patterns
Any combination of dimensions
SLIDE 33 Variety of Aggregations
- Heatmap
- T
- p N
- Histogram
- count(), avg(), sum(), percent(), geo
SLIDE 34
Different Geo Aggregation
SLIDE 35 Large Data Volume
- Hundreds of thousands of events per
second, or billions of events per day
- At least dozens of fields in each event
SLIDE 36
Tight Schedule
SLIDE 37
Key: Generalization
SLIDE 38 Data Type
emporal Spatial Data
Dimension Value state driver_arrived vehicle type uber X timestamp 13244323342 lattitude 12.23 longitude 30.00
SLIDE 39 Data Query
- OLAP on single-table temporal-spatial data
SELECT ¡<agg ¡functions>, ¡<dimensions> ¡
FROM ¡<data_source>
WHERE ¡<boolean ¡filter>
GROUP ¡BY ¡<dimensions>
HAVING ¡<boolean ¡filter>
ORDER ¡BY ¡<sorting ¡criterial>
LIMIT ¡<n>
DO ¡<post ¡aggregation>
SLIDE 40
Finding the Right Storage System
SLIDE 41 Minimum Requirements
- OLAP with geospatial and time series support
- Support large amount of data
- Sub-second response time
- Query of raw data
SLIDE 42
It can’t be a KV store
SLIDE 43
Challenges to KV Store
Pre-computing all keys is O(2n) ¡for both space and time
SLIDE 44
It can’t be a relational database
SLIDE 45 Challenges to Relational DB
- Managing multiple indices is painful
- Scanning is not fast enough
SLIDE 46 A System That Supports
- Fast scan
- Arbitrary boolean queries
- Raw data
- Wide range of aggregations
SLIDE 47
Elasticsearch
SLIDE 48
Highly Efficient Inverted-Index For Boolean Query
SLIDE 49
Built-in Distributed Query
SLIDE 50
Fast Scan with Flexible Aggregations
SLIDE 52
Are We Done?
SLIDE 53
Transformation
e.g. (Lat, Long) -> (zipcode, hexagon)
SLIDE 54
Dynamic Pricing
SLIDE 55
Trend Prediction
SLIDE 56
Supply and Demand Distribution
SLIDE 57
Technically Speaking: Clustering & Pr(D, S, E)
SLIDE 58 New Use Cases —> New Requirements
SLIDE 59
Pre-aggregation
SLIDE 60
Joining Multiple Streams
SLIDE 61
Sessionization
SLIDE 62
Multi-Staged Processing
SLIDE 63
State Management
SLIDE 64
Apache Samza
SLIDE 65
Why Apache Samza?
SLIDE 66
DAG on Kafka
SLIDE 67
Excellent Integration with Kafka
SLIDE 68
Excellent Integration with Kafka
SLIDE 69
Built-in Checkpointing
SLIDE 70
Built-in State Management
SLIDE 71 Processing Storage
SLIDE 72
What If Storage Is Down?
SLIDE 73
What If Processing Takes Long?
SLIDE 74 Processing Storage
SLIDE 75
Are We Done?
SLIDE 76
SLIDE 77
SLIDE 78
Post Processing
SLIDE 79
Results Transformation and Smoothing
SLIDE 80
Scale of Post Processing
10,000 hexagons in a city
SLIDE 81
Scale of Post Processing
331 neighboring hexagons to look at
SLIDE 82
Scale of Post Processing
331 x 10,000 = 3.1 Million Hexagons to Process for a Single Query
SLIDE 83
Scale of Post Processing
99%-ile Processing Time: 70ms
SLIDE 84 Post Processing
- Each processor is a pure function
- Processors can be composed by combinators
SLIDE 85 Post Processing
- Highly parallelized execution
- Pipelining
SLIDE 86 Post Processing
- Each processor is a pure function
- Processors can be composed by combinators
- Highly parallelized execution
SLIDE 87
Practical Considerations
SLIDE 88
Data Discovery
SLIDE 89
Elasticsearch Query Can Be Complex
SLIDE 90 /driverAcceptanceRate? ¡ geo_dist(10, ¡[37, ¡22])& ¡ time_range(2015-‑02-‑04,2015-‑03-‑06)& ¡ aggregate(timeseries(7d))& ¡ eq(msg.driverId,1) ¡
SLIDE 91 Elasticsearch Query Can Be Optimized
- Pipelining
- Validation
- Throttling
SLIDE 93
Elasticsearch Can Be Replaced
SLIDE 94 Storage Query Processing
SLIDE 95
There’s one more thing
SLIDE 96
There are always patterns in streams
SLIDE 97
There is always need for quick exploration
SLIDE 98 How many drivers cancel a request 10 times in a row within a 5-minute window?
SLIDE 99 Which riders request a pickup from 100 miles apart within a half hour window?
SLIDE 100
SLIDE 101 Complex Event Processing
FROM ¡driver_canceled#window.time(10 ¡min) ¡ ¡ SELECT ¡clientUUID, ¡count(clientUUID) ¡as ¡cancelCount ¡ GROUP ¡BY ¡clientUUID ¡HAVING ¡cancelCount ¡> ¡10 ¡ ¡ INSERT ¡INTO ¡hipchat(room);
SLIDE 102
Implementation Becomes Easy
SLIDE 103
Thank You!