STREAM PROCESSING @ UBER DANNY YUAN @ UBER What is Uber - - PowerPoint PPT Presentation

stream processing uber
SMART_READER_LITE
LIVE PREVIEW

STREAM PROCESSING @ UBER DANNY YUAN @ UBER What is Uber - - PowerPoint PPT Presentation

STREAM PROCESSING @ UBER DANNY YUAN @ UBER What is Uber Transportation at your fingertips Stream Data Allows Us To Feel The Pulse Of Cities Marketplace Health Whats Going on Now Whats Happened? Status Tracking A Little Background


slide-1
SLIDE 1

STREAM PROCESSING @ UBER

DANNY YUAN @ UBER

slide-2
SLIDE 2

What is Uber

slide-3
SLIDE 3

Transportation at your fingertips

slide-4
SLIDE 4
slide-5
SLIDE 5

Stream Data Allows Us To Feel The Pulse Of Cities

slide-6
SLIDE 6

Marketplace Health

slide-7
SLIDE 7

What’s Going on Now

slide-8
SLIDE 8

What’s Happened?

slide-9
SLIDE 9

Status Tracking

slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13

A Little Background

slide-14
SLIDE 14

Uber’s Platform Is a Distributed State Machine

Rider States

slide-15
SLIDE 15

Uber’s Platform Is a Distributed State Machine

Rider States Driver States

slide-16
SLIDE 16

Applications can’t do everything

slide-17
SLIDE 17

Instead, Applications Emit Events

slide-18
SLIDE 18

Events Should Be Available In Seconds

slide-19
SLIDE 19

Events Should Rarely Get Lost

slide-20
SLIDE 20

Events Should Be Cheap And Scalable

slide-21
SLIDE 21
slide-22
SLIDE 22

Where are the challenges?

slide-23
SLIDE 23

Many Dimensions

Dozens of fields per event

slide-24
SLIDE 24

Granular Data

slide-25
SLIDE 25

Granular Data

slide-26
SLIDE 26

Granular Data

Over 10,000 hexagons in the city

slide-27
SLIDE 27

Granular Data

7 vehicle types

slide-28
SLIDE 28

Granular Data

1440 minutes in a day

slide-29
SLIDE 29

Granular Data

13 driver states

slide-30
SLIDE 30

Granular Data

300 cities

slide-31
SLIDE 31

Granular Data

1 day of data: 300 x 10,000 x 7 x 1440 x 13 = 393 billion possible combinations

slide-32
SLIDE 32

Unknown Query Patterns

Any combination of dimensions

slide-33
SLIDE 33

Variety of Aggregations

  • Heatmap
  • T
  • p N
  • Histogram
  • count(), avg(), sum(), percent(), geo
slide-34
SLIDE 34

Different Geo Aggregation

slide-35
SLIDE 35

Large Data Volume

  • Hundreds of thousands of events per

second, or billions of events per day


  • At least dozens of fields in each event
slide-36
SLIDE 36

Tight Schedule

slide-37
SLIDE 37

Key: Generalization

slide-38
SLIDE 38

Data Type

  • Dimensional T

emporal Spatial Data

Dimension Value state driver_arrived vehicle type uber X timestamp 13244323342 lattitude 12.23 longitude 30.00

slide-39
SLIDE 39

Data Query

  • OLAP on single-table temporal-spatial data


 SELECT ¡<agg ¡functions>, ¡<dimensions> ¡
 FROM ¡<data_source>
 WHERE ¡<boolean ¡filter>
 GROUP ¡BY ¡<dimensions>
 HAVING ¡<boolean ¡filter>
 ORDER ¡BY ¡<sorting ¡criterial>
 LIMIT ¡<n>
 DO ¡<post ¡aggregation>

slide-40
SLIDE 40

Finding the Right Storage System

slide-41
SLIDE 41

Minimum Requirements

  • OLAP with geospatial and time series support

  • Support large amount of data

  • Sub-second response time

  • Query of raw data

slide-42
SLIDE 42

It can’t be a KV store

slide-43
SLIDE 43

Challenges to KV Store

Pre-computing all keys is O(2n) ¡for both space and time 


slide-44
SLIDE 44

It can’t be a relational database

slide-45
SLIDE 45

Challenges to Relational DB

  • Managing multiple indices is painful

  • Scanning is not fast enough

slide-46
SLIDE 46

A System That Supports

  • Fast scan

  • Arbitrary boolean queries

  • Raw data

  • Wide range of aggregations

slide-47
SLIDE 47

Elasticsearch

slide-48
SLIDE 48

Highly Efficient Inverted-Index For Boolean Query

slide-49
SLIDE 49

Built-in Distributed Query

slide-50
SLIDE 50

Fast Scan with Flexible Aggregations

slide-51
SLIDE 51

Storage

slide-52
SLIDE 52

Are We Done?

slide-53
SLIDE 53

Transformation

e.g. (Lat, Long) -> (zipcode, hexagon)

slide-54
SLIDE 54

Dynamic Pricing

slide-55
SLIDE 55

Trend Prediction

slide-56
SLIDE 56

Supply and Demand Distribution

slide-57
SLIDE 57

Technically Speaking: Clustering & Pr(D, S, E)

slide-58
SLIDE 58

New Use Cases —> New Requirements

slide-59
SLIDE 59

Pre-aggregation

slide-60
SLIDE 60

Joining Multiple Streams

slide-61
SLIDE 61

Sessionization

slide-62
SLIDE 62

Multi-Staged Processing

slide-63
SLIDE 63

State Management

slide-64
SLIDE 64

Apache Samza

slide-65
SLIDE 65

Why Apache Samza?

slide-66
SLIDE 66

DAG on Kafka

slide-67
SLIDE 67

Excellent Integration with Kafka

slide-68
SLIDE 68

Excellent Integration with Kafka

slide-69
SLIDE 69

Built-in Checkpointing

slide-70
SLIDE 70

Built-in State Management

slide-71
SLIDE 71

Processing Storage

slide-72
SLIDE 72

What If Storage Is Down?

slide-73
SLIDE 73

What If Processing Takes Long?

slide-74
SLIDE 74

Processing Storage

slide-75
SLIDE 75

Are We Done?

slide-76
SLIDE 76
slide-77
SLIDE 77
slide-78
SLIDE 78

Post Processing

slide-79
SLIDE 79

Results Transformation and Smoothing

slide-80
SLIDE 80

Scale of Post Processing

10,000 hexagons in a city

slide-81
SLIDE 81

Scale of Post Processing

331 neighboring hexagons to look at

slide-82
SLIDE 82

Scale of Post Processing

331 x 10,000 = 3.1 Million Hexagons to Process for a Single Query

slide-83
SLIDE 83

Scale of Post Processing

99%-ile Processing Time: 70ms

slide-84
SLIDE 84

Post Processing

  • Each processor is a pure function

  • Processors can be composed by combinators
slide-85
SLIDE 85

Post Processing

  • Highly parallelized execution

  • Pipelining
slide-86
SLIDE 86

Post Processing

  • Each processor is a pure function

  • Processors can be composed by combinators

  • Highly parallelized execution
slide-87
SLIDE 87

Practical Considerations

slide-88
SLIDE 88

Data Discovery

slide-89
SLIDE 89

Elasticsearch Query Can Be Complex

slide-90
SLIDE 90

/driverAcceptanceRate? ¡ geo_dist(10, ¡[37, ¡22])& ¡ time_range(2015-­‑02-­‑04,2015-­‑03-­‑06)& ¡ aggregate(timeseries(7d))& ¡ eq(msg.driverId,1) ¡

slide-91
SLIDE 91

Elasticsearch Query Can Be Optimized

  • Pipelining

  • Validation

  • Throttling
slide-92
SLIDE 92

Time in seconds

slide-93
SLIDE 93

Elasticsearch Can Be Replaced

slide-94
SLIDE 94

Storage Query Processing

slide-95
SLIDE 95

There’s one more thing

slide-96
SLIDE 96

There are always patterns in streams

slide-97
SLIDE 97

There is always need for quick exploration

slide-98
SLIDE 98

How many drivers cancel a request 10 times in a row within a 5-minute window?

slide-99
SLIDE 99

Which riders request a pickup from 100 miles apart within a half hour window?

slide-100
SLIDE 100
slide-101
SLIDE 101

Complex Event Processing

FROM ¡driver_canceled#window.time(10 ¡min) ¡ ¡ SELECT ¡clientUUID, ¡count(clientUUID) ¡as ¡cancelCount ¡ GROUP ¡BY ¡clientUUID ¡HAVING ¡cancelCount ¡> ¡10 ¡ ¡ INSERT ¡INTO ¡hipchat(room);

slide-102
SLIDE 102

Implementation Becomes Easy

slide-103
SLIDE 103

Thank You!