Scaling Ubers Elasticsearch as an Geo-Temporal Database Danny Yuan @ - - PowerPoint PPT Presentation

scaling uber s elasticsearch as an geo temporal database
SMART_READER_LITE
LIVE PREVIEW

Scaling Ubers Elasticsearch as an Geo-Temporal Database Danny Yuan @ - - PowerPoint PPT Presentation

Scaling Ubers Elasticsearch as an Geo-Temporal Database Danny Yuan @ Uber Use Cases for a Geo-Temporal Database Real-time Decisions on Global Scale Dynamic Pricing: Every Hexagon, Every Minute Dynamic Pricing: Every Hexagon, Every Minute


slide-1
SLIDE 1

Scaling Uber’s Elasticsearch as an Geo-Temporal Database

Danny Yuan @ Uber

slide-2
SLIDE 2

Use Cases for a Geo-Temporal Database

slide-3
SLIDE 3

Real-time Decisions on Global Scale

slide-4
SLIDE 4

Dynamic Pricing: Every Hexagon, Every Minute

slide-5
SLIDE 5

Dynamic Pricing: Every Hexagon, Every Minute

slide-6
SLIDE 6

Metrics: how many UberXs were in a trip in the past 10 minutes

slide-7
SLIDE 7

Metrics: how many UberXs were in a trip in the past 10 minutes

slide-8
SLIDE 8

Market Analysis: Travel Times

slide-9
SLIDE 9

Forecasting: Granular Forecasting of Rider Demand

slide-10
SLIDE 10

How Can We Produce Geo-Temporal Data for Ever Changing Business Needs?

slide-11
SLIDE 11

Key Question: What Is the Right Abstraction?

slide-12
SLIDE 12

Abstraction: Single-Table OLAP on Geo-Temporal Data

slide-13
SLIDE 13

Abstraction: Single-Table OLAP on Geo-Temporal Data

SELECT <agg functions>, <dimensions> 
 FROM <data_source>
 WHERE <boolean filter>
 GROUP BY <dimensions>
 HAVING <boolean filter>
 ORDER BY <sorting criterial>
 LIMIT <n>


slide-14
SLIDE 14

Abstraction: Single-Table OLAP on Geo-Temporal Data

SELECT <agg functions>, <dimensions> 
 FROM <data_source>
 WHERE <boolean filter>
 GROUP BY <dimensions>
 HAVING <boolean filter>
 ORDER BY <sorting criterial>
 LIMIT <n>


slide-15
SLIDE 15

Why Elasticsearch?

  • Arbitrary boolean query
  • Sub-second response time
  • Built-in distributed aggregation functions
  • High-cardinality queries
  • Idempotent insertion to deduplicate data
  • Second-level data freshness
  • Scales with data volume
  • Operable by small team
slide-16
SLIDE 16

Current Scale: An Important Context

  • Ingestion: 850K to 1.3M messages/second
  • Ingestion volume: 12TB / day
  • Doc scans: 100M to 4B docs/ second
  • Data size: 1 PB
  • Cluster size: 700 ElasticSearch Machines
  • Ingestion pipeline: 100+ Data Pipeline Jobs
slide-17
SLIDE 17

Our Story of Scaling Elasticsearch

slide-18
SLIDE 18

Three Dimensions of Scale Ingestion Query Operation

slide-19
SLIDE 19

Driving Principles

  • Optimize for fast iteration
  • Optimize for simple operations
  • Optimize for automation and tools
  • Optimize for being reasonably fast
slide-20
SLIDE 20

The Past: We Started Small

slide-21
SLIDE 21

Constraints for Being Small

  • Three-person team
  • Two data centers
  • Small set of requirements: common analytics for machines
slide-22
SLIDE 22

First Order of Business: Take Care of the Basics

slide-23
SLIDE 23

Get Single-Node Right: Follow the 20-80 Rule

  • One table <—> multiple indices by time range
  • Disable _source field
  • Disable _all field
  • Use doc_values for storage
  • Disable analyzed field
  • Tune JVM parameters
slide-24
SLIDE 24

Make Decisions with Numbers

  • What’s the maximum number of recovery threads?
  • What’s the maximum size of request queue?
  • What should the refresh rate be?
  • How many shards should an index have?
  • What’s the throttling threshold?
  • Solution: Set up end-to-end stress testing framework
slide-25
SLIDE 25

Deployment in Two Data Centers

  • Each data center has exclusive set of cities
  • Should tolerate failure of a single data center
  • Ingestion should continue to work
  • Querying any city should return correct results
slide-26
SLIDE 26

Deployment in Two Data Centers: trade space for availability

slide-27
SLIDE 27

Deployment in Two Data Centers: trade space for availability

slide-28
SLIDE 28

Deployment in Two Data Centers: trade space for availability

slide-29
SLIDE 29

Discretize Geo Locations: H3

slide-30
SLIDE 30

Optimizations to Ingestion

slide-31
SLIDE 31

Optimizations to Ingestion

slide-32
SLIDE 32

Dealing with Large Volume of Data

  • An event source produces more than 3TB every day
  • Key insight: human does not need too granular data
  • Key insight: stream data usually has lots of redundancy
slide-33
SLIDE 33
  • Pruning unnecessary fields
  • Devise algorithms to remove redundancy
  • 3TB —> 42 GB, more than 70x of reduction!
  • Bulk write

Dealing with Large Volume of Data

slide-34
SLIDE 34

Data Modeling Matters

slide-35
SLIDE 35

Example: Efficient and Reliable Join

  • Example: Calculate Completed/Requested ratio with two different event streams
slide-36
SLIDE 36

Example: Efficient and Reliable Join: Use Elasticsearch

  • Calculate Completed/Requested ratio from two Kafka topics
  • Can we use streaming join?
  • Can we join on the query side?
  • Solution: rendezvous at Elasticsearch on trip ID

TripID Pickup Time Completed

1 2018-02-03T… TRUE 2 2018-02-3T… FALSE

slide-37
SLIDE 37

Example: aggregation on state transitions

slide-38
SLIDE 38

Optimize Querying Elasticsearch

slide-39
SLIDE 39

Hide Query Optimization from Users

  • Do we really expect every user to write Elasticsearch queries?
  • What if someone issues a very expensive query?
  • Solution: Isolation with a query layer
slide-40
SLIDE 40

Query Layer with Multiple Clusters

slide-41
SLIDE 41

Query Layer with Multiple Clusters

slide-42
SLIDE 42

Query Layer with Multiple Clusters

  • Generate efficient Elasticsearch queries
  • Rejecting expensive queries
  • Routing queries - hardcoded first
slide-43
SLIDE 43

Efficient Query Generation

  • “GROUP BY a, b”
slide-44
SLIDE 44

Rejecting Expensive Queries

  • 10,000 hexagons / city x 1440 minutes per day x 800 cities
  • Cardinality: 11 Billion (!) buckets —> Out Of Memory Error
slide-45
SLIDE 45

Routing Queries

"DEMAND": { "CLUSTERS": { "TIER0": { "CLUSTERS": ["ES_CLUSTER_TIER0"], }, "TIER2": { "CLUSTERS": ["ES_CLUSTER_TIER2"] } }, "INDEX": "MARKETPLACE_DEMAND-", "SUFFIXFORMAT": “YYYYMM.WW", "ROUTING": “PRODUCT_ID”, }

slide-46
SLIDE 46

Routing Queries

"DEMAND": { "CLUSTERS": { "TIER0": { "CLUSTERS": ["ES_CLUSTER_TIER0"], }, "TIER2": { "CLUSTERS": ["ES_CLUSTER_TIER2"] } }, "INDEX": "MARKETPLACE_DEMAND-", "SUFFIXFORMAT": “YYYYMM.WW", "ROUTING": “PRODUCT_ID”, }

slide-47
SLIDE 47

Routing Queries

"DEMAND": { "CLUSTERS": { "TIER0": { "CLUSTERS": ["ES_CLUSTER_TIER0"], }, "TIER2": { "CLUSTERS": ["ES_CLUSTER_TIER2"] } }, "INDEX": "MARKETPLACE_DEMAND-", "SUFFIXFORMAT": “YYYYMM.WW", "ROUTING": “PRODUCT_ID”, }

slide-48
SLIDE 48

Routing Queries

"DEMAND": { "CLUSTERS": { "TIER0": { "CLUSTERS": ["ES_CLUSTER_TIER0"], }, "TIER2": { "CLUSTERS": ["ES_CLUSTER_TIER2"] } }, "INDEX": "MARKETPLACE_DEMAND-", "SUFFIXFORMAT": “YYYYMM.WW", "ROUTING": “PRODUCT_ID”, }

slide-49
SLIDE 49

Summary of First Iteration

slide-50
SLIDE 50

Evolution: Success Breeds Failures

slide-51
SLIDE 51

Unexpected Surges

slide-52
SLIDE 52

Applications Went Haywire

slide-53
SLIDE 53

Solution: Distributed Rate limiting

slide-54
SLIDE 54

Solution: Distributed Rate limiting

Per-Cluster Rate Limit

slide-55
SLIDE 55

Solution: Distributed Rate limiting

Per-Instance Rate Limit

slide-56
SLIDE 56

Workload Evolved

  • Users query months of data for modeling and complex analytics
  • Key insight: Data can be a little stale for long-range queries
  • Solution: Caching layer and delayed execution
slide-57
SLIDE 57

Time Series Cache

slide-58
SLIDE 58

Time Series Cache

  • Redis as the cache store
  • Cache key is based on normalized query content and time range
slide-59
SLIDE 59

Time Series Cache

  • Redis as the cache store
  • Cache key is based on normalized query content and time range
slide-60
SLIDE 60

Time Series Cache

  • Redis as the cache store
  • Cache key is based on normalized query content and time range
slide-61
SLIDE 61

Time Series Cache

  • Redis as the cache store
  • Cache key is based on normalized query content and time range
slide-62
SLIDE 62

Time Series Cache

  • Redis as the cache store
  • Cache key is based on normalized query content and time range
slide-63
SLIDE 63

Time Series Cache

  • Redis as the cache store
  • Cache key is based on normalized query content and time range
slide-64
SLIDE 64

Delayed Execution

  • Allow registering long-running queries
  • Provide cached but stale data for such queries
  • Dedicated cluster and queued executions
  • Rationale: three months of data vs a few hours of staleness
  • Example: [-30d, 0d] —> [-30d, -1d]
slide-65
SLIDE 65

Scale Operations

slide-66
SLIDE 66
  • Make the system transparent
  • Optimize for MTTR - mean time to recover
  • Strive for consistency
  • Automation is the most effective way to get consistency

Driving Principles

slide-67
SLIDE 67
  • Cluster slowed down with all metrics being normal
  • Requires additional instrumentation
  • ES Plugin as a solution

Challenge: Diagnosis

slide-68
SLIDE 68
  • Elasticsearch cluster becomes harder to operate as its size increases
  • MTTR increases as cluster size increases
  • Multi-tenancy becomes a huge issue
  • Can’t have too many shards

Challenge: Cluster Size Becomes an Enemy

slide-69
SLIDE 69
  • 3 clusters —> many smaller clusters
  • Dynamic routing
  • Meta-data driven

Federation

slide-70
SLIDE 70

Federation

slide-71
SLIDE 71

Federation

slide-72
SLIDE 72

Federation

slide-73
SLIDE 73

Federation

slide-74
SLIDE 74

Federation

slide-75
SLIDE 75

Federation

slide-76
SLIDE 76

How Can We Trust the Data?

slide-77
SLIDE 77

Self-Serving Trust System

slide-78
SLIDE 78

Self-Serving Trust System

slide-79
SLIDE 79

Self-Serving Trust System

slide-80
SLIDE 80

Self-Serving Trust System

slide-81
SLIDE 81

Too Much Manual Maintenance Work

slide-82
SLIDE 82
  • Adjusting queue size
  • Restart machines
  • Relocating shards

Too Much Manual Maintenance Work

slide-83
SLIDE 83

Auto Ops

slide-84
SLIDE 84

Auto Ops

slide-85
SLIDE 85

Ongoing Work for the Future

slide-86
SLIDE 86
  • Strong reliability
  • Strong consistency among replicas
  • Multi-tenancy

Future Work

slide-87
SLIDE 87

Summary

  • Three dimensions of scaling: ingestion, query, and operations
  • Be simple and practical: successful systems emerge from simple ones
  • Abstraction and data modeling matter
  • Invest in thorough instrumentation
  • Invest in automation and tools