Taking Storage for a Ride Ren W. Schmidt, Storage Platform - - PowerPoint PPT Presentation

taking storage for a ride
SMART_READER_LITE
LIVE PREVIEW

Taking Storage for a Ride Ren W. Schmidt, Storage Platform - - PowerPoint PPT Presentation

Taking Storage for a Ride Ren W. Schmidt, Storage Platform NOVEMBER 4, 2016 About me Uber for almost 3 years working on scaling out our storage infrastructure across the planet. VMware for 10+ years. Part of the team that released Virtual


slide-1
SLIDE 1

Taking Storage for a Ride

René W. Schmidt, Storage Platform

NOVEMBER 4, 2016

slide-2
SLIDE 2

Uber for almost 3 years working on scaling out our storage infrastructure across the planet. VMware for 10+ years. Part of the team that released Virtual Center 1.0, and many vSphere releases since then. Sun Microsystems for 4 years. Part of the team that shipped the Java Hotspot Virtual Machine 1.0, and Java Web Start 1.0.

About me

slide-3
SLIDE 3

Problem Statement

slide-4
SLIDE 4

Backend Marketplace

Data Warehouse All ongoing trips Billing, Payouts, User Accounts, Trip histories, Fraud, etc. Operational Data

slide-5
SLIDE 5

Not so long ago…

Backend Marketplace

slide-6
SLIDE 6

We were there

slide-7
SLIDE 7

Latency Scalability Availability Development Agility

Storage bottleneck

Misc Users Trips

As of early 2014

slide-8
SLIDE 8

The new world…

Backend Marketplace The Schemaless Storage System

slide-9
SLIDE 9

Operational Infrastructure

The Schemaless Storage System

Zone US-West Zone US-East Zone CN-East Zone CN-West Regions Zone US-West Zone US-West Zone US-East Zone US-East Zone CN-East Zone CN-East Zone CN-West Zone CN-West Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Datastores Trips Trips Service Billing Service Datastores Ratings Datastores Receipts

Ratings Service

… Self-Service Self-Healing Storage System Developer Abstraction Service Layer Datacenter

slide-10
SLIDE 10

More than 80% of Uber’s operational data is in Schemaless From a single datastore (trips) to 300+ datastores From 48 MySQL hosts to many thousands MySQL instances

Status after 2 years in production

slide-11
SLIDE 11

Schemaless Architecture

slide-12
SLIDE 12

Requirements

API & Features

(make developers efficient and happy)

Scalability and Efficiency

(qps, capacity, $/GB, trust in operation)

Availability

(4 9’s, zero-downtime operations, hide failures)

Time to Market

slide-13
SLIDE 13

Easy to replace Postgress

SQL-like secondary Indexes Fast queries

select uuid from trips where user_uuid = ? and status = ? and request_at > ? and request_at < ?

slide-14
SLIDE 14

Support batch operations

BASE ROUTE FARE PAYOUT BILLING { trip_uuid: … rider_uuid: … driver_uuid: … gps_points: […] payment_info: … driver_rating: client_rating:… receipt: … payout: … }

slide-15
SLIDE 15

Microservices

Trips Datastore

1000+ services (most are stateless) Each service can request their own storage

slide-16
SLIDE 16

Durability

slide-17
SLIDE 17
slide-18
SLIDE 18

Scalability

2TB 8TB 128 1PB 1PB 512

w/ redundancy

Scalability & Reliability

slide-19
SLIDE 19

Ledger-style API

Tracking of real-life interactions

UUID BASE ROUTE FARE RATING

12AB { json dict } { json dict } { json dict } F4CD { json dict } { json dict } { json dict …, ts: 0} { json dict …, ts: 2} … { json dict }

put_cell(uuid, column_key, ts, data) get_cell(uuid, colum_key, ts) get_cell_latest(uuid, column_key)

Simple, proven, schemaless datamodel Append-only - each cell can only be written once

slide-20
SLIDE 20

Physical storage layout

A007 { json dict } { json dict } { json dict }

1 2 3 Distribution Layer % 4096 4 5 6 7 4 9 2 4 9 3 4 9 4 4 9 5

Logical Model Sharding Function Fixed set of Shards Expandable set of MySQL Clusters

slide-21
SLIDE 21

Defined on columns Scalable - partitioned across shards Fast queries - just need to query a single shard Can be added / removed dynamically

Efficient indexes

put_cell(100, ‘BASE’, { client_id: 10 driver_id: 437 fare: 10 } INDEX: name: CLIENT_INDEX column: BASE fields:

  • name: client_id
  • name: fare

Shard 1 Shard 2 Shard 3

Index

Index Definition Shard 0

Cell

put_cell(121, ‘BASE’, { client_id: 10 driver_id: 217 fare: 15 }

Index Cell

slide-22
SLIDE 22

Internally organized as an ordered log (append-only datastore) B-Tree index for (row, col, ts) lookups Efficient scanning for changes over time

The duality of Schemaless: Log and Key-Value Store

Shard 0 Shard 1 Shard 2 Shard 3

Time

put_cells Recent inserts

slide-23
SLIDE 23

Data driven triggers

partition_read(partition, columns, offset_vector): cells

(BASE, ROUTE) —> FARE (BASE, FARE) -> CLIENT_BILLING (BASE, FARE) -> DRIVER_PAYOUT

All backend processing is triggered based on data being written: Functional programming paradigm Robust in case of failures Eliminates out-of-band message queues

slide-24
SLIDE 24
  • Bigtable-style API for storing JSON dictionaries
  • Horizontally-scalable in both IO and capacity
  • Append-only to track real-time interactions
  • Fast secondary indexes
  • Batch processing using triggers / partition_read API
  • No downtime for changing schemas or indexes
  • Build on a solid MySQL foundation

Schemaless features

slide-25
SLIDE 25

Schemaless Availability

slide-26
SLIDE 26

“The difference between a high- quality product and a low quality product is how well it works when stressed”

slide-27
SLIDE 27

What happens when a database dies?

Master Slaves Master Slaves

Distribution Layer

put_cell

slide-28
SLIDE 28

What happens when a database dies?

Slave Failure - replace Master Slaves Master Slaves

Distribution Layer

put_cell

X

slide-29
SLIDE 29

What happens when a database dies?

Master Failure - Hinted Handoff Master Slaves Master Slaves

Distribution Layer

put_cell

X

Two options: 1) Fail write (fine for batch) 2) Buffer write and retry later Slave is promoted to master

slide-30
SLIDE 30

Consistency guarantees

The trigger API hides the buffering for the application programmer. Log and Retry - Commutative operations makes it simple

put_cell(row, column, ts, json, buffered = true)

slide-31
SLIDE 31

Handling growth

Database Cluster Splits

1 2 3 4

Lots of writes

slide-32
SLIDE 32

Handling growth

Database Cluster Splits

1 2 3 4 1 2 3 4

Setup a read-only shadow cluster in the background

slide-33
SLIDE 33

Handling growth

Database Cluster Splits

1 3 2 4

Make writable Delete extra shards

slide-34
SLIDE 34

How fast can we add a MySQL slave? How fast can a slave DB be promoted to a master DB?

Key operations

slide-35
SLIDE 35

Physical restore limits

The network is not infinitely fast

Typical restore SLA is less than 30 minutes

10Gbps NIC 3.6TB in 1 hour 512GB in 30 min w/ 25% network saturation

slide-36
SLIDE 36

Partition data within hosts

Restore chunks in parallel 8TB

slide-37
SLIDE 37

Operating Storage at Scale

slide-38
SLIDE 38

Operational Infrastructure

The Schemaless Storage System

Zone US-West Zone US-East Zone CN-East Zone CN-West Regions Zone US-West Zone US-West Zone US-East Zone US-East Zone CN-East Zone CN-East Zone CN-West Zone CN-West Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Datastores Trips Trips Service Billing Service Datastores Ratings Datastores Receipts

Ratings Service

… Self-Service Self-Healing Storage System Developer Abstraction Service Layer Datacenters

slide-39
SLIDE 39

Do more with less

slide-40
SLIDE 40

1236 word setup 1557 word execution

slide-41
SLIDE 41
  • Host and rack failures
  • Upgrade all OSes or MySQL across all boxes
  • Somebody ran some manual commands on a host (!)
  • Applied manual steps inconsistently
  • Running out of disk-space
  • Capacity Planning

What gets hard with scale?

Drift, Drift, and Drift

  • Performance tuning and debugging
  • Creating new instances & indexes
slide-42
SLIDE 42

Pets vs Cattle

“Pets”

  • Unique names (“fluffy”, “biscuit”, etc.)
  • Know address by heart
  • Being nurtured when becoming ill

“Cattle”

  • Enumerated name (cow0235)
  • Arbitrary address
  • Replaceable when becoming ill
slide-43
SLIDE 43

So what does that mean?

Pets World View Cattle World View The desired state is in your head Desired state is codifjed Driven by you Driven by software Making changes is cool Making changes is a non-event Making operations directly on hosts Changing the model Runbooks Autonomous Brittle Robust Operation Oriented Goal-State Oriented

slide-44
SLIDE 44
  • Less than 80% disk space used
  • Hosts uses linux kernel version X
  • At least one database is backed up in each cluster
  • A cluster has desired number of slaves
  • Instance has X clusters
  • Instance X exists

What is a goal state…?

slide-45
SLIDE 45

The goal-state engine

Goal State System Actual State Evaluate Action

Operator Input

Idempotent, Robust, Restartable, Continuous, Self-healing, Simple

Current Drift

slide-46
SLIDE 46

Example: Goal-State engine on a host

Docker

M S S

Host

PULL GOAL STATE

Opsless- Agent

Goal State Actual State DB Role Sync_From Role Read-only Issues foo-db1 master

  • master

no none bar-db2 slave bar-db1 slave yes none baz-db10 slave baz-db9 slave yes none

UPDATE ACTUAL STATE

slide-47
SLIDE 47

Let’s promote foo-db2 to be the new master

Docker

S

Host 1

Opsless- Agent

Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A master

  • master

no none foo-db2 Host B slave foo-db1 slave yes none

Docker Host 2

Opsless- Agent

M

Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: slave, …}}

slide-48
SLIDE 48

We just have to change the goal-state

Docker

S

Host 1

Opsless- Agent

Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A idle

  • master

no Wrong role foo-db2 Host B master

  • slave

yes Wrong role

Docker Host 2

Opsless- Agent

M

Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: slave, …}}

slide-49
SLIDE 49

The state is propagated to the hosts…

Docker

S

Host 1

Opsless- Agent

Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A idle

  • master

no Wrong role foo-db2 Host B master

  • slave

yes Wrong role

Docker Host 2

Opsless- Agent

M

Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: master, …}}

Waits for replication lag go to zero Master to be read-only

slide-50
SLIDE 50

And that is it!

Docker

M

Host 1

Opsless- Agent

Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A idle

  • idle

yes none foo-db2 Host B master

  • master

no none

Docker Host 2

Opsless- Agent

Idle

Goalstate: { foo-db1: {role: idle}} Goalstate: { foo-db2: { role: master, …}}

slide-51
SLIDE 51
slide-52
SLIDE 52

Consensus protocol within a cluster

RAFT Consensus Protocol

New master promoted in less than 100ms Hidden by client retries Nothing to do for operator

slide-53
SLIDE 53

Operational Infrastructure

Zone US-West Zone US-East Zone CN-East Zone CN-West Regions Zone US-West Zone US-West Zone US-East Zone US-East Zone CN-East Zone CN-East Zone CN-West Zone CN-West Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Datastores Trips Trips Service Billing Service Datastores Ratings Datastores Receipts

Ratings Service

… Self-Service Self-Healing Storage System Developer Abstraction Service Layer Datacenters

slide-54
SLIDE 54

Where to learn more…

eng.uber.com

slide-55
SLIDE 55