[PPT] - Taking Storage for a Ride Ren W. Schmidt, Storage Platform PowerPoint Presentation

SLIDE 1

Taking Storage for a Ride

René W. Schmidt, Storage Platform

NOVEMBER 4, 2016

SLIDE 2

Uber for almost 3 years working on scaling out our storage infrastructure across the planet. VMware for 10+ years. Part of the team that released Virtual Center 1.0, and many vSphere releases since then. Sun Microsystems for 4 years. Part of the team that shipped the Java Hotspot Virtual Machine 1.0, and Java Web Start 1.0.

About me

SLIDE 3

Problem Statement

SLIDE 4

Backend Marketplace

Data Warehouse All ongoing trips Billing, Payouts, User Accounts, Trip histories, Fraud, etc. Operational Data

SLIDE 5

Not so long ago…

Backend Marketplace

SLIDE 6

We were there

SLIDE 7

Latency Scalability Availability Development Agility

Storage bottleneck

Misc Users Trips

As of early 2014

SLIDE 8

The new world…

Backend Marketplace The Schemaless Storage System

SLIDE 9

Operational Infrastructure

The Schemaless Storage System

Zone US-West Zone US-East Zone CN-East Zone CN-West Regions Zone US-West Zone US-West Zone US-East Zone US-East Zone CN-East Zone CN-East Zone CN-West Zone CN-West Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Datastores Trips Trips Service Billing Service Datastores Ratings Datastores Receipts

…

Ratings Service

… Self-Service Self-Healing Storage System Developer Abstraction Service Layer Datacenter

SLIDE 10

More than 80% of Uber’s operational data is in Schemaless From a single datastore (trips) to 300+ datastores From 48 MySQL hosts to many thousands MySQL instances

Status after 2 years in production

SLIDE 11

Schemaless Architecture

SLIDE 12

Requirements

API & Features

(make developers efficient and happy)

Scalability and Efficiency

(qps, capacity, $/GB, trust in operation)

Availability

(4 9’s, zero-downtime operations, hide failures)

Time to Market

SLIDE 13

Easy to replace Postgress

SQL-like secondary Indexes Fast queries

select uuid from trips where user_uuid = ? and status = ? and request_at > ? and request_at < ?

SLIDE 14

Support batch operations

BASE ROUTE FARE PAYOUT BILLING { trip_uuid: … rider_uuid: … driver_uuid: … gps_points: […] payment_info: … driver_rating: client_rating:… receipt: … payout: … }

SLIDE 15

Microservices

Trips Datastore

1000+ services (most are stateless) Each service can request their own storage

SLIDE 16

Durability

SLIDE 17

SLIDE 18

Scalability

2TB 8TB 128 1PB 1PB 512

w/ redundancy

Scalability & Reliability

SLIDE 19

Ledger-style API

Tracking of real-life interactions

UUID BASE ROUTE FARE RATING

12AB { json dict } { json dict } { json dict } F4CD { json dict } { json dict } { json dict …, ts: 0} { json dict …, ts: 2} … { json dict }

put_cell(uuid, column_key, ts, data) get_cell(uuid, colum_key, ts) get_cell_latest(uuid, column_key)

Simple, proven, schemaless datamodel Append-only - each cell can only be written once

SLIDE 20

Physical storage layout

A007 { json dict } { json dict } { json dict }

1 2 3 Distribution Layer % 4096 4 5 6 7 4 9 2 4 9 3 4 9 4 4 9 5

…

Logical Model Sharding Function Fixed set of Shards Expandable set of MySQL Clusters

SLIDE 21

Defined on columns Scalable - partitioned across shards Fast queries - just need to query a single shard Can be added / removed dynamically

Efficient indexes

put_cell(100, ‘BASE’, { client_id: 10 driver_id: 437 fare: 10 } INDEX: name: CLIENT_INDEX column: BASE fields:

name: client_id
name: fare

Shard 1 Shard 2 Shard 3

Index

Index Definition Shard 0

Cell

put_cell(121, ‘BASE’, { client_id: 10 driver_id: 217 fare: 15 }

Index Cell

SLIDE 22

Internally organized as an ordered log (append-only datastore) B-Tree index for (row, col, ts) lookups Efficient scanning for changes over time

The duality of Schemaless: Log and Key-Value Store

Shard 0 Shard 1 Shard 2 Shard 3

Time

put_cells Recent inserts

SLIDE 23

Data driven triggers

partition_read(partition, columns, offset_vector): cells

(BASE, ROUTE) —> FARE (BASE, FARE) -> CLIENT_BILLING (BASE, FARE) -> DRIVER_PAYOUT

All backend processing is triggered based on data being written: Functional programming paradigm Robust in case of failures Eliminates out-of-band message queues

SLIDE 24

Bigtable-style API for storing JSON dictionaries
Horizontally-scalable in both IO and capacity
Append-only to track real-time interactions
Fast secondary indexes
Batch processing using triggers / partition_read API
No downtime for changing schemas or indexes
Build on a solid MySQL foundation

Schemaless features

SLIDE 25

Schemaless Availability

SLIDE 26

“The difference between a high- quality product and a low quality product is how well it works when stressed”

SLIDE 27

What happens when a database dies?

Master Slaves Master Slaves

Distribution Layer

put_cell

SLIDE 28

What happens when a database dies?

Slave Failure - replace Master Slaves Master Slaves

Distribution Layer

put_cell

X

SLIDE 29

What happens when a database dies?

Master Failure - Hinted Handoff Master Slaves Master Slaves

Distribution Layer

put_cell

X

Two options: 1) Fail write (fine for batch) 2) Buffer write and retry later Slave is promoted to master

SLIDE 30

Consistency guarantees

The trigger API hides the buffering for the application programmer. Log and Retry - Commutative operations makes it simple

put_cell(row, column, ts, json, buffered = true)

SLIDE 31

Handling growth

Database Cluster Splits

1 2 3 4

Lots of writes

SLIDE 32

Handling growth

Database Cluster Splits

1 2 3 4 1 2 3 4

Setup a read-only shadow cluster in the background

SLIDE 33

Handling growth

Database Cluster Splits

1 3 2 4

Make writable Delete extra shards

SLIDE 34

How fast can we add a MySQL slave? How fast can a slave DB be promoted to a master DB?

Key operations

SLIDE 35

Physical restore limits

The network is not infinitely fast

Typical restore SLA is less than 30 minutes

10Gbps NIC 3.6TB in 1 hour 512GB in 30 min w/ 25% network saturation

SLIDE 36

Partition data within hosts

Restore chunks in parallel 8TB

SLIDE 37

Operating Storage at Scale

SLIDE 38

Operational Infrastructure

The Schemaless Storage System

Zone US-West Zone US-East Zone CN-East Zone CN-West Regions Zone US-West Zone US-West Zone US-East Zone US-East Zone CN-East Zone CN-East Zone CN-West Zone CN-West Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Datastores Trips Trips Service Billing Service Datastores Ratings Datastores Receipts

…

Ratings Service

… Self-Service Self-Healing Storage System Developer Abstraction Service Layer Datacenters

SLIDE 39

Do more with less

SLIDE 40

1236 word setup 1557 word execution

SLIDE 41

Host and rack failures
Upgrade all OSes or MySQL across all boxes
Somebody ran some manual commands on a host (!)
Applied manual steps inconsistently
Running out of disk-space
Capacity Planning

What gets hard with scale?

Drift, Drift, and Drift

Performance tuning and debugging
Creating new instances & indexes

SLIDE 42

Pets vs Cattle

“Pets”

Unique names (“fluffy”, “biscuit”, etc.)
Know address by heart
Being nurtured when becoming ill

“Cattle”

Enumerated name (cow0235)
Arbitrary address
Replaceable when becoming ill

SLIDE 43

So what does that mean?

Pets World View Cattle World View The desired state is in your head Desired state is codifjed Driven by you Driven by software Making changes is cool Making changes is a non-event Making operations directly on hosts Changing the model Runbooks Autonomous Brittle Robust Operation Oriented Goal-State Oriented

SLIDE 44

Less than 80% disk space used
Hosts uses linux kernel version X
At least one database is backed up in each cluster
A cluster has desired number of slaves
Instance has X clusters
Instance X exists

What is a goal state…?

SLIDE 45

The goal-state engine

Goal State System Actual State Evaluate Action

Operator Input

Idempotent, Robust, Restartable, Continuous, Self-healing, Simple

Current Drift

SLIDE 46

Example: Goal-State engine on a host

Docker

M S S

Host

PULL GOAL STATE

Opsless- Agent

Goal State Actual State DB Role Sync_From Role Read-only Issues foo-db1 master

master

no none bar-db2 slave bar-db1 slave yes none baz-db10 slave baz-db9 slave yes none

UPDATE ACTUAL STATE

SLIDE 47

Let’s promote foo-db2 to be the new master

Docker

S

Host 1

Opsless- Agent

Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A master

master

no none foo-db2 Host B slave foo-db1 slave yes none

Docker Host 2

Opsless- Agent

M

Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: slave, …}}

SLIDE 48

We just have to change the goal-state

Docker

S

Host 1

Opsless- Agent

Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A idle

master

no Wrong role foo-db2 Host B master

slave

yes Wrong role

Docker Host 2

Opsless- Agent

M

Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: slave, …}}

SLIDE 49

The state is propagated to the hosts…

Docker

S

Host 1

Opsless- Agent

Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A idle

master

no Wrong role foo-db2 Host B master

slave

yes Wrong role

Docker Host 2

Opsless- Agent

M

Goalstate: { foo-db1: {role: master}} Goalstate: { foo-db2: { role: master, …}}

Waits for replication lag go to zero Master to be read-only

SLIDE 50

And that is it!

Docker

M

Host 1

Opsless- Agent

Goal State Actual State DB Host Role Sync_From Role Read-only Issues foo-db1 Host A idle

idle

yes none foo-db2 Host B master

master

no none

Docker Host 2

Opsless- Agent

Idle

Goalstate: { foo-db1: {role: idle}} Goalstate: { foo-db2: { role: master, …}}

SLIDE 51

SLIDE 52

Consensus protocol within a cluster

RAFT Consensus Protocol

New master promoted in less than 100ms Hidden by client retries Nothing to do for operator

SLIDE 53

Operational Infrastructure

Zone US-West Zone US-East Zone CN-East Zone CN-West Regions Zone US-West Zone US-West Zone US-East Zone US-East Zone CN-East Zone CN-East Zone CN-West Zone CN-West Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Schemaless Instances Datastores Trips Trips Service Billing Service Datastores Ratings Datastores Receipts

…

Ratings Service

… Self-Service Self-Healing Storage System Developer Abstraction Service Layer Datacenters

SLIDE 54

Where to learn more…

eng.uber.com

SLIDE 55