Multi-Tenancy & Isolation Bogdan Munteanu - Dropbox Overview - - PowerPoint PPT Presentation

multi tenancy isolation
SMART_READER_LITE
LIVE PREVIEW

Multi-Tenancy & Isolation Bogdan Munteanu - Dropbox Overview - - PowerPoint PPT Presentation

Multi-Tenancy & Isolation Bogdan Munteanu - Dropbox Overview What is Edgestore? Workloads & API Multi-tenancy & Isolation Lessons Learned What is Edgestore Distributed Metadata Store built on top of MySQL


slide-1
SLIDE 1

Multi-Tenancy & Isolation

Bogdan Munteanu - Dropbox

slide-2
SLIDE 2

Overview

  • What is Edgestore?
  • Workloads & API
  • Multi-tenancy & Isolation
  • Lessons Learned
slide-3
SLIDE 3

What is Edgestore

  • Distributed Metadata Store built on top of MySQL
  • Highly Available, Scalable, Durable
  • Abstract away sharding and caching
  • Reduce operational burden
  • Flexible schemas
  • Multi-Region Setup
slide-4
SLIDE 4

Architecture

slide-5
SLIDE 5

Architecture cont’d

  • 2048 Shards
  • 8 Shards per Engine (and MySQL cluster)
  • 1 Master - 2 Slaves (semi-sync)
  • Multi-region setup
slide-6
SLIDE 6

MYSQL EDGESTORE

Edge type Gid Data Photo Entity ? Name:SF.jpg; Size:64 Photo Entity ? Name:Hawaii; Size:64 Photo Entity ? Name:Tahoe.jpg; Size:128 Photo Entity ? Name:Office.jpg; Size:1024 User 15:1 Email:jondoe@, Name:Jon; Type:Free User 20:2 Email:jenny@; Name:Jenny; Type:Pro

Id Company Size 1 Expedia 5000 2 NatGeo 500 3 Intuit 2000 4 Spotify 600 Id Email Name Type 1 jondoe@ Jon Free 2 jenny@ Jenny Pro

Team User Edgedata

Schema Id Data Team 10:1 Company:Expedia; Size:5000 Team 20:1 Company:NatGeo; Size:500 Team 30:7 Company:Intuit; Size:2000 Team 35:3 Company:Spotity; Size:600

slide-7
SLIDE 7

Shard the table

Shard 1 Shard n

Schema Id Data Team 10:1

Company:Expedia; Size:5000

Team 20:4

Company:NatGeo; Size:500

Schema Id Data Team 30:1

Company:Intuit; Size:2000

User 40:2

Email:jondoe@, Name:Jon; Type:Free

Schema Id Data Team 50:2

Company:Spotity; Size:600

User 60:1

Email:jenny@; Name:Jenny; Type:Pro

Shard 2

slide-8
SLIDE 8

Restricted API

  • Create/Update/Delete
  • single and batch
  • Compare and Set semantics
  • Reads:
  • Read(Id, )
  • List(Id, *)
  • Count(Id, *)
  • List(Id, condition=[equals, prefix, range])
  • ReadLog(Id)
  • ListLog(Id, *)
  • Acquire Read/Write Lock
  • Commit/Rollback
  • Strong consistency semantics
slide-9
SLIDE 9

Workloads

  • 10 million QPS
  • 600k Writes / second
  • 9.4mil Reads / second
  • 90% of Reads are cache hits
  • 1.5 million QPS to Engine fleet
slide-10
SLIDE 10

Workloads cont’d

  • Batch Size 1 to 10000
  • Some read requests can return 1 row
  • Some can return 100000 rows
  • Rows can be between a few bytes to several MB
  • 500+ unique Schemas
slide-11
SLIDE 11

Engine

Proto -> SQL Query Query Result -> Proto Control / Reduce load to MySQL Connection Pooling

slide-12
SLIDE 12

Workloads cont’d

  • High QPS
  • Write / Read
  • Large / expensive requests:
  • Write - large transactions
  • Read - large number of rows, or large rows
  • Multi-Read / Multi-Write
slide-13
SLIDE 13

Request Handler Resource Pool

Engine Single Request - 1 token

slide-14
SLIDE 14

Request Handler Resource Pool

Engine Batch (parallel) Request - n tokens

Id1 goroutine Id2 goroutine Id3 goroutine

slide-15
SLIDE 15

Request Handler Resource Pool

Engine Batch (sequential) Request - n tokens

Id 1 - Id 10 Id 11 - Id 20 Id 21 - Id 30

slide-16
SLIDE 16

More Isolation breakdowns

  • Type of Traffic:
  • Live traffic: Front Ends - user traffic, sync related traffic
  • Offline traffic: Scripts / Async processing / Offline

processing

  • Type of Request:
  • Write (Insert, Delete, Update, Create Ids, Aquire Read/

Write Locks)

  • Read (Single read, multi read, list, count, listLog)
slide-17
SLIDE 17

Request Handler Write Live Resource Pool

Engine

Read Live Resource Pool Write Offline Resource Pool Read Offline Resource Pool

Layer Resource Pools

slide-18
SLIDE 18

Breakdown by tenant

  • What is a tenant?
  • Source Machine Tag (e.g. front-end)
  • Source ServiceName (e.g. FileSync)
  • Source Schema (e.g. Team)
  • Source Handler (e.g. Thumbnail generator)
  • Source Script (e.g. backfill-albums)
slide-19
SLIDE 19

Examples

  • “frontend:www:TeamEvent”
  • “async-worker:async_task_wrapper:Contacts”
  • “email:emailservice.py:UserEmailEvent”
  • “taskrunner-node-

quota:update_team_usage.py:User”

slide-20
SLIDE 20

Engine CPU Memory Network Storage Disk IO Mysql: CPU / Disk IO Mysql: Threads running Mysql: Semi-sync Mysql: Threads connected

slide-21
SLIDE 21

Resources

  • QPS is not a good metric, as requests vary considerably
  • # Connections used (mapping to token resource pool)
  • connections used * time
  • 200 connections total pool = 200 * 60 = 12000

connection seconds / min:

  • 1 connection per second for 1 min = 60 connection

seconds / min

  • 60 connections for 1 second = 60 connection

seconds / min

slide-22
SLIDE 22

Tenant ConnSec Used Connections Errors

frontend:rpc:User

20 % 5

frontend:www:FileId

3 % 90

taskrunner: growth: team_quota

0,5 % 4

email: UserEmail

1 % 1 Total 24,5 % 100

Write Live - 1 minute snapshot

slide-23
SLIDE 23

Percentage 25 50 75 100 Time 10:00 10:01 10:02 10:03 10:04 10:05

slide-24
SLIDE 24

Percentage 25 50 75 100 Time 10:00 10:01 10:02 10:03 10:04 10:05

slide-25
SLIDE 25

Percentage 25 50 75 100 Time 10:00 10:01 10:02 10:03 10:04 10:05

slide-26
SLIDE 26

Throttle mechanism

  • Auto-throttle heuristics based on history of resource

usage per tenant

  • No predefined quota
  • Steady state usage by tenant varies wildly 0.001% - 20%
  • Triggering event -> find “bad” tenant -> decide how

much to throttle them -> throttle “bad” tenant

  • Disabled the auto-throttling mechanism
  • We have learned a lot
slide-27
SLIDE 27

Engine

Acquire Lock Read Write Commit

Conn

Timer Start 1 2 3 4 5 6 7 8 9

slide-28
SLIDE 28

Resources

  • Used Time -> Execution Time
  • Bytes In/Out
slide-29
SLIDE 29

Tenant Used Execution MB Read Conns Errors

frontend:rpc:User

20 % 1 % 1 5

frontend:www:File Id

3 % 3 % 30 90

taskrunner: growth: team_quota

0,5 % 0,5 % 5 4

email: UserEmail

1 % 0,5 % 4 1 Total 24,5 % 5 % 40 100

Write Live - 1 minute snapshot

slide-30
SLIDE 30

Layer: write_live, NumTenants: 360 Throttle Controls: State: steady, TokensPrimaryPool: 300, TokensThrottledPool: 0 Throttled Tenants: [] Period 1: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 6.48% | 60.15% | 2.58% | 34057 | 0 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 0.79% | 94.11% | 0.36% | 423 | 0 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 0 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 0 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 0 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 0 | 0 | frontend:www:ActivityEntity Layer: write_live, NumTenants: 360 Throttle Controls: State: steady, TokensPrimaryPool: 300, TokensThrottledPool: 0 Throttled Tenants: [] Period 2: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 100% | 60.15% | 52.58% | 34057 | 20000 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 93.79% | 0.11% | 50.36% | 600 | 300 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 254 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 1293 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 2913 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 23 | 0 | frontend:www:ActivityEntity

slide-31
SLIDE 31

Layer: write_live, NumTenants: 360 Throttle Controls: State: throttled, TokensPrimaryPool: 270, TokensThrottledPool: 30 Throttled Tenants: [offline:bluemail:Email] Period 3: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 16.20% | 60.15% | 7.58% | 34057 | 1900 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 10.79% | 0.11% | 5.36% | 600 | 1900 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 0 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 0 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 0 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 0 | 0 | frontend:www:ActivityEntity edgestore_throttle —tenant=offline:bluemail:Email —tokens=30 —host=abc-de-fg —layer=write_live

slide-32
SLIDE 32

Impact

  • Reduce MTTR
  • Availability event:
  • 1. Detection
  • 2. Investigation
  • 3. Containment
  • 4. Short term fix
  • 5. Long term fix
slide-33
SLIDE 33
  • Expensive queries
  • Abusable APIs
  • Query optimizer
  • Inconsistencies
  • Insufficient documentation
  • Bugs
  • Perf optimization

Findings

slide-34
SLIDE 34

Lessons Learned

  • 1 deployment to rule them all works
  • There is such a thing as automating too soon
  • Silently throttling is bad
  • Throttling should be a temporary state
  • Not having pre-defined quotas works
  • Multiple Isolation breakdowns (by user, by table, by

tenant, by request type (Read/Write), by traffic type (Live vs Offline)

Auto-throttle heuristics Manual Throttle using a throttle tool Query / Throttle / Unthrottle Aggregate tool - queries and filters all engines to isolate the error and limit blast radius while investigating, root causing and fixing the underlying problem. There was a time when we shut down scripts manually not knowing who was causing the problem found issues with API, bugs, poorly documented client, best practices Throttle mechanism Future work (in progress)

slide-35
SLIDE 35

What’s next

  • Control Plane “brain”
  • continuously query all Engines
  • automatically throttle tenants when system is

degraded

  • detecting trends
  • Per logical micros shard (and per Id) granularity

for throttling

slide-36
SLIDE 36

Credits

  • Zviad Metreveli
  • Rati Gelashvili
  • Robert Verkuil
  • Alex Degtiar
  • Jonathan Lee