Multi-Tenancy & Isolation
Bogdan Munteanu - Dropbox
Multi-Tenancy & Isolation Bogdan Munteanu - Dropbox Overview - - PowerPoint PPT Presentation
Multi-Tenancy & Isolation Bogdan Munteanu - Dropbox Overview What is Edgestore? Workloads & API Multi-tenancy & Isolation Lessons Learned What is Edgestore Distributed Metadata Store built on top of MySQL
Bogdan Munteanu - Dropbox
Edge type Gid Data Photo Entity ? Name:SF.jpg; Size:64 Photo Entity ? Name:Hawaii; Size:64 Photo Entity ? Name:Tahoe.jpg; Size:128 Photo Entity ? Name:Office.jpg; Size:1024 User 15:1 Email:jondoe@, Name:Jon; Type:Free User 20:2 Email:jenny@; Name:Jenny; Type:Pro
Id Company Size 1 Expedia 5000 2 NatGeo 500 3 Intuit 2000 4 Spotify 600 Id Email Name Type 1 jondoe@ Jon Free 2 jenny@ Jenny Pro
Team User Edgedata
Schema Id Data Team 10:1 Company:Expedia; Size:5000 Team 20:1 Company:NatGeo; Size:500 Team 30:7 Company:Intuit; Size:2000 Team 35:3 Company:Spotity; Size:600
Shard 1 Shard n
Schema Id Data Team 10:1
Company:Expedia; Size:5000
Team 20:4
Company:NatGeo; Size:500
Schema Id Data Team 30:1
Company:Intuit; Size:2000
User 40:2
Email:jondoe@, Name:Jon; Type:Free
Schema Id Data Team 50:2
Company:Spotity; Size:600
User 60:1
Email:jenny@; Name:Jenny; Type:Pro
Shard 2
Engine
Proto -> SQL Query Query Result -> Proto Control / Reduce load to MySQL Connection Pooling
Request Handler Resource Pool
Engine Single Request - 1 token
Request Handler Resource Pool
Engine Batch (parallel) Request - n tokens
Id1 goroutine Id2 goroutine Id3 goroutine
Request Handler Resource Pool
Engine Batch (sequential) Request - n tokens
Id 1 - Id 10 Id 11 - Id 20 Id 21 - Id 30
processing
Write Locks)
Request Handler Write Live Resource Pool
Engine
Read Live Resource Pool Write Offline Resource Pool Read Offline Resource Pool
quota:update_team_usage.py:User”
Engine CPU Memory Network Storage Disk IO Mysql: CPU / Disk IO Mysql: Threads running Mysql: Semi-sync Mysql: Threads connected
connection seconds / min:
seconds / min
seconds / min
Tenant ConnSec Used Connections Errors
frontend:rpc:User
20 % 5
frontend:www:FileId
3 % 90
taskrunner: growth: team_quota
0,5 % 4
email: UserEmail
1 % 1 Total 24,5 % 100
Write Live - 1 minute snapshot
Percentage 25 50 75 100 Time 10:00 10:01 10:02 10:03 10:04 10:05
Percentage 25 50 75 100 Time 10:00 10:01 10:02 10:03 10:04 10:05
Percentage 25 50 75 100 Time 10:00 10:01 10:02 10:03 10:04 10:05
usage per tenant
much to throttle them -> throttle “bad” tenant
Engine
Acquire Lock Read Write Commit
Conn
Timer Start 1 2 3 4 5 6 7 8 9
Tenant Used Execution MB Read Conns Errors
frontend:rpc:User
20 % 1 % 1 5
frontend:www:File Id
3 % 3 % 30 90
taskrunner: growth: team_quota
0,5 % 0,5 % 5 4
email: UserEmail
1 % 0,5 % 4 1 Total 24,5 % 5 % 40 100
Write Live - 1 minute snapshot
Layer: write_live, NumTenants: 360 Throttle Controls: State: steady, TokensPrimaryPool: 300, TokensThrottledPool: 0 Throttled Tenants: [] Period 1: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 6.48% | 60.15% | 2.58% | 34057 | 0 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 0.79% | 94.11% | 0.36% | 423 | 0 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 0 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 0 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 0 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 0 | 0 | frontend:www:ActivityEntity Layer: write_live, NumTenants: 360 Throttle Controls: State: steady, TokensPrimaryPool: 300, TokensThrottledPool: 0 Throttled Tenants: [] Period 2: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 100% | 60.15% | 52.58% | 34057 | 20000 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 93.79% | 0.11% | 50.36% | 600 | 300 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 254 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 1293 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 2913 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 23 | 0 | frontend:www:ActivityEntity
Layer: write_live, NumTenants: 360 Throttle Controls: State: throttled, TokensPrimaryPool: 270, TokensThrottledPool: 30 Throttled Tenants: [offline:bluemail:Email] Period 3: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 16.20% | 60.15% | 7.58% | 34057 | 1900 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 10.79% | 0.11% | 5.36% | 600 | 1900 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 0 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 0 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 0 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 0 | 0 | frontend:www:ActivityEntity edgestore_throttle —tenant=offline:bluemail:Email —tokens=30 —host=abc-de-fg —layer=write_live
tenant, by request type (Read/Write), by traffic type (Live vs Offline)
Auto-throttle heuristics Manual Throttle using a throttle tool Query / Throttle / Unthrottle Aggregate tool - queries and filters all engines to isolate the error and limit blast radius while investigating, root causing and fixing the underlying problem. There was a time when we shut down scripts manually not knowing who was causing the problem found issues with API, bugs, poorly documented client, best practices Throttle mechanism Future work (in progress)
degraded
for throttling