multi tenancy isolation
play

Multi-Tenancy & Isolation Bogdan Munteanu - Dropbox Overview - PowerPoint PPT Presentation

Multi-Tenancy & Isolation Bogdan Munteanu - Dropbox Overview What is Edgestore? Workloads & API Multi-tenancy & Isolation Lessons Learned What is Edgestore Distributed Metadata Store built on top of MySQL


  1. Multi-Tenancy & Isolation Bogdan Munteanu - Dropbox

  2. Overview • What is Edgestore? • Workloads & API • Multi-tenancy & Isolation • Lessons Learned

  3. What is Edgestore • Distributed Metadata Store built on top of MySQL • Highly Available, Scalable, Durable • Abstract away sharding and caching • Reduce operational burden • Flexible schemas • Multi-Region Setup

  4. Architecture

  5. Architecture cont’d • 2048 Shards • 8 Shards per Engine (and MySQL cluster) • 1 Master - 2 Slaves (semi-sync) • Multi-region setup

  6. MYSQL EDGESTORE Team Edgedata Id Company Size Schema Id Data Edge type Gid Data 1 Expedia 5000 Company:Expedia; Name:SF.jpg; 2 NatGeo 500 Team 10:1 Photo Entity ? Size:5000 Size:64 3 Intuit 2000 Company:NatGeo; Team 20:1 Photo Entity ? Name:Hawaii; Size:64 Size:500 4 Spotify 600 Company:Intuit; Name:Tahoe.jpg; Team 30:7 Photo Entity ? Size:2000 Size:128 Company:Spotity; User Name:Office.jpg; Team 35:3 Photo Entity ? Size:600 Size:1024 Id Email Name Type Email:jondoe@, User 15:1 Name:Jon; Type:Free 1 jondoe@ Jon Free Email:jenny@; User 20:2 Name:Jenny; Type:Pro 2 jenny@ Jenny Pro

  7. Shard the table Schema Id Data Schema Id Data Schema Id Data Company:Expedia; Company:Intuit; Company:Spotity; Team 10:1 Team 30:1 Team 50:2 Size:5000 Size:2000 Size:600 Company:NatGeo; Email:jondoe@, Email:jenny@; Team 20:4 User 40:2 User 60:1 Size:500 Name:Jon; Type:Free Name:Jenny; Type:Pro Shard 1 Shard 2 Shard n

  8. Restricted API • Create/Update/Delete • single and batch • Compare and Set semantics • Reads: • Read(Id, ) • List(Id, *) • Count(Id, *) • List(Id, condition=[equals, prefix, range]) • ReadLog(Id) • ListLog(Id, *) • Acquire Read/Write Lock • Commit/Rollback • Strong consistency semantics

  9. Workloads • 10 million QPS • 600k Writes / second • 9.4mil Reads / second • 90% of Reads are cache hits • 1.5 million QPS to Engine fleet

  10. Workloads cont’d • Batch Size 1 to 10000 • Some read requests can return 1 row • Some can return 100000 rows • Rows can be between a few bytes to several MB • 500+ unique Schemas

  11. Engine Proto -> SQL Query Query Result -> Proto Connection Pooling Control / Reduce load to MySQL

  12. Workloads cont’d • High QPS • Write / Read • Large / expensive requests: • Write - large transactions • Read - large number of rows, or large rows • Multi-Read / Multi-Write

  13. Single Request - 1 token Engine Resource Request Pool Handler

  14. Batch (parallel) Request - n tokens Engine Id1 goroutine Resource Request Id2 goroutine Pool Handler Id3 goroutine

  15. Batch (sequential) Request - n tokens Engine Resource Request Id 1 - Id 10 Pool Handler Id 11 - Id 20 Id 21 - Id 30

  16. More Isolation breakdowns • Type of Traffic: • Live traffic: Front Ends - user traffic, sync related traffic • Offline traffic: Scripts / Async processing / Offline processing • Type of Request: • Write (Insert, Delete, Update, Create Ids, Aquire Read/ Write Locks) • Read (Single read, multi read, list, count, listLog)

  17. Layer Resource Pools Engine Write Live Resource Pool Read Live Request Handler Resource Pool Write Offline Resource Pool Read Offline Resource Pool

  18. Breakdown by tenant • What is a tenant? • Source Machine Tag (e.g. front-end) • Source ServiceName (e.g. FileSync) • Source Schema (e.g. Team) • Source Handler (e.g. Thumbnail generator) • Source Script (e.g. backfill-albums)

  19. Examples • “frontend:www:TeamEvent” • “async-worker:async_task_wrapper:Contacts” • “email:emailservice.py:UserEmailEvent” • “taskrunner-node- quota:update_team_usage.py:User”

  20. CPU Memory Engine Network Storage Disk IO Mysql: Mysql: Threads CPU / Disk IO connected Mysql: Mysql: Semi-sync Threads running

  21. Resources • QPS is not a good metric, as requests vary considerably • # Connections used (mapping to token resource pool) • connections used * time • 200 connections total pool = 200 * 60 = 12000 connection seconds / min: • 1 connection per second for 1 min = 60 connection seconds / min • 60 connections for 1 second = 60 connection seconds / min

  22. Write Live - 1 minute snapshot ConnSec Tenant Connections Errors Used 20 % 5 0 frontend:rpc:User 3 % 90 0 frontend:www:FileId taskrunner: growth: 0,5 % 4 0 team_quota 1 % 1 0 email: UserEmail Total 24,5 % 100 0

  23. 100 75 Percentage 50 25 0 10:00 10:01 10:02 10:03 10:04 10:05 Time

  24. 100 75 Percentage 50 25 0 10:00 10:01 10:02 10:03 10:04 10:05 Time

  25. 100 75 Percentage 50 25 0 10:00 10:01 10:02 10:03 10:04 10:05 Time

  26. Throttle mechanism • Auto-throttle heuristics based on history of resource usage per tenant • No predefined quota • Steady state usage by tenant varies wildly 0.001% - 20% • Triggering event -> find “bad” tenant -> decide how much to throttle them -> throttle “bad” tenant • Disabled the auto-throttling mechanism • We have learned a lot

  27. Timer 9 3 8 7 6 5 4 1 2 Start Acquire Lock Commit Read Write Engine Conn

  28. Resources • Used Time -> Execution Time • Bytes In/Out

  29. Write Live - 1 minute snapshot Tenant Used Execution MB Read Conns Errors 20 % 1 % 1 5 0 frontend:rpc:User frontend:www:File 3 % 3 % 30 90 0 Id taskrunner: 0,5 % 0,5 % 5 4 0 growth: team_quota 1 % 0,5 % 4 1 0 email: UserEmail Total 24,5 % 5 % 40 100 0

  30. Layer: write_live, NumTenants: 360 Throttle Controls: State: steady, TokensPrimaryPool: 300, TokensThrottledPool: 0 Throttled Tenants: [] Period 1: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 6.48% | 60.15% | 2.58% | 34057 | 0 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 0.79% | 94.11% | 0.36% | 423 | 0 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 0 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 0 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 0 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 0 | 0 | frontend:www:ActivityEntity Layer: write_live, NumTenants: 360 Throttle Controls: State: steady, TokensPrimaryPool: 300, TokensThrottledPool: 0 Throttled Tenants: [] Period 2: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 100% | 60.15% | 52.58% | 34057 | 20000 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 93.79% | 0.11% | 50.36% | 600 | 300 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 254 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 1293 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 2913 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 23 | 0 | frontend:www:ActivityEntity

  31. edgestore_throttle —tenant=offline:bluemail:Email —tokens=30 —host=abc-de-fg —layer=write_live Layer: write_live, NumTenants: 360 Throttle Controls: State: throttled, TokensPrimaryPool: 270, TokensThrottledPool: 30 Throttled Tenants: [offline:bluemail:Email] Period 3: | Used | Idle |Execution| Conns | Errors | Size(MB) | Tenants | 16.20% | 60.15% | 7.58% | 34057 | 1900 | 19 | Aggregated stats Top 5 Sources sorted by Used: | 10.79% | 0.11% | 5.36% | 600 | 1900 | 0 | offline:bluemail:Email | 0.76% | 93.68% | 0.20% | 437 | 0 | 0 | frontend:rpc:UserEntity | 0.45% | 19.92% | 0.08% | 4922 | 0 | 4 | cape-sfj:cape_dispatcher:CursorEntity | 0.42% | 52.50% | 0.07% | 2783 | 0 | 0 | filejournal:fj_server_bin:FileID | 0.36% | 93.80% | 0.02% | 252 | 0 | 0 | frontend:www:ActivityEntity

  32. Impact • Reduce MTTR • Availability event: • 1. Detection • 2. Investigation • 3. Containment • 4. Short term fix • 5. Long term fix

  33. Findings • Expensive queries • Abusable APIs • Query optimizer • Inconsistencies • Insufficient documentation • Bugs • Perf optimization

  34. Auto-throttle heuristics Manual Throttle using a throttle tool Lessons Learned Query / Throttle / Unthrottle Aggregate tool - queries and filters all engines to isolate the error and limit blast radius while investigating, root causing and fixing the underlying problem. There was a time when we shut down scripts manually not knowing who was causing the problem • 1 deployment to rule them all works found issues with API, bugs, poorly documented client, best practices Throttle mechanism Future work (in progress) • There is such a thing as automating too soon • Silently throttling is bad • Throttling should be a temporary state • Not having pre-defined quotas works • Multiple Isolation breakdowns (by user, by table, by tenant, by request type (Read/Write), by traffic type (Live vs Offline)

  35. What’s next • Control Plane “brain” • continuously query all Engines • automatically throttle tenants when system is degraded • detecting trends • Per logical micros shard (and per Id) granularity for throttling

  36. Credits • Zviad Metreveli • Rati Gelashvili • Robert Verkuil • Alex Degtiar • Jonathan Lee

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend