Scaling Slack The Good, The Unexpected, and The Road Ahead Michael - PowerPoint PPT Presentation

November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead Michael Demmer mdemmer@slack-corp.com | @mjdemmer

(Not) This Talk 1. 2016: Monolith 2. 2016-2018: Microservices 3. 2016-2018: Best Practices 4. 2018: Lessons Learned

This Talk 1. 2016: How Slack Worked 2. 2016-2018: Things Got More Interesting 3. 2016-2018: What We Did About It 4. 2018+: Themes and Road Ahead

Slack in 2016

Workspaces, Channels, Users, and more A workspace logically contains all channels and messages , as well as users , emoji , bots , and more. All interactions occur within the workspace boundary. Acme Corp Oceanic Airlines #brainstorming @alice #proj-roadrunner @bob #marketing @carol Delos … ... Duff Beer us_east_1

Slack Facts (2016) User Base Largest Organizations 4M Daily Active Users >10,000 Active Users Connectivity Engineering Style 2.5M peak simultaneous connected Conservative, Pragmatic, Minimal Avg 10 hrs/day Most systems > 10 year old technology

How Slack Works (2016) RTM RTM Message Service Service Proxy RTM Service us_west_1 Message Server RTM Service (Java) Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL (PHP) MySQL Job Queue us_east_1

Client / Server Flow Initial login: Message Download full workspace model 3: websocket connect ● Proxy with all channels, users, emoji, etc. Establish real time websocket ● 1: rtm.start Webapp 2: prefs: {...}, (PHP) users: {...}, channels: {...}, emoji: {...}, ms: “ms1.slack-msgs.com”

Client / Server Flow Initial login: Message Download full workspace model ● {message: ...} Proxy with all channels, users, emoji, etc. Establish real time websocket ● While connected: reactions.add Push updates via websocket ● Webapp (PHP) API calls for channel history, ● message edits, create channels, etc.

Sharding And Routing Workspace Sharding RTM RTM Message Service Service Assign a workspace to a DB ● Servers and MS shard at creation s m a e t m o r f Metadata table lookup for Mains ● * 4 3 t 2 c 1 e = l d e i s e each API request to route r e h w db_shard:35, ms_shard:11, ...} {id:1234, domain:demmer, Webapp (PHP) MySQL Shards

Sharding And Routing Workspace Sharding RTM RTM Message Service Service Assign a workspace to a DB ● Servers and MS shard at creation Metadata table lookup for Mains ● each API request to route “Herd of Pets” Webapp (PHP) DBs run in active/active pairs ● with application failover Service hosts are addressed in ● config and manually replaced MySQL Shards

Why This Worked Client Experience Server Experience Data model lends itself to a seamless, rich Implementation model is straightforward , real-time client experience . easy to reason about and debug . Full data model available in memory All operations are workspace scoped ● ● Updates appear instantly Horizontally scale by adding servers ● ● Everything feels real time Few components or dependencies ● ●

Things Get More Interesting...

Things Get More Interesting Size and Scale Product Model

Slack Growth

Slack Facts (2018) User Base Largest Organizations >8M Daily Active Users >125,000 Active Users Connectivity Engineering Style >7M peak simultaneous connected Still pragmatic, but embrace complexity Avg 10 hrs/day where needed to solve hardest problems

Slack Facts (2018) 10x ! 2x User Base Largest Organizations >8M Daily Active Users >125,000 Active Users 3x Connectivity Engineering Style >7M peak simultaneous connected Still pragmatic, but embrace complexity Avg 10 hrs/day where needed to solve hardest problems

Change the Model A workspace logically contains all channels and messages , as well as users , emoji , bots , and more. All interactions occur within the workspace boundary. Acme Corp Oceanic Airlines #brainstorming @alice #proj-roadrunner @bob #marketing @carol Delos … ... Duff Beer us_east_1

Change the Model Workspaces Enterprise Acme Wayne Corp Enterprises Duff Oceanic Beer Airlines Wayne Wayne Finance Shipping Wayne Delos Security

Change the Model Workspaces Enterprise Shared Channels Acme Wayne Corp Agents of Enterprises SHIELD Duff Oceanic Beer Airlines Wayne Wayne Finance Shipping Stark Industries Wayne Delos Security

Challenges Recurring Issues Large organizations : Boot metadata download is slow and expensive ● Thundering Herd : Load to connect >> Load in steady state ● Hot spots : Overwhelm database hosts (mains and shards) and other systems ● Herd of Pets : Manual operation to replace specific servers ● Cross Workspace Channels: Need to change assumptions about partitioning ●

So What Did We Do?

What Did We Do Thin Message Client Vitess Services Model Fine-Grained Service Flannel Cache DB Sharding Decomposition

What Did We Do Thin Client Model Flannel Cache

Challenge: Boot Model Explosion boot_payload_size ~= (num_users * user_profile_bytes) + (num_channels * (channel_info_size + (num_users_in_channel * user_id bytes))) Users Profiles Channels Total 12 6 KB 1 KB 7 KB 530 140 KB 28 KB 168 KB 4,008 5 MB 2 MB 7 MB

Challenge: Boot Model Explosion boot_payload_size ~= (num_users * user_profile_bytes) + (num_channels * (channel_info_size + (num_users_in_channel * user_id bytes))) Users Profiles Channels Total 12 6 KB 1 KB 7 KB 530 140 KB 28 KB 168 KB 4,008 5 MB 2 MB 7 MB 44,030 36 MB 25 MB 59 MB 148,170 78 MB 40 MB 118 MB

Thin Client Model RTM RTM Message Service Service Proxy RTM Service us_west_1 RTM Service Message Server Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL MySQL Job Queue us_east_1

Thin Client Model RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL MySQL Job Consul Queue us_east_1

Thin Client Model Flannel Service Minimize Workspace Model Globally distributed edge cache Much smaller boot payload Routing Query API Workspace affinity for cache locality Fetch unknown objects from cache Cache Updates Proxy subscription messages to clients RTM RTM Service Flannel Service Websocket

Thin Client Model Unblock Large Organizations Adapting clients to a lazy load model was a critical change to enable Slack for large organizations. Huge reduction in payload times on initial connect ● Flannel efficiently responds to > 1+ million queries per second ● Adds challenges of cache coherency and reconciling business logic ●

What Did We Do Vitess Fine-Grained DB Sharding

Challenge: Hot Spots & Manual Repair

Vitess RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket Client HTTP API Calls Webapp Webapp Webapp MySQL MySQL Job Consul Queue us_east_1

Vitess RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket VtGate VtGate VtGate Client HTTP API Calls Webapp Webapp VtTablet Webapp MySQL MySQL MySQL Job Consul Queue us_east_1

Vitess Flexible Sharding Topology Management Vitess manages per-table sharding policy Database servers self-register Single Master Failover Using GTID and semi-sync replication Orchestrator promotes a replica on failover VtGate VtGate Resharding Workflows VtGate Automatically expand the cluster Webapp VtTablet Webapp Webapp MySQL

Vitess Fine-Grained Sharding Migrating to a channel-sharded / user-sharded data model helps mitigate hot spots for large teams and thundering herds. Retains MySQL at the core for developer / operations continuity ● More mature topology management and cluster expansion systems ● Data migrations that change the sharding model take a long time ●

What Did We Do Message Services Service Decomposition

Challenge: Shared Channels? Agents of Message Server SHIELD Stark Message Server Industries

Message Server to Services RTM RTM RTM RTM Flannel Message Service Service Service Service Cache Proxy RTM Service us_west_1 us_west_1 RTM Service Message Server Websocket VtGate VtGate VtGate Client HTTP API Calls Webapp Webapp VtTablet Webapp MySQL MySQL MySQL Job Consul Queue us_east_1

Message Server to Services RTM RTM RTM RTM RTM RTM Flannel Gateway Channel Service Service Service Service Service Service Cache Server Server RTM RTM Presence Service Service Server us_west_1 RTM RTM VtGate Admin Message Service VtGate Websocket Service Server Server VtGate VtGate VtGate Client HTTP API Calls Webapp Webapp VtTablet Webapp MySQL MySQL MySQL Job Consul Queue us_east_1

Scaling Slack The Good, The Unexpected, and The Road Ahead Michael - PowerPoint PPT Presentation

November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead Michael Demmer mdemmer@slack-corp.com | @mjdemmer Me (Not) This Talk 1. 2016: Monolith 2. 2016-2018: Microservices 3. 2016-2018: Best Practices 4. 2018: Lessons

How Slack Works Keith Adams kma@slack-corp.com @keithmadams facebook.com/kma What is Slack?

Presto Summit NYC 2019 December 11, 2019 Slack handles: @cheolsoo; @abhonsule slack-corp.com

Slack and Lateness D i i t a i s i f i d i R R i slack = d slack i = d i - f i f D i

Scaling Slack Bing Wei Infrastructure@Slack 2 3 Our Mission: To make peoples working

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

SLACK BASICS Introducing Slack Group messaging system with the persistence,

Meeting 98 // Virtual Machines // If Youre New! Join our Slack: cyberatuc.slack.com SIGN

Meeting 97 // Fall 2019 Briefing // If Youre New! Join our Slack: cyberatuc.slack.com

Reliable Events Pipeline 1 No data, No problem!!! -Jackson Argo, Slack No data, No problem!!!

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Scaling Slack Infrastructure Julia Grace Senior Director of Engineering @jewelia @jewelia

Jira for Slack Overview De fi ning and improving how Atlassians products operate within the

Cyber@UC Meeting 40 CEH Networking If Youre New! Join our Slack ucyber.slack.com SIGN

Cyber@UC Meeting 66 Welcome New Members! If Youre New! Join our Slack: ucyber.slack.com

Cyber@UC Meeting 77 Magical Goats If Youre New! Join our Slack: cyberatuc.slack.com Check

Riparian Vegetation Monitoring and Research GCMRC Annual Reporting Meeting 2019 Brad Butterfield

Vipshop Holdings Limited Investor Presentation November 2015 Disclaimer This presentation

Distance-1 Constrained Channel Assignment in Single Radio Wireless Mesh Networks Ehsan Aryafar

Multi-Channel MQ Support Lance Swallow TPFSS - TPF Communications DXC Proprietary and

Investor Presentation Autumn 2018 Quilter: A leading, UK-centric full service wealth manager

Boundary Channel Drive Interchange Modification Public Information Meeting Project Location

How to plan your social content strategy The first step in your social media cycle Plan -

Banks Use of Credit Derivatives and the Pricing of Loans: What Is the Channel and Does It

Scaling Slack The Good, The Unexpected, and The Road Ahead Michael - PowerPoint PPT Presentation

November 6, 2018 Scaling Slack The Good, The Unexpected, and The Road Ahead Michael Demmer mdemmer@slack-corp.com | @mjdemmer Me (Not) This Talk 1. 2016: Monolith 2. 2016-2018: Microservices 3. 2016-2018: Best Practices 4. 2018: Lessons

How Slack Works Keith Adams kma@slack-corp.com @keithmadams facebook.com/kma What is Slack?

Presto Summit NYC 2019 December 11, 2019 Slack handles: @cheolsoo; @abhonsule slack-corp.com

Slack and Lateness D i i t a i s i f i d i R R i slack = d slack i = d i - f i f D i

Scaling Slack Bing Wei Infrastructure@Slack 2 3 Our Mission: To make peoples working

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

SLACK BASICS Introducing Slack Group messaging system with the persistence,

Meeting 98 // Virtual Machines // If Youre New! Join our Slack: cyberatuc.slack.com SIGN

Meeting 97 // Fall 2019 Briefing // If Youre New! Join our Slack: cyberatuc.slack.com

Reliable Events Pipeline 1 No data, No problem!!! -Jackson Argo, Slack No data, No problem!!!

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

Scaling Slack Infrastructure Julia Grace Senior Director of Engineering @jewelia @jewelia

Jira for Slack Overview De fi ning and improving how Atlassians products operate within the

Cyber@UC Meeting 40 CEH Networking If Youre New! Join our Slack ucyber.slack.com SIGN

Cyber@UC Meeting 66 Welcome New Members! If Youre New! Join our Slack: ucyber.slack.com

Cyber@UC Meeting 77 Magical Goats If Youre New! Join our Slack: cyberatuc.slack.com Check

Riparian Vegetation Monitoring and Research GCMRC Annual Reporting Meeting 2019 Brad Butterfield

Vipshop Holdings Limited Investor Presentation November 2015 Disclaimer This presentation

Distance-1 Constrained Channel Assignment in Single Radio Wireless Mesh Networks Ehsan Aryafar

Multi-Channel MQ Support Lance Swallow TPFSS - TPF Communications DXC Proprietary and

Investor Presentation Autumn 2018 Quilter: A leading, UK-centric full service wealth manager

Boundary Channel Drive Interchange Modification Public Information Meeting Project Location

How to plan your social content strategy The first step in your social media cycle Plan -

Banks Use of Credit Derivatives and the Pricing of Loans: What Is the Channel and Does It

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms