spotify lessons learning to let go of machines
play

Spotify Lessons: Learning to Let Go of Machines James Wen, Site - PowerPoint PPT Presentation

Spotify Lessons: Learning to Let Go of Machines James Wen, Site Reliability Engineer at Spotify ALF Squad, Infrastructure & Operations Tribe IO Tribe Lets control how feature developers think about what their code is actually


  1. Spotify Lessons: 
 Learning to Let Go of Machines James Wen, Site Reliability Engineer at Spotify 
 ALF Squad, Infrastructure & Operations Tribe IO Tribe

  2. Let’s control how feature developers think about what their code is actually running on.

  3. Takeaways • Feature developers = happiest with feature work • Find out developer machine concerns and mitigate • Migrating to cloud or hybrid? Start embracing ephemeral service design and infrastructure

  4. Agenda • Why? • Journey • Hybrid Cloud • Ops in Squads • Future • Learnings

  5. Why? Why don’t we want feature devs to care too much about infrastructure and machines?

  6. Why? Time taken on infrastructure tasks = time taken away from feature work Feature devs = focused on features

  7. Spotify Scale Stats - 140 Million+ Monthly Active Users - 50 Million+ Subscribers - 30 Million+ Songs - 2 Billion+ Playlists - Available in 60 markets

  8. Spotify Dev Scale Stats ~900 Devs ~100 Tech Teams ~2000 Services

  9. Spotify Machine Scale Stats ~10,000 Bare Metal Hosts 
 ~13,000 Hosts on GCP 
 46 Hardware/VM Types

  10. Example: Capacity Planning Avg # devs on a team Capacity Planning

  11. Scale doesn’t really matter -Smaller companies/teams = developer time is more valuable -Larger companies/teams = wasted infra time scales as well

  12. Other Infrastructure Tasks - Machine provisioning 
 - Failure planning - Security updates - Machine maintenance

  13. Dedicated Ops?

  14. Dedicated Ops? ~2000 Services 
 74 Infrastructure and Operations Engineers If all IO engineers → dedicated ops 
 27:1 service:engineer ratio

  15. 
 Ops In Squads Feature teams handle their own ops and provisioning 
 Using the services and tooling the Infrastructure and Operations tribe has written

  16. We control the level of context feature teams need to operate their services.

  17. - Developer Happiness 
 - Developer effectiveness and context

  18. Journey

  19. - Ops in Squads - Hybrid Cloud (Ephemerality)

  20. Starting Out

  21. Historical: Feature Developer’s Context for Service’s Capacity San Jose Stockholm Rack 1 Rack 2 Rack 2 lon-1-a lon-1-c lon-1-e keys lon-1-b lon-1-d lon-1-f updated updated

  22. Machine Context - Packages Unbound 
 v1.6.3 - Hostname - Machine specs (CPU, RAM, disk, etc.) Openssl v1.0.0f - Uptime and service duration - Location - Local state (files on disk, info in 8 GB 2 Cores 3 Years RAM memory) In Virginia Tarred Logs ash2-metadata-a.ash2.spotify.net

  23. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? Specs? How many?

  24. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? Specs? How many?

  25. ServerDB

  26. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  27. ProvGun/ProvCannon

  28. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  29. DNS

  30. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  31. Nameless

  32. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  33. Cortana

  34. Cortana

  35. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  36. Helios and Containers

  37. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  38. Google Compute Platform

  39. ash2-cortana-a1.ash2 
 Zone Service Group Sequential # gew1-cortana-a-l33t.gew1 Zone Service Pool Random 4 Chars

  40. Cortana Pool Manager

  41. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  42. Regional Managed Instance Groups

  43. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Up to date? How long? Where? Available? How many? Specs?

  44. MBMI: Minimal Base Machine Image

  45. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  46. Phoenix

  47. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  48. Current: Feature Developer’s Context for Service’s Capacity Stockholm GCP - europe-west-1 Pool: 
 Pool: 
 4 instances x (High Mem) 2 instances x (n1-standard-32)

  49. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  50. Future

  51. Gordon (Cloud DNS)

  52. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  53. Autoscaling

  54. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  55. Right Sizing

  56. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  57. Future Feature Developer’s Context for Service’s Capacity GCP - asia-east-1 GCP - europe-west-1 Service Pool Service Pool GCP - us-central-1 Service Pool

  58. Feature Developer Concerns How to How to talk How to get? Service + Business track? to it? What tools What to put Maintenance? on it? on it? Where? Up to date? How long? Available? How many? Specs?

  59. Learnings

  60. Why Pets to Cattle was Difficult: - Manual/tedious setup 
 - Wait times for machine becoming ready (packages, DNS) 
 - Non-automatic security updates - A fixed, reliable hostname - SSH Access - Always up/present unless team tears down

  61. Ephemerality Learnings - Monitoring 
 - Logging 
 - Service Design - Incidents

  62. Hybrid Learnings - Replicate bare metal functionality, then iterate - When in doubt, devs provision up and many - Migration = great time to influence dev paradigms - Don’t need to DIY

  63. 
 DevEx Learnings - Feature devs need carrots, sledgehammers, and/or limos to change - Edge Cases: REST API + CLI = provide enough for feature teams to handle the edge cases

  64. Recap - Decrease necessary infrastructure context - Increase reliability - Save $$$ - Increase dev happiness and productivity

  65. Let’s strategically control and limit how feature developers think about infrastructure.

  66. 
 James Wen 
 Email: jameswen@spotify.com 
 Twitter/Github: @rochesterinnyc LinkedIn: jamesrwen 
 Spotify is hiring! spotifyjobs.com IO Tribe

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend