Spotify Lessons: Learning to Let Go of Machines James Wen, Site - - PowerPoint PPT Presentation

spotify lessons learning to let go of machines
SMART_READER_LITE
LIVE PREVIEW

Spotify Lessons: Learning to Let Go of Machines James Wen, Site - - PowerPoint PPT Presentation

Spotify Lessons: Learning to Let Go of Machines James Wen, Site Reliability Engineer at Spotify ALF Squad, Infrastructure & Operations Tribe IO Tribe Lets control how feature developers think about what their code is actually


slide-1
SLIDE 1

Spotify Lessons:
 Learning to Let Go of Machines

IO Tribe

James Wen, Site Reliability Engineer at Spotify
 ALF Squad, Infrastructure & Operations Tribe

slide-2
SLIDE 2
slide-3
SLIDE 3

Let’s control how feature developers think about what their code is actually running on.

slide-4
SLIDE 4

Takeaways

  • Feature developers = happiest with

feature work

  • Find out developer machine

concerns and mitigate

  • Migrating to cloud or hybrid? Start

embracing ephemeral service design and infrastructure

slide-5
SLIDE 5

Agenda

  • Why?
  • Journey
  • Hybrid Cloud
  • Ops in Squads
  • Future
  • Learnings
slide-6
SLIDE 6

Why?

Why don’t we want feature devs to care too much about infrastructure and machines?

slide-7
SLIDE 7

Why?

Time taken on infrastructure tasks = time taken away from feature work Feature devs = focused on features

slide-8
SLIDE 8

Spotify Scale Stats

  • 140 Million+ Monthly Active Users
  • 50 Million+ Subscribers
  • 30 Million+ Songs
  • 2 Billion+ Playlists
  • Available in 60 markets
slide-9
SLIDE 9

Spotify Dev Scale Stats

~900 Devs ~100 Tech Teams ~2000 Services

slide-10
SLIDE 10

Spotify Machine Scale Stats

~10,000 Bare Metal Hosts
 ~13,000 Hosts on GCP
 46 Hardware/VM Types

slide-11
SLIDE 11

Example: Capacity Planning

Avg # devs on a team Capacity Planning

slide-12
SLIDE 12

Scale doesn’t really matter

  • Smaller companies/teams =

developer time is more valuable

  • Larger companies/teams =

wasted infra time scales as well

slide-13
SLIDE 13

Other Infrastructure Tasks

  • Machine provisioning

  • Failure planning
  • Security updates
  • Machine maintenance
slide-14
SLIDE 14

Dedicated Ops?

slide-15
SLIDE 15

Dedicated Ops?

~2000 Services
 74 Infrastructure and Operations Engineers If all IO engineers → dedicated ops
 27:1 service:engineer ratio

slide-16
SLIDE 16

Ops In Squads

Feature teams handle their own ops and provisioning
 
 Using the services and tooling the Infrastructure and Operations tribe has written

slide-17
SLIDE 17

We control the level of context feature teams need to operate their services.

slide-18
SLIDE 18
  • Developer Happiness

  • Developer

effectiveness and context

slide-19
SLIDE 19

Journey

slide-20
SLIDE 20
  • Ops in Squads
  • Hybrid Cloud

(Ephemerality)

slide-21
SLIDE 21

Starting Out

slide-22
SLIDE 22

Stockholm San Jose Rack 2 Rack 1

Historical: Feature Developer’s Context for Service’s Capacity

lon-1-d lon-1-b lon-1-c lon-1-a

keys updated

Rack 2 lon-1-f lon-1-e

updated

slide-23
SLIDE 23

Machine Context

  • Packages
  • Hostname
  • Machine specs (CPU, RAM,

disk, etc.)

  • Uptime and service duration
  • Location
  • Local state (files on disk, info in

memory)

Unbound
 v1.6.3 ash2-metadata-a.ash2.spotify.net Openssl v1.0.0f 2 Cores 8 GB RAM Tarred Logs In Virginia 3 Years

slide-24
SLIDE 24

Feature Developer Concerns

How to get? How many? Specs? How long? How to talk to it? Where? Up to date? How to track? What tools

  • n it?

Maintenance?

What to put

  • n it?

Available? Service + Business

slide-25
SLIDE 25
slide-26
SLIDE 26

Feature Developer Concerns

How to get? How many? Specs? How long? How to talk to it? Where? Up to date? How to track? What tools

  • n it?

Maintenance?

What to put

  • n it?

Available? Service + Business

slide-27
SLIDE 27

ServerDB

slide-28
SLIDE 28

Feature Developer Concerns

How to get? How long? How to talk to it? What tools

  • n it?

Maintenance?

What to put

  • n it?

Available? Service + Business How to track? Where? Up to date? How many? Specs?

slide-29
SLIDE 29

ProvGun/ProvCannon

slide-30
SLIDE 30

Feature Developer Concerns

How long? How to talk to it? What tools

  • n it?

Maintenance?

What to put

  • n it?

Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?

slide-31
SLIDE 31

DNS

slide-32
SLIDE 32

Feature Developer Concerns

How long? How to talk to it? What tools

  • n it?

Maintenance?

What to put

  • n it?

Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?

slide-33
SLIDE 33

Nameless

slide-34
SLIDE 34

Feature Developer Concerns

How long? How to talk to it? What tools

  • n it?

Maintenance?

What to put

  • n it?

Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?

slide-35
SLIDE 35

Cortana

slide-36
SLIDE 36

Cortana

slide-37
SLIDE 37

Feature Developer Concerns

How long? How to talk to it? What tools

  • n it?

Maintenance?

What to put

  • n it?

Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?

slide-38
SLIDE 38

Helios and Containers

slide-39
SLIDE 39

Feature Developer Concerns

How long? How to talk to it? What tools

  • n it?

Maintenance?

What to put

  • n it?

Available? Service + Business Where? Up to date? How many? Specs? How to track? How to get?

slide-40
SLIDE 40

Google Compute Platform

slide-41
SLIDE 41

ash2-cortana-a1.ash2


Zone Service Group Sequential #

gew1-cortana-a-l33t.gew1

Zone Service Pool Random 4 Chars

slide-42
SLIDE 42

Cortana Pool Manager

slide-43
SLIDE 43

Feature Developer Concerns

How long? How to talk to it? What tools

  • n it?

Maintenance?

What to put

  • n it?

Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?

slide-44
SLIDE 44

Regional Managed Instance Groups

slide-45
SLIDE 45

Feature Developer Concerns

How long? How to talk to it? What tools

  • n it?

Maintenance?

What to put

  • n it?

Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?

slide-46
SLIDE 46

MBMI: Minimal Base Machine Image

slide-47
SLIDE 47

Feature Developer Concerns

How to talk to it? What tools

  • n it?

Maintenance?

What to put

  • n it?

Available? Service + Business How to track? How to get? Where? How long? Up to date? How many? Specs?

slide-48
SLIDE 48

Phoenix

slide-49
SLIDE 49

Feature Developer Concerns

How long? How to talk to it?

Maintenance?

Available? Service + Business How to track? How to get? Where? Up to date? What tools

  • n it?

What to put

  • n it?

How many? Specs?

slide-50
SLIDE 50

Current: Feature Developer’s Context for Service’s Capacity

GCP - europe-west-1 Pool:
 2 instances x (n1-standard-32) Stockholm Pool:
 4 instances x (High Mem)

slide-51
SLIDE 51

Feature Developer Concerns

How long? How to talk to it?

Maintenance?

Available? Service + Business How to track? How to get? Where? Up to date? What tools

  • n it?

What to put

  • n it?

How many? Specs?

slide-52
SLIDE 52

Future

slide-53
SLIDE 53

Gordon (Cloud DNS)

slide-54
SLIDE 54

Feature Developer Concerns

How long? How to talk to it?

Maintenance?

Available? Service + Business How to track? How to get? Where? Up to date? What tools

  • n it?

What to put

  • n it?

How many? Specs?

slide-55
SLIDE 55

Autoscaling

slide-56
SLIDE 56

Feature Developer Concerns

How long? How to talk to it?

Maintenance?

Available? Service + Business How to track? How to get? Where? Up to date? What tools

  • n it?

What to put

  • n it?

How many? Specs?

slide-57
SLIDE 57

Right Sizing

slide-58
SLIDE 58

Feature Developer Concerns

How long? How to talk to it?

Maintenance?

Available? Service + Business How to track? How to get? Where? Up to date? What tools

  • n it?

What to put

  • n it?

How many? Specs?

slide-59
SLIDE 59

Future Feature Developer’s Context for Service’s Capacity

GCP - asia-east-1 Service Pool GCP - europe-west-1 Service Pool GCP - us-central-1 Service Pool

slide-60
SLIDE 60

Feature Developer Concerns

How long? How to talk to it?

Maintenance?

Available? Service + Business How to track? How to get? Where? Up to date? What tools

  • n it?

What to put

  • n it?

How many? Specs?

slide-61
SLIDE 61

Learnings

slide-62
SLIDE 62

Why Pets to Cattle was Difficult:

  • Manual/tedious setup

  • Wait times for machine becoming ready

(packages, DNS)


  • Non-automatic security updates
  • A fixed, reliable hostname
  • SSH Access
  • Always up/present unless team tears down
slide-63
SLIDE 63
  • Monitoring

  • Logging

  • Service Design
  • Incidents

Ephemerality Learnings

slide-64
SLIDE 64
  • Replicate bare metal functionality, then

iterate

  • When in doubt, devs provision up and

many

  • Migration = great time to influence dev

paradigms

  • Don’t need to DIY

Hybrid Learnings

slide-65
SLIDE 65
  • Feature devs need carrots,

sledgehammers, and/or limos to change

  • Edge Cases: REST API + CLI = provide

enough for feature teams to handle the edge cases 


DevEx Learnings

slide-66
SLIDE 66
  • Decrease necessary

infrastructure context

  • Increase reliability
  • Save $$$
  • Increase dev happiness and

productivity

Recap

slide-67
SLIDE 67

Let’s strategically control and limit how feature developers think about infrastructure.

slide-68
SLIDE 68

James Wen
 Email: jameswen@spotify.com
 Twitter/Github: @rochesterinnyc LinkedIn: jamesrwen
 
 Spotify is hiring! spotifyjobs.com

IO Tribe