How to invest in technical infrastructure Will Larson 2019 - - PowerPoint PPT Presentation

how to invest in technical infrastructure
SMART_READER_LITE
LIVE PREVIEW

How to invest in technical infrastructure Will Larson 2019 - - PowerPoint PPT Presentation

How to invest in technical infrastructure Will Larson 2019 @lethain Prioritizing infrastructure investment... ...in a high autonomy environment... ...within a rapidly scaling business. How can infrastructure teams... ...be surprisingly


slide-1
SLIDE 1

How to invest in technical infrastructure

Will Larson @lethain 2019

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

Prioritizing infrastructure investment...

slide-6
SLIDE 6

...in a high autonomy environment...

slide-7
SLIDE 7

...within a rapidly scaling business.

slide-8
SLIDE 8

How can infrastructure teams...

slide-9
SLIDE 9

...be surprisingly impactful...

slide-10
SLIDE 10

...without burning out?

slide-11
SLIDE 11

What is technical infrastructure?

slide-12
SLIDE 12

Technical infrastructure: Someone’s biggest problem they dislike.

slide-13
SLIDE 13

Technical infrastructure: Tools used by 3+ teams for business critical workloads.

slide-14
SLIDE 14

Examples of technical infrastructure Developer tools Data infrastructure Core libraries and frameworks Model training and evaluation

slide-15
SLIDE 15

Introduction

  • 1. Fundamentals
  • 2. Escaping the firefight
  • 3. Learning to innovate
  • 4. Navigating breadth
  • 5. Unifying approach

Closing

slide-16
SLIDE 16
slide-17
SLIDE 17
  • Scale MongoDB
  • Lower AWS costs
  • GDPR

Forced Discretionary

  • Sorbet
  • Monolith -> µservices
  • Deep learning
slide-18
SLIDE 18
slide-19
SLIDE 19
  • Critical remediation
  • Scale for holidays
  • Support launch

Short-term Long-term

  • QoS strategy
  • “Bend the cost curve”
  • Rewrite monolith
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23

Where is your team now?

slide-24
SLIDE 24
slide-25
SLIDE 25

Where do you want to be?

slide-26
SLIDE 26
slide-27
SLIDE 27

Introduction

  • 1. Fundamentals
  • 2. Escaping the firefight
  • 3. Learning to innovate
  • 4. Navigating breadth
  • 5. Unifying approach

Closing

slide-28
SLIDE 28
slide-29
SLIDE 29

Even Stripe...

slide-30
SLIDE 30

MongoDB

slide-31
SLIDE 31
slide-32
SLIDE 32

Shared replsets Easy to provision :-) Don’t cost much :-) Shared everything :-\ Joint ownership :-/ Limited isolation :-( Big blast radius :-(

slide-33
SLIDE 33

More time on incidents

slide-34
SLIDE 34

Incident impact increasing

slide-35
SLIDE 35

When things aren’t getting better, they are getting worse

slide-36
SLIDE 36

How to fix?

slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40

Ok, so what’s the firefighting playbook?

slide-41
SLIDE 41

Finish something

slide-42
SLIDE 42

Reduce concurrent work

slide-43
SLIDE 43

Automate

slide-44
SLIDE 44

Eliminate categories of problems

slide-45
SLIDE 45

Are you seeing signs of progress?

slide-46
SLIDE 46

No? You’ve gotta hire

slide-47
SLIDE 47

Once there’s progress, stay the course!

slide-48
SLIDE 48

btw, don’t fall in love with firefighting

slide-49
SLIDE 49

Introduction

  • 1. Fundamentals
  • 2. Escaping the firefight
  • 3. Learning to innovate
  • 4. Navigating breadth
  • 5. Unifying approach

Closing

slide-50
SLIDE 50
slide-51
SLIDE 51

Rare opportunity in infrastructure

slide-52
SLIDE 52

Rare also means inexperienced

slide-53
SLIDE 53

tl;dr Talk to your users more

slide-54
SLIDE 54

tl;dr Talk to your users more

slide-55
SLIDE 55

tl;dr Listen to your users more

slide-56
SLIDE 56

Ways innovation goes wrong...

slide-57
SLIDE 57

Problem Making the most intuitive fix

slide-58
SLIDE 58

Problem AKA fixating on your local maxima

slide-59
SLIDE 59

Discover

slide-60
SLIDE 60

Discover Benchmark with peer companies Coffee chats with users SLOs Surveys

slide-61
SLIDE 61

“Ruby is a terrible language.”

slide-62
SLIDE 62
slide-63
SLIDE 63

Problem Infinite possibilities, what to pick?

slide-64
SLIDE 64

Prioritization

slide-65
SLIDE 65

Prioritization Order by return on investment Don’t try without users in the room Long-term vision

slide-66
SLIDE 66

“The critical business outcome is me learning Elixir.”

slide-67
SLIDE 67
slide-68
SLIDE 68

Problem Right opportunity with wrong solution

slide-69
SLIDE 69

Validation

slide-70
SLIDE 70

Validation Cheaply disprove approach Try hardest cases early Embed with owners

slide-71
SLIDE 71

“Monster is too unreliable and slow!”

slide-72
SLIDE 72

“Let’s just rewrite monster.”

slide-73
SLIDE 73

“Let’s just rewrite monster. Again.”

slide-74
SLIDE 74

“Let’s just rewrite harden monster.”

slide-75
SLIDE 75

“Can we provide a unified interface for task, cronjob and service orchestration?”

slide-76
SLIDE 76

Kubernetes

slide-77
SLIDE 77

Kubernetes Chronos Railyard Services

slide-78
SLIDE 78

tl;dr Listen to your users more

slide-79
SLIDE 79

Be valuable or go back to firefighting

slide-80
SLIDE 80

Introduction

  • 1. Fundamentals
  • 2. Escaping the firefight
  • 3. Learning to innovate
  • 4. Navigating breadth
  • 5. Unifying approach

Closing

slide-81
SLIDE 81
slide-82
SLIDE 82

Fool me once, shame on you

slide-83
SLIDE 83

Fool me twice, shame on me

slide-84
SLIDE 84

Fool me every year on exact same date?

slide-85
SLIDE 85
slide-86
SLIDE 86
slide-87
SLIDE 87
slide-88
SLIDE 88

“Convert unplanned scalability work into planned scalability work.”

slide-89
SLIDE 89

Schedule manual load tests

slide-90
SLIDE 90

Schedule automated load tests

slide-91
SLIDE 91

Run continuous load tests

slide-92
SLIDE 92

Solved out of a job

slide-93
SLIDE 93

Great technology fix, but what’s the organizational fix?

slide-94
SLIDE 94

Infrastructure properties

slide-95
SLIDE 95

Stripe’s infrastructure properties Security Reliability Usability Efficiency Latency

slide-96
SLIDE 96

Lightly ordered but not stack ranked

slide-97
SLIDE 97

More a portfolio: invest in each

slide-98
SLIDE 98

Baselines!

slide-99
SLIDE 99

Invest to maintain your baselines

slide-100
SLIDE 100

Maintain across timeframes

slide-101
SLIDE 101

Long-term forced work!

slide-102
SLIDE 102
slide-103
SLIDE 103

Do it now or firefight it later

slide-104
SLIDE 104

Introduction

  • 1. Fundamentals
  • 2. Escaping the firefight
  • 3. Learning to innovate
  • 4. Navigating breadth
  • 5. Unifying approach

Closing

slide-105
SLIDE 105

Wait… there’s more than one team?

slide-106
SLIDE 106
slide-107
SLIDE 107

What we actually do today

slide-108
SLIDE 108

Investment strategy 40% user asks 30% platform quality 30% “Key Initiatives”

slide-109
SLIDE 109

40/30/30?

slide-110
SLIDE 110

Solve from your constraints

slide-111
SLIDE 111

Introduction

  • 1. Fundamentals
  • 2. Escaping the firefight
  • 3. Learning to innovate
  • 4. Navigating breadth
  • 5. Unifying approach

Closing

slide-112
SLIDE 112

Technical infrastructure: Tools used by 3+ teams for business critical workloads.

slide-113
SLIDE 113

Firefighting: Limit work in progress. Finish things. If that’s not enough, hire.

slide-114
SLIDE 114

Innovation: Listen to your users. Listen to your users. Listen to your users.

slide-115
SLIDE 115

Navigating breadth: Identify principles. Set baselines. Plan across timeframes.

slide-116
SLIDE 116

Bring it together: Investment strategy. Users, baselines and timeframes.

slide-117
SLIDE 117

Q&A

@lethain / lethain.com