How to invest in technical infrastructure Will Larson 2019 - - PowerPoint PPT Presentation

▶

Oct 29, 2023 194 likes •1.38k views

How to invest in technical infrastructure Will Larson 2019 @lethain Prioritizing infrastructure investment... ...in a high autonomy environment... ...within a rapidly scaling business. How can infrastructure teams... ...be surprisingly

SLIDE 1

How to invest in technical infrastructure

Will Larson @lethain 2019

SLIDE 2

SLIDE 3

SLIDE 4

SLIDE 5

Prioritizing infrastructure investment...

SLIDE 6

...in a high autonomy environment...

SLIDE 7

...within a rapidly scaling business.

SLIDE 8

How can infrastructure teams...

SLIDE 9

...be surprisingly impactful...

SLIDE 10

...without burning out?

SLIDE 11

What is technical infrastructure?

SLIDE 12

Technical infrastructure: Someone’s biggest problem they dislike.

SLIDE 13

Technical infrastructure: Tools used by 3+ teams for business critical workloads.

SLIDE 14

Examples of technical infrastructure Developer tools Data infrastructure Core libraries and frameworks Model training and evaluation

SLIDE 15

Introduction

1. Fundamentals
2. Escaping the firefight
3. Learning to innovate
4. Navigating breadth
5. Unifying approach

Closing

SLIDE 16

SLIDE 17

Scale MongoDB
Lower AWS costs
GDPR

Forced Discretionary

Sorbet
Monolith -> µservices
Deep learning

SLIDE 18

SLIDE 19

Critical remediation
Scale for holidays
Support launch

Short-term Long-term

QoS strategy
“Bend the cost curve”
Rewrite monolith

SLIDE 20

SLIDE 21

SLIDE 22

SLIDE 23

Where is your team now?

SLIDE 24

SLIDE 25

Where do you want to be?

SLIDE 26

SLIDE 27

Introduction

1. Fundamentals
2. Escaping the firefight
3. Learning to innovate
4. Navigating breadth
5. Unifying approach

Closing

SLIDE 28

SLIDE 29

Even Stripe...

SLIDE 30

MongoDB

SLIDE 31

SLIDE 32

Shared replsets Easy to provision :-) Don’t cost much :-) Shared everything :-\ Joint ownership :-/ Limited isolation :-( Big blast radius :-(

SLIDE 33

More time on incidents

SLIDE 34

Incident impact increasing

SLIDE 35

When things aren’t getting better, they are getting worse

SLIDE 36

How to fix?

SLIDE 37

SLIDE 38

SLIDE 39

SLIDE 40

Ok, so what’s the firefighting playbook?

SLIDE 41

Finish something

SLIDE 42

Reduce concurrent work

SLIDE 43

Automate

SLIDE 44

Eliminate categories of problems

SLIDE 45

Are you seeing signs of progress?

SLIDE 46

No? You’ve gotta hire

SLIDE 47

Once there’s progress, stay the course!

SLIDE 48

btw, don’t fall in love with firefighting

SLIDE 49

Introduction

1. Fundamentals
2. Escaping the firefight
3. Learning to innovate
4. Navigating breadth
5. Unifying approach

Closing

SLIDE 50

SLIDE 51

Rare opportunity in infrastructure

SLIDE 52

Rare also means inexperienced

SLIDE 53

tl;dr Talk to your users more

SLIDE 54

tl;dr Talk to your users more

SLIDE 55

tl;dr Listen to your users more

SLIDE 56

Ways innovation goes wrong...

SLIDE 57

Problem Making the most intuitive fix

SLIDE 58

Problem AKA fixating on your local maxima

SLIDE 59

Discover

SLIDE 60

Discover Benchmark with peer companies Coffee chats with users SLOs Surveys

SLIDE 61

“Ruby is a terrible language.”

SLIDE 62

SLIDE 63

Problem Infinite possibilities, what to pick?

SLIDE 64

Prioritization

SLIDE 65

Prioritization Order by return on investment Don’t try without users in the room Long-term vision

SLIDE 66

“The critical business outcome is me learning Elixir.”

SLIDE 67

SLIDE 68

Problem Right opportunity with wrong solution

SLIDE 69

Validation

SLIDE 70

Validation Cheaply disprove approach Try hardest cases early Embed with owners

SLIDE 71

“Monster is too unreliable and slow!”

SLIDE 72

“Let’s just rewrite monster.”

SLIDE 73

“Let’s just rewrite monster. Again.”

SLIDE 74

“Let’s just rewrite harden monster.”

SLIDE 75

“Can we provide a unified interface for task, cronjob and service orchestration?”

SLIDE 76

Kubernetes

SLIDE 77

Kubernetes Chronos Railyard Services

SLIDE 78

tl;dr Listen to your users more

SLIDE 79

Be valuable or go back to firefighting

SLIDE 80

Introduction

1. Fundamentals
2. Escaping the firefight
3. Learning to innovate
4. Navigating breadth
5. Unifying approach

Closing

SLIDE 81

SLIDE 82

Fool me once, shame on you

SLIDE 83

Fool me twice, shame on me

SLIDE 84

Fool me every year on exact same date?

SLIDE 85

SLIDE 86

SLIDE 87

SLIDE 88

“Convert unplanned scalability work into planned scalability work.”

SLIDE 89

Schedule manual load tests

SLIDE 90

Schedule automated load tests

SLIDE 91

Run continuous load tests

SLIDE 92

Solved out of a job

SLIDE 93

Great technology fix, but what’s the organizational fix?

SLIDE 94

Infrastructure properties

SLIDE 95

Stripe’s infrastructure properties Security Reliability Usability Efficiency Latency

SLIDE 96

Lightly ordered but not stack ranked

SLIDE 97

More a portfolio: invest in each

SLIDE 98

Baselines!

SLIDE 99

Invest to maintain your baselines

SLIDE 100

Maintain across timeframes

SLIDE 101

Long-term forced work!

SLIDE 102

SLIDE 103

Do it now or firefight it later

SLIDE 104

Introduction

1. Fundamentals
2. Escaping the firefight
3. Learning to innovate
4. Navigating breadth
5. Unifying approach

Closing

SLIDE 105

Wait… there’s more than one team?

SLIDE 106

SLIDE 107

What we actually do today

SLIDE 108

Investment strategy 40% user asks 30% platform quality 30% “Key Initiatives”

SLIDE 109

40/30/30?

SLIDE 110

Solve from your constraints

SLIDE 111

Introduction

1. Fundamentals
2. Escaping the firefight
3. Learning to innovate
4. Navigating breadth
5. Unifying approach

Closing

SLIDE 112

Technical infrastructure: Tools used by 3+ teams for business critical workloads.

SLIDE 113

Firefighting: Limit work in progress. Finish things. If that’s not enough, hire.

SLIDE 114

Innovation: Listen to your users. Listen to your users. Listen to your users.

SLIDE 115

Navigating breadth: Identify principles. Set baselines. Plan across timeframes.

SLIDE 116

Bring it together: Investment strategy. Users, baselines and timeframes.

SLIDE 117

Q&A

@lethain / lethain.com