SLIDE 1 How to invest in technical infrastructure
Will Larson @lethain 2019
SLIDE 2
SLIDE 3
SLIDE 4
SLIDE 5
Prioritizing infrastructure investment...
SLIDE 6
...in a high autonomy environment...
SLIDE 7
...within a rapidly scaling business.
SLIDE 8
How can infrastructure teams...
SLIDE 9
...be surprisingly impactful...
SLIDE 10
...without burning out?
SLIDE 11
What is technical infrastructure?
SLIDE 12
Technical infrastructure: Someone’s biggest problem they dislike.
SLIDE 13
Technical infrastructure: Tools used by 3+ teams for business critical workloads.
SLIDE 14
Examples of technical infrastructure Developer tools Data infrastructure Core libraries and frameworks Model training and evaluation
SLIDE 15 Introduction
- 1. Fundamentals
- 2. Escaping the firefight
- 3. Learning to innovate
- 4. Navigating breadth
- 5. Unifying approach
Closing
SLIDE 16
SLIDE 17
- Scale MongoDB
- Lower AWS costs
- GDPR
Forced Discretionary
- Sorbet
- Monolith -> µservices
- Deep learning
SLIDE 18
SLIDE 19
- Critical remediation
- Scale for holidays
- Support launch
Short-term Long-term
- QoS strategy
- “Bend the cost curve”
- Rewrite monolith
SLIDE 20
SLIDE 21
SLIDE 22
SLIDE 23
Where is your team now?
SLIDE 24
SLIDE 25
Where do you want to be?
SLIDE 26
SLIDE 27 Introduction
- 1. Fundamentals
- 2. Escaping the firefight
- 3. Learning to innovate
- 4. Navigating breadth
- 5. Unifying approach
Closing
SLIDE 28
SLIDE 29
Even Stripe...
SLIDE 30
MongoDB
SLIDE 31
SLIDE 32
Shared replsets Easy to provision :-) Don’t cost much :-) Shared everything :-\ Joint ownership :-/ Limited isolation :-( Big blast radius :-(
SLIDE 33
More time on incidents
SLIDE 34
Incident impact increasing
SLIDE 35
When things aren’t getting better, they are getting worse
SLIDE 36
How to fix?
SLIDE 37
SLIDE 38
SLIDE 39
SLIDE 40
Ok, so what’s the firefighting playbook?
SLIDE 41
Finish something
SLIDE 42
Reduce concurrent work
SLIDE 43
Automate
SLIDE 44
Eliminate categories of problems
SLIDE 45
Are you seeing signs of progress?
SLIDE 46
No? You’ve gotta hire
SLIDE 47
Once there’s progress, stay the course!
SLIDE 48
btw, don’t fall in love with firefighting
SLIDE 49 Introduction
- 1. Fundamentals
- 2. Escaping the firefight
- 3. Learning to innovate
- 4. Navigating breadth
- 5. Unifying approach
Closing
SLIDE 50
SLIDE 51
Rare opportunity in infrastructure
SLIDE 52
Rare also means inexperienced
SLIDE 53
tl;dr Talk to your users more
SLIDE 54
tl;dr Talk to your users more
SLIDE 55
tl;dr Listen to your users more
SLIDE 56
Ways innovation goes wrong...
SLIDE 57
Problem Making the most intuitive fix
SLIDE 58
Problem AKA fixating on your local maxima
SLIDE 59
Discover
SLIDE 60
Discover Benchmark with peer companies Coffee chats with users SLOs Surveys
SLIDE 61
“Ruby is a terrible language.”
SLIDE 62
SLIDE 63
Problem Infinite possibilities, what to pick?
SLIDE 64
Prioritization
SLIDE 65
Prioritization Order by return on investment Don’t try without users in the room Long-term vision
SLIDE 66
“The critical business outcome is me learning Elixir.”
SLIDE 67
SLIDE 68
Problem Right opportunity with wrong solution
SLIDE 69
Validation
SLIDE 70
Validation Cheaply disprove approach Try hardest cases early Embed with owners
SLIDE 71
“Monster is too unreliable and slow!”
SLIDE 72
“Let’s just rewrite monster.”
SLIDE 73
“Let’s just rewrite monster. Again.”
SLIDE 74
“Let’s just rewrite harden monster.”
SLIDE 75
“Can we provide a unified interface for task, cronjob and service orchestration?”
SLIDE 76
Kubernetes
SLIDE 77
Kubernetes Chronos Railyard Services
SLIDE 78
tl;dr Listen to your users more
SLIDE 79
Be valuable or go back to firefighting
SLIDE 80 Introduction
- 1. Fundamentals
- 2. Escaping the firefight
- 3. Learning to innovate
- 4. Navigating breadth
- 5. Unifying approach
Closing
SLIDE 81
SLIDE 82
Fool me once, shame on you
SLIDE 83
Fool me twice, shame on me
SLIDE 84
Fool me every year on exact same date?
SLIDE 85
SLIDE 86
SLIDE 87
SLIDE 88
“Convert unplanned scalability work into planned scalability work.”
SLIDE 89
Schedule manual load tests
SLIDE 90
Schedule automated load tests
SLIDE 91
Run continuous load tests
SLIDE 92
Solved out of a job
SLIDE 93
Great technology fix, but what’s the organizational fix?
SLIDE 94
Infrastructure properties
SLIDE 95
Stripe’s infrastructure properties Security Reliability Usability Efficiency Latency
SLIDE 96
Lightly ordered but not stack ranked
SLIDE 97
More a portfolio: invest in each
SLIDE 98
Baselines!
SLIDE 99
Invest to maintain your baselines
SLIDE 100
Maintain across timeframes
SLIDE 101
Long-term forced work!
SLIDE 102
SLIDE 103
Do it now or firefight it later
SLIDE 104 Introduction
- 1. Fundamentals
- 2. Escaping the firefight
- 3. Learning to innovate
- 4. Navigating breadth
- 5. Unifying approach
Closing
SLIDE 105
Wait… there’s more than one team?
SLIDE 106
SLIDE 107
What we actually do today
SLIDE 108
Investment strategy 40% user asks 30% platform quality 30% “Key Initiatives”
SLIDE 109
40/30/30?
SLIDE 110
Solve from your constraints
SLIDE 111 Introduction
- 1. Fundamentals
- 2. Escaping the firefight
- 3. Learning to innovate
- 4. Navigating breadth
- 5. Unifying approach
Closing
SLIDE 112
Technical infrastructure: Tools used by 3+ teams for business critical workloads.
SLIDE 113
Firefighting: Limit work in progress. Finish things. If that’s not enough, hire.
SLIDE 114
Innovation: Listen to your users. Listen to your users. Listen to your users.
SLIDE 115
Navigating breadth: Identify principles. Set baselines. Plan across timeframes.
SLIDE 116
Bring it together: Investment strategy. Users, baselines and timeframes.
SLIDE 117 Q&A
@lethain / lethain.com