Detangling complex systems with compassion & production - - PowerPoint PPT Presentation

detangling complex systems
SMART_READER_LITE
LIVE PREVIEW

Detangling complex systems with compassion & production - - PowerPoint PPT Presentation

Detangling complex systems with compassion & production excellence Liz Fong-Jones @lizthegrey #VelocityConf San Jose June 13, 2019 1 w/ illustrations by @emilywithcurls! Production is increasingly complex. 2 @lizthegrey at


slide-1
SLIDE 1

Detangling complex systems

Liz Fong-Jones

@lizthegrey #VelocityConf San Jose June 13, 2019

with compassion & production excellence

1

w/ illustrations by @emilywithcurls!

slide-2
SLIDE 2

@lizthegrey at #VelocityConf

Production is increasingly complex.

2

slide-3
SLIDE 3

@lizthegrey at #VelocityConf

There is no 100% uptime.

3

slide-4
SLIDE 4

@lizthegrey at #VelocityConf

Our strategies need to evolve.

4

slide-5
SLIDE 5

@lizthegrey at #VelocityConf

Co "bought" DevOps.

5

slide-6
SLIDE 6

@lizthegrey at #VelocityConf

Ordering the alphabet soup...

6

slide-7
SLIDE 7

@lizthegrey at #VelocityConf

Noisy alerts. Grumpy engineers.

7

slide-8
SLIDE 8

@lizthegrey at #VelocityConf

Walls of meaningless dashboards.

8

slide-9
SLIDE 9

@lizthegrey at #VelocityConf

Incidents take forever to fix.

9

slide-10
SLIDE 10

@lizthegrey at #VelocityConf

Everyone bugs the "expert".

10

slide-11
SLIDE 11

@lizthegrey at #VelocityConf

Deploys are unpredictable.

11

slide-12
SLIDE 12

@lizthegrey at #VelocityConf

There's no time to do projects...

12

slide-13
SLIDE 13

@lizthegrey at #VelocityConf

and when there's time, there's no plan.

13

slide-14
SLIDE 14

@lizthegrey at #VelocityConf

The team is struggling to hold on.

14

slide-15
SLIDE 15

@lizthegrey at #VelocityConf

What's Co missing?

15

slide-16
SLIDE 16

@lizthegrey at #VelocityConf

Co forgot who operates systems.

16

slide-17
SLIDE 17

@lizthegrey at #VelocityConf

Tools aren't magical.

17

slide-18
SLIDE 18

@lizthegrey at #VelocityConf

Invest in people, culture, & process.

18

slide-19
SLIDE 19

@lizthegrey at #VelocityConf

Enter the art of Production Excellence.

19

slide-20
SLIDE 20

@lizthegrey at #VelocityConf

Make systems more reliable & friendly.

20

slide-21
SLIDE 21

@lizthegrey at #VelocityConf

ProdEx takes planning.

21

slide-22
SLIDE 22

@lizthegrey at #VelocityConf

Measure and act on what matters.

22

slide-23
SLIDE 23

@lizthegrey at #VelocityConf

Involve everyone.

23

slide-24
SLIDE 24

@lizthegrey at #VelocityConf

Build everyone's confidence. Encourage asking questions.

24

slide-25
SLIDE 25

@lizthegrey at #VelocityConf

How do we get started?

25

slide-26
SLIDE 26

@lizthegrey at #VelocityConf

Know when it's too broken.

26

slide-27
SLIDE 27

@lizthegrey at #VelocityConf

& be able to debug, together when it is.

27

slide-28
SLIDE 28

@lizthegrey at #VelocityConf

Eliminate (unnecessary) complexity.

28

slide-29
SLIDE 29

@lizthegrey at #VelocityConf

Our systems are always failing.

29

slide-30
SLIDE 30

@lizthegrey at #VelocityConf

What if we measure too broken?

30

slide-31
SLIDE 31

@lizthegrey at #VelocityConf

We need Service Level Indicators

31

slide-32
SLIDE 32

@lizthegrey at #VelocityConf

Think in terms of events in context.

32

slide-33
SLIDE 33

@lizthegrey at #VelocityConf

Is this event good or bad?

33

slide-34
SLIDE 34

@lizthegrey at #VelocityConf

Are users grumpy? Ask your PM.

34

slide-35
SLIDE 35

@lizthegrey at #VelocityConf

What threshold buckets events?

35

slide-36
SLIDE 36

@lizthegrey at #VelocityConf

HTTP Code 200? Latency < 300ms?

36

slide-37
SLIDE 37

@lizthegrey at #VelocityConf

How many eligible events did we see?

37

slide-38
SLIDE 38

@lizthegrey at #VelocityConf

Availability: Good / Eligible Events

38

slide-39
SLIDE 39

@lizthegrey at #VelocityConf

Set a target Service Level Objective.

39

slide-40
SLIDE 40

@lizthegrey at #VelocityConf

Use a window and target percentage.

40

slide-41
SLIDE 41

@lizthegrey at #VelocityConf

99.9% of events good in past 30 days.

41

slide-42
SLIDE 42

@lizthegrey at #VelocityConf

A good SLO barely keeps users happy.

42

slide-43
SLIDE 43

@lizthegrey at #VelocityConf

Drive alerting with SLOs.

43

slide-44
SLIDE 44

@lizthegrey at #VelocityConf

Is my service on fire?

44

slide-45
SLIDE 45

@lizthegrey at #VelocityConf

Error budget: allowed unavailability

45

slide-46
SLIDE 46

@lizthegrey at #VelocityConf

How long until I run out?

46

slide-47
SLIDE 47

@lizthegrey at #VelocityConf

Page if it's hours.

47

Ticket if it's days.

slide-48
SLIDE 48

@lizthegrey at #VelocityConf

Data-driven business decisions.

48

slide-49
SLIDE 49

@lizthegrey at #VelocityConf

Is it safe to do this risky experiment?

49

slide-50
SLIDE 50

@lizthegrey at #VelocityConf

Should we invest in more reliability?

50

slide-51
SLIDE 51

@lizthegrey at #VelocityConf

Perfect SLO > Good SLO >>> No SLO

51

slide-52
SLIDE 52

@lizthegrey at #VelocityConf

Measure what you can today.

52

slide-53
SLIDE 53

@lizthegrey at #VelocityConf

Iterate to meet user needs.

53

slide-54
SLIDE 54

@lizthegrey at #VelocityConf

Only alert on what matters.

54

slide-55
SLIDE 55

@lizthegrey at #VelocityConf

SLIs & SLOs are

  • nly half the picture...

55

slide-56
SLIDE 56

@lizthegrey at #VelocityConf

Our outages are never identical.

56

slide-57
SLIDE 57

@lizthegrey at #VelocityConf

Failure modes can't be predicted.

57

slide-58
SLIDE 58

@lizthegrey at #VelocityConf

Support debugging novel cases. In production.

58

slide-59
SLIDE 59

@lizthegrey at #VelocityConf

Allow forming & testing hypotheses.

59

slide-60
SLIDE 60

@lizthegrey at #VelocityConf

Dive into data to ask new questions.

60

slide-61
SLIDE 61

@lizthegrey at #VelocityConf

Our services must be observable.

61

slide-62
SLIDE 62

@lizthegrey at #VelocityConf

Can you examine events in context?

62

slide-63
SLIDE 63

@lizthegrey at #VelocityConf

Can you explain the variance?

63

slide-64
SLIDE 64

@lizthegrey at #VelocityConf

Can you mitigate impact & debug later?

64

slide-65
SLIDE 65

@lizthegrey at #VelocityConf

SLOs and Observability go together.

65

slide-66
SLIDE 66

@lizthegrey at #VelocityConf

But they alone don't create collaboration.

66

slide-67
SLIDE 67

@lizthegrey at #VelocityConf

Debugging is not a solo activity.

67

slide-68
SLIDE 68

@lizthegrey at #VelocityConf

Debugging is for everyone.

68

slide-69
SLIDE 69

@lizthegrey at #VelocityConf

Collaboration is interpersonal.

69

slide-70
SLIDE 70

@lizthegrey at #VelocityConf

Operations must be sustainable.

70

slide-71
SLIDE 71

@lizthegrey at #VelocityConf

We learn better when we document.

71

slide-72
SLIDE 72

@lizthegrey at #VelocityConf

Fix hero culture. Share knowledge.

72

slide-73
SLIDE 73

@lizthegrey at #VelocityConf

Reward curiosity and teamwork.

73

slide-74
SLIDE 74

@lizthegrey at #VelocityConf

Learn from the past. Reward your future self.

74

slide-75
SLIDE 75

@lizthegrey at #VelocityConf

Outages don't repeat, but they rhyme.

75

slide-76
SLIDE 76

@lizthegrey at #VelocityConf

Risk analysis helps us plan.

76

slide-77
SLIDE 77

@lizthegrey at #VelocityConf

Quantify risks by frequency & impact.

77

slide-78
SLIDE 78

@lizthegrey at #VelocityConf

Which risks are most significant?

78

slide-79
SLIDE 79

@lizthegrey at #VelocityConf

Address risks that threaten the SLO.

79

slide-80
SLIDE 80

@lizthegrey at #VelocityConf

Make the business case to fix them.

80

slide-81
SLIDE 81

@lizthegrey at #VelocityConf

And prioritize completing the work.

81

slide-82
SLIDE 82

@lizthegrey at #VelocityConf

Lack of observability is systemic risk.

82

slide-83
SLIDE 83

@lizthegrey at #VelocityConf

So is lack of collaboration.

83

slide-84
SLIDE 84

@lizthegrey at #VelocityConf

Season the alphabet soup with ProdEx

84

slide-85
SLIDE 85

@lizthegrey at #VelocityConf

Production Excellence brings teams closer together.

  • Measure. Debug. Collaborate. Fix.

85

lizthegrey.com; @lizthegrey

slide-86
SLIDE 86

@lizthegrey at #VelocityConf

slide-87
SLIDE 87

@lizthegrey at #VelocityConf

slide-88
SLIDE 88

@lizthegrey at #VelocityConf

slide-89
SLIDE 89

@lizthegrey at #VelocityConf

slide-90
SLIDE 90

@lizthegrey at #VelocityConf