Automating Operations with Machine Intelligence Rob Harrop CEO @ - - PowerPoint PPT Presentation

automating operations with machine intelligence
SMART_READER_LITE
LIVE PREVIEW

Automating Operations with Machine Intelligence Rob Harrop CEO @ - - PowerPoint PPT Presentation

Automating Operations with Machine Intelligence Rob Harrop CEO @ Skipjaq Co-founder @ SpringSource Automated performance management Why automate operations? Why now? What does automated operations look like? How do we build for automation?


slide-1
SLIDE 1

Automating Operations with Machine Intelligence

Rob Harrop

slide-2
SLIDE 2

CEO @ Skipjaq Co-founder @ SpringSource Automated performance management

slide-3
SLIDE 3

Why automate operations? Why now? What does automated operations look like? How do we build for automation? Solving a real problem…

slide-4
SLIDE 4

Why automate operations?

slide-5
SLIDE 5

More Complexity

slide-6
SLIDE 6

Monolith -> Microservices Strong -> Eventual Consistency Assume reliability -> Assume failure

slide-7
SLIDE 7

More Deployments

slide-8
SLIDE 8

Very end of 2009 Today

30 20 10 40

Credit: Mike Brittain, Engineering Director @ Easy
slide-9
SLIDE 9

Less time to identify fixes Rollbacks more likely Tiny window for human intervention

slide-10
SLIDE 10

Harder Faster

slide-11
SLIDE 11

Why now?

slide-12
SLIDE 12

We have to

slide-13
SLIDE 13

We can

slide-14
SLIDE 14

Cloud Containers Observability Microservices ML/AI

Trends

slide-15
SLIDE 15

Current trends provide the impetus and tools for automation by AI

slide-16
SLIDE 16

Automated Operations

slide-17
SLIDE 17

Move 37

slide-18
SLIDE 18
slide-19
SLIDE 19

Move 78 - God’s Touch

slide-20
SLIDE 20
slide-21
SLIDE 21

AI Human

slide-22
SLIDE 22

Wholly performed by human Wholly performed by AI Co-operation between human and AI Actionable insight

Types of Operation Actions

slide-23
SLIDE 23

Data is not insight Gathering metrics is not automating operations But, metrics are critical to automating operations

On Metrics

slide-24
SLIDE 24

Human ≠ Manual

slide-25
SLIDE 25

Testing Deployment Provisioning

Actions by Human

slide-26
SLIDE 26

Anomaly alerting Rollback broken builds Dependency upgrade

Cooperative Actions

slide-27
SLIDE 27

Predictive auto scaling Workload placement Automatic rollback Performance optimisation? Security?

Actions by AI

slide-28
SLIDE 28

Actions and Actionable Insights

slide-29
SLIDE 29

Building for Automation

slide-30
SLIDE 30

Visible metrics and logs Ability to start/stop/restart/move workload Ability to change configuration Ability to modify dependencies Ability to wire/rewire external services

Requirements for Operations

slide-31
SLIDE 31

Self-contained package Disposable processes Externally-configurable Externally-observable Externalised dependencies Externalised service wiring

slide-32
SLIDE 32

12+1 Factor

slide-33
SLIDE 33

Metrics as event streams Standard metrics

  • CPU usage, memory usage, …

Service-specific metrics

  • Leads received, items sold, …

13th Factor - Observability

slide-34
SLIDE 34

Detecting Anomalous DB CPU

Case Study

slide-35
SLIDE 35

Background

Consumer-facing web application running Rails against PostgreSQL on AWS RDS Mix of transactional and batch workloads running against the same database Question: when is the DB unusually overloaded?

slide-36
SLIDE 36
slide-37
SLIDE 37

Detecting Anomalies

Policy-based Statistical model Predictive model Classification model

slide-38
SLIDE 38

Policy Based

Fixed threshold alerting How well does this work?

slide-39
SLIDE 39

Not Very

slide-40
SLIDE 40

Statistical Model

Twitter AnomalyDetection package

  • Seasonal Hybrid ESD

Is this point unexpected in our distribution?

  • With seasonal and trend effects removed
slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44

Statistical Model

Stream Metrics Sliding window of observations (1 month, 1 year?) Each new observation run model (S - H - ESD) Is the new point an outlier?

slide-45
SLIDE 45

Predictive Model

Train a model to predict values in the time series Prediction error > critical value => outlier

slide-46
SLIDE 46

x1 x2 x3

+1 +1

Layer L1 Layer L2 Layer L3 hW,b(x) a2

(2)

a3

(2)

a1

(2)
slide-47
SLIDE 47 From: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

h0 h1 h2 h3 h4 x0 x1 x2 x3 x4 A A A A A

slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51

Predictive Model

Metrics Stream Prediction Training set 
 ?? last month Model Re-Train (Nightly, weekly?) Is prediction error an outlier???

slide-52
SLIDE 52

Handling Anomalies

Actionable alerts

  • Confidence in predictions

No alerts for pointless things

slide-53
SLIDE 53

Handling Anomalies

Taking action

  • Rewiring services to read-replica?
  • Kill long-running queries?
slide-54
SLIDE 54

Handling Anomalies

Confidence in the model leads to confidence in automation

slide-55
SLIDE 55

Summary

Increasing complexity and deployment speed make operational automation a must We must build services that are ready for automation Simple models can often beat complex ones Cheap compute and storage makes large-scale ML available to everyone

slide-56
SLIDE 56

Thank You