Anomaly Detection Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

β–Ά
anomaly detection
SMART_READER_LITE
LIVE PREVIEW

Anomaly Detection Jia-Bin Huang Virginia Tech Spring 2019 - - PowerPoint PPT Presentation

Anomaly Detection Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative Anomaly Detection Motivation Developing an anomaly detection system Anomaly detection vs. supervised learning Choosing what features


slide-1
SLIDE 1

Anomaly Detection

Jia-Bin Huang Virginia Tech

Spring 2019

ECE-5424G / CS-5824

slide-2
SLIDE 2

Administrative

slide-3
SLIDE 3

Anomaly Detection

  • Motivation
  • Developing an anomaly detection system
  • Anomaly detection vs. supervised learning
  • Choosing what features to use
  • Multivariate Gaussian distribution
slide-4
SLIDE 4

Anomaly Detection

  • Motivation
  • Developing an anomaly detection system
  • Anomaly detection vs. supervised learning
  • Choosing what features to use
  • Multivariate Gaussian distribution
slide-5
SLIDE 5

Anomaly detection example

  • Dataset 𝑦(1), 𝑦(2), β‹― , 𝑦(𝑛)
  • New engine: 𝑦text
  • Aircraft engine features:
  • 𝑦1 = heat generated
  • 𝑦2 = vibration intensity

𝑦2

vibration

𝑦1

heat

slide-6
SLIDE 6

Density estimation

  • Dataset 𝑦(1), 𝑦(2), β‹― , 𝑦(𝑛)
  • Is 𝑦text anomalous?

Model 𝒒 π’š

  • π‘ž 𝑦text < πœ— β†’ flag anomaly
  • π‘ž 𝑦text β‰₯ πœ— β†’ OK

𝑦2

vibration

𝑦1

heat

slide-7
SLIDE 7

Anomaly detection example

  • Fraud detection
  • 𝑦(𝑗)= features of user I’s activities
  • Model π‘ž 𝑦 from data
  • Identify unusual users by checking which have π‘ž 𝑦 < πœ—
  • Manufacturing
  • Monitoring computers in a data center
  • 𝑦(𝑗)= features of machine i
  • 𝑦1= memory use, 𝑦2= number of disk accesses/sec
  • 𝑦3= CPU load, 𝑦4 = CPU load/network traffic
slide-8
SLIDE 8

Anomaly Detection

  • Motivation
  • Developing an anomaly detection system
  • Anomaly detection vs. supervised learning
  • Choosing what features to use
  • Multivariate Gaussian distribution
slide-9
SLIDE 9

Gaussian (normal) distribution

  • Say 𝑦 ∈ 𝑆. If 𝑦 is a distributed Gaussian with mean 𝜈, variance 𝜏2.
  • 𝑦 ∼ 𝑂(𝜈, 𝜏2)

𝜏 standard deviation

  • π‘ž 𝑦; 𝜈, 𝜏2 =

1 2𝜌𝜏 exp βˆ’ π‘¦βˆ’πœˆ 2 2𝜏2

slide-10
SLIDE 10

Gaussian distribution examples

slide-11
SLIDE 11

Parameter estimation

  • Dataset 𝑦(1), 𝑦(2), β‹― , 𝑦(𝑛)
  • 𝑦 ∼ 𝑂(𝜈, 𝜏2)
  • Maximum likelihood estimation
  • ො

𝜈 =

1 𝑛 σ𝑗=1 𝑛 𝑦(𝑗)

  • ΰ·’

𝜏2 =

1 𝑛 σ𝑗=1 𝑛 (𝑦 𝑗 βˆ’ ො

𝜈)2

slide-12
SLIDE 12

Density estimation

  • Dataset 𝑦(1), 𝑦(2), β‹― , 𝑦(𝑛)
  • Each example 𝑦 ∈ π‘†π‘œ
  • π‘ž 𝑦 = π‘ž(𝑦1; 𝜈1, 𝜏1

2) π‘ž 𝑦2; 𝜈2, 𝜏2 2 β‹― π‘ž π‘¦π‘œ; πœˆπ‘œ, πœπ‘œ 2

= Ξ π‘˜ π‘ž(π‘¦π‘˜; πœˆπ‘˜, 𝜏

π‘˜ 2)

slide-13
SLIDE 13

Anomaly detection algorithm

  • 1. Choose features 𝑦𝑗 that you think might be indicative of anomalous

examples

  • 2. Fit parameters 𝜈1, 𝜈2, β‹― , πœˆπ‘œ, 𝜏1

2, 𝜏2 2, β‹― , πœπ‘œ 2

  • πœˆπ‘˜ =

1 𝑛 σ𝑗=1 𝑛 π‘¦π‘˜ (𝑗)

  • 𝜏

π‘˜ 2 = 1 𝑛 σ𝑗=1 𝑛 (π‘¦π‘˜ 𝑗 βˆ’ πœˆπ‘˜)2

  • 3. Given new example 𝑦, compute π‘ž 𝑦

π‘ž 𝑦 = Ξ π‘˜ π‘ž(π‘¦π‘˜; πœˆπ‘˜, 𝜏

π‘˜ 2)

Anomaly if π‘ž 𝑦 < πœ—

slide-14
SLIDE 14

Evaluation

  • Assume we have some labeled data, of anomalous and non-

anomalous examples (𝑧 = 0 if normal, 𝑧 = 1 if anomalous)

  • Training set 𝑦(1), 𝑦(2), β‹― , 𝑦(𝑛) (assume normal examples)
  • Cross-validation set: (𝑦𝑑𝑀

(1), 𝑧𝑑𝑀 (1)), (𝑦𝑑𝑀 (2), 𝑧𝑑𝑀 (2)), β‹― , (𝑦𝑑𝑀 (𝑛𝑑𝑀), 𝑧𝑑𝑀 (𝑛𝑑𝑀))

  • Test set: (𝑦𝑒𝑓𝑑𝑒

(1) , 𝑧𝑒𝑓𝑑𝑒 (1) ), (𝑦𝑒𝑓𝑑𝑒 (2) , 𝑧𝑒𝑓𝑑𝑒 (2) ), β‹― , (𝑦𝑒𝑓𝑑𝑒 (𝑛𝑒𝑓𝑑𝑒), 𝑧𝑒𝑓𝑑𝑒 (𝑛𝑒𝑓𝑑𝑒))

slide-15
SLIDE 15

Aircraft engines motivating example

  • 10000 good (normal) engines
  • 20 flawed engines (anomalous)
  • Training set: 6000 good engines
  • CV: 2000 good engines (y = 0), 10 anomalous (y = 1)
  • Test: 2000 good engines (y = 0), 10 anomalous (y = 1)
slide-16
SLIDE 16

Algorithm evaluation

  • Fit model π‘ž(𝑦) on training set {𝑦 1 , β‹― , 𝑦 𝑛 }
  • On a cross-validation/test example 𝑦, predict
  • 𝑧 = α‰Š1

if π‘ž 𝑦 < πœ— (anomaly) if π‘ž 𝑦 β‰₯ πœ— (normal)

  • Possible evaluation metrics:
  • True positive, false positive, false negative, true negative
  • Precision/Recall
  • F1-score
  • Can use cross-validation set to choose parameter πœ—
slide-17
SLIDE 17

Evaluation metric

  • How about accuracy?
  • Assume only 0.1% of the engines are anomaly (skewed classes)
  • Declare every example as normal -> 99.9% accuracy!
slide-18
SLIDE 18

Precision/Recall

  • F1 score: 2

𝑄𝑆 𝑄+𝑆

slide-19
SLIDE 19

Anomaly Detection

  • Motivation
  • Developing an anomaly detection system
  • Anomaly detection vs. supervised learning
  • Choosing what features to use
  • Multivariate Gaussian distribution
slide-20
SLIDE 20

Anomaly detection

  • Very small number of positive

examples (y=1) (0-20 is common)

  • Large number of negative (y=0)

examples

  • Many different types of anomalies.

Hard for any algorithm to learn from positive examples what the anomalies look like

  • Future anomalies may look nothing

like any of the anomalous examples we have seen so far Supervised learning Large number of positive and negative examples Enough positive examples for algorithm to get a sense of what positive are like, future positive examples likely to be similar to ones in training set.

slide-21
SLIDE 21

Anomaly detection

  • Fraud detection
  • Manufacturing
  • Monitoring machines in a data

center Supervised learning

  • Email spam classification
  • Weather prediction
  • Cancer classification
slide-22
SLIDE 22

Anomaly Detection

  • Motivation
  • Developing an anomaly detection system
  • Anomaly detection vs. supervised learning
  • Choosing what features to use
  • Multivariate Gaussian distribution
slide-23
SLIDE 23

Non-Gaussian features

log 𝑦

slide-24
SLIDE 24

Error analysis for anomaly detection

Want π‘ž(𝑦) large for normal examples 𝑦 π‘ž(𝑦) small for anomalous examples 𝑦 Most common problem: π‘ž(𝑦) is comparable (say both large) for normal and anomalous examples

slide-25
SLIDE 25

Monitoring computers in a data center

  • Choose features that might take on unusually large or small values in

the event of an anomaly

  • 𝑦1 = memory use of computer
  • 𝑦2 = number of dis accesses/sec
  • 𝑦3 = CPU load
  • 𝑦4 = network traffic
  • 𝑦5 =

CPU load network traffic 𝑦5 = CPU load^2 network traffic

slide-26
SLIDE 26

Anomaly Detection

  • Motivation
  • Developing an anomaly detection system
  • Anomaly detection vs. supervised learning
  • Choosing what features to use
  • Multivariate Gaussian distribution
slide-27
SLIDE 27

Motivating example: Monitoring machines in a data center

𝑦1 (CPU load) 𝑦2 (Memory use) 𝑦1 (CPU load) 𝑦2 (Memory use)

slide-28
SLIDE 28

Multivariate Gaussian (normal) distribution

  • 𝑦 ∈ π‘†π‘œ. Don’t model π‘ž 𝑦1 , π‘ž 𝑦2 , β‹― separately
  • Model π‘ž 𝑦 all in one go.
  • Parameters: 𝜈 ∈ π‘†π‘œ, Ξ£ ∈ π‘†π‘œΓ—π‘œ (covariance matrix)
  • π‘ž 𝑦; 𝜈, Ξ£ =

1 2𝜌 π‘œ/2 Ξ£ 1/2 exp βˆ’ 𝑦 βˆ’ 𝜈 βŠ€Ξ£βˆ’1(𝑦 βˆ’ 𝜈)

slide-29
SLIDE 29

Multivariate Gaussian (normal) examples

Ξ£ = 1 1 Ξ£ = 0.6 0.6 Ξ£ = 2 2 𝑦1 𝑦2 𝑦1 𝑦2 𝑦1 𝑦2

slide-30
SLIDE 30

Multivariate Gaussian (normal) examples

Ξ£ = 1 1 Ξ£ = 0.6 1 Ξ£ = 2 1 𝑦1 𝑦2 𝑦1 𝑦2 𝑦1 𝑦2

slide-31
SLIDE 31

Multivariate Gaussian (normal) examples

Ξ£ = 1 1 Ξ£ = 1 0.5 0.5 1 Ξ£ = 1 0.8 0.8 1 𝑦1 𝑦2 𝑦1 𝑦2 𝑦1 𝑦2

slide-32
SLIDE 32

Anomaly detection using the multivariate Gaussian distribution

  • 1. Fit model π‘ž 𝑦 by setting

𝜈 = 1 𝑛 ෍

𝑗=1 𝑛

𝑦(𝑗) Ξ£ = 1 𝑛 ෍

𝑗=1 𝑛

(𝑦(𝑗)βˆ’πœˆ)(𝑦(𝑗) βˆ’ 𝜈)⊀ 2 Give a new example 𝑦, compute π‘ž 𝑦; 𝜈, Ξ£ = 1 2𝜌 π‘œ/2 Ξ£ 1/2 exp βˆ’ 𝑦 βˆ’ 𝜈 βŠ€Ξ£βˆ’1(𝑦 βˆ’ 𝜈) Flag an anomaly if π‘ž 𝑦 < πœ—

slide-33
SLIDE 33

Original model π‘ž 𝑦1; 𝜈1, 𝜏1

2 π‘ž 𝑦2; 𝜈2, 𝜏2 2 β‹― π‘ž π‘¦π‘œ; πœˆπ‘œ, πœπ‘œ 2

Manually create features to capture anomalies where 𝑦1, 𝑦2 take unusual combinations of values Computationally cheaper (alternatively, scales better) OK even if training set size is small

Original model

π‘ž 𝑦; 𝜈, Ξ£ = 1 2𝜌 π‘œ/2 Ξ£ 1/2 exp(βˆ’ 𝑦 βˆ’ 𝜈 βŠ€Ξ£βˆ’1(𝑦

slide-34
SLIDE 34

Things to remember

  • Motivation
  • Developing an anomaly detection system
  • Anomaly detection vs. supervised learning
  • Choosing what features to use
  • Multivariate Gaussian distribution