15-388/688 - Practical Data Science: Anomaly detection and mixture - - PowerPoint PPT Presentation

β–Ά
15 388 688 practical data science anomaly detection and
SMART_READER_LITE
LIVE PREVIEW

15-388/688 - Practical Data Science: Anomaly detection and mixture - - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Anomalies and outliers Multivariate Gaussian Mixture of Gaussians 2 Outline Anomalies and


slide-1
SLIDE 1

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians

  • J. Zico Kolter

Carnegie Mellon University Spring 2018

1

slide-2
SLIDE 2

Outline

Anomalies and outliers Multivariate Gaussian Mixture of Gaussians

2

slide-3
SLIDE 3

Outline

Anomalies and outliers Multivariate Gaussian Mixture of Gaussians

3

slide-4
SLIDE 4

What is an β€œanomaly”

Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies Unsupervised view: anomalies are outliers (points of low probability) in the data In reality, you want a combination of both these viewpoints: not all outliers are anomalies, but all anomalies should be outliers This lecture is going to focus on the unsupervised view, but this is only part of the full equation

4

slide-5
SLIDE 5

What is an outlier?

Outliers are points of low probability Given a collection of data points 𝑦 1 , … , 𝑦 ν‘š , describe the points using some distribution, then find points with lowest π‘ž 𝑦 ν‘– Since we are considering points with no labels, this is an unsupervised learning algorithm (could formulate in terms of hypothesis, loss, optimization, but instead for this lecture we’ll be focusing on the probabilistic notation)

5

slide-6
SLIDE 6

Outline

Anomalies and outliers Multivariate Gaussian Mixture of Gaussians

6

slide-7
SLIDE 7

Multivariate Gaussian distributions

We have seen Gaussian distributions previously, but mainly focused on distributions over scalar-valued data 𝑦 ν‘– ∈ ℝ π‘ž 𝑦; 𝜈, 𝜏2 = 2𝜌𝜏2 βˆ’1/2 exp βˆ’ 𝑦 βˆ’ 𝜈 2 2𝜏2 Gaussian distributions generalize nicely to distributions over vector-valued random variables π‘Œ taking values in ℝ푛 π‘ž 𝑦; 𝜈, Ξ£ = 2𝜌Σ βˆ’1/2 exp βˆ’ 1 2 𝑦 βˆ’ 𝜈 푇 Ξ£βˆ’1 𝑦 βˆ’ 𝜈 ≑ π’ͺ 𝑦; 𝜈, Ξ£ with parameters 𝜈 ∈ ℝ푛 and Ξ£ ∈ ℝ푛×푛, and were β‹… denotes the determinant of a matrix (also written π‘Œ ∼ π’ͺ 𝜈, Ξ£ )

7

slide-8
SLIDE 8

Properties of multivariate Gaussians

Mean and variance 𝐅 π‘Œ = ∫

ℝ푛

𝑦π’ͺ 𝑦; 𝜈, Ξ£ 𝑒𝑦 = 𝜈 𝐃𝐩𝐰 π‘Œ = ∫

ℝ푛

𝑦 βˆ’ 𝜈 𝑦 βˆ’ 𝜈 푇 π’ͺ 𝑦; 𝜈, Ξ£ 𝑒𝑦 = Ξ£ (these are not obvious) Creation from univariate Gaussians: for 𝑦 ∈ ℝ, if π‘ž 𝑦푖 = π’ͺ 𝑦; 0,1 (i.e., each element 𝑦푖 is an independent univariate Gaussian, then 𝑧 = 𝐡𝑦 + 𝑐 is also normal, with distribution 𝑍 ∼ π’ͺ 𝜈 = 𝑐, Ξ£ = 𝐡𝐡푇

8

slide-9
SLIDE 9

Multivariate Gaussians, graphically

9

𝜈 = 3 βˆ’4 Ξ£ = 2.0 0.5 0.5 1.0

slide-10
SLIDE 10

Multivariate Gaussians, graphically

10

𝜈 = 3 βˆ’4 Ξ£ = 2.0 1.0

slide-11
SLIDE 11

Multivariate Gaussians, graphically

11

𝜈 = 3 βˆ’4 Ξ£ = 2.0 1.0 1.0 1.0

slide-12
SLIDE 12

Multivariate Gaussians, graphically

12

𝜈 = 3 βˆ’4 Ξ£ = 2.0 1.4 1.4 1.0

slide-13
SLIDE 13

Multivariate Gaussians, graphically

13

𝜈 = 3 βˆ’4 Ξ£ = 2.0 βˆ’1.0 βˆ’1.0 1.0

slide-14
SLIDE 14

Maximum likelihood estimation

The maximum likelihood estimate of 𝜈, Ξ£ are what you would β€œexpect”, but derivation is non-obvious minimize

νœ‡,Ξ£

β„“ 𝜈, Ξ£ = βˆ‘

ν‘–=1 ν‘š

log π‘ž(𝑦 ν‘– ; 𝜈, Ξ£) = βˆ‘

ν‘–=1 ν‘š

βˆ’ 1 2 log 2𝜌Σ βˆ’ 1 2 𝑦 ν‘– βˆ’ 𝜈 푇 Ξ£βˆ’1 𝑦 ν‘– βˆ’ 𝜈 Taking gradients with respect to 𝜈 and Ξ£ and setting equal to zero give the closed-form solutions 𝜈 = 1 𝑛 βˆ‘

ν‘–=1 ν‘š

𝑦 ν‘– , Ξ£ = 1 𝑛 βˆ‘

ν‘–=1 ν‘š

𝑦 ν‘– βˆ’ 𝜈 𝑦 ν‘– βˆ’ 𝜈 푇

14

slide-15
SLIDE 15

Fitting Gaussian to MNIST

15

𝜈 = Σ =

slide-16
SLIDE 16

MNIST Outliers

16

slide-17
SLIDE 17

Outline

Anomalies and outliers Multivariate Gaussian Mixture of Gaussians

17

slide-18
SLIDE 18

Limits of Gaussians

Though useful, multivariate Gaussians are limited in the types of distributions they can represent

18

slide-19
SLIDE 19

Mixture models

A more powerful model to consider is a mixture of Gaussian distributions, a distribution where we first consider a categorical variable π‘Ž ∼ Categorical 𝜚 , 𝜚 ∈ 0,1 ν‘˜, βˆ‘

ν‘–

πœšν‘– = 1 i.e., 𝑨 takes on values 1, … , 𝑙 For each potential value of π‘Ž, we consider a separate Gaussian distribution: π‘Œ|π‘Ž = 𝑨 ∼ π’ͺ 𝜈 ν‘§ , Ξ£ ν‘§ , 𝜈 ν‘§ ∈ ℝ푛, Ξ£ ν‘§ ∈ ℝ푛×푛 Can write the distribution of π‘Œ using marginalization π‘ž π‘Œ = βˆ‘

ν‘§

π‘ž π‘Œ π‘Ž = 𝑨 π‘ž(π‘Ž = 𝑨) = βˆ‘

ν‘§

π’ͺ 𝑦; 𝜈 ν‘§ , Ξ£ ν‘§ πœšν‘§

19

slide-20
SLIDE 20

Learning mixture models

To estimate parameters, suppose first that we can observe both π‘Œ and π‘Ž, i.e., our data set is of the form 𝑦 ν‘– , 𝑨 ν‘– , 𝑗 = 1, … , 𝑛 In this case, we can maximize the log-likelihood of the parameters: β„“ 𝜈, Ξ£, 𝜚 = βˆ‘

ν‘–=1 ν‘š

log π‘ž(𝑦 ν‘– , 𝑨 ν‘– ; 𝜈, Ξ£, 𝜚) Without getting into the full details, it hopefully should not be too surprising that the solutions here are given by: πœšν‘§ = βˆ‘ν‘–=1

ν‘š 𝟐 𝑨 ν‘– = 𝑨

𝑛 , 𝜈 ν‘§ = βˆ‘ν‘–=1

ν‘š 𝟐 𝑨 ν‘– = 𝑨 𝑦 ν‘–

βˆ‘ν‘–=1

ν‘š 𝟐 𝑨 ν‘– = 𝑨

, Ξ£ ν‘§ = βˆ‘ν‘–=1

ν‘š 𝟐 𝑨 ν‘– = 𝑨 (𝑦 ν‘– βˆ’πœˆ ν‘§ ) 𝑦 ν‘– βˆ’ 𝜈 ν‘§ 푇

βˆ‘ν‘–=1

ν‘š 𝟐 𝑨 ν‘– = 𝑨

20

slide-21
SLIDE 21

Latent variables and expectation maximization

In the unsupervised setting, 𝑨 ν‘– terms will not be known, these are referred to as hidden or latent random variables This means that to estimate the parameters, we can’t use the function 1 𝑨 ν‘– = 𝑨 anymore Expectation maximization (EM) algorithm (at a high level): replace indicators 1 𝑨 ν‘– = 𝑨 with probability estimates π‘ž 𝑨 ν‘– = 𝑨 𝑦 ν‘– ; 𝜈, Ξ£, 𝜚 When we re-estimate these parameter, probabilities change, so repeat: E (expectation) step: compute π‘ž 𝑨 ν‘– = 𝑨 𝑦 ν‘– ; 𝜈, Ξ£, 𝜚 , βˆ€π‘—, 𝑨 M (maximization) step: re-estimate 𝜈, Ξ£, 𝜚

21

slide-22
SLIDE 22

EM for Gaussian mixture models

E step: using Bayes’ rule, compute probabilities Μ‚ π‘žν‘§

ν‘– = π‘ž 𝑨 ν‘– = 𝑨 𝑦 ν‘– ; 𝜈, Ξ£, 𝜚 =

π‘ž 𝑦 ν‘– 𝑨 ν‘– = 𝑨; 𝜈, Ξ£ π‘ž 𝑨 ν‘– = 𝑨; 𝜚 βˆ‘ν‘§β€² π‘ž 𝑦 ν‘– 𝑨 ν‘– = 𝑨′; 𝜈, Ξ£ π‘ž 𝑨 ν‘– = 𝑨′; 𝜚 = π’ͺ 𝑦 ν‘– ; 𝜈 ν‘§ , Ξ£ ν‘§ πœšν‘§ βˆ‘ν‘§β€² π’ͺ 𝑦 ν‘– ; 𝜈 ν‘§β€² , Ξ£ ν‘§β€² πœšν‘§β€² M step: re-estimate parameters using these probabilities

πœšν‘§ ← βˆ‘ν‘–=1

ν‘š

Μ‚ π‘žν‘§

ν‘–

𝑛 , 𝜈 ν‘§ ← βˆ‘ν‘–=1

ν‘š

Μ‚ π‘žν‘§

ν‘– 𝑦 ν‘–

βˆ‘ν‘–=1

ν‘š

Μ‚ π‘žν‘–,ν‘§ , Ξ£ ν‘§ ← βˆ‘ν‘–=1

ν‘š

Μ‚ π‘žν‘§

ν‘– (𝑦 ν‘– βˆ’πœˆ ν‘§ ) 𝑦 ν‘– βˆ’ 𝜈 ν‘§ 푇

βˆ‘ν‘–=1

ν‘š

Μ‚ π‘žν‘§

ν‘–

22

slide-23
SLIDE 23

Local optima

Like k-means, EM is effectively optimizating a non-convex problem Very real possibility of local optima (seemingly moreso than k-means, in practice) Same heuristics work as for k-means (in fact, common to initialize EM with clusters from k-means)

23

slide-24
SLIDE 24

Illustration of EM algorithm

24

slide-25
SLIDE 25

Illustration of EM algorithm

25

slide-26
SLIDE 26

Illustration of EM algorithm

26

slide-27
SLIDE 27

Illustration of EM algorithm

27

slide-28
SLIDE 28

Illustration of EM algorithm

28

slide-29
SLIDE 29

Possibility of local optima

29

slide-30
SLIDE 30

Possibility of local optima

30

slide-31
SLIDE 31

Possibility of local optima

31

slide-32
SLIDE 32

Possibility of local optima

32

slide-33
SLIDE 33

Poll: outliers in mixture of Gaussians

Consider the following cartoon dataset: If we fit a mixture of two Gaussians to this data via the EM algorithm, which group

  • f points is likely to contain more β€œoutliers” (points with the lowest π‘ž(𝑦))?
  • 1. Left group
  • 2. Right group
  • 3. Equal chance of each, depending on initialization

33

slide-34
SLIDE 34

EM and k-means

As you may have noticed, EM for mixture of Gaussians and k-means seem to be doing very similar things Primary differences: EM is computing β€œdistances” based upon the inverse covariance matrix, allows for β€œsoft” assignments instead of hard assignments

34