15-388/688 - Practical Data Science: Anomaly detection and mixture - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter Carnegie Mellon University Spring 2018 1

Outline Anomalies and outliers Multivariate Gaussian Mixture of Gaussians 2

What is an “anomaly” Two views of anomaly detection Supervised view: anomalies are what some user labels as anomalies Unsupervised view: anomalies are outliers (points of low probability) in the data In reality, you want a combination of both these viewpoints: not all outliers are anomalies, but all anomalies should be outliers This lecture is going to focus on the unsupervised view, but this is only part of the full equation 4

What is an outlier? Outliers are points of low probability Given a collection of data points 𝑦 1 , … , 𝑦 푚 , describe the points using some distribution, then find points with lowest 𝑞 𝑦 푖 Since we are considering points with no labels, this is an unsupervised learning algorithm (could formulate in terms of hypothesis, loss, optimization, but instead for this lecture we’ll be focusing on the probabilistic notation) 5

Multivariate Gaussian distributions We have seen Gaussian distributions previously, but mainly focused on distributions over scalar-valued data 𝑦 푖 ∈ ℝ − 𝑦 − 𝜈 2 𝑞 𝑦; 𝜈, 𝜏 2 = 2𝜌𝜏 2 −1/2 exp 2𝜏 2 Gaussian distributions generalize nicely to distributions over vector-valued random variables 𝑌 taking values in ℝ 푛 − 1 𝑞 𝑦; 𝜈, Σ = 2𝜌Σ −1/2 exp 2 𝑦 − 𝜈 푇 Σ −1 𝑦 − 𝜈 ≡ 𝒪 𝑦; 𝜈, Σ with parameters 𝜈 ∈ ℝ 푛 and Σ ∈ ℝ 푛×푛 , and were ⋅ denotes the determinant of a matrix (also written 𝑌 ∼ 𝒪 𝜈, Σ ) 7

Properties of multivariate Gaussians Mean and variance 𝐅 𝑌 = ∫ 𝑦𝒪 𝑦; 𝜈, Σ 𝑒𝑦 = 𝜈 ℝ 푛 𝑦 − 𝜈 푇 𝒪 𝑦; 𝜈, Σ 𝑒𝑦 = Σ 𝐃𝐩𝐰 𝑌 = ∫ 𝑦 − 𝜈 ℝ 푛 (these are not obvious) Creation from univariate Gaussians: for 𝑦 ∈ ℝ , if 𝑞 𝑦 푖 = 𝒪 𝑦; 0,1 (i.e., each element 𝑦 푖 is an independent univariate Gaussian, then 𝑧 = 𝐵𝑦 + 𝑐 is also normal, with distribution 𝑍 ∼ 𝒪 𝜈 = 𝑐, Σ = 𝐵𝐵 푇 8

Multivariate Gaussians, graphically 3 𝜈 = −4 2.0 0.5 Σ = 0.5 1.0 9

Multivariate Gaussians, graphically 3 𝜈 = −4 2.0 0 Σ = 0 1.0 10

Multivariate Gaussians, graphically 3 𝜈 = −4 2.0 −1.0 Σ = −1.0 1.0 13

Maximum likelihood estimation The maximum likelihood estimate of 𝜈, Σ are what you would “expect”, but derivation is non-obvious 푚 log 𝑞(𝑦 푖 ; 𝜈, Σ) minimize ℓ 𝜈, Σ = ∑ 휇,Σ 푖=1 푚 − 1 2 log 2𝜌Σ − 1 2 𝑦 푖 − 𝜈 푇 Σ −1 𝑦 푖 − 𝜈 = ∑ 푖=1 Taking gradients with respect to 𝜈 and Σ and setting equal to zero give the closed-form solutions 푚 푚 𝜈 = 1 Σ = 1 𝑦 푖 , 𝑦 푖 − 𝜈 𝑦 푖 − 𝜈 푇 𝑛 ∑ 𝑛 ∑ 푖=1 푖=1 14

Fitting Gaussian to MNIST Σ = 𝜈 = 15

MNIST Outliers 16

Limits of Gaussians Though useful, multivariate Gaussians are limited in the types of distributions they can represent 18

Mixture models A more powerful model to consider is a mixture of Gaussian distributions, a distribution where we first consider a categorical variable 𝜚 ∈ 0,1 푘 , ∑ 𝑎 ∼ Categorical 𝜚 , 𝜚 푖 = 1 푖 i.e., 𝑨 takes on values 1, … , 𝑙 For each potential value of 𝑎 , we consider a separate Gaussian distribution: 𝑌|𝑎 = 𝑨 ∼ 𝒪 𝜈 푧 , Σ 푧 𝜈 푧 ∈ ℝ 푛 , Σ 푧 ∈ ℝ 푛×푛 , Can write the distribution of 𝑌 using marginalization 𝒪 𝑦; 𝜈 푧 , Σ 푧 𝑞 𝑌 = ∑ 𝑞 𝑌 𝑎 = 𝑨 𝑞(𝑎 = 𝑨) = ∑ 𝜚 푧 푧 푧 19

Learning mixture models To estimate parameters, suppose first that we can observe both 𝑌 and 𝑎 , i.e., our data set is of the form 𝑦 푖 , 𝑨 푖 , 𝑗 = 1, … , 𝑛 In this case, we can maximize the log-likelihood of the parameters: 푚 log 𝑞(𝑦 푖 , 𝑨 푖 ; 𝜈, Σ, 𝜚) ℓ 𝜈, Σ, 𝜚 = ∑ 푖=1 Without getting into the full details, it hopefully should not be too surprising that the solutions here are given by: 푚 𝟐 𝑨 푖 = 𝑨 푚 𝟐 𝑨 푖 = 𝑨 𝑦 푖 ∑ 푖=1 ∑ 푖=1 𝜈 푧 = 𝜚 푧 = , , 푚 𝟐 𝑨 푖 = 𝑨 𝑛 ∑ 푖=1 푚 𝟐 𝑨 푖 = 𝑨 (𝑦 푖 −𝜈 푧 ) 𝑦 푖 − 𝜈 푧 푇 ∑ 푖=1 Σ 푧 = 푚 𝟐 𝑨 푖 = 𝑨 ∑ 푖=1 20

Latent variables and expectation maximization In the unsupervised setting, 𝑨 푖 terms will not be known, these are referred to as hidden or latent random variables This means that to estimate the parameters, we can’t use the function 1 𝑨 푖 = 𝑨 anymore Expectation maximization (EM) algorithm (at a high level): replace indicators 1 𝑨 푖 = 𝑨 with probability estimates 𝑞 𝑨 푖 = 𝑨 𝑦 푖 ; 𝜈, Σ, 𝜚 When we re-estimate these parameter, probabilities change, so repeat: E (expectation) step: compute 𝑞 𝑨 푖 = 𝑨 𝑦 푖 ; 𝜈, Σ, 𝜚 , ∀𝑗, 𝑨 M (maximization) step: re-estimate 𝜈, Σ, 𝜚 21

̂ ̂ ̂ ̂ ̂ ̂ EM for Gaussian mixture models E step: using Bayes’ rule, compute probabilities 𝑞 𝑦 푖 𝑨 푖 = 𝑨; 𝜈, Σ 𝑞 𝑨 푖 = 𝑨; 𝜚 푖 = 𝑞 𝑨 푖 = 𝑨 𝑦 푖 ; 𝜈, Σ, 𝜚 = 𝑞 푧 ∑ 푧 ′ 𝑞 𝑦 푖 𝑨 푖 = 𝑨′; 𝜈, Σ 𝑞 𝑨 푖 = 𝑨′; 𝜚 𝒪 𝑦 푖 ; 𝜈 푧 , Σ 푧 𝜚 푧 = ∑ 푧 ′ 𝒪 𝑦 푖 ; 𝜈 푧 ′ , Σ 푧 ′ 𝜚 푧 ′ M step: re-estimate parameters using these probabilities 푖 𝑦 푖 푖 (𝑦 푖 −𝜈 푧 ) 𝑦 푖 − 𝜈 푧 푚 푖 푚 푚 푇 ∑ 푖=1 𝑞 푧 ∑ 푖=1 𝑞 푧 ∑ 푖=1 𝑞 푧 , 𝜈 푧 ← , Σ 푧 ← 𝜚 푧 ← 푚 𝑛 ∑ 푖=1 𝑞 푖,푧 푚 푖 ∑ 푖=1 𝑞 푧 22

Local optima Like k-means, EM is effectively optimizating a non-convex problem Very real possibility of local optima (seemingly moreso than k-means, in practice) Same heuristics work as for k-means (in fact, common to initialize EM with clusters from k-means) 23

Illustration of EM algorithm 24

Possibility of local optima 29

Poll: outliers in mixture of Gaussians Consider the following cartoon dataset: If we fit a mixture of two Gaussians to this data via the EM algorithm, which group of points is likely to contain more “outliers” (points with the lowest 𝑞(𝑦) )? 1. Left group 2. Right group 3. Equal chance of each, depending on initialization 33

EM and k-means As you may have noticed, EM for mixture of Gaussians and k-means seem to be doing very similar things Primary differences: EM is computing “distances” based upon the inverse covariance matrix, allows for “soft” assignments instead of hard assignments 34

15-388/688 - Practical Data Science: Anomaly detection and mixture - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Anomalies and outliers Multivariate Gaussian Mixture of Gaussians 2 Outline Anomalies and

What is an anomaly? Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Defining

Isolation trees Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Isolation

15-388/688 - Practical Data Science: Debugging data science J. Zico Kolter School of Computer

15-388/688 - Practical Data Science: Introduction J. Zico Kolter Carnegie Mellon University

Anomaly Detection of Trajectories Junier B. Oliva Anomaly Detection An anomaly (or outlier)

15-388/688 - Practical Data Science: Basic probability J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Data collection and scraping J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Relational Data J. Zico Kolter Carnegie Mellon University

Anomaly Detection Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

15-388/688 - Practical Data Science: Visualization and Data Exploration J. Zico Kolter Carnegie

Time Series Modeling Shouvik Mani April 5, 2018 15-388/688: Practical Data Science Carnegie

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

15-388/688 - Practical Data Science: Graph and network processing J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, and regularization J.

15-388/688 - Practical Data Science: Hypothesis testing and experimental design J. Zico Kolter

15-388/688 - Practical Data Science: Free text and natural language processing J. Zico Kolter

Driving Anomaly Detection with Conditional GAN using Physiological Data & CAN-Bus Data Yuning

Why Nobody Cares About Your Anomaly Detection Baron Schwartz - November 2017 @xaprb

Topics in Software Dynamic White-box Testing Part 2: Data-flow Testing [Reading assignment:

Portrait of an anomaly Attribution Game Made up by Sue and Katherine To incite you to discuss the

Now Do Voters Notice Review Screen Anomalies? A Look at Voting System Usability Bryan A.

Fine-Grained Tracking of Grid Infections Ashish Gehani SRI Basim Baig, Salman Mahmood, Dawood

A Tale of Two Anomalies: from LHCb to ANITA Speaker: Yicong Sui In collaboration with: Wolfgang

Metadata format for benchmarking anomaly detection algorithms Youki Kadobayashi NICT / NAIST

15-388/688 - Practical Data Science: Anomaly detection and mixture - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Anomaly detection and mixture of Gaussians J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Anomalies and outliers Multivariate Gaussian Mixture of Gaussians 2 Outline Anomalies and

What is an anomaly? Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Defining

Isolation trees Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Isolation

15-388/688 - Practical Data Science: Debugging data science J. Zico Kolter School of Computer

15-388/688 - Practical Data Science: Introduction J. Zico Kolter Carnegie Mellon University

Anomaly Detection of Trajectories Junier B. Oliva Anomaly Detection An anomaly (or outlier)

15-388/688 - Practical Data Science: Basic probability J. Zico Kolter Carnegie Mellon University

15-388/688 - Practical Data Science: Data collection and scraping J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Relational Data J. Zico Kolter Carnegie Mellon University

Anomaly Detection Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

15-388/688 - Practical Data Science: Visualization and Data Exploration J. Zico Kolter Carnegie

Time Series Modeling Shouvik Mani April 5, 2018 15-388/688: Practical Data Science Carnegie

15-388/688 - Practical Data Science: Matrices, vectors, and linear algebra J. Zico Kolter

15-388/688 - Practical Data Science: Graph and network processing J. Zico Kolter Carnegie Mellon

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, and regularization J.

15-388/688 - Practical Data Science: Hypothesis testing and experimental design J. Zico Kolter

15-388/688 - Practical Data Science: Free text and natural language processing J. Zico Kolter

Driving Anomaly Detection with Conditional GAN using Physiological Data &amp; CAN-Bus Data Yuning

Why Nobody Cares About Your Anomaly Detection Baron Schwartz - November 2017 @xaprb

Topics in Software Dynamic White-box Testing Part 2: Data-flow Testing [Reading assignment:

Portrait of an anomaly Attribution Game Made up by Sue and Katherine To incite you to discuss the

Now Do Voters Notice Review Screen Anomalies? A Look at Voting System Usability Bryan A.

Fine-Grained Tracking of Grid Infections Ashish Gehani SRI Basim Baig, Salman Mahmood, Dawood

A Tale of Two Anomalies: from LHCb to ANITA Speaker: Yicong Sui In collaboration with: Wolfgang

Metadata format for benchmarking anomaly detection algorithms Youki Kadobayashi NICT / NAIST

Driving Anomaly Detection with Conditional GAN using Physiological Data & CAN-Bus Data Yuning