Anomaly Detection Qi Liu University of Science and Technology of - PowerPoint PPT Presentation

Anomaly Detection Qi Liu University of Science and Technology of China qiliuql@ustc.edu.cn ili l@ t d

Data Mining Tasks … Data Mining Tasks … 2 Data Tid Tid Refund Refund Marital Marital Taxable Taxable Cheat Status Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 12 Yes Yes Divorced Divorced 220K 220K No No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes 10 Milk

Anomaly/Outlier Detection Anomaly/Outlier Detection � What are anomalies/outliers? � The set of data points that are considerably different than the considerably different than the remainder of the data � Natural implication is that anomalies are relatively rare � One in a thousand occurs often if you have lots of data � O i th d ft if h l t f d t � Context is important, e.g., freezing temps in July � Can be important or a nuisance � 10 foot tall 2 year old � Unusually high blood pressure

Importance of Anomaly Detection Importance of Anomaly Detection Ozone Depletion History In 1985 three researchers (Farman, � Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels below normal levels Why did the Nimbus 7 satellite, which � had instruments aboard for recording had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by � the satellite were so low they were being treated as outliers by a computer Sources: program and discarded! htt http://exploringdata.cqu.edu.au/ozone.html // l i d t d / ht l http://www.epa.gov/ozone/science/hole/size.html

Causes of Anomalies Causes of Anomalies � Data from different classes � Measuring the weights of oranges, but a few grapefruit are mixed in � Natural variation Natural ariation � Unusually tall people � Data errors � 200 pound 2 year old � 200 pound 2 year old

Distinction Between Noise and Anomalies Anomalies � Noise is erroneous, perhaps random, values or h d l contaminating objects � Weight recorded incorrectly � Grapefruit mixed in with the oranges � Noise doesn’t necessarily produce unusual values or objects � Noise is not interesting � Anomalies may be interesting if they are not a result of noise noise � Noise and anomalies are related but distinct concepts

General Issues: Number of Attributes General Issues: Number of Attributes � Many anomalies are defined in terms of a single attribute � Height � Shape � Color � Can be hard to find an anomaly using all attributes � Noisy or irrelevant attributes � Noisy or irrelevant attributes � Object is only anomalous with respect to some attributes � However, an object may not be anomalous in any one attribute tt ib t

General Issues: Anomaly Scoring General Issues: Anomaly Scoring � Many anomaly detection techniques provide only a binary categorization � An object is an anomaly or it isn’t � This is especially true of classification ‐ based approaches � Other approaches assign a score to all points � This score measures the degree to which an object is an anomaly � This score measures the degree to which an object is an anomaly � This allows objects to be ranked � In the end, you often need a binary decision � Should this credit card transaction be flagged? gg � Still useful to have a score � How many anomalies are there?

Other Issues for Anomaly Detection y � Find all anomalies at once or one at a time � Swamping � Masking � Evaluation E l ti � How do you measure performance? � Supervised vs unsupervised situations � Supervised vs. unsupervised situations � Efficiency � Efficiency � Context � Context � Professional basketball team

Variants of Anomaly Detection Problems Problems � Given a data set D, find all data points x ∈ D with Gi d t t D fi d ll d t i t D ith anomaly scores greater than some threshold t � Given a data set D, find all data points x ∈ D having the top n largest anomaly scores the top ‐ n largest anomaly scores � Given a data set D, containing mostly normal (but d l l b unlabeled) data points, and a test point x , compute the anomaly score of x with respect to D l f ith t t D

Model ‐ Based Anomaly D t Detection ti � Build a model for the data and see Build a model for the data and see � Unsupervised � Anomalies are those points that don’t fit well � Anomalies are those points that don t fit well � Anomalies are those points that distort the model � Examples: � Statistical distribution � Clusters � Regression g � Geometric � Graph � Su e � Supervised i ed � Anomalies are regarded as a rare class � Need to have training data g

Additional Anomaly Detection Te hni ues Techniques � Proximity ‐ based P i it b d � Anomalies are points far away from other points � Can detect this graphically in some cases � Can detect this graphically in some cases � Density ‐ based � Low density points are outliers � Low density points are outliers � Pattern matching � Create profiles or templates of atypical but important events or � Create profiles or templates of atypical but important events or objects � Algorithms to detect these patterns are usually simple and efficient g p y p

Graphical Approaches Graphical Approaches � Boxplots or scatter plots B l l � Limitations � Not automatic N t t ti � Subjective

Convex Hull Method Convex Hull Method � Extreme points are assumed to be outliers � Extreme points are assumed to be outliers � Use convex hull method to detect extreme values � What if the outlier occurs in the middle of the data?

Statistical Approaches Statistical Approaches Probabilistic definition of an outlier: An outlier is an object that Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data. � Usually assume a parametric model describing the distribution of the data (e.g., normal distribution) � Apply a statistical test that depends on � Data distribution � Parameters of distribution (e.g., mean, variance) � Number of expected outliers (confidence limit) � Issues I ue � Identifying the distribution of a data set � Heavy tailed distribution � Heavy tailed distribution � Number of attributes � Is the data a mixture of distributions?

Normal Distributions Normal Distributions One-dimensional G Gaussian i 8 7 0.1 6 0.09 5 0.08 4 0.07 Two-dimensional 3 0.06 2 Gaussian Gaussian 0.05 y 1 0.04 0 0.03 -1 0.02 -2 -3 0.01 -4 probability -5 density -4 -3 -2 -1 0 1 2 3 4 5 x

Grubbs’ Test Grubbs Test � Detect outliers in univariate data D li i i i d � Assume data comes from normal distribution � Detects one outlier at a time, remove the outlier, and repeat and repeat � H 0 : There is no outlier in data � H A : There is at least one outlier − max X X = � Grubbs’ test statistic: G s 2 − t ( ( 1 ) ) N > � Reject H 0 if: α α − ( ( / / , 2 2 ) ) N N N N G G − + 2 2 N t N α − ( / N , N 2 )

Statistical ‐ based – Likelihood Approach A h � Assume the data set D contains samples from a mixture of two probability distributions: � M (majority distribution) � A (anomalous distribution) � General Approach: � Initially, assume all the data points belong to M � Let L t (D) be the log likelihood of D at time t L L (D) b h l lik lih d f D i � For each point x t that belongs to M, move it to A � Let L � Let L t+1 (D) be the new log likelihood. 1 (D) be the new log likelihood � Compute the difference, Δ = L t (D) – L t+1 (D) � If Δ > c (some threshold), then x t is declared as an anomaly and moved permanently from M to A tl f M t A

Statistical ‐ based – Likelihood Approach A h � Data distribution, D = (1 – λ ) M + λ A � M is a probability distribution estimated from data � M is a probability distribution estimated from data � Can be based on any modeling method (naïve Bayes, maximum entropy etc) maximum entropy, etc) � A is initially assumed to be uniform distribution � Likelihood at time t: ⎛ ⎛ ⎞ ⎞ ⎛ ⎛ ⎞ ⎞ N N ∏ ∏ ∏ ⎜ ⎟ ⎜ ⎟ = = − λ λ | | | | M A ( ) ( ) ( 1 ) ( ) ( ) L D P x P x P x t t ⎜ ⎟ ⎜ ⎟ t D i M i A i ⎝ t ⎠ ⎝ t ⎠ = ∈ ∈ 1 i x M x A i t i t ∑ ∑ ∑ ∑ = − λ + + λ + ( ) log( 1 ) log ( ) log log ( ) LL D M P x A P x t t M i t A i t t ∈ ∈ x M x A i t i t

Strengths/Weaknesses of Statistical A Approaches h � Firm mathematical foundation � Can be very efficient � Good results if distribution is known G d l f d b k � In many cases, data distribution may not be known I d di ib i b k � For high dimensional data it may be difficult to estimate � For high dimensional data, it may be difficult to estimate the true distribution � Anomalies can distort the parameters of the distribution

Anomaly Detection Qi Liu University of Science and Technology of - PowerPoint PPT Presentation

Anomaly Detection Qi Liu University of Science and Technology of China qiliuql@ustc.edu.cn ili l@ t d Data Mining Tasks Data Mining Tasks 2 Data Tid Tid Refund Refund Marital Marital Taxable Taxable Cheat Status Income

What is an anomaly? Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Defining

Isolation trees Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Isolation

Anomaly Detection of Trajectories Junier B. Oliva Anomaly Detection An anomaly (or outlier)

Anomaly Detection Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Learning Rules for Anomaly Detection (LERAD) of Hostile Network Traffic Matt Mahoney Overview

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Structure of Talk Workload-sensitive Timing Behavior Anomaly Detection 1 Motivation in Large

Dataflow Anomaly Detection Presented By Archana Viswanath Computer Science and Engineering The

<Title> Yiqun Hu, SP Group Agenda Condition monitoring & anomaly detection

In Incorporating Feedback in into Tree-based Anomaly Detection Shubhomoy Das, Weng-Keen Wong,

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Netw ork I ntrusion Detection System s False Positive Reduction Through Anomaly Detection Joint

Anomaly Based Network Intrusion Detection with Unsupervised Outlier Detection Jiong Zhang and

Tra ffi c anomaly detection using a distributed measurement network Razvan Oprea Supervisor:

Anomaly Detection on User-agents Peter van Bolhuis Overview Introduction Research

Visualizing Real-Time Network Resource Usage Ryan Blue, Cody Dunne, Adam Fuchs, Kyle King, and

RISD Ca r e e r & T e c hnic a l E duc a tion Co ntrib uting to a Vib rant Wo rk F o rc

Transformation of the Resource Transformation of the Resource Teachers: Learning and Teachers:

Academically / Intellectually Gifted Program Jeremy Spielman, AIG Coordinator Julie Griffith,

TAS TRAINING DHCA MCG, Department of Finance Agenda Part 1 Learning objectives Whats

HR Connections Meeting February 14, 2019 Welcome / Greetings / Introduction of New HR Team

Implied Certification Doctrine Defending Implied Certification Theory of FCA Claims in Litigation,

TITLE I ANNUAL MEETING ASA PHILIP RANDOLPH ELEMENTARY SEPTEMBER 18, 2019 WHAT IS TITLE I?

Anomaly Detection Qi Liu University of Science and Technology of - PowerPoint PPT Presentation

Anomaly Detection Qi Liu University of Science and Technology of China qiliuql@ustc.edu.cn ili l@ t d Data Mining Tasks Data Mining Tasks 2 Data Tid Tid Refund Refund Marital Marital Taxable Taxable Cheat Status Income

What is an anomaly? Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Defining

Isolation trees Alastair Rushworth Data Scientist DataCamp Anomaly Detection in R Isolation

Anomaly Detection of Trajectories Junier B. Oliva Anomaly Detection An anomaly (or outlier)

Anomaly Detection Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Learning Rules for Anomaly Detection (LERAD) of Hostile Network Traffic Matt Mahoney Overview

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Structure of Talk Workload-sensitive Timing Behavior Anomaly Detection 1 Motivation in Large

Dataflow Anomaly Detection Presented By Archana Viswanath Computer Science and Engineering The

&lt;Title&gt; Yiqun Hu, SP Group Agenda Condition monitoring &amp; anomaly detection

In Incorporating Feedback in into Tree-based Anomaly Detection Shubhomoy Das, Weng-Keen Wong,

Detection of neutral particles detection of neutrons detection of neutrinons detection of low

Netw ork I ntrusion Detection System s False Positive Reduction Through Anomaly Detection Joint

Anomaly Based Network Intrusion Detection with Unsupervised Outlier Detection Jiong Zhang and

Tra ffi c anomaly detection using a distributed measurement network Razvan Oprea Supervisor:

Anomaly Detection on User-agents Peter van Bolhuis Overview Introduction Research

Visualizing Real-Time Network Resource Usage Ryan Blue, Cody Dunne, Adam Fuchs, Kyle King, and

RISD Ca r e e r &amp; T e c hnic a l E duc a tion Co ntrib uting to a Vib rant Wo rk F o rc

Transformation of the Resource Transformation of the Resource Teachers: Learning and Teachers:

Academically / Intellectually Gifted Program Jeremy Spielman, AIG Coordinator Julie Griffith,

TAS TRAINING DHCA MCG, Department of Finance Agenda Part 1 Learning objectives Whats

HR Connections Meeting February 14, 2019 Welcome / Greetings / Introduction of New HR Team

Implied Certification Doctrine Defending Implied Certification Theory of FCA Claims in Litigation,

TITLE I ANNUAL MEETING ASA PHILIP RANDOLPH ELEMENTARY SEPTEMBER 18, 2019 WHAT IS TITLE I?

<Title> Yiqun Hu, SP Group Agenda Condition monitoring & anomaly detection

RISD Ca r e e r & T e c hnic a l E duc a tion Co ntrib uting to a Vib rant Wo rk F o rc