Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, - PowerPoint PPT Presentation

www.tugraz.at Anomalies in Data SCIENCE PASSION TECHNOLOGY Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center > www.tugraz.at 1 KDDM2

www.tugraz.at Anomalies in Data Recall from earlier Maximilian Toller, Know-Center 2 KDDM2

www.tugraz.at What are Outliers ? A recap from KDDM1 Maximilian Toller, Know-Center 3 KDDM2

www.tugraz.at What are Outliers ? Definitions An observation that appears to deviate markedly from other members of the sample in which it occurs . (Grubbs, 1969) An observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data. (Barnett and Lewis, 1974) An observation, which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism . (Hawkins, 1980) Maximilian Toller, Know-Center 4 KDDM2

www.tugraz.at What are Outliers ? Examples (easy) 8 6 Inliers 4 2 Outliers Y 0 (Grubb, Barnett) −2 −4 Outliers −6 (Grubb, Barnett, −8 −6 −4 −2 0 2 4 6 X Hawkins) Maximilian Toller, Know-Center 5 KDDM2

www.tugraz.at What are Outliers ? Examples (more difficult) 1.0 0.8 0.6 y 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 x Maximilian Toller, Know-Center 6 KDDM2

www.tugraz.at What are Outliers ? Examples (more difficult) 1.0 110 0.8 100 0.6 90 y y 0.4 80 0.2 70 0.0 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 x x Maximilian Toller, Know-Center 6 KDDM2

www.tugraz.at What are Outliers ? Examples (more difficult) 1000 800 y 600 400 200 −50 0 50 100 150 x Maximilian Toller, Know-Center 7 KDDM2

www.tugraz.at What are Outliers ? Examples (more difficult) 1.0 1000 0.8 800 0.6 y 600 y 0.4 400 0.2 200 0.0 −50 0 50 100 150 0.3 0.4 0.5 0.6 x x Maximilian Toller, Know-Center 7 KDDM2

www.tugraz.at What are Outliers ? Methods: Preview There are many outlier detection methods: Local outlier factor Angle-based outlier degree Artificial neural networks . . . Why are there so many? Maximilian Toller, Know-Center 8 KDDM2

www.tugraz.at What are Anomalies ? Maximilian Toller, Know-Center 9 KDDM2

www.tugraz.at What are Anomalies ? Difference from Outliers In literature, outlier and anomaly are used interchangeably For both, only vague definitions exist that are very similar However, the terms have different origins and different typical use: Outliers typically. . . Anomalies typically. . . . . . are motivated by statistics. . . . require context. . . . are unusual data. . . . are abnormal events. . . . are investigated by traditional . . . are investigated by data analysts researches and statisticians. and data scientists. Maximilian Toller, Know-Center 10 KDDM2

www.tugraz.at What are Anomalies ? Example: Credit card fraud Billions of dollars lost every year Fraudulent transactions often significantly different Difficult to disguise fraud s.t. it is not visible on any scale Maximilian Toller, Know-Center 11 KDDM2

www.tugraz.at What are Anomalies ? Example: Cancer One of the most common causes of human death Disease with abnormal cell growth Cancer has abnormal gene expression signature Maximilian Toller, Know-Center (Quinn et al., 2019) 12 KDDM2

www.tugraz.at What are Anomalies ? The role of context Abnormality is context-dependent Discordant data problem (credit card fraud example) Many normal observations Rare outlying data Anomaly class problem (cancer example) Normal data class Anomaly classes Can data define abnormality? Maximilian Toller, Know-Center 13 KDDM2

www.tugraz.at Unlikely, Discordant and Contaminated Data How to interpret suspicious data Maximilian Toller, Know-Center 14 KDDM2

www.tugraz.at Unlikely, Discordant and Contaminated Data The Case of Hadlum vs Hadlum Mr Hadlum accuses Mrs Hadlum of adultery Sole evidence: Birth of child 349 days after Mr Hadlum left the country Average human gestation period: 280 days Maximilian Toller, Know-Center 15 KDDM2 (Barnett and Lewis, 1974)

www.tugraz.at Unlikely, Discordant and Contaminated Data The Case of Hadlum vs Hadlum Mr Hadlum conjectured different distribution (red) Judges did not find Mrs Hadlum guilty, since 349 days unlikely, but not impossible (blue) (Modern research showed that more than 340 days is impossible) Maximilian Toller, Know-Center (Zimek and Filzmoser, 2018) 16 KDDM2

www.tugraz.at Unlikely, Discordant and Contaminated Data The Antarctic Ozone Hole Ozone layer protects Earth from solar radiation Damaged by human emissions of chlorofuorocarbons High depletion (hole) above poles https://de.wikipedia.org/wiki/Datei:Ozone_layer.jpg Maximilian Toller, Know-Center 17 KDDM2

www.tugraz.at Unlikely, Discordant and Contaminated Data The (Ant)Arctic Ozone Hole Farman et al. (1985) discover hole in field study Authors hesitant to publish Nimbus satellite data showed no drop Problem: Largely deviating values discarded as NASA/JPL-Caltech measurement errors Maximilian Toller, Know-Center 18 KDDM2

www.tugraz.at Unlikely, Discordant and Contaminated Data Definition Unlikely data Discordant data Contamination Position of judges Position of Mr "Wrong day of Hadlum birth?” "Random drop of ozone not caused Ozone field study by Satellite by humans" Farman et al. (1985) measurement error Data unlikely but still Data too unlikely to Data incorrect or normal be normal misleading No correction Correction of model Correction of data Action: none Action: investigate Action: remove Maximilian Toller, Know-Center 19 KDDM2

www.tugraz.at Unlikely, Discordant and Contaminated Data Implications It is hard to classify data as unlikely , discordant or contaminated No universal decision criterion Domain knowledge as remedy Ultimately subjective Maximilian Toller, Know-Center 20 KDDM2

www.tugraz.at Unlikely, Discordant and Contaminated Data Strategies 1. Try to ignore anomalies (Not interesting) 2. Find anomalies for investigation or removal (Interesting) Maximilian Toller, Know-Center 21 KDDM2

www.tugraz.at Robust Statistics Data Analysis in Presence of Anomalies Maximilian Toller, Know-Center 22 KDDM2

www.tugraz.at Robust Statistics Introduction I Setting Potentially contaminated dataset Majority uncontaminated Cannot find or remove contamination, e.g. inserted by attacker Task: Analyze data in spite of contamination, understand what is normal Maximilian Toller, Know-Center 23 KDDM2

www.tugraz.at Robust Statistics Introduction II Challenges No prior information about data Contamination may be arbitrarily “bad” (adversarial) Question: Which methods are suitable? Maximilian Toller, Know-Center 24 KDDM2

www.tugraz.at Robust Statistics Example: Mean and variance Two common estimators � n x = 1 Sample mean ¯ j = 1 x j n � n 1 σ 2 j = 1 ( x j − ¯ x ) 2 Sample variance ˆ x = n − 1 Mean and variance are influenced by contamination σ 2 Original x = [ 1 , 3 , 2 , 1 , 9 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] x ≈ 2 . 58 ¯ x ≈ 4 . 63 ˆ ¯ σ 2 Clean y = [ 1 , 3 , 2 , 1 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] y = 2 y = 0 . 6 ˆ Maximilian Toller, Know-Center 25 KDDM2

www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Maximilian Toller, Know-Center 26 KDDM2

www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Maximilian Toller, Know-Center 26 KDDM2

www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Attack #2 a 2 = [ 1 , 3 , 2 , 1 , 900000000 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] a 2 ≈ 7 . 5 × 10 7 ¯ σ 2 a 2 ≈ 6 . 75 × 10 16 ˆ Maximilian Toller, Know-Center 26 KDDM2

www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Attack #2 a 2 = [ 1 , 3 , 2 , 1 , 900000000 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] a 2 ≈ 7 . 5 × 10 7 ¯ σ 2 a 2 ≈ 6 . 75 × 10 16 ˆ Attack #3 a 3 = [ 1 , 3 , 2 , 1 , ∞ , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 3 = ∞ ˆ a 3 = ∞ Maximilian Toller, Know-Center 26 KDDM2

www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Attack #2 a 2 = [ 1 , 3 , 2 , 1 , 900000000 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] a 2 ≈ 7 . 5 × 10 7 ¯ σ 2 a 2 ≈ 6 . 75 × 10 16 ˆ Attack #3 a 3 = [ 1 , 3 , 2 , 1 , ∞ , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 3 = ∞ ˆ a 3 = ∞ → Mean and variance are not robust . Maximilian Toller, Know-Center 26 KDDM2

Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, - PowerPoint PPT Presentation

www.tugraz.at Anomalies in Data SCIENCE PASSION TECHNOLOGY Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center > www.tugraz.at 1 KDDM2 www.tugraz.at Anomalies in Data Recall from earlier Maximilian Toller,

Mining Anomalies Andrzej Wasylkowski 1 Why Mine Anomalies? How can we make programs more

b s b c anomalies anomalies Found by LHCb (and perhaps Found by several experiments

Detection of electromagnetic anomalies Detection of electromagnetic anomalies before volcanic

Impact of Meteorological Impact of Meteorological A Anomalies on Forest Anomalies on Forest A

Motivation Both human- and computer-generated programs sometimes contain data-flow anomalies .

Detecting routing anomalies using RIPE Atlas Todor Yakimov Graduate School of Informatics

Visualization of Performance Anomalies with Kieker Bachelors Thesis Sren Henning September

Critical Review on Neutrino Anomalies Carlo Giunti INFN, Torino, Italy Neutrinos: the Quest for

Hardware Modeling 3 Timing Anomalies Peter Puschner slides credits: P. Puschner, R. Kirner, B.

On the status of flavor anomalies Diego Guadagnoli LAPTh Annecy (France) Recap of flavor

What is Anomalies? If Efficient Market Hypothesis holds, all securities should have the same

Flavour anomalies at the LHC M. Nardecchia INFN & Sapienza University of Rome 30 May 2019,

Fast and Scalable Method for Resolving Anomalies in Firewall Policies Hassan Gobjua

Muon Anomalies and Their Future Investigations Fermilab Muon Department Journal Club Jason

Global anomalies on Lorentzian space-times Jochen Zahn Universit at Leipzig based on

Moduli Anomalies and Local Terms in the Operator Product Expansion Stefan Theisen

Conformal Anomalies and Gravitational Waves Hermann Nicolai MPI f ur Gravitationsphysik,

Anomalies of the Entanglement Entropy in Chiral Theories Nabil

Q406 FINANCIAL RESULTS Investor Community Conference Call KAREN MAIDMENT Chief Financial and

in Latin America in the 2000s Guillermo Cruces Gary S. Fields CEDLAS-FCE-UNLP, CONICET and IZA

Prt rss t

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Meeting March 10, 2015 For Audio Dial 416-343-2285 or 1-877-969-8433 PIN# 4467765 Meeting

Polyplex (Thailand) Public Limited Company Presentation at the SET Opportunity Day Tuesday, 7 th

Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, - PowerPoint PPT Presentation

www.tugraz.at Anomalies in Data SCIENCE PASSION TECHNOLOGY Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center > www.tugraz.at 1 KDDM2 www.tugraz.at Anomalies in Data Recall from earlier Maximilian Toller,

Mining Anomalies Andrzej Wasylkowski 1 Why Mine Anomalies? How can we make programs more

b s b c anomalies anomalies Found by LHCb (and perhaps Found by several experiments

Detection of electromagnetic anomalies Detection of electromagnetic anomalies before volcanic

Impact of Meteorological Impact of Meteorological A Anomalies on Forest Anomalies on Forest A

Motivation Both human- and computer-generated programs sometimes contain data-flow anomalies .

Detecting routing anomalies using RIPE Atlas Todor Yakimov Graduate School of Informatics

Visualization of Performance Anomalies with Kieker Bachelors Thesis Sren Henning September

Critical Review on Neutrino Anomalies Carlo Giunti INFN, Torino, Italy Neutrinos: the Quest for

Hardware Modeling 3 Timing Anomalies Peter Puschner slides credits: P. Puschner, R. Kirner, B.

On the status of flavor anomalies Diego Guadagnoli LAPTh Annecy (France) Recap of flavor

What is Anomalies? If Efficient Market Hypothesis holds, all securities should have the same

Flavour anomalies at the LHC M. Nardecchia INFN &amp; Sapienza University of Rome 30 May 2019,

Fast and Scalable Method for Resolving Anomalies in Firewall Policies Hassan Gobjua

Muon Anomalies and Their Future Investigations Fermilab Muon Department Journal Club Jason

Global anomalies on Lorentzian space-times Jochen Zahn Universit at Leipzig based on

Moduli Anomalies and Local Terms in the Operator Product Expansion Stefan Theisen

Conformal Anomalies and Gravitational Waves Hermann Nicolai MPI f ur Gravitationsphysik,

Anomalies of the Entanglement Entropy in Chiral Theories Nabil

Q406 FINANCIAL RESULTS Investor Community Conference Call KAREN MAIDMENT Chief Financial and

in Latin America in the 2000s Guillermo Cruces Gary S. Fields CEDLAS-FCE-UNLP, CONICET and IZA

Prt rss t

Data Mining II Anomaly Detection Heiko Paulheim Anomaly Detection Also known as Outlier

Meeting March 10, 2015 For Audio Dial 416-343-2285 or 1-877-969-8433 PIN# 4467765 Meeting

Polyplex (Thailand) Public Limited Company Presentation at the SET Opportunity Day Tuesday, 7 th

Flavour anomalies at the LHC M. Nardecchia INFN & Sapienza University of Rome 30 May 2019,