anomalies in data
play

Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, - PowerPoint PPT Presentation

www.tugraz.at Anomalies in Data SCIENCE PASSION TECHNOLOGY Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center > www.tugraz.at 1 KDDM2 www.tugraz.at Anomalies in Data Recall from earlier Maximilian Toller,


  1. www.tugraz.at Anomalies in Data SCIENCE PASSION TECHNOLOGY Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center > www.tugraz.at 1 KDDM2

  2. www.tugraz.at Anomalies in Data Recall from earlier Maximilian Toller, Know-Center 2 KDDM2

  3. www.tugraz.at What are Outliers ? A recap from KDDM1 Maximilian Toller, Know-Center 3 KDDM2

  4. www.tugraz.at What are Outliers ? Definitions An observation that appears to deviate markedly from other members of the sample in which it occurs . (Grubbs, 1969) An observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data. (Barnett and Lewis, 1974) An observation, which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism . (Hawkins, 1980) Maximilian Toller, Know-Center 4 KDDM2

  5. www.tugraz.at What are Outliers ? Examples (easy) 8 6 Inliers 4 2 Outliers Y 0 (Grubb, Barnett) −2 −4 Outliers −6 (Grubb, Barnett, −8 −6 −4 −2 0 2 4 6 X Hawkins) Maximilian Toller, Know-Center 5 KDDM2

  6. www.tugraz.at What are Outliers ? Examples (more difficult) 1.0 0.8 0.6 y 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 x Maximilian Toller, Know-Center 6 KDDM2

  7. www.tugraz.at What are Outliers ? Examples (more difficult) 1.0 110 0.8 100 0.6 90 y y 0.4 80 0.2 70 0.0 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 x x Maximilian Toller, Know-Center 6 KDDM2

  8. www.tugraz.at What are Outliers ? Examples (more difficult) 1000 800 y 600 400 200 −50 0 50 100 150 x Maximilian Toller, Know-Center 7 KDDM2

  9. www.tugraz.at What are Outliers ? Examples (more difficult) 1.0 1000 0.8 800 0.6 y 600 y 0.4 400 0.2 200 0.0 −50 0 50 100 150 0.3 0.4 0.5 0.6 x x Maximilian Toller, Know-Center 7 KDDM2

  10. www.tugraz.at What are Outliers ? Methods: Preview There are many outlier detection methods: Local outlier factor Angle-based outlier degree Artificial neural networks . . . Why are there so many? Maximilian Toller, Know-Center 8 KDDM2

  11. www.tugraz.at What are Anomalies ? Maximilian Toller, Know-Center 9 KDDM2

  12. www.tugraz.at What are Anomalies ? Difference from Outliers In literature, outlier and anomaly are used interchangeably For both, only vague definitions exist that are very similar However, the terms have different origins and different typical use: Outliers typically. . . Anomalies typically. . . . . . are motivated by statistics. . . . require context. . . . are unusual data. . . . are abnormal events. . . . are investigated by traditional . . . are investigated by data analysts researches and statisticians. and data scientists. Maximilian Toller, Know-Center 10 KDDM2

  13. www.tugraz.at What are Anomalies ? Example: Credit card fraud Billions of dollars lost every year Fraudulent transactions often significantly different Difficult to disguise fraud s.t. it is not visible on any scale Maximilian Toller, Know-Center 11 KDDM2

  14. www.tugraz.at What are Anomalies ? Example: Cancer One of the most common causes of human death Disease with abnormal cell growth Cancer has abnormal gene expression signature Maximilian Toller, Know-Center (Quinn et al., 2019) 12 KDDM2

  15. www.tugraz.at What are Anomalies ? The role of context Abnormality is context-dependent Discordant data problem (credit card fraud example) Many normal observations Rare outlying data Anomaly class problem (cancer example) Normal data class Anomaly classes Can data define abnormality? Maximilian Toller, Know-Center 13 KDDM2

  16. www.tugraz.at Unlikely, Discordant and Contaminated Data How to interpret suspicious data Maximilian Toller, Know-Center 14 KDDM2

  17. www.tugraz.at Unlikely, Discordant and Contaminated Data The Case of Hadlum vs Hadlum Mr Hadlum accuses Mrs Hadlum of adultery Sole evidence: Birth of child 349 days after Mr Hadlum left the country Average human gestation period: 280 days Maximilian Toller, Know-Center 15 KDDM2 (Barnett and Lewis, 1974)

  18. www.tugraz.at Unlikely, Discordant and Contaminated Data The Case of Hadlum vs Hadlum Mr Hadlum conjectured different distribution (red) Judges did not find Mrs Hadlum guilty, since 349 days unlikely, but not impossible (blue) (Modern research showed that more than 340 days is impossible) Maximilian Toller, Know-Center (Zimek and Filzmoser, 2018) 16 KDDM2

  19. www.tugraz.at Unlikely, Discordant and Contaminated Data The Antarctic Ozone Hole Ozone layer protects Earth from solar radiation Damaged by human emissions of chlorofuorocarbons High depletion (hole) above poles https://de.wikipedia.org/wiki/Datei:Ozone_layer.jpg Maximilian Toller, Know-Center 17 KDDM2

  20. www.tugraz.at Unlikely, Discordant and Contaminated Data The (Ant)Arctic Ozone Hole Farman et al. (1985) discover hole in field study Authors hesitant to publish Nimbus satellite data showed no drop Problem: Largely deviating values discarded as NASA/JPL-Caltech measurement errors Maximilian Toller, Know-Center 18 KDDM2

  21. www.tugraz.at Unlikely, Discordant and Contaminated Data Definition Unlikely data Discordant data Contamination Position of judges Position of Mr "Wrong day of Hadlum birth?” "Random drop of ozone not caused Ozone field study by Satellite by humans" Farman et al. (1985) measurement error Data unlikely but still Data too unlikely to Data incorrect or normal be normal misleading No correction Correction of model Correction of data Action: none Action: investigate Action: remove Maximilian Toller, Know-Center 19 KDDM2

  22. www.tugraz.at Unlikely, Discordant and Contaminated Data Implications It is hard to classify data as unlikely , discordant or contaminated No universal decision criterion Domain knowledge as remedy Ultimately subjective Maximilian Toller, Know-Center 20 KDDM2

  23. www.tugraz.at Unlikely, Discordant and Contaminated Data Strategies 1. Try to ignore anomalies (Not interesting) 2. Find anomalies for investigation or removal (Interesting) Maximilian Toller, Know-Center 21 KDDM2

  24. www.tugraz.at Robust Statistics Data Analysis in Presence of Anomalies Maximilian Toller, Know-Center 22 KDDM2

  25. www.tugraz.at Robust Statistics Introduction I Setting Potentially contaminated dataset Majority uncontaminated Cannot find or remove contamination, e.g. inserted by attacker Task: Analyze data in spite of contamination, understand what is normal Maximilian Toller, Know-Center 23 KDDM2

  26. www.tugraz.at Robust Statistics Introduction II Challenges No prior information about data Contamination may be arbitrarily “bad” (adversarial) Question: Which methods are suitable? Maximilian Toller, Know-Center 24 KDDM2

  27. www.tugraz.at Robust Statistics Example: Mean and variance Two common estimators � n x = 1 Sample mean ¯ j = 1 x j n � n 1 σ 2 j = 1 ( x j − ¯ x ) 2 Sample variance ˆ x = n − 1 Mean and variance are influenced by contamination σ 2 Original x = [ 1 , 3 , 2 , 1 , 9 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] x ≈ 2 . 58 ¯ x ≈ 4 . 63 ˆ ¯ σ 2 Clean y = [ 1 , 3 , 2 , 1 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] y = 2 y = 0 . 6 ˆ Maximilian Toller, Know-Center 25 KDDM2

  28. www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Maximilian Toller, Know-Center 26 KDDM2

  29. www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Maximilian Toller, Know-Center 26 KDDM2

  30. www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Attack #2 a 2 = [ 1 , 3 , 2 , 1 , 900000000 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] a 2 ≈ 7 . 5 × 10 7 ¯ σ 2 a 2 ≈ 6 . 75 × 10 16 ˆ Maximilian Toller, Know-Center 26 KDDM2

  31. www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Attack #2 a 2 = [ 1 , 3 , 2 , 1 , 900000000 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] a 2 ≈ 7 . 5 × 10 7 ¯ σ 2 a 2 ≈ 6 . 75 × 10 16 ˆ Attack #3 a 3 = [ 1 , 3 , 2 , 1 , ∞ , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 3 = ∞ ˆ a 3 = ∞ Maximilian Toller, Know-Center 26 KDDM2

  32. www.tugraz.at Robust Statistics Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a 1 = [ 1 , 3 , 2 , 1 , 900 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 1 ≈ 76 . 83 ˆ a 1 ≈ 67200 . 88 Attack #2 a 2 = [ 1 , 3 , 2 , 1 , 900000000 , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] a 2 ≈ 7 . 5 × 10 7 ¯ σ 2 a 2 ≈ 6 . 75 × 10 16 ˆ Attack #3 a 3 = [ 1 , 3 , 2 , 1 , ∞ , 2 , 3 , 2 , 3 , 2 , 2 , 1 ] σ 2 ¯ a 3 = ∞ ˆ a 3 = ∞ → Mean and variance are not robust . Maximilian Toller, Know-Center 26 KDDM2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend