why bad data ruins why bad data ruins projects and how
play

WHY BAD DATA RUINS WHY BAD DATA RUINS PROJECTS AND HOW PROJECTS - PowerPoint PPT Presentation

KEEP IT CLEAN KEEP IT CLEAN 30/04/2019 reveal.js WHY BAD DATA RUINS WHY BAD DATA RUINS PROJECTS AND HOW PROJECTS AND HOW TO FIX IT TO FIX IT localhost:8080/?print-pdf#/ 1/65 1 HOW BAD DATA AFFECTS RESULTS HOW BAD DATA AFFECTS RESULTS


  1. KEEP IT CLEAN KEEP IT CLEAN 30/04/2019 reveal.js WHY BAD DATA RUINS WHY BAD DATA RUINS PROJECTS AND HOW PROJECTS AND HOW TO FIX IT TO FIX IT localhost:8080/?print-pdf#/ 1/65 1

  2. HOW BAD DATA AFFECTS RESULTS HOW BAD DATA AFFECTS RESULTS 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 2/65 2

  3. 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 3/65

  4. 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 4/65 3

  5. 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 5/65 4

  6. 30/04/2019 The AI system has been put through reveal.js rigorous testing that took place in collaboration with the U.K.'s Royal College of Physicians, as well as researchers from Stanford University and the Yale New Haven Health System. Aristos Georgiou On 6/27/18 at 5:21 PM. 2018. “This Artificial Intelligence Platform Can Provide Health Advice That Is as Accurate as a Real Doctor’s.” Newsweek. June 27, 2018. https://www.newsweek.com/ai- can-provide-health-advice-which-good-real-doctors-998461. localhost:8080/?print-pdf#/ 6/65 5

  7. 30/04/2019 Part of this testing involved the AI reveal.js taking a medical diagnosis exam that trainee primary care physicians in the U.K. must pass to be able to practice independently. Remarkably, the AI doctor scored 81 percent on its first attempt. The average pass mark over the past five years for real doctors was 72 percent. localhost:8080/?print-pdf#/ 7/65 6

  8. 30/04/2019 further tests that mimic real-life reveal.js scenarios were also conducted... And when tested only on common conditions, the AI’s accuracy jumped to 98 percent, compared with a range of 52 percent to 99 percent for the real physicians. localhost:8080/?print-pdf#/ 8/65 7

  9. 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 9/65

  10. 30/04/2019 reveal.js https://twitter.com/DrMurphy11/status/1118618977742274560 localhost:8080/?print-pdf#/ 10/65 8

  11. 30/04/2019 reveal.js “Babylon Health Erases AI Test Event for Its Chatbot Doctor.” 2019. AI News (blog). April 12, 2019. https://www.artificialintelligence-news.com/2019/04/12/babylon-health-ai-test-gp-at-hand/. localhost:8080/?print-pdf#/ 11/65 9

  12. 30/04/2019 reveal.js 12/65 10 localhost:8080/?print-pdf#/

  13. Google Translate 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 13/65 11

  14. 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 14/65

  15. 30/04/2019 reveal.js Swinger, Nathaniel, Maria De-Arteaga, Neil Thomas Heffernan IV, Mark DM Leiserson, and Adam Tauman Kalai. 2018. “What Are the Biases in My Word Embedding?” ArXiv:1812.08769 [Cs], December. http://arxiv.org/abs/1812.08769. localhost:8080/?print-pdf#/ 15/65 12

  16. 30/04/2019 reveal.js Demontis, Ambra, Marco Melis, Maura Pintor, Matthew Jagielski, Battista Biggio, Alina Oprea, Cristina Nita- Rotaru, and Fabio Roli. 2018. “Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasion and Poisoning Attacks,” September. https://arxiv.org/abs/1809.02861v2. Ebrahimi, Javid, Daniel Lowd, and Dejing Dou. 2018. “On Adversarial Examples for Character-Level Neural Machine Translation,” June. https://arxiv.org/abs/1806.09030v1. localhost:8080/?print-pdf#/ 16/65 13

  17. https://cloud.google.com/vision/docs/drag-and-drop 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 17/65

  18. 30/04/2019 reveal.js 18/65 14 localhost:8080/?print-pdf#/

  19. 30/04/2019 reveal.js https://www.ibmbigdatahub.com/infographic/four-vs-big-data https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year Gartner, Dirty data is a business problem, not an IT problem, 2007, now removed localhost:8080/?print-pdf#/ 19/65 15

  20. 30/04/2019 reveal.js Loten, Angus. 2019. AI Efforts at Large Companies May Be Hindered by Poor Quality Data. Wall Street Journal, March 4, 2019, sec. C Suite. https://www.wsj.com/articles/ai-efforts-at-large-companies-may-be- hindered-by-poor-quality-data-11551741634. localhost:8080/?print-pdf#/ 20/65 16

  21. BAD DATA INTRODUCES AN BAD DATA INTRODUCES AN 30/04/2019 reveal.js EXTRAORDINARY AMOUNT OF EXTRAORDINARY AMOUNT OF TECHNICAL DEBT TECHNICAL DEBT localhost:8080/?print-pdf#/ 21/65 17

  22. WHY BAD DATA AFFECTS RESULTS WHY BAD DATA AFFECTS RESULTS 30/04/2019 reveal.js Deduction : Newton Induction : Sherlock Holmes localhost:8080/?print-pdf#/ 22/65 18

  23. 30/04/2019 reveal.js 23/65 19 localhost:8080/?print-pdf#/

  24. 30/04/2019 reveal.js 24/65 20 localhost:8080/?print-pdf#/

  25. GROUP QUESTION GROUP QUESTION 30/04/2019 reveal.js WHAT IS THE DEADLIEST ANIMAL IN WHAT IS THE DEADLIEST ANIMAL IN AUSTRALIA? AUSTRALIA? localhost:8080/?print-pdf#/ 25/65 21

  26. 30/04/2019 reveal.js https://www.bbc.co.uk/news/world-australia-38592390 localhost:8080/?print-pdf#/ 26/65 22

  27. 30/04/2019 21-year-old Australian tradesman reveal.js has been bitten by a venomous spider on the penis for a second time. Jordan, who preferred not to reveal his surname, said he was bitten on "pretty much the same spot" by the spider. "I'm the most unlucky guy in the country at the moment," he told the BBC https://www.bbc.co.uk/news/world-australia-37481251 localhost:8080/?print-pdf#/ 27/65 23

  28. VISUALISING DATA VISUALISING DATA 30/04/2019 reveal.js Always visualise your data How? Histogram Scatter plot (matrix) Segmented (faceted) bar chart Nullity plot Correlation plot localhost:8080/?print-pdf#/ 28/65 24

  29. DATA AVAILABILITY DATA AVAILABILITY 30/04/2019 reveal.js The availability of data defines what you can and can't use (see nullity plots). Keep as much detail as possible Preserve versions CR not CRUD! localhost:8080/?print-pdf#/ 29/65 25

  30. DATA CONSISTENCY DATA CONSISTENCY 30/04/2019 reveal.js Consistent data is stable (over time, space, ...) Can improve the quantity and quality of data, and hence improve model performance. Use consistent definitions for metrics localhost:8080/?print-pdf#/ 30/65 26

  31. DATA LEAKAGE DATA LEAKAGE 30/04/2019 reveal.js Very easy to accidentally include future data in training data. Oversampling Running dimensionality reduction on the whole dataset Preprocessing over the whole dataset Including a feature that is only populated after the label has been applied localhost:8080/?print-pdf#/ 31/65 27

  32. MISSING DATA MISSING DATA 30/04/2019 reveal.js Missing data doesn't necessarily mean numpy.nan ! >>> print(titanic.count()) pclass 1309 survived 1309 name 1309 sex 1309 age 1046 sibsp 1309 parch 1309 ticket 1309 fare 1308 cabin 295 embarked 1307 boat 486 body 121 home.dest 745 dtype: int64 localhost:8080/?print-pdf#/ 32/65 28

  33. FIXING MISSING DATA FIXING MISSING DATA 30/04/2019 reveal.js Remove (rows or columns) Impute Simple Natural null Mean Median Impute Complex Regression Random Sampling Jitter localhost:8080/?print-pdf#/ 33/65 29

  34. NOISE: WHAT IS NOISE? NOISE: WHAT IS NOISE? 30/04/2019 reveal.js Weeds are just flowers that you don't like. Noise is data that you don't like. localhost:8080/?print-pdf#/ 34/65 30

  35. NOISE: TYPES OF NOISE NOISE: TYPES OF NOISE 30/04/2019 reveal.js Class Feature (column) Observation (row) localhost:8080/?print-pdf#/ 35/65 31

  36. NOISE: IMPROVING NOISE NOISE: IMPROVING NOISE 30/04/2019 reveal.js Aggregation Average (stacking/beamforming/radon transform Median (popcorn noise) Simple modelling Smoothing Normalisation Complex modelling Regression or fitting Dimensionality Reduction and Restoration Transformations (FFT, Wavelet) Encoding/Embedding (Autoencoder, NLP Embeddings) localhost:8080/?print-pdf#/ 36/65 32

  37. ANOMALIES (A.K.A. OUTLIERS) ANOMALIES (A.K.A. OUTLIERS) 30/04/2019 reveal.js Data that is not expected (in a statistical sense) localhost:8080/?print-pdf#/ 37/65

  38. 30/04/2019 reveal.js 38/65 33 localhost:8080/?print-pdf#/

  39. ANOMALY TYPES ANOMALY TYPES 30/04/2019 reveal.js Contextual - possibly good Corrupted - usually not good Measurement errors or failures API changes Regulatory changes Shift in behaviour Formatting changes localhost:8080/?print-pdf#/ 39/65 34

  40. DETECTING ANOMALIES DETECTING ANOMALIES 30/04/2019 reveal.js a large field in its own right 1. Define what is normal (through a model) 2. Set a threshold to define "not normal" localhost:8080/?print-pdf#/ 40/65 35

  41. DETECTING ANOMALIES FOR DATA CLEANING DETECTING ANOMALIES FOR DATA CLEANING 30/04/2019 reveal.js 1. Visualise your data! 2. Everything else 1. Classification task 2. Clustering 3. Regression/fitting + thresholds localhost:8080/?print-pdf#/ 41/65 36

  42. EXAMPLE REGRESSION TASK - WINE QUALITY EXAMPLE REGRESSION TASK - WINE QUALITY 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 42/65 37

  43. 30/04/2019 reveal.js 43/65 38 localhost:8080/?print-pdf#/

  44. 30/04/2019 reveal.js 44/65 39 localhost:8080/?print-pdf#/

  45. WHY NORMALITY IS IMPORTANT WHY NORMALITY IS IMPORTANT 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 45/65 40

  46. 30/04/2019 reveal.js localhost:8080/?print-pdf#/ 46/65

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend