Data Quality Assurance Or How to get good data , by Florian Netzer - PowerPoint PPT Presentation

Data Quality Assurance Or „ How to get good data “, by Florian Netzer & Lars Wolf Image sources: stackexchange.com texwelt.de stackexchange.com CC BY-SA 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 1

Overview 1. 2. Data Collection on Data Cleaning What are potential problems? What are potential problems? How do you get good data? How do you get clean data? 3. Take-Aways ys Tools for your use Summary 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 2

0. Why is is data quality assuranc nce important anyways?

Let‘s start with an example! We tell five people on to rate job applications of people applying as data scientist , on a scale from 1 to 5 . What could be potential problems? Tell us on menti.co .com 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 5

Possible Issues 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 6

Where does your data come from? US-Military in WW2: „ We need more armour in the areas that were hit most “ Data came from planes that returned from missions Survivorship Bias Source: Wikipedia, McGeddon, CC BY-SA 4.0 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 7

What is your data not showing? Is the dataset showing boats ? … or the the sea? It depends on the negative examples! Negative Set Bias (in section 3.2) Image sources: ImageNet dataset 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 8

Other types of biases • Selection bias (e.g. camera angle) • Bias in reality e.g. searching for „3 black teenagers “ vs. „3 white teenagers “ 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 9

Feedback Loops ML system ML system learning the selecting preference products to of the user show Source: Hidden Technical Debt in Machine Learning Systems 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 10

How much data do you need? Depends on… • Model-type • Number of parameters • Number of features About 10 times more samples than parameters is a good place to start. Source: Malay Haldar: How much training data do you need? medium.com 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 11

More data is almost always better Source: Scaling to Very Very Large Corpora for Natural Language Disambiguation 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 12

Value of a Dataset …but only if the bias matches the test data! Measuring Dataset’s Value (in section 4) 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 13

How do you get good data? • Contains: context, action & outcome • Avoid feedback loops • Collected in interactions that users care about • Best: implicit actions on real usage (avoiding interrater reliability issues) • Test on other data as well! (cross dataset generalization) [5] As in: Building Intelligent Systems: A Guide to Machine Learning Engineering by Hulten et al. 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 14

How do you get clean data? Let us introduce you to 3 examples: Unit Tests HoloClean Data Linting for Data … for automatic … for simple errors. … for complex cleaning. constraints. „ level of aggression “ 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 18

How do you get clean data? Data Linting Data Linting detects potential (simple) errors: Unit Tests 1. miscodings of data 2. outliers 3. packaging errors for Data e.g.: HoloClean The Data Linter (paper by Hynes et al.) 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 19

How do you get clean data? Data Linting Data Linting Frequency of Data Lints Unit Tests for Data Across 600 Kaggle HoloClean Data Sets The Data Linter (paper by Hynes et al.) 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 20

How do you get clean data? Unit Tests for Data Data Linting tests potential constraints in incrementally growing datasets: Unit Tests 1. completeness 2. consistency 3. statistics for Data e.g.: HoloClean Automating Large-Scale Data Quality Verification (paper by Schelter et al.) 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 21

How do you get clean data? Unit Tests for Data Data Linting Unit Tests for Data HoloClean Automating Large-Scale Data Quality Verification + Anomaly Detection + Constraint Suggestion (paper by Schelter et al.) 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 22

How do you get clean data? HoloClean Data Linting automatically cleans online data, combining: Unit Tests for Data 1. integrety constraints 2. statistics 3. External data HoloClean HoloClean: Holistic Data Repairs with Probabilistic Inference (paper by Rekatsinas et al.) 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 23

Take-Aways | Your toolbox 1. Visualize your Data e.g. by using Facets 2. Find mistakes in your Data e.g. by using a Data Linter 3. Automatically clean your Data e.g. by using HoloClean 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 26

Take-Aways | Summary 1. 4. Watch out for Good data should contain biase ses! context xt, , action & & outcome! 2. 5. Don‘t just use your data, If possibl If ble, , test look for fixable le erro rors rs! on other on other data! 3. 6. In online learning ng systems: Get test your data continually lly! enough data! 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 27

References [1] Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F., & Grafberger, A. (2018). Automating large-scale data quality verification Proceedings of the VLDB Endowment, 11 (12), 1781 – 1794. [2] Rekatsinas, T., Chu, X., Ilyas, I., & Ré, C. (2017). HoloClean : Holistic Data Repairs with Probabilistic Inference (i). [3] Hynes, N., Sculley, D., Brain, G., Google Brain, M., & Terry, M. (2017). The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets NIPS MLSys Workshop (Nips). [4] Gao, J., Xie, C., & Tao, C. (2016). Big data validation and quality assurance - Issuses, challenges, and needs Proceedings - 2016 IEEE Symposium on Service-Oriented System Engineering, SOSE 2016 , 433 – 441. [5] Torralba, A., & Efros, A. (2011). Unbiased look at dataset bias Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition , 1521 – 1528. [6] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., & Dennison, D. (2015). Hidden Technical Debt in Machine Learning Systems , 1 – 9. [7] Hulten, G. (2018). Building Intelligent Systems: A Guide to Machine Learning Engineering . Apress. Icons: Font Awesome by Dave Gandy - http://fontawesome.io 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 28

4. Questions ns & Discussio ion

Data Quality Assurance Or How to get good data , by Florian Netzer - PowerPoint PPT Presentation

Data Quality Assurance Or How to get good data , by Florian Netzer & Lars Wolf Image sources: stackexchange.com texwelt.de stackexchange.com CC BY-SA 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial

Census Data Quality Assurance 17 May 2010 Types of Quality Assurance (QA) Quality assurance of

MATERIALS AND TESTING QUALITY ASSURANCE QUALITY ASSURANCE What is Quality Assurance? Why

Session 13 INFM 603 Bugs, process, assurance Software assurance: quality assurance for

TECHNOLOGIES, INC. INTRODUCTION SOFTWARE QUALITY ASSURANCE SOFTWARE QUALITY ASSURANCE Software

LPA 2018 QUALITY ASSURANCE What is Quality Assurance? Why needed? Sampling &

Quality Assurance Stephen Cater, Ph. D Director, Quality Assurance Trade Symposium May 27, 2016

SeeTest Quality Assurance platform SaaS Digital Assurance Lab SaaS Digital Assurance Lab Access

SeeTest Quality Assurance Platform On-premise Digital Assurance Lab On-premise Digital Assurance

Overview of the Overview of the Air Quality Assurance Air Quality Assurance Programs Programs

Presentation Overview Financial Quality Financial Quality Assurance (FQA) Department

PEN: Pathway form EQAVET to NQAVET Quality Assurance Quality Assurance aims at safeguarding

Seminar 18122 Automatic Quality Assurance and Release Seminar 18122 Automatic Quality

Brewery Quality Crash Course Ben Bailey Quality Assurance Manager Troegs Brewing Company

QA, QC, Test plan F. Pietropaolo CERN / INFN Padova Quality Assurance / Quality control The

Quality Management Siegfried Zopf, Siemens PSE QM Program and System Engineering PSE Quality

Appendix C: Quality Assurance Process PowerPoint Slides C-2 Appendix C: Quality Assurance Process

CSC 411 Lecture 18: Matrix Factorizations Roger Grosse, Amir-massoud Farahmand, and Juan

Big Data Management & Analytics EXERCISE 8 TEXT PROCESSING, PCA 21st of December, 2015

Week 1: 6 weeks, Sep 13 - Oct 18 Instructor: Tamara Munzner participation, 10%

From AHAR to LSA: Understanding the FY18 Changes Office Hours, Session #1 Tuesday, October 23,

Machine Learning (CSE 446): Perceptron Sham M Kakade c 2018 University of Washington

r s r rt

NAMED DATA NETWORKING (NDN) Named Data Networking NDN BRIEF HISTORY When the Networking was

Single-Cycle CPU Datapath Design "The Do-It-Yourself CPU Kit" CSE 141, S2'06 Jeff

Data Quality Assurance Or How to get good data , by Florian Netzer - PowerPoint PPT Presentation

Data Quality Assurance Or How to get good data , by Florian Netzer & Lars Wolf Image sources: stackexchange.com texwelt.de stackexchange.com CC BY-SA 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial

Census Data Quality Assurance 17 May 2010 Types of Quality Assurance (QA) Quality assurance of

MATERIALS AND TESTING QUALITY ASSURANCE QUALITY ASSURANCE What is Quality Assurance? Why

Session 13 INFM 603 Bugs, process, assurance Software assurance: quality assurance for

TECHNOLOGIES, INC. INTRODUCTION SOFTWARE QUALITY ASSURANCE SOFTWARE QUALITY ASSURANCE Software

LPA 2018 QUALITY ASSURANCE What is Quality Assurance? Why needed? Sampling &amp;

Quality Assurance Stephen Cater, Ph. D Director, Quality Assurance Trade Symposium May 27, 2016

SeeTest Quality Assurance platform SaaS Digital Assurance Lab SaaS Digital Assurance Lab Access

SeeTest Quality Assurance Platform On-premise Digital Assurance Lab On-premise Digital Assurance

Overview of the Overview of the Air Quality Assurance Air Quality Assurance Programs Programs

Presentation Overview Financial Quality Financial Quality Assurance (FQA) Department

PEN: Pathway form EQAVET to NQAVET Quality Assurance Quality Assurance aims at safeguarding

Seminar 18122 Automatic Quality Assurance and Release Seminar 18122 Automatic Quality

Brewery Quality Crash Course Ben Bailey Quality Assurance Manager Troegs Brewing Company

QA, QC, Test plan F. Pietropaolo CERN / INFN Padova Quality Assurance / Quality control The

Quality Management Siegfried Zopf, Siemens PSE QM Program and System Engineering PSE Quality

Appendix C: Quality Assurance Process PowerPoint Slides C-2 Appendix C: Quality Assurance Process

CSC 411 Lecture 18: Matrix Factorizations Roger Grosse, Amir-massoud Farahmand, and Juan

Big Data Management &amp; Analytics EXERCISE 8 TEXT PROCESSING, PCA 21st of December, 2015

Week 1: 6 weeks, Sep 13 - Oct 18 Instructor: Tamara Munzner participation, 10%

From AHAR to LSA: Understanding the FY18 Changes Office Hours, Session #1 Tuesday, October 23,

Machine Learning (CSE 446): Perceptron Sham M Kakade c 2018 University of Washington

r s r rt

NAMED DATA NETWORKING (NDN) Named Data Networking NDN BRIEF HISTORY When the Networking was

Single-Cycle CPU Datapath Design &quot;The Do-It-Yourself CPU Kit&quot; CSE 141, S2'06 Jeff

LPA 2018 QUALITY ASSURANCE What is Quality Assurance? Why needed? Sampling &

Big Data Management & Analytics EXERCISE 8 TEXT PROCESSING, PCA 21st of December, 2015

Single-Cycle CPU Datapath Design "The Do-It-Yourself CPU Kit" CSE 141, S2'06 Jeff