Data Quality Assurance Or How to get good data , by Florian Netzer - - PowerPoint PPT Presentation

data quality assurance
SMART_READER_LITE
LIVE PREVIEW

Data Quality Assurance Or How to get good data , by Florian Netzer - - PowerPoint PPT Presentation

Data Quality Assurance Or How to get good data , by Florian Netzer & Lars Wolf Image sources: stackexchange.com texwelt.de stackexchange.com CC BY-SA 17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial


slide-1
SLIDE 1

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 1

Or „How to get good data“, by Florian Netzer & Lars Wolf

Data Quality Assurance

Image sources: stackexchange.com texwelt.de stackexchange.com CC BY-SA

slide-2
SLIDE 2

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 2

Overview

Data Collection

  • n

What are potential problems? How do you get good data?

1.

Data Cleaning

What are potential problems? How do you get clean data?

2.

Take-Aways ys

Tools for your use Summary

3.

slide-3
SLIDE 3

Why is is data quality assuranc nce important anyways?

0.

slide-4
SLIDE 4

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 4

Overview

Data Collection

  • n

What are potential problems? How do you get good data?

1.

Data Cleaning

What are potential problems? How do you get clean data?

2.

Take-Aways ys

Tools for your use Summary

3.

slide-5
SLIDE 5

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 5

What could be potential problems?

Tell us on menti.co .com

Let‘s start with an example!

We tell five people on to rate job applications of people applying as data scientist,

  • n a scale from 1 to 5.
slide-6
SLIDE 6

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 6

Possible Issues

slide-7
SLIDE 7

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 7

Where does your data come from?

US-Military in WW2:

„We need more armour in the areas that were hit most“

Survivorship Bias Data came from planes that returned from missions

Source: Wikipedia, McGeddon, CC BY-SA 4.0

slide-8
SLIDE 8

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 8

What is your data not showing? Is the dataset showing boats? …or the the sea?

Negative Set Bias (in section 3.2)

Image sources: ImageNet dataset

It depends on the negative examples!

slide-9
SLIDE 9

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 9

Other types of biases

  • Selection bias (e.g. camera angle)
  • Bias in reality

e.g. searching for „3 black teenagers“

  • vs. „3 white teenagers“
slide-10
SLIDE 10

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 10

Feedback Loops ML system selecting products to show ML system learning the preference

  • f the user

Source: Hidden Technical Debt in Machine Learning Systems

slide-11
SLIDE 11

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 11

How much data do you need?

Depends on…

  • Model-type
  • Number of parameters
  • Number of features

About 10 times more samples than parameters is a good place to start.

Source: Malay Haldar: How much training data do you need? medium.com

slide-12
SLIDE 12

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 12

More data is almost always better

Source: Scaling to Very Very Large Corpora for Natural Language Disambiguation

slide-13
SLIDE 13

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 13

Value of a Dataset

Measuring Dataset’s Value (in section 4) …but only if the bias matches the test data!

slide-14
SLIDE 14

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 14

How do you get good data?

  • Contains: context, action & outcome
  • Avoid feedback loops
  • Collected in interactions that users care about
  • Best: implicit actions on real usage (avoiding interrater reliability issues)
  • Test on other data as well! (cross dataset generalization) [5]

As in: Building Intelligent Systems: A Guide to Machine Learning Engineering by Hulten et al.

slide-15
SLIDE 15

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 15

Overview

Data Collection

  • n

What are potential problems? How do you get good data?

1.

Data Cleaning

What are potential problems? How do you get clean data?

2.

Take-Aways ys

Tools for your use Summary

3.

slide-16
SLIDE 16

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 16

Possible Issues

slide-17
SLIDE 17

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 17

Possible Issues

slide-18
SLIDE 18

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 18

How do you get clean data? Let us introduce you to 3 examples: HoloClean

…for automatic cleaning.

Data Linting

…for simple errors.

Unit Tests for Data

…for complex constraints. „level of aggression“

slide-19
SLIDE 19

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 19

Data Linting

detects potential (simple) errors: HoloClean Data Linting Unit Tests for Data

How do you get clean data?

  • 1. miscodings of data
  • 2. outliers
  • 3. packaging errors

e.g.: The Data Linter

(paper by Hynes et al.)

slide-20
SLIDE 20

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 20

HoloClean Data Linting Unit Tests for Data

How do you get clean data?

The Data Linter

(paper by Hynes et al.)

Frequency of

Data Lints

Across

600 Kaggle Data Sets

Data Linting

slide-21
SLIDE 21

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 21

Unit Tests for Data

tests potential constraints in incrementally growing datasets: HoloClean Data Linting Unit Tests for Data

How do you get clean data?

  • 1. completeness
  • 2. consistency
  • 3. statistics

e.g.: Automating Large-Scale Data Quality Verification

(paper by Schelter et al.)

slide-22
SLIDE 22

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 22

How do you get clean data?

Automating Large-Scale Data Quality Verification

(paper by Schelter et al.)

+ Constraint Suggestion

Unit Tests for Data

HoloClean Data Linting Unit Tests for Data + Anomaly Detection

slide-23
SLIDE 23

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 23

HoloClean

automatically cleans online data, combining: HoloClean Data Linting Unit Tests for Data

How do you get clean data?

  • 1. integrety constraints
  • 3. External data
  • 2. statistics

HoloClean: Holistic Data Repairs with Probabilistic Inference

(paper by Rekatsinas et al.)

slide-24
SLIDE 24

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 25

Overview

Data Collection

  • n

What are potential problems? How do you get good data?

1.

Data Cleaning

What are potential problems? How do you get clean data?

2.

Take-Aways ys

Tools for your use Summary

3.

slide-25
SLIDE 25

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 26

Take-Aways | Your toolbox Visualize your Data

e.g. by using Facets

Find mistakes in your Data

e.g. by using a Data Linter

Automatically clean your Data

e.g. by using HoloClean

1. 2. 3.

slide-26
SLIDE 26

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 27

Take-Aways | Summary

Watch out for

biase ses!

1.

If If possibl ble, , test

  • n
  • n other
  • ther data!

2.

Get

enough data!

3.

Good data should contain

context xt, , action & & outcome!

4.

Don‘t just use your data,

look for fixable le erro rors rs!

5.

In online learning ng systems:

test your data continually lly!

6.

slide-27
SLIDE 27

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 28

References

[1] Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F., & Grafberger, A. (2018). Automating large-scale data quality verification Proceedings of the VLDB Endowment, 11(12), 1781–1794. [2] Rekatsinas, T., Chu, X., Ilyas, I., & Ré, C. (2017). HoloClean : Holistic Data Repairs with Probabilistic Inference(i). [3] Hynes, N., Sculley, D., Brain, G., Google Brain, M., & Terry, M. (2017). The Data Linter: Lightweight, Automated Sanity Checking for ML Data Sets NIPS MLSys Workshop(Nips). [4] Gao, J., Xie, C., & Tao, C. (2016). Big data validation and quality assurance - Issuses, challenges, and needs Proceedings - 2016 IEEE Symposium on Service-Oriented System Engineering, SOSE 2016, 433–441. [5] Torralba, A., & Efros, A. (2011). Unbiased look at dataset bias Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1521–1528. [6] Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., & Dennison, D. (2015). Hidden Technical Debt in Machine Learning Systems, 1–9. [7] Hulten, G. (2018). Building Intelligent Systems: A Guide to Machine Learning Engineering. Apress. Icons: Font Awesome by Dave Gandy - http://fontawesome.io

slide-28
SLIDE 28

Questions ns & Discussio ion

4.

slide-29
SLIDE 29

Additi tiona nal Materia rial

5.

slide-30
SLIDE 30

17.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence | Florian Netzer, Lars Wolf | 31

HoloClean Data Linting Unit Tests for Data

How do you get clean data?

The Data Linter

(paper by Hynes et al.)

Historgram of

# # of

  • f Lints

per Data Set

Across

600 Kaggle Data Sets

Data Linting