DATA QUALITY AND DATA DATA QUALITY AND DATA PROGRAMMING - PowerPoint PPT Presentation

DATA QUALITY AND DATA DATA QUALITY AND DATA PROGRAMMING PROGRAMMING "Data cleaning and repairing account for about 60% of the work of data scientists." Christian Kaestner Required reading: ฀ Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F. and Grafberger, A., 2018. Automating large- scale data quality verification . Proceedings of the VLDB Endowment, 11(12), pp.1781-1794. ฀ Nick Hynes, D. Sculley, Michael Terry. " The Data Linter: Lightweight Automated Sanity Checking for ML Data Sets ." NIPS Workshop on ML Systems (2017) 1

LEARNING GOALS LEARNING GOALS Design and implement automated quality assurance steps that check data schema conformance and distributions Devise thresholds for detecting data dri� and schema violations Describe common data cleaning steps and their purpose and risks Evaluate the robustness of AI components with regard to noisy or incorrect data Understanding the better models vs more data tradeoffs Programatically collect, manage, and enhance training data 2

DATA-QUALITY DATA-QUALITY CHALLENGES CHALLENGES 3 . 1

CASE STUDY: INVENTORY MANAGEMENT CASE STUDY: INVENTORY MANAGEMENT 3 . 2

INVENTORY DATABASE INVENTORY DATABASE Product Database: ID Name Weight Description Size Vendor ... ... ... ... ... ... Stock: ProductID Location Quantity ... ... ... Sales history: UserID ProductId DateTime Quantity Price ... ... ... ... ... 3 . 3

WHAT MAKES GOOD QUALITY DATA? WHAT MAKES GOOD QUALITY DATA? Accuracy The data was recorded correctly. Completeness All relevant data was recorded. Uniqueness The entries are recorded once. Consistency The data agrees with itself. Timeliness The data is kept up to date. 3 . 4

DATA IS NOISY DATA IS NOISY Unreliable sensors or data entry Wrong results and computations, crashes Duplicate data, near-duplicate data Out of order data Data format invalid Examples? 3 . 5

DATA CHANGES DATA CHANGES System objective changes over time So�ware components are upgraded or replaced Prediction models change Quality of supplied data changes User behavior changes Assumptions about the environment no longer hold Examples? 3 . 6

USERS MAY DELIBERATELY CHANGE DATA USERS MAY DELIBERATELY CHANGE DATA Users react to model output Users try to game/deceive the model Examples? 3 . 7

MANY DATA SOURCES MANY DATA SOURCES Twitter AdNetworks SalesTrends VendorSales ProductData Marketing Expired/Lost/Theft PastSales Inventory ML sources of different reliability and quality 3 . 8

Precision Yes No ACCURACY VS PRECISION ACCURACY VS PRECISION Yes Accuracy: Reported values (on Reference value Reference value Reference value Reference value Probability Probability Probability Probability average) represent real value density density density density Accuracy Precision: Repeated measurements yield the same Value Value Value Value Precision Precision Precision Precision result Accurate, but imprecise: Average over multiple measurements No Inaccurate, but precise: Systematic Reference value Reference value Reference value Reference value Probability Probability Probability Probability density density density density Accuracy Accuracy Accuracy Accuracy measurement problem, misleading Value Value Value Value Precision Precision Precision Precision

(CC-BY-4.0 by Arbeck ) 3 . 9

ACCURACY AND PRECISION IN TRAINING DATA? ACCURACY AND PRECISION IN TRAINING DATA? 3 . 10

DATA QUALITY AND MACHINE LEARNING DATA QUALITY AND MACHINE LEARNING More data -> better models (up to a point, diminishing effects) Noisy data (imprecise) -> less confident models, more data needed some ML techniques are more or less robust to noise (more on robustness in a later lecture) Inaccurate data -> misleading models, biased models Need the "right" data Invest in data quality, not just quantity 3 . 11

EXPLORATORY DATA EXPLORATORY DATA ANALYSIS ANALYSIS 4 . 1

EXPLORATORY DATA ANALYSIS IN DATA SCIENCE EXPLORATORY DATA ANALYSIS IN DATA SCIENCE Before learning, understand the data Understand types, ranges, distributions Important for understanding data and assessing quality Plot data distributions for features Visualizations in a notebook Boxplots, histograms, density plots, scatter plots, ... Explore outliers Look for correlations and dependencies Association rule mining Principal component analysis Examples: https://rpubs.com/ablythe/520912 and https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15 4 . 2

SE PERSPECTIVE: UNDERSTANDING DATA FOR SE PERSPECTIVE: UNDERSTANDING DATA FOR QUALITY ASSURANCE QUALITY ASSURANCE Understand input and output data Understand expected distributions Understand assumptions made on data for modeling ideally document those Check assumptions at runtime 4 . 3

DATA CLEANING DATA CLEANING Data cleaning and repairing account for about 60% of the work of data scientists. Quote: Gil Press. “ Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says .” Forbes Magazine, 2016. 5 . 1

Source: Rahm, Erhard, and Hong Hai Do. Data cleaning: Problems and current approaches . IEEE Data Eng. Bull. 23.4 (2000): 3-13. 5 . 2

SINGLE-SOURCE PROBLEM EXAMPLES SINGLE-SOURCE PROBLEM EXAMPLES Further readings: Rahm, Erhard, and Hong Hai Do. Data cleaning: Problems and current approaches . IEEE Data Eng. Bull. 23.4 (2000): 3-13.

SINGLE-SOURCE PROBLEM EXAMPLES SINGLE-SOURCE PROBLEM EXAMPLES Schema level: Illegal attribute values: bdate=30.13.70 Violated attribute dependencies: age=22, bdate=12.02.70 Uniqueness violation: (name=”John Smith”, SSN=”123456”), (name=”Peter Miller”, SSN=”123456”) Referential integrity violation: emp=(name=”John Smith”, deptno=127) if department 127 not defined Further readings: Rahm, Erhard, and Hong Hai Do. Data cleaning: Problems and current approaches . IEEE Data Eng. Bull. 23.4 (2000): 3-13.

SINGLE-SOURCE PROBLEM EXAMPLES SINGLE-SOURCE PROBLEM EXAMPLES Schema level: Illegal attribute values: bdate=30.13.70 Violated attribute dependencies: age=22, bdate=12.02.70 Uniqueness violation: (name=”John Smith”, SSN=”123456”), (name=”Peter Miller”, SSN=”123456”) Referential integrity violation: emp=(name=”John Smith”, deptno=127) if department 127 not defined Instance level: Missing values: phone=9999-999999 Misspellings: city=Pittsburg Misfielded values: city=USA Duplicate records: name=John Smith, name=J. Smith Wrong reference: emp=(name=”John Smith”, deptno=127) if department 127 defined but wrong Further readings: Rahm, Erhard, and Hong Hai Do. Data cleaning: Problems and current approaches . IEEE Data Eng. Bull. 23.4 (2000): 3-13.

DIRTY DATA: EXAMPLE DIRTY DATA: EXAMPLE Problems with the data? 5 . 4

DISCUSSION: POTENTIAL DATA QUALITY DISCUSSION: POTENTIAL DATA QUALITY PROBLEMS? PROBLEMS? 5 . 5

DATA CLEANING OVERVIEW DATA CLEANING OVERVIEW Data analysis / Error detection Error types: e.g. schema constraints, referential integrity, duplication Single-source vs multi-source problems Detection in input data vs detection in later stages (more context) Error repair Repair data vs repair rules, one at a time or holistic Data transformation or mapping Automated vs human guided 5 . 6

ERROR DETECTION ERROR DETECTION Illegal values: min, max, variance, deviations, cardinality Misspelling: sorting + manual inspection, dictionary lookup Missing values: null values, default values Duplication: sorting, edit distance, normalization 5 . 7

ERROR DETECTION: EXAMPLE ERROR DETECTION: EXAMPLE Q. Can we (automatically) detect errors? Which errors are problem-dependent? 5 . 8

COMMON STRATEGIES COMMON STRATEGIES Enforce schema constraints e.g., delete rows with missing data or use defaults Explore sources of errors e.g., debugging missing values, outliers Remove outliers e.g., Testing for normal distribution, remove > 2 σ Normalization e.g., range [0, 1], power transform Fill in missing values 5 . 9

DATA CLEANING TOOLS DATA CLEANING TOOLS OpenRefine (formerly Google Refine), Trifacta Wrangler, Drake, etc., 5 . 10

DIFFERENT CLEANING TOOLS DIFFERENT CLEANING TOOLS Outlier detection Data deduplication Data transformation Rule-based data cleaning and rule discovery (conditional) functional dependencies and other constraints Probabilistic data cleaning Further reading: Ilyas, Ihab F., and Xu Chu. Data cleaning . Morgan & Claypool, 2019. 5 . 11

DATA SCHEMA DATA SCHEMA 6 . 1

DATA SCHEMA DATA SCHEMA Define expected format of data expected fields and their types expected ranges for values constraints among values (within and across sources) Data can be automatically checked against schema Protects against change; explicit interface between components 6 . 2

DATA QUALITY AND DATA DATA QUALITY AND DATA PROGRAMMING - PowerPoint PPT Presentation

DATA QUALITY AND DATA DATA QUALITY AND DATA PROGRAMMING PROGRAMMING "Data cleaning and repairing account for about 60% of the work of data scientists." Christian Kaestner Required reading: Schelter, S., Lange, D., Schmidt, P.,

Hierarchy of Software Complexity Application Programs Sequential Programming Embedded

Programming Styles and Objects Fermilab - TARGET 2018 Week 3 Programming styles Imperative

NLP Programming Tutorial 0 - Programming Basics Graham Neubig Nara Institute of Science and

+ f(x) = Python Functional Programming Python Functional Programming Functional Programming by

voice Kate Howland End-user programming? End-user programming? End-user programming?

CS2281: Programming in UNIX Semester 3, 2004/05 CS2281: Programming in UNIX p.1/13 Syllabus

61A Lecture 26 Announcements Programming Languages Programming Languages 4 Programming

RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW

Combining Combining Constraint Programming Constraint Programming and Integer Programming and

? P12 2 Getting Started/Lab Programming Lab Programming Program of Requirements PRELIMINARY

Introduction to Functional Programming in Python David Jones drj@ravenbrook.com Programming:

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Programming Distributed Systems Programming Models for Distributed Systems Annette Bieniusa FB

MATHEMATICS 1 CONTENTS Mathematical programming Linear programming The LP-problem Old exam

Network Programming Network Programming as Programming across Machine Boundaries The

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Computational Models of Discourse: Discourse Parsing Caroline Sporleder Universit at des

Query Expansion Techniques (Relevance Feedback, Thesaurus, Semantic Network) (COSC 488) Nazli

Instance-level recognition 1) Local invariant features 2) Matching and recognition with local

Intro to Fusion and Gyrokine1cs D. R. Hatch ICTP Oct 29, 2018 Most MaCer is Turbulent Plasma

Semimartingale methods for Markov chains, interacting particle systems and random growth models

First order optimization. Last time. Other scenarios. min f ( x ) Gradient Descent: Dont you

Stabilization of Branching Queueing Networks Tom Brzdil 1 Stefan Kiefer 2 1 Masaryk

Undecidability in group theory, topology, and F.p. groups Word problem Markov properties