A Machine Learning Perspective on Managing Noisy Data Theodoros - - PowerPoint PPT Presentation

a machine learning perspective on managing noisy data
SMART_READER_LITE
LIVE PREVIEW

A Machine Learning Perspective on Managing Noisy Data Theodoros - - PowerPoint PPT Presentation

A Machine Learning Perspective on Managing Noisy Data Theodoros Rekatsinas | UW-Madison @thodrek Data-hungry applications are taking over Data errors are everywhere Noisy measurements Sensor failures Data errors are everywhere


slide-1
SLIDE 1

A Machine Learning Perspective

  • n Managing Noisy Data

Theodoros Rekatsinas | UW-Madison @thodrek

slide-2
SLIDE 2

Data-hungry applications are taking over

slide-3
SLIDE 3
  • Noisy

measurements

  • Sensor failures

Data errors are everywhere

slide-4
SLIDE 4
  • Uncertain extractions
  • Semantic ambiguity

Data errors are everywhere

slide-5
SLIDE 5
  • Adversarial examples

Data errors are everywhere

slide-6
SLIDE 6
  • Human errors
  • Machine failures
  • Code bugs

Data errors are everywhere

slide-7
SLIDE 7

The Achilles’ Heel of Modern Analytics

is low quality, erroneous data

slide-8
SLIDE 8

The Achilles’ Heel of Modern Analytics

is low quality, erroneous data

Cleaning and organizing the data comprises 60% of the time spent on an analytics or AI project.

slide-9
SLIDE 9

Stanford’s Snorkel: A System for Fast Training Data Creation Google’s TFX: TensorFlow Data Validation Amazon’s SageMaker Amazon’s Deequ: Data Quality Validation for ML Pipelines HoloClean: Weakly-supervised data cleaning

The Achilles’ Heel of Modern Analytics

is low quality, erroneous data

Many modern data management systems are being developed to address aspects of this issue:

slide-10
SLIDE 10

Question:

What is an appropriate (formal) framework for managing noisy data?

Things to consider: Simplicity and generality

slide-11
SLIDE 11

Talk outline

  • Managing Noisy Data (Background)
  • The Probabilistic Unclean Databases (PUDs) Framework
  • From Theory to Systems
slide-12
SLIDE 12

Managing Noisy Data

slide-13
SLIDE 13

A simple example of noisy data

c1: DBAName → Zip c2: Zip → City, State c3: City, State, Address → Zip

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

Conflicts Conflict Does not obey data distribution

slide-14
SLIDE 14

A simple example of noisy data

c1: DBAName → Zip c2: Zip → City, State c3: City, State, Address → Zip

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

Conflicts Conflict Does not obey data distribution

Computational problems: Detect errors, repair errors, compute “consistent” query answers.

slide-15
SLIDE 15

The case for inconsistent data

An example unclean database J

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

c1: DBAName → Zip c2: Zip → City, State c3: City, State, Address → Zip

  • Errors correspond to tuples/cells that introduce inconsistencies (violations of integrity constraints).
  • Inconsistencies are typical in data integration, extract-load-transform workloads, etc.
  • Data repairs: A theoretical framework for coping with inconsistent databases [Arenas et al. 1999]
slide-16
SLIDE 16

Minimal data repairs

Slide by Phokion Kolaitis [SAT 2016]

slide-17
SLIDE 17

Minimal data repairs

Slide by Phokion Kolaitis [SAT 2016]

Plethora of fundamental results

  • n tractability of repair-checking

and consistent query answering.

slide-18
SLIDE 18

Minimal data repairs

Limited adoption in practice.

Slide by Phokion Kolaitis [SAT 2016]

Plethora of fundamental results

  • n tractability of repair-checking

and consistent query answering.

slide-19
SLIDE 19

Minimal data repairs

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

c1: DBAName → Zip c2: Zip → City, State c3: City, State, Address → Zip

slide-20
SLIDE 20

Minimal data repairs

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

Minimal subset repair: We remove t1

c1: DBAName → Zip c2: Zip → City, State c3: City, State, Address → Zip

An example repaired database I

slide-21
SLIDE 21

Minimal data repairs

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

Minimal subset repair: We remove t1

c1: DBAName → Zip c2: Zip → City, State c3: City, State, Address → Zip

Errors remain: (1) Cicago should clearly be Chicago (2) Non-obvious errors: 60609 is the wrong Zip

slide-22
SLIDE 22

Minimal data repairs

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

Minimal subset repair: We remove t1

c1: DBAName → Zip c2: Zip → City, State c3: City, State, Address → Zip

Errors remain: (1) Cicago should clearly be Chicago (2) Non-obvious errors: 60609 is the wrong Zip Several variations of minimal repairs. E.g., update the minimum number of cells.

slide-23
SLIDE 23

Minimal data repairs

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

Minimal subset repair: We remove t1

c1: DBAName → Zip c2: Zip → City, State c3: City, State, Address → Zip

Errors remain: (1) Cicago should clearly be Chicago (2) Non-obvious errors: 60609 is the wrong Zip Minimality can be used as an operational principle to prioritize repairs but these repairs are not necessarily correct with respect to the ground truth. Several variations of minimal repairs. E.g., update the minimum number of cells.

slide-24
SLIDE 24

The case for most probable data [Gribkoff et al., 14]

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

c1: DBAName → Zip c2: Zip → City, State c3: City, State, Address → Zip

p 0.9 0.4 0.4 0.8

Most probable world, conditioned on integrity constraint satisfaction

slide-25
SLIDE 25

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

c1: DBAName → Zip c2: Zip → City, State c3: City, State, Address → Zip

p 0.9 0.4 0.4 0.8 Factor (f) 1 - 0.9 0.4 0.4 0.8

max

I

t∈I

p(t)∏

t∉I

(1 − p(t))

Optimization Objective

The case for most probable data [Gribkoff et al., 14]

slide-26
SLIDE 26

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

c1: DBAName → Zip c2: Zip → City, State c3: City, State, Address → Zip

p 0.9 0.4 0.4 0.8 Factor (f) 0.9 1 - 0.4 1 - 0.4 1 - 0.8

max

I

t∈I

p(t)∏

t∉I

(1 − p(t))

Optimization Objective

The case for most probable data [Gribkoff et al., 14]

slide-27
SLIDE 27

Most probable repairs

Probabilities offer clear semantics than minimality. Fundamental question: How do we know p?

t2 t4 t1 t3 DBAName John Veliotis Sr. Johnnyo’s John Veliotis Sr. John Veliotis Sr. Zip 60609 60608 60608 60609 3465 S Morgan ST IL Johnnyo’s Cicago Johnnyo’s 3465 S Morgan ST IL Chicago Johnnyo’s IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST Johnnyo’s IL State City Address AKAName

p 0.9 0.4 0.4 0.8 Factor (f) 1 - 0.9 0.4 0.4 0.8

max

I

t∈I

p(t)∏

t∉I

(1 − p(t))

Optimization Objective

slide-28
SLIDE 28

Probabilistic Unclean Databases

Christopher De Sa, Ihab Ilyas, Benny Kimelfeld, Christopher Ré, Theodoros Rekatsinas, ICDT 2019

slide-29
SLIDE 29

The case of a noisy channel for data

Noisy Channel Model 1. We see an observation x in the noisy world 2. Find the correct world w Applications: Speech, OCR, Spelling correction, Part of speech tagging, machine translations, etc…

̂ w = arg max

w∈W P(w|x)

Noisy Channel Clean Source Data Observed Data with Errors

slide-30
SLIDE 30

The Probabilistic Unclean Database Model

Noisy Channel Clean Source Data Observed Data with Errors

slide-31
SLIDE 31

The Probabilistic Unclean Database Model

Noisy Channel Clean Intended Database I Observed Data with Errors

Intension

Probabilistic Data Generator

slide-32
SLIDE 32

The Probabilistic Unclean Database Model

Clean Intended Database I Observed Unclean Database J

Intension

Probabilistic Data Generator

Realizer

Probabilistic Noise Generator (Noisy Channel)

slide-33
SLIDE 33

The Probabilistic Unclean Database Model

Intension

Probabilistic Data Generator

A Probability Distribution Component 1: Probability over tuple-values in I Component 2: Logical constraints bias towards consistency of tuples in I

slide-34
SLIDE 34

The Probabilistic Unclean Database Model

A Conditional Probability Distribution Example: Exponential Family Realizer Captures the conditional probability of data edits and transformations Realizer

Probabilistic Noise Generator (Noisy Channel)

RI(J) = Pr(J|I)

R[i, t](t0) = 1 Z(t) exp @X

g2G

wg · g(t, t0) 1 A with t ∈ I, t0 ∈ J and G is a set of features where each g is an arbitrary function over (t, t0) and each weight wg is a real number.

Probability of the i'th record of I changing from t to t'

slide-35
SLIDE 35

Example PUD instances

PUD Example 1: Parfactor/Subset PUD

Models the generation of duplicate data

  • Prob. of extra

tuples in J

  • Prob. of

no-tuples

slide-36
SLIDE 36

Example PUD instances

PUD Example 2: Parfactor/Update PUD

Models errors due to transformations (e.g., typos)

  • Prob. of edits

present in J

slide-37
SLIDE 37

Computational Problems in the PUD framework

slide-38
SLIDE 38

The Probabilistic Unclean Database Model

Clean Intended Database I Observed Unclean Database J

Intension

Probabilistic Data Generator

Realizer

Probabilistic Noise Generator (Noisy Channel)

Input: We only

  • bserve this
slide-39
SLIDE 39

The Probabilistic Unclean Database Model

Clean Intended Database I Observed Unclean Database J

Intension

Probabilistic Data Generator

Realizer

Probabilistic Noise Generator (Noisy Channel)

Input: We only

  • bserve this

Problem 1: If we knew the Intension and the Realizer can we recover I? Output: An estimate of the most probable I

slide-40
SLIDE 40

The Probabilistic Unclean Database Model

Clean Intended Database I Observed Unclean Database J

Intension

Probabilistic Data Generator

Realizer

Probabilistic Noise Generator (Noisy Channel)

Input: We only

  • bserve this

Problem 1: If we knew the Intension and the Realizer can we recover I? Output: An estimate of the most probable I Problem 2: Given J can we answer a query

  • n I correctly?

Output:

Pr(a ∈ Q(I)|J)

slide-41
SLIDE 41

The Probabilistic Unclean Database Model

Clean Intended Database I Observed Unclean Database J

Intension

Probabilistic Data Generator

Realizer

Probabilistic Noise Generator (Noisy Channel)

Input: We only

  • bserve this

Problem 1: If we knew the Intension and the Realizer can we recover I? Output: An estimate of the most probable I Problem 2: Given J can we answer a query

  • n I correctly?

Output:

Pr(a ∈ Q(I)|J)

Problem 3: Can we learn the Intension and the Realizer? Can we do that from J (i.e., without any training data)? Output: An estimate for the Intension and the Realizer

slide-42
SLIDE 42

Data Cleaning

Problem Statement: Given the observed noisy database instance J, compute the Most Likely intended database instance I. We show that PUDs generalize existing frameworks:

  • MLI in parfactor/subset PUDs generalizes cardinality repairs
  • MLI in parfactor/update PUDs generalizes min-update repairs

Question: How does data cleaning in PUDs compare to existing frameworks?

slide-43
SLIDE 43

Data Cleaning

Problem Statement: Given the observed noisy database instance J, compute the Most Likely intended database instance I. Question: Is data cleaning in the PUD framework efficient? In general no. It is equivalent to probabilistic inference. However:

  • For parfactor/subset PUDs with key constraints (i.e., when errors are limited to

duplicates) MLI can be computed in polynomial time.

  • New result: Approximate inference algorithm with guarantees on expected

Hamming Error w.r.t. I; uniform noise model [Heidari, Ilyas, Rekatsinas UAI 2019.]

slide-44
SLIDE 44

Setup (with noise):

  • known graph G = (V,E)
  • unknown labeling X: V -> {1, 2, …, k}
  • given noisy parity of each edge
  • flipped with probability p
  • given noisy observations for each node
  • altered with probability q

Goal: (approximately) recover X. Formally: want an algorithm A that finds a labeling that minimizes the worst-case expected Hamming error:

Approximate Inference in Structured Instances with Noisy Categorical Observations

max

X {EL∼D(X)[error( ̂

X, X)]}

̂ X

slide-45
SLIDE 45

New Algorithm: New approximate inference algorithm based on tree decompositions and correlation clustering. Guarantees on worst-case expected Hamming error:

  • For trees, the Hamming error is upper

bounded by

  • For low-treewidth graphs, the Hamming

error is upper bounded by

Approximate Inference in Structured Instances with Noisy Categorical Observations

˜ O(log(k) ⋅ p ⋅ n) ˜ O(k ⋅ log(k) ⋅ p⌈ Δ(G)

2 ⌉ ⋅ n)

slide-46
SLIDE 46

New Algorithm: New approximate inference algorithm based on tree decompositions and correlation clustering. Guarantees on worst-case expected Hamming error:

  • For trees, the Hamming error is upper

bounded by

  • For low-treewidth graphs, the Hamming

error is upper bounded by

Approximate Inference in Structured Instances with Noisy Categorical Observations

˜ O(log(k) ⋅ p ⋅ n) ˜ O(k ⋅ log(k) ⋅ p⌈ Δ(G)

2 ⌉ ⋅ n)

It should be for the edge side information to be useful for statistical recovery.

p < 1 k log k

slide-47
SLIDE 47

PUD learning

Problem Statement: Assume a parametric representation of the Intention and the

  • Realizer. We want to find the maximum likelihood estimates for the parameters of

these representations. Supervised variant: We are given examples of both unclean databases and their clean versions. Unsupervised variant: We are given only unclean databases. Question: Can we learn a PUD? Can we do so without any training data?

  • We show standard learnability results for supervised variant
  • More interesting result: We show that in the uniform noise model and under

tuple independence we can learn a PUD without any training data when the noise is bounded. Single instance J decomposes to multiple training examples. Under bounded noise the log-likelihood is convex.

slide-48
SLIDE 48

From Theory to Systems

Is the PUDs framework useful in practice?

slide-49
SLIDE 49

HoloClean: Probabilistic Data Repairs

Reference: HoloClean: Holistic Data Repairs with Probabilistic Inference Rekatsinas, Chu, Ilyas, Ré, VLDB 2017 HoloClean is the first practical probabilistic data repairing engine and a state-of-the-art data repairing system HoloClean’s factor-graph model is an instantiation of the PUDs Intention model. HoloClean uses clean cells as training data to learn its PUD Intention model and uses the learned model to approximate MLI repairs.

slide-50
SLIDE 50

Challenge: Inference under constraints is #P-complete

Applying probabilistic inference naively does not scale to data cleaning instances with millions of tuples Idea 1: Prune domain of random variables. Idea 2: Relax constraints over sets of random variables to features over independent random variables.

HoloClean: Probabilistic Data Repairs

slide-51
SLIDE 51

t1.City t1.Zip t4.City t4.Zip

w1 w1 w2 w2 w3 “Address= 3465 S Morgan St”

t2 t4 t1 t3 Zip 60609 60608 60608 60609 3465 S Morgan ST IL Cicago 3465 S Morgan ST IL Chicago IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST IL State City Address

“Zip -> City”

“Address= 3465 S Morgan St”

Relaxing constraints

slide-52
SLIDE 52

t1.City t4.City

w1 w1 w3’ “Address= 3465 S Morgan St”

t2 t4 t1 t3 Zip 60609 60608 60608 60609 3465 S Morgan ST IL Cicago 3465 S Morgan ST IL Chicago IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST IL State City Address

“Assignment Chicago violates Zip -> City due to t4”

w3’

“Assignment Cicago violates Zip -> City due to t1” We have one relaxed factor for each value in the domain of the RV

Relaxing constraints

slide-53
SLIDE 53

t1.Zip t4.Zip

w2 w2 w4’ “Address= 3465 S Morgan St”

t2 t4 t1 t3 Zip 60609 60608 60608 60609 3465 S Morgan ST IL Cicago 3465 S Morgan ST IL Chicago IL Chicago 3465 S Morgan ST Chicago 3465 S Morgan ST IL State City Address

“Assignment 60608 violates Zip -> City due to t4”

w4’

“Assignment 60609 violates Zip -> City due to t1” We have one relaxed factor for each value in the domain of the RV

Relaxing constraints Relaxing constraints

slide-54
SLIDE 54

Relaxing constraints HoloClean in practice

HoloClean: our approach combining all signals and using inference Holistic[Chu,2013]: state-of-the-art for constraints & minimality KATARA[Chu,2015]: state-of-the-art for external data SCARE[Yakout,2013]: state-of-the-art ML & qualitative statistics Competing methods do not scale or perform correct repairs.

slide-55
SLIDE 55

Full Relaxed F1-score 0.2 0.4 0.6 0.8 More domain pruning (lowers recall, increases precision) F1-score for Full vs Relaxed Model Full Relaxed Runtime (sec) 1000 2000 More domain pruning (lowers recall, increases precision) Runtime for Full vs Relaxed Model

Faster compilation, learning, and inference when we prune the RV domain

Relaxing constraints

slide-56
SLIDE 56

Full Relaxed F1-score 0.2 0.4 0.6 0.8 More domain pruning (lowers recall, increases precision) F1-score for Full vs Relaxed Model

Increased robustness (more accurate repairs) when RV domain is ill-specified (no heavy pruning used)

Full Relaxed Runtime (sec) 1000 2000 More domain pruning (lowers recall, increases precision) Runtime for Full vs Relaxed Model

Relaxing constraints

slide-57
SLIDE 57

Data Augmentation for Error Detection

Error Detection with Data Augmentation

Transformation and Policy Learning Data Augmentation using Policy Π

Augmentation Transformations Φ and Policy Π

Data Augmentation Module Cell Value Representation Module

t1 t2 t3 tN, City Chicago IL Chicago IL t1, City Cicago Cicago Porter t1, Business ID EVP Cofee Transformed Value Observed Value Cell

Augmented Training Dataset Model Training and Classification Module

IN: D, T, Σ Cell Representation and Labels IN: D IN: T

HoloDetect learns a PUD realizer and uses the learned realizer to generate synthetic training data to teach a deep neural network how to detect erroneous values. Reference: HoloDetect: A Few-Shot Learning Framework for Error Detection Heidari, McGrath, Ilyas, Rekatsinas, SIGMOD 2019

slide-58
SLIDE 58

Error Detection:

  • Binary classification: for each cell decide

if it’s erroneous or correct.

  • Severe imbalance, high heterogeneity.
  • Assumption: Easy for human annotators

to provide examples of correct tuples. Challenge: How can we obtain labeled data while minimizing the input from human annotators?

Data Augmentation for Error Detection

slide-59
SLIDE 59

Data Augmentation for Error Detection

Approach: Analyze the input dataset and learn how errors are introduced (learn a noisy channel). Use the clean tuples as seeds and introduce artificial erroneous examples that obey the distribution of the noisy channel. Program Synthesis: Learn a program to introduce errods

slide-60
SLIDE 60

Data Augmentation for Error Detection

slide-61
SLIDE 61

Data Augmentation for Error Detection

Approach: Train a classifier to identify errors in the input data set

slide-62
SLIDE 62

HoloDetect requires fewer training examples than competing approaches

Data Augmentation for Error Detection

slide-63
SLIDE 63

AutoFD: Functional Dependency Discovery via Structure Learning

FD discovery as a structure learning problem over a linear structured model Lifted-variation of structure learning using sparse regression (L1-regularization). 2x F1 improvement over state-of-the-art (included non-lifted structure learning methods). Guarantees on FD discovery under a weak Realizer (bounded noise).

Graft DBAName Harry Caray’s Pierrot 3435 W Washington Chicago IL 835 N Michigan 60608 60611 835 N Michigan Av Mity Nice Bar State 60612 Chicago Address 60611 Foodlife IL 60612 835 N Michigan Av City IL Zip Code IL 3493 Washington IL Chicago Cicago

Input Noisy Dataset Structure Learning

  • Estimate the inverse covariance matrix of lifted model.
  • Fit a linear model by decomposing the estimated inverse covariance
slide-64
SLIDE 64

AutoFD provides insights for downstream data preparation tasks

Effective feature engineering

slide-65
SLIDE 65

Provides insights on the effectiveness

  • f automated data cleaning.

AutoFD provides insights for downstream data preparation tasks

Ex: increased imputation accuracy for attributes w. dependencies (in the

  • utput of AutoFD)
slide-66
SLIDE 66

The Probabilistic Unclean Database Model

Clean Intended Database I Observed Unclean Database J

Intension

Probabilistic Data Generator

Realizer

Probabilistic Noise Generator (Noisy Channel)

A formal noisy channel model that leads to new insights for managing noisy data and has immediate practical applications to data cleaning systems.

  • HoloClean
  • AutoFD
  • HoloDetect
slide-67
SLIDE 67

The Probabilistic Unclean Database Model

Clean Intended Database I Observed Unclean Database J

Intension

Probabilistic Data Generator

Realizer

Probabilistic Noise Generator (Noisy Channel)

A formal noisy channel model that leads to new insights for managing noisy data and has immediate practical applications to data cleaning systems and exciting connections to robust ML.

  • HoloClean
  • AutoFD
  • HoloDetect
slide-68
SLIDE 68

The Probabilistic Unclean Database Model

Clean Intended Database I Observed Unclean Database J

Intension

Probabilistic Data Generator

Realizer

Probabilistic Noise Generator (Noisy Channel)

  • HoloClean
  • AutoFD
  • HoloDetect

Thank you! thodrek@cs.wisc.edu

A formal noisy channel model that leads to new insights for managing noisy data and has immediate practical applications to data cleaning systems and exciting connections to robust ML.