The Statistics of Dirty Data Sanjay Krishnan coax treasure out of - - PowerPoint PPT Presentation

the statistics of dirty data
SMART_READER_LITE
LIVE PREVIEW

The Statistics of Dirty Data Sanjay Krishnan coax treasure out of - - PowerPoint PPT Presentation

The Statistics of Dirty Data Sanjay Krishnan coax treasure out of messy, unstructured data 204 papers since 2012 in VLDB, ICDE, SIGMOD ( dirty data ) The Database Perspective Dirty data is a violation of constraints on a table. Data


slide-1
SLIDE 1

The Statistics of Dirty Data

Sanjay Krishnan

slide-2
SLIDE 2
slide-3
SLIDE 3

coax treasure out of messy, unstructured data

slide-4
SLIDE 4

204 papers since 2012 in VLDB, ICDE, SIGMOD (dirty data)

slide-5
SLIDE 5

The “Database” Perspective

  • Dirty data is a violation of constraints on a table.
  • Data Cleaning is constraint satisfaction

Name Role Salary 1 Jane Doe Emp 1700 2 John Smith Manager 1500 3 Raj Kumar Emp 1300 4 Maria Lopez Manager 4400

“No Manager Can Earn Less Than an Employee”

slide-6
SLIDE 6

The “Database” Perspective

6

“No Manager Can Earn Less Than an Employee”

Name Role Salary 1 Jane Doe Emp 1700 2 John Smith Manager 1500 Name Role Salary 1 Jane Doe Emp

1500

2 John Smith Manager 1500 Name Role Salary 1 Jane Doe

Manager

1700 2 John Smith Manager 1500

slide-7
SLIDE 7

7

slide-8
SLIDE 8

We Interact With Data in Fundamentally New Ways

slide-9
SLIDE 9

The Statistics of Dirty Data

tl;dr Formalism Good, Theory Needs Updating

  • SampleClean: Linking Data Repair To Statistical

Analysis.

  • AlphaClean: Synthesizing Data Cleaning Programs

With New AI Tools

  • Discussion
slide-10
SLIDE 10

Motivating Example

slide-11
SLIDE 11

Results After Cleaning

Author! Dirty! Clean! Rakesh'Agarwal! 353! 211! Jeffrey'Ullman! 460! 255! Michael'Franklin! 561! 173!

slide-12
SLIDE 12

Results After Cleaning

Author! Dirty! Clean! Rakesh'Agarwal! 353! 211! Jeffrey'Ullman! 460! 255! Michael'Franklin! 561! 173!

Did I need to clean everything?

slide-13
SLIDE 13

Sample-and-Clean

Dirty&& Database&

Sample' Clean& Sample&

SELECT COUNT(1) FROM Pubs GROUP BY Author

Sanjay Krishnan, Jiannan Wang, Michael Franklin, Ken Goldberg, Tim Kraska, Tova Milo, Eugene Wu. A Sample-and-Clean Framework for Fast and Accurate Query Processing on Dirty Data.

slide-14
SLIDE 14

What goes wrong?

Madden, Samuel R., et al. "TinyDB: an acquisitional query processing system for sensor networks." ACM Transactions on database systems (TODS) 30.1 (2005): 122-173. Madden, S. R., Franklin, M. J., Hellerstein, J. M., & Hong, W. "TinyDB: an acquisitional query processing system for sensor networks." ACM Transactions on database systems (TODS) 30.1 (2005): 122-173.

Duplicates!

slide-15
SLIDE 15

Probabilistic Interpretation

  • SUM, COUNT, AVG, VAR can be expressed as a mean.
  • SUM = size * mean
  • COUNT = size * frequency
  • Probabilistic Interpretation: Expected Values
slide-16
SLIDE 16

Transform Dirty Sample to Simulate Clean Sample

Cleaned' Database'

Φ(.)'

Dirty&& Database&

Sample' Clean& Sample&

slide-17
SLIDE 17

Algorithm 1: Direct Estimate

Query! Answer!

Samples( Clean( Samples(

Direct Estimate

Φ(.)%

slide-18
SLIDE 18

Samples( Clean( Samples(

Δ"

How much did the cleaning change the data?

Φ(.)%

Algorithm 2: Corrected Estimate

slide-19
SLIDE 19

Database! Query! Incorrect Answer!

Samples( Clean( Samples(

Δ"

Φ(.)%

Algorithm 2: Corrected Estimate

slide-20
SLIDE 20

Direct vs. Corrections

slide-21
SLIDE 21

MS Academic Results

100 300 500 700 900 1100 1300 1500 0.85 0.9 0.95 1 Number of Samples Cleaned Probability Of Correct Ordering

slide-22
SLIDE 22

Wisteria

Daniel Haas, Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Eugene Wu. Wisteria: Nurturing Scalable Data Cleaning Infrastructure. VLDB 2015.

slide-23
SLIDE 23

Salient Pieces

  • 1. Probability Measure over the database: user

sees the data under some type of observation model.

  • 2. Language for data cleaning with estimable

statistical properties.

  • 3. An aggregate query to estimate after some

adjustment of statistical changes.

slide-24
SLIDE 24

ActiveClean

  • How do we most efficiently clean data for a given

machine learning task?

Clean Data Feature Eng. Model Sel.

Database

Algorithm Sel.

[],[ ,[],[ ,[]

Expected Error

ActiveClean Prioritize Records To Clean Updates Analysis

Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Eugene Wu. ActiveClean: Interactive Data Cleaning For Statistical Modeling. VLDB 2016.

  • Data Cleaning as a form of Stochastic Gradient

Descent.

slide-25
SLIDE 25

Dollars For Docs

  • 250,000 medical contribution

records

  • Manually labeled as

suspicious or not

  • Entity resolution errors in

company and drug names

slide-26
SLIDE 26

Dollars For Docs

10 20 30 40 10000 20000 30000 40000 50000

Model Error % # Records Cleaned (c) DfD Model Error

Sampling In Place Active Learning Active Clean No Cleaning

slide-27
SLIDE 27

There’s a bound for that

For a batch size b and iterations T, the ActiveClean stochastic gradient descent updates converge with rate: For strongly-convex models: For L-Lipschitz loss (e.g., SVM):

slide-28
SLIDE 28

Data Cleaning + Differential Privacy

Randomized DB

Data Cleaning Query Results

Sanjay Krishnan, Jiannan Wang, Michael J. Franklin, Ken Goldberg, Tim Kraska. PrivateClean: Data Cleaning and Differential Privacy. SIGMOD 2016.

  • Not very different from Sample-and-Clean!
slide-29
SLIDE 29

Streaming Systems

Users With Orders > $300 Users From USA Users From China

Database'

Updates

  • Approximate Maintenance as Sample-and-Clean
slide-30
SLIDE 30

Quantifying Incompleteness

Brandie Nonnecke*, Sanjay Krishnan*, et al.. DevCAFE 1.0: A Participatory Platform for Assessing Development Initiatives in the Field. IEEE GHTC. 2015 (Best Paper) Sanjay Krishnan, Jay Patel, Michael J. Franklin, and Ken

  • Goldberg. Social Influence Bias in Recommender Systems: A

Methodology for Learning, Analyzing, and Mitigating Bias in

  • Ratings. RecSys. Foster City, CA, USA. Oct 2014

Yeouhnoh Chung, Sanjay Krishnan , Tim Kraska. A Data Quality Metric (DQM). How to Estimate the Number of Undetected Errors in Data Sets. Under Review VLDB 2017.

  • Similar mechanisms but different estimators!
slide-31
SLIDE 31

The Statistics of Dirty Data

tl;dr Formalism Good, Theory Needs Updating

  • SampleClean: Linking Data Repair To Statistical

Analysis.

  • AlphaClean: Synthesizing Data Cleaning

Programs With New AI Tools

  • Discussion
slide-32
SLIDE 32

Quantifying Incompleteness

Brandie Nonnecke*, Sanjay Krishnan*, et al.. DevCAFE 1.0: A Participatory Platform for Assessing Development Initiatives in the Field. IEEE GHTC. 2015 (Best Paper) Sanjay Krishnan, Jay Patel, Michael J. Franklin, and Ken

  • Goldberg. Social Influence Bias in Recommender Systems: A

Methodology for Learning, Analyzing, and Mitigating Bias in

  • Ratings. RecSys. Foster City, CA, USA. Oct 2014
slide-33
SLIDE 33
slide-34
SLIDE 34

Hard To Disentangle From The Data

^[0-9] 1/2 -> $1.5

Analysis

replaceNames() findr(years,’’)

MODEL

Data Cleaning

slide-35
SLIDE 35

The “Database” Perspective

Name Role Salary 1 Jane Doe Emp 1700 2 John Smith Manager 1500 3 Raj Kumar Emp 1300 4 Maria Lopez Manager 4400

“No Manager Can Earn Less Than an Employee” Say what you want not how you get it

slide-36
SLIDE 36

Making Data Cleaning Declarative

Age.isFloat() {deleteToken(?), textToNumber(?), …}

Optimizer

deleteToken(‘year’) deleteToken(‘1/2’) textToNumber(‘Twenty’)

slide-37
SLIDE 37
  • Input: A formal language of transformations. (Actions)
  • Input: a quality function of the following form where 0 implies clean

(Reward):

  • Output: A composite transformation that optimizes

Data Cleaning is Planning

(State-Transition)

slide-38
SLIDE 38

Model-Free Search Commodity Clusters

slide-39
SLIDE 39

Specification

Library Generator Quality Functions Type Inference

Model-Free Search Execution

Logical Optimizer Literal Generalization

AlphaClean

slide-40
SLIDE 40

Example

df = pd.read_csv(‘uganda.csv', quotechar='\"', index_col=False)

slide-41
SLIDE 41

Specification API

patterns = [] #18 years old to 100, remove under 18 patterns += [Float('Age', [18, 100])] #Only alpha numeric values patterns += [Pattern('Response', “^[a-zA-Z0-9_]*$")] #Parish patterns += [CodeBook(‘Parish’, allowed_parishes)]

Patterns

Allowed Values a Column Can Take

slide-42
SLIDE 42

Specification API

dependencies = [] #Manual Collections Happened on a Specific Day dependencies += [CFD(‘Parish -> Day’, isManual)] #Logical Checks patterns += [DC(‘Age', “< 22”, “Children”, “< 5”)]

Dependencies

Allowed Relationships Between Columns

slide-43
SLIDE 43

Specification API

stats = [] #Expect Pos. Correlation stats += [Correlate(‘Age’, ‘Children’)] #Previous Year’s model stats += [Model]

Statistical Hints

Model the data is expected to follow

slide-44
SLIDE 44
  • Input: A formal language of transformations. (Actions)
  • Input: a quality function of the following form where 0 implies

clean (Reward):

  • Output: A composite transformation that optimizes

Data Cleaning is Planning

(State-Transition)

slide-45
SLIDE 45

The Language

findAndReplace(attribute, value1, value2) clip(attribute, threshold) filterToken(attribute, substring) regex(attribute, reg)

User Defines Templates

slide-46
SLIDE 46
  • Input: A formal language of transformations. (Actions)
  • Input: a quality function of the following form where 0 implies

clean (Reward):

  • Output: A composite transformation that optimizes

Data Cleaning is Planning

(State-Transition)

slide-47
SLIDE 47
  • Input: A formal language of transformations. (Actions)
  • Input: a quality function of the following form where 0 implies

clean (Reward):

  • Output: A composite transformation that optimizes

Data Cleaning is Planning

(State-Transition)

slide-48
SLIDE 48

Data Cleaning is Planning

NOOP() t1() t2() t3() t4() t(t1(..))

  • Typical errors are localized

(greedy fixes are safe)

  • Typical errors are systematic

(previous fixes give information about future fixes)

slide-49
SLIDE 49

Basic Algorithm

  • γ-greedy best first search

8 4 2 6 3 5

x x x

  • Keep a node on the frontier if it is within

γ of the current best result.

slide-50
SLIDE 50

Learning a Search Heuristic

?

4 2 8

slide-51
SLIDE 51

Learning a Search Heuristic

findAndReplace(‘Age’, ‘year’, ‘years’)

Featurize Transformations

deleteToken(‘Age’, ‘years’)

[0,0,1,0,…] [0.244,0.123,-1.293] word2vec 1-hot [0,0,1,0,…] 0.134 word2vec 1-hot

slide-52
SLIDE 52

GTD: Combinations

slide-53
SLIDE 53

Schema Integration

  • Link columns and enforce integrity constraints
  • Stock Dataset: There are 1000 ticker symbols from 55 sources for every

trading day in a month.

slide-54
SLIDE 54

Performance

  • Works well on systematic and localized errors
slide-55
SLIDE 55

Overfitting and Underfitting

Sensitivity Language Size Depth Data

slide-56
SLIDE 56

The Statistics of Dirty Data

tl;dr Formalism Good, Theory Needs Updating

  • SampleClean: Linking Data Repair To Statistical

Analysis.

  • AlphaClean: Synthesizing Data Cleaning Programs

With New AI Tools

  • Discussion
slide-57
SLIDE 57

Conclusion

  • Data Cleaning before Statistical Analytics:

[SIGMOD 14], [IEEE DEB 15], [VLDB 16]

  • Sampling and Approximation with Writes:

[VLDB 15], [SIGMOD 16]

  • Crowdsourcing’s Downstream Impact

[VLDB Demo 15], [RecSys 14], [VLDB review 17] Data Cleaning is a Statistical Problem www.ocf.berkeley.edu/~sanjayk