Improving Data Quality: Consistency and Accuracy Gao Cong, - - PowerPoint PPT Presentation

improving data quality consistency and accuracy
SMART_READER_LITE
LIVE PREVIEW

Improving Data Quality: Consistency and Accuracy Gao Cong, - - PowerPoint PPT Presentation

Improving Data Quality: Consistency and Accuracy Gao Cong, Microsoft Research Asia Wenfei Fan, University of Edinburgh, Bell Laboratories Floris Geerts, Univ. of Edinburgh, Hasselt Univ., transnational Univ. Limburg Xibei Jia , University


slide-1
SLIDE 1

Improving Data Quality: Consistency and Accuracy

Gao Cong, Microsoft Research Asia Wenfei Fan, University of Edinburgh, Bell Laboratories Floris Geerts,

  • Univ. of Edinburgh,

Hasselt Univ., transnational Univ. Limburg Xibei Jia, University of Edinburgh Shuai Ma, University of Edinburgh

25th September 2007

slide-2
SLIDE 2
  • 25 September 2007

Dirty data are costly

Typical data error rate in industry: 1% - 5%, up

to 30%

Poor data cost US companies $600 billion

annually

30%-80% of the development time for data

cleaning in a data warehousing data warehousing project

CIA intelligence on WMD in Iraq!

These dirty data need to be cleaned These dirty data need to be cleaned (semi (semi-

  • )automatically

)automatically ! !

slide-3
SLIDE 3
  • 25 September 2007

Constraint-based data cleaning

Constraint-based data cleaning

Define a set of constraints to model the data Errors in data are captured as violations of these

constraints

These violations are then repaired to improve data

quality

Constraints used in previous data cleaning tools

Functional Dependencies Inclusion Dependencies Denial Constraints …

Are these traditional constraints Are these traditional constraints sufficient for cleaning data? sufficient for cleaning data?

slide-4
SLIDE 4
  • 25 September 2007

Functional Dependencies (FDs)

t4 t3 t2 t1 131 215 215 215

AC

EH8 9LE CHI 44 John 19355 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name [ CC, AC ] [ CC, AC ]

  • [ City ]

[ City ] These data are consistent, but are they clean?

slide-5
SLIDE 5
  • 25 September 2007

FDs CFDs: flashback

[ CC, AC ] [ CC, AC ]

  • [ City ]

[ City ]

  • EDI
  • 131
  • 44

【 【 【 【 【 【 【 【 CC CC, , , , , , , ,AC AC 】 】 】 】 】 】 】 】

【 【 【 【 【 【 【 City City 】 】 】 】 】 】 】 】

Data integration in real-life: source constraints

hold on a subset of sources hold conditionally on the integrated data

They are NOT expressible as traditional FDs

do not hold on the entire relation contain constant data values

for schema design for data cleaning

FDs FDs CFDs CFDs

slide-6
SLIDE 6
  • 25 September 2007

Conditional Functional Dependencies (CFDs)

  • EDI
  • 131
  • 44

【 【 【 【 【 【 【 【 CC CC, , , , , , , , AC AC 】 】 】 】 】 】 】 】

【 【 【 【 【 【 【 City City 】 】 】 】 】 】 】 】

t4 t3 t2 t1 131 215 215 215

AC

EH8 9LE CHI 44 John 19355 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

slide-7
SLIDE 7
  • 25 September 2007

Our data cleaning framework

  • ptional module
slide-8
SLIDE 8
  • 25 September 2007

Automatically find a repair

Input: a relational database DB, and a set Σ of CFDs Output: a repair DB’ of DB such that cost(DB’, DB) is minimal

Cost Model

CFDs

Minimally Differ

⊨ ⊭

Complexity: It is known that finding an optimal repair is NP-complete even for a fixed set of FDs. It remains intractable for CFDs.

repair: DB’ ⊨ Σ “good”: cost(DB’, DB)

DB’ is “close” to the original

data in DB

Minimizing changes to

“accurate” attributes

Find effective heuristics for repairing databases based on Find effective heuristics for repairing databases based on CFDs CFDs. .

slide-9
SLIDE 9
  • 25 September 2007

Equivalence Class

[ CC, AC ] [ CC, AC ]

t4 t3 t2 t1 312 215 215 215

AC

60132 CHI 1 John 60132 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

  • [ City ]

[ City ]

slide-10
SLIDE 10
  • 25 September 2007

Equivalence Class

[ CC, AC ] [ CC, AC ]

t4 t3 t2 t1 312 215 215 215

AC

60132 CHI 1 John 60132 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

  • [ City ]

[ City ]

E1 E1

slide-11
SLIDE 11
  • 25 September 2007

Equivalence Class

Separate

The decision of which attribute values

which attribute values need to be equivalent

The decision of exactly what value

what value an EC should be assigned

Avoid poor local decisions

poor local decisions [ CC, AC ] [ CC, AC ]

  • [ City ]

[ City ]

t4 t3 t2 t1 312 215 215 215

AC

60132 CHI 1 John 60132 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

E1 E1 E2 E2

slide-12
SLIDE 12
  • 25 September 2007

Merge equivalence classes

[ CC, AC ] [ CC, AC ]

  • [ City ]

[ City ]

t4 t3 t2 t1 312 215 215 215

AC

60132 CHI 1 John 60132 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

[ ZIP ] [ ZIP ]

  • [ City ]

[ City ]

E1 E1 E2 E2

slide-13
SLIDE 13
  • 25 September 2007

Merge equivalence classes

[ CC, AC ] [ CC, AC ]

  • [ City ]

[ City ]

t4 t3 t2 t1 312 215 215 215

AC

60132 CHI 1 John 60132 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

[ ZIP ] [ ZIP ]

  • [ City ]

[ City ]

E3 = E1 E3 = E1 ∪ ∪ ∪ ∪ ∪ ∪ ∪ ∪ E2 E2 E1 E1 E2 E2

slide-14
SLIDE 14
  • 25 September 2007

Merge equivalence classes

[ CC, AC ] [ CC, AC ]

  • [ City ]

[ City ]

t4 t3 t2 t1 312 215 215 215

AC

60132 CHI 1 John 60132 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

[ ZIP ] [ ZIP ]

  • [ City ]

[ City ]

E3 = E1 E3 = E1 ∪ ∪ ∪ ∪ ∪ ∪ ∪ ∪ E2 E2 E3 E3

slide-15
SLIDE 15
  • 25 September 2007

FDs CFDs: does it work?

t4 t3 t2 t1 312 215 215 215

AC

60132 CHI 1 John 60132 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

CHI 60132 【 【 【 【 【 【 【 【 ZIP ZIP 】 】 】 】 】 】 】 】 【 【 【 【 【 【 【 【 City City 】 】 】 】 】 】 】 】 AC AC 】 】 】 】 】 】 】 】

【 【 【 【 【 【 【 City City 】 】 】 】 】 】 】 】 PHI 215 1 【 【 【 【 【 【 【 【 CC CC, , , , , , , ,

E3: E3: PHI PHI

slide-16
SLIDE 16
  • 25 September 2007

FDs CFDs: does it work?

t4 t3 t2 t1 312 215 215 215

AC

60132 CHI 1 John 60132 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

CHI 60132 【 【 【 【 ZIP 】 】 】 】 【 【 【 【 City 】 】 】 】 AC 】 】 】 】 【 【 【 【 City 】 】 】 】 PHI 215 1 【 【 【 【 CC, , , ,

E3: E3: CHI CHI

slide-17
SLIDE 17
  • 25 September 2007

FDs CFDs: it doesn’t work

t4 t3 t2 t1 312 215 215 215

AC

60132 CHI 1 John 60132 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

CHI 60132 【 【 【 【 ZIP 】 】 】 】 【 【 【 【 City 】 】 】 】 AC 】 】 】 】 【 【 【 【 City 】 】 】 】 PHI 215 1 【 【 【 【 CC, , , ,

E3: E3: PHI PHI

FD repair FD repair alg alg. . doesn doesn’ ’t t even even terminate terminate for CFD! for CFD!

slide-18
SLIDE 18
  • 25 September 2007

CFD repair

To resolve CFD violations, we allow

merge ECs

  • upgrade EC

upgrade EC (different from repairing FD)

Change both

RHS attributes and LHS attributes

LHS attributes (different from repairing FD)

We do not “invent

invent” values: choose value from active domain active domain

If there is no suitable value from active domain, put “null”

Guarantees termination

termination and correctness correctness (DB' satisfies all constraints)

slide-19
SLIDE 19
  • 25 September 2007

Cost Model: weight and distance

Based on both

  • weight

weight: estimate the accuracy of the attributes values to be modified

Could be obtained by data provenance …

and distance

distance: measure the “closeness” of the new value to the

  • riginal one

Intuitively

the more accurate the original value is

the less reasonable to change the value

the more distant the new value is from the original one

the less reasonable of this change

As will be seen soon

although the cost model incorporate

incorporate the weight information, the cleaning algorithm also works also works in the absence of it

Cost(u,v) = weight(t, A) * distance(u,v) / max(|u|,|v|)

slide-20
SLIDE 20
  • 25 September 2007

CFD: upgrade equivalence classes

t4 t3 t2 t1 312 215 215 215

AC

60132 CHI 1 John 60132 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

E1: PHI Fixed Fixed

CHI 60132 【 【 【 【 【 【 【 【 ZIP ZIP 】 】 】 】 】 】 】 】

【 【 【 【 【 【 【 City City 】 】 】 】 】 】 】 】 AC AC 】 】 】 】 】 】 】 】

【 【 【 【 【 【 【 City City 】 】 】 】 】 】 】 】 PHI

  • 215
  • 1

【 【 【 【 【 【 【 CC CC, , , , , , , , Target value of equivalence class E

targ(E) = not fixed not fixed

  • fixed

fixed : upgrade

E2 Not Fixed Not Fixed

slide-21
SLIDE 21
  • 25 September 2007

Change LHS attribute

[ CC, AC ] [ City ]

t4 t3 t2 t1 312 215 215 215

AC

60132 CHI 1 John 60132 PHI 1 Paul 19132 PHI 1 Joe 19132 PHI 1 Ben

ZIP City CC Name

CHI 60132 【 【 【 【 【 【 【 【 ZIP ZIP 】 】 】 】 】 】 】 】

【 【 【 【 【 【 【 City City 】 】 】 】 】 】 】 】 AC AC 】 】 】 】 】 】 】 】

【 【 【 【 【 【 【 City City 】 】 】 】 】 】 】 】 PHI

  • 215
  • 1

【 【 【 【 【 【 【 CC CC, , , , , , , ,

E1: PHI Fixed Fixed E2 Not Fixed Not Fixed

slide-22
SLIDE 22
  • 25 September 2007

Resolving CFD violations

Terminate

Each step

Either the number of original

  • riginal ECs

ECs is reduced

Or the number of upgraded

upgraded ECs ECs is increased

There are bounds for the number of ECs

ECs and upgraded upgraded ECs ECs

Correct

the output database is guaranteed to satisfy the CFDs

t t t t t t t t t t t t t t t t t t t t t t tt tt t t tt t t t t t t t t t t t t t t tt t t merge upgrade t t t t t t t t t t t t tt t t merge & upgrade

slide-23
SLIDE 23
  • 25 September 2007

Incremental repair

CFDs

⊨ ⊨ ⊨ ⊨

Now we have obtained a clean database:

slide-24
SLIDE 24
  • 25 September 2007

Incremental repair

CFDs

+

When the cleaned database is updated …

⊭ ⊭ ⊭ ⊭

slide-25
SLIDE 25
  • 25 September 2007

Incremental repair

Input: a clean database DB, changes ∆DB to DB, and a set Σ of CFDs Output: a repair DB’ of DB + ∆DB One might think that the incremental repairing problem is simpler than its batch (non-incremental) counterpart …

Cost Model

CFDs

Minimally Differ

+

⊭ ⊨

+

slide-26
SLIDE 26
  • 25 September 2007

Incremental repair

Input: a clean database DB, changes ∆DB to DB, and a set Σ of CFDs Output: a repair DB’ of DB + ∆DB Find effective heuristic algorithms for incrementally repairing databases based on CFDs.

  • Complexity. The local data cleaning problem is also NP-

complete, even if ∆DB consists of a single tuple.

Cost Model

CFDs

Minimally Differ

+

⊭ ⊨

+

slide-27
SLIDE 27
  • 25 September 2007

Repair a tuple: local repair

EH8 9LE EDI 131 44 Peter t2 19112 PHI 215 1 Mark t1

AC ZIP City CC Name

NYC 10112 【 【 【 【 ZIP 】 】 】 】 【 【 【 【 City 】 】 】 】 AC 】 】 】 】 【 【 【 【 City 】 】 】 】

【 【 【 CC, , , ,

Greedily finds the “ “best best” ” set of attributes set of attributes to modify in order to create a repair.

t3 215 10112 CHI 1 Eric

slide-28
SLIDE 28
  • 25 September 2007

Repair a tuple: local repair

EH8 9LE EDI 131 44 Peter t2 19112 PHI 215 1 Mark t1 t3 215

AC

10112 CHI 1 Eric

ZIP City CC Name

NYC 10112 【 【 【 【 ZIP 】 】 】 】 City AC 】 】 】 】 City

【 【 【 CC, , , ,

Since one attribute

  • ne attribute is not enough to fix this violation,

we consider two attributes two attributes …

slide-29
SLIDE 29
  • 25 September 2007

Repair a tuple: local repair

EH8 9LE EDI 131 44 Peter t2 19112 PHI 215 1 Mark t1 t3 215

AC

19112 PHI 1 Eric

ZIP City CC Name

NYC 10112 【 【 【 【 ZIP 】 】 】 】 City AC 】 】 】 】 City

【 【 【 CC, , , ,

Techniques to reduce the search space reduce the search space and using index index to optimize this process

slide-30
SLIDE 30
  • 25 September 2007

Repair a group of tuples: ordering

The order of the tuples to repair

has no impact on the termination

termination

impact repairing accuracy

accuracy and performance performance

Orders used

linear-scan: bad

L-IncRepair

based on weights: good

W-IncRepair: repair tuples with more weights

more weights first

based on violations: good

V-IncRepair: repair tuples with less violations

less violations first

  • Independent of weights

Independent of weights

slide-31
SLIDE 31
  • 25 September 2007

Consistent, but accurate?

CFDs

+

Consistent, but … We can automatically automatically find a repair. We can also incrementally incrementally find a repair in response to database updates. Would the automatically generated repair be what the user wants what the user wants? it is better to involve domain experts involve domain experts to inspect the repairs. To meet the expectation expectation of the user

slide-32
SLIDE 32
  • 25 September 2007

Assess accuracy of repairs

However, it is not realistic

not realistic to manually inspect each editing when dealing with large dataset

How to ensure that the repairs are

accurate enough without excessive user without excessive user interaction interaction?

A statistical method to guarantee the

accuracy of the repairs are above a predefined bound with a high confidence.

slide-33
SLIDE 33
  • 25 September 2007

Sampling

Repair

sampling

CFDs

Involve the user to

inspect small samples edit both the sample data

sample data and input input CFDs CFDs if necessary

invoke automated repairing methods

automated repairing methods to revise repairs

Stratified sampling method

give priority to strata that are more likely to be inaccurate ensure the accuracy

accuracy of the repairs are above a predefined bound with a high confidence.

slide-34
SLIDE 34
  • 25 September 2007

Experimental setting

Prototype system

  • Con

Con2

2Clean

Clean (in Java)

Data

we scraped real-life data from web Generate datasets of various sizes, 10k to 300k tuples

Constraints

Fairly large since each pattern tuple is in fact a constraint

7 CFDs 300---5,000 pattern tuples for each of these CFDs

Clean data

Initial datasets are “correct” data, consistent with all CFDs

Dirty data: error rate 1% to 10%

Randomly add noise to an attribute

New value close to the original one Or an arbitrary existing value taken from another tuple

slide-35
SLIDE 35
  • 25 September 2007

Accuracy of CFDs vs FDs

slide-36
SLIDE 36
  • 25 September 2007

Scalability over Noise Rate

slide-37
SLIDE 37
  • 25 September 2007

Conclusion and future work

A framework for improving data quality: both

consistency and accuracy

  • Automatic

Automatic part: guarantee termination and correctness

Batch repair Incremental repair: optional

  • ptional
  • Semi

Semi-

  • automatic

automatic part

Statistical methods: optional

  • ptional

Guarantee accuracy above a predefined bound without

excessive user interaction

A data cleaning framework using constraints specially designed for improving data quality.

Future

Automated methods for discovering CFDs Repair algorithms for other conditional constraints