Sherlock Rules Proof Positive and Negative in Data Cleaning Matteo - - PowerPoint PPT Presentation

sherlock rules
SMART_READER_LITE
LIVE PREVIEW

Sherlock Rules Proof Positive and Negative in Data Cleaning Matteo - - PowerPoint PPT Presentation

Sherlock Rules Proof Positive and Negative in Data Cleaning Matteo Interlandi Nan Tang Outline Motivation Sherlock Rules Fundamental problems Algorithms 2 Data Mining Machine Learning Rule Discovery Roadblocks to Get Value


slide-1
SLIDE 1

Proof Positive and Negative in Data Cleaning

Matteo Interlandi Nan Tang

Sherlock Rules

slide-2
SLIDE 2
  • Motivation
  • Sherlock Rules
  • Fundamental problems
  • Algorithms

Outline

2

slide-3
SLIDE 3

Roadblocks to Get Value from Data?

3

Data Mining Machine Learning Rule Discovery

slide-4
SLIDE 4

Roadblocks to Get Value from Data?

3

Data Mining Machine Learning Rule Discovery

slide-5
SLIDE 5

Roadblocks to Get Value from Data?

3

High Quality Data

Data Mining Machine Learning Rule Discovery

slide-6
SLIDE 6

name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo

D

slide-7
SLIDE 7

name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo

D

consistent D’

nation -> capital

name nation capital Si China Beijing Yan China Beijing Ian China Beijing

data repairing

slide-8
SLIDE 8

name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo

D

consistent D’

nation -> capital

name nation capital Si China Beijing Yan China Beijing Ian China Beijing

data repairing

slide-9
SLIDE 9

name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo

D

consistent D’

nation -> capital

name nation capital Si China Beijing Yan China Beijing Ian China Beijing name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo

annotated D”

data repairing proof positive and negative

slide-10
SLIDE 10

name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo

D

consistent D’

nation -> capital

name nation capital Si China Beijing Yan China Beijing Ian China Beijing name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo

annotated D”

data repairing proof positive and negative help

slide-11
SLIDE 11

name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo

D

consistent D’

nation -> capital

name nation capital Si China Beijing Yan China Beijing Ian China Beijing name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo

annotated D”

data repairing proof positive and negative help

Sherlock Rules

slide-12
SLIDE 12
  • Motivation
  • Sherlock Rules
  • Fundamental problems
  • Algorithms

Outline

5

slide-13
SLIDE 13

Proof Positive and Negative

6

name dep t nation capital bornat

  • fficePhn

Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3

slide-14
SLIDE 14

Proof Positive and Negative

6

name dep t nation capital bornat

  • fficePhn

Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3

slide-15
SLIDE 15

Proof Positive and Negative

6

name dep t nation capital bornat

  • fficePhn

Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3

Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928

slide-16
SLIDE 16

Proof Positive and Negative

6

name dep t nation capital bornat

  • fficePhn

Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3

Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928

slide-17
SLIDE 17

Proof Positive and Negative

6

name dep t nation capital bornat

  • fficePhn

Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3

Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928 Proof Positive/Negative t3[Ian] is correct, t3[officePhn] is wrong

slide-18
SLIDE 18

Proof Positive and Negative

6

name dep t nation capital bornat

  • fficePhn

Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 country capital China Beijing Japan Tokyo Chile Santiago s1 s2 s3

Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928 Proof Positive/Negative t3[Ian] is correct, t3[officePhn] is wrong

slide-19
SLIDE 19

Proof Positive and Negative

6

name dep t nation capital bornat

  • fficePhn

Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 country capital China Beijing Japan Tokyo Chile Santiago s1 s2 s3

Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928 Proof Positive t1[nation, capital] is correct t3[nation, capital] is correct Proof Positive/Negative t3[Ian] is correct, t3[officePhn] is wrong

slide-20
SLIDE 20

Sherlock Rules

7

name dep t nation capital bornat

  • fficePhn

Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3 country capital China Beijing Japan Tokyo Chile Santiago s1 s2 s3

D Dm

evidence positive negative

slide-21
SLIDE 21

Sherlock Rules

7

name dep t nation capital bornat

  • fficePhn

Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3 country capital China Beijing Japan Tokyo Chile Santiago s1 s2 s3

D Dm

evidence positive negative

slide-22
SLIDE 22

Point of Innovation

8

Integrity Constraints

There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2]

(China, Shanghai) (China, Beijing)

=

<>

slide-23
SLIDE 23

Point of Innovation

8

Integrity Constraints

There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2]

(China, Shanghai) (China, Beijing)

=

<>

slide-24
SLIDE 24

Point of Innovation

8

Integrity Constraints

There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2]

(China, Shanghai) (China, Beijing)

Sherlock Rules

t1[X1] = t2[X2] and t1[B] = t2[B-], then t1[B] := t2[B+]

(China, Shanghai)

(China, Beijing, Shanghai)

=

<>

slide-25
SLIDE 25

Point of Innovation

8

Integrity Constraints

There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2]

(China, Shanghai) (China, Beijing)

Sherlock Rules

t1[X1] = t2[X2] and t1[B] = t2[B-], then t1[B] := t2[B+]

(China, Shanghai)

(China, Beijing, Shanghai)

=

<>

slide-26
SLIDE 26

Point of Innovation

8

Integrity Constraints

There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2]

(China, Shanghai) (China, Beijing)

Sherlock Rules

t1[X1] = t2[X2] and t1[B] = t2[B-], then t1[B] := t2[B+]

(China, Shanghai)

(China, Beijing, Shanghai)

=

<>

slide-27
SLIDE 27

Applying Multiple Rules

9

+

Pos(t) Neg(t) Free(t)

  • +
slide-28
SLIDE 28

Sherlock Rules in Action

10

t1 (Si, DA, China, Beijing, ChenYang, 28098001) t1 (Si+, DA, China, Beijing, ChenYang-, 28098001+) t1 (Si+, DA, China, Beijing, ShenYang+, 28098001+)

slide-29
SLIDE 29

Sherlock Rules in Action

10

t1 (Si, DA, China, Beijing, ChenYang, 28098001) t1 (Si+, DA, China, Beijing, ChenYang-, 28098001+) t1 (Si+, DA, China, Beijing, ShenYang+, 28098001+)

Pos(t1)

slide-30
SLIDE 30

Transformation Rules

11

slide-31
SLIDE 31
  • Motivation
  • Sherlock Rules
  • Fundamental problems
  • Algorithms

Outline

12

slide-32
SLIDE 32

Fundamental Problems

13

Termination Determinism Consistency Implication

(coNP-complete) (coNP-complete)

slide-33
SLIDE 33
  • Motivation
  • Sherlock Rules
  • Fundamental problems
  • Algorithms

Algorithms

14

slide-34
SLIDE 34

Algorithms

15

Naive Repairing chase-based

O(|R|x|Sigma|x|M|)

slide-35
SLIDE 35

Algorithms

15

Naive Repairing chase-based

O(|R|x|Sigma|x|M|)

Fast Repairing

Similarity indices to reduce |M| (BK-tree, FastSS, n-gram) Inverted index to reduce |Sigma| (hash map)

O(|R|x|Sigma| x com(S))

slide-36
SLIDE 36

Algorithms

15

Naive Repairing chase-based

O(|R|x|Sigma|x|M|)

Fast Repairing

Similarity indices to reduce |M| (BK-tree, FastSS, n-gram) Inverted index to reduce |Sigma| (hash map)

O(|R|x|Sigma| x com(S))

Caching similarity index accesses Rule pruning based on dependency

slide-37
SLIDE 37

Rule Pruning Example

16

R1 R2 R3 R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323)

slide-38
SLIDE 38

Rule Pruning Example

16

R1 R2 R3 R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 1: {(R1, Yes), (R2, Yes), (R3, No)}

slide-39
SLIDE 39

Rule Pruning Example

16

R1 R2 R3 R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 2: {(R1, Yes), (R2, No), (R3, No)} iteration 1: {(R1, Yes), (R2, Yes), (R3, No)}

slide-40
SLIDE 40

Rule Pruning Example

16

R1 R2 R3 R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 2: {(R1, Yes), (R2, No), (R3, No)} iteration 3: {(R1, Yes), (R2, No), (R3, No)} iteration 1: {(R1, Yes), (R2, Yes), (R3, No)}

slide-41
SLIDE 41

17

Conclusion

  • Sherlock rules for

accurately annotating and repairing data

  • Fundamental problems
  • Efficient algorithms
slide-42
SLIDE 42

17

Conclusion

  • Sherlock rules for

accurately annotating and repairing data

  • Fundamental problems
  • Efficient algorithms

Future Work

  • Let SQL drive the

Sherlock workhorse

  • Extend Sherlock rules

to more data such as RDF (knowledge bases)