Proof Positive and Negative in Data Cleaning
Matteo Interlandi Nan Tang
Sherlock Rules Proof Positive and Negative in Data Cleaning Matteo - - PowerPoint PPT Presentation
Sherlock Rules Proof Positive and Negative in Data Cleaning Matteo Interlandi Nan Tang Outline Motivation Sherlock Rules Fundamental problems Algorithms 2 Data Mining Machine Learning Rule Discovery Roadblocks to Get Value
Matteo Interlandi Nan Tang
2
3
Data Mining Machine Learning Rule Discovery
3
Data Mining Machine Learning Rule Discovery
3
Data Mining Machine Learning Rule Discovery
name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo
name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo
consistent D’
nation -> capital
name nation capital Si China Beijing Yan China Beijing Ian China Beijing
data repairing
name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo
consistent D’
nation -> capital
name nation capital Si China Beijing Yan China Beijing Ian China Beijing
data repairing
name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo
consistent D’
nation -> capital
name nation capital Si China Beijing Yan China Beijing Ian China Beijing name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo
annotated D”
data repairing proof positive and negative
name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo
consistent D’
nation -> capital
name nation capital Si China Beijing Yan China Beijing Ian China Beijing name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo
annotated D”
data repairing proof positive and negative help
name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo
consistent D’
nation -> capital
name nation capital Si China Beijing Yan China Beijing Ian China Beijing name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo
annotated D”
data repairing proof positive and negative help
Sherlock Rules
5
6
name dep t nation capital bornat
Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3
6
name dep t nation capital bornat
Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3
6
name dep t nation capital bornat
Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3
Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928
6
name dep t nation capital bornat
Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3
Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928
6
name dep t nation capital bornat
Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3
Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928 Proof Positive/Negative t3[Ian] is correct, t3[officePhn] is wrong
6
name dep t nation capital bornat
Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 country capital China Beijing Japan Tokyo Chile Santiago s1 s2 s3
Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928 Proof Positive/Negative t3[Ian] is correct, t3[officePhn] is wrong
6
name dep t nation capital bornat
Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 country capital China Beijing Japan Tokyo Chile Santiago s1 s2 s3
Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928 Proof Positive t1[nation, capital] is correct t3[nation, capital] is correct Proof Positive/Negative t3[Ian] is correct, t3[officePhn] is wrong
7
name dep t nation capital bornat
Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3 country capital China Beijing Japan Tokyo Chile Santiago s1 s2 s3
evidence positive negative
7
name dep t nation capital bornat
Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 Ian ALT China Beijing Hangzhou 33668323 t1 t2 t3 name officePhn mobile Si 28098001 66700541 Yan 24038698 66706563 Ian 27364928 33668323 r1 r2 r3 country capital China Beijing Japan Tokyo Chile Santiago s1 s2 s3
evidence positive negative
8
Integrity Constraints
There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2]
(China, Shanghai) (China, Beijing)
<>
8
Integrity Constraints
There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2]
(China, Shanghai) (China, Beijing)
<>
8
Integrity Constraints
There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2]
(China, Shanghai) (China, Beijing)
t1[X1] = t2[X2] and t1[B] = t2[B-], then t1[B] := t2[B+]
(China, Shanghai)
(China, Beijing, Shanghai)
<>
8
Integrity Constraints
There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2]
(China, Shanghai) (China, Beijing)
t1[X1] = t2[X2] and t1[B] = t2[B-], then t1[B] := t2[B+]
(China, Shanghai)
(China, Beijing, Shanghai)
<>
8
Integrity Constraints
There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2]
(China, Shanghai) (China, Beijing)
t1[X1] = t2[X2] and t1[B] = t2[B-], then t1[B] := t2[B+]
(China, Shanghai)
(China, Beijing, Shanghai)
<>
9
10
t1 (Si, DA, China, Beijing, ChenYang, 28098001) t1 (Si+, DA, China, Beijing, ChenYang-, 28098001+) t1 (Si+, DA, China, Beijing, ShenYang+, 28098001+)
10
t1 (Si, DA, China, Beijing, ChenYang, 28098001) t1 (Si+, DA, China, Beijing, ChenYang-, 28098001+) t1 (Si+, DA, China, Beijing, ShenYang+, 28098001+)
11
12
13
(coNP-complete) (coNP-complete)
14
15
O(|R|x|Sigma|x|M|)
15
O(|R|x|Sigma|x|M|)
Similarity indices to reduce |M| (BK-tree, FastSS, n-gram) Inverted index to reduce |Sigma| (hash map)
O(|R|x|Sigma| x com(S))
15
O(|R|x|Sigma|x|M|)
Similarity indices to reduce |M| (BK-tree, FastSS, n-gram) Inverted index to reduce |Sigma| (hash map)
O(|R|x|Sigma| x com(S))
Caching similarity index accesses Rule pruning based on dependency
16
R1 R2 R3 R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323)
16
R1 R2 R3 R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 1: {(R1, Yes), (R2, Yes), (R3, No)}
16
R1 R2 R3 R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 2: {(R1, Yes), (R2, No), (R3, No)} iteration 1: {(R1, Yes), (R2, Yes), (R3, No)}
16
R1 R2 R3 R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 2: {(R1, Yes), (R2, No), (R3, No)} iteration 3: {(R1, Yes), (R2, No), (R3, No)} iteration 1: {(R1, Yes), (R2, Yes), (R3, No)}
17
accurately annotating and repairing data
17
accurately annotating and repairing data
Sherlock workhorse
to more data such as RDF (knowledge bases)