sherlock rules
play

Sherlock Rules Proof Positive and Negative in Data Cleaning Matteo - PowerPoint PPT Presentation

Sherlock Rules Proof Positive and Negative in Data Cleaning Matteo Interlandi Nan Tang Outline Motivation Sherlock Rules Fundamental problems Algorithms 2 Data Mining Machine Learning Rule Discovery Roadblocks to Get Value


  1. Sherlock Rules Proof Positive and Negative in Data Cleaning Matteo Interlandi Nan Tang

  2. Outline • Motivation • Sherlock Rules • Fundamental problems • Algorithms 2

  3. Data Mining Machine Learning Rule Discovery Roadblocks to Get Value from Data? 3

  4. Data Mining Machine Learning Rule Discovery Roadblocks to Get Value from Data? 3

  5. Data Mining Machine Learning Rule Discovery Roadblocks to Get Value from Data? High Quality Data 3

  6. D name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo

  7. D data repairing name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo consistent D’ nation -> capital name nation capital Si China Beijing Yan China Beijing Ian China Beijing

  8. D data repairing name nation capital Si China Beijing Yan China Shanghai Ian China Tokyo consistent D’ nation -> capital name nation capital Si China Beijing Yan China Beijing Ian China Beijing

  9. D proof positive data repairing name nation capital and negative Si China Beijing Yan China Shanghai Ian China Tokyo consistent D’ annotated D” nation -> capital name nation capital name nation capital Si China Beijing Si China Beijing Yan China Shanghai Yan China Beijing Ian China Tokyo Ian China Beijing

  10. D proof positive data repairing name nation capital and negative Si China Beijing Yan China Shanghai Ian China Tokyo consistent D’ annotated D” nation -> capital name nation capital name nation capital Si China Beijing Si China Beijing Yan China Shanghai Yan China Beijing Ian China Tokyo Ian China Beijing help

  11. D proof positive data repairing name nation capital and negative Si China Beijing Yan China Shanghai Ian China Tokyo consistent D’ annotated D” nation -> capital name nation capital name nation capital Si China Beijing Si China Beijing Yan China Shanghai Yan China Beijing Ian China Tokyo Ian China Beijing Sherlock Rules help

  12. Outline • Motivation • Sherlock Rules • Fundamental problems • Algorithms 5

  13. Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 name officePhn mobile Si 28098001 66700541 r1 Yan 24038698 66706563 r2 Ian 27364928 33668323 r3 6

  14. Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 name officePhn mobile Si 28098001 66700541 r1 Yan 24038698 66706563 r2 Ian 27364928 33668323 r3 6

  15. Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 name officePhn mobile Si 28098001 66700541 r1 Yan 24038698 66706563 r2 Ian 27364928 33668323 r3 Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928 6

  16. Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 name officePhn mobile Si 28098001 66700541 r1 Yan 24038698 66706563 r2 Ian 27364928 33668323 r3 Proof Positive/Negative, Correction t3[Ian] is correct, t3[officePhn] = 27364928 6

  17. Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 name officePhn mobile Si 28098001 66700541 r1 Yan 24038698 66706563 r2 Ian 27364928 33668323 r3 Proof Positive/Negative, Proof Positive/Negative Correction t3[Ian] is correct, t3[Ian] is correct, t3[officePhn] = 27364928 t3[officePhn] is wrong 6

  18. Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 country capital China Beijing s1 Japan Tokyo s2 Chile Santiago s3 Proof Positive/Negative, Proof Positive/Negative Correction t3[Ian] is correct, t3[Ian] is correct, t3[officePhn] = 27364928 t3[officePhn] is wrong 6

  19. Proof Positive and Negative name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 country capital China Beijing s1 Japan Tokyo s2 Chile Santiago s3 Proof Positive/Negative, Proof Positive/Negative Proof Positive Correction t3[Ian] is correct, t3[Ian] is correct, t1[nation, capital] is correct t3[officePhn] = 27364928 t3[officePhn] is wrong t3[nation, capital] is correct 6

  20. Sherlock Rules name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 D Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 country capital name officePhn mobile China Beijing Si 28098001 66700541 s1 r1 D m Japan Tokyo Yan 24038698 66706563 s2 r2 Chile Santiago Ian 27364928 33668323 s3 r3 positive evidence negative 7

  21. Sherlock Rules name dep nation capital bornat officePhn t t1 Si DA China Beijing ChenYang 28098001 D Yan DA China Shanghai Chengdu 24038698 t2 t3 Ian ALT China Beijing Hangzhou 33668323 country capital name officePhn mobile China Beijing Si 28098001 66700541 s1 r1 D m Japan Tokyo Yan 24038698 66706563 s2 r2 Chile Santiago Ian 27364928 33668323 s3 r3 positive evidence negative 7

  22. Point of Innovation Integrity Constraints There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2] (China, Shanghai) <> = (China, Beijing) 8

  23. Point of Innovation Integrity Constraints There does not exist t1[X1] = t2[X2] but t1[B1] = t2[B2] (China, Shanghai) <> = (China, Beijing) 8

  24. Point of Innovation Integrity Constraints Sherlock Rules t1[X1] = t2[X2] and There does not exist t1[B] = t2[B - ], then t1[X1] = t2[X2] but t1[B1] = t2[B2] t1[B] := t2[B + ] (China, Shanghai) (China, Shanghai) <> = (China, Beijing) (China, Beijing, Shanghai) 8

  25. Point of Innovation Integrity Constraints Sherlock Rules t1[X1] = t2[X2] and There does not exist t1[B] = t2[B - ], then t1[X1] = t2[X2] but t1[B1] = t2[B2] t1[B] := t2[B + ] (China, Shanghai) (China, Shanghai) <> = (China, Beijing) (China, Beijing, Shanghai) 8

  26. Point of Innovation Integrity Constraints Sherlock Rules t1[X1] = t2[X2] and There does not exist t1[B] = t2[B - ], then t1[X1] = t2[X2] but t1[B1] = t2[B2] t1[B] := t2[B + ] (China, Shanghai) (China, Shanghai) <> = (China, Beijing) (China, Beijing, Shanghai) 8

  27. Applying Multiple Rules Pos(t) + + Neg(t) - Free(t) 9

  28. Sherlock Rules in Action t1 (Si, DA, China, Beijing, ChenYang, 28098001) t1 ( Si + , DA, China, Beijing, ChenYang-, 28098001 + ) t1 ( Si + , DA, China, Beijing, ShenYang + , 28098001 + ) 10

  29. Sherlock Rules in Action t1 (Si, DA, China, Beijing, ChenYang, 28098001) t1 ( Si + , DA, China, Beijing, ChenYang-, 28098001 + ) t1 ( Si + , DA, China, Beijing, ShenYang + , 28098001 + ) Pos(t1) 10

  30. Transformation Rules 11

  31. Outline • Motivation • Sherlock Rules • Fundamental problems • Algorithms 12

  32. Fundamental Problems Termination Consistency ( coNP-complete ) Determinism Implication ( coNP-complete ) 13

  33. Algorithms • Motivation • Sherlock Rules • Fundamental problems • Algorithms 14

  34. Algorithms Naive Repairing chase-based O(|R| x |Sigma| x |M|) 15

  35. Algorithms Fast Repairing Naive Repairing Similarity indices to reduce |M| (BK-tree, FastSS, n-gram) chase-based Inverted index to reduce |Sigma| O(|R| x |Sigma| x |M|) (hash map) O(|R| x |Sigma| x com(S)) 15

  36. Algorithms Fast Repairing Naive Repairing Similarity indices to reduce |M| (BK-tree, FastSS, n-gram) chase-based Inverted index to reduce |Sigma| O(|R| x |Sigma| x |M|) (hash map) O(|R| x |Sigma| x com(S)) Caching similarity index accesses Rule pruning based on dependency 15

  37. Rule Pruning Example R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) R1 R2 R3 16

  38. Rule Pruning Example R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 1: {(R1, Yes), (R2, Yes), (R3, No)} R1 R2 R3 16

  39. Rule Pruning Example R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 1: {(R1, Yes), (R2, Yes), (R3, No)} R1 R2 iteration 2: {(R1, Yes), (R2, No), (R3, No)} R3 16

  40. Rule Pruning Example R1: R2: R3: t3(Ian, ALT, Chine, Beijing, Hangzhou, 33668323) iteration 1: {(R1, Yes), (R2, Yes), (R3, No)} R1 R2 iteration 2: {(R1, Yes), (R2, No), (R3, No)} R3 iteration 3: {(R1, Yes), (R2, No), (R3, No)} 16

  41. Conclusion • Sherlock rules for accurately annotating and repairing data • Fundamental problems • Efficient algorithms 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend