Dependable Data Repairing with Fixing Rules
Jiannan Wang Nan Tang
1
Dependable Data Repairing with Fixing Rules Jiannan Wang Nan Tang - - PowerPoint PPT Presentation
Dependable Data Repairing with Fixing Rules Jiannan Wang Nan Tang 1 Data is Dirty 2 incomplete inconsistent inaccurate Data is Dirty 2 incomplete 25% companies: flawed data inconsistent 3+ trillion $: US economy 20%: labor
Dependable Data Repairing with Fixing Rules
Jiannan Wang Nan Tang
1
2
incomplete inconsistent inaccurate …
2
incomplete inconsistent inaccurate …
25% companies: flawed data 3+ trillion $: US economy 20%: labor productivity … …
2
incomplete inconsistent inaccurate …
25% companies: flawed data 3+ trillion $: US economy 20%: labor productivity … …
2
D a t a t r a n s f
m a t i
( E T L r u l e s )
S t a t i s t i c a l / M L
T y p
( s y n t a c t i c e r r
s )
3
D a t a t r a n s f
m a t i
( E T L r u l e s )
S t a t i s t i c a l / M L
T y p
( s y n t a c t i c e r r
s )
State-of-the-art
3
Dependency Theory
4
Dependency Theory
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
4
Dependency Theory
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
FD: [country] -> [capital]
4
Dependency Theory
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
FD: [country] -> [capital]
4
Dependency Theory
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
FD: [country] -> [capital]
Data dependencies are not sufficient to guide dependable data repairing
4
User Guidance
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
5
User Guidance
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital))
5
User Guidance
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital))
5
User Guidance
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES.
5
User Guidance
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES. Beijing
5
User Guidance
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES. Beijing Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … …
5
User Guidance
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB country capital s1 China Beijing s2 Canada Ottawa s3 Japan Tokyo
editing rule: ((country, country) -> (capital, capital)) Is r2[country] China? YES. Beijing Is r1[country] China? Is r3[country] China? Is r4[country] Canada? … … … …
check each tuple: not cheap !!
5
Heuristic
(Automated)
Certain
(User guided)
precision: + recall: ++ precision: ++ recall: ++
6
Heuristic
(Automated)
Certain
(User guided)
precision: + recall: ++ precision: ++ recall: ++
precision: ++ recall: +
Fixing Rules
6
7
7
7
7
negative evidence
7
negative evidence
8
country capital
China Shanghai
Data patterns
8
country capital
China Shanghai
Data patterns
evidence negative
8
country capital
China Shanghai
Data patterns
China T
evidence negative
8
country capital
China Shanghai
Data patterns
China T
evidence negative
?
(China, Beijing) (Japan, T
8
country capital
China Shanghai
Data patterns
China T
name work mail
Ian ian@gmail.com
evidence negative
?
(China, Beijing) (Japan, T
8
country capital
China Shanghai
Data patterns
China T
name work mail
Ian ian@gmail.com
evidence negative evidence negative
?
(China, Beijing) (Japan, T
8
country capital
China Shanghai
Data patterns
China T
name work mail
Ian ian@gmail.com
evidence negative evidence negative
?
(China, Beijing) (Japan, T
city area code
Beijing 110002
8
country capital
China Shanghai
Data patterns
China T
name work mail
Ian ian@gmail.com
evidence negative evidence negative evidence negative
?
(China, Beijing) (Japan, T
city area code
Beijing 110002
Fixing Rules
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative fact
9
Fixing Rules
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative fact name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
9
Fixing Rules
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative fact name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
9
Fixing Rules
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative fact name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
Beijing
9
Fixing Rules
fR1: (([country], [China]), (capital, {Shanghai, Hongkong})) -> Beijing
country {capital capital China Shanghai Beijing Hongkong evidence negative fact name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
Beijing
9
Deterministic Conservative
Applying One Fixing Rule
r2 Ian China Shanghai Hongkong ICDE country {capital capital China Shanghai Beijing Hongkong
10
Applying One Fixing Rule
r2 Ian China Shanghai Hongkong ICDE r2’ Ian China Beijing Hongkong ICDE country {capital capital China Shanghai Beijing Hongkong
10
Applying Multiple Fixing Rules
capital city conf {country country Tokyo Tokyo ICDE China Japan
fR1’ fR3
country {capital capital China Shanghai Beijing Hongkong Tokyo
11
Applying Multiple Fixing Rules
capital city conf {country country Tokyo Tokyo ICDE China Japan
fR1’ fR3
r2 Ian China Shanghai Hongkong ICDE r2’ Ian China Beijing Hongkong ICDE
fR1’
country {capital capital China Shanghai Beijing Hongkong Tokyo
11
Applying Multiple Fixing Rules
capital city conf {country country Tokyo Tokyo ICDE China Japan
fR1’ fR3
r2 Ian China Shanghai Hongkong ICDE r2’ Ian China Beijing Hongkong ICDE
fR1’
country {capital capital China Shanghai Beijing Hongkong Tokyo
r3 Peter China Tokyo Tokyo ICDE r3’ Peter China Beijing Tokyo ICDE
fR1’
11
Applying Multiple Fixing Rules
capital city conf {country country Tokyo Tokyo ICDE China Japan
fR1’ fR3
r2 Ian China Shanghai Hongkong ICDE r2’ Ian China Beijing Hongkong ICDE
fR1’
country {capital capital China Shanghai Beijing Hongkong Tokyo
r3 Peter China Tokyo Tokyo ICDE r3’ Peter China Beijing Tokyo ICDE
fR1’
r3’’ Peter Japan Tokyo Tokyo ICDE
fR3
11
12
Fundamental Problems
13
T ermination
Y es
Consistency
PTIME
Implication
coNP-complete
Determinism
Y es
Ensuring Consistency
country {capital capital China Shanghai Beijing Hongkong country {capital capital Canada Toronto Ottawa
fR1 fR2
14
Ensuring Consistency
country {capital capital China Shanghai Beijing Hongkong country {capital capital Canada Toronto Ottawa
fR1 fR2 (◦, China, Shanghai, ◦, ◦) (◦, China, Hongkong, ◦, ◦) (◦, China, Toronto, ◦, ◦) (◦, Canada, Shanghai, ◦, ◦) (◦, Canada, Hongkong, ◦, ◦) (◦, Canada, Toronto, ◦, ◦)
14
Ensuring Consistency
country {capital capital China Shanghai Beijing Hongkong country {capital capital Canada Toronto Ottawa
fR1 fR2 (◦, China, Shanghai, ◦, ◦) (◦, China, Hongkong, ◦, ◦) (◦, China, Toronto, ◦, ◦) (◦, Canada, Shanghai, ◦, ◦) (◦, Canada, Hongkong, ◦, ◦) (◦, Canada, Toronto, ◦, ◦)
14
Xi, Xj Bi, Bj tpi, tpj two rules
15
Repairing with Fixing Rules
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
fR1 fR3
country {capital capital China Shanghai Beijing Hongkong country {capital capital Canada Toronto Ottawa capital city conf {country country Tokyo Tokyo ICDE China Japan capital conf {city city Beijing ICDE Hongkong ShanghaifR2 fR4
16
Repairing with Fixing Rules
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
fR1 fR3
country {capital capital China Shanghai Beijing Hongkong country {capital capital Canada Toronto Ottawa capital city conf {country country Tokyo Tokyo ICDE China Japan capital conf {city city Beijing ICDE Hongkong ShanghaifR2 fR4 country, China country, Canada conf, ICDE capital, Tokyo city, Tokyo capital, Beijing fR1 fR2 fR3, fR4 fR3 fR3 fR4 Key List
16
Repairing with Fixing Rules
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
fR1 fR3
country {capital capital China Shanghai Beijing Hongkong country {capital capital Canada Toronto Ottawa capital city conf {country country Tokyo Tokyo ICDE China Japan capital conf {city city Beijing ICDE Hongkong ShanghaifR2 fR4 country, China country, Canada conf, ICDE capital, Tokyo city, Tokyo capital, Beijing fR1 fR2 fR3, fR4 fR3 fR3 fR4 Key List cnt(fR1) = 1, cnt(fR4) = 1, rules = {fR1} r1’ = r1, rules = {}
r1: itr1:
16
Repairing with Fixing Rules
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
fR1 fR3
country {capital capital China Shanghai Beijing Hongkong country {capital capital Canada Toronto Ottawa capital city conf {country country Tokyo Tokyo ICDE China Japan capital conf {city city Beijing ICDE Hongkong ShanghaifR2 fR4 country, China country, Canada conf, ICDE capital, Tokyo city, Tokyo capital, Beijing fR1 fR2 fR3, fR4 fR3 fR3 fR4 Key List cnt(fR1) = 1, cnt(fR4) = 1, rules = {fR1} r1’ = r1, rules = {}
r1: itr1:
cnt(fR1, fR3, fR4) = 1, rules = {fR1} r2’[capital] = Beijing, cnt(fR3) = 1, cnt(fR4) = 2, rules = {fR4} r2’[city] = Shanghai, rules = {}
r2: itr1: itr2:
16
Repairing with Fixing Rules
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
fR1 fR3
country {capital capital China Shanghai Beijing Hongkong country {capital capital Canada Toronto Ottawa capital city conf {country country Tokyo Tokyo ICDE China Japan capital conf {city city Beijing ICDE Hongkong ShanghaifR2 fR4 country, China country, Canada conf, ICDE capital, Tokyo city, Tokyo capital, Beijing fR1 fR2 fR3, fR4 fR3 fR3 fR4 Key List cnt(fR1) = 1, cnt(fR4) = 1, rules = {fR1} r1’ = r1, rules = {}
r1: itr1:
cnt(fR1, fR3, fR4) = 1, rules = {fR1} r2’[capital] = Beijing, cnt(fR3) = 1, cnt(fR4) = 2, rules = {fR4} r2’[city] = Shanghai, rules = {}
r2: itr1: itr2:
16
Repairing with Fixing Rules
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
fR1 fR3
country {capital capital China Shanghai Beijing Hongkong country {capital capital Canada Toronto Ottawa capital city conf {country country Tokyo Tokyo ICDE China Japan capital conf {city city Beijing ICDE Hongkong ShanghaifR2 fR4 country, China country, Canada conf, ICDE capital, Tokyo city, Tokyo capital, Beijing fR1 fR2 fR3, fR4 fR3 fR3 fR4 Key List cnt(fR1) = 1, cnt(fR4) = 1, rules = {fR1} r1’ = r1, rules = {}
r1: itr1:
cnt(fR1, fR3, fR4) = 1, rules = {fR1} r2’[capital] = Beijing, cnt(fR3) = 1, cnt(fR4) = 2, rules = {fR4} r2’[city] = Shanghai, rules = {}
r2: itr1: itr2:
Beijing Shanghai 16
Repairing with Fixing Rules
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
fR1 fR3
country {capital capital China Shanghai Beijing Hongkong country {capital capital Canada Toronto Ottawa capital city conf {country country Tokyo Tokyo ICDE China Japan capital conf {city city Beijing ICDE Hongkong ShanghaifR2 fR4 country, China country, Canada conf, ICDE capital, Tokyo city, Tokyo capital, Beijing fR1 fR2 fR3, fR4 fR3 fR3 fR4 Key List cnt(fR1) = 1, cnt(fR4) = 1, rules = {fR1} r1’ = r1, rules = {}
r1: itr1:
cnt(fR1, fR3, fR4) = 1, rules = {fR1} r2’[capital] = Beijing, cnt(fR3) = 1, cnt(fR4) = 2, rules = {fR4} r2’[city] = Shanghai, rules = {}
r2: itr1: itr2:
cnt(fR3) = 3, cnt(fR4) = 1, rules = {fR3} r3’[country] = Japan, rules = {}
r3: itr1:
Beijing Shanghai Japan 16
Repairing with Fixing Rules
name country capital city conf r1 George China Beijing Beijing SIGMOD r2 Ian China Shanghai Hongkong ICDE r3 Peter China Tokyo Tokyo ICDE r4 Mike Canada Toronto Toronto VLDB
fR1 fR3
country {capital capital China Shanghai Beijing Hongkong country {capital capital Canada Toronto Ottawa capital city conf {country country Tokyo Tokyo ICDE China Japan capital conf {city city Beijing ICDE Hongkong ShanghaifR2 fR4 country, China country, Canada conf, ICDE capital, Tokyo city, Tokyo capital, Beijing fR1 fR2 fR3, fR4 fR3 fR3 fR4 Key List cnt(fR1) = 1, cnt(fR4) = 1, rules = {fR1} r1’ = r1, rules = {}
r1: itr1:
cnt(fR1, fR3, fR4) = 1, rules = {fR1} r2’[capital] = Beijing, cnt(fR3) = 1, cnt(fR4) = 2, rules = {fR4} r2’[city] = Shanghai, rules = {}
r2: itr1: itr2:
cnt(fR3) = 3, cnt(fR4) = 1, rules = {fR3} r3’[country] = Japan, rules = {}
r3: itr1:
cnt(fR3) = 1, rules = {fR2} r4’[capital] = Ottawa, rules = {}
r4: itr1:
Beijing Shanghai Japan Ottawa 16
17
Experimental Study
18
Efficiency of Checking Consistency
100 101 102 103 104 105 106 1 2 3 4 5 6 7 8 9 10 Time (msec) # of rules (* 100)
isConsistt (worst case) isConsistr (worst case)10-3 10-2 10-1 100 101 102 103 1 2 3 4 5 6 7 8 9 10 Time (msec) # of rules (* 10)
isConsistt (worst case) isConsistr (worst case)Hospital data UIS
19
Accuracy
10 20 30 40 50 60 20 40 60 80 100 # of errors corrected Top 100 rules
0.2 0.4 0.6 0.8 1 Precision Recall
Fix Edit
Hospital data
20
Efficiency of Repairing Algorithms
2 4 6 8 10 12 1 2 3 4 5 6 7 8 9 10 Time (sec) # of rules (* 100) cRepair lRepair 0.05 0.1 0.15 0.2 1 2 3 4 5 6 7 8 9 10 Time (sec) # of rules (* 10) cRepair lRepair
Hospital data UIS
21
(Automated)
(User guided)
precision: + recall: ++ precision: ++ recall: ++ precision: ++ recall: +
22
(Automated)
(User guided)
precision: + recall: ++ precision: ++ recall: ++ precision: ++ recall: +
Conclusion:
Automated Dependable Fundamentals Repair
22
(Automated)
(User guided)
precision: + recall: ++ precision: ++ recall: ++ precision: ++ recall: +
Conclusion:
Automated Dependable Fundamentals Repair
Future work:
Discovery Generalized fixing rules
22
23
Generating Fixing Rules
country {capital capital China Shanghai Beijing Hongkong
24
Generating Fixing Rules
country {capital capital China Shanghai Beijing Hongkong
[{ "type": "/location/country", "name": null, "/location/country/capital": [] }]
MQL1
24
Generating Fixing Rules
country {capital capital China Shanghai Beijing Hongkong
[{ "type": "/location/country", "name": null, "/location/country/capital": [] }]
MQL1
[{ "/location/country/iso3166 1 shortname": "CHINA", "/location/location/contains": [{ "name": null, "type": "/location/citytown"
}]}]
MQL2
24