1
Generic Entity Resolution with Negative Rules Steven Whang Hector - - PowerPoint PPT Presentation
Generic Entity Resolution with Negative Rules Steven Whang Hector - - PowerPoint PPT Presentation
Generic Entity Resolution with Negative Rules Steven Whang Hector Garcia-Molina Omar Benjelloun Stanford University Google Inc. 1 Entity Resolution Name SSN Gender Pat 999-04-1234 r 1 Patricia F r 2 Pat 999-04-1234 M r 3 M(r 1
2
Entity Resolution
- M(r1, r2) = T, merge <r1, r2> = r12
- M(r3, r12) = T, merge <r3, r12> = r123
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
3
Entity Resolution
r12 r123 r1 r2 r3
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M r12 {Pat, Patricia} 999-04-1234 F r123 {Pat, Patricia} 999-04-1234 {F, M}
4
Entity Resolution
r12 r123 r1 r2 r3 Negative Rules
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M r12 {Pat, Patricia} 999-04-1234 F r123 {Pat, Patricia} 999-04-1234 {F, M}
5
Entity Resolution
r12 r1 r2 r3 Negative Rules
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M r12 {Pat, Patricia} 999-04-1234 F
6
Entity Resolution
Solutions:
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
Undesirable: {r13, r2} or {r12} {r1, r2}
7
Negative Rules
I input records ER R resolved records match, merge func. negative rules
8
Negative Rules
I input records ER R resolved records match, merge func. negative rules I input records ER R resolved records match, merge func. negative rules
9
Why not simply extend match func.?
r1 r12 r123 r2 r3 M M|F F M M
10
Algorithm
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
r12 r123 r1 r2 r3 r23 r13
Solution
11
Algorithm
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
r12 r123 r1 r2 r3 r23 r13
Solution
12
Algorithm
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
r12 r123 r1 r2 r3 r23 r13
Solution
13
Algorithm
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
r12 r1 r2 r3 r23 r13
Solution
14
Algorithm
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
r12 r1 r2 r3 r23 r13
Solution
15
Algorithm
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
r12 r1 r2 r3 r13
Solution
16
Algorithm
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
r12 r1 r2 r3 r13
Solution
17
Algorithm
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
r12 r1 r2 r3 r13
Solution
18
Algorithm
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
r1 r2 r13
Solution
19
Algorithm
Name SSN Gender r1 Pat 999-04-1234 r2 Patricia F r3 Pat 999-04-1234 M
r2 r13
Solution
20
Resolving Inconsistencies
r1 r2 Discard r1 r2 Forced Merge r12 Override r1 r2
21
Precision and Recall
Match and Merge Func. Discard Forced Merge Solver Best Point
22
Runtime
General Alg. Enhanced Alg.
23
Negative Rules Summary
Negative Rules can improve the precision and recall of Entity Resolution Entity Resolution with Negative Rules is very expensive and should be used within buckets after blocking
24
Evolving Rules
I input records ER R resolved records
- ld match,
merge func.
25
Evolving Rules
I input records ER R resolved records
- ld match,
merge func. new match, merge func. ER S resolved records
26
Evolving Rules
I input records ER R resolved records
- ld match,
merge func. new match, merge func. ER S resolved records Merge Undo T resolved records ER
27
ER in the InfoLab
- Generic ER
- Confidences
- Distributed ER
- Negative Rules
- Evolving Rules
- Blocking