A Decision Model for Cost Optimal Record Matching
Presenter: Vassilios S. Verykios IST College / Drexel University Affiliates Workshop on Data Quality NISS/Telcordia - December 1st, 2000
A Decision Model for Cost Optimal Record Matching Presenter: - - PowerPoint PPT Presentation
A Decision Model for Cost Optimal Record Matching Presenter: Vassilios S. Verykios IST College / Drexel University Affiliates Workshop on Data Quality NISS/Telcordia - December 1 st , 2000 Comparison Vector Given a pair of database records
Presenter: Vassilios S. Verykios IST College / Drexel University Affiliates Workshop on Data Quality NISS/Telcordia - December 1st, 2000
A B C D E F A B C D F 1 1 3 2 2 1
1. Agreement 2. Disagreement 3. Missing
x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x2 x1
Distributions of samples from matched and unmatched record pairs.
x x x x x x x x x
experience, we can set up a boundary between these two distributions, g(x1,x2)=0, which divides the two- dimensional space into two regions.
sample without a class label to a matched or unmatched, depending on the sign of g(x1,x2).
that detects the sign of g(x1,x2) a classifier.
x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x2 x1
Distributions of samples from matched and unmatched record pairs.
g(x1,x2)= 0
x x x x x x x x x
P(X)
conditional density of class i p(X|ci) or pi(X)
function or mixture density function p(X)=
function P(ci|X) or qi(X)
L i i i
X p P
1
) (
U M
+ = − = =
> ≥ ≤ <
N n i N n i i M i M n i n i i U i U
X p X p X p X p
' ' 1
1 1 1
) ( ) ( ) ( ) ( λ µ
3 3 3 3 2 2 2 2 1 1 1 1
U M U M U M
i
A X j i
A1 A2 A3
3 3 3 3 2 2 2 2 1 1 1 1
U M U M U M
∈ ∈ ∈ ∈ ∈
3 3 2 2 1 1
3 3 2 2 1 1
A X U U A X M M A X U U A X M M A X U U A X M M
Using the Bayes theorem: Using the definition of the conditional probability:
3 2 1
3 3 2 2 1 1
∈ ∈
A X U U M M A X U U M M A X U U M M
) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( : if To ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( : if To ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( : if To
2 2 3 3 1 1 3 3 3 3 3 2 2 1 1 2 2 2 3 3 1 1 2 2 1 1 1
π π π π π π π π π π π π π π π π π π π π π π π π − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅ − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅ − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅ − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅ − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅ − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅
U U M M U U M M U U M M U U M M U U M M U U M M U U M M U U M M U U M M U U M M U U M M U U M M
c X p c X p c X p c X p c X p c X p c X p c X p A c X p c X p c X p c X p c X p c X p c X p c X p A c X p c X p c X p c X p c X p c X p c X p c X p A
− ⋅ − ≥ − − ⋅ − ≥ =
− ⋅ − ≤ − − ⋅ − ≥ =
− ⋅ − ≤ − − ⋅ − ≤ =
U U M M M U U U M M M U U U M M M U U U M M M U U U M M M U U U M M M U
c c c c p p c c c c p p X A c c c c p p c c c c p p X A c c c c p p c c c c p p X A
3 2 2 3 3 1 1 3 3 3 2 2 3 2 1 1 2 2 2 1 1 2 3 1 1 3 1
1 and, 1 : 1 and, 1 : 1 and, 1 : π π π π π π π π π π π π
U U M M U U M M U U M M U U U M M M
3 2 2 3 2 1 1 2 3 1 1 3 3 2 1 3 2 1
0 to exist:
U U M M U U M M
3 2 2 3 3 1 1 3
2 3 3 2 3 2 2 3 1 2 2 1 2 1 1 2 M M U U U U M M M M U U U U M M
∈ ∈ ∈ ∈ ∈ ∈ ∈
+ + = ≥ ⋅ + ⋅ + ⋅ = + + =
3 2 1 3 2 1 3 2 1
) ( ) ( ) ( } ) ( ), ( ), ( min{ )] ( ) ( ) ( ) ( ) ( ) ( [ ) ( ) ( ) (
3 2 1 def A X 3 2 1 3 2 1 3 2 1 A X A X A X A X A A A A X A X A X
X z X z X z X z X z X z X I X z X I X z X I X z X z X z X z c
3
A X 3 3
M
1
A X 1 1
U
i i U i i U i i M i i M n j j j j n
2 1 2 1
n 1 i 2 2 1 1 2 1 2 1 2 1 2 1
i M i U n M n U M U M U n M U n M M M n U U U n M U
=
n i n i i i i i i i i M U i i i i i i i i i i i i i i M U
1 1
3 2 1 3 2 1 3 2 1
U U U M M M i i
0.75 0.43 0.03 0.97 ZIPCODE 1.00 0.00 0.00 1.00 STATE 0.94 0.06 0.11 0.89 CITY 1.00 0.00 0.00 1.00 APRT# 0.99 0.01 0.23 0.77 SADDRESS 1.00 0.00 0.00 1.00 STREET# 0.70 0.30 0.03 0.97 LNAME 0.95 0.05 0.05 0.95 MINIT 0.71 0.29 0.04 0.96 FNAME 0.65 0.35 0.00 1.00 SSN
1.4553 0.3471 2.0862
0.005 0.005 1.8720 0.2028 3.0882
0.0005 0.0005 0.0995 0.9836 1.0661
0.05 0.05 0.0186 0.9890 0.7416
0.1 0.1 0.0186 0.9890 0.2645
0.25 0.25 C 1.5797 0.3602 2.7874
0.0005 0.50 1.1692 0.3650 1.7884
0.005 0.50 0.0062 1.0013 0.7874
0.05 0.50 0.0000 1.0013 0.0884
0.25 0.50 B 0.0000 1.0013
0.60 0.40 0.0000 1.0013
0.50 0.50 A
% of Recs in A2 %Error µ λ c2U c2M GID