A Decision Model for Cost Optimal Record Matching Presenter: - - PowerPoint PPT Presentation

a decision model for cost optimal record matching
SMART_READER_LITE
LIVE PREVIEW

A Decision Model for Cost Optimal Record Matching Presenter: - - PowerPoint PPT Presentation

A Decision Model for Cost Optimal Record Matching Presenter: Vassilios S. Verykios IST College / Drexel University Affiliates Workshop on Data Quality NISS/Telcordia - December 1 st , 2000 Comparison Vector Given a pair of database records


slide-1
SLIDE 1

A Decision Model for Cost Optimal Record Matching

Presenter: Vassilios S. Verykios IST College / Drexel University Affiliates Workshop on Data Quality NISS/Telcordia - December 1st, 2000

slide-2
SLIDE 2

Comparison Vector

  • Given a pair of database records with partially
  • verlapping schemata, decide whether it is a

match or not.

  • Compare the pairs of values stored in each

common attribute/field (assume n common fields).

  • The n comparison measurements form a

comparison vector X.

slide-3
SLIDE 3

Record Comparison

A B C D E F A B C D F 1 1 3 2 2 1

1. Agreement 2. Disagreement 3. Missing

slide-4
SLIDE 4

Random Vector

  • Even if a pair of records match, the observed

value for each field comparison is different each time the observation is made.

  • Therefore, each field comparison variable is a

random variable.

  • Likewise, the comparison vector X is a

random vector.

slide-5
SLIDE 5

Distribution of Vectors

  • Each pair of records is expressed by a

comparison vector (or a sample) in an n- dimensional space.

  • Many comparison vectors form a distribution
  • f X in the n-dimensional space.
  • Figure 1 shows a simple two dimensional

example of two distributions corresponding to matched and unmatched pairs of records.

slide-6
SLIDE 6

Figure 1

  • o
  • x

x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x2 x1

Distributions of samples from matched and unmatched record pairs.

  • x

x x x x x x x x x

slide-7
SLIDE 7

Classifiers

  • If we know these two distributions of X from past

experience, we can set up a boundary between these two distributions, g(x1,x2)=0, which divides the two- dimensional space into two regions.

  • Once the boundary is selected, we can classify a

sample without a class label to a matched or unmatched, depending on the sign of g(x1,x2).

  • We call g(x1,x2) a discriminant function and a system

that detects the sign of g(x1,x2) a classifier.

slide-8
SLIDE 8

Figure 2

  • o
  • x

x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x2 x1

Distributions of samples from matched and unmatched record pairs.

g(x1,x2)= 0

  • x

x x x x x x x x x

slide-9
SLIDE 9

Learning

  • In order to design a classifier, we must study

the characteristics of the distribution of X for each category and find a proper discriminant function.

  • This process is called learning.
  • Samples used to design a classifier are called

learning or training samples.

slide-10
SLIDE 10

Statistical Hypothesis Testing

  • What is the best classifier, assuming

that the distributions of the random vectors are given?

  • Bayes classifier minimizes the

probability of classification error.

slide-11
SLIDE 11

Distribution and Density Functions

  • Random vector X
  • Distribution function

P(X)

  • Density function p(X)
  • Class i density or

conditional density of class i p(X|ci) or pi(X)

  • Unconditional density

function or mixture density function p(X)=

  • A posteriori density

function P(ci|X) or qi(X)

  • Bayes rule
  • =

L i i i

X p P

1

) (

slide-12
SLIDE 12

Bayes Rule for Minimum Error

  • Let X a comparison vector.
  • Determine whether X belongs to M or U.
  • If the a posteriori probability of M given

X is larger than the probability of U, X is classified to M, and vice versa.

slide-13
SLIDE 13

Fellegi-Sunter Model

  • Order X’s based on their likelihood ratio

) ( ) ( ) ( X p X p X l

U M

=

  • For a pair of error levels (µ, λ), choose

index values n and n’ such that:

  • =

+ = − = =

> ≥ ≤ <

N n i N n i i M i M n i n i i U i U

X p X p X p X p

' ' 1

1 1 1

) ( ) ( ) ( ) ( λ µ

slide-14
SLIDE 14

Minimum Cost Model

  • Minimizing the probability of error is not the

best criterion to design a decision rule because the misclassifications of M and U samples may have different consequences.

  • The misclassification of a cancer patient to

normal may have a more damaging effect than the misclassification of a normal patient to cancer.

  • Therefore, it is appropriate to assign a cost to

each situation.

slide-15
SLIDE 15

Decision Costs

M A3 C3M U A3 C3U M A2 C2M U A2 C2U U A1 C1U M A1 C1M Class Decision Cost

slide-16
SLIDE 16

Mean Cost (I)

) , ( ) , ( ) , ( ) , ( ) , ( ) , (

3 3 3 3 2 2 2 2 1 1 1 1

U c A d P c M c A d P c U c A d P c M c A d P c U c A d P c M c A d P c c

U M U M U M

= = ⋅ + = = ⋅ + = = ⋅ + = = ⋅ + = = ⋅ + = = ⋅ =

slide-17
SLIDE 17

Bayes Theorem U M c i j c P j c A d P j c A d P

i i

, and 3 , 2 , 1 where ) ( ) | ( ) , ( = = = ⋅ = = = = =

slide-18
SLIDE 18

Conditional Probability

= = = = =

i

A X j i

U M c i X p j c A d P , and 3 , 2 , 1 where ), ( ) | (

A1 A2 A3

1 ) ( and ) ( π π − = = = = U c P M c P

slide-19
SLIDE 19

Mean Cost (II)

) ( ) | ( ) ( ) | ( ) ( ) | ( ) ( ) | ( ) ( ) | ( ) ( ) | (

3 3 3 3 2 2 2 2 1 1 1 1

U c P U c A d P c M c P M c A d P c U c P U c A d P c M c P M c A d P c U c P U c A d P c M c P M c A d P c c

U M U M U M

= ⋅ = = ⋅ + = ⋅ = = ⋅ + = ⋅ = = ⋅ + = ⋅ = = ⋅ + = ⋅ = = ⋅ + = ⋅ = = ⋅ = + − ⋅ ⋅ + ⋅ ⋅ + − ⋅ ⋅ + ⋅ ⋅ + − ⋅ ⋅ + ⋅ ⋅ =

∈ ∈ ∈ ∈ ∈

) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) (

3 3 2 2 1 1

3 3 2 2 1 1

π π π π π π

A X U U A X M M A X U U A X M M A X U U A X M M

X p c X p c X p c X p c X p c X p c c

Using the Bayes theorem: Using the definition of the conditional probability:

slide-20
SLIDE 20

Mean Cost (III)

)] 1 ( ) ( ) ( [ )] 1 ( ) ( ) ( [ )] 1 ( ) ( ) ( [

3 2 1

3 3 2 2 1 1

∈ ∈

− ⋅ ⋅ + ⋅ ⋅ + − ⋅ ⋅ + ⋅ ⋅ + − ⋅ ⋅ + ⋅ ⋅ =

A X U U M M A X U U M M A X U U M M

c X p c X p c X p c X p c X p c X p c π π π π π π

slide-21
SLIDE 21

Decision Areas

  • Every sample X in the decision space A,

should be assigned to only one decision class: A1, A2 or A3.

  • We should thus assign each sample to a

class in such a way that its contribution to the mean cost is minimum.

  • This will lead to the optimal selection for the

three sets which we denote by A10, A20, A30.

slide-22
SLIDE 22

Decision Making

  • A sample is assigned to the optimal

areas as follows:

) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( : if To ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( : if To ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( ) 1 ( ) ( ) ( : if To

2 2 3 3 1 1 3 3 3 3 3 2 2 1 1 2 2 2 3 3 1 1 2 2 1 1 1

π π π π π π π π π π π π π π π π π π π π π π π π − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅ − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅ − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅ − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅ − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅ − ⋅ ⋅ + ⋅ ⋅ ≤ − ⋅ ⋅ + ⋅ ⋅

U U M M U U M M U U M M U U M M U U M M U U M M U U M M U U M M U U M M U U M M U U M M U U M M

c X p c X p c X p c X p c X p c X p c X p c X p A c X p c X p c X p c X p c X p c X p c X p c X p A c X p c X p c X p c X p c X p c X p c X p c X p A

slide-23
SLIDE 23

Optimal Decision Areas

  • We thus conclude from the previous

slide:

− ⋅ − ≥ − − ⋅ − ≥ =

− ⋅ − ≤ − − ⋅ − ≥ =

− ⋅ − ≤ − − ⋅ − ≤ =

U U M M M U U U M M M U U U M M M U U U M M M U U U M M M U U U M M M U

c c c c p p c c c c p p X A c c c c p p c c c c p p X A c c c c p p c c c c p p X A

3 2 2 3 3 1 1 3 3 3 2 2 3 2 1 1 2 2 2 1 1 2 3 1 1 3 1

1 and, 1 : 1 and, 1 : 1 and, 1 : π π π π π π π π π π π π

slide-24
SLIDE 24

Threshold Values

U U M M U U M M U U M M U U U M M M

c c c c c c c c c c c c c c c c c c

3 2 2 3 2 1 1 2 3 1 1 3 3 2 1 3 2 1

1 1 1 , − − ⋅ − = − − ⋅ − = − − ⋅ − = ≥ ≥ ≤ ≤ π π µ π π λ π π κ

slide-25
SLIDE 25

Threshold Values

  • In order for A2

0 to exist:

µ π π π π λ = − − ⋅ − ≤ − − ⋅ − =

U U M M U U M M

c c c c c c c c

3 2 2 3 3 1 1 3

1 1

  • We can easily prove now, that threshold

κ lies between λ and µ.

slide-26
SLIDE 26

Threshold Values

  • Adding by parts the relationships above, we

can easily show that λ≤κ

  • Similarly we can prove that κ ≤ µ.

) ( 1 ) ( 1 ) ( 1 ) ( 1

2 3 3 2 3 2 2 3 1 2 2 1 2 1 1 2 M M U U U U M M M M U U U U M M

c c c c c c c c c c c c c c c c − ⋅ − ≤ − ⋅

− ⋅ − ≤ − ⋅ − = − ⋅

− ⋅ − = π π λ π π λ π π λ π π λ

slide-27
SLIDE 27

Optimality of the Model

∈ ∈ ∈ ∈ ∈ ∈ ∈

+ + = ≥ ⋅ + ⋅ + ⋅ = + + =

3 2 1 3 2 1 3 2 1

) ( ) ( ) ( } ) ( ), ( ), ( min{ )] ( ) ( ) ( ) ( ) ( ) ( [ ) ( ) ( ) (

3 2 1 def A X 3 2 1 3 2 1 3 2 1 A X A X A X A X A A A A X A X A X

X z X z X z X z X z X z X I X z X I X z X I X z X z X z X z c

slide-28
SLIDE 28

Probabilities of Errors

  • Type I
  • Type II

⋅ = = ⋅ = = = = =

3

A X 3 3

) ( ) ( ) | ( ) , ( X p M r P M r A d P M r A d P

M

π

⋅ − = = ⋅ = = = = =

1

A X 1 1

) ( ) 1 ( ) ( ) | ( ) , ( X p U r P U r A d P U r A d P

U

π

slide-29
SLIDE 29

Conditionally Independent Binary Components

i i U i i U i i M i i M n j j j j n

q x p q x p p x p p x p U M j x p x p x p X p x x x X − = = = = − = = = = = ⋅ = = 1 ) ( ) 1 ( 1 ) ( ) 1 ( , e wher ), ( ) ( ) ( ) ( ] [

2 1 2 1

slide-30
SLIDE 30

Conditionally Independent Binary Components

  • =

= + + + = ⋅ ⋅ =

n 1 i 2 2 1 1 2 1 2 1 2 1 2 1

) ( ) ( log ) ( ) ( log ) ( ) ( log ) ( ) ( log ) , , ( log ) ( ) ( ) ( ) ( ) ( ) ( log ) , , ( log

i M i U n M n U M U M U n M U n M M M n U U U n M U

x p x p x p x p x p x p x p x p x x x p p x p x p x p x p x p x p x x x p p

slide-31
SLIDE 31

Conditionally Independent Binary Components

  • Note, that since xi can only assume the

values of 0 or 1:

  • =

=

− − + − − ⋅ ⋅ = − − + − − ⋅ ⋅ = − − ⋅ − + ⋅ =

n i n i i i i i i i i M U i i i i i i i i i i i i i i M U

p q q p p q x X p p p q q p p q x p q x q p x x p p

1 1

1 1 log ) 1 ( ) 1 ( log ) ( log 1 1 log ) 1 ( ) 1 ( log 1 1 log ) 1 ( log ) ( log

slide-32
SLIDE 32

Example

  • Records are being compared.
  • Three attributes: last name, first name and

sex.

  • Two possible outcomes: agree and disagree.
  • Comparison vector contains eight

3-component vectors.

slide-33
SLIDE 33

Probabilities of Agreement and Disagreement

0.55 0.45 0.05 0.95 Sex 0.90 0.10 0.15 0.85 First Name 0.95 0.05 0.10 0.90 Last Name 1-qi qi 1-pi pi Under U Under M Attribute

slide-34
SLIDE 34

Comparisons and Costs

5 . 1 , 2 . , 1 1 , 2 . , else 1 then agree values attrbitute if ) , , (

3 2 1 3 2 1 3 2 1

= − = = = = = = = = = = π π

U U U M M M i i

c c c c c c x x x x x X

slide-35
SLIDE 35

Decisions Made

A1

  • 2.511

(1,1,1) 8 A1

  • 1.145

(1,1,0) 7 A1

  • 0.804

(1,0,1) 6 A2 0.562 (1,0,0) 5 A2

  • 0.272

(0,1,1) 4 A3 1.088 (0,1,0) 3 A3 1.429 (0,0,1) 2 A3 2.795 (0,0,0) 1 Decision Log(pU/pM) X i

slide-36
SLIDE 36

Experiments

0.75 0.43 0.03 0.97 ZIPCODE 1.00 0.00 0.00 1.00 STATE 0.94 0.06 0.11 0.89 CITY 1.00 0.00 0.00 1.00 APRT# 0.99 0.01 0.23 0.77 SADDRESS 1.00 0.00 0.00 1.00 STREET# 0.70 0.30 0.03 0.97 LNAME 0.95 0.05 0.05 0.95 MINIT 0.71 0.29 0.04 0.96 FNAME 0.65 0.35 0.00 1.00 SSN

1-qi qi 1-pi pi Under U Under M Attribute

slide-37
SLIDE 37

Percent of Error VS. No of Records in A2

1.4553 0.3471 2.0862

  • 2.5115

0.005 0.005 1.8720 0.2028 3.0882

  • 3.5134

0.0005 0.0005 0.0995 0.9836 1.0661

  • 1.4914

0.05 0.05 0.0186 0.9890 0.7416

  • 1.1668

0.1 0.1 0.0186 0.9890 0.2645

  • 0.6897

0.25 0.25 C 1.5797 0.3602 2.7874

  • 0.5134

0.0005 0.50 1.1692 0.3650 1.7884

  • 0.5115

0.005 0.50 0.0062 1.0013 0.7874

  • 0.4914

0.05 0.50 0.0000 1.0013 0.0884

  • 0.3887

0.25 0.50 B 0.0000 1.0013

  • 0.2126
  • 0.2126

0.60 0.40 0.0000 1.0013

  • 0.2126
  • 0.2126

0.50 0.50 A

% of Recs in A2 %Error µ λ c2U c2M GID

slide-38
SLIDE 38

Concluding Remarks

  • Efficiency
  • Time optimal models
  • Prototype implementation