a decision model for cost optimal record matching
play

A Decision Model for Cost Optimal Record Matching Presenter: - PowerPoint PPT Presentation

A Decision Model for Cost Optimal Record Matching Presenter: Vassilios S. Verykios IST College / Drexel University Affiliates Workshop on Data Quality NISS/Telcordia - December 1 st , 2000 Comparison Vector Given a pair of database records


  1. A Decision Model for Cost Optimal Record Matching Presenter: Vassilios S. Verykios IST College / Drexel University Affiliates Workshop on Data Quality NISS/Telcordia - December 1 st , 2000

  2. Comparison Vector • Given a pair of database records with partially overlapping schemata, decide whether it is a match or not. • Compare the pairs of values stored in each common attribute/field (assume n common fields). • The n comparison measurements form a comparison vector X .

  3. Record Comparison A B C D E F A B C D F 1. Agreement 2. Disagreement 3. Missing 1 1 3 2 2 1

  4. Random Vector • Even if a pair of records match, the observed value for each field comparison is different each time the observation is made. • Therefore, each field comparison variable is a random variable . • Likewise, the comparison vector X is a random vector .

  5. Distribution of Vectors • Each pair of records is expressed by a comparison vector (or a sample) in an n- dimensional space. • Many comparison vectors form a distribution of X in the n-dimensional space. • Figure 1 shows a simple two dimensional example of two distributions corresponding to matched and unmatched pairs of records.

  6. Figure 1 x2 x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x o x x o o x x x o x o o x x o o x x o x o o o o x o o o o o o o o o o o o o o o o o o o o o o o o o o o o x1 Distributions of samples from matched and unmatched record pairs.

  7. Classifiers • If we know these two distributions of X from past experience, we can set up a boundary between these two distributions, g(x1,x2)=0 , which divides the two- dimensional space into two regions. • Once the boundary is selected, we can classify a sample without a class label to a matched or unmatched, depending on the sign of g(x1,x2). • We call g(x1,x2) a discriminant function and a system that detects the sign of g(x1,x2) a classifier .

  8. Figure 2 x2 x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x o x x o o x x x o x o o x x o o x x o x o o o o x o o o o o o o o o o o o o o o o o o o o o o o g(x1,x2)= 0 o o o o o x1 Distributions of samples from matched and unmatched record pairs.

  9. Learning • In order to design a classifier, we must study the characteristics of the distribution of X for each category and find a proper discriminant function. • This process is called learning . • Samples used to design a classifier are called learning or training samples .

  10. Statistical Hypothesis Testing • What is the best classifier, assuming that the distributions of the random vectors are given? • Bayes classifier minimizes the probability of classification error .

  11. Distribution and Density Functions • Random vector X • Unconditional density function or mixture density function • Distribution function L P(X) ( ) P p X p(X)= � i i = i 1 • Density function p(X) • A posteriori density • Class i density or function P(c i |X) or q i (X) conditional density of class i p(X|c i ) or p i (X) • Bayes rule

  12. Bayes Rule for Minimum Error • Let X a comparison vector. • Determine whether X belongs to M or U. • If the a posteriori probability of M given X is larger than the probability of U, X is classified to M, and vice versa.

  13. Fellegi-Sunter Model • Order X’s based on their likelihood ratio ( ) p X = M ( ) l X ( ) p X U • For a pair of error levels ( µ , λ ), choose index values n and n’ such that: − 1 n n < µ ≤ ( ) ( ) p X p X � � U i U i = = 1 1 i i N N ≥ λ > ( ) ( ) p X p X � � M i M i ' 1 = ' = + i n i n

  14. Minimum Cost Model • Minimizing the probability of error is not the best criterion to design a decision rule because the misclassifications of M and U samples may have different consequences. • The misclassification of a cancer patient to normal may have a more damaging effect than the misclassification of a normal patient to cancer. • Therefore, it is appropriate to assign a cost to each situation.

  15. Decision Costs Cost Decision Class C 1M A 1 M C 1U A 1 U C 2M A 2 M C 2U A 2 U C 3M A 3 M C 3U A 3 U

  16. Mean Cost (I) = ⋅ = = + ⋅ = = + ( , ) ( , ) c c P d A c M c P d A c U 1 1 1 1 M U ⋅ = = + ⋅ = = + ( , ) ( , ) c P d A c M c P d A c U 2 2 2 2 M U ⋅ = = + ⋅ = = ( , ) ( , ) c P d A c M c P d A c U 3 3 3 3 M U

  17. Bayes Theorem = = = = = ⋅ = ( , ) ( | ) ( ) P d A c j P d A c j P c j i i = = where 1 , 2 , 3 and , i c M U

  18. Conditional Probability A 1 A 2 A 3 = = = = = ( | ) ( ), where 1 , 2 , 3 and , P d A c j p X i c M U � i j ∈ X A i = = π = = − π ( ) and ( ) 1 P c M P c U 0 0

  19. Mean Cost (II) Using the Bayes theorem: = ⋅ = = ⋅ = + ⋅ = = ⋅ = + ( | ) ( ) ( | ) ( ) c c P d A c M P c M c P d A c U P c U 1 1 1 1 M U ⋅ = = ⋅ = + ⋅ = = ⋅ = + ( | ) ( ) ( | ) ( ) c P d A c M P c M c P d A c U P c U 2 2 2 2 M U ⋅ = = ⋅ = + ⋅ = = ⋅ = ( | ) ( ) ( | ) ( ) c P d A c M P c M c P d A c U P c U 3 3 3 3 M U Using the definition of the conditional probability: = ⋅ π ⋅ + ⋅ ⋅ − π + ( ) ( ) ( 1 ) c c p X c p X � � 1 0 1 0 M M U U ∈ ∈ X A X A 1 1 ⋅ ⋅ π + ⋅ ⋅ − π + ( ) ( ) ( 1 ) c p X c p X � � 2 0 2 0 M M U U ∈ ∈ X A X A 2 2 ⋅ ⋅ π + ⋅ ⋅ − π + ( ) ( ) ( 1 ) c p X c p X � � 3 0 3 0 M M U U ∈ ∈ X A X A 3 3

  20. Mean Cost (III) = ⋅ ⋅ π + ⋅ ⋅ − π + [ ( ) ( ) ( 1 )] c p X c p X c � 1 0 1 0 M M U U ∈ X A 1 ⋅ ⋅ π + ⋅ ⋅ − π + [ ( ) ( ) ( 1 )] p X c p X c � 2 0 2 0 M M U U ∈ X A 2 ⋅ π ⋅ + ⋅ ⋅ − π [ ( ) ( ) ( 1 )] p X c p X c � 3 0 3 0 M M U U ∈ X A 3

  21. Decision Areas • Every sample X in the decision space A, should be assigned to only one decision class: A 1 , A 2 or A 3 . • We should thus assign each sample to a class in such a way that its contribution to the mean cost is minimum. • This will lead to the optimal selection for the three sets which we denote by A 10 , A 20 , A 30 .

  22. Decision Making • A sample is assigned to the optimal areas as follows: 0 To if : A 1 ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 1 0 1 0 2 0 2 0 M M U U M M U U ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 1 0 1 0 3 0 3 0 M M U U M M U U 0 To if : A 2 ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c M 2 M 0 U 2 U 0 M 1 M 0 U 1 U 0 ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ π ⋅ + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 2 0 2 0 3 0 3 0 M M U U M M U U 0 To if : A 3 ⋅ π ⋅ + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 3 0 3 0 1 0 1 0 M M U U M M U U ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ π ⋅ + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c M 3 M 0 U 3 U 0 M 2 M 0 U 2 U 0

  23. Optimal Decision Areas • We thus conclude from the previous slide: π − π − p c c p c c � � = ≤ ⋅ ≤ ⋅ 0 0 3 1 0 U M M U 2 M 1 M : and, A X � � − π − − π − 1 1 1 p c c p c c � 0 1 3 0 1 2 � M U U M U U π − π − p c c p c c � � = ≥ ⋅ ≤ ⋅ 0 0 2 1 0 3 2 U M M U M M : and, A X � � − π − − π − 2 1 1 p c c p c c � M 0 1 U 2 U M 0 2 U 3 U � π − π − p c c p c c � � = ≥ ⋅ ≥ ⋅ 0 0 3 1 0 3 2 : U M M and, U M M A X � � − π − − π − 3 1 1 p c c p c c � 0 1 3 0 2 3 � M U U M U U

  24. Threshold Values ≤ ≤ ≥ ≥ , c c c c c c 1 2 3 1 2 3 M M M U U U π − c c κ = ⋅ 0 3 1 M M − π − 1 c c 0 1 3 U U π − c c λ = ⋅ 0 2 1 M M − π − 1 c c 0 1 2 U U π − c c µ = ⋅ 0 3 2 M M − π − 1 c c 0 2 3 U U

  25. Threshold Values 0 to exist: • In order for A 2 π − π − c c c c λ = ⋅ ≤ ⋅ = µ 0 3 1 0 3 2 M M M M − π − − π − 1 1 c c c c 0 1 3 0 2 3 U U U U • We can easily prove now, that threshold κ lies between λ and µ .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend