A Decision Model for Cost Optimal Record Matching Presenter: - PowerPoint PPT Presentation

A Decision Model for Cost Optimal Record Matching Presenter: Vassilios S. Verykios IST College / Drexel University Affiliates Workshop on Data Quality NISS/Telcordia - December 1 st , 2000

Comparison Vector • Given a pair of database records with partially overlapping schemata, decide whether it is a match or not. • Compare the pairs of values stored in each common attribute/field (assume n common fields). • The n comparison measurements form a comparison vector X .

Record Comparison A B C D E F A B C D F 1. Agreement 2. Disagreement 3. Missing 1 1 3 2 2 1

Random Vector • Even if a pair of records match, the observed value for each field comparison is different each time the observation is made. • Therefore, each field comparison variable is a random variable . • Likewise, the comparison vector X is a random vector .

Distribution of Vectors • Each pair of records is expressed by a comparison vector (or a sample) in an n- dimensional space. • Many comparison vectors form a distribution of X in the n-dimensional space. • Figure 1 shows a simple two dimensional example of two distributions corresponding to matched and unmatched pairs of records.

Figure 1 x2 x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x o x x o o x x x o x o o x x o o x x o x o o o o x o o o o o o o o o o o o o o o o o o o o o o o o o o o o x1 Distributions of samples from matched and unmatched record pairs.

Classifiers • If we know these two distributions of X from past experience, we can set up a boundary between these two distributions, g(x1,x2)=0 , which divides the two- dimensional space into two regions. • Once the boundary is selected, we can classify a sample without a class label to a matched or unmatched, depending on the sign of g(x1,x2). • We call g(x1,x2) a discriminant function and a system that detects the sign of g(x1,x2) a classifier .

Figure 2 x2 x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x o x x o o x x x o x o o x x o o x x o x o o o o x o o o o o o o o o o o o o o o o o o o o o o o g(x1,x2)= 0 o o o o o x1 Distributions of samples from matched and unmatched record pairs.

Learning • In order to design a classifier, we must study the characteristics of the distribution of X for each category and find a proper discriminant function. • This process is called learning . • Samples used to design a classifier are called learning or training samples .

Statistical Hypothesis Testing • What is the best classifier, assuming that the distributions of the random vectors are given? • Bayes classifier minimizes the probability of classification error .

Distribution and Density Functions • Random vector X • Unconditional density function or mixture density function • Distribution function L P(X) ( ) P p X p(X)= � i i = i 1 • Density function p(X) • A posteriori density • Class i density or function P(c i |X) or q i (X) conditional density of class i p(X|c i ) or p i (X) • Bayes rule

Bayes Rule for Minimum Error • Let X a comparison vector. • Determine whether X belongs to M or U. • If the a posteriori probability of M given X is larger than the probability of U, X is classified to M, and vice versa.

Fellegi-Sunter Model • Order X’s based on their likelihood ratio ( ) p X = M ( ) l X ( ) p X U • For a pair of error levels ( µ , λ ), choose index values n and n’ such that: − 1 n n < µ ≤ ( ) ( ) p X p X � � U i U i = = 1 1 i i N N ≥ λ > ( ) ( ) p X p X � � M i M i ' 1 = ' = + i n i n

Minimum Cost Model • Minimizing the probability of error is not the best criterion to design a decision rule because the misclassifications of M and U samples may have different consequences. • The misclassification of a cancer patient to normal may have a more damaging effect than the misclassification of a normal patient to cancer. • Therefore, it is appropriate to assign a cost to each situation.

Decision Costs Cost Decision Class C 1M A 1 M C 1U A 1 U C 2M A 2 M C 2U A 2 U C 3M A 3 M C 3U A 3 U

Mean Cost (I) = ⋅ = = + ⋅ = = + ( , ) ( , ) c c P d A c M c P d A c U 1 1 1 1 M U ⋅ = = + ⋅ = = + ( , ) ( , ) c P d A c M c P d A c U 2 2 2 2 M U ⋅ = = + ⋅ = = ( , ) ( , ) c P d A c M c P d A c U 3 3 3 3 M U

Bayes Theorem = = = = = ⋅ = ( , ) ( | ) ( ) P d A c j P d A c j P c j i i = = where 1 , 2 , 3 and , i c M U

Conditional Probability A 1 A 2 A 3 = = = = = ( | ) ( ), where 1 , 2 , 3 and , P d A c j p X i c M U � i j ∈ X A i = = π = = − π ( ) and ( ) 1 P c M P c U 0 0

Mean Cost (II) Using the Bayes theorem: = ⋅ = = ⋅ = + ⋅ = = ⋅ = + ( | ) ( ) ( | ) ( ) c c P d A c M P c M c P d A c U P c U 1 1 1 1 M U ⋅ = = ⋅ = + ⋅ = = ⋅ = + ( | ) ( ) ( | ) ( ) c P d A c M P c M c P d A c U P c U 2 2 2 2 M U ⋅ = = ⋅ = + ⋅ = = ⋅ = ( | ) ( ) ( | ) ( ) c P d A c M P c M c P d A c U P c U 3 3 3 3 M U Using the definition of the conditional probability: = ⋅ π ⋅ + ⋅ ⋅ − π + ( ) ( ) ( 1 ) c c p X c p X � � 1 0 1 0 M M U U ∈ ∈ X A X A 1 1 ⋅ ⋅ π + ⋅ ⋅ − π + ( ) ( ) ( 1 ) c p X c p X � � 2 0 2 0 M M U U ∈ ∈ X A X A 2 2 ⋅ ⋅ π + ⋅ ⋅ − π + ( ) ( ) ( 1 ) c p X c p X � � 3 0 3 0 M M U U ∈ ∈ X A X A 3 3

Mean Cost (III) = ⋅ ⋅ π + ⋅ ⋅ − π + [ ( ) ( ) ( 1 )] c p X c p X c � 1 0 1 0 M M U U ∈ X A 1 ⋅ ⋅ π + ⋅ ⋅ − π + [ ( ) ( ) ( 1 )] p X c p X c � 2 0 2 0 M M U U ∈ X A 2 ⋅ π ⋅ + ⋅ ⋅ − π [ ( ) ( ) ( 1 )] p X c p X c � 3 0 3 0 M M U U ∈ X A 3

Decision Areas • Every sample X in the decision space A, should be assigned to only one decision class: A 1 , A 2 or A 3 . • We should thus assign each sample to a class in such a way that its contribution to the mean cost is minimum. • This will lead to the optimal selection for the three sets which we denote by A 10 , A 20 , A 30 .

Decision Making • A sample is assigned to the optimal areas as follows: 0 To if : A 1 ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 1 0 1 0 2 0 2 0 M M U U M M U U ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 1 0 1 0 3 0 3 0 M M U U M M U U 0 To if : A 2 ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c M 2 M 0 U 2 U 0 M 1 M 0 U 1 U 0 ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ π ⋅ + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 2 0 2 0 3 0 3 0 M M U U M M U U 0 To if : A 3 ⋅ π ⋅ + ⋅ ⋅ − π ≤ ⋅ ⋅ π + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c 3 0 3 0 1 0 1 0 M M U U M M U U ⋅ ⋅ π + ⋅ ⋅ − π ≤ ⋅ π ⋅ + ⋅ ⋅ − π ( ) ( ) ( 1 ) ( ) ( ) ( 1 ) p X c p X c p X c p X c M 3 M 0 U 3 U 0 M 2 M 0 U 2 U 0

Optimal Decision Areas • We thus conclude from the previous slide: π − π − p c c p c c � � = ≤ ⋅ ≤ ⋅ 0 0 3 1 0 U M M U 2 M 1 M : and, A X � � − π − − π − 1 1 1 p c c p c c � 0 1 3 0 1 2 � M U U M U U π − π − p c c p c c � � = ≥ ⋅ ≤ ⋅ 0 0 2 1 0 3 2 U M M U M M : and, A X � � − π − − π − 2 1 1 p c c p c c � M 0 1 U 2 U M 0 2 U 3 U � π − π − p c c p c c � � = ≥ ⋅ ≥ ⋅ 0 0 3 1 0 3 2 : U M M and, U M M A X � � − π − − π − 3 1 1 p c c p c c � 0 1 3 0 2 3 � M U U M U U

Threshold Values ≤ ≤ ≥ ≥ , c c c c c c 1 2 3 1 2 3 M M M U U U π − c c κ = ⋅ 0 3 1 M M − π − 1 c c 0 1 3 U U π − c c λ = ⋅ 0 2 1 M M − π − 1 c c 0 1 2 U U π − c c µ = ⋅ 0 3 2 M M − π − 1 c c 0 2 3 U U

Threshold Values 0 to exist: • In order for A 2 π − π − c c c c λ = ⋅ ≤ ⋅ = µ 0 3 1 0 3 2 M M M M − π − − π − 1 1 c c c c 0 1 3 0 2 3 U U U U • We can easily prove now, that threshold κ lies between λ and µ .

A Decision Model for Cost Optimal Record Matching Presenter: - PowerPoint PPT Presentation

A Decision Model for Cost Optimal Record Matching Presenter: Vassilios S. Verykios IST College / Drexel University Affiliates Workshop on Data Quality NISS/Telcordia - December 1 st , 2000 Comparison Vector Given a pair of database records

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

Record Type Families: Record type A Key to Generic Record Combinators families Record scheme

Outline Flexible, optimal matching for observational Optimal matching of two groups studies

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

The SOUL Record TM Sarah Durke SOUL Record Coordinator What is The SOUL Record? The SOUL

TUTORIAL - TUTORIAL -ABC ABC TOTAL COST for a COST OBJECT TOTAL COST for a COST OBJECT

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Cost Report Capital Cost Operating Cost (Up front cost) (Annual cost over time) Utilities

Cost Allocation Plans and Indirect Cost Rates Cost Allocation Plans and Indirect Cost Rates

6 Decision- -Making Making MVC (revisited) 6 Decision MVC (revisited) decision

Impedance Matching of 640 GHz SIS Mixer Impedance Matching of 640 GHz SIS Mixer of 640 GHz SIS

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

Bayesian Classification Autonomous Agents Vasilis Papageorgiou February 23, 2020 Technical

EST5104 Bayesian Inference EST5803 Advanced Bayesian Inference Ricardo Ehlers ehlers@icmc.usp.br

Biostatistics 602 - Statistical Inference March 14th, 2013 Biostatistics 602 - Lecture 16 Hyun

Bayesian Methods in Reliability Engineering ASQ Reliability Division Webinar Program Nov 15 th

IndabaX 2019 Malawi An Introduction to ML Amelia Taylor Lecturer in AI The Polytechnic,

Analyzing #POTUS Sentiment on Twitter to Predict Public Opinion on Presidential Issues By: Jacob

A RISK ANALYSIS OF THE MOLYBDENUM-99 SUPPLY CHAIN USING BAYESIAN NETWORKS 2017 Mo-99 Topical

A few basics of credibility theory Greg Taylor Director, Taylor Fry Consulting Actuaries