Record Linkage Record Linkage Craig Knoblock University of - - PowerPoint PPT Presentation

record linkage record linkage
SMART_READER_LITE
LIVE PREVIEW

Record Linkage Record Linkage Craig Knoblock University of - - PowerPoint PPT Presentation

Record Linkage Record Linkage Craig Knoblock University of Southern California These slides are based in part on slides from Sheila Tejada and Misha Bilenko Craig Knoblock University of Southern California 1 Record Linkage Problem Record


slide-1
SLIDE 1

Craig Knoblock University of Southern California 1

Record Linkage Record Linkage

Craig Knoblock University of Southern California

These slides are based in part on slides from Sheila Tejada and Misha Bilenko

slide-2
SLIDE 2

Craig Knoblock University of Southern California 2

Record Linkage Problem Record Linkage Problem

  • Task:

identify syntactically different records that refer to the same entity

  • Common sources of variation: database merges, typographic errors,

abbreviations, extraction errors, OCR scanning errors, etc.

French (new) 213-848-6677

  • W. Hollywood

8358 Sunset Blvd. Fenix at the Argyle American 213/848-6677 Hollywood 8358 Sunset Blvd. West Fenix

Cuisine Phone City Address Restaurant Name

Kaelbling, L. P., 1987. An architecture for intelligent reactive systems. In M. P. Georgeff &

  • A. L. Lansky, eds., Reasoning about Actions and Plans, Morgan Kaufmann, Los Altos, CA,

395 410

  • L. P. Kaelbling. An architecture for intelligent reactive systems. In Reasoning About Actions

and Plans: Proceedings of the 1986 Workshop. Morgan Kaufmann, 1986

slide-3
SLIDE 3

Craig Knoblock University of Southern California 3

Outline Outline

  • Introduction
  • Candidate Generation
  • Field Matching
  • Record Matching
  • Discussion
slide-4
SLIDE 4

Craig Knoblock University of Southern California 4

Integrating Restaurant Sources Integrating Restaurant Sources

Zagat’s Restaurant Guide Source Department of Health Restaurant Rating Source

ARIADNE

Information Mediator

Question: What is the Review and Rating for the Restaurant “Art’s Deli”?

slide-5
SLIDE 5

Craig Knoblock University of Southern California 5

Ariadne Information Mediator Ariadne Information Mediator

Zagat’s Wrapper

  • Dept. of Health Wrapper

User Query

ARIADNE

Information Mediator

Name Street Phone Art’s Deli 12224 Ventura Boulevard 818-756-4124 Teresa’s 80 Montague St. 718-520-2910 Steakhouse The 128 Fremont St. 702-382-1600 Les Celebrites 155 W. 58th St. 212-484-5113 Name Street Phone Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Teresa’s 103 1st Ave. between 6th and 7th Sts. 212/228-0604 Binion’s Coffee Shop 128 Fremont St. 702/382-1600 Les Celebrites 5432 Sunset Blvd 212/484-5113

Extract web objects in the form of database records

Zagat’s Dept of Health

slide-6
SLIDE 6

Craig Knoblock University of Southern California 6

Application Dependent Mapping Application Dependent Mapping

Observations:

  • Mapping objects can be application dependent
  • Example:
  • The mapping is in the application, not the data
  • User input is needed to increase accuracy of the

mapping

Mapped?

Binion's Coffee Shop 128 Fremont St. 702/382-1600 Steakhouse The

128 Fremont Street 702-382-1600

slide-7
SLIDE 7

Craig Knoblock University of Southern California 7

General Approach to Record Linkage General Approach to Record Linkage

1. Identification of candidate pairs

  • Comparing all possible record pairs would be computationally

wasteful

2. Compute Field Similarity

  • String similarity between individual fields is computed

3. Compute Record Similarity

  • Field similarities are combined into a total record similarity

estimate

4. Linkage/Merging

  • Records with similarity higher than a threshold are labeled as

matches

  • Equivalence classes are found by transitive closure
slide-8
SLIDE 8

Craig Knoblock University of Southern California 8

Outline Outline

  • Introduction
  • Candidate Generation
  • Field Matching
  • Record Matching
  • Discussion
slide-9
SLIDE 9

Craig Knoblock University of Southern California 9

Candidate Generation Candidate Generation

  • Comparing all possible matches across two data

sets would require n^2 comparisons

  • On large datasets this is impractical and

wasteful

  • Instead, we compare only those that could

possible be matched

  • Also referred to as blocking
slide-10
SLIDE 10

Craig Knoblock University of Southern California 10

Approach to Candidate Generation Approach to Candidate Generation

  • Construct an inverted index of all tokens in a

document

  • Links the token to the documents in which it appears
  • Place each token in a hash table
  • Apply transformations on the tokens to find

closely related tokens

  • Transformations include equal, stemming, soundex,

and other unary transformations

  • Use a stop list to avoid common tokens
  • Tokens such as “the”, “s”, etc. would be on the stop

list

slide-11
SLIDE 11

Craig Knoblock University of Southern California 11

Example: Partial Inverted Index for LA Example: Partial Inverted Index for LA Department of Health Department of Health

slide-12
SLIDE 12

Craig Knoblock University of Southern California 12

Outline Outline

  • Introduction
  • Candidate Generation
  • Field Matching
  • Record Matching
  • Discussion
slide-13
SLIDE 13

Craig Knoblock University of Southern California 13

Field Matching Approaches Field Matching Approaches

  • Expert-system rules
  • Manually written
  • Information retrieval
  • General string similarity
  • Used in Marlin
  • Learned weights for domain-specific

transformations

  • Used in Active Atlas
slide-14
SLIDE 14

Craig Knoblock University of Southern California 14

Information Retrieval Approach Information Retrieval Approach

[Cohen, 1998] [Cohen, 1998]

  • Idea: Evaluate the similarity of records via textual
  • similarity. Used in Whirl (Cohen 1998).
  • Follows the same approach used by classical IR

algorithms (including web search engines).

  • First, “stemming” is applied to each entry.
  • E.g. “Joe’s Diner” -> “Joe [‘s] Diner”
  • Then, entries are compared by counting the number of

words in common.

  • Note: Infrequent words weighted more heavily by TFIDF

metric = Term Frequency Inverse Document Frequency

slide-15
SLIDE 15

Craig Knoblock University of Southern California 15

Token Token‐ ‐based Metrics based Metrics

  • Any string can be treated as a bag of tokens .
  • “8358 Sunset Blvd” ► {8358, Sunset, Blvd}
  • “8358 Sunset Blvd” ► {‘8358’, ‘358 ‘, ’58 S’, ‘8 Su’, ‘ Sun’, ‘Suns’, ‘unse’,

‘nset’, ‘set ‘, ‘et B’, ‘t Bl’, ‘ Blv’, ‘Blvd’}

  • Each token corresponds to a dimension in Euclidean

space; string similarity is the normalized dot product (cosine) in the vector space.

  • Weighting tokens by Inverse Document Frequency

(IDF) is a form of unsupervised string metric learning.

slide-16
SLIDE 16

Craig Knoblock University of Southern California 16

String Similarity Measures String Similarity Measures

  • Metrics based on sequence comparison:
  • String edit distance
  • Variants: Length of longest common

subsequence, Smith-Waterman distance, etc.

  • [Gusfield ‘97]
  • Metrics based on vector-space similarity:
  • Rely on representing strings as sets of tokens
  • Variants include word tokenization, n-grams, etc.
  • [Baeza-Yates & Ribeiro-Neto ‘98]
slide-17
SLIDE 17

Craig Knoblock University of Southern California 17

Sequence Sequence‐ ‐based String Metrics: based String Metrics: String Edit Distance [ String Edit Distance [Levenshtein Levenshtein, 1966] , 1966]

  • Minimum number of character deletions, insertions, or substitutions

needed to make two strings equivalent.

  • “misspell” to “mispell” is distance 1 (‘delete s’)
  • “misspell” to “mistell” is distance 2 (‘delete s’, ‘substitute p with t’ OR

‘substitute s with t’, ‘delete p’)

  • “misspell” to “misspelling” is distance 3 (‘insert i’, ‘insert n’, ‘insert g’)
  • Can be computed efficiently using dynamic programming in O(mn)

time where m and n are the lengths of the two strings being compared.

  • Unit cost is typically assigned to individual edit operations, but

individual costs can be used.

slide-18
SLIDE 18

Craig Knoblock University of Southern California 18

String Edit Distance with Affine Gaps String Edit Distance with Affine Gaps [Gotoh,1982] [Gotoh,1982]

  • Cost of gaps formed by contiguous deletions/insertions

should be lower than the cost of multiple non-contiguous

  • perators.
  • Distance from “misspell” to “misspelling” is <3.
  • Affine model for gap cost: cost(gap)=s+e|gap|, e<s
  • Edit distance with affine gaps is more flexible since it is

less susceptible to sequences of insertions/deletions that are frequent in natural language text (e.g.’Street’ vs.

‘Str’).

slide-19
SLIDE 19

Craig Knoblock University of Southern California 19

Learnable Edit Distance with Affine Learnable Edit Distance with Affine Gaps Gaps

  • Motivation:

Significance of edit operations depends on a particular domain

  • Substitute ‘/’ with ‘-’ insignificant for phone numbers.
  • Delete ‘Q’ significant for names.
  • Gap start/extension costs vary: sequence deletion is common for

addresses (‘Street’ ►’Str’), uncommon for zip codes.

  • Using individual weights for edit operations, as well as learning

gap operation costs allows adapting to a particular field domain.

  • [Ristad & Yianilos, ‘97] proposed a one-state generative model for

regular edit distance.

slide-20
SLIDE 20

Craig Knoblock University of Southern California 20

  • Matching/substituted pairs of characters are generated in state M.
  • Deleted/inserted characters that form gaps are generated in states D

and I.

  • Special termination state “#” ends the alignment of two strings.
  • Similar to pairwise alignment HMMs used in bioinformatics [Durbin et al.

’98]

(e,e) (l,l) (l,l)

Learnable Edit Distance with Affine Gaps Learnable Edit Distance with Affine Gaps – – the Generative Model the Generative Model

(m,m) (i,i) (s,s) (s,t) (p,ε) (ε,i) (ε,n) (ε,g)

misspell mistelling misspell mistelling misspell mistelling misspell mistelling misspell mistelling misspell mistelling misspell mistelling misspell mistelling misspell mistelling misspell mistelling misspell mistelling misspell mistelling

slide-21
SLIDE 21

Craig Knoblock University of Southern California 21

Learnable Edit Distance with Affine Gaps: Learnable Edit Distance with Affine Gaps: Training Training

  • Given a corpus of matched string pairs, the model is trained using

Expectation-Maximization.

  • The model parameters take on values that result in high probability
  • f producing duplicate strings.
  • Frequent edit operations and typos have high probability.
  • Rare edit operations have low probability.
  • Gap parameters take on values that are optimal for duplicate strings in

the training corpus.

  • Once trained, distance between any two strings is estimated as

the posterior probability of generating the most likely alignment between the strings as a sequence of edit operations.

  • Distance computation is performed in a simple dynamic

programming algorithm.

slide-22
SLIDE 22

Craig Knoblock University of Southern California 22

Learning Transformation Weights Learning Transformation Weights

  • Learn general transformations to recognize related objects

Art’s Deli California Pizza Kitchen Philippe The Original Zagat’s Dept of Health Art’s Delicatessen CPK Philippe’s The Original Prefix Acronym Stemming Transformations

slide-23
SLIDE 23

Craig Knoblock University of Southern California 23

Transformation Weights Transformation Weights

  • Transformations can be more appropriate for a specific

application domain

  • Restaurants, Companies or Airports
  • Or for different attributes within an application domain
  • Acronym more appropriate for the attribute Restaurant

Name than for the Phone attribute

  • Learn likelihood that if transformation is applied then

the objects are mapped Transformation Weight = P(mapped | transformation)

slide-24
SLIDE 24

Craig Knoblock University of Southern California 24

Types of Transformations Types of Transformations

  • Equality (Exact match)
  • Stemming
  • Soundex (e.g. “Celebrites” => “C453”)
  • Abbreviation (e.g. “3rd” => “third”)

Unary Transformations Binary Transformations

  • Initial
  • Prefix (e.g. “Deli” & “Delicatessen”)
  • Suffix
  • Substring
  • Acronym (e.g. “California Pizza Kitchen” & “CPK”)
  • Drop Word
slide-25
SLIDE 25

Craig Knoblock University of Southern California 25

Applying Unary Transformations Applying Unary Transformations

  • Employs Information Retrieval Techniques
  • One set of attribute values broken into words or tokens
  • “Art” “s” “Delicatessen”
  • Apply Type I transformations to tokens
  • “Art” “A630” “s” “S000” “Delicatessen” “D423”
  • Enter tokens into inverted index
  • Tokens from second set used to query the index
  • Transformed query set: “Art” “A630” “s” “S000” “Deli” “Del” “D400”

Zagat’s Name Dept of Health Art’s Deli Art’s Delicatessen Equality Equality

slide-26
SLIDE 26

Craig Knoblock University of Southern California 26

Applying Binary Transformations Applying Binary Transformations

Zagat’s Name Dept of Health Art’s Deli Art’s Delicatessen Equality Prefix Equality

  • Binary transformations improve measurement of similarity
slide-27
SLIDE 27

Craig Knoblock University of Southern California 27

Calculate Transformation Weights Calculate Transformation Weights

Art’s Deli, Art’s Delicatessen CPK, California Pizza Kitchen Ca’Brea, La Brea Bakery

Mapped Learner Mapped User Not Mapped Learner Examples Classification Labeled by P(mapped | transformation) = P(transformation | mapped) P(mapped) P(transformation)

slide-28
SLIDE 28

Craig Knoblock University of Southern California 28

Computing Textual Similarity Computing Textual Similarity

Zagat’s Restaurant Objects Department of Health Objects Z1, Z2, Z3 D1, D2, D3

Name Street Phone Name Street Phone

W

Sname Sstreet Sphone

  • Candidate Generator returns sets of similarity scores

.9 .79 .4 .17 .3 .74 . . .

Name Street Phone

slide-29
SLIDE 29

Craig Knoblock University of Southern California 29

Outline Outline

  • Introduction
  • Candidate Generation
  • Field Matching
  • Record Matching
  • Discussion
slide-30
SLIDE 30

Craig Knoblock University of Southern California 30

Record Matching Approaches Record Matching Approaches

  • Learning Decision Trees
  • Support Vector Machines (SVM)
  • Unsupervised Learning
slide-31
SLIDE 31

Craig Knoblock University of Southern California 31

Learning Mapping Rules with Learning Mapping Rules with Decision Trees Decision Trees

  • Learning important attributes for determining a mapping

Zagat’s

Dept of Health

Art’s Deli 12224 Ventura Boulevard 818-756-4124 Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100

Name Street Phone

slide-32
SLIDE 32

Craig Knoblock University of Southern California 32

Learning Mapping Rules Learning Mapping Rules with Decision Trees with Decision Trees

Mapping rules: Name > .9 & Street > .87 => mapped Name > .95 & Phone > .96 => mapped

Name Street Phone Art’s Deli 12224 Ventura Boulevard 818-756-4124 Teresa's 80 Montague St. 718-520-2910 Steakhouse The 128 Fremont St. 702-382-1600 Les Celebrites 155 W. 58th St. 212-484-5113 Name Street Phone Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Teresa's 103 1st Ave. between 6th and 7th Sts. 212/228-0604 Binion's Coffee Shop 128 Fremont St. 702/382-1600 Les Celebrites 160 Central Park S 212/484-5113

Zagat’s Restaurants

  • Dept. of Health
slide-33
SLIDE 33

Craig Knoblock University of Southern California 33

Learning Mapping Rules Learning Mapping Rules

Set of Similarity Scores Mapping Rules

Name Street Phone

.967 .973 .3 .17 .3 .74 .8 .542 .49 .95 .97 .67 …

Name > .8 & Street > .79 => mapped

Name > .89 => mapped Street < .57 => not mapped

slide-34
SLIDE 34

Craig Knoblock University of Southern California 34

Mapping Rule Learner with Mapping Rule Learner with Active Learning Active Learning

Set of Mapped Objects Choose initial examples Generate committee of learners

Learn Rules Classify Examples

Votes Votes Votes

Choose Example

USER

Learn Rules Classify Examples Learn Rules Classify Examples

Label Label

slide-35
SLIDE 35

Craig Knoblock University of Southern California 35

Committee Disagreement Committee Disagreement

  • Chooses an example based on the disagreement of the query

committee

  • In this case CPK, California Pizza Kitchen is the most

informative example based on disagreement

Art’s Deli, Art’s Delicatessen CPK, California Pizza Kitchen Ca’Brea, La Brea Bakery

Yes Yes Yes Yes No Yes No No No Examples M1 M2 M3 Committee

slide-36
SLIDE 36

Craig Knoblock University of Southern California 36

Learnable Vector Learnable Vector‐ ‐space Similarity space Similarity

x: “3130 Piedmont Road” y: “3130 Piedmont Rd. NE”

3130 ne piedmont rd road

X x y p(x,y)

x2 x4 x3 x5 x1 y2 y4 y3 y5 y1 x1y1 x2y2 x3y3 x4y4 x5y5

φ(p(x,y))

S D

f(p(x,y)

) ( ) , (

) , ( y x

p f y x Sim ∝

Each string is converted to vector-space representation The pair vector is classified as “similar” or “dissimilar” Similarity between strings is

  • btained from the SVM output

The pair vector is created

slide-37
SLIDE 37

Craig Knoblock University of Southern California 37

Combining String Similarity Across Combining String Similarity Across Fields Fields

  • Some fields are more indicative of record similarity than others:
  • For addresses, street address similarity is more important than city

similarity.

  • For bibliographic citations, author or title similarity are more important

than venue (i.e. conference or journal name) similarity.

  • Field similarities should be weighted when combined to determine

record similarity.

  • Weights can be learned using a learning algorithm [Cohen &

Richman ‘02], [Sarawagi & Bhamidipaty ‘02], [Tejada et. al. ‘02].

slide-38
SLIDE 38

Craig Knoblock University of Southern California 38

Learned Record Similarity Learned Record Similarity

  • String similarities for each field are used as

components of a feature vector for a pair of records.

  • SVM is trained on labeled feature vectors to

discriminate duplicate from non-duplicate pairs.

  • Record similarity is based on the distance of the

feature vector from the separating hyperplane.

slide-39
SLIDE 39

Craig Knoblock University of Southern California 39

Learning Record Similarity (cont.) Learning Record Similarity (cont.)

slide-40
SLIDE 40

Craig Knoblock University of Southern California 40

Unsupervised Record Linkage Unsupervised Record Linkage

  • Idea: Analyze data and automatically cluster pairs into

three groups:

  • Let R = P(obs | Same) / P(obs| Different)
  • Matched if R > threshold TU
  • Unmatched if R < threshold TL
  • Ambiguous if TL < R < TU
  • This model for computing decision rules was introduced

by Felligi & Sunter in 1969

  • Particularly useful for statistically linking large sets of

data, e.g., by US Census Bureau

slide-41
SLIDE 41

Craig Knoblock University of Southern California 41

Unsupervised Record Linkage (cont.) Unsupervised Record Linkage (cont.)

  • Winkler (1998) used EM algorithm to estimate P(obs |

Same) and P(obs | Different)

  • EM computes the maximum likelihood estimate. The

algorithm iteratively determines the parameters most likely to generate the observed data.

  • Additional mathematical techniques must be used to

adjust for “relative frequencies”, I.e. last name of “Smith” is much more frequent than “Knoblock”.

slide-42
SLIDE 42

Craig Knoblock University of Southern California 42

Outline Outline

  • Introduction
  • Candidate Generation
  • Field Matching
  • Record Matching
  • Discussion
slide-43
SLIDE 43

Craig Knoblock University of Southern California 43

Enforcing One Enforcing One‐ ‐to to‐ ‐One Relationship One Relationship

(Name, Street, City)

(Art’s Deli, 1745 Ventura Boulevard,Encino) (Citrus, 267 Citrus Ave., LA) (Spago, 456 Sunset Bl. LA) ( Z1, Z2, Z3 ) . . . ( not in source ) .

Zagat’s Dept of Health

Given weights W, matching method determines mostly likely Matching Assignment

(Name, Street, City)

(Art’s Delicatessen, 1745 Ventura Blvd,Encino) (Ca’ Brea, 6743 La Brea Ave., LA) (Patina, 342 Melrose Ave., LA) ( D1, D2, D3 ) . . . ( not in source ) .

  • Viewed as weighted bipartite matching problem

Wn2 W1

slide-44
SLIDE 44

Craig Knoblock University of Southern California 44

Related Work Related Work

  • Record linkage [Newcombe et al. ’59; Fellegi & Sunter ’69; Winkler

’94, ’99, ‘02]

  • Database hardening [Cohen et al. ’00]
  • Merge/purge [Hernandez & Stolfo ’95]
  • Field matching [Monge & Elkan ’96]
  • Data cleansing [Lee et al. ’99]
  • Name matching [Cohen & Richman ’01, Cohen et al. ’03]
  • De-duplication [Sarawagi & Bhamidipaty ’02]
  • Object identification [Tejada et al. ’01, ’02]
  • Fuzzy duplicate elimination [Ananthakrishna et al. ’02]
  • Identity uncertainty [Pasula et. al. ’02, McCallum & Wellner ‘03]
  • Object consolidation [Michalowski et al. ’03]
slide-45
SLIDE 45

Craig Knoblock University of Southern California 45

Conclusions Conclusions

  • Technical choices in record linkage:
  • Approach to candidate generation
  • Approach to field matching
  • Approach to record matching
  • Learning approaches have the advantage of being able to
  • Adapt to specific application domains
  • Learn which fields are important
  • Learn the most appropriate transformations
  • Optimal classifier choice is sensitive to the domain and the

amount of available training data.