Record Linkage Record Linkage Craig Knoblock University of - PowerPoint PPT Presentation

Record Linkage Record Linkage Craig Knoblock University of Southern California These slides are based in part on slides from Sheila Tejada and Misha Bilenko Craig Knoblock University of Southern California 1

Record Linkage Problem Record Linkage Problem Restaurant Address City Phone Cuisine Name Fenix 8358 Sunset Blvd. West Hollywood 213/848-6677 American Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood 213-848-6677 French (new) L. P. Kaelbling. An architecture for intelligent reactive systems. In Reasoning About Actions and Plans: Proceedings of the 1986 Workshop. Morgan Kaufmann, 1986 Kaelbling, L. P., 1987. An architecture for intelligent reactive systems. In M. P. Georgeff & A. L. Lansky, eds., Reasoning about Actions and Plans, Morgan Kaufmann, Los Altos, CA, 395 410 • Task: identify syntactically different records that refer to the same entity • Common sources of variation: database merges, typographic errors, abbreviations, extraction errors, OCR scanning errors, etc. Craig Knoblock University of Southern California 2

Outline Outline • Introduction • Candidate Generation • Field Matching • Record Matching • Discussion Craig Knoblock University of Southern California 3

Integrating Restaurant Sources Integrating Restaurant Sources Zagat’s Restaurant Department of Health Guide Source Restaurant Rating Source ARIADNE Information Mediator Question : What is the Review and Rating for the Restaurant “ Art’s Deli ”? Craig Knoblock University of Southern California 4

Ariadne Information Mediator Ariadne Information Mediator ARIADNE Information Mediator Zagat’s Wrapper Dept. of Health Wrapper User Query Extract web objects in the form of database records Zagat’s Dept of Health Name Street Phone Name Street Phone Art’s Deli 12224 Ventura Boulevard 818-756-4124 Art’s 12224 Ventura Blvd. 818/755-4100 Delicatessen Teresa’s 103 1st Ave. between 6th and 212/228-0604 Teresa’s 80 Montague St. 718-520-2910 7th Sts. Steakhouse The 128 Fremont St. 702-382-1600 Binion’s Coffee 128 Fremont St. 702/382-1600 Shop 155 W. 58 th St. Les Celebrites 212-484-5113 Les Celebrites 5432 Sunset Blvd 212/484-5113 Craig Knoblock University of Southern California 5

Application Dependent Mapping Application Dependent Mapping Observations: • Mapping objects can be application dependent • Example: Mapped? Steakhouse The Binion's Coffee Shop 128 Fremont St. 702/382-1600 128 Fremont Street 702-382-1600 • The mapping is in the application, not the data • User input is needed to increase accuracy of the mapping Craig Knoblock University of Southern California 6

General Approach to Record Linkage General Approach to Record Linkage 1. Identification of candidate pairs • Comparing all possible record pairs would be computationally wasteful 2. Compute Field Similarity • String similarity between individual fields is computed 3. Compute Record Similarity • Field similarities are combined into a total record similarity estimate 4. Linkage/Merging • Records with similarity higher than a threshold are labeled as matches • Equivalence classes are found by transitive closure Craig Knoblock University of Southern California 7

Candidate Generation Candidate Generation • Comparing all possible matches across two data sets would require n^2 comparisons • On large datasets this is impractical and wasteful • Instead, we compare only those that could possible be matched • Also referred to as blocking Craig Knoblock University of Southern California 9

Approach to Candidate Generation Approach to Candidate Generation • Construct an inverted index of all tokens in a document • Links the token to the documents in which it appears • Place each token in a hash table • Apply transformations on the tokens to find closely related tokens • Transformations include equal, stemming, soundex, and other unary transformations • Use a stop list to avoid common tokens • Tokens such as “the”, “s”, etc. would be on the stop list Craig Knoblock University of Southern California 10

Example: Partial Inverted Index for LA Example: Partial Inverted Index for LA Department of Health Department of Health Craig Knoblock University of Southern California 11

Field Matching Approaches Field Matching Approaches • Expert-system rules • Manually written • Information retrieval • General string similarity • Used in Marlin • Learned weights for domain-specific transformations • Used in Active Atlas Craig Knoblock University of Southern California 13

Information Retrieval Approach Information Retrieval Approach [Cohen, 1998] [Cohen, 1998] • Idea: Evaluate the similarity of records via textual similarity. Used in Whirl (Cohen 1998). • Follows the same approach used by classical IR algorithms (including web search engines). • First, “stemming” is applied to each entry. • E.g. “Joe’s Diner” -> “Joe [‘s] Diner” • Then, entries are compared by counting the number of words in common. • Note: Infrequent words weighted more heavily by TFIDF metric = Term Frequency Inverse Document Frequency Craig Knoblock University of Southern California 14

Token ‐ ‐ based Metrics based Metrics Token • Any string can be treated as a bag of tokens . “8358 Sunset Blvd” ► {8358, Sunset, Blvd} • “8358 Sunset Blvd” ► {‘8358’, ‘358 ‘, ’58 S’, ‘8 Su’, ‘ Sun’, ‘Suns’, ‘unse’, • ‘nset’, ‘set ‘, ‘et B’, ‘t Bl’, ‘ Blv’, ‘Blvd’} • Each token corresponds to a dimension in Euclidean space; string similarity is the normalized dot product (cosine) in the vector space. • Weighting tokens by Inverse Document Frequency (IDF) is a form of unsupervised string metric learning. Craig Knoblock University of Southern California 15

String Similarity Measures String Similarity Measures • Metrics based on sequence comparison : • String edit distance • Variants: Length of longest common subsequence, Smith-Waterman distance, etc. • [Gusfield ‘97] • Metrics based on vector-space similarity : • Rely on representing strings as sets of tokens • Variants include word tokenization, n-grams, etc. • [Baeza-Yates & Ribeiro-Neto ‘98] Craig Knoblock University of Southern California 16

Sequence ‐ ‐ based String Metrics: based String Metrics: Sequence String Edit Distance [Levenshtein Levenshtein, 1966] , 1966] String Edit Distance [ • Minimum number of character deletions , insertions, or substitutions needed to make two strings equivalent. • “misspell” to “mispell” is distance 1 ( ‘delete s’ ) • “misspell” to “mistell” is distance 2 ( ‘delete s’, ‘substitute p with t’ OR ‘substitute s with t’, ‘delete p’ ) • “misspell” to “misspelling” is distance 3 (‘ insert i’, ‘insert n’, ‘insert g’ ) • Can be computed efficiently using dynamic programming in O( mn ) time where m and n are the lengths of the two strings being compared. • Unit cost is typically assigned to individual edit operations, but individual costs can be used. Craig Knoblock University of Southern California 17

String Edit Distance with Affine Gaps String Edit Distance with Affine Gaps [Gotoh,1982] [Gotoh,1982] • Cost of gaps formed by contiguous deletions/insertions should be lower than the cost of multiple non-contiguous operators. • Distance from “misspell” to “misspelling” is <3. • Affine model for gap cost: cost( gap) = s + e|gap| , e < s • Edit distance with affine gaps is more flexible since it is less susceptible to sequences of insertions/deletions that are frequent in natural language text (e.g. ’Street’ vs. ‘ Str’ ). Craig Knoblock University of Southern California 18

Learnable Edit Distance with Affine Learnable Edit Distance with Affine Gaps Gaps • Motivation: Significance of edit operations depends on a particular domain • Substitute ‘/’ with ‘-’ insignificant for phone numbers. • Delete ‘Q’ significant for names. • Gap start/extension costs vary: sequence deletion is common for addresses ( ‘Street’ ► ’Str’ ), uncommon for zip codes. • Using individual weights for edit operations, as well as learning gap operation costs allows adapting to a particular field domain . • [Ristad & Yianilos, ‘97] proposed a one-state generative model for regular edit distance. Craig Knoblock University of Southern California 19

Record Linkage Record Linkage Craig Knoblock University of - PowerPoint PPT Presentation

Record Linkage Record Linkage Craig Knoblock University of Southern California These slides are based in part on slides from Sheila Tejada and Misha Bilenko Craig Knoblock University of Southern California 1 Record Linkage Problem Record

Using Structured Neural Networks for Record Linkage Burdette Pixton Christophe Giraud-Carrier

Genealogical Record Linkage: Features for Automated Person Matching Randy Wilson

Modeling Offsets and Linkage in a Modeling Offsets and Linkage in a Modeling Offsets and Linkage

Linkage Disequilibrium Linkage Disequilibrium Linkage Equilibrium Consider two linked loci Locus

Building the Linkage Tree (LT) in LTGA 1. Start with singleton linkage sets Thierens, D. (2010).

Privacy Preserving Record Linkage Linkage Elizabeth Ashley Durham Health Information Privacy

What is data (or record) linkage? Recent interest in data linkage The process of linking and

Record Type Families: Record type A Key to Generic Record Combinators families Record scheme

Linkage graphs and what they look like Stephen Kell Stephen.Kell@cl.cam.ac.uk Linkage graphs. .

Nam e Standardization Nam e Standardization for Genealogical for Genealogical Record Linkage

Probabilistic Record Linkage in Genealogical Research John Lawson, Dave White, Brenda Price and

A Semantic - based K - anonymity Scheme for Health Record Linkage Yang LU 1 , Richard O. SINNOTT

Transformations for Record Linkage Matthew Michelson & Craig A. Knoblock Fetch

January 2017 Data Linkage: An Overview Natalie Shlomo University of Manchester 1

Data linkage in Victoria 7 August 2017 Sharon Williams, Manager, Centre for Victorian Data

Performing linkage analysis using MERLIN David Duffy Queensland Institute of Medical Research

1

Autonomous Intelligent Robotics Instructor: Shiqi Zhang

Around canonical heights in arithmetic dynamics Shu Kawaguchi Arithmetic 2015 - Silvermania

Mapping the Tcl world: using Tcl to curate OpenStreetMap Kevin B. Kenny 5 November 2019 Howd

Activation Records Modern imperative programming languages typically have local variables.

VI.3 Named Entity Reconciliation Problem: Same entity appears in Different spellings

ARM Cortex-M4 Programming Model Stacks and Subroutines Textbook: Chapter 8.1 - Subroutine

Linking 15-213: Introduc0on to Computer Systems 11 th Lecture,

Record Linkage Record Linkage Craig Knoblock University of - PowerPoint PPT Presentation

Record Linkage Record Linkage Craig Knoblock University of Southern California These slides are based in part on slides from Sheila Tejada and Misha Bilenko Craig Knoblock University of Southern California 1 Record Linkage Problem Record

Using Structured Neural Networks for Record Linkage Burdette Pixton Christophe Giraud-Carrier

Genealogical Record Linkage: Features for Automated Person Matching Randy Wilson

Modeling Offsets and Linkage in a Modeling Offsets and Linkage in a Modeling Offsets and Linkage

Linkage Disequilibrium Linkage Disequilibrium Linkage Equilibrium Consider two linked loci Locus

Building the Linkage Tree (LT) in LTGA 1. Start with singleton linkage sets Thierens, D. (2010).

Privacy Preserving Record Linkage Linkage Elizabeth Ashley Durham Health Information Privacy

What is data (or record) linkage? Recent interest in data linkage The process of linking and

Record Type Families: Record type A Key to Generic Record Combinators families Record scheme

Linkage graphs and what they look like Stephen Kell Stephen.Kell@cl.cam.ac.uk Linkage graphs. .

Nam e Standardization Nam e Standardization for Genealogical for Genealogical Record Linkage

Probabilistic Record Linkage in Genealogical Research John Lawson, Dave White, Brenda Price and

A Semantic - based K - anonymity Scheme for Health Record Linkage Yang LU 1 , Richard O. SINNOTT

Transformations for Record Linkage Matthew Michelson &amp; Craig A. Knoblock Fetch

January 2017 Data Linkage: An Overview Natalie Shlomo University of Manchester 1

Data linkage in Victoria 7 August 2017 Sharon Williams, Manager, Centre for Victorian Data

Performing linkage analysis using MERLIN David Duffy Queensland Institute of Medical Research

1

Autonomous Intelligent Robotics Instructor: Shiqi Zhang

Around canonical heights in arithmetic dynamics Shu Kawaguchi Arithmetic 2015 - Silvermania

Mapping the Tcl world: using Tcl to curate OpenStreetMap Kevin B. Kenny 5 November 2019 Howd

Activation Records Modern imperative programming languages typically have local variables.

VI.3 Named Entity Reconciliation Problem: Same entity appears in Different spellings

ARM Cortex-M4 Programming Model Stacks and Subroutines Textbook: Chapter 8.1 - Subroutine

Linking 15-213: Introduc0on to Computer Systems 11 th Lecture,

Transformations for Record Linkage Matthew Michelson & Craig A. Knoblock Fetch