Automatic Record Linkage using Seeded Nearest Neighbour and SVM - PowerPoint PPT Presentation

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au Project Web site: http://datamining.anu.edu.au/linkage.html Funded by the Australian National University, the New South Wales Department of Health, and the Australian Research Council (ARC) under Linkage Project 0453463. Peter Christen, August 2008 – p.1/12

Outline Record linkage and its challenges The record linkage process Record pair comparison and classification Records and weight vectors example Two-step classification approach Experimental results Outlook and future work Peter Christen, August 2008 – p.2/12

Record linkage and its challenges The process of linking and aggregating records that represent the same entity (such as a patient, a customer, a business, etc.) Also called data matching , data scrubbing , entity resolution , object identification , merge-purge , etc. Has several major challenges Real world data is dirty (typographical errors and variations, missing and out-of-date values, etc.) Scalability (naïve comparison of all record pairs is O (n 2 ) , so some form of blocking or indexing is required) No training data available in many application areas (no data sets with known true match status) Peter Christen, August 2008 – p.3/12

The record linkage process Cleaning and Database A standardisation Blocking / Indexing Cleaning and Database B standardisation Weight vector Field classification comparison Non− Possible Clerical Matches matches matches review Evaluation Peter Christen, August 2008 – p.4/12

Record pair comparison and classification Pairs of records are compared field (attribute) wise using various field comparison functions Such as exact or approximate string (edit-distance, q -gram, Winkler), numeric, age, date, time, etc. Return 1.0 for exact similarity, 0.0 for total dissimilarity For each compared record pair, a weight vector containing matching weights is calculated Record pairs are then classified into matches , non-matches (and possible matches ) Various techniques have been explored: Summing and threshold based, decision trees, SVM, clustering, etc. Peter Christen, August 2008 – p.5/12

Records and weight vectors example R1 : Christine Smith 42 Main Street R2 : Christina Smith 42 Main St R3 : Bob O’Brian 11 Smith Rd R4 : Robert Bryce 12 Smythe Road WV(R1,R2) : [0.9, 1.0, 1.0, 1.0, 0.9] WV(R1,R3) : [0.0, 0.0, 0.0, 0.0, 0.0] WV(R1,R4) : [0.0, 0.0, 0.5, 0.0, 0.0] WV(R2,R3) : [0.0, 0.0, 0.0, 0.0, 0.0] WV(R2,R4) : [0.0, 0.0, 0.5, 0.0, 0.0] WV(R3,R4) : [0.7, 0.3, 0.5, 0.7, 0.9] Peter Christen, August 2008 – p.6/12

Two-step classification approach 1. Select weight vectors into seed training sets Weight vectors closest to the exact match vector into the match seed training set Weight vectors closest to the total dissimilarity weight vector into the non-match seed training set 2. Start binary classification using seed training sets Nearest neighbour: Iteratively add not yet classified weight vector closest to a training set into it Iterative SVM: Train an SVM, then add the weight vectors furthest away from the decision boundary into the training sets, then train a new SVM Peter Christen, August 2008 – p.7/12

Experimental results All techniques are implemented in the Febrl open source record linkage system (available from: https://sourceforge.net/projects/febrl/ ) Experiments using both real and synthetic data ( Secondstring repository and Febrl data set generator) The proposed two-step approach is compared with two other classifiers Support vector machine (SVM) (supervised) Hybrid TAILOR approach (k-means followed by SVM) F -measure used to evaluate classifier results (minimum, average and maximum values shown in graphs) Peter Christen, August 2008 – p.8/12

Classification results for ‘Cora’ ’Cora’ data set (1295 records) 1 0.8 F-measure 0.6 0.4 0.2 0 S T 2 2 2 2 2 S S S S S V A - - - - - I M L N S S S S O V V V V N R M M M M - - - - 0 2 2 5 - 5 5 0 0 - - - 2 5 1 5 0 0 0 Peter Christen, August 2008 – p.9/12

Classification results for ‘Restaurant’ ’Restaurant’ data set (864 records) 1 0.8 F-measure 0.6 0.4 0.2 0 S T 2 2 2 2 2 S S S S S V A - - - - - I M L N S S S S O V V V V N R M M M M - - - - 0 2 2 5 - 5 5 0 0 - - - 2 5 1 5 0 0 0 Peter Christen, August 2008 – p.10/12

Results for synthetic data sets Average of the four ’DS-Gen’ data sets 1 0.8 F-measure 0.6 0.4 0.2 0 S T 2 2 2 2 2 S S S S S V A - - - - - I M L N S S S S O V V V V N R M M M M - - - - 0 2 2 5 - 5 5 0 0 - - - 2 5 1 5 0 0 0 Peter Christen, August 2008 – p.11/12

Outlook and future work The proposed two-step record pair classification approach shows promising results Can automatically select good quality training examples Can achieve better results than other unsupervised classification techniques Improvements for second step (classification) Implement data reduction and fast indexing techniques to improve performance and scalability Investigate how this approach can be combined with active learning Conduct more experiments on larger data sets Peter Christen, August 2008 – p.12/12

Automatic Record Linkage using Seeded Nearest Neighbour and SVM - PowerPoint PPT Presentation

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra, Australia Contact:

Nearest Neighbour Searching in Metric Spaces Kenneth Clarkson (1999, 2006) Nearest Neighbour

Record Linkage Record Linkage Craig Knoblock University of Southern California These slides are

Using Structured Neural Networks for Record Linkage Burdette Pixton Christophe Giraud-Carrier

Fast Nearest Neighbour Classification Gordon Lesti July 17, 2015 Structure Introduction

Basic Classification Algorithms Rules, Linear Regression, Nearest Neighbour Outline Rules

Basic Classification Algorithms (2) Rules, Linear Regression, Nearest Neighbour Outline Rules

Non-parametric Methods Oliver Schulte - CMPT 726 Bishop PRML Ch. 2.5 Kernel Density Estimation

Genealogical Record Linkage: Features for Automated Person Matching Randy Wilson

Modeling Offsets and Linkage in a Modeling Offsets and Linkage in a Modeling Offsets and Linkage

Linkage Disequilibrium Linkage Disequilibrium Linkage Equilibrium Consider two linked loci Locus

Building the Linkage Tree (LT) in LTGA 1. Start with singleton linkage sets Thierens, D. (2010).

Privacy Preserving Record Linkage Linkage Elizabeth Ashley Durham Health Information Privacy

Record Type Families: Record type A Key to Generic Record Combinators families Record scheme

What is data (or record) linkage? Recent interest in data linkage The process of linking and

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

Lecture 2: Nearest Neighbour Classifier Aykut Erdem September 2017 Hacettepe University Your

Data Matching Research at the Australian National University Peter Christen Research School of

2016 17 May 2017 Simon Andrews, Manager Performance Audit Rob Luciani, Manager Technical and

NSW Smart Sensing Network (NSSN) DR ANTHONY MORFA Business Development Manager, NSW Smart Sensing

Association Australasia - MEMBER ONLY WEBINAR - Certjfjcatjon Quarterly June 2017 22 June 2017

Outline Data cleaning and standardisation Febrl A parallel open source data linkage and

Reinforcement Learning for Interactive Theorem Proving in HOL4 Minchao Wu 1 Michael Norrish 1,2

Stratified Space Learning Reconstructing Embedded Graphs Y. Bokor 1 Mathematical Sciences

Foundations of Artificial Intelligence 5. Constraint Satisfaction Problems CSPs as Search