Probabilistic Data Generation for Deduplication and Data Linkage - PowerPoint PPT Presentation

Probabilistic Data Generation for Deduplication and Data Linkage Peter Christen Data Mining Group, Australian National University Contact: peter.christen@anu.edu.au Project web page: http://datamining.anu.edu.au/linkage.html Funded by the ANU, the NSW Department of Health, and the Australian Research Council (ARC) (LP #0453463) Peter Christen, July 2005 – p.1/13

Outline Data linkage and deduplication Data linkage techniques Test data for data linkage Artificial data Probabilistic data set generator Example data generated Experimental study Conclusions and outlook Peter Christen, July 2005 – p.2/13

Data linkage and deduplication The task of linking together records representing the same entity from one or more data sources (patient, customer, business, etc.) Real world data is dirty , so cleaning and standardisation is important Applications of data linkage Remove duplicates in a data set (internal linkage) Merge new records into a larger master data set Create customer or patient oriented statistics Compile data for longitudinal studies Geocode data (match addresses with geographic reference data) Peter Christen, July 2005 – p.3/13

Data linkage techniques Computer assisted linkage goes back to 1950s Deterministic linkage Exact linkage (if a unique identifier of high quality – precise, robust, stable over time – is available) Rules based linkage (complex to build and maintain) Probabilistic linkage ( Fellegi & Sunter , 1969) Apply linkage using available (personal) information (which can be missing, wrong, coded differently, or out-of-date) Modern approaches Based on machine learning, data mining, or information retrieval techniques (clustering, decision trees, active learning, learnable string metrics, graphical models, etc.) Peter Christen, July 2005 – p.4/13

Test data for data linkage Various data sets are used in recent publications ( restaurant , cora , citeseer , census , etc.) Usually very small (less than 2,000 records) Proprietary and even confidential data has been used There is a lack of standard test data Hard to compare new algorithms and to learn how to use and customise data linkage systems Recent small repository: RIDDLE http://www.cs.utexas.edu/users/ml/riddle/ (Repository of Information on Duplicate Detection, Record Linkage, and Identity Uncertainty) Peter Christen, July 2005 – p.5/13

Artificial data Privacy issues prohibit publication of real data (for example of names, addresses, dates of birth, etc.) De-identified or encrypted data cannot be used (as linkage algorithms work on name and address strings) Artificial data as alternative to real data Based on real data (frequency and misspellings tables) Must model content and statistical properties of real data Advantages Content and error modifications can be controlled Data can be published Easy to repeat and verify experiments Peter Christen, July 2005 – p.6/13

A probabilistic data set generator First data generator by Hernandez & Stolfo (1996) Improved by Bertolazzi et.al. (2003) (no details given, not publicly available) Our generator Open source (Python) Part of the Febrl data linkage system (Freely extensible biomedical record linkage) Easy to modify and improve by a user Based on real world frequency look-up tables for names, addresses, date of birth, etc. Includes look-up tables with real typographical errors and name variations (for example ’Gail’ and ’Gayle’ ) Peter Christen, July 2005 – p.7/13

Data generation Step 1: Create original records Randomly select values from various frequency look-up tables, or from a user specified range (e.g. for date of birth ) Step 2: Create duplicates based on original records by introducing modifications Single errors (insert, delete, substitute a character; transpose two characters) Insert or delete a whitespace (split or merge a word) Set to missing (empty string), or insert new value Swap with another value from a look-up table Swap two attribute values (e.g. given name surname ) Peter Christen, July 2005 – p.8/13

Example data generated Data set with 4 original and 6 duplicate records REC_ID, ADDRESS1, ADDRESS2, SUBURB rec-0-org, wylly place, pine ret vill, taree rec-0-dup-0, wyllyplace, pine ret vill, taree rec-0-dup-1, pine ret vill, wylly place, taree rec-0-dup-2, wylly place, pine ret vill, tared rec-0-dup-3, wylly parade, pine ret vill, taree rec-1-org, stuart street, hartford, menton rec-2-org, griffiths street, myross, kilda rec-2-dup-0, griffith sstreet, myross, kilda rec-2-dup-1, griffith street, mycross, kilda rec-3-org, ellenborough place, kalkite homestead, sydney Each record is given a unique identifier, which allows the evaluation of accuracy and error rates Peter Christen, July 2005 – p.9/13

Experimental study NSW Midwifes Data Collection (MDC) Extracted years 1999 and 2000 (175,211 records) Contained 5,331 twin and 177 triplet births Linkage done by AutoMatch resulted in 8,442 duplicate record pairs Extracted frequency tables for mother’s name, address and date of birth attributes Created 3 data sets with 175,211 records each, containing 5%, 10%, and 20% duplicates Then performed deduplication using Febrl data linkage system Peter Christen, July 2005 – p.10/13

� � � � Sorted attribute frequencies Attribute ’Given name’ Attribute ’Mother date of birth’ 3000 80 Original data set Original data set Generated with 5% duplicates Generated with 5% duplicates 70 Generated with 10% duplicates Generated with 10% duplicates 2500 Generated with 20% duplicates Generated with 20% duplicates 60 2000 50 Frequency Frequency 1500 40 30 1000 20 500 10 0 1 10 100 1000 10000 1 10 100 1000 10000 Number of attribute values Number of attribute values Attribute ’Locality’ Attribute ’Street number’ 1600 3500 Original data set Original data set Generated with 5% duplicates Generated with 5% duplicates 1400 Generated with 10% duplicates 3000 Generated with 10% duplicates Generated with 20% duplicates Generated with 20% duplicates 1200 2500 1000 Frequency Frequency 2000 800 1500 600 1000 400 500 200 0 0 1 10 100 1000 10000 1 10 100 1000 10000 Number of attribute values Number of attribute values Peter Christen, July 2005 – p.11/13

� � � � Deduplication matching weights MDC 1999 and 2000 deduplication (AutoMatch match status) Generated MDC deduplication (with 5% duplicates) 100000 100000 Duplicates Duplicates Non Duplicates Non Duplicates 10000 10000 1000 1000 Frequency Frequency 100 100 10 10 1 1 -60 -40 -20 0 20 40 60 80 100 120 -60 -40 -20 0 20 40 60 80 100 120 Matching weight Matching weight Generated MDC deduplication (with 10% duplicates) Generated MDC deduplication (with 20% duplicates) 100000 100000 Duplicates Duplicates Non Duplicates Non Duplicates 10000 10000 1000 1000 Frequency Frequency 100 100 10 10 1 1 -60 -40 -20 0 20 40 60 80 100 120 -60 -40 -20 0 20 40 60 80 100 120 Matching weight Matching weight Peter Christen, July 2005 – p.12/13

Conclusions and outlook Several possible improvements Relax independence assumption (based on real world frequency tables), for example a change of address results in new street name, number and type, as well as postcode and locality Allow generation of groups of records, for example for households (census) Fine tune error modifications (scanning, typing, etc.) Do further comparison studies with real data sets See project web page for more information http://datamining.anu.edu.au/linkage.html Peter Christen, July 2005 – p.13/13

Probabilistic Data Generation for Deduplication and Data Linkage - PowerPoint PPT Presentation

Probabilistic Data Generation for Deduplication and Data Linkage Peter Christen Data Mining Group, Australian National University Contact: peter.christen@anu.edu.au Project web page: http://datamining.anu.edu.au/linkage.html Funded by the ANU,

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Probabilistic Deduplication, Data Linkage and Geocoding Peter Christen Data Mining Group,

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash Zhuan Chen and Kai Shen

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Deduplication: Overview & Case Studies CSCI 333 Spring 2020 Williams College Lecture

SADedupe: Skew Area Inline Deduplication for Distributed Storage Binqi Zhang , Bing Bing Zhou,

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS & RESPONSE RATES 28 October 2014 Matching

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Reconstruction

From Probabilistic Circuits to Probabilistic Programs and Back Guy Van den Broeck PROBPROG - Oct

The More You Know: Linkage of Public Health Datasets and All-Payer Claims to Further

Causal Structure Search: Philosophical Foundations and Problems Richard Scheines & Peter

Introduction to Graphs nodes. edges between pairs of nodes.

probe ion transport mechanisms in a membrane channel Maria Queralt-Martin , M. Lidn Lpez and

Enhancing HIV Testing and Linkage to Care With Peer Recruitment, Financial Incentives, and

Using PROV-O to represent lineage in statistical processes: a record linkage example Flavio

New Prediction Methods for Tree Ensembles with Applications in Record Linkage Samuel L. Ventura

Measuring the impact of STEM learning in afterschool The webinar will begin shortly.

Probabilistic Data Generation for Deduplication and Data Linkage - PowerPoint PPT Presentation

Probabilistic Data Generation for Deduplication and Data Linkage Peter Christen Data Mining Group, Australian National University Contact: peter.christen@anu.edu.au Project web page: http://datamining.anu.edu.au/linkage.html Funded by the ANU,

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Probabilistic Deduplication, Data Linkage and Geocoding Peter Christen Data Mining Group,

Storage Deduplication in Cloud Computing Joo Paulo and Jos Pereira University of Minho July

OrderMergeDedup: Efficient, Failure-Consistent Deduplication on Flash Zhuan Chen and Kai Shen

ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath + , Sudipta

Lazy Exact Deduplication Jingwei Ma, Rebecca J. Stones , Yuxiang Ma, Jingui Wang, Junjie Ren, Gang

A study of practical deduplication Dutch T. Meyer University of British Columbia Microsoft

Decentralized Deduplication in SAN Cluster File Systems Austin T. Clements Irfan Ahmad

Deduplication: Overview &amp; Case Studies CSCI 333 Spring 2020 Williams College Lecture

SADedupe: Skew Area Inline Deduplication for Distributed Storage Binqi Zhang , Bing Bing Zhou,

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS &amp; RESPONSE RATES 28 October 2014 Matching

dtalink Faster probabilistic record linking and deduplication methods in Stata for large data

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Reconstruction

From Probabilistic Circuits to Probabilistic Programs and Back Guy Van den Broeck PROBPROG - Oct

The More You Know: Linkage of Public Health Datasets and All-Payer Claims to Further

Causal Structure Search: Philosophical Foundations and Problems Richard Scheines &amp; Peter

Introduction to Graphs nodes. edges between pairs of nodes.

probe ion transport mechanisms in a membrane channel Maria Queralt-Martin , M. Lidn Lpez and

Enhancing HIV Testing and Linkage to Care With Peer Recruitment, Financial Incentives, and

Using PROV-O to represent lineage in statistical processes: a record linkage example Flavio

New Prediction Methods for Tree Ensembles with Applications in Record Linkage Samuel L. Ventura

Measuring the impact of STEM learning in afterschool The webinar will begin shortly.

Deduplication: Overview & Case Studies CSCI 333 Spring 2020 Williams College Lecture

NEC METHODS: MATCHING, DEDUPLICATION, ANALYSIS & RESPONSE RATES 28 October 2014 Matching

Causal Structure Search: Philosophical Foundations and Problems Richard Scheines & Peter