Data Matching Overview of Computer Science Methods and Research at - - PowerPoint PPT Presentation

data matching overview of computer science methods and
SMART_READER_LITE
LIVE PREVIEW

Data Matching Overview of Computer Science Methods and Research at - - PowerPoint PPT Presentation

Data Matching Overview of Computer Science Methods and Research at the ANU Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University Contact:


slide-1
SLIDE 1

Data Matching – Overview of Computer Science Methods and Research at the ANU

Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University Contact: peter.christen@anu.edu.au

March 2013 – p.1/25

slide-2
SLIDE 2

Outline

Recent interest in data matching Data matching applications and challenges The data matching process Types of data matching techniques Improving scalability: Indexing techniques Improving matching quality: Learning techniques Privacy-preserving record linkage Research at the ANU: Febrl, privacy, matching historical census data, and real-time matching Challenges and research directions

March 2013 – p.2/25

slide-3
SLIDE 3

Recent interest in data matching

Traditionally, data matching has been used in statistics (census) and health (epidemiology) In recent years, increased interest from businesses and governments

Massive amounts of data are being collected Increased computing power and storage capacities Often data from different sources need to be integrated Need for data sharing between organisations Data mining (analysis) of large data collections E-Commerce and Web applications Geocode matching and spatial data analysis

March 2013 – p.3/25

slide-4
SLIDE 4

Applications of data matching

Remove duplicates in one data set (deduplication) Merge new records into a larger master data set Create patient or customer oriented statistics

(for example for longitudinal studies)

Clean and enrich data for analysis and mining Geocode matching (with reference address data) Widespread use of data matching

Immigration, taxation, social security, census Fraud, crime, and terrorism intelligence Business mailing lists, exchange of customer data Biomedical and social science research

March 2013 – p.4/25

slide-5
SLIDE 5

Data matching challenges

Often no unique entity identifiers are available Real world data are dirty

(typographical errors and variations, missing and

  • ut-of-date values, different coding schemes, etc.)

Scalability

Naïve comparison of all record pairs is quadratic Blocking, searching, or filtering is needed (indexing)

No training data in many data matching applications

(no record pairs or groups with known true match status)

Privacy and confidentiality

(because personal information, like names and addresses, are commonly required for matching)

March 2013 – p.5/25

slide-6
SLIDE 6

The data matching process

Database A Database B Comparison Matches Non− matches Matches processing Data pre− processing Data pre− Classif− ication Clerical Review Evaluation Potential Indexing / Searching

March 2013 – p.6/25

slide-7
SLIDE 7

Types of data matching techniques

Deterministic matching

Exact matching (if a unique identifier of high quality is available: precise, robust, stable over time) Examples: Social security or Medicare numbers Rule-based matching (complex to build and maintain)

Probabilistic record linkage (Fellegi and Sunter, 69)

Use available attributes for matching (often personal information, like names, addresses, dates of birth, etc.) Calculate matching weights for attributes

‘Computer science’ approaches

(based on machine learning, data mining, database, or information retrieval techniques)

March 2013 – p.7/25

slide-8
SLIDE 8

Improving scalability: Indexing

Number of record pair comparisons equals the product of the sizes of the two databases

(matching two databases containing 1 and 5 million records will result in 5×1012 – 5 trillion – record pairs)

Number of true matches is generally less than the number of records in the smaller of the two databases (assuming no duplicate records) Performance bottleneck is usually the (expensive) detailed comparison of attribute values between records (using approximate string comparison functions) Aim of indexing: Cheaply remove record pairs that are obviously not matches

March 2013 – p.8/25

slide-9
SLIDE 9

Traditional blocking

Traditional blocking works by only comparing record pairs that have the same value for a blocking variable (for example, only compare records

that have the same postcode value)

Problems with traditional blocking

An erroneous value in a blocking variable results in a record being inserted into the wrong block (several passes with different blocking variables can solve this) Values of blocking variable should have uniform frequencies (as the most frequent values determine the size of the largest blocks) Example: Frequency of ‘Smith’ in NSW: 25,425

March 2013 – p.9/25

slide-10
SLIDE 10

Recent indexing approaches (1)

Sorted neighbourhood approach

Sliding window over sorted databases Use several passes with different blocking variables

Q-gram based blocking (e.g. 2-grams / bigrams)

Convert values into q-gram lists, then generate sub-lists ‘peter’ → [‘pe’,‘et’,‘te’,‘er’], [‘pe’,‘et’,‘te’], [‘pe’,‘et’,‘er’], .. ‘pete’ → [‘pe’,‘et’,‘te’], [‘pe’,‘et’], [‘pe’,‘te’], [‘et’,‘te’], ... Each record will be inserted into several blocks

Overlapping canopy clustering

Based on q-grams and a ‘cheap’ similarity measure, such as Jaccard (set intersection) or TF-IDF/Cosine

March 2013 – p.10/25

slide-11
SLIDE 11

Recent indexing approaches (2)

StringMap based blocking

Map strings into a multi-dimensional space such that distances between pairs of strings are preserved Use similarity join to find similar pairs (close strings)

Suffix array based blocking

Generate suffix array based inverted index (suffix array: ‘peter’ → ‘eter’, ‘ter’, ‘er’, ‘r’)

Post-blocking filtering

(for example, string length or q-grams count differences)

US Census Bureau: BigMatch

(pre-process ‘smaller’ data set so its values can be directly accessed; with all blocking passes in one go)

March 2013 – p.11/25

slide-12
SLIDE 12

Improving matching quality: Learning techniques

View record pair classification as a multi- dimensional binary classification problem

(use numerical attribute similarities to classify record pairs as matches or non-matches)

Many machine learning techniques can be used

Supervised: Decision trees, SVMs, neural networks, learnable string comparisons, active learning, etc. Un-supervised: Various clustering algorithms

Recently, collective classification techniques have been investigated (build graph of database and

conduct overall classification, rather than each record pair independently)

March 2013 – p.12/25

slide-13
SLIDE 13

Collective classification example

Dave White Don White Susan Grey John Black Paper 2 Paper 1 Paper 3 ? Joe Brown ? Paper 4 Liz Pink Paper 6 Paper 5 Intel CMU MIT

w1=? w2=? w4=? w3=?

(A1, Dave White, Intel) (P1, John Black / Don White) (A2, Don White, CMU) (P2, Sue Grey / D. White) (A3, Susan Grey, MIT) (P3, Dave White) (A4, John Black, MIT) (P4, Don White / Joe Brown) (A5, Joe Brown, unknown) (P5, Joe Brown / Liz Pink) (A6, Liz Pink, unknown) (P6, Liz Pink / D. White) Adapted from Kalashnikov and Mehrotra, ACM TODS, 31(2), 2006

March 2013 – p.13/25

slide-14
SLIDE 14

Managing transitive closure a2

a1

a3 a4

If record a1 is classified as matching with record a2, and record a2 as matching with record a3, then records a1 and a3 must also be matching. Possibility of record chains occurring Various algorithms have been developed to find

  • ptimal solutions (special clustering algorithms)

Collective classification deals with this problem by default

March 2013 – p.14/25

slide-15
SLIDE 15

Classification challenges

In many cases there is no training data available

Possible to use results of earlier data matching projects? Or from manual clerical review process? How confident can we be about correct manual classification of potential matches?

Often there is no gold standard available

(no data sets with known true match status)

No large test data set collections available

(like in information retrieval or machine learning)

Many data matching researchers use synthetic

  • r bibliographic data

(which have very different characteristics)

March 2013 – p.15/25

slide-16
SLIDE 16

Privacy-preserving record linkage

(1) (2) (2) (3) (3)

Bob Alice

(3) (3) (2) (2) (1)

Alice Carol Bob

Assume two data sources, and possibly a third (trusted) party to conduct the matching Objective: No party learns about the other parties’ private data, only matched records are revealed

Various approaches with different assumptions about threats, what can be inferred by parties, and what is being released

Based on some form of encoding or encryption techniques

March 2013 – p.16/25

slide-17
SLIDE 17

Research at ANU 1: Collaboration with NSW Health

From 2002 to 2009, funded by ANU, APAC, and an ARC Linkage Project Developed open source software Febrl

(Freely extensible biomedical record linkage)

Several research areas

Probabilistic techniques for automated data cleaning and standardisation (mainly of addresses) Novel geocode matching techniques New and improved blocking and indexing techniques Improved record pair classification using un-supervised machine learning techniques Improved performance (scalability and parallelism)

March 2013 – p.17/25

slide-18
SLIDE 18

Research at ANU 2: Privacy-preserving record linkage

Currently 1 PhD student, with an ARC Discovery Project starting this year (collaboration with Vassilios

Verykios, Greece)

Work so far has focused on scalability to large databases, and two-party protocols

Protocols based on Bloom filters (bit strings for calculating Dice/Jaccard similarities) We developed a taxonomy for PPRL techniques

Current work is on privacy measures for PPRL Future work to focus on matching data from multiple parties, and assessing matching quality and completeness in PPRL

March 2013 – p.18/25

slide-19
SLIDE 19

Research at ANU 3: Historical census data matching

A collaboration with the Australian Demographic and Social Research Institute and 1 PhD student Aim is to match individuals and households from census returns across time (UK, 1851 to 1901) We developed novel group linkage approaches, achieving much improved matching quality Currently exploring collective classification techniques

March 2013 – p.19/25

slide-20
SLIDE 20

Research at ANU 4: Real-time and dynamic matching

Collaboration with Veda (Australian credit bureau) and Funnelback (enterprise search) through an ARC Linkage Project (1 post-doc and 1 PhD) Aim is to develop real-time matching techniques for dynamic databases to detect identity fraud

(using consumer credit application database)

Initial work on similarity-aware indexing

(AusDM’08, CIKM’09, DMApps’13)

0.7 0.9

peter smith smyth myler

0.9

miller millar

0.8 1.0 1.0 1.0 r5 r2 r4 r6 r8 r2 r4 r6 r8 r2 r4 r6 r8 r3 r1 r5 r7 r1 r7 1.0 1.0 1.0 1.0 0.9 0.9 0.9 0.9 0.7 0.9 0.8 0.9 0.8 1.0 0.8 0.7

March 2013 – p.20/25

slide-21
SLIDE 21

Challenges and research directions

Improved classification for data matching

(collective classification for personal data)

Matching data from many sources Use cloud computing platforms for large-scale data matching (privacy and cost efficiency) Real-time matching and matching dynamic data Develop and implement frameworks for data matching that allow comparative studies of different techniques (benchmarks) Develop practical PPRL techniques

(that facilitate accurate and automatic classification, as well as evaluation of matching quality and completeness)

March 2013 – p.21/25

slide-22
SLIDE 22

Advertisement: Book ‘Data Matching’

The book is very well organized and exceptionally well written. Because

  • f the depth, amount, and quality of

the material that is covered, I would expect this book to be one of the standard references in future years. William E. Winkler, U.S. Bureau of the Census.

March 2013 – p.22/25

slide-23
SLIDE 23

Synthetic data: Advantages

Privacy issues prohibit publication of real personal information De-identified or encrypted data cannot be used to match databases

(as real name and address values are required)

Several advantages of synthetic data

Volume and characteristics can be controlled (errors and variations in records, number of duplicates, etc.) It is known which records are duplicates of each other, and so matching quality can be calculated Data and the data generator program can be published (allowing others to repeat experiments)

March 2013 – p.23/25

slide-24
SLIDE 24

Synthetic data: Challenges

Modelling the content and characteristics of real data (frequencies of values; variations and errors) Modelling dependencies between attributes

(for example, given names often depend on gender)

Several data generators have been developed

Hernandez and Stolfo (mid 1990s): Only based on value tables, no frequencies, simple typographic errors Bertolazzi et al. (2003): Added frequency tables, allowed missing values, still simple error generation Christen et al. (2009): Added look-up tables with misspellings and nicknames, model different error types (typing, phonetic, OCR), attribute dependencies, generate households

March 2013 – p.24/25

slide-25
SLIDE 25

Modelling of variations and errors

Typed Printed Handwritten Memory OCR Dictate Electronic document Speech recognition cc (ty) sub, ins, del, trans attr swap, repl cc (ph) sub, ins, del attr swap, repl cc (ph) sub, ins, del attr swap, repl cc (ph and or ty) sub, ins, del, trans attr swap, repl

  • Abbreviations:

cc : character change wc : word change subs : substitution ins : insertion del : deletion trans : transpose repl : replace ty : typographic ph : phonetic attr : attribute

cc (ph,ty) sub, ins, del, trans wc split, merge attr swap, repl cc (ph) sub, ins, del cc (ocr) sub, ins, del wc split, merge March 2013 – p.25/25