data cleaning and standardisation 1 data cleaning and
play

Data cleaning and standardisation (1) Data cleaning and - PDF document

Data cleaning and standardisation Febrl A parallel open source record linkage and geocoding system Record linkage and data integration Febrl overview Peter Christen Probabilistic data cleaning and standardisation Data Mining Group,


  1. Data cleaning and standardisation Febrl – A parallel open source record linkage and geocoding system Record linkage and data integration Febrl overview Peter Christen Probabilistic data cleaning and standardisation Data Mining Group, Australian National University Blocking / indexing in collaboration with Centre for Epidemiology and Research, New South Wales Department of Health Record pair classification Parallelisation in Febrl Contact: peter.christen@anu.edu.au Data set generation Project web page: http://datamining.anu.edu.au/linkage.html Geocoding Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for Advanced Computing (APAC) Outlook Peter Christen, April 2005 – p.1/36 Peter Christen, April 2005 – p.2/36 Data cleaning and standardisation (1) Data cleaning and standardisation (2) Real world data is often dirty Name Address Date of Birth Doc Peter Miller 42Main Rd.App. 3a Canberra A.C.T. 2600 29/4/1986 Missing values, inconsistencies Typographical and other errors Street Locality Title Givenname Surname Day Month Year Different coding schemes / formats doctor peter miller 42Main Rd. App. 3a CanberraA.C.T. 2600 29 4 1986 Out-of-date data Wayfare Wayfare Wayfare Unit Names and addresses are especially prone to no. name type Unittype no. Localityname Territory Postcode data entry errors 42 main road apartment 3a canberra act 2600 Cleaned and standardised data is needed for Remove unwanted characters and words Loading into databases and data warehouses Expand abbreviations and correct misspellings Data mining and other data analysis studies Segment data into well defined output fields Record linkage and data integration Peter Christen, April 2005 – p.3/36 Peter Christen, April 2005 – p.4/36 Record linkage and data integration Record linkage techniques The task of linking together records representing the same entity from one or more data sources Deterministic or exact linkage If no unique identifier is available, probabilistic A unique identifier is needed, which is of high quality linkage techniques have to be applied (precise, robust, stable over time, highly available) Applications of record linkage For example Medicare , ABN or Tax file number (are they really unique, stable, trustworthy?) Remove duplicates in a data set (internal linkage) Probabilistic linkage ( Fellegi & Sunter , 1969) Merge new records into a larger master data set Create customer or patient oriented statistics Apply linkage using available (personal) information Compile data for longitudinal studies Examples: names , addresses , dates of birth Other techniques Geocode data (rule-based, fuzzy approach, information retrieval) Data cleaning and standardisation are important first steps for successful record linkage Peter Christen, April 2005 – p.5/36 Peter Christen, April 2005 – p.6/36 Febrl – Freely extensible biomedical Probabilistic data cleaning and record linkage standardisation Three step approach in Febrl An experimental platform for new and improved linkage algorithms 1. Cleaning – Based on look-up tables and correction lists Modules for data cleaning and standardisation, – Remove unwanted characters and words record linkage, deduplication and geocoding – Correct various misspellings and abbreviations Free, open source https://sourceforge.net/projects/febrl/ 2. Tagging Implemented in Python http://www.python.org – Split input into a list of words, numbers and separators Easy and rapid prototype software development – Assign one or more tags to each element of this list Object-oriented and cross-platform (Unix, Win, Mac) (using look-up tables and some hard-coded rules) 3. Segmenting Can handle large data sets stable and efficiently – Use either rules or a hidden Markov model (HMM) Many external modules, easy to extend to assign list elements to output fields

  2. Assume the input component is one string Cleaned strings are split at whitespace boundaries (either name or address – dates are processed differently) into lists of words, numbers, characters, etc. Convert all letters into lower case Using look-up tables and some hard-coded rules, each element is tagged with one or more tags Use correction lists which contain pairs of original:replacement strings Example: An empty replacement string results in removing Uncleaned input string: “Doc. peter Paul MILLER” the original string Cleaned string: “dr peter paul miller” Correction lists are stored in text files and can be Word and tag lists: modified by the user [‘dr’, ‘peter’, ‘paul’, ‘miller’] Different correction lists for names and addresses [‘TI’, ‘GM/SN’, ‘GM’, ‘SN’ ] Peter Christen, April 2005 – p.9/36 Peter Christen, April 2005 – p.10/36 Step 3: Segmenting Hidden Markov model (HMM) Givenname 5% 25% 55% Using the tag list, assign elements in the word list 85% 5% 5% 30% to the appropriate output fields 20% Start Title Middlename End 5% 65% Rules based approach (e.g. AutoStan ) 10% 100% 15% Surname 75% Example: “if an element has tag ‘TI’ then assign the corresponding word to the ‘Title’ output field” A HMM is a probabilistic finite state machine Hard to develop and maintain rules Made of a set of states and transition probabilities Different sets of rules needed for different data sets between these states Hidden Markov model (HMM) approach In each state an observation symbol is emitted with a certain probability distribution A machine learning technique (supervised learning) In our approach, the observation symbols are tags and Training data is needed to build HMMs the states correspond to the output fields Peter Christen, April 2005 – p.11/36 Peter Christen, April 2005 – p.12/36 HMM probability matrices HMM data segmentation Givenname 5% Givenname 5% 25% 25% 55% 55% 85% 5% 85% 5% 5% 5% 30% 30% Start Title 20% Middlename End Start Title 20% Middlename End 5% 65% 5% 65% 10% 10% 100% 100% 15% 15% Surname 75% Surname 75% For an observation sequence we are interested in State the most likely path through a given HMM Observation Start Title Givenname Middlename Surname End (in our case an observation sequence is a tag list ) TI – 96% 1% 1% 1% – GM – 1% 35% 33% 15% – The Viterbi algorithm is used for this task GF – 1% 35% 27% 14% – (a dynamic programming approach) SN – 1% 9% 14% 45% – Smoothing is applied to account for unseen data UN – 1% 20% 25% 25% – (assign small probabilities for unseen observation symbols) Peter Christen, April 2005 – p.13/36 Peter Christen, April 2005 – p.14/36 Address HMM standardisation HMM segmentation example example 5% Givenname 5% 25% 3% 55% 85% 5% 20% 90% Wayfare Wayfare 5% Territory Number 95% Type 10% 30% 40% 20% 80% Start Title Middlename End 95% Start End 2% 95% 2% 5% 65% Wayfare Locality Post− 90% 10% 8% 2% 100% Name Name code 15% 40% Surname 75% 3% 2% 18% Input word and tag list Raw input: ’73 Miller St, NORTH SYDENY 2060’ [‘dr’, ‘peter’, ‘paul’, ‘miller’] Cleaned into: ’73 miller street north sydney 2060’ [‘TI’, ‘GM/SN’, ‘GM’, ‘SN’ ] Word and tag lists: [’73’, ’miller’, ’street’, ’north_sydney’, ’2060’] Two example paths through the HMM [’NU’, ’UN’, ’WT’, ’LN’, ’PC’ ] 1: Start -> Title (TI) -> Givenname (GM) -> Middlename (GM) -> Surname (SN) -> End Example path through HMM 2: Start -> Title (TI) -> Surname (SN) -> Givenname (GM) -> Start -> Wayfare Number (NU) -> Wayfare Name (UN) -> Wayfare Surname (SN) -> End Type (WT) -> Locality Name (LN) -> Postcode (PC) -> End

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend