genealogical place name normalization
play

Genealogical Place Name Normalization Bob Leaman - PowerPoint PPT Presentation

Genealogical Place Name Normalization Bob Leaman (bob.leaman@asu.edu) What is meant by Normalization? Enforcing a standardized representation Increases accuracy Data shared over e-mail can be very hard to correct Easier


  1. Genealogical Place Name Normalization Bob Leaman (bob.leaman@asu.edu)

  2. What is meant by “Normalization”? • Enforcing a standardized representation • Increases accuracy •Data shared over e-mail can be very hard to correct •Easier record linkage •Automated merging •Automated research 3 Apr 2003 Genealogical Place Na 2 me Normalization

  3. What format to use? • Fixed three-level • Mesa, Maricopa, Arizona • Variable-level • Mesa, Maricopa, Arizona, United States • Note absence of descriptors •“Of”, “Near”, etc. 3 Apr 2003 Genealogical Place Na 3 me Normalization

  4. The Problem What kinds of deviations from the standard are common? • Biographical notes • Johnsville, Arkansas. He had 6 children • Addresses and e-mails • Hospital, church and cemetery names • Bluff Cemetery, Elgin, Ill.  Elgin, Ill. • Leaving out one or more of the levels • Vancouver, Washington  Vancouver, Clark, Washington, United States 3 Apr 2003 Genealogical Place Na 4 me Normalization

  5. The Problem • Excluding the comma between two of the place names • San Leandro CA  San Leandro , CA • Using an abbreviated, truncated, or alternate form of a place name • UT  Utah • Tenn  Tennessee • Holland  Denmark • Misspelling place names • Ypfilanti, Washtinaud, Michigan  Ypsilanti, Washtenaw, Michigan • Algorithmic contractions such as removing all vowels after the first letter • Oxfrd  Oxford 3 Apr 2003 Genealogical Place Na 5 me Normalization

  6. Strategy • Preprocessing – remove everything that is not part of the place name • Match against a name variations database (thesaurus) • Match against standardized names database (gazetteer) 3 Apr 2003 Genealogical Place Na 6 me Normalization

  7. Preprocessing Place Names • Use regular expressions to detect patterns • 38th year, Benedict, Kansas. Buried High Prairie Cem, Wilson, Kansas becomes • 38th year, Benedict, Kansas. becomes • Benedict, Kansas • List of “note words” (e.g. occupations, causes of death, etc.) 3 Apr 2003 Genealogical Place Na 7 me Normalization

  8. Preprocessing Place Names • Tested on 2450 randomly selected “PLAC” fields from 10 different GEDCOM files • Each was preprocessed by hand: 58.4% required modification • Preprocessing via the system matched preprocessing by hand 97.6% of the time 3 Apr 2003 Genealogical Place Na 8 me Normalization

  9. Handling Name Variations • At this point all non-place name information has been removed • Each place name is looked up in a database of alternate names (thesaurus) • Livonia, MI  {Livonia, MI & Livonia, Michigan} • The original is included in case the wrong alternate was recorded originally 3 Apr 2003 Genealogical Place Na 9 me Normalization

  10. Place Name Matching • Created a place name database • Mostly GNIS data • Includes all of the United States and some of England and Canada • Nearly 160,000 places • Database format • A single table was used to hold all place records • Utilized unique identifiers to point to the “parent” record 3 Apr 2003 Genealogical Place Na 10 me Normalization

  11. Place Name Matching • Need to find the place name in the database that maximizes the “similarity” with respect to the input place name • 0 = no match • 1 = perfect match • Calculated using the average “similarity” of the individual pieces of the place name 3 Apr 2003 Genealogical Place Na 11 me Normalization

  12. Place Name Matching • Used the elements of the edit distance metric • Substitution, insertion, deletion • Added transposition, length of the longest common substring & a measure of truncation • Sorted through the several data points per potential match with a decision tree • Trained using the metric scores from a test set of place name pieces matched by hand • S Lk, Salt Lake, TRUE • Used the proportion of test cases that were matches in any leaf of the tree as the “similarity” score 3 Apr 2003 Genealogical Place Na 12 me Normalization

  13. Place Name Matching • Tested on 330 randomly selected “PLAC” fields from 10 different GEDCOM files • Each was preprocessed and matched by hand: 99.1% required modification after preprocessing • The first-ranked match was the same as the match found by hand 97.9% of the time • The average rank of the match generated by hand was 1.21 3 Apr 2003 Genealogical Place Na 13 me Normalization

  14. Future Directions • Recognize when the best match is not satisfactory • Acquisition of a suitable thesaurus and gazetteer • Alexandria Digital Library Project • Historical place information • Increased productization • Indexing scheme • Internationalization 3 Apr 2003 Genealogical Place Na 14 me Normalization

  15. Questions? • Reference: K. Kukich. Techniques for Automatically Correcting Words in Text . Computing Surveys, 24(4):377-440, Dec. 1992. 3 Apr 2003 Genealogical Place Na 15 me Normalization

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend