Genealogical Place Name Normalization Bob Leaman - - PowerPoint PPT Presentation

genealogical place name normalization
SMART_READER_LITE
LIVE PREVIEW

Genealogical Place Name Normalization Bob Leaman - - PowerPoint PPT Presentation

Genealogical Place Name Normalization Bob Leaman (bob.leaman@asu.edu) What is meant by Normalization? Enforcing a standardized representation Increases accuracy Data shared over e-mail can be very hard to correct Easier


slide-1
SLIDE 1

Genealogical Place Name Normalization

Bob Leaman

(bob.leaman@asu.edu)

slide-2
SLIDE 2

3 Apr 2003 Genealogical Place Na me Normalization 2

What is meant by “Normalization”?

  • Enforcing a standardized

representation

  • Increases accuracy
  • Data shared over e-mail can be

very hard to correct

  • Easier record linkage
  • Automated merging
  • Automated research
slide-3
SLIDE 3

3 Apr 2003 Genealogical Place Na me Normalization 3

What format to use?

  • Fixed three-level
  • Mesa, Maricopa, Arizona
  • Variable-level
  • Mesa, Maricopa, Arizona, United

States

  • Note absence of descriptors
  • “Of”, “Near”, etc.
slide-4
SLIDE 4

3 Apr 2003 Genealogical Place Na me Normalization 4

The Problem

What kinds of deviations from the standard are common?

  • Biographical notes
  • Johnsville, Arkansas. He had 6 children
  • Addresses and e-mails
  • Hospital, church and cemetery names
  • Bluff Cemetery, Elgin, Ill.  Elgin, Ill.
  • Leaving out one or more of the levels
  • Vancouver, Washington  Vancouver, Clark,

Washington, United States

slide-5
SLIDE 5

3 Apr 2003 Genealogical Place Na me Normalization 5

The Problem

  • Excluding the comma between two of the

place names

  • San Leandro CA  San Leandro, CA
  • Using an abbreviated, truncated, or

alternate form of a place name

  • UT  Utah
  • Tenn  Tennessee
  • Holland  Denmark
  • Misspelling place names
  • Ypfilanti, Washtinaud, Michigan  Ypsilanti,

Washtenaw, Michigan

  • Algorithmic contractions such as

removing all vowels after the first letter

  • Oxfrd  Oxford
slide-6
SLIDE 6

3 Apr 2003 Genealogical Place Na me Normalization 6

Strategy

  • Preprocessing – remove everything

that is not part of the place name

  • Match against a name variations

database (thesaurus)

  • Match against standardized names

database (gazetteer)

slide-7
SLIDE 7

3 Apr 2003 Genealogical Place Na me Normalization 7

Preprocessing Place Names

  • Use regular expressions to detect

patterns

  • 38th year, Benedict, Kansas. Buried High

Prairie Cem, Wilson, Kansas becomes

  • 38th year, Benedict, Kansas.

becomes

  • Benedict, Kansas
  • List of “note words” (e.g.
  • ccupations, causes of death, etc.)
slide-8
SLIDE 8

3 Apr 2003 Genealogical Place Na me Normalization 8

Preprocessing Place Names

  • Tested on 2450 randomly selected

“PLAC” fields from 10 different GEDCOM files

  • Each was preprocessed by hand:

58.4% required modification

  • Preprocessing via the system

matched preprocessing by hand 97.6% of the time

slide-9
SLIDE 9

3 Apr 2003 Genealogical Place Na me Normalization 9

Handling Name Variations

  • At this point all non-place name

information has been removed

  • Each place name is looked up in a

database of alternate names (thesaurus)

  • Livonia, MI  {Livonia, MI & Livonia,

Michigan}

  • The original is included in case the

wrong alternate was recorded

  • riginally
slide-10
SLIDE 10

3 Apr 2003 Genealogical Place Na me Normalization 10

Place Name Matching

  • Created a place name database
  • Mostly GNIS data
  • Includes all of the United States and some
  • f England and Canada
  • Nearly 160,000 places
  • Database format
  • A single table was used to hold all place

records

  • Utilized unique identifiers to point to the

“parent” record

slide-11
SLIDE 11

3 Apr 2003 Genealogical Place Na me Normalization 11

Place Name Matching

  • Need to find the place name in the

database that maximizes the “similarity” with respect to the input place name

  • 0 = no match
  • 1 = perfect match
  • Calculated using the average

“similarity” of the individual pieces

  • f the place name
slide-12
SLIDE 12

3 Apr 2003 Genealogical Place Na me Normalization 12

Place Name Matching

  • Used the elements of the edit

distance metric

  • Substitution, insertion, deletion
  • Added transposition, length of the longest

common substring & a measure of truncation

  • Sorted through the several data points per

potential match with a decision tree

  • Trained using the metric scores from a test set
  • f place name pieces matched by hand
  • S Lk, Salt Lake, TRUE
  • Used the proportion of test cases that were

matches in any leaf of the tree as the “similarity” score

slide-13
SLIDE 13

3 Apr 2003 Genealogical Place Na me Normalization 13

Place Name Matching

  • Tested on 330 randomly selected

“PLAC” fields from 10 different GEDCOM files

  • Each was preprocessed and matched

by hand: 99.1% required modification after preprocessing

  • The first-ranked match was the same

as the match found by hand 97.9% of the time

  • The average rank of the match

generated by hand was 1.21

slide-14
SLIDE 14

3 Apr 2003 Genealogical Place Na me Normalization 14

Future Directions

  • Recognize when the best match is

not satisfactory

  • Acquisition of a suitable thesaurus

and gazetteer

  • Alexandria Digital Library Project
  • Historical place information
  • Increased productization
  • Indexing scheme
  • Internationalization
slide-15
SLIDE 15

3 Apr 2003 Genealogical Place Na me Normalization 15

Questions?

  • Reference:
  • K. Kukich. Techniques for Automatically

Correcting Words in Text. Computing Surveys, 24(4):377-440, Dec. 1992.