Genealogical Place Name Normalization Bob Leaman - - PowerPoint PPT Presentation

▶

Feb 03, 2024 313 likes •473 views

Genealogical Place Name Normalization Bob Leaman (bob.leaman@asu.edu) What is meant by Normalization? Enforcing a standardized representation Increases accuracy Data shared over e-mail can be very hard to correct Easier

SLIDE 1

Genealogical Place Name Normalization

Bob Leaman

(bob.leaman@asu.edu)

SLIDE 2

3 Apr 2003 Genealogical Place Na me Normalization 2

What is meant by “Normalization”?

Enforcing a standardized

representation

Increases accuracy
Data shared over e-mail can be

very hard to correct

Easier record linkage
Automated merging
Automated research

SLIDE 3

3 Apr 2003 Genealogical Place Na me Normalization 3

What format to use?

Fixed three-level
Mesa, Maricopa, Arizona
Variable-level
Mesa, Maricopa, Arizona, United

States

Note absence of descriptors
“Of”, “Near”, etc.

SLIDE 4

3 Apr 2003 Genealogical Place Na me Normalization 4

The Problem

What kinds of deviations from the standard are common?

Biographical notes
Johnsville, Arkansas. He had 6 children
Addresses and e-mails
Hospital, church and cemetery names
Bluff Cemetery, Elgin, Ill.  Elgin, Ill.
Leaving out one or more of the levels
Vancouver, Washington  Vancouver, Clark,

Washington, United States

SLIDE 5

3 Apr 2003 Genealogical Place Na me Normalization 5

The Problem

Excluding the comma between two of the

place names

San Leandro CA  San Leandro, CA
Using an abbreviated, truncated, or

alternate form of a place name

UT  Utah
Tenn  Tennessee
Holland  Denmark
Misspelling place names
Ypfilanti, Washtinaud, Michigan  Ypsilanti,

Washtenaw, Michigan

Algorithmic contractions such as

removing all vowels after the first letter

Oxfrd  Oxford

SLIDE 6

3 Apr 2003 Genealogical Place Na me Normalization 6

Strategy

Preprocessing – remove everything

that is not part of the place name

Match against a name variations

database (thesaurus)

Match against standardized names

database (gazetteer)

SLIDE 7

3 Apr 2003 Genealogical Place Na me Normalization 7

Preprocessing Place Names

Use regular expressions to detect

patterns

38th year, Benedict, Kansas. Buried High

Prairie Cem, Wilson, Kansas becomes

38th year, Benedict, Kansas.

becomes

Benedict, Kansas
List of “note words” (e.g.
ccupations, causes of death, etc.)

SLIDE 8

3 Apr 2003 Genealogical Place Na me Normalization 8

Preprocessing Place Names

Tested on 2450 randomly selected

“PLAC” fields from 10 different GEDCOM files

Each was preprocessed by hand:

58.4% required modification

Preprocessing via the system

matched preprocessing by hand 97.6% of the time

SLIDE 9

3 Apr 2003 Genealogical Place Na me Normalization 9

Handling Name Variations

At this point all non-place name

information has been removed

Each place name is looked up in a

database of alternate names (thesaurus)

Livonia, MI  {Livonia, MI & Livonia,

Michigan}

The original is included in case the

wrong alternate was recorded

riginally

SLIDE 10

3 Apr 2003 Genealogical Place Na me Normalization 10

Place Name Matching

Created a place name database
Mostly GNIS data
Includes all of the United States and some
f England and Canada
Nearly 160,000 places
Database format
A single table was used to hold all place

records

Utilized unique identifiers to point to the

“parent” record

SLIDE 11

3 Apr 2003 Genealogical Place Na me Normalization 11

Place Name Matching

Need to find the place name in the

database that maximizes the “similarity” with respect to the input place name

0 = no match
1 = perfect match
Calculated using the average

“similarity” of the individual pieces

f the place name

SLIDE 12

3 Apr 2003 Genealogical Place Na me Normalization 12

Place Name Matching

Used the elements of the edit

distance metric

Substitution, insertion, deletion
Added transposition, length of the longest

common substring & a measure of truncation

Sorted through the several data points per

potential match with a decision tree

Trained using the metric scores from a test set
f place name pieces matched by hand
S Lk, Salt Lake, TRUE
Used the proportion of test cases that were

matches in any leaf of the tree as the “similarity” score

SLIDE 13

3 Apr 2003 Genealogical Place Na me Normalization 13

Place Name Matching

Tested on 330 randomly selected

“PLAC” fields from 10 different GEDCOM files

Each was preprocessed and matched

by hand: 99.1% required modification after preprocessing

The first-ranked match was the same

as the match found by hand 97.9% of the time

The average rank of the match

generated by hand was 1.21

SLIDE 14

3 Apr 2003 Genealogical Place Na me Normalization 14

Future Directions

Recognize when the best match is

not satisfactory

Acquisition of a suitable thesaurus

and gazetteer

Alexandria Digital Library Project
Historical place information
Increased productization
Indexing scheme
Internationalization

Genealogical Place Name Normalization

Bob Leaman

(bob.leaman@asu.edu)

3 Apr 2003 Genealogical Place Na me Normalization 2

What is meant by “Normalization”?

representation

very hard to correct

3 Apr 2003 Genealogical Place Na me Normalization 3

What format to use?

States

3 Apr 2003 Genealogical Place Na me Normalization 4

The Problem

What kinds of deviations from the standard are common?

Washington, United States

3 Apr 2003 Genealogical Place Na me Normalization 5

The Problem

place names

alternate form of a place name

removing all vowels after the first letter

3 Apr 2003 Genealogical Place Na me Normalization 6

Strategy

that is not part of the place name

database (thesaurus)

database (gazetteer)

3 Apr 2003 Genealogical Place Na me Normalization 7

Preprocessing Place Names

patterns

Prairie Cem, Wilson, Kansas becomes

becomes

3 Apr 2003 Genealogical Place Na me Normalization 8

Preprocessing Place Names

“PLAC” fields from 10 different GEDCOM files

58.4% required modification

matched preprocessing by hand 97.6% of the time

3 Apr 2003 Genealogical Place Na me Normalization 9

Handling Name Variations

information has been removed

database of alternate names (thesaurus)

Michigan}

wrong alternate was recorded

3 Apr 2003 Genealogical Place Na me Normalization 10

Place Name Matching

records

“parent” record

3 Apr 2003 Genealogical Place Na me Normalization 11

Place Name Matching

database that maximizes the “similarity” with respect to the input place name

“similarity” of the individual pieces

3 Apr 2003 Genealogical Place Na me Normalization 12

Place Name Matching

distance metric

common substring & a measure of truncation

potential match with a decision tree

matches in any leaf of the tree as the “similarity” score

3 Apr 2003 Genealogical Place Na me Normalization 13

Place Name Matching

“PLAC” fields from 10 different GEDCOM files

by hand: 99.1% required modification after preprocessing

as the match found by hand 97.9% of the time

generated by hand was 1.21

3 Apr 2003 Genealogical Place Na me Normalization 14

Future Directions

not satisfactory

and gazetteer

3 Apr 2003 Genealogical Place Na me Normalization 15

Questions?

Correcting Words in Text. Computing Surveys, 24(4):377-440, Dec. 1992.