Comparison between historical population archives and decentralized - - PowerPoint PPT Presentation

comparison between historical population archives and
SMART_READER_LITE
LIVE PREVIEW

Comparison between historical population archives and decentralized - - PowerPoint PPT Presentation

Introduction Matching Verification Application Conclusion Comparison between historical population archives and decentralized databases Marijn Schraagen Dionysius Huijsmans Leiden Institute of Advanced Computer Science (LIACS) Leiden


slide-1
SLIDE 1

Introduction Matching Verification Application Conclusion

Comparison between historical population archives and decentralized databases

Marijn Schraagen Dionysius Huijsmans

Leiden Institute of Advanced Computer Science (LIACS) Leiden University, The Netherlands

LaTeCH Workshop 2013

slide-2
SLIDE 2

Introduction Matching Verification Application Conclusion

Research subject

Historical databases have increasingly become digitized

Census data, civil registry, church records, trade records, . . . Millions of interrelated records → historical social networks However, network structure is not given

Alternative data sources: personal and local archives

Family trees, legal archives, . . . Small amount of information Relations between records generally indicated and verified

Research goal: combine the information from different sources

slide-3
SLIDE 3

Introduction Matching Verification Application Conclusion

Outline

1

Introduction

2

Matching

3

Verification

4

Application

5

Conclusion

slide-4
SLIDE 4

Introduction Matching Verification Application Conclusion

Motivation

Links between (historical) records are important for a wide range of applications Data Mining: graph traversal algorithms, community detection Humanities: migration patterns, family size, occupational development Linguistics: stability of spelling, morphology, phonetics Onomastics: name inheritance, geographical name distribution

slide-5
SLIDE 5

Introduction Matching Verification Application Conclusion

Overview

First match records from databases X and Y, then identify complementary or conflicting links birth record X1 death record X2 birth record Y1 death record Y2 La Lb

match? match? link compare

Example: If X1 = Y1 but X2 = Y2 then either La or Lb or both are wrong.

slide-6
SLIDE 6

Introduction Matching Verification Application Conclusion

Data formats

Large-scale historical databases

Syntax usually structured

XML, SQL, comma-separated Occasionally structured natural language is used

Semantics generally based on events

Birth, marriage, baptism, change of ownership Exception: census records

Family databases

Syntax often the legacy Gedcom format

Hierarchical level numbers and tags

Semantics generally based on individuals and families

slide-7
SLIDE 7

Introduction Matching Verification Application Conclusion

Example historical databases

Genlias civil certificate database

Official registration of birth, marriage and death The Netherlands, ∼1811-1920 15 million certificates (events)

Gedcom family archive

Hand-compiled from various sources Mostly northern part of the Netherlands, ∼1600-now 1750 records (individuals and families)

Overlap: ∼1100 events, of which ∼600 births

slide-8
SLIDE 8

Introduction Matching Verification Application Conclusion

Data formats example

Civil certificate Type: birth certificate Serial number: 176 Date: 16-05-1883 Place: Wonseradeel Child: Sierk Rolsma Father: Sjoerd Rolsma Mother: Agnes Weldring Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ · · · 0 @I787@ INDI 1 NAME Agnes/Welderink/ · · · 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883

slide-9
SLIDE 9

Introduction Matching Verification Application Conclusion

Data formats example

Civil certificate Type: birth certificate Serial number: 176 Date: 16-05-1883 Place: Wonseradeel Child: Sierk Rolsma Father: Sjoerd Rolsma Mother: Agnes Weldring Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ · · · 0 @I787@ INDI 1 NAME Agnes/Welderink/ · · · 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883

slide-10
SLIDE 10

Introduction Matching Verification Application Conclusion

Parser

Grammar birth → [FAM:CHIL]:child, father,mother. child → bdate,bplace,name. father → [FAM:HUSB]:name. mother → [FAM:WIFE]:name. bdate → [INDI:BIRT:DATE]. bplace → [INDI:BIRT:PLAC]. name → [INDI:NAME]. Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ · · · 0 @I787@ INDI 1 NAME Agnes/Welderink/ · · · 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883

slide-11
SLIDE 11

Introduction Matching Verification Application Conclusion

Record similarity measure

The parser provides uniform data for matching two records using similarity requirements for selected fields. Example: Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year

  • f birth is within a small margin and the edit distance between the

names is below some threshold. If multiple candidates for matching a record are found, then the candidate with the smallest edit distance is selected. Note that the definition is domain specific.

slide-12
SLIDE 12

Introduction Matching Verification Application Conclusion

Matching example

Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year

  • f birth is within a small margin and the edit distance between the

names is below some threshold. Civil certificate Date: 16-05-1883 Child: Sierk Rolsma Mother: Agnes Weldring Family archive Date: 16 MAY 1883 Child: Sierk Rolsma Mother: Agnes Welderink Three out of four names equal (Sierk, Rolsma, Agnes), year of birth equal (1883) → match

slide-13
SLIDE 13

Introduction Matching Verification Application Conclusion

Matching results

Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year

  • f birth is within a small margin and the edit distance between the

names is below some threshold. Birth matches: 361/611 (59%) Civil certificate database still in digitization phase Family database contains many peripheral individuals for which parent names and birth date are unknown Similarity measure could be improved

  • Cf. results for marriage certificate matching: 154/176 (88%)
slide-14
SLIDE 14

Introduction Matching Verification Application Conclusion

Verification

Ideal case: gold standard

Generally not available for historical databases

Large variation in domain and data quality

Performance of matching algorithms obtained on one database is not indicative for other databases Unlike, e.g., newspaper archives, e-mail archives, co-author networks, . . .

Possible solution: internal verification

slide-15
SLIDE 15

Introduction Matching Verification Application Conclusion

Internal verification

A similarity measure does not necessarily use all record fields for matching Unused fields can provide a support level for a match Example: the birth similarity measure used person names and year of birth Location, exact date of birth, and serial number can be used for verification

slide-16
SLIDE 16

Introduction Matching Verification Application Conclusion

Verification results

serial location date dist birth marriage + + + 177 69 +

  • +

31 2 – + + 21 41 – + ∼ 33 – + – 7 2 – – + 3 10 – – ∼ 6 2 – – – ≤ 3 4 20 – – – > 3 79 8 total 361 154

slide-17
SLIDE 17

Introduction Matching Verification Application Conclusion

Interpretation of support categories

serial location date dist mean % unique + + + 177 100

  • k

+

  • +

31 100

  • k

– + + 21 99.1

  • k

– + ∼ 33 98.7

  • k

– – + 3 98.1

  • k

– – ∼ 6 94.4 likely ok – + – 7 90.0 manual check – – – ≤ 3 4 74.0 manual check – – – > 3 79 74.0 incorrect total 361

slide-18
SLIDE 18

Introduction Matching Verification Application Conclusion

Application: link comparison

First match records from databases X and Y, then identify complementary or conflicting links record X1 record X2 record Y1 record Y2 La Lb

match? match? link compare

Application: compare links from Gedcom family archive (given) to links between civil certificates (computed)

slide-19
SLIDE 19

Introduction Matching Verification Application Conclusion

Visualization tool

@F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F100@ 01-05-1824 Sikke Sasses van der Zee

Aafke Klazes de Boer Afke de Boer

@F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena

Cornelia Verkooyen Cornelia Verkooijen

@F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes ? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga

A tool is developed to explore the link tree Red and blue: matched certificates have differences

slide-20
SLIDE 20

Introduction Matching Verification Application Conclusion

Visualization tool

@F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F100@ 01-05-1824 Sikke Sasses van der Zee

Aafke Klazes de Boer Afke de Boer

@F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena

Cornelia Verkooyen Cornelia Verkooijen

@F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes ? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga

Only red or blue: marriage from family archive without match in civil certificates, or vice versa

slide-21
SLIDE 21

Introduction Matching Verification Application Conclusion

Visualization tool

@F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F100@ 01-05-1824 Sikke Sasses van der Zee

Aafke Klazes de Boer Afke de Boer

@F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena

Cornelia Verkooyen Cornelia Verkooijen

@F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes ? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga

Records F19 and 9797998 are a false negative match

slide-22
SLIDE 22

Introduction Matching Verification Application Conclusion

Visualization tool

@F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F100@ 01-05-1824 Sikke Sasses van der Zee

Aafke Klazes de Boer Afke de Boer

@F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena

Cornelia Verkooyen Cornelia Verkooijen

@F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes ? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga

Records F122, F123, F124 are outside of the civil certificate timeframe

slide-23
SLIDE 23

Introduction Matching Verification Application Conclusion

Summary

Combining information from different databases in the same domain Syntactic and semantic parsing of records based on individuals to records based on events Matching using domain-specific similarity measures Match validation using additional record fields Application: visualization of link comparison

slide-24
SLIDE 24

Introduction Matching Verification Application Conclusion

Future work

Scale up to more and larger databases

Crowdsourcing is particularly suited to obtain data

Refine matching procedure Public release of visualization tool

slide-25
SLIDE 25

Introduction Matching Verification Application Conclusion

Acknowledgment

This work is part of the research programme LINKS, which is financed by the Netherlands Organisation for Scientific Research (NWO), grant 640.004.804. The authors would like to thank Tom Altena for the use of his Gedcom database.