Introduction Matching Verification Application Conclusion
Comparison between historical population archives and decentralized - - PowerPoint PPT Presentation
Comparison between historical population archives and decentralized - - PowerPoint PPT Presentation
Introduction Matching Verification Application Conclusion Comparison between historical population archives and decentralized databases Marijn Schraagen Dionysius Huijsmans Leiden Institute of Advanced Computer Science (LIACS) Leiden
Introduction Matching Verification Application Conclusion
Research subject
Historical databases have increasingly become digitized
Census data, civil registry, church records, trade records, . . . Millions of interrelated records → historical social networks However, network structure is not given
Alternative data sources: personal and local archives
Family trees, legal archives, . . . Small amount of information Relations between records generally indicated and verified
Research goal: combine the information from different sources
Introduction Matching Verification Application Conclusion
Outline
1
Introduction
2
Matching
3
Verification
4
Application
5
Conclusion
Introduction Matching Verification Application Conclusion
Motivation
Links between (historical) records are important for a wide range of applications Data Mining: graph traversal algorithms, community detection Humanities: migration patterns, family size, occupational development Linguistics: stability of spelling, morphology, phonetics Onomastics: name inheritance, geographical name distribution
Introduction Matching Verification Application Conclusion
Overview
First match records from databases X and Y, then identify complementary or conflicting links birth record X1 death record X2 birth record Y1 death record Y2 La Lb
match? match? link compare
Example: If X1 = Y1 but X2 = Y2 then either La or Lb or both are wrong.
Introduction Matching Verification Application Conclusion
Data formats
Large-scale historical databases
Syntax usually structured
XML, SQL, comma-separated Occasionally structured natural language is used
Semantics generally based on events
Birth, marriage, baptism, change of ownership Exception: census records
Family databases
Syntax often the legacy Gedcom format
Hierarchical level numbers and tags
Semantics generally based on individuals and families
Introduction Matching Verification Application Conclusion
Example historical databases
Genlias civil certificate database
Official registration of birth, marriage and death The Netherlands, ∼1811-1920 15 million certificates (events)
Gedcom family archive
Hand-compiled from various sources Mostly northern part of the Netherlands, ∼1600-now 1750 records (individuals and families)
Overlap: ∼1100 events, of which ∼600 births
Introduction Matching Verification Application Conclusion
Data formats example
Civil certificate Type: birth certificate Serial number: 176 Date: 16-05-1883 Place: Wonseradeel Child: Sierk Rolsma Father: Sjoerd Rolsma Mother: Agnes Weldring Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ · · · 0 @I787@ INDI 1 NAME Agnes/Welderink/ · · · 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883
Introduction Matching Verification Application Conclusion
Data formats example
Civil certificate Type: birth certificate Serial number: 176 Date: 16-05-1883 Place: Wonseradeel Child: Sierk Rolsma Father: Sjoerd Rolsma Mother: Agnes Weldring Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ · · · 0 @I787@ INDI 1 NAME Agnes/Welderink/ · · · 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883
Introduction Matching Verification Application Conclusion
Parser
Grammar birth → [FAM:CHIL]:child, father,mother. child → bdate,bplace,name. father → [FAM:HUSB]:name. mother → [FAM:WIFE]:name. bdate → [INDI:BIRT:DATE]. bplace → [INDI:BIRT:PLAC]. name → [INDI:NAME]. Family archive 0 @F294@ FAM 1 HUSB @I840@ 1 WIFE @I787@ 1 CHIL @I848@ 1 CHIL @I849@ · · · 0 @I787@ INDI 1 NAME Agnes/Welderink/ · · · 0 @I849@ INDI 1 NAME Sierk/Rolsma/ 1 BIRT 2 DATE 16 MAY 1883
Introduction Matching Verification Application Conclusion
Record similarity measure
The parser provides uniform data for matching two records using similarity requirements for selected fields. Example: Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year
- f birth is within a small margin and the edit distance between the
names is below some threshold. If multiple candidates for matching a record are found, then the candidate with the smallest edit distance is selected. Note that the definition is domain specific.
Introduction Matching Verification Application Conclusion
Matching example
Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year
- f birth is within a small margin and the edit distance between the
names is below some threshold. Civil certificate Date: 16-05-1883 Child: Sierk Rolsma Mother: Agnes Weldring Family archive Date: 16 MAY 1883 Child: Sierk Rolsma Mother: Agnes Welderink Three out of four names equal (Sierk, Rolsma, Agnes), year of birth equal (1883) → match
Introduction Matching Verification Application Conclusion
Matching results
Birth certificate similarity Out of the four names of child and mother, at least two names are exactly equal. The year of birth is equal, or the difference in year
- f birth is within a small margin and the edit distance between the
names is below some threshold. Birth matches: 361/611 (59%) Civil certificate database still in digitization phase Family database contains many peripheral individuals for which parent names and birth date are unknown Similarity measure could be improved
- Cf. results for marriage certificate matching: 154/176 (88%)
Introduction Matching Verification Application Conclusion
Verification
Ideal case: gold standard
Generally not available for historical databases
Large variation in domain and data quality
Performance of matching algorithms obtained on one database is not indicative for other databases Unlike, e.g., newspaper archives, e-mail archives, co-author networks, . . .
Possible solution: internal verification
Introduction Matching Verification Application Conclusion
Internal verification
A similarity measure does not necessarily use all record fields for matching Unused fields can provide a support level for a match Example: the birth similarity measure used person names and year of birth Location, exact date of birth, and serial number can be used for verification
Introduction Matching Verification Application Conclusion
Verification results
serial location date dist birth marriage + + + 177 69 +
- +
31 2 – + + 21 41 – + ∼ 33 – + – 7 2 – – + 3 10 – – ∼ 6 2 – – – ≤ 3 4 20 – – – > 3 79 8 total 361 154
Introduction Matching Verification Application Conclusion
Interpretation of support categories
serial location date dist mean % unique + + + 177 100
- k
+
- +
31 100
- k
– + + 21 99.1
- k
– + ∼ 33 98.7
- k
– – + 3 98.1
- k
– – ∼ 6 94.4 likely ok – + – 7 90.0 manual check – – – ≤ 3 4 74.0 manual check – – – > 3 79 74.0 incorrect total 361
Introduction Matching Verification Application Conclusion
Application: link comparison
First match records from databases X and Y, then identify complementary or conflicting links record X1 record X2 record Y1 record Y2 La Lb
match? match? link compare
Application: compare links from Gedcom family archive (given) to links between civil certificates (computed)
Introduction Matching Verification Application Conclusion
Visualization tool
@F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F100@ 01-05-1824 Sikke Sasses van der Zee
Aafke Klazes de Boer Afke de Boer
@F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena
Cornelia Verkooyen Cornelia Verkooijen
@F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes ? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga
A tool is developed to explore the link tree Red and blue: matched certificates have differences
Introduction Matching Verification Application Conclusion
Visualization tool
@F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F100@ 01-05-1824 Sikke Sasses van der Zee
Aafke Klazes de Boer Afke de Boer
@F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena
Cornelia Verkooyen Cornelia Verkooijen
@F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes ? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga
Only red or blue: marriage from family archive without match in civil certificates, or vice versa
Introduction Matching Verification Application Conclusion
Visualization tool
@F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F100@ 01-05-1824 Sikke Sasses van der Zee
Aafke Klazes de Boer Afke de Boer
@F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena
Cornelia Verkooyen Cornelia Verkooijen
@F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes ? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga
Records F19 and 9797998 are a false negative match
Introduction Matching Verification Application Conclusion
Visualization tool
@F171@ 13-05-1848 Sjoerd Riemerts Riemersma Johanna Sikkes van der Zee @F100@ 01-05-1824 Sikke Sasses van der Zee
Aafke Klazes de Boer Afke de Boer
@F15@ 09-05-1857 Jan Johannes Altena Klaaske Sikkes van der Zee @F16@ 02-07-1892 Johannes Altena Elisabeth Vonk @F17@ 16-11-1889 Eke Foekema Aaltje Altena @F18@ 09-01-1896 Sikke Altena
Cornelia Verkooyen Cornelia Verkooijen
@F13@ 13-06-1896 Ruurd Altena Anna Jans Rolsma @F19@ ~1900 H Wesseling Agatha Altena 9797998 08-05-1895 Hendrikus Wesseling Agatha Altena @F122@ ~1920 Sikkes ? IJbeltje Altena @F123@ ~1925 Bartolomeus Mathias van Oerle Klaaske Altena @F124@ 18-05-1923 Sikke Altena Trijntje Homminga
Records F122, F123, F124 are outside of the civil certificate timeframe
Introduction Matching Verification Application Conclusion
Summary
Combining information from different databases in the same domain Syntactic and semantic parsing of records based on individuals to records based on events Matching using domain-specific similarity measures Match validation using additional record fields Application: visualization of link comparison
Introduction Matching Verification Application Conclusion
Future work
Scale up to more and larger databases
Crowdsourcing is particularly suited to obtain data
Refine matching procedure Public release of visualization tool
Introduction Matching Verification Application Conclusion