Statistical Methods for Dating Collections of Historical Documents - - PowerPoint PPT Presentation

statistical methods for dating collections of historical
SMART_READER_LITE
LIVE PREVIEW

Statistical Methods for Dating Collections of Historical Documents - - PowerPoint PPT Presentation

The Problem The Maximum Prevalence Method Statistical Methods for Dating Collections of Historical Documents Michael Gervers University of Toronto Michael Gervers DEEDS dating 1 / 28 The Problem The Maximum Prevalence Method


slide-1
SLIDE 1

The Problem The Maximum Prevalence Method

Statistical Methods for Dating Collections of Historical Documents

Michael Gervers

University of Toronto

Michael Gervers DEEDS dating — 1 / 28

slide-2
SLIDE 2

The Problem The Maximum Prevalence Method

  • Problem – Statistical methodologies for dating documents

and texts.

  • Motivation – Historians want to date source documents

accurately.

Michael Gervers DEEDS dating — 2 / 28

slide-3
SLIDE 3

The Problem The Maximum Prevalence Method

The Data

  • A total of 3353 documents which have all been accurately

dated by historians.

  • These documents are in digitized format.
  • The 3353 documents were divided into a training set,

validation set and test set.

Michael Gervers DEEDS dating — 3 / 28

slide-4
SLIDE 4

The Problem The Maximum Prevalence Method

  • The training documents “teach” or “train” our dating algorithm.
  • The validation set is used for estimating certain parameters.
  • The test set is used to measure accuracy.

Michael Gervers DEEDS dating — 4 / 28

slide-5
SLIDE 5

The Problem The Maximum Prevalence Method

ID: 00640214 Document date: 1237

Haec est finalis concordia facta in curia domini regis apud Westmonasterium a die S Johannis Baptistae in !xv! dies anno regni regis Henrici filii regis Johannis !xxi! coram Roberto de Lexinton Willelmo de Eboraco Ada filio Willelmi Willelmo de Culewurth justitiariis et aliis domini regis fidelibus tunc ibi praesentibus inter Johannem Baioc quaerentem et Robertum Sarum episcopum et capitulum .....

Michael Gervers DEEDS dating — 5 / 28

slide-6
SLIDE 6

The Problem The Maximum Prevalence Method

  • The concept of shingles
  • A shingle is a consecutive sequence of words (Broder, 1998).
  • Example:

D = (a rose is a rose is a rose) then the set of its k-shingles (say, k = 2) is: S2(D) = {{a rose}, {rose is}, {is a}, {a rose}, {rose is}, {is a}, {a rose}}

Michael Gervers DEEDS dating — 6 / 28

slide-7
SLIDE 7

The Problem The Maximum Prevalence Method

The idea behind the maximum prevalence method To date an undated document D : 1) Construct the set S(D) for a fixed shingle order.

Michael Gervers DEEDS dating — 7 / 28

slide-8
SLIDE 8

The Problem The Maximum Prevalence Method

The idea behind the maximum prevalence method To date an undated document D : 1) Construct the set S(D) for a fixed shingle order. 2) For each shingle in the set S(D), estimate the probability of its occurrence as a function of time.

Michael Gervers DEEDS dating — 8 / 28

slide-9
SLIDE 9

The Problem The Maximum Prevalence Method

The idea behind the maximum prevalence method To date an undated document D : 1) Construct the set S(D) for a fixed shingle order. 2) For each shingle in the set S(D), estimate the probability of its occurrence as a function of time. 3) Combine the probability of occurrence of the shingles together.

Michael Gervers DEEDS dating — 9 / 28

slide-10
SLIDE 10

The Problem The Maximum Prevalence Method

The idea behind the maximum prevalence method To date an undated document D : 1) Construct the set S(D) for a fixed shingle order. 2) For each shingle in the set S(D), estimate the probability of its occurrence as a function of time. 3) Combine the probability of occurrence of the shingles together. 4) The value where the peak of the resulting function occurs is taken to be the date estimate of document D.

Michael Gervers DEEDS dating — 10 / 28

slide-11
SLIDE 11

The Problem The Maximum Prevalence Method

The probability of occurrence of the shingle ibidem Deo seruientibus as a function of time

* * * * ** * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * ** * * * * * *

1100 1200 1300 1400 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 Document date Probability of occurrence ** * * * ** ** ** ** *** * ** * ** * * * *** ** * * * * * * ** * * * ** * ** ** * * ** **** * * ** * * ***** ** * * * * * * * ** * * * * * * *** * * * * * * * * ** * * ** * ** * * * * * * * * * * * * * ** * ** * ** * ** * * * * ** * * ** * * * * * * * * ** * ** * * ** * * ** ** * ** * * * * ** ** * * * * * * *

Michael Gervers DEEDS dating — 11 / 28

slide-12
SLIDE 12

The Problem The Maximum Prevalence Method

The probability of occurrence of the shingle testimonium huic as a function of time

Michael Gervers DEEDS dating — 12 / 28

slide-13
SLIDE 13

The Problem The Maximum Prevalence Method

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Input Output

Michael Gervers DEEDS dating — 13 / 28

slide-14
SLIDE 14

The Problem The Maximum Prevalence Method

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Input Output

Michael Gervers DEEDS dating — 14 / 28

slide-15
SLIDE 15

The Problem The Maximum Prevalence Method

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Input Output

Michael Gervers DEEDS dating — 15 / 28

slide-16
SLIDE 16

The Problem The Maximum Prevalence Method

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Input Output

Michael Gervers DEEDS dating — 16 / 28

slide-17
SLIDE 17

The Problem The Maximum Prevalence Method

The probability of occurrence of the shingle Francis et Anglicis as a function of time

* * * * * * * * * * * * * * * * * * * * *

1100 1200 1300 1400 0.000 0.002 0.004 0.006 0.008 0.010 Document date Probability of occurrence ** * * * * ** * * * **** * ** * * * *** * ** ** * * ** * * * ** * * * * * * * * * ** * * ** ** * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * ** * ** * * * * ** * * ** * * * * * * * * ** * ** * * ** * * ** ** * ** * * * * ** ** * * * * * * * ** * * * * ** * * * **** * ** * * * *** * ** ** * * ** * * * ** * * * * * * * * * ** * * ** ** * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * ** * ** * * * * ** * * ** * * * * * * * * ** * ** * * ** * * ** ** * ** * * * * ** ** * * * * * * *

Michael Gervers DEEDS dating — 17 / 28

slide-18
SLIDE 18

The Problem The Maximum Prevalence Method

Estimating the probability of occurrences of shingles in order to date undated document D

Michael Gervers DEEDS dating — 18 / 28

slide-19
SLIDE 19

The Problem The Maximum Prevalence Method

  • Construct the set S(D) for a fixed shingle order.

Let s1 be Francis et Anglicis Let s2 be ibidem Deo seruientibus

  • Ps1(1130) × Ps2(1130) × Ps3(1130) × Ps4(1130) × · · ·

= 0.0007 × 0.0005 × · · ·

Michael Gervers DEEDS dating — 19 / 28

slide-20
SLIDE 20

The Problem The Maximum Prevalence Method

The probability of occurrence of the shingle Francis et Anglicis as a function of time

* * * * * * * * * * * * * * * * * * * * *

1100 1200 1300 1400 0.000 0.002 0.004 0.006 0.008 0.010 Document date Probability of occurrence ** * * * * ** * * * **** * ** * * * *** * ** ** * * ** * * * ** * * * * * * * * * ** * * ** ** * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * ** * ** * * * * ** * * ** * * * * * * * * ** * ** * * ** * * ** ** * ** * * * * ** ** * * * * * * * ** * * * * ** * * * **** * ** * * * *** * ** ** * * ** * * * ** * * * * * * * * * ** * * ** ** * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * ** * ** * * * * ** * * ** * * * * * * * * ** * ** * * ** * * ** ** * ** * * * * ** ** * * * * * * *

Michael Gervers DEEDS dating — 20 / 28

slide-21
SLIDE 21

The Problem The Maximum Prevalence Method

The probability of occurrence of the shingle ibidem Deo seruientibus as a function of time

* * * * ** * ** * * * * * * * * * * * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * * ** * * * * * *

1100 1200 1300 1400 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 0.0030 Document date Probability of occurrence ** * * * ** ** ** ** *** * ** * ** * * * *** ** * * * * * * ** * * * ** * ** ** * * ** **** * * ** * * ***** ** * * * * * * * ** * * * * * * *** * * * * * * * * ** * * ** * ** * * * * * * * * * * * * * ** * ** * ** * ** * * * * ** * * ** * * * * * * * * ** * ** * * ** * * ** ** * ** * * * * ** ** * * * * * * *

Michael Gervers DEEDS dating — 21 / 28

slide-22
SLIDE 22

The Problem The Maximum Prevalence Method

The probability of occurrence of document D as a function of time. The actual date for the document is 1211 and the estimated date is 1210 (the peak)

Michael Gervers DEEDS dating — 22 / 28

slide-23
SLIDE 23

The Problem The Maximum Prevalence Method

The probability of occurrence of a document as a function of time. True date is 1299. Estimated date is 1307

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

1100 1200 1300 1400 −1200 −1100 −1000 −900 −800 −700 −600 Date Estimate of pi_{D}(t)

  • 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^

h=3 h=12 h=20 h=35 Date=1307

Michael Gervers DEEDS dating — 23 / 28

slide-24
SLIDE 24

The Problem The Maximum Prevalence Method

The Probability of occurrence of the word omnibus as a function of time

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

1100 1200 1300 1400 0.000 0.005 0.010 0.015 Document date Probability of occurrence * ** * * * * ** * * * * * *

Michael Gervers DEEDS dating — 24 / 28

slide-25
SLIDE 25

The Problem The Maximum Prevalence Method

Result

Of all shingle orders, shingle order 2 performed best. On a test set of 326 documents

  • Average error in absolute terms (mean) is 9.0 years
  • The 50th percentile of the error in absolute terms (median) is

6.0 years

Michael Gervers DEEDS dating — 25 / 28

slide-26
SLIDE 26

The Problem The Maximum Prevalence Method

Estimated versus actual document date for the 326 documents in the test set based on shingle

  • rder 2. The solid line is “X = Y” axis.

1100 1150 1200 1250 1300 1350 1400 1150 1200 1250 1300 1350 Actual document date Estimated document date

Michael Gervers DEEDS dating — 26 / 28

slide-27
SLIDE 27

The Problem The Maximum Prevalence Method

This is joint work with Andrey Feuerverger and Gelila Tilahun University of Toronto

Michael Gervers DEEDS dating — 27 / 28

slide-28
SLIDE 28

The Problem The Maximum Prevalence Method

Thank You! The End

Michael Gervers DEEDS dating — 28 / 28