SLIDE 1 An approach to unsupervised historical text normalisation
Petar Mitankin
Sofia University FMI
Stefan Gerdjikov
Sofia University FMI
Stoyan Mihov
Bulgarian Academy
IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May
SLIDE 2 An approach to unsupervised historical text normalisation
Petar Mitankin
Sofia University FMI
Stefan Gerdjikov
Sofia University FMI
Stoyan Mihov
Bulgarian Academy
IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May
SLIDE 3 Contents
- Supervised Text Normalisation
– CULTURA – REBELS Translation Model – Functional Automata
- Unsupervised Text Normalisation
– Unsupervised REBELS – Experimental Results – Future Improvements
SLIDE 4 Co-funded under the 7th Framework Programme of the European Commission
- Maye - 34 occurrences in the 1641 Depositions,
8022 documents, 17th century Early Modern English
- CULTURA: CULTivating Understanding and
Research through Adaptivity
- Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND
TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
SLIDE 5 Co-funded under the 7th Framework Programme of the European Commission
- Maye - 34 occurrences in the 1641 Depositions,
8022 documents, 17th century Early Modern English
- CULTURA: CULTivating Understanding and
Research through Adaptivity
- Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND
TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI
SLIDE 6 Supervised Text Normalisation
- Manually created ground truth
– 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133
- Statistical Machine Translation from historical
language to modern language combines:
– Translation model – Language model
SLIDE 7 Supervised Text Normalisation
- Manually created ground truth
– 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133
- Statistical Machine Translation from historical
language to modern language combines:
– Translation model – Language model
SLIDE 8
REgularities Based Embedding of Language Structures
shee REBELS Translation Model
he / -1.89 se / -1.69 she / -9.75 shea / -10.04
Automatic Extraction of Historical Spelling Variations
SLIDE 9 Training of The REBELS Translation Model
- Training pairs from the ground truth:
(shee, she), (maye, may), (she, she), (tyme, time), (saith, says), (have, have), (tho:, thomas), ...
SLIDE 10 Training of The REBELS Translation Model
- Deterministic structure of all historical/modern
subwords
- Each word has several hierarchical
decompositions in the DAWG: Hierarchical decomposition of each historical word Hierarchical decomposition of each modern word
SLIDE 11 Training of The REBELS Translation Model
- For each training pair (knowth, knows) we find a
mapping between the decompositions:
- We collect statistics about
historical subword -> modern subword
- We collect statistics about
historical subword -> modern subword
SLIDE 12
REgularities Based Embedding of Language Structures
shee REBELS Translation Model
he / -1.89 se / -1.69 she / -9.75 shea / -10.04
REBELS generates normalisation candidates for unseen historical words
SLIDE 13
shee
REBELS
knowth
REBELS
me
REBELS
shee knowth me
SLIDE 14 relevance score (he knuth my) = REBELS TM (he knuth my) * C_tm + Statistical Language Model (he knuth my)*C_lm
Combination of REBELS with Statistical Bigram Language Model
– Smoothing: Absolute Discounting, Backing-off – Gutengberg English language corpus
SLIDE 15
Functional Automata
L(C_tm, C_lm) is represented with Functional Automata
SLIDE 16
Automatic Construction of Functional Automaton For The Partial Derivative w.r.t. x
L(C_tm, C_lm) is optimised with the Conjugate Gradient method
SLIDE 17
Supervised Text Normalisation
REBELS Translation Model
Search Module Based on Functional Automata Ground Truth Training Module Based on Functional Automata Historical text Normalised text
SLIDE 18
Unsupervised Text Normalisation
REBELS Translation Model
Unsupervised Generation of Training Pairs (knoweth, knows) Historical text Normalised text Search Module Based on Functional Automata
SLIDE 19 Unsupervised Generation of the Training Pairs
- We use similarity search to generate training
pairs:
– For each historical word H:
- If H is a modern word, then generate (H,H) , else
- Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
- Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
- Find each modern word M that is at distance 3 from H
and generate (H,M).
- If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
SLIDE 20 Unsupervised Generation of the Training Pairs
- We use similarity search to generate training
pairs:
– For each historical word H:
- If H is a modern word, then generate (H,H) , else
- Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
- Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
- Find each modern word M that is at distance 3 from H
and generate (H,M).
- If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
SLIDE 21 Unsupervised Generation of the Training Pairs
- We use similarity search to generate training
pairs:
– For each historical word H:
- If H is a modern word, then generate (H,H) , else
- Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
- Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
- Find each modern word M that is at distance 3 from H
and generate (H,M).
- If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
SLIDE 22 Unsupervised Generation of the Training Pairs
- We use similarity search to generate training
pairs:
– For each historical word H:
- If H is a modern word, then generate (H,H) , else
- Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
- Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
- Find each modern word M that is at distance 3 from H
and generate (H,M).
- If more than 6 modern words were generated for H, then
do not use the corresponding pairs for training.
SLIDE 23 Unsupervised Generation of the Training Pairs
- We use similarity search to generate training
pairs:
– For each historical word H:
- If H is a modern word, then generate (H,H) , else
- Find each modern word M that is at Levenshtein distance
1 from H and generate (H,M). If no modern words are found, then
- Find each modern word M that is at distance 2 from H
and generate (H,M). If no modern words are found, then
- Find each modern word M that is at distance 3 from H
and generate (H,M).
- If too many (> 5) modern words were generated for H,
then do not use the corresponding pairs for training.
SLIDE 24 Normalisation of the 1641
- Depositions. Experimental results
Method Generation of REBELS Training Pairs Spelling Probabilities Language Model Accuracy BLEU 1
50.31 2 Unsupervised NO YES 67.84 45.52 3 Unsupervised YES NO 79.18 56.55 4 Unsupervised YES YES 81.79 61.88 5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78 6 Supervised Supervised Trained Supervised Trained 93.96 87.30
SLIDE 25
Future Improvement
REBELS Translation Model
Unsupervised Generation of Training Pairs (knoweth, knows) with probabilities Historical text Normalised text Search Module Based on Functional Automata MAP Training Module
SLIDE 26 Thank You! Comments / Questions?
ACKNOWLEDGEMENTS The reported research work is supported by the project CULTURA, grant 269973, funded by the FP7Programme and the project AComIn, grant 316087, funded by the FP7 Programme.