An approach to unsupervised historical text normalisation Petar - PowerPoint PPT Presentation

An approach to unsupervised historical text normalisation Petar Mitankin Stefan Gerdjikov Stoyan Mihov Sofia University Sofia University Bulgarian Academy FMI FMI of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May

Contents ● Supervised Text Normalisation – CULTURA – REBELS Translation Model – Functional Automata ● Unsupervised Text Normalisation – Unsupervised REBELS – Experimental Results – Future Improvements

Co-funded under the 7th Framework Programme of the European Commission ● Maye - 34 occurrences in the 1641 Depositions , 8022 documents, 17 th century Early Modern English ● CULTURA: CULTivating Understanding and Research through Adaptivity ● Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

Supervised Text Normalisation ● Manually created ground truth – 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133 ● Statistical Machine Translation from historical language to modern language combines: – Translation model – Language model

REgularities Based Embedding of Language Structures he / -1.89 REBELS se / -1.69 shee Translation she / -9.75 Model shea / -10.04 Automatic Extraction of Historical Spelling Variations

Training of The REBELS Translation Model ● Training pairs from the ground truth: (shee, she), (maye, may), (she, she), (tyme, time), (saith, says), (have, have), (tho:, thomas), ...

Training of The REBELS Translation Model ● Deterministic structure of all historical/modern subwords ● Each word has several hierarchical decompositions in the DAWG: Hierarchical Hierarchical decomposition of each decomposition of each historical word modern word

Training of The REBELS Translation Model ● For each training pair ( knowth , knows ) we find a mapping between the decompositions: ● We collect statistics about ● We collect statistics about historical subword -> modern subword historical subword -> modern subword

REgularities Based Embedding of Language Structures he / -1.89 REBELS se / -1.69 shee Translation she / -9.75 Model shea / -10.04 REBELS generates normalisation candidates for unseen historical words

shee knowth me REBELS REBELS REBELS shee knowth me

Combination of REBELS with Statistical Bigram Language Model relevance score (he knuth my) = REBELS TM (he knuth my) * C_tm + Statistical Language Model (he knuth my)*C_lm ● Bigram Statistical Model – Smoothing: Absolute Discounting, Backing-off – Gutengberg English language corpus

Functional Automata L(C_tm, C_lm) is represented with Functional Automata

Automatic Construction of Functional Automaton For The Partial Derivative w.r.t. x L(C_tm, C_lm) is optimised with the Conjugate Gradient method

Supervised Text Normalisation Search REBELS Module Normalised Historical Translation text Based on text Model Functional Automata Ground Training Truth Module Based on Functional Automata

Unsupervised Text Normalisation Search REBELS Historical Normalised Module Translation text text Based on Model Functional Automata Unsupervised Generation of Training Pairs ( knoweth, knows )

Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If more than 6 modern words were generated for H, then do not use the corresponding pairs for training.

Unsupervised Generation of the Training Pairs ● We use similarity search to generate training pairs: – For each historical word H: ● If H is a modern word, then generate (H,H) , else ● Find each modern word M that is at Levenshtein distance 1 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 2 from H and generate (H,M). If no modern words are found, then ● Find each modern word M that is at distance 3 from H and generate (H,M). ● If too many (> 5) modern words were generated for H, then do not use the corresponding pairs for training.

Normalisation of the 1641 Depositions . Experimental results Generation of REBELS Spelling Method Language Model Accuracy BLEU Training Probabilities Pairs 1 ---- ---- ---- 75.59 50.31 2 Unsupervised NO YES 67.84 45.52 3 Unsupervised YES NO 79.18 56.55 4 Unsupervised YES YES 81.79 61.88 5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78 6 Supervised Supervised Trained Supervised Trained 93.96 87.30

Future Improvement Search REBELS Historical Normalised Module Translation text text Based on Model Functional Automata Unsupervised MAP Generation of Training Training Pairs Module ( knoweth, knows ) with probabilities

Thank You! Comments / Questions? ACKNOWLEDGEMENTS The reported research work is supported by the project CULTURA, grant 269973, funded by the FP7Programme and the project AComIn, grant 316087, funded by the FP7 Programme.

An approach to unsupervised historical text normalisation Petar - PowerPoint PPT Presentation

An approach to unsupervised historical text normalisation Petar Mitankin Stefan Gerdjikov Stoyan Mihov Sofia University Sofia University Bulgarian Academy FMI FMI of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May An approach

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Type checking and normalisation James Chapman - University of Nottingham My thesis Type

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

ISO TC67 WG10 ISO TC252 WG2 Normalisation internationale pour les installations et

Normalisation Programme de recherche ILNAS/SnT et programme ducatif ILNAS/FSTC la

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

A Bayesian Approach to A Bayesian Approach to Unsupervised One- Unsupervised One -Shot Shot

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Preliminary Findings of the Vision Group Translation and Localisation Josef van Genabith Centre

An Introduction to Neural Machine Translation Prof. John D. Kelleher @johndkelleher ADAPT Centre

The use of parallel corpora in linguistics Annemarie Verkerk Translation: Online and offline,

Our vision To be our clients strategic partner of choice and inspire our people to be the

Attention a useful tool to improve and understand neural networks Sala Riunioni DISI V.le

PI World Gothenburg 2019 Presentation Content Guidelines OSIsoft PI World presents a unique

Apple Q & A iPhone, iPad, Macintosh, MYOB Update Agenda MYOB Mac and Windows a quick

CS425 Computer System Design Lecture 10 Pipelining Hazards Shankar Balachandran Dept. of

An approach to unsupervised historical text normalisation Petar - PowerPoint PPT Presentation

An approach to unsupervised historical text normalisation Petar Mitankin Stefan Gerdjikov Stoyan Mihov Sofia University Sofia University Bulgarian Academy FMI FMI of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May An approach

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Type checking and normalisation James Chapman - University of Nottingham My thesis Type

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

ISO TC67 WG10 ISO TC252 WG2 Normalisation internationale pour les installations et

Normalisation Programme de recherche ILNAS/SnT et programme ducatif ILNAS/FSTC la

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

A Bayesian Approach to A Bayesian Approach to Unsupervised One- Unsupervised One -Shot Shot

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Preliminary Findings of the Vision Group Translation and Localisation Josef van Genabith Centre

An Introduction to Neural Machine Translation Prof. John D. Kelleher @johndkelleher ADAPT Centre

The use of parallel corpora in linguistics Annemarie Verkerk Translation: Online and offline,

Our vision To be our clients strategic partner of choice and inspire our people to be the

Attention a useful tool to improve and understand neural networks Sala Riunioni DISI V.le

PI World Gothenburg 2019 Presentation Content Guidelines OSIsoft PI World presents a unique

Apple Q &amp; A iPhone, iPad, Macintosh, MYOB Update Agenda MYOB Mac and Windows a quick

CS425 Computer System Design Lecture 10 Pipelining Hazards Shankar Balachandran Dept. of

Apple Q & A iPhone, iPad, Macintosh, MYOB Update Agenda MYOB Mac and Windows a quick