An approach to unsupervised historical text normalisation Petar - - PowerPoint PPT Presentation

an approach to unsupervised historical text normalisation
SMART_READER_LITE
LIVE PREVIEW

An approach to unsupervised historical text normalisation Petar - - PowerPoint PPT Presentation

An approach to unsupervised historical text normalisation Petar Mitankin Stefan Gerdjikov Stoyan Mihov Sofia University Sofia University Bulgarian Academy FMI FMI of Sciences IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May An approach


slide-1
SLIDE 1

An approach to unsupervised historical text normalisation

Petar Mitankin

Sofia University FMI

Stefan Gerdjikov

Sofia University FMI

Stoyan Mihov

Bulgarian Academy

  • f Sciences

IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May

slide-2
SLIDE 2

An approach to unsupervised historical text normalisation

Petar Mitankin

Sofia University FMI

Stefan Gerdjikov

Sofia University FMI

Stoyan Mihov

Bulgarian Academy

  • f Sciences

IICT DATeCH 2014, Maye 19 - 20, Madrid, Spain May

slide-3
SLIDE 3

Contents

  • Supervised Text Normalisation

– CULTURA – REBELS Translation Model – Functional Automata

  • Unsupervised Text Normalisation

– Unsupervised REBELS – Experimental Results – Future Improvements

slide-4
SLIDE 4

Co-funded under the 7th Framework Programme of the European Commission

  • Maye - 34 occurrences in the 1641 Depositions,

8022 documents, 17th century Early Modern English

  • CULTURA: CULTivating Understanding and

Research through Adaptivity

  • Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND

TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

slide-5
SLIDE 5

Co-funded under the 7th Framework Programme of the European Commission

  • Maye - 34 occurrences in the 1641 Depositions,

8022 documents, 17th century Early Modern English

  • CULTURA: CULTivating Understanding and

Research through Adaptivity

  • Partners: TRINITY COLLEGE DUBLIN, IBM ISRAEL - SCIENCE AND

TECHNOLOGY LTD, COMMETRIC EOOD, PINTAIL LTD, UNIVERSITA DEGLI STUDI DI PADOVA, TECHNISCHE UNIVERSITAET GRAZ, SOFIA UNIVERSITY ST KLIMENT OHRIDSKI

slide-6
SLIDE 6

Supervised Text Normalisation

  • Manually created ground truth

– 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133

  • Statistical Machine Translation from historical

language to modern language combines:

– Translation model – Language model

slide-7
SLIDE 7

Supervised Text Normalisation

  • Manually created ground truth

– 500 documents from the 1641 Depositions – All words: 205 291 – Normalised words: 51 133

  • Statistical Machine Translation from historical

language to modern language combines:

– Translation model – Language model

slide-8
SLIDE 8

REgularities Based Embedding of Language Structures

shee REBELS Translation Model

he / -1.89 se / -1.69 she / -9.75 shea / -10.04

Automatic Extraction of Historical Spelling Variations

slide-9
SLIDE 9

Training of The REBELS Translation Model

  • Training pairs from the ground truth:

(shee, she), (maye, may), (she, she), (tyme, time), (saith, says), (have, have), (tho:, thomas), ...

slide-10
SLIDE 10

Training of The REBELS Translation Model

  • Deterministic structure of all historical/modern

subwords

  • Each word has several hierarchical

decompositions in the DAWG: Hierarchical decomposition of each historical word Hierarchical decomposition of each modern word

slide-11
SLIDE 11

Training of The REBELS Translation Model

  • For each training pair (knowth, knows) we find a

mapping between the decompositions:

  • We collect statistics about

historical subword -> modern subword

  • We collect statistics about

historical subword -> modern subword

slide-12
SLIDE 12

REgularities Based Embedding of Language Structures

shee REBELS Translation Model

he / -1.89 se / -1.69 she / -9.75 shea / -10.04

REBELS generates normalisation candidates for unseen historical words

slide-13
SLIDE 13

shee

REBELS

knowth

REBELS

me

REBELS

shee knowth me

slide-14
SLIDE 14

relevance score (he knuth my) = REBELS TM (he knuth my) * C_tm + Statistical Language Model (he knuth my)*C_lm

Combination of REBELS with Statistical Bigram Language Model

  • Bigram Statistical Model

– Smoothing: Absolute Discounting, Backing-off – Gutengberg English language corpus

slide-15
SLIDE 15

Functional Automata

L(C_tm, C_lm) is represented with Functional Automata

slide-16
SLIDE 16

Automatic Construction of Functional Automaton For The Partial Derivative w.r.t. x

L(C_tm, C_lm) is optimised with the Conjugate Gradient method

slide-17
SLIDE 17

Supervised Text Normalisation

REBELS Translation Model

Search Module Based on Functional Automata Ground Truth Training Module Based on Functional Automata Historical text Normalised text

slide-18
SLIDE 18

Unsupervised Text Normalisation

REBELS Translation Model

Unsupervised Generation of Training Pairs (knoweth, knows) Historical text Normalised text Search Module Based on Functional Automata

slide-19
SLIDE 19

Unsupervised Generation of the Training Pairs

  • We use similarity search to generate training

pairs:

– For each historical word H:

  • If H is a modern word, then generate (H,H) , else
  • Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

  • Find each modern word M that is at distance 2 from H

and generate (H,M). If no modern words are found, then

  • Find each modern word M that is at distance 3 from H

and generate (H,M).

  • If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

slide-20
SLIDE 20

Unsupervised Generation of the Training Pairs

  • We use similarity search to generate training

pairs:

– For each historical word H:

  • If H is a modern word, then generate (H,H) , else
  • Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

  • Find each modern word M that is at distance 2 from H

and generate (H,M). If no modern words are found, then

  • Find each modern word M that is at distance 3 from H

and generate (H,M).

  • If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

slide-21
SLIDE 21

Unsupervised Generation of the Training Pairs

  • We use similarity search to generate training

pairs:

– For each historical word H:

  • If H is a modern word, then generate (H,H) , else
  • Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

  • Find each modern word M that is at distance 2 from H

and generate (H,M). If no modern words are found, then

  • Find each modern word M that is at distance 3 from H

and generate (H,M).

  • If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

slide-22
SLIDE 22

Unsupervised Generation of the Training Pairs

  • We use similarity search to generate training

pairs:

– For each historical word H:

  • If H is a modern word, then generate (H,H) , else
  • Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

  • Find each modern word M that is at distance 2 from H

and generate (H,M). If no modern words are found, then

  • Find each modern word M that is at distance 3 from H

and generate (H,M).

  • If more than 6 modern words were generated for H, then

do not use the corresponding pairs for training.

slide-23
SLIDE 23

Unsupervised Generation of the Training Pairs

  • We use similarity search to generate training

pairs:

– For each historical word H:

  • If H is a modern word, then generate (H,H) , else
  • Find each modern word M that is at Levenshtein distance

1 from H and generate (H,M). If no modern words are found, then

  • Find each modern word M that is at distance 2 from H

and generate (H,M). If no modern words are found, then

  • Find each modern word M that is at distance 3 from H

and generate (H,M).

  • If too many (> 5) modern words were generated for H,

then do not use the corresponding pairs for training.

slide-24
SLIDE 24

Normalisation of the 1641

  • Depositions. Experimental results

Method Generation of REBELS Training Pairs Spelling Probabilities Language Model Accuracy BLEU 1

  • 75.59

50.31 2 Unsupervised NO YES 67.84 45.52 3 Unsupervised YES NO 79.18 56.55 4 Unsupervised YES YES 81.79 61.88 5 Unsupervised Supervised Trained Supervised Trained 84.82 68.78 6 Supervised Supervised Trained Supervised Trained 93.96 87.30

slide-25
SLIDE 25

Future Improvement

REBELS Translation Model

Unsupervised Generation of Training Pairs (knoweth, knows) with probabilities Historical text Normalised text Search Module Based on Functional Automata MAP Training Module

slide-26
SLIDE 26

Thank You! Comments / Questions?

ACKNOWLEDGEMENTS The reported research work is supported by the project CULTURA, grant 269973, funded by the FP7Programme and the project AComIn, grant 316087, funded by the FP7 Programme.