(Semi-)Automatic Normalization of Historical Texts using Distance - - PowerPoint PPT Presentation

semi automatic normalization of historical texts using
SMART_READER_LITE
LIVE PREVIEW

(Semi-)Automatic Normalization of Historical Texts using Distance - - PowerPoint PPT Presentation

Corpora Methods Norma Tool Evaluation (Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool Marcel Bollmann Department of Linguistics Ruhr-University Bochum, Germany Second Workshop on Annotation of


slide-1
SLIDE 1

Corpora Methods Norma Tool Evaluation

(Semi-)Automatic Normalization of Historical Texts using Distance Measures and the Norma tool

Marcel Bollmann

Department of Linguistics Ruhr-University Bochum, Germany

Second Workshop on Annotation of Corpora for Research in the Humanities November 29, 2012, Lisbon, Portugal

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-2
SLIDE 2

Corpora Methods Norma Tool Evaluation

Motivation

The problem with historical data... High variance in spelling Difficult to annotate with tools aimed at modern data, e.g. POS taggers None or very little training data to (re-)train annotation tools

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-3
SLIDE 3

Corpora Methods Norma Tool Evaluation

Motivation

The problem with historical data... High variance in spelling Difficult to annotate with tools aimed at modern data, e.g. POS taggers None or very little training data to (re-)train annotation tools A possible solution... Pre-processing data to “modernize” spelling Normalization as the process of mapping historical spellings to its modern equivalents.

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-4
SLIDE 4

Corpora Methods Norma Tool Evaluation

Outline

1 Corpora Anselm Corpus Luther Bible 2 Methods Wordlist Mapping Rule-Based Normalization Distance-Based Normalization 3 Norma Tool Overview Description Example 4 Evaluation Procedure Results

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-5
SLIDE 5

Corpora Methods Norma Tool Evaluation Anselm Corpus Luther Bible

Corpora

Anselm Corpus

Collection of Early New High German (ENHG) texts “Interrogatio Sancti Anselmi de Passione Domini” (Questions by Saint Anselm about the Lord’s Passion) More than 50 manuscripts and prints (in German) 14th–16th centuries Various German dialects

Sample from an Anselm manuscript

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-6
SLIDE 6

Corpora Methods Norma Tool Evaluation Anselm Corpus Luther Bible

Corpora

Anselm Corpus

Goals Lemmatization, POS tagging Paragraph, sentence, and word alignment Digital edition Method Normalization of historical wordforms to modern ones

Allows the use of already-existing tools Simplifies wordform queries in a resulting corpus

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-7
SLIDE 7

Corpora Methods Norma Tool Evaluation Anselm Corpus Luther Bible

Corpora

Anselm Corpus

ENHG1 do meín chind híet geezzen· ... ENHG2 da myn kínt hatte geſzen ... ENHG3 do mín kínt hatt geſſen ... Norm da mein kind hatte gegessen ... as my child had eaten

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-8
SLIDE 8

Corpora Methods Norma Tool Evaluation Anselm Corpus Luther Bible

Corpora

Luther Bible

Bible translation by Martin Luther

1545 version and a modernized equivalent Freely available on the web: http://www.sermon-online.de/

Extraction of 550,000 alignment pairs

Randomly split into development/training/evaluation corpus

→ Large test corpus for normalization

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-9
SLIDE 9

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Comparison of different normalization methods:

1 Wordlist mapping 2 Rule-based normalization

Character rewrite rules

3 Distance-based normalization

(Weighted) Levenshtein distance

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-10
SLIDE 10

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Wordlist Mapping

Word-to-word mappings Learned from an aligned corpus Chooses most frequent candidate wordform No knowledge about spelling variation Example do → da 50 meín → mein 30 myn → mein 30 mín → mein 30 . . . hatt → hatte 50 hatt → hat 20 hatt → hut 1

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-11
SLIDE 11

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Rule-Based Normalization

“Context-aware” character rewrite rules v → u / # _ n v n d ↓ u n d Learned from aligned training corpus

Levenshtein distance: Minimum number of edit operations to transform string a into string b Modified algorithm: Outputs the actual edit operations

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-12
SLIDE 12

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Rule-Based Normalization

Substitution rules v → u / # _ n Identity rules n → n / e _ # Insertion rules ε → l / o _ l Deletion rules f → ε / u _ f Additional lexicon lookup to prevent nonsense words

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-13
SLIDE 13

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Rule-Based Normalization

Substitution rules v → u / # _ n Identity rules n → n / e _ # Insertion rules ε → l / o _ l Deletion rules f → ε / u _ f → Identity and non-identity rules intended to “compete” Additional lexicon lookup to prevent nonsense words

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-14
SLIDE 14

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Distance-Based Normalization

Levenshtein distance: Count number of edit operations myn → mein d = 2

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-15
SLIDE 15

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Distance-Based Normalization

Levenshtein distance: Count number of edit operations myn → mein d = 2 Weighted Levenshtein distance

Assigns weights to edit operations e.g., d(‘y’, ‘ei’) = 0.8 Edit operations are directed/asymmetric Edit operations may span multiple characters

myn → mein d = 0.8

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-16
SLIDE 16

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Distance-Based Normalization

Find lexicon entry with lowest distance to input string myn ... main mein meine meins mine mini mimik ...

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-17
SLIDE 17

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Distance-Based Normalization

Find lexicon entry with lowest distance to input string myn ... main mein meine meins mine mini mimik ...

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-18
SLIDE 18

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Distance-Based Normalization

Find lexicon entry with lowest distance to input string myn ... main mein meine meins mine mini mimik ...

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-19
SLIDE 19

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Which normalization method is “best”? Does a combination of methods work better? Many other algorithms for normalization

Ernst-Gerlach & Fuhr (2006), Hauser & Schulz (2007): Information Retrieval (IR) on historical texts Baron & Rayson (2009): focus on Early Modern English Jurish (2010): evaluated as IR task → → → Results not easily comparable!

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-20
SLIDE 20

Corpora Methods Norma Tool Evaluation Wordlist Mapping Rule-Based Normalization Distance-Based Normalization

Methods

Which normalization method is “best”? Does a combination of methods work better? Many other algorithms for normalization

Ernst-Gerlach & Fuhr (2006), Hauser & Schulz (2007): Information Retrieval (IR) on historical texts Baron & Rayson (2009): focus on Early Modern English Jurish (2010): evaluated as IR task → → → Results not easily comparable!

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-21
SLIDE 21

Corpora Methods Norma Tool Evaluation Overview Description Example

Norma Tool

Overview

Norma: an interactive normalization tool Key features

Automatic and semi-automatic modes Easy extensibility (with regard to normalization algorithms) Support for dynamically trainable normalization methods

Current limitations

No token context considered Command-line interface only

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-22
SLIDE 22

Corpora Methods Norma Tool Evaluation Overview Description Example

Norma Tool

Description

Input

  • Norm. 1
  • Norm. 2

...

  • Norm. n

Validation Training

historical word form generated word form validated word form

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-23
SLIDE 23

Corpora Methods Norma Tool Evaluation Overview Description Example

Norma Tool

Example

> chind

1 Generate normalization candidate

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-24
SLIDE 24

Corpora Methods Norma Tool Evaluation Overview Description Example

Norma Tool

Example

> chind

1 Generate normalization candidate

Wordlist substitution: no mapping found

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-25
SLIDE 25

Corpora Methods Norma Tool Evaluation Overview Description Example

Norma Tool

Example

> chind kind (0.62) ? _

1 Generate normalization candidate

Wordlist substitution: no mapping found Rule-based normalizer: generates kind with probability score 0.62

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-26
SLIDE 26

Corpora Methods Norma Tool Evaluation Overview Description Example

Norma Tool

Example

> chind kind (0.62) ?

1 Generate normalization candidate

Wordlist substitution: no mapping found Rule-based normalizer: generates kind with probability score 0.62

2 User confirms kind as correct

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-27
SLIDE 27

Corpora Methods Norma Tool Evaluation Overview Description Example

Norma Tool

Example

> chind kind (0.62) ?

1 Generate normalization candidate

Wordlist substitution: no mapping found Rule-based normalizer: generates kind with probability score 0.62

2 User confirms kind as correct 3 Retrain normalizer modules

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-28
SLIDE 28

Corpora Methods Norma Tool Evaluation Overview Description Example

Norma Tool

Example

> chind kind (0.62) ?

1 Generate normalization candidate

Wordlist substitution: no mapping found Rule-based normalizer: generates kind with probability score 0.62

2 User confirms kind as correct 3 Retrain normalizer modules

Wordlist substitution: learns mapping chind → kind

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-29
SLIDE 29

Corpora Methods Norma Tool Evaluation Overview Description Example

Norma Tool

Example

> chind kind (0.62) ?

1 Generate normalization candidate

Wordlist substitution: no mapping found Rule-based normalizer: generates kind with probability score 0.62

2 User confirms kind as correct 3 Retrain normalizer modules

Wordlist substitution: learns mapping chind → kind Rule-based normalizer: adds rules to its ruleset

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-30
SLIDE 30

Corpora Methods Norma Tool Evaluation Overview Description Example

Norma Tool

Comparison to VARD 2

VARD 2 normalization tool http://www.comp.lancs.ac.uk/~barona/vard2/ Graphical user interface Combination of normalization methods Disadvantages

Designed for Early Modern English Not completely customizable → Problematic especially for phonetic matching

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-31
SLIDE 31

Corpora Methods Norma Tool Evaluation Procedure Results

Evaluation

Procedure

Compare different normalization methods

1 On their own 2 In combination

Luther bible

Training part: 218,504 tokens Evaluation part: 109,258 tokens

Anselm text

Training part: 500 tokens Evaluation part: 4,020 tokens

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-32
SLIDE 32

Corpora Methods Norma Tool Evaluation Procedure Results

Evaluation

Procedure

Wordlist mapping & Rule-based normalization

→ Automatically trained

Weighted Levenshtein distance (MultiWLD)

→ Weights defined manually Inspection of first 500 tokens as basis

Methods were trained separately for both texts

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-33
SLIDE 33

Corpora Methods Norma Tool Evaluation Procedure Results

Evaluation

Results Luther Anselm Baseline 71,163 (65.13%) 1,512 (37.61%) Mapper 101,170 (92.60%) 2,448 (60.90%) Rule-based 98,767 (90.40%) 2,530 (62.94%) MultiWLD 96,510 (88.33%) 2,730 (67.91%) Levenshtein 87,619 (80.19%) 2,161 (53.76%) Mapper → Rules → WLD 102,193 (93.53%) 2,947 (73.31%) Mapper → Rules → Leven 102,160 (93.50%) 2,793 (69.48%) Mapper → WLD 102,131 (93.48%) 2,960 (73.63%) Mapper → Leven 101,867 (93.24%) 2,705 (67.29%) VARD 2 99,632 (91.19%) 2,789 (69.38%)

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-34
SLIDE 34

Corpora Methods Norma Tool Evaluation

Conclusion

Interactive normalization tool Norma

Flexible, easily extensible Support for trainable normalization algorithms

Different normalization methods

Rule-based normalization: better with larger training corpora Weighted Levenshtein distance: better with smaller training corpora Simple wordlist mapping is always helpful

Normalization profits even from small amounts

  • f training data

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts

slide-35
SLIDE 35

Corpora Methods Norma Tool Evaluation

Thank you for listening!

Marcel Bollmann (Semi-)Automatic Normalization of Historical Texts