Natural Language Processing for historical language varieties - - PowerPoint PPT Presentation

natural language processing for historical language
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing for historical language varieties - - PowerPoint PPT Presentation

Natural Language Processing for historical language varieties Cristina S anchez Marco Gjvik University College MTL lectures April 3 2013 April 3 2013 C. S anchez Marco, GUC NLP for historical language varieties 1 / 28 NLP and its


slide-1
SLIDE 1

Natural Language Processing for historical language varieties

Cristina S´ anchez Marco Gjøvik University College MTL lectures

April 3 2013

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

1 / 28

slide-2
SLIDE 2

NLP and its applications

Natural Language Processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. Applications: question answering, sentiment analysis, machine translation, information extraction, ...

An example: Information extraction

→ Email: Subject: curriculum meeting Date: January 15, 2012 To: Dan Hi Dan, we’ve now scheduled the curriculum

  • meeting. It will be in Gates 159 tomorrow from 10:00-11:30. -Chris

→ Create new Calendar entry:

Event: Curriculum meeting Date: Jan-16-2012 Start: 10:00am End: 11:30am Where: Gates 159

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

2 / 28

slide-3
SLIDE 3

Morphological analysis and tagging

Morphological analysis and tagging is a basic NLP task useful in many

  • applications. Two steps:

Part-of-speech tagging: the process of marking up a word in a text as corresponding to a particular part of speech. 9 parts of speech: noun, verb, adjective, adverb, preposition, determiner, conjunction, pronoun, interjection. For example, the word the is a determiner. Lemmatisation is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, the word better has good as its lemma. → Words often have more than one POS (“ambiguity”). The POS tagging problem is to determine the POS tag for a particular instance of a word in a sentence. → Uses: text-to-speech (how do we pronounce “lead”?), spelling checkers, as input to or to speed up a full parser, OCR scanning,...

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

3 / 28

slide-4
SLIDE 4

Morphological analysis and tagging

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

4 / 28

slide-5
SLIDE 5

An example from the 12th century

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

5 / 28

slide-6
SLIDE 6

An example from the 12th century

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

6 / 28

slide-7
SLIDE 7

An example

→ Electronic editions prepared by the Hispanic Seminary of Medieval Studies

[fol. 32v] {CB1. Ca delo q<ue> mas amaua yal viene el mandado Dozi[en]tos cauall<er>os mando exir p<r><<i>>uado Q<ue> Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bie<n> sabe q<ue> albarfanez t<r><<a>>he todo Recabdo

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

7 / 28

slide-8
SLIDE 8

An example: Paleographic symbols

[fol. 32v] {CB1. Ca delo q<ue> mas amaua yal viene el mandado Dozi[en]tos cauall<er>os mando exir p<r><i>uado Q<ue> Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bie<n> sabe q<ue> albarfanez t<r><a>he todo Recabdo

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

8 / 28

slide-9
SLIDE 9

An example: Spelling

[fol. 32v] {CB1. Ca delo (de lo) que mas amaua (amaba) yal (ya le) viene el mandado Dozientos (Doscientos) caualleros (caballeros) mando (mand´

  • ) exir priuado (privado)

Que Re¸ ciban (reciban) a myanaya & alas (a las) duenas (due˜ nas) fijas dalgo (hidalgas) El (´ El) sedie (ser´ ıa) en valen¸ cia (valencia) curiando (curando) & guardando Ca bien sabe que albarfanez trahe (trae) todo Recabdo (Recaudo)

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

9 / 28

slide-10
SLIDE 10

An example: Capital letters and punctuation

[fol. 32v] {CB1. Ca delo que mas amaua yal viene el mandado(,) Dozientos caualleros mando exir priuado(,) Que Re¸ ciban a myanaya (Myanaya) & alas duenas fijas dalgo(,) El sedie en valen¸ cia (Valencia) curiando & guardando(,) Ca bien sabe que albarfanez (Albarfanez) trahe todo Recabdo (recaudo)(.)

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

10 / 28

slide-11
SLIDE 11

An example: Word order

[fol. 32v] {CB1. Ca delo que mas amaua yal viene el mandado (Ca yal viene el mandado delo que mas amaua) Dozientos caualleros mando exir priuado Que Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bien sabe que albarfanez trahe todo Recabdo

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

11 / 28

slide-12
SLIDE 12

Challenge and solution

The challenge is to enrich the text with lemma and POS tag.

1

Manually

2

Build a tagger from scratch

3

Use an existing tool for a modern language variety

Advantages: resource saving Disadvantages: non-acceptable accuracy, manual correction

4

Adapt an existing tool for a modern language variety

Advantages: reusable, sustainable, relatively easy to adapt, resource saving, extensible to other language varieties Disadvantages: Is it easy and resource-saving to do this? Is it easy to adapt to

  • ther language varieties?
  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

12 / 28

slide-13
SLIDE 13

Our specific case study and proposal

Solution 4

Adapt the tool Tool: Freeling http://nlp.lsi.upc.edu/freeling Language: (standard) Modern Spanish Specific advantages of adapting Freeling

  • pen-source

well documented and actively mantained modular, relatively easy to adapt

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

13 / 28

slide-14
SLIDE 14

FreeLing processing pipeline

ra raw t text tokenizer probabilities affixation dictionary morphological analysis ANAL ANALYZER TA TAGGER ER ta tagged cor corpus pus

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

14 / 28

slide-15
SLIDE 15

Method

Using the existing standard Modern Spanish tool as a basis to create an Old Spanish analyzer Expansion of the dictionary Retraining of the tagger Modification of other modules: tokenization, affixation

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

15 / 28

slide-16
SLIDE 16

Data

1

Old Spanish Corpus

2

Gold Standard Corpus

3

(Standard) Modern Spanish Corpus

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

16 / 28

slide-17
SLIDE 17
  • 1. Old Spanish Corpus

Electronic editions by the Hispanic Seminary of Medieval Studies Critical editions of the original manuscripts 12th to 16th century Spanish Representative corpus:

more than 20 million tokens, 470 thousand types variety of genres (fiction and non-fiction)

→ To expand the dictionary

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

17 / 28

slide-18
SLIDE 18
  • 2. Gold Standard Corpus

60,000 tokens from the Old Spanish Corpus (50%) and a Modern Spanish tagged corpus (50%) It mirrors the Old Spanish Corpus in size and text-type distribution → To retrain the tagger and carry out the evaluation and error analysis

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

18 / 28

slide-19
SLIDE 19
  • 3. Standard Modern Spanish Corpus

Corpus LexEsp (Sebasti´ an et al 2000) from 1975 to 1995 more than 5 million words variety of genres → baseline performance for the tagger

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

19 / 28

slide-20
SLIDE 20

Dictionary expansion: Data

556,210 Standard Spanish words (669,121 lemma- tag pairs) + 58,435 Old Spanish words = 614,000 word forms (744,160 lemma-tag pairs) Distribution of words added to the dictionary Verbs 83.4% Pronouns 1.3% Nouns 26.8% Determiners 1% Adjectives 9.4% Adverbs 0.7% Prepositions 2.1% Conjunctions 0.5% Numbers 1.7% Interjections 0.3% Proper names 1.4% Punctuation 0.01%

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

20 / 28

slide-21
SLIDE 21

Dictionary expansion: Method

Mapping rules:

Substring rules (54 sequences of characters): 42% of the words added Old Modern Example euo evo nueuo → nuevo ‘new’ uio vio uio → vio ‘saw’

  • f
  • ube

nuf → nube ‘cloud’ sp- esp- spera → espera ‘wait’ Word rules: 39% of the words added consul → c´

  • nsul ‘consul’

catholica → cat´

  • lica ‘catholic’

VARD 2 (69 spelling rules): 19% of the words added Old Modern j ´ ı nn ˜ n rr r

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

21 / 28

slide-22
SLIDE 22

Retraining of the tagger

Use of the Gold Standard Corpus Two taggers:

Hybrid (relax), integrating statistical and hand-coded grammatical rules Hidden Markov Model (hmm), trigram markovian tagger based on TnT (Brants, 2000)

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

22 / 28

slide-23
SLIDE 23

Accuracy

C0: original tools for standard Modern Spanish (baseline) C1-hmm: expanded dict. + modules + hmm trained tagger (60,000-token gold standard corpus) C1-relax: expanded dict. + modules + relax trained tagger (60,000-token gold standard corpus)

Lemma PoS-1 PoS-2 C0 72.4 70.9 77.4 C1-hmm 95.8 90.1 95.3 C1-relax 95.8 92.6 95.7 SS 99.1 94 97.6 → PoS-1: whole label. E.g. viene VMIP3S0 → PoS-2: word class. E.g. viene VMIP3S0

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

23 / 28

slide-24
SLIDE 24

Error analysis

100 most frequent errors in the tagging, under condition C1-relax 81.6% errors are due to ambiguity

determiner vs clitic readings of la, las ‘the/it’ first vs third person singular (queria ‘I/he wanted’) accentuation (llego ‘arrive/arrived’) (standard: llego/lleg´

  • )

18.4% errors are words out of the dictionary

proper nouns (pierres, antolinez) words not covered by any mapping rule (coita ‘wish’) (standard: cuita)

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

24 / 28

slide-25
SLIDE 25

An example from the 12th century

Ca delo que mas amaua yal viene el mandado Dozientos caualleros mando exir priuado Que Re¸ ciban a myanaya & alas duenas fijas dalgo El sedie en valen¸ cia curiando & guardando Ca bien sabe que albarfanez trahe todo Recabdo

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

25 / 28

slide-26
SLIDE 26

An example from the 12th century: Tagged text

Ca/ca/CS delo de/de/SPS00 delo el/el/DA0NS0 que/que/PR0CN000 mas/m´ as/RG amaua/amar/VMII3S0 yal/yal/AQ0CS0 viene/venir/VMIP3S0 el/el/DA0MS0 mandado/mandado/NCMS000 Dozientos/200/Z caualleros/caballero/NCMP000 mando/mandar/VMIP1S0 exir/salir/VMN0000 priuado/privar/VMP00SM Que/que/PR0CN000 Re¸ ciban/recibir/VMSP3P0 a/a/SPS00 myanaya/myanaya/NCFS000 &/y/CC alas a/a/SPS00 alas el/el/DA0FP0 duenas/duenas/AQ0FP0 fijas/fijo/AQ0FP0 dalgo/dalgo/NCMS000 El/´ el/PP3MS000 sedie/ser/VSIC3S0 en/en/SPS00 valen¸ cia/valencia/NCFS000 curiando/curiando/VMG0000 &/y/CC guardando/guardar/VMG0000 Ca/ca/CS bien/bien/RG sabe/saber/VMIP3S0 que/que/CS albarfanez/albarfanez/NCFS000 trahe/traer/VMIP3S0 todo/todo/DI0MS0 Recabdo/recaudo/NCMS000

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

26 / 28

slide-27
SLIDE 27

Conclusion & Further work

Simple and general method to adapt an existing tool for Modern standard Spanish in order to annotate Old Spanish Advantages: resource-saving, sustainable, relatively easy Benefit from:

similarities between historical and modern standard language varieties existing tools existing philological editions of old manuscripts

The quality of the tagging is close to the state of the art taggers (but it can still be improved!) The greatest improvement is obtained when the lexicon is expanded with non-standard variants Method extensible to other non-standard language varieties → Other NLP applications can be built to deal with historical language varieties: syntactic and semantic analyzers,...

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

27 / 28

slide-28
SLIDE 28

Thanks!

Thanks also to Gemma Boleda, Llu´ ıs Padr´

  • , Josep Maria Fontana, Eva Bofias...

Questions?

  • C. S´

anchez Marco, GUC NLP for historical language varieties

April 3 2013

28 / 28