Part-of-speech Tagging for Middle English through Alignment and - - PowerPoint PPT Presentation

part of speech tagging for middle english through
SMART_READER_LITE
LIVE PREVIEW

Part-of-speech Tagging for Middle English through Alignment and - - PowerPoint PPT Presentation

Text Mining for Historical Documents Non-Standard Language Adapting NLP Tools Part-of-speech Tagging for Middle English through Alignment and Projection of Parallel Diachronic Texts Taesun Moon and Jason Baldridge Presenter: Yevgeni Berzak


slide-1
SLIDE 1

Part-of-speech Tagging for Middle English through Alignment and Projection of Parallel Diachronic Texts

Taesun Moon and Jason Baldridge

Presenter: Yevgeni Berzak

22 February 2010 1

Text Mining for Historical Documents Non-Standard Language – Adapting NLP Tools

slide-2
SLIDE 2

Annotation of Historical Languages

Annotation: Marking texts written in historical languages with linguistic information. Motivation Diachronic Linguistics

  • Language change.
  • Language variation.

Case study: POS tagging for Middle English

22 February 2010 2

slide-3
SLIDE 3

Part of Speech Tagging

  • Sequence Labeling Task: associate words in

context with their syntactic categories.

“In the beginning God created the heavens and the earth.”

22 February 2010 3

slide-4
SLIDE 4

Part of Speech Tagging

  • Sequence Labeling Task: associate words in

context with their syntactic categories.

“In/PREPOSITION the/DETERMINER beginning/NOUN God/NOUN created/VERB the/DETERMINER heavens/NOUN and/CONJUNCTION the/DETERMINER earth/NOUN.”

22 February 2010 4

slide-5
SLIDE 5

Part of Speech Tagging

  • Sequence Labeling Task: associate words in

context with their syntactic categories.

“In/PREPOSITION the/DETERMINER beginning/NOUN God/NOUN created/VERB the/DETERMINER heavens/NOUN and/CONJUNCTION the/DETERMINER earth/NOUN.”

  • Useful for syntactic parsing, morphological

analysis, and many other tasks.

  • Major problem – ambiguity.

22 February 2010 5

slide-6
SLIDE 6

Part of Speech Tagging

How to do it?

  • Use statistical tagger (n-grams, ME,

Transformational Tagging…) Supervised approach:

  • Use manually annotated training corpus.
  • Train tagger this corpus.
  • Apply tagger to new data.

22 February 2010 6

slide-7
SLIDE 7

Part of Speech Tagging

Middle English: 11th to 15th century “In the bigynnyng God made of nouyt heuene and erthe.” Challenges in tagging Middle English

  • Limited amount of machine readable text.
  • Inconsistent orthography.
  • Grammatical diversity (different genres,

periods, dialects, etc..).

22 February 2010 7

slide-8
SLIDE 8

Part of Speech Tagging

How can we induce a tagger for Middle English? (or any other historical language..)

22 February 2010 8

slide-9
SLIDE 9

Tagging a Historical Language

First approach

  • Do the same as for modern languages:

Use manually annotated data to train a tagger. Problem:

  • Very few annotated recourses for historical

languages.

  • Manual annotation:

– Time, Money, Skills. – Error Prone

22 February 2010 9

slide-10
SLIDE 10

Tagging a Historical Language

  • Second Approach: avoid annotation

bottleneck by Leveraging existing recourses for relevant modern languages.

  • Use parallel corpora – translations of the same

text to two languages.

  • Use tagging of a modern language to

approximate tagging of a historical language. (Exploiting inherent similarities between the modern and the historical language)

22 February 2010 10

slide-11
SLIDE 11

Tagging Middle English

  • Key Idea exploit parallel annotated corpora of

Modern English to tag Middle English.

  • Align the words
  • Project the tags

In/ ? the/ ? bigynnyng/ ?... In/PREPOSITION the/DETERMINER beginning/NOUN…

  • Train a tagger on this corpus

22 February 2010 11

slide-12
SLIDE 12

Tagging Middle English

  • Key Idea exploit parallel annotated corpora of

Modern English to tag Middle English.

  • Align the words
  • Project the tags

In/ ? the/ ? bigynnyng/ ?... In/PREPOSITION the/DETERMINER beginning/NOUN…

  • Train a tagger on this corpus

22 February 2010 12

slide-13
SLIDE 13

Tagging Middle English

  • Key Idea exploit parallel annotated corpora of

Modern English to tag Middle English.

  • Align the words
  • Project the tags

In/PREPOSITION the/DETERMINER bigynnyng/NOUN… In/PREPOSITION the/DETERMINER beginning/NOUN…

  • Train a tagger on this corpus

22 February 2010 13

slide-14
SLIDE 14

Tagging Middle English

  • Key Idea exploit parallel annotated corpora of

Modern English to tag Middle English.

  • Align the words
  • Project the tags

In/PREPOSITION the/DETERMINER bigynnyng/NOUN… In/PREPOSITION the/DETERMINER beginning/NOUN…

  • Train a tagger on this corpus!

22 February 2010 14

slide-15
SLIDE 15

Question: Which parallel corpus can we use?

  • The Bible.
  • Existing (electronic) translation for many

historical and modern languages.

  • Relatively large around 900,000 words.
  • Clear separation of verses – facilitates

sentence alignment.

22 February 2010 15

Tagging with Alignment & Projection

slide-16
SLIDE 16

Question: Which parallel corpus can we use? Answer: The Bible

  • Existing (electronic) translations for many

historical and modern languages.

  • Relatively large - around 900,000 words.
  • Clear separation of verses – facilitates

sentence alignment.

22 February 2010 16

Tagging with Alignment & Projection

slide-17
SLIDE 17

Dice Alignment: a word in Middle English is aligned to the word in modern English that co-occurs with it most often. To license alignment a threshold has to be passed Giza++ Alignment: Off-the-shelf alignment

  • Software. Uses IBM language models and HMM’s.

22 February 2010 17

Tagging with Alignment & Projection

slide-18
SLIDE 18

Tags projection: project the majority tag of the aligned Modern English word. Problems: 1) Alignment & projection are approximations 2) Some Middle English words are not aligned and thus don’t receive tags.

22 February 2010 18

Tagging with Alignment & Projection

Middle English word Modern English word Majority tag

slide-19
SLIDE 19

Bigram Tagging

  • Solution for gaps: complete missing tags with

a bigram tagger.

  • Bigram tagger: find the most likely tag for a

word given the preceding tag.

the/DETERMINER(ti-1) bigynnyng(wi)/NOUN(ti)

  • Training: Estimate P(ti|ti-1) and P(wi|ti) from

corpus counts of successfully projected sequences (Smooth unseen events).

22 February 2010 19

slide-20
SLIDE 20

Bigram Tagging

  • Side effect: Bigram tagger for Middle English.
  • Apply tagger to its training corpus.

 Retagged Middle English Bible, where all words have tags.

22 February 2010 20

slide-21
SLIDE 21

Maximum Entropy Tagging

  • Use the output of the bigram tagger to train a

more sophisticated tagger: C&C Maximum Entropy tagger.

  • Uses many features, including two previous

tags, two previous and two following words, affixes, etc…

  • The induced C&C tagger can be considered as a

specialized tagger for Middle English!

22 February 2010 21

slide-22
SLIDE 22

Recap

22 February 2010 22

Partially tagged Middle English text Raw Middle English text Fully tagged Middle English text Train Maximum Entropy tagger Taggers for Middle English induced without human effort Train and apply bigram tagger Align words & project tags from parallel modern English text. Training corpus for Middle English

slide-23
SLIDE 23

Evaluation

  • Evaluation Corpus – “Penn-Helsinki Parsed

Corpus of Middle English”(PPCME). Tagged text samples of Middle English from 55 different sources.

  • More then million words.
  • Includes portions of the Bible.

22 February 2010 23

slide-24
SLIDE 24

Evaluation

Out of domain (PPCME other texts) In domain (PPCME Bible) Model 56.2%-62.3% 56.2%-63.4% C&C trained on Modern English 61.3%-67.8% 78.8%-84.1% C&C trained on Middle English projected tagging

22 February 2010 24

  • ≈20% improvement on biblical material.
  • ≈5% improvement on other Middle English texts.
slide-25
SLIDE 25

Discussion

  • Strong domain effect.
  • Performance within domain is much better,

but still far from state of-the-art. Why?

  • If high accuracy is needed, carefully sampled

manual annotation is still a reasonable approach.

  • Tagger could be used for semi-automated

tagging.

22 February 2010 25

slide-26
SLIDE 26

To Sum Up

  • A reasonably good POS tagger for historical

languages can be induced with minimal human effort using alignment and projection

  • f tags from modern languages.
  • The Bible can be a useful recourse for

adapting NLP tools for historical languages.

  • Linguistic annotation can help us gain insight
  • n language change and variation.

22 February 2010 26