language processing with perl and prolog
play

Language Processing with Perl and Prolog Chapter 8: Part-of-Speech - PowerPoint PPT Presentation

Language Technology Language Processing with Perl and Prolog Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language


  1. Language Technology Language Processing with Perl and Prolog Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and Prolog 1 / 9

  2. Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques POS Annotation with Statistical Methods Modeling the problem: t 1 , t 2 , t 3 ,..., t n → noisy channel → w 1 , w 2 , w 3 ,..., w n . The optimal part of speech sequence is ˆ T = argmax P ( t 1 , t 2 , t 3 ,..., t n | w 1 , w 2 , w 3 ,..., w n ) , t 1 , t 2 , t 3 ,..., t n The Bayes’ rule on conditional probabilities: P ( A | B ) P ( B ) = P ( B | A ) P ( A ) . ˆ T = argmax P ( T ) P ( W | T ) . T P ( T ) and P ( W | T ) are simplified and estimated on hand-annotated corpora, the “gold standard”. Pierre Nugues Language Processing with Perl and Prolog 2 / 9

  3. Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques The First Term: N -Gram Approximation n ∏ P ( T ) = P ( t 1 , t 2 , t 3 ,..., t n ) ≈ P ( t 1 ) P ( t 2 | t 1 ) P ( t i | t i − 2 , t i − 1 ) . i = 3 If we use a start-of-sentence delimiter <s> , the two first terms of the product, P ( t 1 ) P ( t 2 | t 1 ) , are rewritten as P ( < s > ) P ( t 1 | < s > ) P ( t 2 | < s >, t 1 ) , where P ( < s > ) = 1. We estimate the probabilities with the maximum likelihood, P MLE : P MLE ( t i | t i − 2 , t i − 1 ) = C ( t i − 2 , t i − 1 , t i ) C ( t i − 2 , t i − 1 ) . Pierre Nugues Language Processing with Perl and Prolog 3 / 9

  4. Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques Sparse Data If N p is the number of the different parts-of-speech tags, there are N p × N p × N p values to estimate. If data is missing, we can back off to bigrams: n ∏ P ( T ) = P ( t 1 , t 2 , t 3 ,..., t n ) ≈ P ( t 1 ) P ( t i | t i − 1 ) . i = 2 Or to unigrams: n ∏ P ( T ) = P ( t 1 , t 2 , t 3 ,..., t n ) ≈ P ( t i ) . i = 1 And finally, we can combine linearly these approximations: P LinearInter ( t i | t i − 2 t i − 1 ) = λ 1 P ( t i | t i − 2 t i − 1 )+ λ 2 P ( t i | t i − 1 )+ λ 3 P ( t i ) , with λ 1 + λ 2 + λ 3 = 1, for example, λ 1 = 0 . 6, λ 2 = 0 . 3, λ 3 = 0 . 1. Pierre Nugues Language Processing with Perl and Prolog 4 / 9

  5. Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques The Second Term The complete word sequence knowing the part-of-speech sequence is usually approximated as: n ∏ P ( W | T ) = P ( w 1 , w 2 , w 3 ,..., w n | t 1 , t 2 , t 3 ,..., t n ) ≈ P ( w i | t i ) . i = 1 Like the previous probabilities, P ( w i | t i ) is estimated from hand-annotated corpora using the maximum likelihood: P MLE ( w i | t i ) = C ( w i , t i ) C ( t i ) . For N w different words, there are N p × N w values to obtain. But in this case, many of the estimates will be 0. Pierre Nugues Language Processing with Perl and Prolog 5 / 9

  6. Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques An Example Je le donne ‘I give it’ le /ART donne /VERB je /PRO le /PRO donne /NOUN P ( pro | / 0 ) × P ( art | / 0 , pro ) × P ( verb | pro , art ) × P ( je | pro ) × P ( le | art ) × 1 P ( donne | verb ) P ( pro | / 0 ) × P ( art | / 0 , pro ) × P ( noun | pro , art ) × P ( je | pro ) × P ( le | art ) × 2 P ( donne | noun ) P ( pro | / 0 ) × P ( pro | / 0 , pro ) × P ( verb | pro , pro ) × P ( je | pro ) × P ( le | pro ) × 3 P ( donne | verb ) P ( pro | / 0 ) × P ( pro | / 0 , pro ) × P ( noun | pro , pro ) × P ( je | pro ) × P ( le | pro ) × 4 P ( donne | noun ) Pierre Nugues Language Processing with Perl and Prolog 6 / 9

  7. Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques Viterbi (Informal) Je le donne demain dans la matinée ‘I give it tomorrow in the morning’ le /ART donne /VERB la /ART je /PRO demain /ADV dans /PREP matinée /NOUN le /PRO donne /NOUN la /PRO Pierre Nugues Language Processing with Perl and Prolog 7 / 9

  8. Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques Viterbi (Informal) The term brought by the word demain has still the memory of the ambiguity of donne : P ( adv | verb ) × P ( demain | adv ) and P ( adv | noun ) × P ( demain | adv ) . This is no longer the case with dans . According to the noisy channel model and the bigram assumption, the term brought by the word dans is P ( dans | prep ) × P ( prep | adv ) . It does not show the ambiguity of le and donne . The subsequent terms will ignore it as well. We can discard the corresponding paths. The optimal path does not contain nonoptimal subpaths. Pierre Nugues Language Processing with Perl and Prolog 8 / 9

  9. Language Technology Chapter 8: Part-of-Speech Tagging Using Stochastic Techniques Supervised Learning: A Summary Needs a manually annotated corpus called the Gold Standard The Gold Standard may contain errors ( errare humanum est ) that we ignore A classifier is trained on a part of the corpus, the training set , and evaluated on another part, the test set , where automatic annotation is compared with the “Gold Standard” N-fold cross validation is used avoid the influence of a particular division Some algorithms may require additional optimization on a development set Classifiers can use statistical or symbolic methods Pierre Nugues Language Processing with Perl and Prolog 9 / 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend