part of speech tagging part of speech tagging
play

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 - PowerPoint PPT Presentation

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and Language Processing, chapter 8 2. Foundations of Statistical Natural Language Processing, chapter 10 1 Review Tagging (part-of-speech tagging)


  1. Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and Language Processing, chapter 8 2. Foundations of Statistical Natural Language Processing, chapter 10 1

  2. Review • Tagging (part-of-speech tagging) – The process of assigning (labeling) a part-of-speech or other lexical class marker to each word in a sentence (or a corpus) • Decide whether each word is a noun, verb, adjective, or whatever The/ AT representative/ NN put/ VBD chairs/ NNS on/ IN the/ AT table/ NN – An intermediate layer of representation of syntactic structure • When compared with syntactic parsing – Above 96% accuracy for most successful approaches 2

  3. Introduction • Parts-of-speech – Known as POS, word classes, lexical tags, morphology classes • Tag sets – Penn Treebank : 45 word classes used (Francis, 1979) • Penn Treebank is a parsed corpus – Brown corpus: 87 word classes used (Marcus et al., 1993) – …. The /DT grand /JJ jury /NN commented /VBD on /IN a /DT number /NN of /IN other /JJ topics /NNS . /. 3

  4. The Penn Treebank POS Tag Set 4

  5. Disambiguation • Resolve the ambiguities and chose the proper tag for the context • Most English words are unambiguous (have only one tag) but many of the most common words are ambiguous – E.g.: “ can ” can be a (an auxiliary) verb or a noun – E.g.: statistics of Brown corpus - 11.5% word types are ambiguous - But 40% tokens are ambiguous (However, the probabilities of tags associated a word are not equal → many ambiguous tokens are easy to disambiguate) 5

  6. Process of POS Tagging A Specified Tagset A String of Words Tagging Algorithm A Single Best Tag of Each Word VB DT NN . Book that flight . VBZ DT NN VB NN ? Does that flight serve dinner ? 6

  7. POS Tagging Algorithms • Fall into One of Two Classes • Rule-based Tagger – Involve a large database of hand-written disambiguation rules • E.g. a rule specifies that an ambiguous word is a noun rather than a verb if it follows a determiner • ENGTWOL : a simple rule-based tagger based on the constraint grammar architecture • Stochastic/Probabilistic Tagger – Use a training corpus to compute the probability of a given word having a given context – E.g.: the HMM tagger chooses the best tag for a given word (maximize the product of word likelihood and tag sequence probability ) 7

  8. POS Tagging Algorithms • Transformation-based/Brill Tagger – A hybrid approach – Like rule-based approach , determine the tag of an ambiguous word based on rules – Like stochastic approach , the rules are automatically included from previous tagged training corpus with the machine learning technique 8

  9. Rule-based POS Tagging • Two-stage architecture – First stage : Use a dictionary to assign each word a list of potential part-of-speech – Second stage : Use large lists of hand-written disambiguation rules to winnow down this list to a single part-of-speech for each word Pavlov had shown that salivation … An example for Pavlov PAVLOV N NOM SG PROPER The ENGTOWL tagger had HAVE V PAST VFIN SVO HAVE PCP2 SVO shown SHOW PCP2 SVOO SVO SV that ADV A set of 1,100 constraints PRON DEM SG can be applied to the input DET CENTRAL DEM SG sentence CS salivation N NOM SG 9

  10. Rule-based POS Tagging • Simple lexical entries in the ENGTWOL lexicon past participle 10

  11. Rule-based POS Tagging Example: one It isn’t that odd! A ADV I consider that odd. NUM Compliment 11

  12. HMM-based Tagging • Also called Maximum Likelihood Tagging – Pick the most-likely tag for a word • For a given sentence or words sequence , an HMM tagger chooses the tag sequence that maximizes the following probability ( ) ( ) = ⋅ − tag arg max P word tag P tag previous n 1 tags i i i tag sequence probability word/lexical likelihood N-gram HMM tagger 12

  13. HMM-based Tagging • Assumptions made here – Words are independent of each other • A word’s identity only depends on its tag – “ Limited Horizon ” and “ Time Invariant ” (“ Stationary ”) • A word’s tag only depends on the previous tag ( limited horizon ) and the dependency does not change over time ( time invariance ) • Time invariance means the tag dependency won’t change as tag sequence appears different positions of a sentence 13

  14. HMM-based Tagging • Apply bigram-HMM tagger to choose the best tag for a given word – Choose the tag t i for word w i that is most probable given the previous tag t i-1 and current word w i ( ) = t arg max P t t , w − i j i 1 i j – Through some simplifying Markov assumptions ) ( ) ( = t arg max P t t P w t − i j i 1 i j j tag sequence probability word/lexical likelihood 14

  15. HMM-based Tagging • Apply bigram-HMM tagger to choose the best tag for a given word ( ) = t arg max P t t , w − i j i 1 i j ( ) P t , w t = − j i i 1 arg max ( ) The same for all tags P w t j − i i 1 ( ) = arg max P t , w t − j i i 1 j ( ) ( ) The probability of a word = arg max P w t , t P t t only depends on its tag − − i i 1 j j i 1 j ( ) ( ) ( ) ) ( = = arg max P w t P t t arg max P t t P w t − − i j j i 1 j i 1 i j j j 15

  16. HMM-based Tagging • Example: Choose the best tag for a given word Secretariat/NNP is /VBZ expected/VBN to/TO race/VB tomorrow/NN 0.34 0.00003 to/TO race/??? P (VB|TO) P (race|VB)=0.00001 0.021 0.00041 P (NN|TO) P (race|NN)=0.000007 Pretend that the previous word has already tagged 16

  17. HMM-based Tagging • Apply bigram-HMM tagger to choose the best sequence of tags for a given sentence ( ) ˆ = T arg max P T W T ( ) ( ) P T P W T = arg max ( ) P W T ( ) ( ) = arg max P T P W T T ) ( ) ( = arg max P t , t ,..., t P w , w ,..., w t , t ,..., t n n n 1 2 1 1 1 2 t t t , ,..., 1 2 n [ ] n ( ) ( ) ∏ = arg max P t t , t ,..., t P w w ,..., w , t , t ,..., t − − i 1 2 i 1 i 1 i 1 1 2 n t , t ,..., t 1 2 n = i 1 [ ] n ( ) ( ) ∏ = arg max P t t , t ,..., t P w t The probability of a word − i 1 2 i 1 i i t , t ,..., t 1 2 n only depends on its tag = i 1 17

  18. HMM-based Tagging • The Viterbi algorithm for the bigram-HMM tagger t J t J t J t J t J Tag State π J π t j+1 t j+1 t j+1 t j+1 t j+1 + j 1 π MAX MAX t j t j t j t j t j j π − j 1 t j-1 t j-1 t j-1 t j-1 t j-1 π 1 t 1 t 1 t 1 t 1 t 1 1 2 i n -1 n Word Sequence w 1 w 2 w n-1 w n w i 18

  19. HMM-based Tagging • The Viterbi algorithm for the bigram-HMM tagger ( ) ( ) δ = ≤ ≤ 1. Initializa tion k π P w t , 1 k J [ ] ( 1 k 1 ) k ( ) ( ) ( ) δ = δ ≤ ≤ ≤ ≤ 2. Induction j max k P t t P w t , 2 i n, 1 k J − i i 1 j i k j i [ ] ( ) ( ) ( ) ψ = δ j argmax k P t t − i i 1 j k ≤ ≤ 1 j J ( ) = δ 3.Terminat ion X argmax j n n ≤ ≤ 1 j J = for i : n- 1 to 1 step - 1 do ( ) = ψ X X + i i i 1 end 19

  20. HMM-based Tagging • Apply trigram-HMM tagger to choose the best sequence of tags for a given sentence – When trigram model is used     ( ) ( ) n ( ) n ( )  ∏ ∏ ˆ = T arg max P t P t t P t t , t P w t    − − 1 2 1 i i 2 i 1 i i     t , t ,.., t 1 2 n = = i 3 i 1 • Maximum likelihood estimation based on the relative frequencies observed in the pre-tagged training corpus (labeled data) ( ) ( ) c t t t Smoothing is needed ! = P t t , t − − i 2 i 1 i ( ) − − i i 2 i 1 c t t t − − i 2 i 1 i ( ) ( ) c w , t = P w t i i ( ) i i c t i 20

  21. HMM-based Tagging • Apply trigram-HMM tagger to choose the best sequence of tags for a given sentence with tag history t J Tag State with tag history t j MAX with tag history t 1 J copies of tag states 1 2 i n -1 n Word Sequence w 1 w 2 w n-1 w n w i 21

  22. HMM-based Tagging • Probability re-estimation based on unlabeled data • EM (Expectation-Maximization) algorithm is applied – Start with a dictionary that lists which tags can be assigned to which words » word likelihood function cab be estimated » tag transition probabilities set to be equal – EM algorithm learns (re-estimates) the word likelihood function for each tag and the tag transition probabilities • However, a tagger trained on hand-tagged data worked better than one trained via EM 22

  23. Transformation-based Tagging • Also called Brill tagging – An instance of Transformation-Based Learning (TBL) • Spirits – Like the rule-based approach , TBL is based on rules that specify what tags should be assigned to what word – Like the stochastic approach , rules are automatically induced from the data by the machine learning technique • Note that TBL is a supervised learning technique – It assumes a pre-tagged training corpus 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend