part of speech tagging part of speech tagging
play

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 - PowerPoint PPT Presentation

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and Language Processing , chapter 8 2. Foundations of Statistical Natural Language Processing , chapter 10 NLP-Berlin Chen 1 Review Tagging


  1. Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2005 References: 1. Speech and Language Processing , chapter 8 2. Foundations of Statistical Natural Language Processing , chapter 10 NLP-Berlin Chen 1

  2. Review • Tagging (part-of-speech tagging) – The process of assigning (labeling) a part-of-speech or other lexical class marker to each word in a sentence (or a corpus) • Decide whether each word is a noun, verb, adjective, or whatever The/ AT representative/ NN put/ VBD chairs/ NNS on/ IN the/ AT table/ NN Or The/ AT representative/ JJ put/ NN chairs/ VBZ on/ IN the/ AT table/ NN – An intermediate layer of representation of syntactic structure • When compared with syntactic parsing – Above 96% accuracy for most successful approaches Tagging can be viewed as a kind of syntactic disambiguation NLP-Berlin Chen 2

  3. Introduction • Parts-of-speech – Known as POS, word classes, lexical tags, morphology classes • Tag sets – Penn Treebank : 45 word classes used (Francis, 1979) • Penn Treebank is a parsed corpus – Brown corpus: 87 word classes used (Marcus et al., 1993) – …. The /DT grand /JJ jury /NN commented /VBD on /IN a /DT number /NN of /IN other /JJ topics /NNS . /. NLP-Berlin Chen 3

  4. The Penn Treebank POS Tag Set NLP-Berlin Chen 4

  5. Disambiguation • Resolve the ambiguities and chose the proper tag for the context • Most English words are unambiguous (have only one tag) but many of the most common words are ambiguous – E.g.: “ can ” can be a (an auxiliary) verb or a noun – E.g.: statistics of Brown corpus - 11.5% word types are ambiguous - But 40% tokens are ambiguous (However, the probabilities of tags associated a word are not equal → many ambiguous tokens are easy to disambiguate) ( ) ( ) ≠ ≠ P t w P t w L 1 2 NLP-Berlin Chen 5

  6. Process of POS Tagging A Specified Tagset A String of Words Tagging Algorithm A Single Best Tag of Each Word VB DT NN . Book that flight . VBZ DT NN VB NN ? Does that flight serve dinner ? Two information sources used: - Syntagmatic information (looking at information about tag sequences) - Lexical information (predicting a tag based on the word concerned) NLP-Berlin Chen 6

  7. POS Tagging Algorithms Fall into One of Two Classes • Rule-based Tagger – Involve a large database of handcrafted disambiguation rules • E.g. a rule specifies that an ambiguous word is a noun rather than a verb if it follows a determiner • ENGTWOL : a simple rule-based tagger based on the constraint grammar architecture “a new play” P (NN|JJ) ≈ 0.45 • Stochastic/Probabilistic Tagger P (VBP|JJ) ≈ 0.0005 – Also called model-based tagger – Use a training corpus to compute the probability of a given word having a given context – E.g.: the HMM tagger chooses the best tag for a given word (maximize the product of word likelihood and tag sequence probability ) NLP-Berlin Chen 7

  8. POS Tagging Algorithms (cont.) • Transformation-based/Brill Tagger – A hybrid approach – Like rule-based approach , determine the tag of an ambiguous word based on rules – Like stochastic approach , the rules are automatically induced from previous tagged training corpus with the machine learning technique • Supervised learning NLP-Berlin Chen 8

  9. Rule-based POS Tagging • Two-stage architecture – First stage : Use a dictionary to assign each word a list of potential parts-of-speech – Second stage : Use large lists of hand-written disambiguation rules to winnow down this list to a single part-of-speech for each word Pavlov had shown that salivation … An example for Pavlov PAVLOV N NOM SG PROPER The ENGTOWL tagger (preterit) had HAVE V PAST VFIN SVO (past participle) HAVE PCP2 SVO shown SHOW PCP2 SVOO SVO SV that ADV A set of 1,100 constraints PRON DEM SG can be applied to the input DET CENTRAL DEM SG sentence (complementizer) CS salivation N NOM SG NLP-Berlin Chen 9

  10. Rule-based POS Tagging (cont.) • Simple lexical entries in the ENGTWOL lexicon past participle NLP-Berlin Chen 10

  11. Rule-based POS Tagging (cont.) Example: It isn’t that odd! A ADV I consider that odd. NUM Compliment NLP-Berlin Chen 11

  12. HMM-based Tagging • Also called Maximum Likelihood Tagging – Pick the most-likely tag for a word • For a given sentence or words sequence , an HMM tagger chooses the tag sequence that maximizes the following probability For a word at position i : ( ) ( ) = ⋅ − tag arg max P word tag P tag previous n 1 tags i i j j j tag sequence probability word/lexical likelihood N-gram HMM tagger NLP-Berlin Chen 12

  13. HMM-based Tagging (cont.) For a word w at position i , follow Bayes' theorem : i ( ) = t arg max P t w , t , t ,..., t − − i j i i 1 i 2 1 j ( ) P w , t t , t ,..., t − − = i j i 1 i 2 1 arg max ( ) P w t , t ,..., t j − − i i 1 i 2 1 ( ) = arg max P w , t t , t ,..., t − − i j i 1 i 2 1 j ( ) ( ) = arg max P w t , t , t ,..., t P t t , t ,..., t − − − − i j i 1 i 2 1 j i 1 i 2 1 j ( ) ( ) ≈ arg max P w t P t t , t ,..., t − − − + i j j i 1 i 2 i n 1 j NLP-Berlin Chen 13

  14. HMM-based Tagging (cont.) • Assumptions made here – Words are independent of each other • A word’s identity only depends on its tag – “ Limited Horizon ” and “ Time Invariant ” (“ Stationary ”) • Limited Horizon: a word’s tag only depends on the previous few tags ( limited horizon ) and the dependency does not change over time ( time invariance ) • Time Invariant : the tag dependency won’t change as tag sequence appears different positions of a sentence Do not model long-distance relationships well ! - e.g., Wh-extraction,… NLP-Berlin Chen 14

  15. HMM-based Tagging (cont.) • Apply bigram-HMM tagger to choose the best tag for a given word – Choose the tag t i for word w i that is most probable given the previous tag t i-1 and current word w i ( ) = t arg max P t t , w − i j i 1 i j – Through some simplifying Markov assumptions ) ( ) ( = t arg max P t t P w t − i j i 1 i j j word/lexical likelihood tag sequence probability NLP-Berlin Chen 15

  16. HMM-based Tagging (cont.) • Apply bigram-HMM tagger to choose the best tag for a given word ( ) = t arg max P t t , w − i j i 1 i j ( ) P t , w t − = j i i 1 arg max ( ) The same for all tags P w t j − i i 1 ( ) = arg max P t , w t − j i i 1 j ( ) ( ) The probability of a word = arg max P w t , t P t t only depends on its tag − − i i 1 j j i 1 j ( ) ( ) ( ) ) ( = = arg max P w t P t t arg max P t t P w t − − i j j i 1 j i 1 i j j j NLP-Berlin Chen 16

  17. HMM-based Tagging (cont.) • Example: Choose the best tag for a given word Secretariat/NNP is /VBZ expected/VBN to/TO race/VB tomorrow/NN 0.34 0.00003 to/TO race/??? P (VB|TO) P (race|VB)=0.00001 0.021 0.00041 P (NN|TO) P (race|NN)=0.000007 Pretend that the previous word has already tagged NLP-Berlin Chen 17

  18. HMM-based Tagging (cont.) • Apply bigram-HMM tagger to choose the best sequence of tags for a given sentence ( ) ˆ = Assumptions: T arg max P T W - words are independent T ( ) ( ) of each other P T P W T = - a word’s identity only arg max ( ) P W depends on its tag T ( ) ( ) = arg max P T P W T T ) ( ) ( = arg max P t , t ,..., t P w , w ,..., w t , t ,..., t 1 2 n 1 1 n 1 2 n t , t ,..., t 1 2 n [ ] ( ) ( ) n = arg max ∏ P t t , t ,..., t P w t , t ,..., t − i 1 2 i 1 i 1 2 n = i 1 t , t ,..., t 1 2 n [ ] ( ) ( ) n = ∏ arg max P t t , t ,..., t P w t − + − + − i i m 1 i m 2 i 1 i i = i 1 t , t ,..., t 1 2 n The probability of a word Tag M-gram assumption only depends on its tag NLP-Berlin Chen 18

  19. HMM-based Tagging (cont.) • The Viterbi algorithm for the bigram-HMM tagger – States: distinct tags – Observations: input word generated by each state t J t J t J t J t J Tag State π J π t j+1 t j+1 t j+1 t j+1 t j+1 + j 1 π MAX MAX t j t j t j t j j t j π − j 1 t j-1 t j-1 t j-1 t j-1 t j-1 π 1 t 1 t 1 t 1 t 1 t 1 1 2 i n -1 n Word Sequence w 1 w 2 w n-1 w n w i NLP-Berlin Chen 19

  20. HMM-based Tagging (cont.) • The Viterbi algorithm for the bigram-HMM tagger ( ) ( ) ( ) δ = ≤ ≤ = 1. Initializa tion j π P w t , 1 j J , π P t 1 j 1 j j j ) ( ) ( ) ( ⎡ ⎤ ( ) δ = δ ≤ ≤ ≤ ≤ 2. Induction j max k P t t P w t , 2 i n, 1 j J ⎢ ⎥ − i i 1 j i ⎣ ⎦ k j k [ ] ( ) ( ) ( ) ψ = δ j argmax k P t t − i i 1 j k ≤ ≤ 1 k J ( ) = δ 3.Terminat ion X argmax j n n ≤ ≤ 1 j J = for i : n- 1 to 1 step - 1 do ( ) = ψ X X + i i i 1 end NLP-Berlin Chen 20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend