natural language processing cse 490u featurized language
play

Natural Language Processing (CSE 490U): Featurized Language Models - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu January 9, 2017 1 / 62 Whats wrong with n-grams? Data sparseness: most histories and most words


  1. Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith � 2017 c University of Washington nasmith@cs.washington.edu January 9, 2017 1 / 62

  2. What’s wrong with n-grams? Data sparseness: most histories and most words will be seen only rarely (if at all). 2 / 62

  3. What’s wrong with n-grams? Data sparseness: most histories and most words will be seen only rarely (if at all). Next central idea: teach histories and words how to share. 3 / 62

  4. Log-Linear Models: Definitions We define a conditional log-linear model p ( Y | X ) as: ◮ Y is the set of events/outputs ( � for language modeling, V ) ◮ X is the set of contexts/inputs ( � for n-gram language modeling, V n − 1 ) ◮ φ : X × Y → R d is a feature vector function ◮ w ∈ R d are the model parameters exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ′ ) y ′ ∈Y 4 / 62

  5. Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y 5 / 62

  6. Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) 6 / 62

  7. Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) nonnegative exp w · φ ( x, y ) 7 / 62

  8. Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) nonnegative exp w · φ ( x, y ) � exp w · φ ( x, y ′ ) = Z w ( x ) normalizer y ′ ∈Y 8 / 62

  9. Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) nonnegative exp w · φ ( x, y ) � exp w · φ ( x, y ′ ) = Z w ( x ) normalizer y ′ ∈Y “Log-linear” comes from the fact that: log p w ( Y = y | X = x ) = w · φ ( x, y ) − log Z w ( x ) � �� � constant in y This is an instance of the family of generalized linear models . 9 / 62

  10. The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) 10 / 62

  11. The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) w · φ = w 1 φ 1 + w 2 φ 2 = 0 11 / 62

  12. The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) p ( y 3 | x ) > p ( y 1 | x ) > p ( y 4 | x ) > p ( y 2 | x ) 12 / 62

  13. The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) 13 / 62

  14. The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) p ( y 3 | x ) > p ( y 1 | x ) > p ( y 2 | x ) > p ( y 4 | x ) 14 / 62

  15. Why Build Language Models This Way? ◮ Exploit features of histories for sharing of statistical strength and better smoothing (Lau et al., 1993) ◮ Condition the whole text on more interesting variables like the gender, age, or political affiliation of the author (Eisenstein et al., 2011) ◮ Interpretability! ◮ Each feature φ k controls a factor to the probability ( e w k ). ◮ If w k < 0 then φ k makes the event less likely by a factor of 1 e wk . ◮ If w k > 0 then φ k makes the event more likely by a factor of e w k . ◮ If w k = 0 then φ k has no effect. 15 / 62

  16. Log-Linear n-Gram Models ℓ � p w ( X = x ) = p w ( X j = x j | X 1: j − 1 = x 1: j − 1 ) j =1 ℓ exp w · φ ( x 1: j − 1 , x j ) � = Z w ( x 1: j − 1 ) j =1 ℓ exp w · φ ( x j − n +1: j − 1 , x j ) � assumption = Z w ( x j − n +1: j − 1 ) j − 1 ℓ exp w · φ ( h j , x j ) � = Z w ( h j ) j =1 16 / 62

  17. Example much many little The man who knew too few . . . hippopotamus 17 / 62

  18. What Features in φ ( X j − n +1: j − 1 , X j ) ? 18 / 62

  19. What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” 19 / 62

  20. What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” 20 / 62

  21. What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” 21 / 62

  22. What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” 22 / 62

  23. What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” ◮ Gazetteer features: “ X j is listed as a geographic place name” 23 / 62

  24. What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” ◮ Gazetteer features: “ X j is listed as a geographic place name” You can define any features you want! ◮ Too many features, and your model will overfit � ◮ Too few (good) features, and your model will not learn � 24 / 62

  25. What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” ◮ Gazetteer features: “ X j is listed as a geographic place name” You can define any features you want! ◮ Too many features, and your model will overfit � ◮ “Feature selection” methods, e.g., ignoring features with very low counts, can help. ◮ Too few (good) features, and your model will not learn � 25 / 62

  26. “Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. 26 / 62

  27. “Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. 27 / 62

  28. “Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! 28 / 62

  29. “Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! ◮ There is some work on automatically inducing features (Della Pietra et al., 1997). 29 / 62

  30. “Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! ◮ There is some work on automatically inducing features (Della Pietra et al., 1997). ◮ More recent work in neural networks can be seen as discovering features (instead of engineering them). 30 / 62

  31. “Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! ◮ There is some work on automatically inducing features (Della Pietra et al., 1997). ◮ More recent work in neural networks can be seen as discovering features (instead of engineering them). ◮ But in much of NLP, there’s a strong preference for interpretable features. 31 / 62

  32. How to Estimate w ? n-gram log-linear n-gram ℓ ℓ exp w · φ ( h j , x j ) � � p θ ( x ) = θ x j | h j Z w ( h j ) j =1 j − 1 Parameters: θ v | h w k ∀ v ∈ V , h ∈ ( V ∪ {�} ) n − 1 ∀ k ∈ { 1 , . . . , d } θ v | h = c ( h v ) ˆ MLE: no closed form c ( h ) 32 / 62

  33. MLE for w ◮ Let training data consist of { ( h i , x i ) } N i =1 . 33 / 62

  34. MLE for w ◮ Let training data consist of { ( h i , x i ) } N i =1 . ◮ Maximum likelihood estimation is: N � max log p w ( x i | h i ) w ∈ R d i =1 N log exp w · φ ( h i , v ) � max Z w ( h i ) w ∈ R d i =1 N � � = max w · φ ( h i , x i ) − log exp w · φ ( h i , v ) w ∈ R d i =1 v ∈V � �� � Z w ( h i ) 34 / 62

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend