 
              Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith � 2017 c University of Washington nasmith@cs.washington.edu January 9, 2017 1 / 62
What’s wrong with n-grams? Data sparseness: most histories and most words will be seen only rarely (if at all). 2 / 62
What’s wrong with n-grams? Data sparseness: most histories and most words will be seen only rarely (if at all). Next central idea: teach histories and words how to share. 3 / 62
Log-Linear Models: Definitions We define a conditional log-linear model p ( Y | X ) as: ◮ Y is the set of events/outputs ( � for language modeling, V ) ◮ X is the set of contexts/inputs ( � for n-gram language modeling, V n − 1 ) ◮ φ : X × Y → R d is a feature vector function ◮ w ∈ R d are the model parameters exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ′ ) y ′ ∈Y 4 / 62
Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y 5 / 62
Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) 6 / 62
Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) nonnegative exp w · φ ( x, y ) 7 / 62
Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) nonnegative exp w · φ ( x, y ) � exp w · φ ( x, y ′ ) = Z w ( x ) normalizer y ′ ∈Y 8 / 62
Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) nonnegative exp w · φ ( x, y ) � exp w · φ ( x, y ′ ) = Z w ( x ) normalizer y ′ ∈Y “Log-linear” comes from the fact that: log p w ( Y = y | X = x ) = w · φ ( x, y ) − log Z w ( x ) � �� � constant in y This is an instance of the family of generalized linear models . 9 / 62
The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) 10 / 62
The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) w · φ = w 1 φ 1 + w 2 φ 2 = 0 11 / 62
The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) p ( y 3 | x ) > p ( y 1 | x ) > p ( y 4 | x ) > p ( y 2 | x ) 12 / 62
The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) 13 / 62
The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) p ( y 3 | x ) > p ( y 1 | x ) > p ( y 2 | x ) > p ( y 4 | x ) 14 / 62
Why Build Language Models This Way? ◮ Exploit features of histories for sharing of statistical strength and better smoothing (Lau et al., 1993) ◮ Condition the whole text on more interesting variables like the gender, age, or political affiliation of the author (Eisenstein et al., 2011) ◮ Interpretability! ◮ Each feature φ k controls a factor to the probability ( e w k ). ◮ If w k < 0 then φ k makes the event less likely by a factor of 1 e wk . ◮ If w k > 0 then φ k makes the event more likely by a factor of e w k . ◮ If w k = 0 then φ k has no effect. 15 / 62
Log-Linear n-Gram Models ℓ � p w ( X = x ) = p w ( X j = x j | X 1: j − 1 = x 1: j − 1 ) j =1 ℓ exp w · φ ( x 1: j − 1 , x j ) � = Z w ( x 1: j − 1 ) j =1 ℓ exp w · φ ( x j − n +1: j − 1 , x j ) � assumption = Z w ( x j − n +1: j − 1 ) j − 1 ℓ exp w · φ ( h j , x j ) � = Z w ( h j ) j =1 16 / 62
Example much many little The man who knew too few . . . hippopotamus 17 / 62
What Features in φ ( X j − n +1: j − 1 , X j ) ? 18 / 62
What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” 19 / 62
What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” 20 / 62
What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” 21 / 62
What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” 22 / 62
What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” ◮ Gazetteer features: “ X j is listed as a geographic place name” 23 / 62
What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” ◮ Gazetteer features: “ X j is listed as a geographic place name” You can define any features you want! ◮ Too many features, and your model will overfit � ◮ Too few (good) features, and your model will not learn � 24 / 62
What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” ◮ Gazetteer features: “ X j is listed as a geographic place name” You can define any features you want! ◮ Too many features, and your model will overfit � ◮ “Feature selection” methods, e.g., ignoring features with very low counts, can help. ◮ Too few (good) features, and your model will not learn � 25 / 62
“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. 26 / 62
“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. 27 / 62
“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! 28 / 62
“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! ◮ There is some work on automatically inducing features (Della Pietra et al., 1997). 29 / 62
“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! ◮ There is some work on automatically inducing features (Della Pietra et al., 1997). ◮ More recent work in neural networks can be seen as discovering features (instead of engineering them). 30 / 62
“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! ◮ There is some work on automatically inducing features (Della Pietra et al., 1997). ◮ More recent work in neural networks can be seen as discovering features (instead of engineering them). ◮ But in much of NLP, there’s a strong preference for interpretable features. 31 / 62
How to Estimate w ? n-gram log-linear n-gram ℓ ℓ exp w · φ ( h j , x j ) � � p θ ( x ) = θ x j | h j Z w ( h j ) j =1 j − 1 Parameters: θ v | h w k ∀ v ∈ V , h ∈ ( V ∪ {�} ) n − 1 ∀ k ∈ { 1 , . . . , d } θ v | h = c ( h v ) ˆ MLE: no closed form c ( h ) 32 / 62
MLE for w ◮ Let training data consist of { ( h i , x i ) } N i =1 . 33 / 62
MLE for w ◮ Let training data consist of { ( h i , x i ) } N i =1 . ◮ Maximum likelihood estimation is: N � max log p w ( x i | h i ) w ∈ R d i =1 N log exp w · φ ( h i , v ) � max Z w ( h i ) w ∈ R d i =1 N � � = max w · φ ( h i , x i ) − log exp w · φ ( h i , v ) w ∈ R d i =1 v ∈V � �� � Z w ( h i ) 34 / 62
Recommend
More recommend