Natural Language Processing (CSE 490U): Featurized Language Models - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith � 2017 c University of Washington nasmith@cs.washington.edu January 9, 2017 1 / 62

What’s wrong with n-grams? Data sparseness: most histories and most words will be seen only rarely (if at all). 2 / 62

What’s wrong with n-grams? Data sparseness: most histories and most words will be seen only rarely (if at all). Next central idea: teach histories and words how to share. 3 / 62

Log-Linear Models: Definitions We define a conditional log-linear model p ( Y | X ) as: ◮ Y is the set of events/outputs ( � for language modeling, V ) ◮ X is the set of contexts/inputs ( � for n-gram language modeling, V n − 1 ) ◮ φ : X × Y → R d is a feature vector function ◮ w ∈ R d are the model parameters exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ′ ) y ′ ∈Y 4 / 62

Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y 5 / 62

Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) 6 / 62

Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) nonnegative exp w · φ ( x, y ) 7 / 62

Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) nonnegative exp w · φ ( x, y ) � exp w · φ ( x, y ′ ) = Z w ( x ) normalizer y ′ ∈Y 8 / 62

Breaking It Down exp w · φ ( x, y ) p w ( Y = y | X = x ) = � exp w · φ ( x, y ) y ′ ∈Y linear score w · φ ( x, y ) nonnegative exp w · φ ( x, y ) � exp w · φ ( x, y ′ ) = Z w ( x ) normalizer y ′ ∈Y “Log-linear” comes from the fact that: log p w ( Y = y | X = x ) = w · φ ( x, y ) − log Z w ( x ) � �� constant in y This is an instance of the family of generalized linear models . 9 / 62

The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) 10 / 62

The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) w · φ = w 1 φ 1 + w 2 φ 2 = 0 11 / 62

The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) p ( y 3 | x ) > p ( y 1 | x ) > p ( y 4 | x ) > p ( y 2 | x ) 12 / 62

The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) 13 / 62

The Geometric View Suppose we have instance x , Y = { y 1 , y 2 , y 3 , y 4 } , and there are only two features, φ 1 and φ 2 . ϕ 2 ( x, y 3 ) ( x, y 1 ) ϕ 1 ( x, y 4 ) ( x, y 2 ) p ( y 3 | x ) > p ( y 1 | x ) > p ( y 2 | x ) > p ( y 4 | x ) 14 / 62

Why Build Language Models This Way? ◮ Exploit features of histories for sharing of statistical strength and better smoothing (Lau et al., 1993) ◮ Condition the whole text on more interesting variables like the gender, age, or political affiliation of the author (Eisenstein et al., 2011) ◮ Interpretability! ◮ Each feature φ k controls a factor to the probability ( e w k ). ◮ If w k < 0 then φ k makes the event less likely by a factor of 1 e wk . ◮ If w k > 0 then φ k makes the event more likely by a factor of e w k . ◮ If w k = 0 then φ k has no effect. 15 / 62

Log-Linear n-Gram Models ℓ � p w ( X = x ) = p w ( X j = x j | X 1: j − 1 = x 1: j − 1 ) j =1 ℓ exp w · φ ( x 1: j − 1 , x j ) � = Z w ( x 1: j − 1 ) j =1 ℓ exp w · φ ( x j − n +1: j − 1 , x j ) � assumption = Z w ( x j − n +1: j − 1 ) j − 1 ℓ exp w · φ ( h j , x j ) � = Z w ( h j ) j =1 16 / 62

Example much many little The man who knew too few . . . hippopotamus 17 / 62

What Features in φ ( X j − n +1: j − 1 , X j ) ? 18 / 62

What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” 19 / 62

What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” 20 / 62

What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” 21 / 62

What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” 22 / 62

What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” ◮ Gazetteer features: “ X j is listed as a geographic place name” 23 / 62

What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” ◮ Gazetteer features: “ X j is listed as a geographic place name” You can define any features you want! ◮ Too many features, and your model will overfit � ◮ Too few (good) features, and your model will not learn � 24 / 62

What Features in φ ( X j − n +1: j − 1 , X j ) ? ◮ Traditional n-gram features: “ X j − 1 = the ∧ X j = man” ◮ “Gappy” n-grams: “ X j − 2 = the ∧ X j = man” ◮ Spelling features: “ X j ’s first character is capitalized” ◮ Class features: “ X j is a member of class 132” ◮ Gazetteer features: “ X j is listed as a geographic place name” You can define any features you want! ◮ Too many features, and your model will overfit � ◮ “Feature selection” methods, e.g., ignoring features with very low counts, can help. ◮ Too few (good) features, and your model will not learn � 25 / 62

“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. 26 / 62

“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. 27 / 62

“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! 28 / 62

“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! ◮ There is some work on automatically inducing features (Della Pietra et al., 1997). 29 / 62

“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! ◮ There is some work on automatically inducing features (Della Pietra et al., 1997). ◮ More recent work in neural networks can be seen as discovering features (instead of engineering them). 30 / 62

“Feature Engineering” ◮ Many advances in NLP (not just language modeling) have come from careful design of features. ◮ Sometimes “feature engineering” is used pejoratively. ◮ Some people would rather not spend their time on it! ◮ There is some work on automatically inducing features (Della Pietra et al., 1997). ◮ More recent work in neural networks can be seen as discovering features (instead of engineering them). ◮ But in much of NLP, there’s a strong preference for interpretable features. 31 / 62

How to Estimate w ? n-gram log-linear n-gram ℓ ℓ exp w · φ ( h j , x j ) � � p θ ( x ) = θ x j | h j Z w ( h j ) j =1 j − 1 Parameters: θ v | h w k ∀ v ∈ V , h ∈ ( V ∪ {�} ) n − 1 ∀ k ∈ { 1 , . . . , d } θ v | h = c ( h v ) ˆ MLE: no closed form c ( h ) 32 / 62

MLE for w ◮ Let training data consist of { ( h i , x i ) } N i =1 . 33 / 62

MLE for w ◮ Let training data consist of { ( h i , x i ) } N i =1 . ◮ Maximum likelihood estimation is: N � max log p w ( x i | h i ) w ∈ R d i =1 N log exp w · φ ( h i , v ) � max Z w ( h i ) w ∈ R d i =1 N � � = max w · φ ( h i , x i ) − log exp w · φ ( h i , v ) w ∈ R d i =1 v ∈V � �� Z w ( h i ) 34 / 62

Natural Language Processing (CSE 490U): Featurized Language Models - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu January 9, 2017 1 / 62 Whats wrong with n-grams? Data sparseness: most histories and most words

Natural Language Processing (CSE 490U): Neural Language Models Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Language Models Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Compositional Semantics Noah Smith 2017 c

Natural Language Processing (CSE 490U): Sequence Models (II) Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Phrase Structure Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Introduction Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Text Classification Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Generation: Translation & Summarization Noah Smith c

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Regular Expressions / Finite State Automata UW CSE 490u Quiz Section, Feb 8, 2017 Sam Thomson

Natural Language Processing Info 159/259 Lecture 15: Review (Oct 11, 2018) David Bamman, UC

Smoothing sample extremes: the mixed model approach Francesco Pauli Fabrizio Laurini Dept of

Table of Contents Berner Fachhochschule-Technik und Informatik Internationalization - I18n

Realisa'on Albert Ga* Ins.tute of Linguis.cs, University of

Lower bounds for average values of Arakelov-Green functions and global applications Matt Baker

CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris

Cartesian Control Analytical inverse kinematics can be difficult to derive Inverse

Overview IAML: Linear Regression The linear model Fitting the linear model to data

Characterizing Endpoints of Generalized Inverse Limits Lori Alvin Bradley University Nipissing

Sambuz

Useful Links

Newsletter

Mail Us

Natural Language Processing (CSE 490U): Featurized Language Models - PowerPoint PPT Presentation

Natural Language Processing (CSE 490U): Featurized Language Models Noah Smith 2017 c University of Washington nasmith@cs.washington.edu January 9, 2017 1 / 62 Whats wrong with n-grams? Data sparseness: most histories and most words

Natural Language Processing (CSE 490U): Neural Language Models Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Language Models Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Compositional Semantics Noah Smith 2017 c

Natural Language Processing (CSE 490U): Sequence Models (II) Noah Smith 2017 c University

Natural Language Processing (CSE 490U): Phrase Structure Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Introduction Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Text Classification Noah Smith 2017 c University of

Natural Language Processing (CSE 490U): Generation: Translation &amp; Summarization Noah Smith c

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Regular Expressions / Finite State Automata UW CSE 490u Quiz Section, Feb 8, 2017 Sam Thomson

Natural Language Processing Info 159/259 Lecture 15: Review (Oct 11, 2018) David Bamman, UC

Smoothing sample extremes: the mixed model approach Francesco Pauli Fabrizio Laurini Dept of

Table of Contents Berner Fachhochschule-Technik und Informatik Internationalization - I18n

Realisa'on Albert Ga* Ins.tute of Linguis.cs, University of

Lower bounds for average values of Arakelov-Green functions and global applications Matt Baker

CSC 311: Introduction to Machine Learning Lecture 6 - Bagging, Boosting Roger Grosse Chris

Cartesian Control Analytical inverse kinematics can be difficult to derive Inverse

Overview IAML: Linear Regression The linear model Fitting the linear model to data

Characterizing Endpoints of Generalized Inverse Limits Lori Alvin Bradley University Nipissing

Sambuz

Useful Links

Newsletter

Mail Us

Natural Language Processing (CSE 490U): Generation: Translation & Summarization Noah Smith c