Lecture 5 Neural models for NLP Julia Hockenmaier - PowerPoint PPT Presentation

CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

Logistics CS546 Machine Learning in NLP 2

Schedule Week 1—Week 4: Lectures Paper presentations: Lectures 9-28 1. Word embeddings 2. Language models 3. More on RNNs for NLP 4. CNNs for NLP 5. Multitask learning for NLP 6. Syntactic parsing 7. Information extraction 8. Semantic parsing 9. Coreference resolution 10. Machine translation I 11. Machine translation II 12. Generation 13. Discourse 14. Dialog I 15. Dialog II 16. Multimodal NLP 17. Question answering 18. Entailment recognition 19. Reading comprehension 20. Knowledge graph modeling 3 CS546 Machine Learning in NLP

Machine learning fundamentals CS546 Machine Learning in NLP 4

Learning scenarios Supervised learning: Learning to predict labels/structures   from correctly annotated data Unsupervised learning: Learning to find hidden structure (e.g. clusters) in [unannotated] input data Semi-supervised learning: Learning to predict labels/structures from (a little) annotated   and (a lot of) unannotated data Reinforcement learning: Learning to act through feedback for actions   (rewards/punishments) from the environment 5 CS546 Machine Learning in NLP

Input Output x ∈ X y ∈ Y System y = f( x ) An item x   An item y   drawn from an drawn from an input space X output space Y In (supervised) machine learning, we deal with systems whose f( x ) is learned from (labeled) examples.

Supervised learning Input Output Target function y = f( x ) x ∈ X y ∈ Y Learned Model   y = g( x ) An item y   An item x   drawn from a drawn from an label space Y instance space X ^ You often seen f( x ) instead of g( x ), but PowerPoint can’t really typeset that, so g( x ) will have to do.

Supervised learning Regression : Y is continuous Classification : Y is discrete (and finite) Binary classification: Y = {0,1} or {+1, -1} Multiclass classification: Y = {1,…,K} (with K>2) Structured prediction : Y consists of structured objects Y often has some sort of compositional structure and may be infinite

Supervised learning: Training Labeled Training Data   D train   Learned Learning ( x 1 , y 1 ) model Algorithm g( x ) ( x 2 , y 2 ) … ( x N , y N ) Give the learner examples in D train The learner returns a model g( x )

Supervised learning: Testing Labeled Test Data   D test   ( x’ 1 , y’ 1 ) ( x’ 2 , y’ 2 ) … ( x’ M , y’ M ) Reserve some labeled data for testing

Supervised learning: Testing Labeled Test Data   D test   Raw Test Test   Data   Labels   ( x’ 1 , y’ 1 ) X test   Y test   ( x’ 2 , y’ 2 ) x’ 1   y’ 1 … x’ 2 y’ 2 ( x’ M , y’ M ) …. ... x’ M y’ M

Supervised learning: Testing Apply the model to the raw test data Raw Test Predicted   Test   Data   Labels   Labels   X test   g( X test )   Y test   Learned x’ 1   g( x’ 1 )   y’ 1 model x’ 2 g( x’ 2 ) y’ 2 g( x ) …. …. ... x’ M g( x’ M ) y’ M

Supervised learning: Testing Evaluate the model by comparing the predicted labels against the test labels Raw Test Predicted   Test   Data   Labels   Labels   X test   g( X test )   Y test   Learned x’ 1   g( x’ 1 )   y’ 1 model x’ 2 g( x’ 2 ) y’ 2 g( x ) …. …. ... x’ M g( x’ M ) y’ M

Design decisions What data do you use to train/test your system? Do you have enough training data? How noisy is it? What evaluation metrics do you use to test your system? Do they correlate with what you want to measure? What features do you use to represent your data X ? Feature engineering used to be really important What kind of a model do you want to use? What network architecture do you want to use? What learning algorithm do you use to train your system? How do you set the hyperparameters of the algorithm?

Linear classifiers: f( x ) = w 0 + wx f( x ) = 0 f( x ) > 0 x 2 f( x ) < 0 x 1 Linear classifiers are defined over vector spaces Every hypothesis f( x ) is a hyperplane:   f( x ) = w 0 + wx f( x ) is also called the decision boundary – Assign ŷ = 1 to all x where f( x ) > 0 – Assign ŷ = -1 to all x where f( x ) < 0 ŷ = sgn(f( x ))

Learning a linear classifier f( x ) = 0 f( x ) > 0 x 2 x 2 f( x ) < 0 x 1 x 1 Input: Labeled training data Output: A decision boundary f( x ) = 0   D = {( x 1 , y 1 ),…,( x D , y D )}   that separates the training data   y i ·f( x i ) > 0 plotted in the sample space X = R 2 with : y i = +1, : y i = 1 CS446 Machine Learning 16

Which model should we pick? We need a metric (aka an objective function) We would like to minimize the probability of misclassifying unseen examples, but we can’t measure that probability. Instead: minimize the number of misclassified training examples CS446 Machine Learning 17

Which model should we pick? We need a more specific metric:   There may be many models that are consistent with the training data. Loss functions provide such metrics. CS446 Machine Learning 18

y·f( x ) > 0: Correct classification f( x ) = 0 f( x ) > 0 x 2 f( x ) < 0 x 1 An example ( x , y) is correctly classified by f( x )   if and only if y·f( x ) > 0: Case 1 (y = +1 = ŷ ): f( x ) > 0 ⇒ y·f( x ) > 0 Case 2 (y = -1 = ŷ ): f( x ) < 0 ⇒ y·f( x ) > 0 Case 3 (y = +1 ≠ ŷ = -1): f( x ) > 0 ⇒ y·f( x ) < 0 Case 4 (y = -1 ≠ ŷ = +1): f( x ) < 0 ⇒ y·f( x ) < 0

Loss functions for classification Loss = What penalty do we incur if we misclassify x ? L( y , f ( x )) is the loss (aka cost) of classifier f   on example x when the true label of x is y . We assign label ŷ = sgn(f( x )) to x Plots of L( y , f ( x )): x-axis is typically y·f( x ) Today: 0-1 loss and square loss (more loss functions later) CS446 Machine Learning 20

0-1 Loss L( y, f( x )) = 0 iff y = ŷ   = 1 iff y ≠ ŷ L( y ·f( x ) ) = 0 iff y ·f( x ) > 0 (correctly classified) = 1 iff y ·f( x ) < 0 (misclassified) CS446 Machine Learning 21

  Square Loss: ( y – f( x )) 2 L( y, f( x )) = ( y – f( x )) 2 Note: L(-1, f( x )) = (-1 – f( x )) 2 = ( 1 + f( x )) 2 = L(1, -f( x )) (the loss when y=-1 [red] is the mirror of the loss when y=+1 [blue]) CS446 Machine Learning 22

The square loss is a convex   upper bound on 0-1 Loss CS446 Machine Learning 23

Batch learning: Gradient Descent for Least Mean Squares (LMS) CS446 Machine Learning CS546 Machine Learning in NLP 24

  Gradient Descent Iterative batch learning algorithm: – Learner updates the hypothesis   based on the entire training data – Learner has to go multiple times   over the training data Goal: Minimize training error/loss – At each step: move w in the direction of   steepest descent along the error/loss surface CS446 Machine Learning 25

Gradient Descent Error( w ): Error of w on training data w i : Weight at iteration i Error( w ) w w 4 w 3 w 2 w 1 CS446 Machine Learning 26

Least Mean Square Error Err( w ) = 1 y d ) 2 − ˆ ( y d ∑ 2 d ∈ D LMS Error:   Sum of square loss over all training items (multiplied by 0.5 for convenience) D is fixed, so no need to divide by its size   Goal of learning: Find w* = argmin(Err( w )) CS446 Machine Learning 27

Iterative batch learning Initialization: Initialize w 0 (the initial weight vector)   For each iteration: for i = 0…T: Determine by how much to change w   based on the entire data set D Δ w = computeDelta( D , w i )   Update w: w i+1 = update( w i , Δ w ) CS446 Machine Learning 28

    Gradient Descent: Update 1. Compute ∇ Err( w i ), the gradient of the training error at w i This requires going over the entire training data   T ⎛ ⎞ ∇ Err( w ) = ∂ Err( w ) , ∂ Err( w ) ,..., ∂ Err( w ) ⎜ ⎟ ∂ w 0 ∂ w ∂ w N ⎝ ⎠ 1 2. Update w :   w i+1 = w i − α ∇ Err( w i ) α >0 is the learning rate CS446 Machine Learning 29

What’s a gradient? T ⎛ ⎞ ∇ Err( w ) = ∂ Err( w ) , ∂ Err( w ) ,..., ∂ Err( w ) ⎜ ⎟ ∂ w 0 ∂ w ∂ w N ⎝ ⎠ 1 The gradient is a vector of partial derivatives It indicates the direction of steepest increase   in Err( w ) Hence the minus in the upgrade rule: w i − α ∇ Err( w i ) CS446 Machine Learning 30

Computing ∇ Err( w i ) ∂ Err( w ) = ∂ 1 − f( x d )) 2 ( y d ∑ Err( w (j) )= 1 ∂ w i ∂ w i 2 − f( x ) d ) 2 (y d ∑ d ∈ D 2 d ∈ D = 1 ∂ − f( x d )) 2 ( y d ∑ 2 ∂ w i d ∈ D = 1 − f( x d )) ∂ 2(y d (y d − w ⋅ x d ) ∑ 2 ∂ w i d ∈ D ∑ ( y d − f ( x d )) x di = − d ∈ D CS446 Machine Learning 31

Gradient descent (batch) Initialize w 0 randomly for i = 0…T: Δ w = (0, …., 0) for every training item d = 1…D: f( x d ) = w i · x d for every component of w j = 0…N: Δ w j += α (y d − f( x d ))·x dj w i+1 = w i + Δ w return w i+1 when it has converged CS446 Machine Learning 32

The batch update rule for each component of w D − w i ⋅ x d ) x di Δ w i = α ( y d ∑ d = 1 Implementing gradient descent:   As you go through the training data,   you can just accumulate the change   in each component w i of w 33

Lecture 5 Neural models for NLP Julia Hockenmaier - PowerPoint PPT Presentation

CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm Logistics CS546 Machine Learning

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2017/ NLP and

SI485i : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

SI425 : NLP Missing Topics and the Future Who cares about NLP? NLP has expanded quickly

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

NLP: Two pictures Wordnet and Word Sense Problem NLP Disambiguation Semantics NLP Trinity

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Debugging Neural Networks for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2020/ In

Recurrent Neural Networks Graham Neubig Site https://phontron.com/class/nn4nlp2020/ NLP and

Ontologies for NLP NLP for Ontologies FOIS 2014 - LogOnto Workshop on Logics and Ontologies for

Debugging Neural Networks for NLP Graham Neubig Site https://phontron.com/class/nn4nlp2019/ In

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

NLP Programming Tutorial 7 - Topic Models Graham Neubig Nara Institute of Science and Technology

Representing symbolic linguistic structures for neural NLP: methods and applications Alexander

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

Learning Summary Given a task, use data/experience bias/background knowledge measure

What You Need to Know About the Paycheck Protection Program Component of the CARES Act Brian J.

FY21 Budget Recommendation for Annual Town Meeting Approval Dr. Joseph M. Sawyer, Superintendent

CORPORATE PARENTING BOARD: MARCH 16 TH 2020 This Annual Report provides evidence relating to the

Pyramidal Stochastic Graphlet Embedding for Document Pattern Classification Anjan Dutta , Pau

ASYNC PROGRAMMING What does this print? function getY() { var y; $http.get(/gety,

Futures and Promises: Lessons in Concurrency Learned at Tumblr QCon NY 2012 Tuesday, June 19, 12