Neural Network LMs READ CHAPTERS 5 AND 7 IN JURAFSKY AND MARTIN - PowerPoint PPT Presentation

Neural Network LMs READ CHAPTERS 5 AND 7 IN JURAFSKY AND MARTIN READ CHAPTER 4 FROM YOAV GOLDBER’S BOOK NE NEURAL L NE NETWOR ORKS ME METHODS FOR NLP NLP (IT’S FREE TO DOWNLOAD FROM PENN’S CAMPUS!)

Reminders QUIZ IS DUE TONIGHT BY HOMEWORK 5 IS DUE 11:59PM WEDNESDAY

Recap: Logistic Regression Logistic regression solves this task by learning, from a training set, a vector of weights and a bias term . n ! X z = + b w i x i i = 1 We can also write this as a dot product: z = w · x + b

Recap: Sigmoid function

Recap: Probabilities P ( y = 1 ) = σ ( w · x + b ) 1 = 1 + e − ( w · x + b ) P ( y = 0 ) = 1 − σ ( w · x + b ) 1 = 1 − 1 + e − ( w · x + b ) e − ( w · x + b ) = 1 + e − ( w · x + b )

Recap: Loss functions We need to determine for some observation x how close the classifier output ( ! 𝑧 = σ ( w · x + b )) is to the correct output y , which is 0 or 1. 𝑀 ! 𝑧, 𝑧 = how much ! 𝑧 differs from the true y

Recap: Loss functions mize the probability of the For one observation x, let’s ma maximi correct label p(y|x). 𝑧 ( (1 − ! 𝑧) -.( 𝑞 𝑧 𝑦 = ! If y = 1, then p y x = ! 𝑧 . If y = 0, then p y x = 1 − ! 𝑧 .

Re Recap: Cross-en entropy lo loss Th The result is cross-en entropy loss: 𝑀 23 ! 𝑧, 𝑧 = −log 𝑞(𝑧|𝑦) = −[𝑧 log ! 𝑧 + 1 − 𝑧 log(1 − ! 𝑧)] r ; Fi Finally, plug in the defi finition for 𝒛 = σ ( w · x ) + b 𝑧, 𝑧 = −[𝑧 log σ( w · x + b ) + 1 − 𝑧 log(1 − σ( w · x + b ) )] 𝑀 23 !

Re Recap: Cross-en entropy lo loss Why does minimizing this negative log probability do what we want? A perfect classifier would assign probability 1 to the correct outcome (y=1 or y=0) and probability 0 to the incorrect outcome. That means the higher ; 𝒛 (the closer it is to 1), the better the classifier; the lower ; 𝒛 is (the closer it is to 0), the worse the classifier. The negative log of this probability is a convenient loss metric since it goes from 0 (negative log of 1, no loss) to infinity (negative log of 0, infinite loss).

Loss on all training examples J 𝑞(𝑧 H |𝑦 H ) log 𝑞 𝑢𝑠𝑏𝑗𝑜𝑗𝑜𝑕 𝑚𝑏𝑐𝑓𝑚𝑡 = log G HI- J log𝑞(𝑧 H |𝑦 H ) = K HI- J 𝑧 H |𝑧 H ) = − K L MN (! HI-

Finding good parameters We use gradient descent to find good settings for our weights and bias by minimizing the loss function. m 1 X L CE ( y ( i ) , x ( i ) ; θ ) ˆ θ = argmin m θ i = 1 Gradient descent is a method that finds a minimum of a function by figuring out in which direction (in the space of the parameters θ) the function’s slope is rising the most steeply, and moving in the opposite direction.

Finding good parameters We use gradient descent to find good settings for our weights and bias by minimizing the loss function. J 1 𝑀 23 (𝑧 H , 𝑦 H ; 𝜄) O 𝜄 = argmin 𝑛 K V HI- Gradient descent is a method that finds a minimum of a function by figuring out in which direction (in the space of the parameters θ) the function’s slope is rising the most steeply, and moving in the opposite direction.

Gradient descent

Global v. Local Minimums For logistic regression, this loss function is conveniently convex . A convex function has just one minimum , so there are no local minima to get stuck in. So gradient descent starting from any point is guaranteed to find the minimum.

Iteratively find minimum Loss one step of gradient slope of loss at w 1 descent is negative w 1 w min w 0 (goal)

How much should we update the parameter by? The magnitude of the amount to move in gradient descent is the value of the slope weighted by a learning rate η. A higher/faster learning rate means that we should move w more on each step. w t + 1 = w t − η d dw f ( x ; w ) intuition from a function of one

Many dimensions Cost(w,b) b w

Updating each dimension w i ∂   ∂ w 1 L ( f ( x ; θ ) , y ) ∂ ∂ w 2 L ( f ( x ; θ ) , y )     ∇ θ L ( f ( x ; θ ) , y )) = .   .   .   ∂ ∂ w n L ( f ( x ; θ ) , y ) equation for updating θ based on the gradient is thus The final equation for updating θ based on the gradient is θ t + 1 = θ t − η ∇ L ( f ( x ; θ ) , y )

The Gradient To update θ, we need a definition for the gradient ∇ L ( f ( x ; θ ), y ). For logistic regression the cross-entropy loss function is: L CE ( w , b ) = − [ y log σ ( w · x + b )+( 1 − y ) log ( 1 − σ ( w · x + b ))] The derivative of this function for one observation vector x for a single weight w j is ∂ L CE ( w , b ) = [ σ ( w · x + b ) − y ] x j ∂ w j The gradient is a very intuitive value: the difference between the true y and our estimate for x, multiplied by the corresponding input value x j .

Average Loss J 𝐷𝑝𝑡𝑢 𝑥, 𝑐 = 1 𝑧 H , 𝑧 (H) ) 𝑛 K 𝑀 23 (! HI- J = − 1 𝑧 H log 𝜏 𝑥 ⋅ 𝑦 H + 𝑐 + 1 − 𝑧 H log(1 − 𝜏 𝑥 ⋅ 𝑦 H + 𝑐 ) 𝑛 K HI- This is what we want to minimize!!

The Gradient The loss for a batch of data or an entire dataset is just the average loss over the m examples J 𝐷𝑝𝑡𝑢 𝑥, 𝑐 = − 1 𝑧 (H) log 𝜏 𝑥 ⋅ 𝑦 H + 𝑐 + 1 − 𝑧 H log(1 − 𝜏 𝑥 ⋅ 𝑦 H + 𝑐 ) 𝑛 K HI- The gradient for multiple data points is the sum of the individual gradients: J 𝜖𝐷𝑝𝑡𝑢 𝑥, 𝑐 [𝜏 𝑥 ⋅ 𝑦 H + 𝑐 − 𝑧 (H) ]𝑦 ` (H) = K 𝜖𝑥 ` HI-

Stochastic gradient descent algorithm function S TOCHASTIC G RADIENT D ESCENT ( L () , f () , x , y ) returns θ # where: L is the loss function f is a function parameterized by θ # x is the set of training inputs x ( 1 ) , x ( 2 ) ,..., x ( n ) # y is the set of training outputs (labels) y ( 1 ) , y ( 2 ) ,..., y ( n ) # θ ← 0 repeat T times For each training tuple ( x ( i ) , y ( i ) ) (in random order) y ( i ) = f ( x ( i ) ; θ ) Compute ˆ # What is our estimated output ˆ y ? y ( i ) , y ( i ) ) # How far off is ˆ y ( i ) ) from the true output y ( i ) ? Compute the loss L ( ˆ g ← ∇ θ L ( f ( x ( i ) ; θ ) , y ( i ) ) # How should we move θ to maximize loss ? θ ← θ − η g # go the other way instead return θ

Multinomial logistic regression Instead of binary classification, we often want more than two classes. For sentiment classification we might extend the class labels to be positive, negative , and neutral . We want to know the probability of y for each class c ∈ C , p ( y = c | x ). To get a proper probability, we will use a generalization of the sigmoid function called the softmax function. 𝑓 f g softmax 𝑨 H = 𝑓 f g 1 ≤ 𝑗 ≤ 𝑙 i ∑ `I-

Softmax The softmax function takes in an input vector z = [ z 1 , z 2 ,..., z k ] and outputs a vector of values normalized into probabilities. 𝑓 f l 𝑓 f n 𝑓 f p softmax 𝑨 = [ 𝑓 f m , 𝑓 f m , ⋯ , 𝑓 f m ] i i i ∑ HI- ∑ HI- ∑ HI- For example, for this input: z = [0.6, 1.1, −1.5, 1.2, 3.2, −1.1] Softmax will output: [0.056, 0.090, 0.007, 0.099, 0.74, 0.010]

Neural Networks: A brain- inspired metaphor

Neural Network LMs READ CHAPTERS 5 AND 7 IN JURAFSKY AND MARTIN - PowerPoint PPT Presentation

Neural Network LMs READ CHAPTERS 5 AND 7 IN JURAFSKY AND MARTIN READ CHAPTER 4 FROM YOAV GOLDBERS BOOK NE NEURAL L NE NETWOR ORKS ME METHODS FOR NLP NLP (ITS FREE TO DOWNLOAD FROM PENNS CAMPUS!) Reminders QUIZ IS DUE

LMS Working Group LMS Transportation 7/30/2020 Existing Conditions & Future Transportation

Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs (Bengio et al, 03)

Review of LMS Activities (August 2016-July 2017) Professor Ken Brown (LMS Vice-President) LMS

A NEXT -GEN LEARNING MANAGEMENT SYSTEM WHAT IS LMS? A Learning Management System (LMS) is a

NIH Best Practices Using the HHS Learning Portal (LMS) Michele Schwartzman, OHR, NIH LMS

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Algorithms for NLP CS 11711, Fall 2019 Lecture 5: Vector Semantics Yulia Tsvetkov 1 Neural LMs

Local Maternity System (LMS) Christine Morris LMS Senior Responsible Officer Shrewsbury &

FEMA 1595 Non-Federal Expected Fed Share Project LMS Funding LMS (75% of Total (25 % of

LMS and GAMLSS Flexible Regression and Smoothing Mikis Stasinopoulos 1 and Bob Rigby 1 1

Teaching and Learning Services LMS Review Listening session LMS Review What is it? How does

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural LMs, Recurrent

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Network II Neural Network II Week 8 1 Team Homework Assignment #10 Team Homework

Logistic Regression Dr. Besnik Fetahu Supervised Classification X = { x (1) , . . . , x ( n ) } Y

Compressed Strings and Applications Shunsuke Inenaga Kyushu University, Japan Collaborators

Effective 2D description of thin liquid crystal elastomer sheets Marius Lemm (Caltech) joint

Disclosures The Rapidly Changing Landscape Consulting of Diabetes Mellitus: What You

| V ub | from QCD Sum Rules on the Light-Cone Patricia Ball IPPP , Durham CKM06, 14 December

A Journey through the World of Incompressible Viscous Flows : an Evolution Equation Perspective

Compositions of Extended Top-down Tree Transducers Andreas Maletti March 30, 2007 Short

Comparison of 4DVAR and EnKF state estimates and forecasts in the Gulf of Mexico Ganesh

Sambuz

Useful Links

Newsletter

Mail Us

Neural Network LMs READ CHAPTERS 5 AND 7 IN JURAFSKY AND MARTIN - PowerPoint PPT Presentation

Neural Network LMs READ CHAPTERS 5 AND 7 IN JURAFSKY AND MARTIN READ CHAPTER 4 FROM YOAV GOLDBERS BOOK NE NEURAL L NE NETWOR ORKS ME METHODS FOR NLP NLP (ITS FREE TO DOWNLOAD FROM PENNS CAMPUS!) Reminders QUIZ IS DUE

LMS Working Group LMS Transportation 7/30/2020 Existing Conditions &amp; Future Transportation

Neural LMs Image: (Bengio et al, 03) One Hot Vectors Neural LMs (Bengio et al, 03)

Review of LMS Activities (August 2016-July 2017) Professor Ken Brown (LMS Vice-President) LMS

A NEXT -GEN LEARNING MANAGEMENT SYSTEM WHAT IS LMS? A Learning Management System (LMS) is a

NIH Best Practices Using the HHS Learning Portal (LMS) Michele Schwartzman, OHR, NIH LMS

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Algorithms for NLP CS 11711, Fall 2019 Lecture 5: Vector Semantics Yulia Tsvetkov 1 Neural LMs

Local Maternity System (LMS) Christine Morris LMS Senior Responsible Officer Shrewsbury &amp;

FEMA 1595 Non-Federal Expected Fed Share Project LMS Funding LMS (75% of Total (25 % of

LMS and GAMLSS Flexible Regression and Smoothing Mikis Stasinopoulos 1 and Bob Rigby 1 1

Teaching and Learning Services LMS Review Listening session LMS Review What is it? How does

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural LMs, Recurrent

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Neural Network II Neural Network II Week 8 1 Team Homework Assignment #10 Team Homework

Logistic Regression Dr. Besnik Fetahu Supervised Classification X = { x (1) , . . . , x ( n ) } Y

Compressed Strings and Applications Shunsuke Inenaga Kyushu University, Japan Collaborators

Effective 2D description of thin liquid crystal elastomer sheets Marius Lemm (Caltech) joint

Disclosures The Rapidly Changing Landscape Consulting of Diabetes Mellitus: What You

| V ub | from QCD Sum Rules on the Light-Cone Patricia Ball IPPP , Durham CKM06, 14 December

A Journey through the World of Incompressible Viscous Flows : an Evolution Equation Perspective

Compositions of Extended Top-down Tree Transducers Andreas Maletti March 30, 2007 Short

Comparison of 4DVAR and EnKF state estimates and forecasts in the Gulf of Mexico Ganesh

Sambuz

Useful Links

Newsletter

Mail Us

LMS Working Group LMS Transportation 7/30/2020 Existing Conditions & Future Transportation

Local Maternity System (LMS) Christine Morris LMS Senior Responsible Officer Shrewsbury &