logistic regression neural networks
play

Logistic Regression & Neural Networks CMSC 723 / LING 723 / - PowerPoint PPT Presentation

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability p(y|x)? The


  1. Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein

  2. Logistic Regression

  3. Perceptron & Probabilities • What if we want a probability p(y|x)? • The perceptron gives us a prediction y • Let’s illustrate this with binary classification Illustrations: Graham Neubig

  4. The logistic function • “Softer” function than in perceptron • Can account for uncertainty • Differentiable

  5. Logistic regression: how to train? • Train based on conditional likelihood • Find parameters w that maximize conditional likelihood of all answers 𝑧 " given examples 𝑦 "

  6. Stochastic gradient ascent (or descent) • Online training algorithm for logistic regression • and other probabilistic models Update weights for every training example • Move in direction given by gradient • Size of update step scaled by learning rate •

  7. Gradient of the logistic function

  8. Example: Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person

  9. Example: initial update

  10. Example: second update

  11. How to set the learning rate? • Various strategies • decay over time 1 𝛽 = 𝐷 + 𝑢 Number of Parameter samples • Use held-out test set, increase learning rate when likelihood increases

  12. Multiclass version

  13. Some models are better then others… • Consider these 2 examples • Which of the 2 models below is better? Classifier 2 will probably generalize better! It does not include irrelevant information => Smaller model is better

  14. Regularization • A penalty on adding extra weights • L2 regularization: 𝑥 + • big penalty on large weights • small penalty on small weights 𝑥 , • L1 regularization: • Uniform increase when large or small • Will cause many weights to become zero

  15. L1 regularization in online learning

  16. What you should know • Standard supervised learning set-up for text classification • Difference between train vs. test data • How to evaluate • 3 examples of supervised linear classifiers • Naïve Bayes, Perceptron, Logistic Regression • Learning as optimization: what is the objective function optimized? • Difference between generative vs. discriminative classifiers • Smoothing, regularization • Overfitting, underfitting

  17. Neural networks

  18. Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person

  19. Formalizing binary prediction

  20. The Perceptron: a “machine” to calculate a weighted sum φ “A” = 1 0 φ “site” = 1 -3 φ “located” = 1 0 0 φ “Maizuru” = 1 . -1 0 sign - 𝑥 " ⋅ ϕ " 𝑦 φ “,” = 2 0 "/, φ “in” = 1 0 2 φ “Kyoto” = 1 0 φ “priest” = 0 φ “black” = 0

  21. The Perceptron: Geometric interpretation O X O X O X

  22. The Perceptron: Geometric interpretation O X O X O X

  23. Limitation of perceptron ● can only find linear separations between positive and negative examples X O O X

  24. Neural Networks ● Connect together multiple perceptrons φ “A” = 1 φ “site” = 1 φ “located” = 1 φ “Maizuru” = 1 -1 φ “,” = 2 φ “in” = 1 φ “Kyoto” = 1 φ “priest” = 0 φ “black” = 0 ● Motivation: Can represent non-linear functions!

  25. Neural Networks: key terms • Input (aka features) • Output φ “A” = 1 • Nodes φ “site” = 1 φ “located” = 1 • Layers φ “Maizuru” = 1 • Hidden layers -1 φ “,” = 2 • Activation function φ “in” = 1 (non-linear) φ “Kyoto” = 1 φ “priest” = 0 φ “black” = 0 • Multi-layer perceptron

  26. Example ● Create two classifiers w 0,0 φ 0 [0] 1 φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} 1 φ 0 [1] φ 1 [0] sign φ 0 [1] X O -1 1 b 0,0 φ 0 [0] w 0,1 O X φ 0 [0] -1 φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} -1 φ 0 [1] φ 1 [1] sign -1 1 b 0,1

  27. Example ● These classifiers map to a new space φ 1 (x 3 ) = {-1, 1} φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 1 [1] φ 2 O X O φ 1 φ 1 [0] X O X O φ 1 (x 1 ) = {-1, -1} φ 1 (x 2 ) = {1, -1} φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} φ 1 (x 4 ) = {-1, -1} 1 φ 1 [0] 1 -1 -1 φ 1 [1] -1 -1

  28. Example ● In new space, the examples are linearly separable! φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 0 [1] X O φ 0 [0] 1 φ 2 [0] = y 1 1 O X φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} φ 1 [1] 1 O φ 1 (x 3 ) = {-1, 1} φ 1 [0] 1 -1 φ 1 [0] -1 φ 1 (x 1 ) = {-1, -1} X φ 1 [1] -1 O φ 1 (x 2 ) = {1, -1} -1 φ 1 (x 4 ) = {-1, -1}

  29. Example wrap-up: Forward propagation ● The final net φ 0 [0] 1 1 φ 0 [1] φ 1 [0] tanh 1 -1 1 φ 0 [0] -1 φ 2 [0] tanh 1 -1 φ 0 [1] φ 1 [1] tanh -1 1 1 1

  30. � � Softmax Function for multiclass classification ● Sigmoid function for multiple classes 𝑓 𝐱⋅6 7,9 Current class 𝑄 𝑧 ∣ 𝑦 = ∑ 𝑓 𝐱⋅6 7,9 ; Sum of other classes ; 9 ● Can be expressed using matrix/vector ops 𝐬 = exp 𝐗 ⋅ ϕ 𝑦, 𝑧 𝐪 = 𝐬 - G 𝑠̃ Ẽ∈𝐬 30

  31. Stochastic Gradient Descent Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw In other words • For every training example, calculate the gradient (the direction that will increase the probability of y) • Move in that direction, multiplied by learning rate α

  32. Gradient of the Sigmoid Function Take the derivative of the probability 𝑓 𝐱⋅6 7 𝑒 𝑒 𝑒𝑥 𝑄 𝑧 = 1 ∣ 𝑦 = 𝑒𝑥 1 + 𝑓 𝐱⋅6 7 𝑓 𝐱⋅6 7 = ϕ 𝑦 1 + 𝑓 𝐱⋅6 7 + 𝑓 𝐱⋅6 7 𝑒 𝑒 𝑒𝑥 𝑄 𝑧 = −1 ∣ 𝑦 = 𝑒𝑥 1 − 1 + 𝑓 𝐱⋅6 7 𝑓 𝐱⋅6 7 = −ϕ 𝑦 1 + 𝑓 𝐱⋅6 7 +

  33. Learning: We Don't Know the Derivative for Hidden Units! For NNs, only know correct tag for last layer 𝐢 𝑦 w 1 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? 𝑒𝐱 𝟐 𝑓 𝐱 𝟓 ⋅𝐢 7 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = 𝐢 𝑦 𝑒𝐱 𝟓 1 + 𝑓 𝐱 𝟓 ⋅𝐢 7 + w 2 w 4 ϕ 𝑦 y=1 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? w 3 𝑒𝐱 𝟑 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? 𝑒𝐱 𝟒

  34. � Answer: Back-Propagation Calculate derivative with chain rule 𝑒𝑄 𝑧 = 1 ∣ 𝑦 = 𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱 𝟓 𝐢 𝐲 𝑒ℎ , 𝐲 𝑒𝐱 𝟐 𝑒𝐱 𝟓 𝐢 𝐲 𝑒ℎ , 𝐲 𝑒𝐱 𝟐 𝑓 𝐱 𝟓 ⋅𝐢 7 𝑥 ,,R 1 + 𝑓 𝐱 𝟓 ⋅𝐢 7 + Error of Weight Gradient of next unit ( δ 4 ) this unit 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = 𝑒ℎ " 𝐲 In General - δ U 𝑥 ",U 𝐱 𝐣 𝑒𝐱 𝐣 Calculate i based U on next units j :

  35. Backpropagation = Gradient descent + Chain rule

  36. Feed Forward Neural Nets All connections point forward ϕ 𝑦 y It is a directed acyclic graph (DAG)

  37. Neural Networks • Non-linear classification • Prediction: forward propagation • Vector/matrix operations + non-linearities • Training: backpropagation + stochastic gradient descent For more details, see CIML Chap 7

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend