Logistic Regression & Neural Networks CMSC 723 / LING 723 / - PowerPoint PPT Presentation

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein

Logistic Regression

Perceptron & Probabilities • What if we want a probability p(y|x)? • The perceptron gives us a prediction y • Let’s illustrate this with binary classification Illustrations: Graham Neubig

The logistic function • “Softer” function than in perceptron • Can account for uncertainty • Differentiable

Logistic regression: how to train? • Train based on conditional likelihood • Find parameters w that maximize conditional likelihood of all answers 𝑧 " given examples 𝑦 "

Stochastic gradient ascent (or descent) • Online training algorithm for logistic regression • and other probabilistic models Update weights for every training example • Move in direction given by gradient • Size of update step scaled by learning rate •

Gradient of the logistic function

Example: Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person

Example: initial update

Example: second update

How to set the learning rate? • Various strategies • decay over time 1 𝛽 = 𝐷 + 𝑢 Number of Parameter samples • Use held-out test set, increase learning rate when likelihood increases

Multiclass version

Some models are better then others… • Consider these 2 examples • Which of the 2 models below is better? Classifier 2 will probably generalize better! It does not include irrelevant information => Smaller model is better

Regularization • A penalty on adding extra weights • L2 regularization: 𝑥 + • big penalty on large weights • small penalty on small weights 𝑥 , • L1 regularization: • Uniform increase when large or small • Will cause many weights to become zero

L1 regularization in online learning

What you should know • Standard supervised learning set-up for text classification • Difference between train vs. test data • How to evaluate • 3 examples of supervised linear classifiers • Naïve Bayes, Perceptron, Logistic Regression • Learning as optimization: what is the objective function optimized? • Difference between generative vs. discriminative classifiers • Smoothing, regularization • Overfitting, underfitting

Neural networks

Person/not-person classification problem Given an introductory sentence in Wikipedia predict whether the article is about a person

Formalizing binary prediction

The Perceptron: a “machine” to calculate a weighted sum φ “A” = 1 0 φ “site” = 1 -3 φ “located” = 1 0 0 φ “Maizuru” = 1 . -1 0 sign - 𝑥 " ⋅ ϕ " 𝑦 φ “,” = 2 0 "/, φ “in” = 1 0 2 φ “Kyoto” = 1 0 φ “priest” = 0 φ “black” = 0

The Perceptron: Geometric interpretation O X O X O X

Limitation of perceptron ● can only find linear separations between positive and negative examples X O O X

Neural Networks ● Connect together multiple perceptrons φ “A” = 1 φ “site” = 1 φ “located” = 1 φ “Maizuru” = 1 -1 φ “,” = 2 φ “in” = 1 φ “Kyoto” = 1 φ “priest” = 0 φ “black” = 0 ● Motivation: Can represent non-linear functions!

Neural Networks: key terms • Input (aka features) • Output φ “A” = 1 • Nodes φ “site” = 1 φ “located” = 1 • Layers φ “Maizuru” = 1 • Hidden layers -1 φ “,” = 2 • Activation function φ “in” = 1 (non-linear) φ “Kyoto” = 1 φ “priest” = 0 φ “black” = 0 • Multi-layer perceptron

Example ● Create two classifiers w 0,0 φ 0 [0] 1 φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} 1 φ 0 [1] φ 1 [0] sign φ 0 [1] X O -1 1 b 0,0 φ 0 [0] w 0,1 O X φ 0 [0] -1 φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} -1 φ 0 [1] φ 1 [1] sign -1 1 b 0,1

Example ● These classifiers map to a new space φ 1 (x 3 ) = {-1, 1} φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 1 [1] φ 2 O X O φ 1 φ 1 [0] X O X O φ 1 (x 1 ) = {-1, -1} φ 1 (x 2 ) = {1, -1} φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} φ 1 (x 4 ) = {-1, -1} 1 φ 1 [0] 1 -1 -1 φ 1 [1] -1 -1

Example ● In new space, the examples are linearly separable! φ 0 (x 1 ) = {-1, 1} φ 0 (x 2 ) = {1, 1} φ 0 [1] X O φ 0 [0] 1 φ 2 [0] = y 1 1 O X φ 0 (x 3 ) = {-1, -1} φ 0 (x 4 ) = {1, -1} φ 1 [1] 1 O φ 1 (x 3 ) = {-1, 1} φ 1 [0] 1 -1 φ 1 [0] -1 φ 1 (x 1 ) = {-1, -1} X φ 1 [1] -1 O φ 1 (x 2 ) = {1, -1} -1 φ 1 (x 4 ) = {-1, -1}

Example wrap-up: Forward propagation ● The final net φ 0 [0] 1 1 φ 0 [1] φ 1 [0] tanh 1 -1 1 φ 0 [0] -1 φ 2 [0] tanh 1 -1 φ 0 [1] φ 1 [1] tanh -1 1 1 1

� � Softmax Function for multiclass classification ● Sigmoid function for multiple classes 𝑓 𝐱⋅6 7,9 Current class 𝑄 𝑧 ∣ 𝑦 = ∑ 𝑓 𝐱⋅6 7,9 ; Sum of other classes ; 9 ● Can be expressed using matrix/vector ops 𝐬 = exp 𝐗 ⋅ ϕ 𝑦, 𝑧 𝐪 = 𝐬 - G 𝑠̃ Ẽ∈𝐬 30

Stochastic Gradient Descent Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw In other words • For every training example, calculate the gradient (the direction that will increase the probability of y) • Move in that direction, multiplied by learning rate α

Gradient of the Sigmoid Function Take the derivative of the probability 𝑓 𝐱⋅6 7 𝑒 𝑒 𝑒𝑥 𝑄 𝑧 = 1 ∣ 𝑦 = 𝑒𝑥 1 + 𝑓 𝐱⋅6 7 𝑓 𝐱⋅6 7 = ϕ 𝑦 1 + 𝑓 𝐱⋅6 7 + 𝑓 𝐱⋅6 7 𝑒 𝑒 𝑒𝑥 𝑄 𝑧 = −1 ∣ 𝑦 = 𝑒𝑥 1 − 1 + 𝑓 𝐱⋅6 7 𝑓 𝐱⋅6 7 = −ϕ 𝑦 1 + 𝑓 𝐱⋅6 7 +

Learning: We Don't Know the Derivative for Hidden Units! For NNs, only know correct tag for last layer 𝐢 𝑦 w 1 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? 𝑒𝐱 𝟐 𝑓 𝐱 𝟓 ⋅𝐢 7 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = 𝐢 𝑦 𝑒𝐱 𝟓 1 + 𝑓 𝐱 𝟓 ⋅𝐢 7 + w 2 w 4 ϕ 𝑦 y=1 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? w 3 𝑒𝐱 𝟑 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = ? 𝑒𝐱 𝟒

� Answer: Back-Propagation Calculate derivative with chain rule 𝑒𝑄 𝑧 = 1 ∣ 𝑦 = 𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱 𝟓 𝐢 𝐲 𝑒ℎ , 𝐲 𝑒𝐱 𝟐 𝑒𝐱 𝟓 𝐢 𝐲 𝑒ℎ , 𝐲 𝑒𝐱 𝟐 𝑓 𝐱 𝟓 ⋅𝐢 7 𝑥 ,,R 1 + 𝑓 𝐱 𝟓 ⋅𝐢 7 + Error of Weight Gradient of next unit ( δ 4 ) this unit 𝑒𝑄 𝑧 = 1 ∣ 𝐲 = 𝑒ℎ " 𝐲 In General - δ U 𝑥 ",U 𝐱 𝐣 𝑒𝐱 𝐣 Calculate i based U on next units j :

Backpropagation = Gradient descent + Chain rule

Feed Forward Neural Nets All connections point forward ϕ 𝑦 y It is a directed acyclic graph (DAG)

Neural Networks • Non-linear classification • Prediction: forward propagation • Vector/matrix operations + non-linearities • Training: backpropagation + stochastic gradient descent For more details, see CIML Chap 7

Logistic Regression & Neural Networks CMSC 723 / LING 723 / - PowerPoint PPT Presentation

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability p(y|x)? The

Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What

Logistic Regression James H. Steiger Department of Psychology and Human Development Vanderbilt

Todays lecture Logistic regression How can we use logistic regression for reranking? Shay

LEARNING Outline Math Behind Logistic Regression Visualizing Logistic Regression Loss

Workshop 10.5a: Logistic regression Murray Logan August 23, 2016 Table of contents 1 Logistic

Logistic Regression using OLS1D in Excel 2013 XL4D: V0H XL4D: V0H XL4D: V0H 2015 Schield

Workshop 10.5a: Logistic regression Murray Logan 05 Sep 2016 Section 1 Logistic regression

Lecture 3: Logistic Regression Feng Li Shandong University fli@sdu.edu.cn September 21, 2020

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

XL4B: Logistic Regression using OLS1B in Excel 2013 25 Feb 2018 V0C-2x XL4B: V0C-2x XL4B: V0C-2x

Logistic regression Shay Cohen (based on slides by Sharon Goldwater) 28 October 2019 Todays

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

Learning From Data Lecture 9 Logistic Regression and Gradient Descent Logistic Regression

Logistic regression Predict binary outcomes (success/failure) from numerical or categorical

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

THE NEURAL SIMULATION TOOL NEST 1st HPAC Platform Training December 11, 2018 Jochen M. Eppler

Neural Networks Part 1 Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University

NEURON + Python Michael Hines HBP CodeJam Workshop #7 Manchester 2016 NINDS I r n e t t

Pattern Recognition Two main challenges Representation Matching Jain CSE 802, Spring

Fast Homomorphic Evaluation of Deep Discretized Neural Networks Florian Bourse Michele Minelli

Deep Neural Network Mathematical Mysteries for High Dimensional Learning Stphane Mallat

MULTILAYER NEURAL NETWORKS Jeff Robble, Brian Renzenbrink, Doug Roberts Multilayer Neural

Neural Networks: Introduction Machine Learning Based on slides and material from Geoffrey Hinton,