From Logistic Regression to Neural Networks CMSC 470 Marine - - PowerPoint PPT Presentation

from logistic regression
SMART_READER_LITE
LIVE PREVIEW

From Logistic Regression to Neural Networks CMSC 470 Marine - - PowerPoint PPT Presentation

From Logistic Regression to Neural Networks CMSC 470 Marine Carpuat Logistic Regression What you should know How to make a prediction with logistic regression classifier How to train a logistic regression classifier Machine learning


slide-1
SLIDE 1

From Logistic Regression to Neural Networks

CMSC 470 Marine Carpuat

slide-2
SLIDE 2

Logistic Regression What you should know

How to make a prediction with logistic regression classifier How to train a logistic regression classifier Machine learning concepts: Loss function Gradient Descent Algorithm

slide-3
SLIDE 3
slide-4
SLIDE 4

SGD hyperparameter: the learning rate

  • The hyperparameter 𝜃 that control the size of the step down the

gradient is called the learning rate

  • If 𝜃 is too large, training might not converge; if 𝜃 is too small, training

might be very slow.

  • How to set the learning rate? Common strategies:
  • decay over time: 𝜃 =

1 𝐷+𝑢

  • Use held-out test set, increase learning rate when likelihood increases

Constant hyperparameter set by user Number of samples

slide-5
SLIDE 5

Multiclass Logistic Regression

slide-6
SLIDE 6

Formalizing classification

Task definition

  • Given inputs:
  • an example x
  • ften x is a D-dimensional vector of

binary or real values

  • a fixed set of classes Y

Y = {y1, y2,…, yJ}

e.g. word senses from WordNet

  • Output: a predicted class y  Y

Classifier definition A function g: x  g(x) = y

Many different types of functions/classifiers can be defined

  • We’ll talk about perceptron, logistic

regression, neural networks.

So far we’ve only worked with binary classification problems i.e. J = 2

slide-7
SLIDE 7

A multiclass logistic regression classifier

aka multinomial logistic regression, softmax logistic regression, maximum entropy (or maxent) classifier Goal: predict probability P(y=c|x), where c is one of k classes in set C

slide-8
SLIDE 8

The softmax function

  • A generalization of the sigmoid
  • Input: a vector z of dimensionality k
  • Output: a vector of dimensionality k

Looks like a probability distribution!

slide-9
SLIDE 9

The softmax function Example

All values are in [0,1] and sum up to 1: they can be interpreted as probabilities!

slide-10
SLIDE 10

A multiclass logistic regression classifier

aka multinomial logistic regression, softmax logistic regression, maximum entropy (or maxent) classifier Goal: predict probability P(y=c|x), where c is one of k classes in set C Model definition:

We now have one weight vector and

  • ne bias PER CLASS
slide-11
SLIDE 11

Features in multiclass logistic regression

  • Features are a function of the input example and of a candidate
  • utput class c
  • represents feature i for a particular class c for a given

example x

slide-12
SLIDE 12

Example: sentiment analysis with 3 classes {positive (+), negative (-), neutral (0)}

  • Starting from the features for binary classification
  • We create one copy of each feature per class
slide-13
SLIDE 13

Learning in Multiclass Logistic Regression

  • Loss function for a single example

1{ } is an indicator function that evaluates to 1 if the condition in the brackets is true, and to 0 otherwise

slide-14
SLIDE 14

Learning in Multiclass Logistic Regression

  • Loss function for a single example
slide-15
SLIDE 15

Learning in Multiclass Logistic Regression

slide-16
SLIDE 16

Logistic Regression What you should know

How to make a prediction with logistic regression classifier How to train a logistic regression classifier For both binary and multiclass problems Machine learning concepts: Loss function Gradient Descent Algorithm Learning rate

slide-17
SLIDE 17

Neural Networks

slide-18
SLIDE 18

From logistic regression to a neural network unit

slide-19
SLIDE 19

Limitation of perceptron

  • can only find linear separations between positive and

negative examples

X O O X

slide-20
SLIDE 20

Example: binary classification with a neural network

  • Create two classifiers

X O O X

φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}

sign sign

φ0[0] φ0[1] 1 1 1

  • 1
  • 1
  • 1
  • 1

φ0[0] φ0[1]

φ1[0] φ0[0] φ0[1] 1

w0,0 b0,0

φ1[1]

w0,1 b0,1

slide-21
SLIDE 21

Example: binary classification with a neural network

  • These classifiers map to a new space

X O O X

φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}

1 1

  • 1
  • 1
  • 1
  • 1

φ1 φ2

φ1[1] φ1[0]

φ1[0] φ1[1]

φ1(x1) = {-1, -1}

X O

φ1(x2) = {1, -1}

O

φ1(x3) = {-1, 1} φ1(x4) = {-1, -1}

slide-22
SLIDE 22

X O O X

φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}

1 1

  • 1
  • 1
  • 1
  • 1

φ0[0] φ0[1]

φ1[1] φ1[0] φ1[0] φ1[1] φ1(x1) = {-1, -1} X O φ1(x2) = {1, -1} O φ1(x3) = {-1, 1} φ1(x4) = {-1, -1}

1 1 1

φ2[0] = y

Example: binary classification with a neural network

slide-23
SLIDE 23

Example: the final network can correctly classify the examples that the perceptron could not.

tanh tanh

φ0[0] φ0[1] 1 φ0[0] φ0[1] 1 1 1

  • 1
  • 1
  • 1
  • 1

1 1 1 1

tanh

φ1[0] φ1[1] φ2[0]

Replace “sign” with smoother non-linear function (e.g. tanh, sigmoid)

slide-24
SLIDE 24

Feedforward Neural Networks

Components:

  • an input layer
  • an output layer
  • one or more hidden layers

In a fully connected network: each hidden unit takes as input all the units in the previous layer No loops!

  • A 2-layer feedforward neural network
slide-25
SLIDE 25

Designing Neural Networks: Activation functions

  • Hidden layer can be viewed as set of

hidden features

  • The output of the hidden layer

indicates the extent to which each hidden feature is “activated” by a given input

  • The activation function is a non-linear

function that determines range of hidden feature values

slide-26
SLIDE 26

Designing Neural Networks: Activation functions

slide-27
SLIDE 27

Designing Neural Networks: Network structure

  • 2 key decisions:
  • Width (number of nodes per layer)
  • Depth (number of hidden layers)
  • More parameters means that the network can learn more

complex functions of the input

slide-28
SLIDE 28

Forward Propagation: For a given network, and some input values, compute output

slide-29
SLIDE 29

Forward Propagation: For a given network, and some input values, compute output

Given input (1,0) (and sigmoid non-linearities), we can calculate the output by processing one layer at a time:

slide-30
SLIDE 30

Forward Propagation: For a given network, and some input values, compute output

Output table for all possible inputs:

slide-31
SLIDE 31

Neural Networks as Computation Graphs

slide-32
SLIDE 32

Computation Graphs Make Prediction Easy: Forward Propagation consists in traversing graph in topological order

slide-33
SLIDE 33

Neural Networks so far

  • Powerful non-linear models for classification
  • Predictions are made as a sequence of simple operations
  • matrix-vector operations
  • non-linear activation functions
  • Choices in network structure
  • Width and depth
  • Choice of activation function
  • Feedforward networks
  • no loop
  • Next: how to train