Logistic Regression & Neural Networks CMSC 723 / LING 723 / - - PowerPoint PPT Presentation

logistic regression neural networks
SMART_READER_LITE
LIVE PREVIEW

Logistic Regression & Neural Networks CMSC 723 / LING 723 / - - PowerPoint PPT Presentation

Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability p(y|x)? The


slide-1
SLIDE 1

Logistic Regression & Neural Networks

CMSC 723 / LING 723 / INST 725 Marine Carpuat

Slides credit: Graham Neubig, Jacob Eisenstein

slide-2
SLIDE 2

Logistic Regression

slide-3
SLIDE 3

Perceptron & Probabilities

  • What if we want a probability p(y|x)?
  • The perceptron gives us a prediction y
  • Let’s illustrate this with binary classification

Illustrations: Graham Neubig

slide-4
SLIDE 4

The logistic function

  • “Softer” function than in perceptron
  • Can account for uncertainty
  • Differentiable
slide-5
SLIDE 5

Logistic regression: how to train?

  • Train based on conditional likelihood
  • Find parameters w that maximize conditional likelihood of all answers

𝑧" given examples 𝑦"

slide-6
SLIDE 6

Stochastic gradient ascent (or descent)

  • Online training algorithm for logistic regression
  • and other probabilistic models
  • Update weights for every training example
  • Move in direction given by gradient
  • Size of update step scaled by learning rate
slide-7
SLIDE 7

Gradient of the logistic function

slide-8
SLIDE 8

Example: Person/not-person classification problem

Given an introductory sentence in Wikipedia predict whether the article is about a person

slide-9
SLIDE 9

Example: initial update

slide-10
SLIDE 10

Example: second update

slide-11
SLIDE 11

How to set the learning rate?

  • Various strategies
  • decay over time

𝛽 = 1 𝐷 + 𝑢

  • Use held-out test set, increase learning rate when likelihood increases

Parameter Number of samples

slide-12
SLIDE 12

Multiclass version

slide-13
SLIDE 13

Some models are better then others…

  • Consider these 2 examples
  • Which of the 2 models below is better?

Classifier 2 will probably generalize better! It does not include irrelevant information => Smaller model is better

slide-14
SLIDE 14

Regularization

  • A penalty on adding extra weights
  • L2 regularization:
  • big penalty on large weights
  • small penalty on small weights
  • L1 regularization:
  • Uniform increase when large or small
  • Will cause many weights to become zero

𝑥 + 𝑥 ,

slide-15
SLIDE 15

L1 regularization in online learning

slide-16
SLIDE 16

What you should know

  • Standard supervised learning set-up for text classification
  • Difference between train vs. test data
  • How to evaluate
  • 3 examples of supervised linear classifiers
  • Naïve Bayes, Perceptron, Logistic Regression
  • Learning as optimization: what is the objective function optimized?
  • Difference between generative vs. discriminative classifiers
  • Smoothing, regularization
  • Overfitting, underfitting
slide-17
SLIDE 17

Neural networks

slide-18
SLIDE 18

Person/not-person classification problem

Given an introductory sentence in Wikipedia predict whether the article is about a person

slide-19
SLIDE 19

Formalizing binary prediction

slide-20
SLIDE 20

The Perceptron:

a “machine” to calculate a weighted sum

sign - 𝑥"

. "/,

⋅ ϕ" 𝑦

φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

  • 3

2

  • 1
slide-21
SLIDE 21

The Perceptron: Geometric interpretation

O X O X O X

slide-22
SLIDE 22

The Perceptron: Geometric interpretation

O X O X O X

slide-23
SLIDE 23

Limitation of perceptron

  • can only find linear separations between positive and

negative examples

X O O X

slide-24
SLIDE 24

Neural Networks

  • Connect together multiple perceptrons

φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

  • 1
  • Motivation: Can represent non-linear functions!
slide-25
SLIDE 25

Neural Networks: key terms

φ“A” = 1 φ“site” = 1 φ“,” = 2 φ“located” = 1 φ“in” = 1 φ“Maizuru”= 1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

  • 1
  • Input (aka features)
  • Output
  • Nodes
  • Layers
  • Hidden layers
  • Activation function

(non-linear)

  • Multi-layer perceptron
slide-26
SLIDE 26

Example

  • Create two classifiers

X O O X

φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}

sign sign

φ0[0] φ0[1] 1 1 1

  • 1
  • 1
  • 1
  • 1

φ0[0] φ0[1]

φ1[0] φ0[0] φ0[1] 1

w0,0 b0,0

φ1[1]

w0,1 b0,1

slide-27
SLIDE 27

Example

  • These classifiers map to a new space

X O O X

φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}

1 1

  • 1
  • 1
  • 1
  • 1

φ1 φ2

φ1[1] φ1[0]

φ1[0] φ1[1]

φ1(x1) = {-1, -1}

X O

φ1(x2) = {1, -1}

O

φ1(x3) = {-1, 1} φ1(x4) = {-1, -1}

slide-28
SLIDE 28

Example

  • In new space, the examples are linearly separable!

X O O X

φ0(x2) = {1, 1} φ0(x1) = {-1, 1} φ0(x4) = {1, -1} φ0(x3) = {-1, -1}

1 1

  • 1
  • 1
  • 1
  • 1

φ0[0] φ0[1]

φ1[1] φ1[0] φ1[0] φ1[1] φ1(x1) = {-1, -1} X O φ1(x2) = {1, -1} O φ1(x3) = {-1, 1} φ1(x4) = {-1, -1}

1 1 1

φ2[0] = y

slide-29
SLIDE 29

Example wrap-up: Forward propagation

  • The final net

tanh tanh

φ0[0] φ0[1] 1 φ0[0] φ0[1] 1 1 1

  • 1
  • 1
  • 1
  • 1

1 1 1 1

tanh

φ1[0] φ1[1] φ2[0]

slide-30
SLIDE 30

30

Softmax Function for multiclass classification

  • Sigmoid function for multiple classes
  • Can be expressed using matrix/vector ops

𝑄 𝑧 ∣ 𝑦 = 𝑓𝐱⋅6 7,9 ∑ 𝑓𝐱⋅6 7,9

;

  • 9

;

Current class Sum of other classes

𝐬 = exp 𝐗 ⋅ ϕ 𝑦, 𝑧 𝐪 = 𝐬 - 𝑠̃

  • Ẽ∈𝐬

G

slide-31
SLIDE 31

Stochastic Gradient Descent

Online training algorithm for probabilistic models w = 0 for I iterations for each labeled pair x, y in the data w += α * dP(y|x)/dw In other words

  • For every training example, calculate the gradient

(the direction that will increase the probability of y)

  • Move in that direction, multiplied by learning rate α
slide-32
SLIDE 32

Gradient of the Sigmoid Function

Take the derivative of the probability

𝑒 𝑒𝑥 𝑄 𝑧 = 1 ∣ 𝑦 = 𝑒 𝑒𝑥 𝑓𝐱⋅6 7 1 + 𝑓𝐱⋅6 7 = ϕ 𝑦 𝑓𝐱⋅6 7 1 + 𝑓𝐱⋅6 7

+

𝑒 𝑒𝑥 𝑄 𝑧 = −1 ∣ 𝑦 = 𝑒 𝑒𝑥 1 − 𝑓𝐱⋅6 7 1 + 𝑓𝐱⋅6 7 = −ϕ 𝑦 𝑓𝐱⋅6 7 1 + 𝑓𝐱⋅6 7

+

slide-33
SLIDE 33

Learning: We Don't Know the Derivative for Hidden Units!

For NNs, only know correct tag for last layer

y=1

ϕ 𝑦 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟓 = 𝐢 𝑦 𝑓𝐱𝟓⋅𝐢 7 1 + 𝑓𝐱𝟓⋅𝐢 7

+

𝐢 𝑦 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟐 = ? 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟑 = ? 𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝑒𝐱𝟒 = ?

w1 w2 w3 w4

slide-34
SLIDE 34

Answer: Back-Propagation

Calculate derivative with chain rule

𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱𝟐 = 𝑒𝑄 𝑧 = 1 ∣ 𝑦 𝑒𝐱𝟓𝐢 𝐲 𝑒𝐱𝟓𝐢 𝐲 𝑒ℎ, 𝐲 𝑒ℎ, 𝐲 𝑒𝐱𝟐 𝑓𝐱𝟓⋅𝐢 7 1 + 𝑓𝐱𝟓⋅𝐢 7

+

𝑥,,R

Error of next unit (δ4) Weight Gradient of this unit

𝑒𝑄 𝑧 = 1 ∣ 𝐲 𝐱𝐣 = 𝑒ℎ" 𝐲 𝑒𝐱𝐣

  • δU
  • U

𝑥",U

In General Calculate i based

  • n next units j:
slide-35
SLIDE 35

Backpropagation = Gradient descent + Chain rule

slide-36
SLIDE 36

Feed Forward Neural Nets

All connections point forward y

ϕ 𝑦

It is a directed acyclic graph (DAG)

slide-37
SLIDE 37

Neural Networks

  • Non-linear classification
  • Prediction: forward propagation
  • Vector/matrix operations + non-linearities
  • Training: backpropagation + stochastic gradient descent

For more details, see CIML Chap 7