Linear Models Continued: Perceptron & Logistic Regression CMSC - - PowerPoint PPT Presentation

linear models continued perceptron logistic regression
SMART_READER_LITE
LIVE PREVIEW

Linear Models Continued: Perceptron & Logistic Regression CMSC - - PowerPoint PPT Presentation

Linear Models Continued: Perceptron & Logistic Regression CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Linear Models for Classification Feature function representation Weights Nave


slide-1
SLIDE 1

Linear Models Continued: Perceptron & Logistic Regression

CMSC 723 / LING 723 / INST 725 Marine Carpuat

Slides credit: Graham Neubig, Jacob Eisenstein

slide-2
SLIDE 2

Linear Models for Classification

Feature function representation Weights

slide-3
SLIDE 3

Naïve Bayes recap

slide-4
SLIDE 4

The Perceptron

slide-5
SLIDE 5

The perceptron

  • A linear model for classification
  • An algorithm to learn feature weights given labeled data
  • online algorithm
  • error-driven
slide-6
SLIDE 6

Multiclass perceptron

slide-7
SLIDE 7

Understanding the perceptron

  • What’s the impact of the update rule on parameters?
  • The perceptron algorithm will converge if the training data is linearly

separable

  • Proof: see “A Course In Machine Learning” Ch.4
  • Practical issues
  • How to initalize?
  • When to stop?
  • How to order training examples?
slide-8
SLIDE 8

When to stop?

  • One technique
  • When the accuracy on held out data starts to decrease
  • Early stopping

Requires splitting data into 3 sets: training/development/test

slide-9
SLIDE 9

ML fundamentals aside:

  • verfitting/underfitting/generalization
slide-10
SLIDE 10

Training error is not sufficient

  • We care about generalization to new examples
  • A classifier can classify training data perfectly, yet classify new

examples incorrectly

  • Because training examples are only a sample of data distribution
  • a feature might correlate with class by coincidence
  • Because training examples could be noisy
  • e.g., accident in labeling
slide-11
SLIDE 11

Overfitting

  • Consider a model 𝜄 and its:
  • Error rate over training data 𝑓𝑠𝑠𝑝𝑠

%&'()(𝜄)

  • True error rate over all data 𝑓𝑠𝑠𝑝𝑠

%&,- 𝜄

  • We say ℎ overfits the training data if

𝑓𝑠𝑠𝑝𝑠%&'() 𝜄 < 𝑓𝑠𝑠𝑝𝑠%&,- 𝜄

slide-12
SLIDE 12

Evaluating on test data

  • Problem: we don’t know 𝑓𝑠𝑠𝑝𝑠%&,- 𝜄 !
  • Solution:
  • we set aside a test set
  • some examples that will be used for evaluation
  • we don’t look at them during training!
  • after learning a classifier 𝜄, we calculate

𝑓𝑠𝑠𝑝𝑠

%-0% 𝜄

slide-13
SLIDE 13

Overfitting

  • Another way of putting it
  • A classifier 𝜄 is said to overfit the training data, if there is another

hypothesis 𝜄′, such that

  • 𝜄 has a smaller error than 𝜄′ on the training data
  • but 𝜄 has larger error on the test data than 𝜄′.
slide-14
SLIDE 14

Underfitting/Overfitting

  • Underfitting
  • Learning algorithm had the opportunity to learn more from training data, but

didn’t

  • Overfitting
  • Learning algorithm paid too much attention to idiosyncracies of the training

data; the resulting classifier doesn’t generalize

slide-15
SLIDE 15

Back to the Perceptron

slide-16
SLIDE 16

Averaged Perceptron improves generalization

slide-17
SLIDE 17

What objective/loss does the perceptron

  • ptimize?
  • Zero-one loss function
  • What are the pros and cons compared to Naïve Bayes loss?
slide-18
SLIDE 18

Logistic Regression

slide-19
SLIDE 19

Perceptron & Probabilities

  • What if we want a probability p(y|x)?
  • The perceptron gives us a prediction y
  • Let’s illustrate this with binary classification

Illustrations: Graham Neubig

slide-20
SLIDE 20

The logistic function

  • “Softer” function than in perceptron
  • Can account for uncertainty
  • Differentiable
slide-21
SLIDE 21

Logistic regression: how to train?

  • Train based on conditional likelihood
  • Find parameters w that maximize conditional likelihood of all answers

𝑧( given examples 𝑦(

slide-22
SLIDE 22

Stochastic gradient ascent (or descent)

  • Online training algorithm for logistic regression
  • and other probabilistic models
  • Update weights for every training example
  • Move in direction given by gradient
  • Size of update step scaled by learning rate
slide-23
SLIDE 23

What you should know

  • Standard supervised learning set-up for text classification
  • Difference between train vs. test data
  • How to evaluate
  • 3 examples of supervised linear classifiers
  • Naïve Bayes, Perceptron, Logistic Regression
  • Learning as optimization: what is the objective function optimized?
  • Difference between generative vs. discriminative classifiers
  • Smoothing, regularization
  • Overfitting, underfitting
slide-24
SLIDE 24

An on

  • nline learning algorithm
slide-25
SLIDE 25

Perceptron weight update

  • If y = 1, increase the weights for features in
  • If y = -1, decrease the weights for features in