Department of Computer Science CSCI 5622: Machine Learning Chenhao - - PowerPoint PPT Presentation

department of computer science csci 5622 machine learning
SMART_READER_LITE
LIVE PREVIEW

Department of Computer Science CSCI 5622: Machine Learning Chenhao - - PowerPoint PPT Presentation

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 12: Regularization, regression, and multi-class classification Slides adapted from Jordan Boyd-Graber, Chris Ketelsen 1 HW 2 2 Learning objective Review


slide-1
SLIDE 1

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 12: Regularization, regression, and multi-class classification Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

1

slide-2
SLIDE 2

HW 2

2

slide-3
SLIDE 3

Learning objective

  • Review homeworks and multi-class classification
  • Linear regression
  • Examine regularization in the regression context
  • Recognize the effects of regularization on bias/variance

3

slide-4
SLIDE 4

Outline

  • Multi-class classification
  • Linear regression
  • Regularization

4

slide-5
SLIDE 5

Outline

  • Multi-class classification
  • Linear regression
  • Regularization

5

slide-6
SLIDE 6

Multi-class classification

  • Binary examples
  • Spam classification
  • Sentiment classification

6

slide-7
SLIDE 7

Multi-class classification

  • Binary examples
  • Spam classification
  • Sentiment classification
  • Multi-class examples
  • Star-ratings classification
  • Part-of-speech tagging
  • Image classification

7

slide-8
SLIDE 8

What we learned so far

  • KNN
  • Naïve Bayes
  • Logistic regression
  • Neural networks
  • Support vector machines

8

slide-9
SLIDE 9

Binary vs. Multi-class classification

9

slide-10
SLIDE 10

Multi-class logistic regression

10

slide-11
SLIDE 11

11

slide-12
SLIDE 12

Multi-class Support Vector Machines

  • Reduction
  • One-against-all
  • All-pairs
  • Modify objective function (SSBD 17.2)

12

slide-13
SLIDE 13

Reduction

13

slide-14
SLIDE 14

How do we use binary classifier to output categorical labels?

14

slide-15
SLIDE 15

One-against-all

15

slide-16
SLIDE 16

One-against-all

  • Break k-class problem into k binary problems and solve separately
  • Combine predictions: evaluates all h’s, take the one with highest confidence

16

slide-17
SLIDE 17

One-against-all

17

slide-18
SLIDE 18

All-pairs

18

slide-19
SLIDE 19

All-pairs

  • Break k-class problem into k(k-1)/2 binary problems and solve separately
  • Combine predictions: evaluates all h’s, take the one with highest sum

confidence

19

slide-20
SLIDE 20

All-pairs

20

slide-21
SLIDE 21

Outline

  • Multi-class classification
  • Linear regression
  • Regularization

21

slide-22
SLIDE 22

Linear regression

  • Data are continuous inputs and outputs

22

slide-23
SLIDE 23

Linear regression example

  • Given a person’s age and gender, predict their height
  • Given the square footage and number of bathrooms in a house,

predict its sale price

  • Given unemployment, inflation, number of wars, and economics

growth, predict the president’s approval rating

  • Given a user’s browsing history, predict how long he will stay on

product page

  • Given the advertising budget expenditures in various markets,

predict the number of products sold

23

slide-24
SLIDE 24

Linear regression example

24

slide-25
SLIDE 25

Linear regression example

25

slide-26
SLIDE 26

Derived features

26

slide-27
SLIDE 27

Derived features

27

slide-28
SLIDE 28

Objective function

The objective function is called the residual sum of squares:

28

slide-29
SLIDE 29

Probabilistic interpretation

A discriminative model that assumes the response Gaussian with mean

29

slide-30
SLIDE 30

Probabilistic interpretation

A discriminative model that assumes the response Gaussian with mean

30

slide-31
SLIDE 31

Probabilistic interpretation

Assuming i.i.d. samples, we can write the likelihood of the data as

31

slide-32
SLIDE 32

Probabilistic interpretation

Negative log likelihood

32

slide-33
SLIDE 33

Probabilistic interpretation

Negative log likelihood

33

slide-34
SLIDE 34

Revisiting Bias-variance Tradeoff

  • Consider the case of fitting linear regression with derived

polynomial features to a set of training data

  • In general, want a model that explains the training data and can

still generalize to unseen test data

34

slide-35
SLIDE 35

Revisiting Bias-variance Tradeoff

35

slide-36
SLIDE 36

Outline

  • Multi-class classification
  • Linear regression
  • Regularization

36

slide-37
SLIDE 37

High variance

  • Model wiggles wildly to get close to data
  • To get big swings, model coefficients are very large
  • weight go to 106

37

slide-38
SLIDE 38

Regularization

  • Keep all the features, but force the coefficients to be smaller
  • This is called regularization

38

slide-39
SLIDE 39
  • Add penalty term to RSS objective function
  • Balance between small RSS and small coefficients

Regularization

39

slide-40
SLIDE 40
  • Add penalty term to RSS objective function
  • Balance between small RSS and small coefficients
  • HW 2 extra credit question

Regularization

40

slide-41
SLIDE 41

41

slide-42
SLIDE 42

Regularization

42

slide-43
SLIDE 43

Ridge regularization

43

slide-44
SLIDE 44

Ridge regularization

44

slide-45
SLIDE 45

Bias-variance tradeoff

45

The penalty decreases, then

  • A. the bias increases, the variance increases
  • B. the bias increases, the variance decreases
  • C. the bias decreases, the variance increases
  • D. the bias decreases, the variance decreases
slide-46
SLIDE 46
  • How do the coefficients behave as increases?

Ridge regularization vs. lasso regularization

46

λ

slide-47
SLIDE 47
  • Coefficients shrink to zero

uniformly smoothly

Ridge regularization

47

slide-48
SLIDE 48
  • Some coefficients shrink to

zero very fast

Lasso regularization

48

slide-49
SLIDE 49
  • Why does the choice between the two types of regularization

lead to very different behavior?

  • Several ways to look at it
  • Constrained minimization
  • Look at a simplified case of data
  • Prior probabilities on parameters

Ridge regularization vs. lasso regularization

49

slide-50
SLIDE 50

Intuition 1: Constrained Minimization

50

slide-51
SLIDE 51

Intuition 1: Constrained Minimization

51

slide-52
SLIDE 52

Intuition 1: Constrained Minimization

52

Minimum more likely to be at point of diamond with Lasso, causing some feature weights to be set to zero.

slide-53
SLIDE 53

Intuition 2: A Simplified Case

53

slide-54
SLIDE 54

Intuition 2: A Simplified Case

54

slide-55
SLIDE 55

Intuition 2: A Simplified Case

55

slide-56
SLIDE 56

Intuition 2: A Simplified Case

56

slide-57
SLIDE 57

Intuition 3: Prior Distribution

57

slide-58
SLIDE 58

Intuition 3: Prior Distribution

58

slide-59
SLIDE 59

Intuition 3: Prior Distribution

59

slide-60
SLIDE 60

Intuition 3: Prior Distribution

60

slide-61
SLIDE 61

Intuition 3: Prior Distribution

  • Lasso's prior peaked at 0 means expect many params to be zero
  • Ridge's prior flatter and fatter around 0 means we expect many

coefficients to be smallish

61

slide-62
SLIDE 62

Wrap up

  • Regularization and the idea behind it is crucial for machine learning
  • Always use regularization in some form
  • Next
  • Ensemble methods

62