Lecture 5: Linear regression (contd.) Regularization ML - - PowerPoint PPT Presentation

lecture 5
SMART_READER_LITE
LIVE PREVIEW

Lecture 5: Linear regression (contd.) Regularization ML - - PowerPoint PPT Presentation

Lecture 5: Linear regression (contd.) Regularization ML Methodology Learning theory Aykut Erdem October 2017 Hacettepe University About class projects This semester the theme is machine learning and the city. To be done


slide-1
SLIDE 1

Lecture 5:

−Linear regression (cont’d.) −Regularization −ML Methodology −Learning theory

Aykut Erdem

October 2017 Hacettepe University

slide-2
SLIDE 2

About class projects

  • This semester the theme is machine learning and the city.
  • To be done in groups of 3 people.
  • Deliverables: Proposal, blog posts, progress report, project presentations

(classroom + video (new) presentations), final report and code

  • For more details please check the project webpage: 


http://web.cs.hacettepe.edu.tr/~aykut/classes/fall2017/bbm406/project.html.

3

slide-3
SLIDE 3

4

Recall from last time… Linear Regression

y(x) = w0 + w1x `(w) =

N

X

n=1

h t(n) − (w0 + w1x(n)) i2 w = (w0, w1)

w ← w + 2λ ⇣ t(n) − y(x(n)) ⌘ x(n)

Gradient Descent Update Rule: Closed Form Solution:

w =

  • XT X

−1 XT t

slide-4
SLIDE 4

Today

  • Linear regression (cont’d.)
  • Regularization
  • Machine Learning Methodology
  • validation
  • cross-validation (k-fold, leave-one-out)
  • model selection


  • Learning theory


5

slide-5
SLIDE 5

Multi-dimensional Inputs

  • One method of extending the model is to consider other input dimensions 



 


  • In the Boston housing example, we can look at the number of rooms

6

slide by Sanja Fidler

y(x) = w0 + w1x1 + w2x2

slide-6
SLIDE 6

Linear Regression with 
 Multi-dimensional Inputs

  • Imagine now we want to predict the median house price from

these multi-dimensional observations

  • Each house is a data point n, with observations indexed by j:
  • We can incorporate the bias w0 into w, by using x0 = 1, then
  • We can then solve for w = (w0,w1,…,wd). How?
  • We can use gradient descent to solve for each coefficient, or

compute w analytically (how does the solution change?)

7

slide by Sanja Fidler

x(n) = ⇣ x(n)

1 , . . . , x(n) j

, . . . , x(n)

d

⌘ y(x) = w0 +

d

X

j=1

wjxj = wT x

slide-7
SLIDE 7

More Powerful Models?

  • What if our linear model is not good? How can we create a more

complicated model?

8

slide by Sanja Fidler

slide-8
SLIDE 8

Fitting a Polynomial

  • What if our linear model is not good? How can we create a more

complicated model?

  • We can create a more complicated model by defining input variables

that are combinations of components of x

  • Example: an M-th order polynomial function of one dimensional

feature x: 
 
 
 
 
 where xj is the j-th power of x

  • We can use the same approach to optimize for the weights w
  • How do we do that?

9

slide by Sanja Fidler

y(x, w) = w0 +

M

X

j=1

wjxj

slide-9
SLIDE 9

Some types of basis functions in 1-D

10

Sigmoids Gaussians Polynomials

φj(x) = exp

  • −(x − µj)2

2s2

  • φj(x) = σ

x − µj

s

  • σ(a) =

1 1 + exp(−a).

slide by Erik Sudderth

slide-10
SLIDE 10

) ( ... ) ( ) ( ) ( ... ) (

2 2 1 1 2 2 1 1

x w x x w x, x w w x, Φ = + + + = = + + + =

T T

w w w y x w x w w y φ φ

bias

Two types of linear model that are equivalent with respect to learning

  • The first model has the same number of adaptive coefficients as the

dimensionality of the data +1.

  • The second model has the same number of adaptive coefficients as

the number of basis functions +1.

  • Once we have replaced the data by the outputs of the basis

functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick)

11

slide by Erik Sudderth

slide-11
SLIDE 11

General linear regression problem

  • Using our new notations for the basis function linear

regression can be written as
 
 
 
 
 where can be either xj for multivariate regression

  • r one of the nonlinear basis we defined
  • Once again we can use “least squares” to find the
  • ptimal solution.

12

  • notations for the basis
  • Once ¡again ¡we ¡can ¡use ¡‘least ¡squares’ ¡to ¡find ¡the ¡optimal ¡solution.

฀ y w j j(x)

j 0 n

  • Where j(x) can

non linear bas

  • Once ¡again ¡we ¡can ¡use ¡‘least ¡squares’ ¡to ¡find ¡the ¡optimal ¡solution.

  • slide by E. P

. Xing

slide-12
SLIDE 12

regression problem

฀ y w j j(x)

j 0 n

J(w) (y i w j j(x i)

j

  • )

i

  • 2

Our goal is to minimize the following loss function: Moving to vector notations we get: We take the derivative w.r.t w

฀ J(w) (yi w T(xi))2

i

  • w

(y i w T(x i))2

i

  • 2

(y i w T(x i))

i

  • (x i)T

Equating to 0 we get

฀ 2 (y i w T(x i))

i

  • (x i)T 0

y i

i

  • (x i)T w T

(x i)

i

  • (x i)T
  • w – vector of dimension k+1

(xi) – vector of dimension k+1 yi – a scaler

13

LMS for the general linear regression problem

slide by E. P . Xing

slide-13
SLIDE 13

14

We take the derivative w.r.t w ฀ J(w) (yi w T(xi))2

i

  • w

(y i w T(x i))2

i

  • 2

(y i w T(x i))

i

  • (x i)T

Equating to 0 we get

฀ 2 (y i w T(x i))

i

  • (x i)T 0

y i

i

  • (x i)T w T

(x i)

i

  • (x i)T
  • Define:

฀ 0(x1) 1(x1) m(x1) 0(x 2) 1(x 2) m(x 2) 0(x n) 1(x n) m(x n)

  • Then deriving w

we get:

฀ w (T)1Ty

LMS for the general linear regression problem

slide by E. P . Xing

slide-14
SLIDE 14

LMS for the general linear regression problem

15

฀ J(w) (yi w T(xi))2

i

  • Deriving w we get:

฀ w (T)1Ty

n by k+1 matrix n entries vector k+1 entries vector This solution is also known as ‘psuedo ¡inverse’

slide by E. P . Xing

slide-15
SLIDE 15

0th order polynomial

16

slide by Erik Sudderth

slide-16
SLIDE 16

1st order polynomial

17

slide by Erik Sudderth

slide-17
SLIDE 17

3rd order polynomial

18

slide by Erik Sudderth

slide-18
SLIDE 18

9th order polynomial

19

slide by Erik Sudderth

slide-19
SLIDE 19

Which Fit is Best?

20

slide by Sanja Fidler from Bishop

slide-20
SLIDE 20

Root Mean Square (RMS) Error

21

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

E(w) = 1 2

N

  • n=1

{y(xn, w) − tn}2

ERMS =

  • 2E(w⋆)/N

The division by N allows us to compare different sizes of data sets on an equal footing, and 
 the square root ensures that ERMS is measured on the same scale (and in the same units) as the target variable t

slide by Erik Sudderth

slide-21
SLIDE 21

Root>Mean>Square'(RMS)'Error:'

E(w) = 1 2

N

X

n=1

(tn − φ(xn)T w)2 = 1 2||t − Φw||2

Root Mean Square (RMS) Error

22

inde- M ERMS 3 6 9 0.5 1 Training Test

slide by Erik Sudderth

slide-22
SLIDE 22

Generalization

  • Generalization = model’s ability to predict the held out data
  • What is happening?
  • Our model with M = 9 overfits the data (it models also noise)

23

slide by Sanja Fidler

inde- M ERMS 3 6 9 0.5 1 Training Test

slide-23
SLIDE 23

Generalization

  • Generalization = model’s ability to predict the held out data
  • What is happening?
  • Our model with M = 9 overfits the data (it models also noise)
  • Not a problem if we have lots of training examples

24

slide by Sanja Fidler

slide-24
SLIDE 24

Generalization

  • Generalization = model’s ability to predict the held out data
  • What is happening?
  • Our model with M = 9 overfits the data (it models also noise)
  • Let’s look at the estimated weights for various M in the case of

fewer examples

25

slide by Sanja Fidler

slide-25
SLIDE 25

Generalization

  • Generalization = model’s ability to predict the held out data
  • What is happening?
  • Our model with M = 9 overfits the data (it models also noise)
  • Let’s look at the estimated weights for various M in the case
  • f fewer examples
  • The weights are becoming huge to compensate for the noise
  • One way of dealing with this is to encourage the weights to be

small (this way no input dimension will have too much influence on prediction). This is called regularization.

26

slide by Sanja Fidler

slide-26
SLIDE 26 inde- M ERMS 3 6 9 0.5 1 Training Test

1-D regression illustrates key concepts

  • Data fits – is linear model best (model selection)?

− Simplest models do not capture all the important variations (signal) in the data: underfit − More complex model may overfit the training data 
 (fit not only the signal but also the noise in the data), especially if not enough data to constrain model

  • One method of assessing fit:

− test generalization = model’s ability to predict 
 the held out data

  • Optimization is essential: stochastic and batch

iterative approaches; analytic when available

27

slide by Richard Zemel

slide-27
SLIDE 27

Regularized Least Squares

  • A technique to control the overfitting phenomenon
  • Add a penalty term to the error function in order to

discourage the coefficients from reaching large values

28

  • E(w) = 1

2

N

  • n=1

{y(xn, w) − tn}2 + λ 2 ∥w∥2

  • where ∥w∥2 ≡ wTw = w2

0 + w2 1 + . . . + w2

M,

importance of the regularization term compared

'

Ridge regression

which is minimized by

slide by Erik Sudderth

slide-28
SLIDE 28

The effect of regularization

29

x t ln λ = −18 1 −1 1 x t ln λ = 0 1 −1 1

M = 9

slide by Erik Sudderth

slide-29
SLIDE 29

ERMS ln λ −35 −30 −25 −20 0.5 1 Training Test

The effect of regularization

30

ln λ = −∞ ln λ = −18 ln λ = 0 w⋆ 0.35 0.35 0.13 w⋆

1

232.37 4.74

  • 0.05

w⋆

2

  • 5321.83
  • 0.77
  • 0.06

w⋆

3

48568.31

  • 31.97
  • 0.05

w⋆

4

  • 231639.30
  • 3.89
  • 0.03

w⋆

5

640042.26 55.28

  • 0.02

w⋆

6

  • 1061800.52

41.32

  • 0.01

w⋆

7

1042400.18

  • 45.95
  • 0.00

w⋆

8

  • 557682.99
  • 91.53

0.00 w⋆

9

125201.43 72.68 0.01

The corresponding coefficients from the fitted polynomials, showing that regularization has the desired effect of reducing the magnitude

  • f the coefficients.

slide by Erik Sudderth

slide-30
SLIDE 30

A more general regularizer

31

1 2

N

  • n=1

{tn − wTφ(xn)}2 + λ 2

M

  • j=1

|wj|q

q = 0.5 q = 1 q = 2 q = 4

slide by Richard Zemel

slide-31
SLIDE 31

Machine Learning 
 Methodology

32

slide-32
SLIDE 32

Recap: Regression

  • In regression, labels yi are

continuous

  • Classification/regression are

solved very similarly

  • Everything we have done so

far transfers to classification with very minor changes

  • Error: sum of distances from

examples to the fitted model

33

slide by Olga Veksler

x y 1 4 8 6 3

slide-33
SLIDE 33

Training/Test Data Split

  • Talked about splitting data in training/test sets
  • training data is used to fit parameters
  • test data is used to assess how classifier generalizes

to new data

  • What if classifier has “non‐tunable” parameters?
  • a parameter is “non‐tunable” if tuning (or training) it
  • n the training data leads to overfitting
  • Examples:
  • k in kNN classifier
  • number of hidden units in MNN
  • number of hidden layers in MNN
  • etc …

34

slide by Olga Veksler

slide-34
SLIDE 34

Example of Overfitting

  • Want to fit a polynomial machine f (x,w)
  • Instead of fixing polynomial degree, 


make it parameter d

  • learning machine f (x,w,d)
  • Consider just three choices for d
  • degree 1
  • degree 2
  • degree 3 

  • Training error is a bad measure to choose d

− degree 3 is the best according to the training error, but overfits

the data

35

slide by Olga Veksler

x y

slide-35
SLIDE 35

Training/Test Data Split

  • What about test error? Seems appropriate

− degree 2 is the best model according to the test error

  • Except what do we report as the test error now?
  • Test error should be computed on data that was not used for

training at all!

  • Here used “test” data for training, i.e. choosing model

36

slide by Olga Veksler

slide-36
SLIDE 36

Validation data

  • Same question when choosing among several classifiers
  • our polynomial degree example can be looked at as

choosing among 3 classifiers (degree 1, 2, or 3)

  • Solution: split the labeled data into three parts

37

slide by Olga Veksler

Training ≈ 60% Validation ≈ 20% Test ≈ 20%

train tunable 
 parameters w train other
 parameters,


  • r to select


classifier use only to
 assess final
 performance

labeled data

slide-37
SLIDE 37

Training/Validation

38

slide by Olga Veksler

Training ≈ 60% Validation ≈ 20% Test ≈ 20%

Training error:
 computed on training
 example Validation error: 
 computed on
 validation
 examples Test error:
 computed

  • n


test examples

labeled data

slide-38
SLIDE 38

Training/Validation/Test Data

  • Training Data
  • Validation Data
  • d = 2 is chosen
  • Test Data
  • 1.3 test error computed for d = 2

39

slide by Olga Veksler

validation error: 3.3 validation error: 1.8 validation error: 3.4

slide-39
SLIDE 39

Choosing Parameters: Example

  • Need to choose number of hidden units for a MNN
  • The more hidden units, the better can fit training data
  • But at some point we overfit the data

40

slide by Olga Veksler

error Validation error Training error

number of base functions 50

slide-40
SLIDE 40

Diagnosing Underfitting/Overfitting

41

slide by Olga Veksler

Underfitting

  • large training error
  • large validation error

Just Right

  • small training error
  • small validation error

Overfitting

  • small training error
  • large validation error
slide-41
SLIDE 41

Fixing Underfitting/Overfitting

  • Fixing Underfitting
  • getting more training examples will not help
  • get more features
  • try more complex classifier
  • if using MNN, try more hidden units

  • Fixing Overfitting
  • getting more training examples might help
  • try smaller set of features
  • Try less complex classifier
  • If using MNN, try less hidden units

42

slide by Olga Veksler

slide-42
SLIDE 42

Train/Test/Validation Method

  • Good news
  • Very simple

  • Bad news:
  • Wastes data
  • in general, the more data we have, the better are the estimated

parameters

  • we estimate parameters on 40% less data, since 20% removed

for test and 20% for validation data

  • If we have a small dataset our test (validation) set might just

be lucky or unlucky

  • Cross Validation is a method for performance evaluation

that wastes less data

43

slide by Olga Veksler

slide-43
SLIDE 43

Small Dataset

44

slide by Olga Veksler

Linear Model: Quadratic Model: Join the dots Model:

Mean Squared Error = 2.4 Mean Squared Error = 0.9 Mean Squared Error = 2.2

x y x y x y

slide-44
SLIDE 44

LOOCV (Leave-one-out Cross Validation)

45

slide by Olga Veksler

x y For k=1 to n

  • 1. Let (xk,yk) be the kth example
  • 2. Temporarily remove (xk,yk)

from the dataset

  • 3. Train on the remaining n-1

examples

  • 4. Note your error on (xk,yk)

When you’ve done all points, report the mean error

slide-45
SLIDE 45

LOOCV (Leave-one-out Cross Validation)

46

slide by Olga Veksler

x y For k=1 to n

  • 1. Let (xk,yk) be the kth example
  • 2. Temporarily remove (xk,yk)

from the dataset

  • 3. Train on the remaining n-1

examples

  • 4. Note your error on (xk,yk)

When you’ve done all points, report the mean error

slide-46
SLIDE 46

LOOCV (Leave-one-out Cross Validation)

47

slide by Olga Veksler

x y For k=1 to n

  • 1. Let (xk,yk) be the kth example
  • 2. Temporarily remove (xk,yk)

from the dataset

  • 3. Train on the remaining n-1

examples

  • 4. Note your error on (xk,yk)

When you’ve done all points, report the mean error

slide-47
SLIDE 47

LOOCV (Leave-one-out Cross Validation)

48

slide by Olga Veksler

x y For k=1 to n

  • 1. Let (xk,yk) be the kth example
  • 2. Temporarily remove (xk,yk)

from the dataset

  • 3. Train on the remaining n-1

examples

  • 4. Note your error on (xk,yk)

When you’ve done all points, report the mean error

slide-48
SLIDE 48

LOOCV (Leave-one-out Cross Validation)

49

slide by Olga Veksler

x y For k=1 to n

  • 1. Let (xk,yk) be the kth example
  • 2. Temporarily remove (xk,yk)

from the dataset

  • 3. Train on the remaining n-1

examples

  • 4. Note your error on (xk,yk)

When you’ve done all points, report the mean error

slide-49
SLIDE 49

LOOCV (Leave-one-out Cross Validation)

50

slide by Olga Veksler x y x y x y

MSELOOCV 
 = 2.12

x y x y x y x y x y x y

slide-50
SLIDE 50

LOOCV for Quadratic Regression

51

slide by Olga Veksler

MSELOOCV 
 = 0.96

x y x y x y x y x y x y x y x y x y

slide-51
SLIDE 51

LOOCV for Joint The Dots

52

slide by Olga Veksler

MSELOOCV 
 = 3.33

x y x y x y x y x y x y x y x y x y

slide-52
SLIDE 52

Which kind of Cross Validation?

  • Can we get the best of both worlds?

53

  • Downside

Upside Testset maygiveunreliable estimateoffuture performance cheap Leaveone

  • ut

expensive doesn’twaste data

  • slide by Olga Veksler
slide-53
SLIDE 53

K-Fold Cross Validation

54

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

slide by Olga Veksler

x y

slide-54
SLIDE 54

K-Fold Cross Validation

55

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐set sum of errors on blue points

slide by Olga Veksler

x y

slide-55
SLIDE 55

K-Fold Cross Validation

56

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐set sum of errors on blue points

  • For the green partition: train on all points not

in green partition. Find test‐set sum of errors on green points

slide by Olga Veksler

x y

slide-56
SLIDE 56

K-Fold Cross Validation

57

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐set sum of errors on blue points

  • For the green partition: train on all points not

in green partition. Find test‐set sum of errors on green points

  • For the red partition: train on all points not in

red partition. Find the test‐set sum of errors

  • n red points

slide by Olga Veksler

x y

slide-57
SLIDE 57

K-Fold Cross Validation

58

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐set sum of errors on blue points

  • For the green partition: train on all points not

in green partition. Find test‐set sum of errors on green points

  • For the red partition: train on all points not in

red partition. Find the test‐set sum of errors

  • n red points
  • Report the mean error

slide by Olga Veksler

x y

Linear Regression
 MSE3FOLD = 2.05

slide-58
SLIDE 58

K-Fold Cross Validation

59

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐set sum of errors on blue points

  • For the green partition: train on all points not

in green partition. Find test‐set sum of errors on green points

  • For the red partition: train on all points not in

red partition. Find the test‐set sum of errors

  • n red points
  • Report the mean error

slide by Olga Veksler

Quadratic Regression
 MSE3FOLD = 1.1

x y

slide-59
SLIDE 59

K-Fold Cross Validation

60

  • Randomly break the dataset into k partitions
  • In this example, we have k=3 partitions

colored red green and blue

  • For the blue partition: train on all points not

in the blue partition. Find test‐set sum of errors on blue points

  • For the green partition: train on all points not

in green partition. Find test‐set sum of errors on green points

  • For the red partition: train on all points not in

red partition. Find the test‐set sum of errors

  • n red points
  • Report the mean error

slide by Olga Veksler

Join the dots
 MSE3FOLD = 2.93

x y

slide-60
SLIDE 60

Which kind of Cross Validation?

61

  • Downside

Upside Testset maygiveunreliable estimateoffuture performance cheap Leave

  • neout

expensive doesn’twastedata 10fold wastes10%ofthedata,10 timesmoreexpensivethan testset

  • nlywastes10%,only10

timesmoreexpensive insteadofn times 3fold wastesmoredatathan10 fold,moreexpensivethan testset slightlybetterthantestset Nfold IdenticaltoLeaveoneout

slide by Olga Veksler

slide-61
SLIDE 61

Cross-validation for classification

  • Instead of computing the sum squared

errors on a test set, you should compute...

62

slide by Andrew Moore

slide-62
SLIDE 62

Cross-validation for classification

  • Instead of computing the sum squared

errors on a test set, you should compute…

The total number of misclassifications on a test set

63

slide by Andrew Moore

slide-63
SLIDE 63

Cross-validation for classification

  • Instead of computing the sum squared

errors on a test set, you should compute…

The total number of misclassifications on a test set

64

  • W
  • W
  • W
  • What’s LOOCV of 1-NN?
  • What’s LOOCV of 3-NN?
  • What’s LOOCV of 22-NN?

slide by Andrew Moore

slide-64
SLIDE 64

Cross-validation for classification

  • Choosing k for k‐nearest neighbors
  • Choosing Kernel parameters for SVM
  • Any other “free” parameter of a classifier
  • Choosing Features to use
  • Choosing which classifier to use

65

slide by Andrew Moore

slide-65
SLIDE 65

CV-based Model Selection

  • We’re trying to decide which algorithm to use.
  • We train each machine and make a table...

66

  • fi

TrainingError

  • f1

f2 f3

  • f4

f5 f6

slide by Olga Veksler

slide-66
SLIDE 66

CV-based Model Selection

  • We’re trying to decide which algorithm to use.
  • We train each machine and make a table...

67

  • fi

TrainingError

10FOLDCVError f1 f2 f3

  • f4

f5 f6

slide by Olga Veksler

slide-67
SLIDE 67

CV-based Model Selection

  • We’re trying to decide which algorithm to use.
  • We train each machine and make a table...

68

  • fi

TrainingError

10FOLDCVError Choice f1 f2 f3

  • f4

f5 f6

slide by Olga Veksler

slide-68
SLIDE 68

CV-based Model Selection

  • Example: Choosing “k” for a k‐nearest‐neighbor regression.
  • Step 1: Compute LOOCV error for six different model classes:

69

  • Algorithm

TrainingError

10foldCVError Choice

k=1 k=2 k=3 k=4

  • k=5

k=6

  • Step 2: Choose model that gave the best CV score
  • Train with all the data, and that’s the final model you’ll use

slide by Olga Veksler

slide-69
SLIDE 69

CV-based Model Selection

  • Why stop at k=6?
  • No good reason, except it looked like things were getting

worse as K was increasing

  • Are we guaranteed that a local optimum of K vs LOOCV

will be the global optimum?

  • No, in fact the relationship can be very bumpy
  • What should we do if we are depressed at the expense
  • f doing LOOCV for k = 1 through 1000?
  • Try: k=1, 2, 4, 8, 16, 32, 64, ... ,1024
  • Then do hillclimbing from an initial guess at k

70

slide by Olga Veksler

slide-70
SLIDE 70

Learning Theory: 
 Why ML Works

71

slide-71
SLIDE 71

Computational Learning 
 Theory

  • Entire subfield devoted to the 


mathematical analysis of machine 
 learning algorithms

  • Has led to several practical methods:

− PAC (probably approximately correct) learning 


→ boosting

− VC (Vapnik–Chervonenkis) theory 


→ support vector machines 


72

slide by Eric Eaton

(

Annual conference: Conference on Learning Theory (COLT)

slide-72
SLIDE 72

Computational Learning Theory

  • Is learning always possible?
  • How many training examples will I need to do a

good job learning?

  • Is my test performance going to be much worse

than my training performance?

73

The key idea that underlies all these answer is that simple functions generalize well.

adapted from Hal Daume III

slide-73
SLIDE 73

The Role of Theory

  • Theory can serve two roles:

− It can justify and help understand why

common practice works.

− It can also serve to suggest new algorithms

and approaches that turn out to work well in practice.

74

adapted from Hal Daume III

theory after theory before

Often, it turns out to be a mix!

slide-74
SLIDE 74

The Role of Theory

  • Practitioners discover something that works

surprisingly well.

  • Theorists figure out why it works and prove

something about it.

− In the process, they make it better or find new

algorithms.

  • Theory can also help you understand what’s

possible and what’s not possible.

75

adapted from Hal Daume III

slide-75
SLIDE 75

Induction is Impossible

  • From an algorithmic perspective, a natural question is

− whether there is an “ultimate” learning algorithm, Aawesome,

that solves the Binary Classification problem.

  • Have you been wasting your time learning about KNN and
  • ther methods Perceptron and decision trees, when

Aawesome is out there?

  • What would such an ultimate learning algorithm do?

− Take in a data set D and produce a function f. − No matter what D looks like, this function f should get perfect

classification on all future examples drawn from the same distribution that produced D.

76

adapted from Hal Daume III

slide-76
SLIDE 76

Induction is Impossible

  • From an algorithmic perspective, a natural question is

− whether there is an “ultimate” learning algorithm, Aawesome,

that solves the Binary Classification problem.

  • Have you been wasting your time learning about KNN and
  • ther methods Perceptron and decision trees, when

Aawesome is out there?

  • What would such an ultimate learning algorithm do?

− Take in a data set D and produce a function f. − No matter what D looks like, this function f should get perfect

classification on all future examples drawn from the same distribution that produced D.

77

adapted from Hal Daume III

Impossible

slide-77
SLIDE 77

Label Noise

  • Let X = {−1, +1} (i.e., a one-dimensional, binary distribution


− 80% of data points in this distribution have x = y and 20%

don’t.

  • No matter what function your learning algorithm produces,

there’s no way that it can do better than 20% error on this data.

− No Aawesome exists that always achieves an error rate of zero. − The best that we can hope is that the error rate is not “too

large.”

78

adapted from Hal Daume III

D = (⟨+1⟩,+1) = 0.4
 D = (⟨+1⟩,-1) = 0.1 D = (⟨-1⟩,-1) = 0.4
 D = (⟨-1⟩,+1) = 0.1

slide-78
SLIDE 78

Sampling

  • Another source of difficulty comes from the fact that the
  • nly access we have to the data distribution is through

sampling.

− When trying to learn about a distribution, you only get to

see data points drawn from that distribution.

− You know that “eventually” you will see enough data points

that your sample is representative of the distribution, but it might not happen immediately.

  • For instance, even though a fair coin will come up heads
  • nly with probability 1/2, it’s completely plausible that in

a sequence of four coin flips you never see a tails, or perhaps only see one tails.

79

adapted from Hal Daume III

slide-79
SLIDE 79

Induction is Impossible

  • We need to understand that Aawesome will not always

work.

− In particular, if we happen to get a lousy sample of

data from D, we need to allow Aawesome to do something completely unreasonable.

  • We cannot hope that Aawesome will do perfectly, every

time.

80

adapted from Hal Daume III

The best we can reasonably hope of Aawesome is that it will do pretty well, most of the time.

slide-80
SLIDE 80

Probably Approximately Correct 
 (PAC) Learning

  • A formalism based on the realization that

the best we can hope of an algorithm is that

− It does a good job most of the time (probably

approximately correct)

81

adapted from Hal Daume III

slide-81
SLIDE 81

Probably Approximately Correct 
 (PAC) Learning

  • Consider a hypothetical learning algorithm

− We have 10 different binary classification data sets. − For each one, it comes back with functions f1, f2, . . . , f10.

✦ For some reason, whenever you run f4 on a test point, it

crashes your computer. For the other learned functions, their performance on test data is always at most 5% error.

✦ If this situtation is guaranteed to happen, then this

hypothetical learning algorithm is a PAC learning algorithm.

✤ It satisfies probably because it only failed in one out of

ten cases, and it’s approximate because it achieved low, but non-zero, error on the remainder of the cases.

82

adapted from Hal Daume III

slide-82
SLIDE 82

PAC Learning

83

adapted from Hal Daume III

Definitions 1. An algorithm A is an (e, d)-PAC learning algorithm if, for all distributions D: given samples from D, the probability that it returns a “bad function” is at most d; where a “bad” function is one with test error rate more than e on D.

slide-83
SLIDE 83

PAC Learning

84

adapted from Hal Daume III

Definition: An algorithm A is an efficient (e, d)-PAC learning al- gorithm if it is an (e, d)-PAC learning algorithm whose runtime is polynomial in 1

e and 1 d.

In other words, suppose that you want your algorithm to achieve

In other words, to let your algorithm to achieve 
 4% error rather than 5%, the runtime required 
 to do so should not go up by an exponential factor!

  • Two notions of efficiency

− Computational complexity: Prefer an algorithm that runs quickly

to one that takes forever

− Sample complexity: The number of examples required for your

algorithm to achieve its goals

slide-84
SLIDE 84

Example: PAC Learning of Conjunctions

  • Data points are binary vectors, for instance x = ⟨0, 1, 1, 0, 1⟩
  • Some Boolean conjunction defines the true labeling of this data 


(e.g. x1 ⋀ x2 ⋀ x5)

  • There is some distribution DX over binary data points (vectors) 


x = ⟨x1, x2, . . . , xD⟩.

  • There is a fixed concept conjunction c that we are trying to learn.
  • There is no noise, so for any example x, its true label is simply 


y = c(x)

  • Example:

− Clearly, the true formula cannot 


include the terms x1 , x2, ¬x3, ¬x4 
 


85

adapted from Hal Daume III

y x1 x2 x3 x4 +1 1 1 +1 1 1 1

  • 1

1 1 1

able 10.1: Data set for learning con-

slide-85
SLIDE 85

Example: PAC Learning

  • f Conjunctions

f 0(x) = x1 ⋀ ¬x1 ⋀ x2 ⋀ ¬x2 ⋀ x3 ⋀ ¬x3 ⋀ x4 ⋀ ¬x4 f 1(x) = ¬x1 ⋀ ¬x2 ⋀ x3 ⋀ x4 f 2(x) = ¬x1 ⋀ x3 ⋀ x4 f 3(x) = ¬x1 ⋀ x3 ⋀ x4

  • After processing an example, it is guaranteed to classify that

example correctly (provided that there is no noise)

  • Computationally very efficient

− Given a data set of N examples in D dimensions, it takes O (ND)

time to process the data. This is linear in the size of the data set.

86

Algorithm 30 BinaryConjunctionTrain(D)

1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD

// initialize function

2: for all positive examples (x,+1) in D do 3:

for d = 1 . . . D do

4:

if xd = 0 then

5:

f ← f without term “xd”

6:

else

7:

f ← f without term “¬xd”

8:

end if

9:

end for

10: end for 11: return f

adapted from Hal Daume III

y x1 x2 x3 x4 +1 1 1 +1 1 1 1

  • 1

1 1 1

able 10.1: Data set for learning con-

“Throw Out Bad Terms”

slide-86
SLIDE 86
  • Is this an efficient (ε, δ)-PAC learning algorithm?
  • What about sample complexity?

− How many examples N do you need to see in order to

guarantee that it achieves an error rate of at most ε (in all but δ- many cases)?

− Perhaps N has to be gigantic (like ) to (probably) guarantee

a small error.

87

adapted from Hal Daume III

Algorithm 30 BinaryConjunctionTrain(D)

1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD

// initialize function

2: for all positive examples (x,+1) in D do 3:

for d = 1 . . . D do

4:

if xd = 0 then

5:

f ← f without term “xd”

6:

else

7:

f ← f without term “¬xd”

8:

end if

9:

end for

10: end for 11: return f

Example: PAC Learning

  • f Conjunctions

y x1 x2 x3 x4 +1 1 1 +1 1 1 1

  • 1

1 1 1

able 10.1: Data set for learning con-

most e (like 22D/e)

“Throw Out Bad Terms”

slide-87
SLIDE 87
  • Prove that the number of samples N required to (probably)

achieve a small error is not-too-big.

  • Sketch of the proof:

− Say there is some term (say ¬x8) that should have been thrown

  • ut, but wasn’t.

− If this was the case, then you must not have seen any positive

training examples with ¬x8 = 0.

− So example with x8 = 0 must have low probability (otherwise you

would have seen them). So such a thing is not that common

88

adapted from Hal Daume III

Algorithm 30 BinaryConjunctionTrain(D)

1: f ← x1 ∧ ¬x1 ∧ x2 ∧ ¬x2 ∧ · · · ∧ xD ∧ ¬xD

// initialize function

2: for all positive examples (x,+1) in D do 3:

for d = 1 . . . D do

4:

if xd = 0 then

5:

f ← f without term “xd”

6:

else

7:

f ← f without term “¬xd”

8:

end if

9:

end for

10: end for 11: return f

Example: PAC Learning

  • f Conjunctions

y x1 x2 x3 x4 +1 1 1 +1 1 1 1

  • 1

1 1 1

able 10.1: Data set for learning con-

“Throw Out Bad Terms”

slide-88
SLIDE 88

Occam’s Razor

  • Simple solutions generalize well
  • The hypothesis class H, is the set of all boolean formulae over D-

many variables.

− The hypothesis class for Boolean conjunctions is finite; the

hypothesis class for linear classifiers is infinite.

− For Occam’s razor, we can only work with finite hypothesis classes.

89

adapted from Hal Daume III

William of Occam 
 (c. 1288 – c. 1348)

“If one can explain a phenomenon without assuming this or that hypothetical entity, then there is no ground for assuming it i.e. that one should always opt for an explanation in terms of the fewest possible number of causes, factors, or variables.”

Theorem 14 (Occam’s Bound). Suppose A is an algorithm that learns a function f from some finite hypothesis class H. Suppose the learned function always gets zero error on the training data. Then, the sample com- plexity of f is at most log |H|.

slide-89
SLIDE 89

Complexity of Infinite 
 Hypothesis Spaces

  • Occam’s Bound is is completely useless when |H | = ∞
  • In example, instead of representing your hypothesis as 


a Boolean conjunction, represent it as a conjunction of inequalities.

− Instead of having x1 ∧ ¬x2 ∧ x5, you have 


[x1 > 0.2] ∧ [x2 < 0.77] ∧ [x5 < π/4]

− In this representation, for each feature, you need to choose

an inequality (< or >) and a threshold.

− Since the thresholds can be arbitrary real values, there are

now infinitely many possibilities: |H| = 2D×∞ = ∞

90

adapted from Hal Daume III

slide-90
SLIDE 90

Vapnik-Chervonenkis 
 (VC) Dimension

  • A classic measure of complexity of infinite hypothesis classes

based on this intuition.

  • The VC dimension is a very classification-oriented notion of

complexity

− The idea is to look at a finite set of unlabeled examples − no matter how these points were labeled, would we be able to

find a hypothesis that correctly classifies them

  • The idea is that as you add more points, being able to

represent an arbitrary labeling becomes harder and harder.

91

adapted from Hal Daume III

Definitions 2. For data drawn from some space X , the VC dimension of a hypothesis space H over X is the maximal K such that: there exists a set X ⊆ X of size |X| = K, such that for all binary labelings of X, there exists a function f ∈ H that matches this labeling.

slide-91
SLIDE 91

VC Dimension Example

  • The first 3 examples show that the class of lines in the plane can

shatter 3 points.

  • However, the last example shows that this class cannot shatter 4

points.

  • Hence the VC dimension of the class of straight lines in the plane is 3.
  • Note that a class of nonlinear curves could shatter four points, and

hence has VC dimension greater than 3.

92

adapted from Trevor Hastie, Robert Tibshirani, Jerome Friedman