MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco - - PowerPoint PPT Presentation

mlcc 2017 regularization networks i linear models
SMART_READER_LITE
LIVE PREVIEW

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco - - PowerPoint PPT Presentation

MLCC 2017 Regularization Networks I: Linear Models Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017 About this class We introduce a class of learning algorithms based on Tikhonov regularization We study computational aspects of these


slide-1
SLIDE 1

MLCC 2017 Regularization Networks I: Linear Models

Lorenzo Rosasco UNIGE-MIT-IIT June 27, 2017

slide-2
SLIDE 2

About this class

◮ We introduce a class of learning algorithms based on Tikhonov

regularization

◮ We study computational aspects of these algorithms .

MLCC 2017 2

slide-3
SLIDE 3

Empirical Risk Minimization (ERM)

◮ Empirical Risk Minimization (ERM): probably the most popular

approach to design learning algorithms.

◮ General idea: considering the empirical error

ˆ E(f) = 1 n

n

  • i=1

ℓ(yi, f(xi)), as a proxy for the expected error E(f) = E[ℓ(y, f(x))] =

  • dxdyp(x, y)ℓ(y, f(x)).

MLCC 2017 3

slide-4
SLIDE 4

The Expected Risk is Not Computable

Recall that

◮ ℓ measures the price we pay predicting f(x) when the true label is y ◮ E(f) cannot be directly computed, since p(x, y) is unknown

MLCC 2017 4

slide-5
SLIDE 5

From Theory to Algorithms: The Hypothesis Space

To turn the above idea into an actual algorithm, we:

◮ Fix a suitable hypothesis space H ◮ Minimize ˆ

E over H H should allow feasible computations and be rich, since the complexity

  • f the problem is not known a priori.

MLCC 2017 5

slide-6
SLIDE 6

Example: Space of Linear Functions

The simplest example of H is the space of linear functions: H = {f : Rd → R : ∃w ∈ Rd such that f(x) = xT w, ∀x ∈ Rd}.

◮ Each function f is defined by a vector w ◮ fw(x) = xT w.

MLCC 2017 6

slide-7
SLIDE 7

Rich Hs May Require Regularization

◮ If H is rich enough, solving ERM may cause overfitting (solutions

highly dependent on the data)

◮ Regularization techniques restore stability and ensure generalization

MLCC 2017 7

slide-8
SLIDE 8

Tikhonov Regularization

Consider the Tikhonov regularization scheme, min

w∈Rd ˆ

E(fw) + λw2 (1) It describes a large class of methods sometimes called Regularization Networks.

MLCC 2017 8

slide-9
SLIDE 9

The Regularizer

◮ w2 is called regularizer ◮ It controls the stability of the solution and prevents overfitting ◮ λ balances the error term and the regularizer

MLCC 2017 9

slide-10
SLIDE 10

Loss Functions

◮ Different loss functions ℓ induce different classes of methods ◮ We will see common aspects and differences in considering different

loss functions

◮ There exists no general computational scheme to solve Tikhonov

Regularization

◮ The solution depends on the considered loss function

MLCC 2017 10

slide-11
SLIDE 11

The Regularized Least Squares Algorithm

Regularized Least Squares: Tikhonov regularization min

w∈RD ˆ

E(fw) + λw2, ˆ E(fw) = 1 n

n

  • i=1

ℓ(yi, fw(xi)) (2) Square loss function: ℓ(y, fw(x)) = (y − fw(x))2 We then obtain the RLS optimization problem (linear model): min

w∈RD

1 n

n

  • i=1

(yi − wT xi)2 + λwT w, λ ≥ 0. (3)

MLCC 2017 11

slide-12
SLIDE 12

Matrix Notation

◮ The n × d matrix Xn, whose rows are the input points ◮ The n × 1 vector Yn, whose entries are the corresponding outputs.

With this notation, 1 n

n

  • i=1

(yi − wT xi)2 = 1 nYn − Xnw2.

MLCC 2017 12

slide-13
SLIDE 13

Gradients of the ER and of the Regularizer

By direct computation,

◮ Gradient of the empirical risk w. r. t. w

− 2 nXT

n (Yn − Xnw) ◮ Gradient of the regularizer w. r. t. w

2w

MLCC 2017 13

slide-14
SLIDE 14

The RLS Solution

By setting the gradient to zero, the solution of RLS solves the linear system (XT

n Xn + λnI)w = XT n Yn.

λ controls the invertibility of (XT

n Xn + λnI)

MLCC 2017 14

slide-15
SLIDE 15

Choosing the Cholesky Solver

◮ Several methods can be used to solve the above linear system ◮ Cholesky decomposition is the method of choice, since

XT

n Xn + λI

is symmetric and positive definite.

MLCC 2017 15

slide-16
SLIDE 16

Time Complexity

Time complexity of the method :

◮ Training: O(nd2) (assuming n >> d) ◮ Testing: O(d)

MLCC 2017 16

slide-17
SLIDE 17

Dealing with an Offset

For linear models, especially in low dimensional spaces, it is useful to consider an offset: wT x + b How to estimate b from data?

MLCC 2017 17

slide-18
SLIDE 18

Idea: Augmenting the Dimension of the Input Space

◮ Simple idea: augment the dimension of the input space, considering

˜ x = (x, 1) and ˜ w = (w, b).

◮ This is fine if we do not regularize, but if we do then this method

tends to prefer linear functions passing through the origin (zero

  • ffset), since the regularizer becomes:

˜ w2 = w2 + b2.

MLCC 2017 18

slide-19
SLIDE 19

Avoiding to Penalize the Solutions with Offset

We want to regularize considering only w2, without penalizing the

  • ffset.

The modified regularized problem becomes: min

(w,b)∈RD+1

1 n

n

  • i=1

(yi − wT xi − b)2 + λw2.

MLCC 2017 19

slide-20
SLIDE 20

Solution with Offset: Centering the Data

It can be proved that a solution w∗, b∗ of the above problem is given by b∗ = ¯ y − ¯ xT w∗ where ¯ y = 1 n

n

  • i=1

yi ¯ x = 1 n

n

  • i=1

xi

MLCC 2017 20

slide-21
SLIDE 21

Solution with Offset: Centering the Data

w∗ solves min

w∈RD

1 n

n

  • i=1

(yc

i − wT xc i)2 + λw2.

where yc

i = y − ¯

y and xc

i = x − ¯

x for all i = 1, . . . , n. Note: This corresponds to centering the data and then applying the standard RLS algorithm.

MLCC 2017 21

slide-22
SLIDE 22

Introduction: Regularized Logistic Regression

Regularized logistic regression: Tikhonov regularization min

w∈Rd ˆ

E(fw) + λw2, ˆ E(fw) = 1 n

n

  • i=1

ℓ(yi, fw(xi)) (4) With the logistic loss function: ℓ(y, fw(x)) = log(1 + e−yfw(x))

MLCC 2017 22

slide-23
SLIDE 23

The Logistic Loss Function

Figure: Plot of the logistic regression loss function

MLCC 2017 23

slide-24
SLIDE 24

Minimization Through Gradient Descent

◮ The logistic loss function is differentiable ◮ The candidate to compute a minimizer is the gradient descent (GD)

algorithm

MLCC 2017 24

slide-25
SLIDE 25

Regularized Logistic Regression (RLR)

◮ The regularized ERM problem associated with the logistic loss is

called regularized logistic regression

◮ Its solution can be computed via gradient descent ◮ Note:

∇ ˆ E(fw) = 1 n

n

  • i=1

xi −yie−yixT

i wt−1

1 + e−yixT

i wt−1 = 1

n

n

  • i=1

xi −yi 1 + eyixT

i wt−1 MLCC 2017 25

slide-26
SLIDE 26

RLR: Gradient Descent Iteration

For w0 = 0, the GD iteration applied to min

w∈Rd ˆ

E(fw) + λw2 is wt = wt−1 − γ

  • 1

n

n

  • i=1

xi −yi 1 + eyixT

i wt−1 + 2λwt−1

  • a

for t = 1, . . . T, where a = ∇( ˆ E(fw) + λw2)

MLCC 2017 26

slide-27
SLIDE 27

Logistic Regression and Confidence Estimation

◮ The solution of logistic regression has a probabilistic interpretation ◮ It can be derived from the following model

p(1|x) = exT w 1 + exT w

  • h

where h is called logistic function.

◮ This can be used to compute a confidence for each prediction

MLCC 2017 27

slide-28
SLIDE 28

Support Vector Machines

Formulation in terms of Tikhonov regularization: min

w∈Rd ˆ

E(fw) + λw2, ˆ E(fw) = 1 n

n

  • i=1

ℓ(yi, fw(xi)) (5) With the Hinge loss function: ℓ(y, fw(x)) = |1 − yfw(x)|+

3 2 1 1 2 3 0.5 1 1.5 2 2.5 3 3.5 4 y * f(x) Hinge Loss

MLCC 2017 28

slide-29
SLIDE 29

A more classical formulation (linear case)

w∗ = min

w∈Rd

1 n

n

  • i=1

|1 − yiw⊤xi|+ + λw2 with λ = 1

C

MLCC 2017 29

slide-30
SLIDE 30

A more classical formulation (linear case)

w∗ = min

w∈Rd,ξi≥0 w2 + C

n

n

  • i=1

ξi subject to yiw⊤xi ≥ 1 − ξi ∀i ∈ {1 . . . n}

MLCC 2017 30

slide-31
SLIDE 31

A geometric intuition - classification

In general do you have many solutions

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2

What do you select?

MLCC 2017 31

slide-32
SLIDE 32

A geometric intuition - classification

Intuitively I would choose an “equidistant” line

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 MLCC 2017 32

slide-33
SLIDE 33

A geometric intuition - classification

Intuitively I would choose an “equidistant” line

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 MLCC 2017 33

slide-34
SLIDE 34

Maximum margin classifier

I want the classifier that

◮ classifies perfectly the dataset ◮ maximize the distance from its closest examples

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 MLCC 2017 34

slide-35
SLIDE 35

Point-Hyperplane distance

How to do it mathematically? Let w our separating hyperplane. We have x = αw + x⊥ with α = x⊤w

w and x⊥ = x − αw.

Point-Hyperplane distance: d(x, w) = x⊥

MLCC 2017 35

slide-36
SLIDE 36

Margin

An hyperplane w well classifies an example (xi, yi) if

◮ yi = 1 and w⊤xi > 0 or ◮ yi = −1 and w⊤xi < 0

therefore xi is well classified iff yiw⊤xi > 0 Margin: mi = yiw⊤xi Note that x⊥ = x − yimi

w w

MLCC 2017 36

slide-37
SLIDE 37

Maximum margin classifier definition

I want the classifier that

◮ classifies perfectly the dataset ◮ maximize the distance from its closest examples

w∗ = max

w∈Rd min 1≤i≤n d(xi, w)2

subject to mi > 0 ∀i ∈ {1 . . . n} Let call µ the smallest mi thus we have w∗ = max

w∈Rd

min

1≤i≤n,µ≥0 xi − (x⊤ i w)2

w2 subject to yiw⊤xi ≥ µ ∀i ∈ {1 . . . n} that is

MLCC 2017 37

slide-38
SLIDE 38

Computation of w∗

w∗ = max

w∈Rd min µ≥0 − µ2

w2 subject to yiw⊤xi ≥ µ ∀i ∈ {1 . . . n}

MLCC 2017 38

slide-39
SLIDE 39

Computation of w∗

w∗ = max

w∈Rd, µ≥0

µ2 w2 subject to yiw⊤xi ≥ µ ∀i ∈ {1 . . . n} Note that if yiw⊤xi ≥ µ, then yi(αw)⊤xi ≥ αµ and

µ2 w2 = (αµ)2 αw2 for

any α ≥ 0. Therefore we have to fix the scale parameter, in particular we choose µ = 1.

MLCC 2017 39

slide-40
SLIDE 40

Computation of w∗

w∗ = max

w∈Rd

1 w2 subject to yiw⊤xi ≥ 1 ∀i ∈ {1 . . . n}

MLCC 2017 40

slide-41
SLIDE 41

Computation of w∗

w∗ = min

w∈Rd w2

subject to yiw⊤xi ≥ 1 ∀i ∈ {1 . . . n}

MLCC 2017 41

slide-42
SLIDE 42

What if the problem is not separable?

We relax the constraints and penalize the relaxation w∗ = min

w∈Rd w2

subject to yiw⊤xi ≥ 1 ∀i ∈ {1 . . . n}

MLCC 2017 42

slide-43
SLIDE 43

What if the problem is not separable?

We relax the constraints and penalize the relaxation w∗ = min

w∈Rd,ξi≥0 w2 + C

n

n

  • i=1

ξi subject to yiw⊤xi ≥ 1 − ξi ∀i ∈ {1 . . . n} where C is a penalization parameter for the average error 1

n

n

i=1 ξi.

MLCC 2017 43

slide-44
SLIDE 44

Dual formulation

It can be shown that the solution of the SVM problem is of the form w =

n

  • i=1

αiyixi where αi are given by the solution of the following quadratic programming problem: max

α∈Rn

n

i=1 αi − 1 2

n

i,j=1 yiyjαiαjxT i xj

i = 1, . . . , n subj to αi ≥ 0

◮ The solution requires the estimate of n rather than D coefficients ◮ αi are often sparse. The input points associated with non-zero

coefficients are called support vectors

MLCC 2017 44

slide-45
SLIDE 45

Wrapping up

Regularized Empirical Risk Minimization w∗ = min

w∈Rd

1 n

n

  • i=1

ℓ(yi, w⊤xi) + λw2 Examples of Regularization Networks

◮ ℓ(y, t) = (y − t)2 (Square loss) leads to Least Squares ◮ ℓ(y, t) = log(1 + e−yt) (Logistic loss) leads to Logistic Regression ◮ ℓ(y, t) = |1 − yt|+ (Hinge loss) leads to Maximum Margin Classifier

MLCC 2017 45

slide-46
SLIDE 46

Next class

... beyond linear models!

MLCC 2017 46