N OISE ... p (y|x) x Y X the same x can generate different y - - PowerPoint PPT Presentation

n oise
SMART_READER_LITE
LIVE PREVIEW

N OISE ... p (y|x) x Y X the same x can generate different y - - PowerPoint PPT Presentation

I NTRODUCTION TO LEARNING R EGULARIZATION M ETHODS FOR H IGH D IMENSIONAL L EARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu June 6, 2011 Regularization Methods for High Dimensional Learning Introduction to


slide-1
SLIDE 1

INTRODUCTION TO LEARNING

REGULARIZATION METHODS FOR HIGH DIMENSIONAL LEARNING Francesca Odone and Lorenzo Rosasco

  • done@disi.unige.it - lrosasco@mit.edu

June 6, 2011

Regularization Methods for High Dimensional Learning Introduction to learning

slide-2
SLIDE 2

DIFFERENT PROBLEMS IN SUPERVISED LEARNING

In supervised learning we are given a set of input-output pairs (x1, y1), . . . , (xn, yn) that we call a training set Classification: A learning problem with output values taken from a finite unordered set C = {C1, . . . , Ck}. A special case is binary classification where yi ∈ {−1, 1}. Regression: A learning problem whose output values are real yi ∈ IR

Regularization Methods for High Dimensional Learning Introduction to learning

slide-3
SLIDE 3

LEARNING IS INFERENCE

PREDICTIVITY OR GENERALIZATION Given the data, the goal is to learn how to make decisions/predictions about future data / data not belonging to the training set . The problem is : Avoid overfitting!!

Regularization Methods for High Dimensional Learning Introduction to learning

slide-4
SLIDE 4

PREDICTIVITY

Among many possible solutions how can we choose one that correcly applies to previously unseen data?

Regularization Methods for High Dimensional Learning Introduction to learning

slide-5
SLIDE 5

PREDICTIVITY

Among many possible solutions how can we choose one that correcly applies to previously unseen data?

Regularization Methods for High Dimensional Learning Introduction to learning

slide-6
SLIDE 6

PREDICTIVITY

Among many possible solutions how can we choose one that correcly applies to previously unseen data?

Regularization Methods for High Dimensional Learning Introduction to learning

slide-7
SLIDE 7

PREDICTIVITY

Among many possible solutions how can we choose one that correcly applies to previously unseen data?

Regularization Methods for High Dimensional Learning Introduction to learning

slide-8
SLIDE 8

THE ROLE OF PROBABILITY

In supervised learning we consider the following The relationship can be stochastic, or deterministic with stochastic noise. If it is entirely umpredictable no learning takes place (we are not about to learn how to predict lotto numbers!)

Regularization Methods for High Dimensional Learning Introduction to learning

slide-9
SLIDE 9

DATA GENERATED BY A PROBABILITY DISTRIBUTION

We assume that X and Y are two sets of random variables. We consider a set of data S = {(x1, y1), . . . , (xn, yn)} that we call a training set. The training set consists of a set of independent identically distributed samples drawn from the probability distribution on X × Y. The joint conditional probabilities obey to the following: p(x, y) = p(y|x)p(x). p(x, y) is fixed but unknown.

Regularization Methods for High Dimensional Learning Introduction to learning

slide-10
SLIDE 10

NOISE ...

Y X p (y|x) x

the same x can generate different y (according to p(y|x)): the underlying process is deterministic, but there is noise in the measurement of y; the underlying process is not deterministic; the underlying process is deterministic, but only incomplete information is available.

Regularization Methods for High Dimensional Learning Introduction to learning

slide-11
SLIDE 11

...AND SAMPLING

p(x) y x x

EVEN IN A NOISE FREE CASE WE HAVE TO DEAL WITH SAMPLING

the marginal p(x) distribution might model errors in the location of the input points; discretization error for a given grid; presence or absence of certain input instances

Regularization Methods for High Dimensional Learning Introduction to learning

slide-12
SLIDE 12

...AND SAMPLING

p(x)

y x x

EVEN IN A NOISE FREE CASE WE HAVE TO DEAL WITH SAMPLING

the marginal p(x) distribution might model errors in the location of the input points; discretization error for a given grid; presence or absence of certain input instances

Regularization Methods for High Dimensional Learning Introduction to learning

slide-13
SLIDE 13

...AND SAMPLING

x p(x) y x

EVEN IN A NOISE FREE CASE WE HAVE TO DEAL WITH SAMPLING

the marginal p(x) distribution might model errors in the location of the input points; discretization error for a given grid; presence or absence of certain input instances

Regularization Methods for High Dimensional Learning Introduction to learning

slide-14
SLIDE 14

...AND SAMPLING

✂ ✄ ☎ ✆ ✝ ✞ ✟ ✠ ✡ ☛ ☞ ✌ ✍ ✍ ✎ ✎ ✏ ✑ ✒ ✓ ✔ ✔ ✕ ✖ ✗ ✘ ✙ ✚ ✛ ✜ ✢ ✣ ✤ ✥ ✦ ✧ ★ ✩ ✪ ✫ ✬ ✭ ✮ ✯ ✰ ✱ ✲ ✳ ✴

y x

✵ ✶

p(x) x

EVEN IN A NOISE FREE CASE WE HAVE TO DEAL WITH SAMPLING

the marginal p(x) distribution might model errors in the location of the input points; discretization error for a given grid; presence or absence of certain input instances

Regularization Methods for High Dimensional Learning Introduction to learning

slide-15
SLIDE 15

HYPOTHESIS SPACE

Predictivity is a trade-off between the information provided by training data and the complexity of the solution we are looking for The hypothesis space, H, is the space of functions where we look for our solution Supervised learning uses the training data to learn a function f of H, f : X → Y, that can be applied to previously unseen data: ypred = f(xnew)

Regularization Methods for High Dimensional Learning Introduction to learning

slide-16
SLIDE 16

LOSS FUNCTIONS

How do we choose a “good” f ∈ H ? LOSS FUNCTION In order to measure the goodness of our function f we use a non negative function called loss function V. In general V(f(x), y) denotes the price to pay in associating f(x) to x instead than y

Regularization Methods for High Dimensional Learning Introduction to learning

slide-17
SLIDE 17

LOSS FUNCTIONS FOR REGRESSION

The most common is the square loss or L2 loss V(f(x), y) = (f(x) − y)2 Absolute value or L1 loss: V(f(x), y) = |f(x) − y| Vapnik’s ǫ-insensitive loss: V(f(x), y) = (|f(x) − y| − ǫ)+

Regularization Methods for High Dimensional Learning Introduction to learning

slide-18
SLIDE 18

LOSS FUNCTIONS FOR (BINARY) CLASSIFICATION

The most intuitive one: 0 − 1-loss: V(f(x), y) = θ(−yf(x)) (θ is the step function) The more tractable hinge loss: V(f(x), y) = (1 − yf(x))+ And again the square loss or L2 loss V(f(x), y) = (f(x) − y)2

Regularization Methods for High Dimensional Learning Introduction to learning

slide-19
SLIDE 19

LOSS FUNCTIONS

Regularization Methods for High Dimensional Learning Introduction to learning

slide-20
SLIDE 20

LEARNING ALGORITHM

LEARNING ALGORITHM If Z = X × Y, a learning algorithm is a map L : Z n → H that looks at the training set S and selects from H a function fS : X → Y such that fS(x) ∼ y in a generalizing way

Regularization Methods for High Dimensional Learning Introduction to learning

slide-21
SLIDE 21

WHAT WE HAVE SEEN SO FAR

We are considering an input space X and an output space Y ⊂ R an unknown probability distribution on the product space Z = X × Y: p(X, Y) a training set of n samples i.i.d. from p: S = {(x1, y1), . . . , (xn, yn)} a hypothesis space H, that is, a space of functions f : X → Y a learning algorithm, that is a map L : Z n → H selecting from H a function fS such that fS(x) ∼ y in a predictive way

Regularization Methods for High Dimensional Learning Introduction to learning

slide-22
SLIDE 22

LEARNING AS RISK MINIMIZATION

Learning means to produce a hypothesis making the expected error or true error small Expected error: I[f] =

  • X×Y

V(f(x), y)p(x, y)dxdy. We would like to obtain fH = arg minf∈H I[f] If the probability density is known, then learning is easy! Unfortunatelly it is usually fixed but unknown What we do have is the training set S

Regularization Methods for High Dimensional Learning Introduction to learning

slide-23
SLIDE 23

EMPIRICAL RISK MINIMIZATION (ERM)

Given a loss function V = V(y, f(x)) we define the empirical risk Iemp[f] as Iemp[f, S] = 1 n

n

  • i=1

V(f(xi), yi) ERM PRINCIPLE The Empirical Risk Minimization (ERM) principle chooses the function fS ∈ H according to the following fS = arg min

f∈H Iemp[f, S].

Regularization Methods for High Dimensional Learning Introduction to learning

slide-24
SLIDE 24

GOOD QUALITIES OF A SOLUTION

For a solution to be useful in the context of learning it must generalize be stable (well posed).

Regularization Methods for High Dimensional Learning Introduction to learning

slide-25
SLIDE 25

REMINDER

ILL POSED PROBLEM A mathematical problem is well posed in the sense of Hadamard is the solution exists the solution is unique the solution depends continuously on the data If a problem is not well posed it is called ill posed.

Regularization Methods for High Dimensional Learning Introduction to learning

slide-26
SLIDE 26

REMINDER

CONVERGENCE IN PROBABILITY Let {Xn} be a sequence of bounded random variables. Then lim

n→∞ Xn = X

in probability if ∀ǫ > 0 lim

n→∞ P{|Xn − X| ≥ ǫ} = 0

Regularization Methods for High Dimensional Learning Introduction to learning

slide-27
SLIDE 27

CONSISTENCY AND GENERALIZATION

A desirable property for fS is consistency: lim

n→∞ I[fS] = I[fH]

that is the training error must converge to the expected error — consistency guarantees generalization

Regularization Methods for High Dimensional Learning Introduction to learning

slide-28
SLIDE 28

EMPIRICAL RISK AND GENERALIZATION

The ERM principle approximates I[f] with the empirical risk Iemp[f]. How well? We know that empirical risk converges in probability to the expected risk as the number of examples grows (Law of large numbers). That is, for a given f lim

n→∞ Iemp[f, S] = I[f]

ERM and consistency: lim

n→∞ min f∈H Iemp[f, S] = min f∈H I[f]

It can be shown that this property holds only under appropriate conditions on the hypothesis space.

Regularization Methods for High Dimensional Learning Introduction to learning

slide-29
SLIDE 29

ERM AND ILL-POSEDNESS

Ill posed problems often arise if one tries to infer general laws from few data the hypothesis space is too large there are not enough data

In general ERM leads to ill-posed solutions because the solution may be too complex it may be not unique it may change radically when leaving one sample out

Regularization Methods for High Dimensional Learning Introduction to learning

slide-30
SLIDE 30

THE ERM, GENERALIZATION AND STABILITY

For a solution to be useful in the context of learning it must generalize exist, be unique, and stable (well poseness). Well poseness is important in terms of stability of the solution. In the case of ERM For a fixed training set S the ERM principle does not in general exhibit generalization It often leads to ill-posed solutions.

Regularization Methods for High Dimensional Learning Introduction to learning

slide-31
SLIDE 31

REGULARIZATION APPROACH

Appropriate choices of the hypothesis space guarantee generalization and stability The basic idea of regularization is to restore well posedness and generalization of ERM by constraining the hypothesis space Regularization was originally introduced by Tikhonov (1943) in the context of inverse problems

Regularization Methods for High Dimensional Learning Introduction to learning

slide-32
SLIDE 32

IVANOV REGULARIZATION

The direct way consists of minimixing the empirical error requiring that f lies in predefined subset of H — a ball of radius A IVANOV minf∈H

1 n

n

i=1 V(f(xi), yi)

  • subj. to

||f||2

H ≤ A

where ||f||H is the norm in the function space H

Regularization Methods for High Dimensional Learning Introduction to learning

slide-33
SLIDE 33

TIKHONOV REGULARIZATION

An indirect way, a relaxation of Tihonov regularization (Lagrangian formulation): TIKHONOV arg min

f∈H

1 n

n

  • i=1

V(f(xi), yi) + λ||f||2

H

where ||f||H is the norm in the function space H

Regularization Methods for High Dimensional Learning Introduction to learning

slide-34
SLIDE 34

REGULARIZATION APPROACH

More in general a possible way is considering a penalized ERM: ERR(f) + λPEN(f) λ is a regularization parameter trading-off between the two terms. λ needs to be tuned (this is done by prior knowledge or, more

  • ften in learning, using the available data)

This scheme is general: by using different loss functions we

  • btain different algorithms

Regularization Methods for High Dimensional Learning Introduction to learning

slide-35
SLIDE 35

OCCAM’S RAZOR

All things being equal, the simplest solution tends to be the best one (attributed to William of Ockham, 1285 - 1349) For regularization this principle could be interpreded as: choose the penalty term to measure the complexity of H.

Regularization Methods for High Dimensional Learning Introduction to learning

slide-36
SLIDE 36

REGULARIZATION APPROACH

More in general a possible way is considering a penalized ERM: ERR(f) + λPEN(f) λ is a regularization parameter trading-off between the two terms. The penalization term should include some notion of smoothness

  • f f

The main point is understanding how to choose a norm that encodes a notion of smoothness of the solution. ... we need to formalize this a bit more

Regularization Methods for High Dimensional Learning Introduction to learning

slide-37
SLIDE 37

APPENDIX: ERROR ANALYSIS

We start by defining the target space T , the space of functions that is assumed to contain the ”true” function minimizing the risk: f0 = arg min

f∈T I[f]

We may assume for simplicity that I[f0] = 0 approximation error: I[fH] − I[f0] generalization error: I[fS] − I[f0] sample or estimation error: I[fS] − I[fH] The error is the sum of the sample error and the approximation error I[fS] − I[f0] = (I[fS] − I[fH]) + (I[fH] − I[f0]) H large leads to small approximation error, H small leads to small sample error.

Regularization Methods for High Dimensional Learning Introduction to learning