MLCC 2018 Statistical Learning: Basic Concepts Lorenzo Rosasco - - PowerPoint PPT Presentation

mlcc 2018 statistical learning basic concepts
SMART_READER_LITE
LIVE PREVIEW

MLCC 2018 Statistical Learning: Basic Concepts Lorenzo Rosasco - - PowerPoint PPT Presentation

MLCC 2018 Statistical Learning: Basic Concepts Lorenzo Rosasco UNIGE-MIT-IIT Outline Learning from Examples Data Space and Distribution Loss Function and Expected Risk Stability, Overfitting and Regularization MLCC 2017 2 Learning from


slide-1
SLIDE 1

MLCC 2018 Statistical Learning: Basic Concepts

Lorenzo Rosasco UNIGE-MIT-IIT

slide-2
SLIDE 2

Outline

Learning from Examples Data Space and Distribution Loss Function and Expected Risk Stability, Overfitting and Regularization

MLCC 2017 2

slide-3
SLIDE 3

Learning from Examples

◮ Machine Learning deals with systems that are trained from data

rather than being explicitly programmed

◮ Here we describe the framework considered in statistical learning

theory.

MLCC 2017 3

slide-4
SLIDE 4

Supervised Learning

The goal of supervised learning is to find an underlying input-output relation f(xnew) ∼ y, given data.

MLCC 2017 4

slide-5
SLIDE 5

Supervised Learning

The goal of supervised learning is to find an underlying input-output relation f(xnew) ∼ y, given data. The data, called training set, is a set of n input-output pairs (examples) S = {(x1, y1), . . . , (xn, yn)}.

MLCC 2017 5

slide-6
SLIDE 6

We Need a Model to Learn

◮ We consider the approach to machine learning based on the learning

from examples paradigm

◮ Goal: Given the training set, learn a corresponding I/O relation ◮ We have to postulate the existence of a model for the data ◮ The model should take into account the possible uncertainty in the

task and in the data

MLCC 2017 6

slide-7
SLIDE 7

Outline

Learning from Examples Data Space and Distribution Loss Function and Expected Risk Stability, Overfitting and Regularization

MLCC 2017 7

slide-8
SLIDE 8

Data Space

◮ The inputs belong to an input space X, we assume that X ⊆ RD ◮ The outputs belong to an output space Y , typycally a subset of R ◮ The space X × Y is called the data space

MLCC 2017 8

slide-9
SLIDE 9

Examples of Data Space

We consider several possible situations:

◮ Regression: Y ⊆ R ◮ Binary classification Y = {−1, 1} ◮ Multi-category (multiclass) classification Y = {1, 2, . . . , T}. ◮ . . .

MLCC 2017 9

slide-10
SLIDE 10

Modeling Uncertainty in the Data Space

◮ Assumption: ∃ a fixed unknown distribution p(x, y) according to

which the data are identically and independently sampled

◮ The distribution p models different sources of uncertainty ◮ Assumption: p factorizes as p(x, y) = pX(x)p(y|x)

MLCC 2017 10

slide-11
SLIDE 11

Marginal and Conditional

p(y|x) can be seen as a form of noise in the output Y X p (y|x) x

Figure: For each input x there is a distribution of possible outputs p(y|x).

MLCC 2017 11

slide-12
SLIDE 12

Marginal and Conditional

p(y|x) can be seen as a form of noise in the output Y X p (y|x) x

Figure: For each input x there is a distribution of possible outputs p(y|x).

The marginal distribution pX(x) models uncertainty in the sampling of the input points.

MLCC 2017 12

slide-13
SLIDE 13

Data Models

◮ In regression, the following model is often considered:

y = f ∗(x) + ǫ where:

– f ∗: fixed unknown (regression) function – ǫ: random noise, e.g. standard Gaussian N(0, σI), σ ∈ [0, ∞)

◮ In classification,

p(1|x) = 1 − p(−1|x), ∀x Noiseless classification, p(1|x) = {1, 0}, ∀x ∈ X

MLCC 2017 13

slide-14
SLIDE 14

Outline

Learning from Examples Data Space and Distribution Loss Function and Expected Risk Stability, Overfitting and Regularization

MLCC 2017 14

slide-15
SLIDE 15

Loss Function

Goal of learning: Estimate “best” I/O relation (not the whole p(x, y))

◮ We need to fix a loss function

ℓ : Y × Y → [0, ∞) ℓ(y, f(x)) is a point-wise error measure. It is the cost of when predicting f(x) in place of y

MLCC 2017 15

slide-16
SLIDE 16

Expected Risk and Target Function

The expected loss (or expected risk) E(f) = E[ℓ(y, f(x))] =

  • p(x, y)ℓ(y, f(x))dxdy

can be seen as a measure of the error on past as well as future data.

MLCC 2017 16

slide-17
SLIDE 17

Expected Risk and Target Function

The expected loss (or expected risk) E(f) = E[ℓ(y, f(x))] =

  • p(x, y)ℓ(y, f(x))dxdy

can be seen as a measure of the error on past as well as future data. Given ℓ and a distribution, the ”best” I/O relation is the target function f ∗ : X → Y that minimizes the expected risk

MLCC 2017 17

slide-18
SLIDE 18

Learning from Data

◮ The target function f ∗ cannot be computed, since p is unknown

MLCC 2017 18

slide-19
SLIDE 19

Learning from Data

◮ The target function f ∗ cannot be computed, since p is unknown ◮ The goal of learning is to find an estimator of the target function

from data

MLCC 2017 19

slide-20
SLIDE 20

Outline

Learning from Examples Data Space and Distribution Loss Function and Expected Risk Stability, Overfitting and Regularization

MLCC 2017 20

slide-21
SLIDE 21

Learning Algorithms and Generalization

◮ A learning algorithm is a procedure that given a training set S

computes an estimator fS

MLCC 2017 21

slide-22
SLIDE 22

Learning Algorithms and Generalization

◮ A learning algorithm is a procedure that given a training set S

computes an estimator fS

◮ An estimator should mimic the target function, in which case we say

that it generalizes

MLCC 2017 22

slide-23
SLIDE 23

Learning Algorithms and Generalization

◮ A learning algorithm is a procedure that given a training set S

computes an estimator fS

◮ An estimator should mimic the target function, in which case we say

that it generalizes

◮ More formally we are interested in an estimator such that the excess

expected risk E(fS) − E(f ∗), is small

MLCC 2017 23

slide-24
SLIDE 24

Learning Algorithms and Generalization

◮ A learning algorithm is a procedure that given a training set S

computes an estimator fS

◮ An estimator should mimic the target function, in which case we say

that it generalizes

◮ More formally we are interested in an estimator such that the excess

expected risk E(fS) − E(f ∗), is small The latter requirement needs some care since fS depends on the training set and hence is random

MLCC 2017 24

slide-25
SLIDE 25

Generalization and Consistency

A natural approach is to consider the expectation of the excess expected risk ES[E(fS) − E(f ∗)]

MLCC 2017 25

slide-26
SLIDE 26

Generalization and Consistency

A natural approach is to consider the expectation of the excess expected risk ES[E(fS) − E(f ∗)]

◮ A basic requirement is consistency

lim

n→∞ ES[E(fS) − E(f ∗)] = 0

MLCC 2017 26

slide-27
SLIDE 27

Generalization and Consistency

A natural approach is to consider the expectation of the excess expected risk ES[E(fS) − E(f ∗)]

◮ A basic requirement is consistency

lim

n→∞ ES[E(fS) − E(f ∗)] = 0 ◮ Learning rates provide finite sample information, for all ǫ > if

n ≥ n(ǫ), then ES[E(fS) − E(f ∗)] ≤ ǫ,

◮ n(ǫ)is called sample complexity

MLCC 2017 27

slide-28
SLIDE 28

Generalization: Fitting and Stability

How to design a good algorithm?

MLCC 2017 28

slide-29
SLIDE 29

Generalization: Fitting and Stability

How to design a good algorithm? Two concepts are key:

MLCC 2017 29

slide-30
SLIDE 30

Generalization: Fitting and Stability

How to design a good algorithm? Two concepts are key:

◮ Fitting: an estimator should fit data well

MLCC 2017 30

slide-31
SLIDE 31

Generalization: Fitting and Stability

How to design a good algorithm? Two concepts are key:

◮ Fitting: an estimator should fit data well ◮ Stability: an estimator should be stable, it should not change much

if data change slightly

MLCC 2017 31

slide-32
SLIDE 32

Generalization: Fitting and Stability

How to design a good algorithm? We say that an algorithms overfits, if it fits the data while being unstable We say that an algorithms oversmooth, if it is stable while disregarding the data

MLCC 2017 32

slide-33
SLIDE 33

Regularization as a Fitting-Stability Trade-off

◮ Most learning algorithms depend on one (or more) regularization

parameter, that controls the trade-off between data-fitting and stability

◮ We broadly refer to this class of approaches as regularization

algorithms, our main topic of discussion

MLCC 2017 33

slide-34
SLIDE 34

Wrapping up

In this class, we introduced the basic definitions in statistical learning theory, including the key concepts of overfitting, stability and generalization.

MLCC 2017 34

slide-35
SLIDE 35

Next Class

We will introduce the a first basic class of learning methods, namely local methods, and study more formally the fundamental trade-off between

  • verfitting and stability.

MLCC 2017 35