RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco - - PowerPoint PPT Presentation

regml 2020 class 1 statistical learning theory
SMART_READER_LITE
LIVE PREVIEW

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco - - PowerPoint PPT Presentation

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT All starts with DATA Supervised: { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , Unsupervised: { x 1 , . . . , x m } , Semi-supervised: { ( x 1 , y 1 ) , . . .


slide-1
SLIDE 1

RegML 2020 Class 1 Statistical Learning Theory

Lorenzo Rosasco UNIGE-MIT-IIT

slide-2
SLIDE 2

All starts with DATA

◮ Supervised: {(x1, y1), . . . , (xn, yn)}, ◮ Unsupervised: {x1, . . . , xm}, ◮ Semi-supervised: {(x1, y1), . . . , (xn, yn)} ∪ {x1, . . . , xm}

L.Rosasco, RegML 2020

slide-3
SLIDE 3

Learning from examples

L.Rosasco, RegML 2020

slide-4
SLIDE 4

Setting for the supervised learning problem

◮ X × Y probability space, with measure ρ. ◮ Sn = (x1, y1), . . . , (xn, yn) ∼ ρn, i.e. sampled i.i.d. ◮ L : Y × Y → [0, ∞), measurable loss function. ◮ Expected risk E(f) =

  • X×Y

L(y, f(x))dρ(x, y). Problem: Solve min

f:X→Y E(f),

given only Sn (ρ fixed, but unknown).

L.Rosasco, RegML 2020

slide-5
SLIDE 5

Data space

X

  • input space

Y

  • utput space

L.Rosasco, RegML 2020

slide-6
SLIDE 6

Input space

X input space: ◮ linear spaces, e. g.

– vectors, – functions, – matrices/operators

◮ “structured” spaces, e. g.

– strings, – probability distributions, – graphs

L.Rosasco, RegML 2020

slide-7
SLIDE 7

Output space

Y output space ◮ linear spaces, e. g.

– Y = R, regression, – Y = RT , multi-task regression, – Y Hilbert space, functional regression,

◮ “structured” spaces

– Y = {+1, −1}, classification, – Y = {1, . . . , T}, multi-label classification, – strings, – probability distributions, – graphs

L.Rosasco, RegML 2020

slide-8
SLIDE 8

Probability distribution

Reflects uncertainty and stochasticity of the learning problem ρ(x, y) = ρX(x)ρ(y|x), ◮ ρX marginal distribution on X, ◮ ρ(y|x) conditional distribution on Y given x ∈ X.

L.Rosasco, RegML 2020

slide-9
SLIDE 9

Conditional distribution and noise

f∗

(x2, y2) (x3, y3) (x4, y4) (x5, y5)

(x1, y1)

Regression

yi = f∗(xi) + ǫi,

◮ Let f∗ : X → Y , fixed function ◮ ǫ1, . . . , ǫn zero mean random variables ◮ x1, . . . , xn random

L.Rosasco, RegML 2020

slide-10
SLIDE 10

Conditional distribution and misclassification

Classification ρ(y|x) = {ρ(1|x), ρ(−1|x)},

1 0.9

Noise in classification: overlap between the classes ∆t =

  • x ∈ X
  • ρ(1|x) − ρ(−1|x)
  • ≤ t
  • L.Rosasco, RegML 2020
slide-11
SLIDE 11

Marginal distribution and sampling

ρX takes into account uneven sampling of the input space

L.Rosasco, RegML 2020

slide-12
SLIDE 12

Marginal distribution, densities and manifolds

p(x) = dρX(x) dx → p(x) = dρX(x) dvol(x),

1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0

L.Rosasco, RegML 2020

slide-13
SLIDE 13

Loss functions

L : Y × Y → [0, ∞), ◮ The cost of predicting f(x) in place of y. ◮ Part of the problem definition E(f) =

  • L(y, f(x))dρ(x, y)

◮ Measures the pointwise error,

L.Rosasco, RegML 2020

slide-14
SLIDE 14

Losses for regression

L(y, y′) = L(y − y′) ◮ Square loss L(y, y′) = (y − y′)2, ◮ Absolute loss L(y, y′) = |y − y′|, ◮ ǫ-insensitive L(y, y′) = max(|y − y′| − ǫ, 0),

1.0 0.5 0.5 1.0 0.2 0.4 0.6 0.8 1.0

Square Loss Absolute insensitive

  • L.Rosasco, RegML 2020
slide-15
SLIDE 15

Losses for classification

L(y, y′) = L(−yy′) ◮ 0-1 loss L(y, y′) = 1{−yy′>0} ◮ Square loss L(y, y′) = (1 − yy′)2, ◮ Hinge-loss L(y, y′) = max(1 − yy′, 0), ◮ logistic loss L(y, y′) = log(1 + exp(−yy′)),

1 2 0.5 1.0 1.5 2.0

0 1 loss square loss Hinge loss Logistic loss

0.5

L.Rosasco, RegML 2020

slide-16
SLIDE 16

Losses for structured prediction

Loss specific for each learning task e. g. ◮ Multi-class: square loss, weighted square loss, logistic loss, . . . ◮ Multi-task: weighted square loss, absolute, . . . ◮ . . .

L.Rosasco, RegML 2020

slide-17
SLIDE 17

Expected risk

E(f) = EL(f) =

  • X×Y

L(y, f(x))dρ(x, y) note that f ∈ F where F = {f : X → Y | f measurable}. Example Y = {−1, +1}, L(y, f(x)) = 1{−yf(x)>0} E(f) = P({(x, y) ∈ X × Y | f(x) = y}).

L.Rosasco, RegML 2020

slide-18
SLIDE 18

Target function

fρ = arg min

f∈F E(f),

can be derived for many loss functions...

L.Rosasco, RegML 2020

slide-19
SLIDE 19

Target functions in regression

square loss, fρ(x) =

  • Y

ydρ(y|x) absolute loss, fρ(x) = median ρ(y|x), where median p(·) = y s.t. y

−∞

tdp(t) = +∞

y

tdp(t).

L.Rosasco, RegML 2020

slide-20
SLIDE 20

Target functions in classification

0-1 loss, fρ(x) = sign(ρ(1|x) − ρ(−1|x)) square loss, fρ(x) = ρ(1|x) − ρ(−1|x) logistic loss, fρ(x) = log ρ(1|x) ρ(−1|x) hinge-loss, fρ(x) = sign(ρ(1|x) − ρ(−1|x))

L.Rosasco, RegML 2020

slide-21
SLIDE 21

Learning algorithms

Sn → fn = fSn fn estimates fρ given the observed examples Sn How to measure the error of an estimator?

L.Rosasco, RegML 2020

slide-22
SLIDE 22

Excess risk

Excess Risk: E( f) − inf

f∈F E(f),

Consistency: For any ǫ > 0 lim

n→∞ P

  • E(

f) − inf

f∈F E(f) > ǫ

  • = 0,

L.Rosasco, RegML 2020

slide-23
SLIDE 23

Tail bounds, sample complexity and error bound

◮ Tail bounds: For any ǫ > 0, n ∈ N P

  • E(

f) − inf

f∈F E(f) > ǫ

  • ≤ δ(n, F, ǫ)

◮ Sample complexity: For any ǫ > 0, δ ∈ (0, 1], when n ≥ n0(ǫ, δ, F) P

  • E(

f) − inf

f∈F E(f) > ǫ

  • ≤ δ,

◮ Error bounds: For any δ ∈ (0, 1], n ∈ N, with probability at least 1 − δ, E( f) − inf

f∈F E(f) ≤ ǫ(n, F, δ),

L.Rosasco, RegML 2020

slide-24
SLIDE 24

Error bounds and no free-lunch theorem

Theorem For any f, there exists a problem for which E(E( f) − inf

f∈F E(f)) > 0

L.Rosasco, RegML 2020

slide-25
SLIDE 25

No free-lunch theorem continued

Theorem For any f, there exists a ρ such that E(E( f) − inf

f∈F E(f)) > 0

F → H Hypothesis space

L.Rosasco, RegML 2020

slide-26
SLIDE 26

Hypothesis space

H ⊂ F E.g. X = Rd H = {f(x) = w, x =

d

  • j=1

wjxj, | w ∈ Rd, ∀x ∈ X} then H ⋍ Rd.

L.Rosasco, RegML 2020

slide-27
SLIDE 27

Finite dictionaries

D = {φi : X → R | i = 1, . . . , p} H = {f(x) =

p

  • j=1

wjφj(x) | w1, . . . , wp ∈ R, ∀x ∈ X} f(x) = w⊤Φ(x), Φ(x) = (φ1(x), . . . , φp(x))

L.Rosasco, RegML 2020

slide-28
SLIDE 28

This class

Learning theory ingredients ◮ Data space/distribution ◮ Loss function, risks and target functions ◮ Learning algorithms and error estimates ◮ Hypothesis space

L.Rosasco, RegML 2020

slide-29
SLIDE 29

Next class

◮ Regularized learning algorithm: penalization ◮ Statistics and computations ◮ Nonparametrics and kernels

L.Rosasco, RegML 2020