RegML 2020 Class 3 Early Stopping and Spectral Regularization - - PowerPoint PPT Presentation

regml 2020 class 3 early stopping and spectral
SMART_READER_LITE
LIVE PREVIEW

RegML 2020 Class 3 Early Stopping and Spectral Regularization - - PowerPoint PPT Presentation

RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT Learning problem Solve d ( x, y ) L ( w x, y ) min w E ( w ) , E ( w ) = given ( x 1 , y 1 ) , . . . , ( x n , y n ) Beyond linear models:


slide-1
SLIDE 1

RegML 2020 Class 3 Early Stopping and Spectral Regularization

Lorenzo Rosasco UNIGE-MIT-IIT

slide-2
SLIDE 2

Learning problem

Solve min

w E(w),

E(w) =

  • dρ(x, y)L(w⊤x, y)

given (x1, y1), . . . , (xn, yn) Beyond linear models: non-linear features and kernels

L.Rosasco, RegML 2020 2

slide-3
SLIDE 3

Regularization by penalization

Replace min

w E(w)

by min

w

  • E(w) + λw2
  • Eλ(w)

◮ E(w) = 1

n

n

i=1 L(w⊤xi, yi)

◮ λ > 0 regularization parameter

L.Rosasco, RegML 2020 3

slide-4
SLIDE 4

Loss functions and computational methods

◮ Logistic loss log(1 + e−yw⊤x) ◮ Hinge loss |1 − yw⊤x|+ wt+1 = wt − γt∇ Eλ(wt) . . .

L.Rosasco, RegML 2020 4

slide-5
SLIDE 5

Square loss

(1 − yw⊤x)2 = (y − w⊤x)2

L.Rosasco, RegML 2020 5

slide-6
SLIDE 6

Square loss

(1 − yw⊤x)2 = (y − w⊤x)2

  • Eλ(w) =

E(w) + λw2 with

  • E(w) = 1

n ˆ Xw − ˆ y2 ◮ ˆ X n × d data matrix ◮ ˆ y n × 1 output vector.

L.Rosasco, RegML 2020 5

slide-7
SLIDE 7

Ridge regression / Tikhonov regression

  • Eλ(w) = 1

n ˆ Xw − ˆ y2 + λw2

  • Smooth and strongly convex

∇ Eλ(w) = 2 n ˆ X⊤( ˆ Xw − ˆ y) + 2λw = 0 = ⇒ ( ˆ X⊤ ˆ X + λnI)w = ˆ X⊤ˆ y

L.Rosasco, RegML 2020 6

slide-8
SLIDE 8

Linear systems

( ˆ X⊤ ˆ X + λnI)w = ˆ X⊤ˆ y ◮ nd2 to form ˆ X⊤ ˆ X ◮ roughly d3 to solve the linear system

L.Rosasco, RegML 2020 7

slide-9
SLIDE 9

Representer theorem for square loss

f(x) = x⊤w = ⇒ f(x) =

n

  • i=1

x⊤xici

L.Rosasco, RegML 2020 8

slide-10
SLIDE 10

Representer theorem for square loss

f(x) = x⊤w = ⇒ f(x) =

n

  • i=1

x⊤xici Using SVD of ˆ X... w = ( ˆ X⊤ ˆ X + λnI)−1 ˆ X⊤ˆ y = ˆ X⊤ ( ˆ X ˆ X⊤ + λnI)−1ˆ y

  • c

= ⇒ w = ˆ X⊤c =

n

  • i=1

xici

L.Rosasco, RegML 2020 8

slide-11
SLIDE 11

Beyond linear models

f(x) = x⊤w =

n

  • i=1

x⊤xici, w = ( ˆ X⊤ ˆ X + λnI)−1 ˆ X⊤ˆ y, c = ( ˆ X ˆ X⊤ + λnI)−1ˆ y ◮ non-linear function x → φ(x) = (φ1(x), . . . φn(x)), f(x) = φ(x)⊤w ◮ non-linear kernels ˆ X ˆ X⊤ = ˆ K, f(x) =

n

  • i=1

K(x, xi)ci.

L.Rosasco, RegML 2020 9

slide-12
SLIDE 12

Interlude: linear systems and stability

Aw = y, A = diag(a1, . . . , ad), c = a1 ad < ∞, w = A−1y, A−1 = diag(a−1

1 , . . . , a−1 d )

More generally A = UΣU ⊤, Σ = diag(σ1, . . . , σd) A−1 = UΣ−1U ⊤, Σ−1 = diag(σ−1

1 , . . . , σ−1 d )

L.Rosasco, RegML 2020 10

slide-13
SLIDE 13

Tikhonov Regularization

1 n

n

  • i=1

(yi − w⊤xi)2 → 1 n

n

  • i=1

(yi − w⊤xi)2 + λw2 ˆ X⊤ ˆ Xw = ˆ X⊤ˆ y → ( ˆ X⊤ ˆ X + λnI)w = ˆ X⊤ˆ y Overfitting and numerical stability

L.Rosasco, RegML 2020 11

slide-14
SLIDE 14

Beyond Tikhonov: TSVD

ˆ X⊤ ˆ X = V ΣV ⊤, wM = ( ˆ X⊤ ˆ X)−1

M ˆ

X⊤ˆ y ◮ ( ˆ X⊤ ˆ X)−1

M = V Σ−1 M V ⊤

◮ Σ−1

M = diag(σ−1 1 , . . . , σM −1, 0 . . . , 0)

Also known as principal component regression (PCR)

L.Rosasco, RegML 2020 12

slide-15
SLIDE 15

Principal component analysis (PCA)

Dimensionality reduction ˆ X⊤ ˆ X = V ΣV ⊤ Eigenfunctions are directions, of ◮ maximum variance ◮ best reconstruction

L.Rosasco, RegML 2020 13

slide-16
SLIDE 16

TSVD and PCA

TSV D ⇔ PCA + ERM Regularization by projection

L.Rosasco, RegML 2020 14

slide-17
SLIDE 17

TSVD/PCR beyond linearity

Non-linear function f(x) =

p

  • i=1

wiφi(x) = Φ(x)⊤w with w = ( Φ⊤ Φ)−1

M

Φ⊤ˆ y Let Φ = (Φ(x1), . . . Φ(xn))⊤ ∈ Rn×p.

  • Φ⊤

Φ = V ΣV ⊤, ( Φ⊤ Φ)−1

M = V Σ−1 M V ⊤

Σ = diag(σ1, . . . , σp), Σ−1

M = diag(σ−1 1 , . . . , σ−1 M , 0, . . . )

L.Rosasco, RegML 2020 15

slide-18
SLIDE 18

TSVD/PCR with kernels

f(x) =

n

  • i=1

K(x, xi)ci, c = ( K)−1

M ˆ

y

  • Kij = K(xi, xj),
  • K = UΣU ⊤,

Σ = (σ1, . . . , σn),

  • K−1

M = UΣ−1 M U ⊤,

Σ−1

M = (σ−1 1 , . . . , σ−1 M , 0, . . . ),

L.Rosasco, RegML 2020 16

slide-19
SLIDE 19

Early stopping regularization

Other example of regularization: Early stopping of an iterative procedure applied to noisy data.

L.Rosasco, RegML 2020 17

slide-20
SLIDE 20

Gradient descent for square loss

wt+1 = wt − γ ˆ X⊤( ˆ Xwt − ˆ y)

n

  • i=1

(yi − w⊤xi)2 = ˆ Xw − ˆ y2 ◮ no penalty ◮ stepsize chosen a priori γ =

2 ˆ X⊤ ˆ X

L.Rosasco, RegML 2020 18

slide-21
SLIDE 21

Early stopping at work

0.2 0.4 0.6 0.8 1

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

Fitting on the training set Iteration #1

L.Rosasco, RegML 2020 19

slide-22
SLIDE 22

Early stopping at work

0.2 0.4 0.6 0.8 1

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

Fitting on the training set Iteration #2

L.Rosasco, RegML 2020 19

slide-23
SLIDE 23

Early stopping at work

0.2 0.4 0.6 0.8 1

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

Fitting on the training set Iteration #7

L.Rosasco, RegML 2020 19

slide-24
SLIDE 24

Early stopping at work

0.2 0.4 0.6 0.8 1

  • 1.5
  • 1
  • 0.5

0.5 1 1.5

Fitting on the training set Iteration #5000

L.Rosasco, RegML 2020 19

slide-25
SLIDE 25

Semi-convergence

min

w E(w)

vs min

w

  • E(w)

L.Rosasco, RegML 2020 20

slide-26
SLIDE 26

Connection to Tikhonov or TSVD

wt+1 = wt − γ ˆ X⊤( ˆ Xwt − ˆ y) = (I − γ ˆ X⊤ ˆ X)wt + γ ˆ X⊤ˆ y) by induction wt = γ

t−1

  • j=0

(I − γ ˆ X⊤ ˆ X)j

  • Truncated power series

ˆ X⊤ˆ y

L.Rosasco, RegML 2020 21

slide-27
SLIDE 27

Neumann series

γ

t−1

  • j=0

(I − γ ˆ X⊤ ˆ X)j

L.Rosasco, RegML 2020 22

slide-28
SLIDE 28

Neumann series

γ

t−1

  • j=0

(I − γ ˆ X⊤ ˆ X)j ◮ |a| < 1 (1 − a)−1 =

  • j=0

aj = ⇒ a−1 =

  • j=0

(1 − a)j

L.Rosasco, RegML 2020 22

slide-29
SLIDE 29

Neumann series

γ

t−1

  • j=0

(I − γ ˆ X⊤ ˆ X)j ◮ |a| < 1 (1 − a)−1 =

  • j=0

aj = ⇒ a−1 =

  • j=0

(1 − a)j ◮ A ∈ Rd×d, A < 1, invertible A−1 =

  • j=0

(I − A)j

L.Rosasco, RegML 2020 22

slide-30
SLIDE 30

Stable matrix inversion

Truncated Neumann Series ( ˆ X⊤ ˆ X)−1 = γ

  • j=0

(I − γ ˆ X⊤ ˆ X)j ≈ γ

t−1

  • j=0

(I − γ ˆ X⊤ ˆ X)j compare to ( ˆ X⊤ ˆ X)−1 ≈ ( ˆ X⊤ ˆ X + λnI)−1

L.Rosasco, RegML 2020 23

slide-31
SLIDE 31

Spectral filtering

Different instances of the same principle. ◮ Tikhonov wt = ( ˆ X⊤ ˆ X + λnI)−1 ˆ X⊤ˆ y ◮ Early Stopping wt = γ

t−1

  • j=0

(I − γ ˆ X⊤ ˆ X)j ˆ X⊤ˆ y ◮ TSVD wt = ( ˆ X⊤ ˆ X)−1

M ˆ

X⊤ˆ y

L.Rosasco, RegML 2020 24

slide-32
SLIDE 32

Statistics and optimization

wt = γ

t−1

  • j=0

(I − γ ˆ X⊤ ˆ X)j ˆ X⊤ˆ y The difference is in the computations wt+1 = wt − γ ˆ X⊤( ˆ Xwt − ˆ y) ◮ Tikhonov - O(nd2 + d3) ◮ TSVD - O(nd2 + d2M) ◮ GD - O(ndt)

L.Rosasco, RegML 2020 25

slide-33
SLIDE 33

Regularization path and warm restart

min

w E(w)

vs min

w

  • E(w)

L.Rosasco, RegML 2020 26

slide-34
SLIDE 34

Beyond linear models

Non-linear function f(x) =

p

  • i=1

wiφi(x) = Φ(x)⊤w ◮ Replace x by Φ(x) = (φ1(x), . . . , φp(x)) ◮ Replace X by

  • Φ = (Φ(x1), . . . Φ(xn))⊤ ∈ Rn×p.

wt+1 = wt − γ Φ⊤( Φwt − ˆ y) Computational cost O(npt).

L.Rosasco, RegML 2020 27

slide-35
SLIDE 35

Early-stopping and kernels

f(x) =

n

  • i=1

K(xi, x)ci By induction ct+1 = ct − γ( Kct − ˆ y)

  • Kij = K(xi, xj)

Computational Complexity O(n2t).

L.Rosasco, RegML 2020 28

slide-36
SLIDE 36

What about other loss functions?

◮ PCA + ERM ◮ Gradient / Subgradient Descent. Iterations for regularization, not

  • nly optimization!

L.Rosasco, RegML 2020 29

slide-37
SLIDE 37

Going big...

Bottleneck of Kernel methods Memory

  • K

is O(n2)

L.Rosasco, RegML 2020 30

slide-38
SLIDE 38

Approaches to large scale

◮ (Random) features - find Φ : X → RM, with M ≪ n s.t. K(x, x′) ≈ Φ(x)⊤ Φ(x′)

L.Rosasco, RegML 2020 31

slide-39
SLIDE 39

Approaches to large scale

◮ (Random) features - find Φ : X → RM, with M ≪ n s.t. K(x, x′) ≈ Φ(x)⊤ Φ(x′) ◮ Subsampling (Nystr¨

  • m) - replace

f(x) =

n

  • i=1

K(x, xi)ci by f(x) =

M

  • i=1

K(x, ˜ xi)ci ˜ xi subsampled from training set, M

L.Rosasco, RegML 2020 31

slide-40
SLIDE 40

Approaches to large scale

◮ (Random) features - find Φ : X → RM, with M ≪ n s.t. K(x, x′) ≈ Φ(x)⊤ Φ(x′) ◮ Subsampling (Nystr¨

  • m) - replace

f(x) =

n

  • i=1

K(x, xi)ci by f(x) =

M

  • i=1

K(x, ˜ xi)ci ˜ xi subsampled from training set, M ◮ Greedy! ◮ Neural Nets

L.Rosasco, RegML 2020 31

slide-41
SLIDE 41

This class

Regularization beyond penalization ◮ Regularization by projection ◮ Regularization by early stopping

L.Rosasco, RegML 2020 32

slide-42
SLIDE 42

Next class

Multioutput learning ◮ Multitask learning ◮ Vector valued learning ◮ Multiclass learning

L.Rosasco, RegML 2020 33