Iterative Convex Regularization Lorenzo Rosasco Universita di - - PowerPoint PPT Presentation

iterative convex regularization
SMART_READER_LITE
LIVE PREVIEW

Iterative Convex Regularization Lorenzo Rosasco Universita di - - PowerPoint PPT Presentation

Iterative Convex Regularization Lorenzo Rosasco Universita di Genova Universita di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology Optimization and Statistical Learning Workshop, Les Houches, Montevideo,


slide-1
SLIDE 1

Iterative Convex Regularization

Lorenzo Rosasco

Universita’ di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology

Optimization and Statistical Learning Workshop, Les Houches, Montevideo, January 14

  • ngoing work with S. Villa IIT-MIT, B.C. Vu IIT-MIT

Universita’ di Genova

slide-2
SLIDE 2

Iterative Convex Regularization

Lorenzo Rosasco

Universita’ di Genova Istituto Italiano di Tecnologia Massachusetts Institute of Technology

Optimization and Statistical Learning Workshop, Les Houches, Montevideo, January 14

  • ngoing work with S. Villa IIT-MIT, B.C. Vu IIT-MIT

Universita’ di Genova

Early Stopping

slide-3
SLIDE 3

Plan

  • part I: introduction to iterative regularization
  • part II: iterative convex regularization: problem and results

Statistics/Estimation Optimization &

slide-4
SLIDE 4

Linear Inverse Problems

Φw = y, Φ : H → G

H

G

Φ

linear and bounded

w† = arg min

Φw=y

R(w)

Moore-Penrose Solution strongly convex lsc

Examples: *endless list here*

y

slide-5
SLIDE 5

Data Φw = y Data Type I ky ˆ yk  δ

  • Φ∗Φ − ˆ

Φ∗ ˆ Φ

  • ≤ η

ˆ Φ : H → ˆ G Data Type II

  • Φ∗y − ˆ

Φ∗ˆ y

  • ≤ δ
  • Data type I: Deterministic/stochastic noise […]
  • Data type II: stochastic noise statistical Learning [R. et al. ’05],

also econometrics, discretized PDEs (?)

slide-6
SLIDE 6

Learning* as an Inverse Problem

Yi = ⌦ w†, Xi ↵ + Ni, i = 1, . . . , n Φ∗Φ = EXXT , ˆ Φ∗ ˆ Φ = 1 n

n

X

i=1

XiXT

i

Φ∗y = EXY, ˆ Φ∗ˆ y = 1 n

n

X

i=1

XiYi

Can be shown to fit Data Type II with

*Random Design Regression

[De vito et al. ‘05]

Nonparametric extensions via RKHS theory: Covariance operators become integral operators

δ, η ∼ 1 √n

slide-7
SLIDE 7

Tikhonov Regularization

w† = arg min

Φw=y

R(w)

  • New Trade-Offs (?)

Variance Bias ˆ wt,λ Computations ˆ wλ = arg min

w∈H

  • ˆ

Φw − ˆ y

  • 2

+ λR(w), λ ≥ 0 wλ = arg min

w∈H

kΦw yk2 + λR(w)

  • Complexity of Model selection?
slide-8
SLIDE 8

…to Landweber Regularization

R(w) = kwk2 ∼ (Φ∗Φ + λI)−1Φ∗y w† = Φ†y

From Tikhonov Regularization

t

X

j=0

(I − Φ∗Φ)jΦ∗y wt+1 = wt + Φ∗(Φwt − y) ˆ wt+1 = ˆ wt + ˆ Φ∗(ˆ Φ ˆ wt − ˆ y)

slide-9
SLIDE 9

R(w) = kwk2 ∼ (Φ∗Φ + λI)−1Φ∗y

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2 X Y

w† = Φ†y ∼

t

X

j=0

(I − Φ∗Φ)jΦ∗y wt+1 = wt + Φ∗(Φwt − y) ˆ wt+1 = ˆ wt + ˆ Φ∗(ˆ Φ ˆ wt − ˆ y)

Landweber Regularization aka Gradient Descent

slide-10
SLIDE 10

R(w) = kwk2 ∼ (Φ∗Φ + λI)−1Φ∗y

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2 X Y

w† = Φ†y ∼

t

X

j=0

(I − Φ∗Φ)jΦ∗y

Landweber Regularization aka Gradient Descent

wt+1 = wt + Φ∗(Φwt − y) ˆ wt+1 = ˆ wt + ˆ Φ∗(ˆ Φ ˆ wt − ˆ y)

slide-11
SLIDE 11

R(w) = kwk2 ∼ (Φ∗Φ + λI)−1Φ∗y

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2 X Y

w† = Φ†y ∼

t

X

j=0

(I − Φ∗Φ)jΦ∗y

Landweber Regularization aka Gradient Descent

wt+1 = wt + Φ∗(Φwt − y) ˆ wt+1 = ˆ wt + ˆ Φ∗(ˆ Φ ˆ wt − ˆ y)

slide-12
SLIDE 12

R(w) = kwk2 ∼ (Φ∗Φ + λI)−1Φ∗y

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2 −1.5 −1 −0.5 0.5 1 1.5 2 X Y

w† = Φ†y ∼

t

X

j=0

(I − Φ∗Φ)jΦ∗y

Landweber Regularization aka Gradient Descent

wt+1 = wt + Φ∗(Φwt − y) ˆ wt+1 = ˆ wt + ˆ Φ∗(ˆ Φ ˆ wt − ˆ y)

slide-13
SLIDE 13

R(w) = kwk2 ∼ (Φ∗Φ + λI)−1Φ∗y

t

w† = Φ†y

Landweber Regularization aka Gradient Descent

10

1

10

2

Emp Err 10

1

10

2

Val Err

t

  • ˆ

wt − w†

  • ˆ

wt − ˆ w† , ˆ w† = ˆ Φ†ˆ y

t

X

j=0

(I − Φ∗Φ)jΦ∗y wt+1 = wt + Φ∗(Φwt − y) ˆ wt+1 = ˆ wt + ˆ Φ∗(ˆ Φ ˆ wt − ˆ y)

slide-14
SLIDE 14

R(w) = kwk2 ∼ (Φ∗Φ + λI)−1Φ∗y

t

w† = Φ†y

Landweber Regularization aka Gradient Descent

10

1

10

2

Emp Err 10

1

10

2

Val Err

t

  • ˆ

wt − w†

  • ˆ

wt − ˆ w† , ˆ w† = ˆ Φ†ˆ y

t

X

j=0

(I − Φ∗Φ)jΦ∗y wt+1 = wt + Φ∗(Φwt − y) ˆ wt+1 = ˆ wt + ˆ Φ∗(ˆ Φ ˆ wt − ˆ y)

Semi-Convergence

slide-15
SLIDE 15

Remarks I

  • History: iteration+semiconvergence [Landweber ’50] …[…Nemirovski’86…]
  • Other iterative approaches— some acceleration: nu-method/Chebyshev

method [Brakhage ‘87, Nemirovski Polyak'84], conjugate gradient [Nemirovski’86…]…

R(w) = kwk2

Data type I:

  • Deterministic noise [Engl et al. ’96], stocastic noise […,Buhlmann, Yu ’02 (L2

Boosting),Bissantz et al. ’07]

  • Extensions to noise in the operator [Nemirovski’86,…]
  • Nonlinear problems [Kaltenbacher et al. ’08]
  • Banach Spaces [Schuster et al. ‘12]
slide-16
SLIDE 16

Remarks II

R(w) = kwk2

Data type II:

  • Deterministic noise Landweber and nu-method [De Vito et al. ‘06]
  • Stochastic noise/learning Landweber and nu-method [Ong Canu ’04, R

et al ’04, Yao et al.’05, Bauer et al. ’06, Caponetto Yao ’07, Raskutti et al.’13]

  • …also conjugate gradient [Blanchard Cramer ‘10]
  • …and incremental gradient aka multiple passes SGD [R et al.’14]
  • …and (convex) loss, subgradient method [Lin, R, Zhou ‘15]
  • Works really well in practice [Huang et al. ’14, Perronnin et al. ‘13]
  • Regularization “path” is for free
slide-17
SLIDE 17

Remarks III

Take home message Computations/iterations control stability/regularization New trade-offs?

10

1

10

2

Emp Err 10

1

10

2

Val Err

t

  • ˆ

wt − w†

  • ˆ

wt − ˆ w† , ˆ w† = ˆ Φ†ˆ y

ˆ wt+1 = ˆ wt + ˆ Φ∗(ˆ Φ ˆ wt − ˆ y)

Semi-Convergence

slide-18
SLIDE 18

Can we derive iterative regularization for any (strongly) convex regularization?

slide-19
SLIDE 19

Plan

  • part I: introduction to iterative regularization
  • part II: iterative convex regularization: problem and

results

slide-20
SLIDE 20

How can I tell the iteration which regularization I want to use?

ˆ wt+1 = ˆ wt + ˆ Φ∗(ˆ Φ ˆ wt − ˆ y)

w† = arg min

Φw=y

R(w)

slide-21
SLIDE 21

Iterative Regularization and Early Stopping

wt = A(w0, . . . , wt−1, Φ, y) Convergence

  • wt − w†

→ 0, t → ∞

Exact

∃ t† = t†(w†, δ, η) s.t.

  • ˆ

wt† − w† → 0, (δ, η) → 0

Noisy

Error Bounds ∃ t† = t†(w†, δ, η) s.t.

  • ˆ

wt† − w† ≤ ε(w†, δ, η)

  • adaptivity, e.g. via discrepancy or Lepskii principles
slide-22
SLIDE 22

Dual Forward Backward (DFB)

  • Analogous iteration for noisy data
  • Special case of dual forward backward splitting [Combettes et al.

’10]…

  • …also a form of augmented Lagrangian method/ADMM [see Beck

Teboulle ‘14]

  • …also can be shown to be equivalent to linearized Bregmanized
  • perator splitting [Burger, Osher et al. …]
  • Reduces to Landweber iteration if we consider only the squared norm

(∀t ∈ N)

  • wt = proxα−1F
  • − α−1Φ∗vt
  • vt+1 = vt + γt(Φwt − y).

γt = α convex lsc R = F + α 2 k·k2 , α 0

w† = arg min

Φw=y

R(w)

slide-23
SLIDE 23

Analysis for Data Type I

[R.Villa Vu et al.’14]

  • ˆ

wt − w†  k ˆ wt wtk +

  • wt w†
  • v†

/(α √ t) Proof idea Φ∗v† ∈ ∂R(w†) the DFB sequence (wt)t for v0 = 0 satisfies

  • Theorem. If there exists v† ∈ G such that

kwt w†k  kv†k α p t α 2 kwt w†k2  D(vt) D(v†)

slide-24
SLIDE 24

Analysis for Data Type I

  • v†

/(α √ t)

[R.Villa Vu et al.’14]

  • ˆ

wt − w†  k ˆ wt wtk +

  • wt w†
  • cδt
  • Theorem. Let (wt)t,( ˆ

wt)t be the DFB sequences for ˆ v0 = v0 = 0. Then it holds k ˆ wt wtk  2tδ kΦk

slide-25
SLIDE 25

Analysis for Data Type I

  • v†

/(α √ t) cδt t† = cδ−2/3 ⇒

  • ˆ

wt† − w† ≤ cδ1/3

[R.Villa Vu et al.’14]

  • ˆ

wt − w†  k ˆ wt wtk +

  • wt w†
slide-26
SLIDE 26

Analysis for Data Type II

  • ˆ

wt w†  k ˆ wt wtk +

  • wt w†
  • (δ + η)(1 + c)t

[R. Villa Vu et al.’14]

ˆ t = c log p 1/(δ + η) ) k ˆ wˆ

t w†k 

c p log(1/pδ + η)

  • v†

/(α √ t)

slide-27
SLIDE 27

Remarks

  • General convex setting— only weak convergence [Burger, Osher et al.

~’09’10], no stability results, no strong convergence.

  • Sparsity based regularization [Osher et al. ’14]
  • No previous results, either convergence or error bounds.
  • Directly give results for statistical learning.
  • Acceleration possible, but stability harder to prove (e.g. via dual FISTA,

Chambolle Pock…)

  • Polynomial estimates of variance under stronger conditions (satisfied in certain

smooth cases, e.g. Landweber)

  • Connections to regularization path, e.g. Lasso path/Lars Results…

Data Type I Data Type II

slide-28
SLIDE 28

Work in Progress

  • Purely convex case: exact penalization result for atomic norms?
  • Analysis under partial smoothness
  • Sharper Bounds (high-finite dimension)
  • Truly ill-posed problems
  • (more) Experiments
slide-29
SLIDE 29

Conclusions

  • Iterative Regularization viable alternative to Tikhonov regularization

for large problems

  • Old (?) Trade offs in ML: computational regularization??
  • A whole new playground - loss, iterations, randomization