Sparse Convex Optimization Methods for Machine Learning PhD - - PowerPoint PPT Presentation

sparse convex optimization methods
SMART_READER_LITE
LIVE PREVIEW

Sparse Convex Optimization Methods for Machine Learning PhD - - PowerPoint PPT Presentation

Sparse Convex Optimization Methods for Machine Learning PhD Defense Talk 2011 / 10 / 04 Martin Jaggi Examiner: Co-Examiners: Emo Welzl Bernd Grtner, Elad Hazan, Joachim Giesen, Joachim Buhmann Convex Optimization D R n f ( x ) x


slide-1
SLIDE 1

Sparse Convex Optimization Methods

PhD Defense Talk 2011 / 10 / 04 Martin Jaggi

for Machine Learning

Examiner: Emo Welzl Co-Examiners: Bernd Gärtner, Elad Hazan, Joachim Giesen, Joachim Buhmann

slide-2
SLIDE 2

D ⊂ Rn

Convex Optimization

slide-3
SLIDE 3

D ⊂ Rn

min

x∈D f(x)

x

f(x)

slide-4
SLIDE 4

D ⊂ Rn x

f(x)

min

x∈D f(x)

slide-5
SLIDE 5

D ⊂ Rn x

f(x)

min

x∈D f(x)

slide-6
SLIDE 6

D ⊂ Rn x

f(x)

min

x∈D f(x)

slide-7
SLIDE 7

D ⊂ Rn

x

f(x)

The Linearized Problem min

y∈D f(x) + hy x, dxi

Algorithm 1 Greedy on a Compact Convex Set Pick an arbitrary starting point x(0) ⇤ D for k = 0 . . . ⇥ do Let dx ⇤ ∂f(x(k)) be a subgradient to f at x(k) Compute s := approx arg min

y∈D

⌅y, dx⇧ Let α :=

2 k+2

Update x(k+1) := x(k) + α(s x(k)) end for

s

Theorem: Algorithm obtains accuracy after steps.

O 1

k

  • k
slide-8
SLIDE 8

D ⊂ Rn

x

f(x)

The Linearized Problem min

y∈D f(x) + hy x, dxi

Our Method Gradient Descent Cost per step Convergence Sparse / Low Rank Solutions

  • Approx. solve

linearized problem on D Projection back to D

(depending on the domain)

✗ 1/k 1/k dx

slide-9
SLIDE 9

History & Related Work

Domain Frank & Wolfe 1956 Dunn 1978, 1980 Zhang 2003 Clarkson 2008, 2010 Hazan 2008

  • J. PhD Thesis

linear inequality constraints general bounded convex domain convex hulls unit simplex semidefinite matrices

  • f bounded trace

general bounded convex domain Known Stepsize Approx. Subproblem Primal-Dual Guarantee

✗ ✗ ✗ ✗ ✓ ✗ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✓ ✓ ✓ ✓

slide-10
SLIDE 10

Sparse Approximation

unit simplex

min

x∈∆n f(x)

Sparsity as a function of the approximation quality

“Coresets”

1

for k = 0 . . . ∞ do Let dx ∈ ∂f(x(k)) be a subgradient to f at x(k) Compute i := arg mini (dx)i Let α :=

2 k+2

Update x(k+1) := x(k) + α(ei − x(k)) end for

[ Clarkson SODA '08 ]

Corollary: Algorithm gives an -approximate solution of sparsity .

O 1

ε

  • ε

D := conv({ei | i ∈ [n]}) Ω 1

ε

  • lower bound:
slide-11
SLIDE 11
  • Smallest enclosing ball
  • Model Predictive Control

Applications

  • Linear Classifiers

(such as Support Vector Machines, -loss)

`2 min

x∈∆n xT (K + t1)x

  • Mean

Variance Portfolio Optimization min

x∈∆n xT Cx − t · bT x

slide-12
SLIDE 12

Sparse Approximation

`1-ball

min

kxk11 f(x)

Ω 1

ε

  • lower bound:

Sparsity as a function of the approximation quality

“Coresets”

Corollary: Algorithm gives an -approximate solution of sparsity .

O 1

ε

  • ε

for k = 0 . . . ∞ do Let dx ∈ @f(x(k)) be a subgradient to f at x(k) Compute i := arg maxi |(dx)i|, and let s := ei · sign ((−dx)i) Let ↵ :=

2 k+2

Update x(k+1) := x(k) + ↵(s − x(k)) end for

D := conv({±ei | i ∈ [n]})

slide-13
SLIDE 13

Sparse Recovery

  • -regularized regression

`1 min

kxk1t kAx bk2 2

Applications

slide-14
SLIDE 14

Low Rank Approximation

spectahedron

X 2 Symn×n X ⌫ 0 Tr(X) = 1 =

[ Hazan LATIN '08 ]

Corollary: Algorithm gives an -approximate solution of rank .

O 1

ε

  • ε

for k = 0 . . . 1 do Let D

X 2 ∂f(X(k)) be a subgradient to f at X(k)

Let α :=

2 k+2

Compute v := v(k) = ApproxEV (D

X, αCf)

Update X(k+1) := X(k) + α(vvT X(k)) end for

D := conv(

  • vvT

v2Rn,

kvk2=1

) Ω 1

ε

  • lower bound:

min

x∈D f(x)

slide-15
SLIDE 15
  • Trace norm regularized problems

Applications

Low-Rank Matrix Recovery min

kXk∗t f(X)

  • Max norm regularized problems
slide-16
SLIDE 16

1 3 4 1 2 3 5 3 2 2 1 3

Matrix Factorizations

The Netflix challenge: 17k Movies 500k Customers 100M Observed Entries (≈ 1%)

for recommender systems

Movies Customers

≈ UV T

=

⎫ ⎬k ⎭ v(1)

v(k)

u(1) u(k)

= Y

min

U,V

X

(i,j)∈Ω

(Yij (UV T )ij)2 s.t. kUk2

F ro + kV k2 F ro = t

Sulovský

  UU T UV T V U T V V T  

1 2 4 1 2 3 5 3 2 2 1 3

1 2 4 1 2 3 5 3 2 2 1 3

=: X

min

X⌫0 f(X)

s.t. Tr(X) = t

Is equivalent to: [ J, Sulovský ICML 2010 ]

slide-17
SLIDE 17

D ⊂ Rn

x

f(x)

gap(x) The Problem

min

x∈D f(x)

A Simple Alternative Optimization Duality

Weak Duality ω(x) ≤ f(x⇤) ≤ f(x0) The Dual ω(x) := min

y∈D f(x) + hy x, dxi

ω(x)

slide-18
SLIDE 18

The Parameterized Problem

Pathwise Optimization

gt(x) ≤ ε

2

“Better than necessary” gt0(x) ≤ ε “Still good enough” gt0(x) − gt(x) ≤ ε

  • 1 − 1

2

  • The difference

100 200

ft(x∗

t )

t

t0

ωt0(x)

ft0(x)

⇐ |t0 − t| ≤ ε · Pf

“Continuity in the parameter”

min

x∈D ft(x)

There are many intervals

  • f piecewise constant -approx. solutions.

Theorem:

ε

O 1

ε

  • [ Giesen, J, Laue ESA 2010 ]
slide-19
SLIDE 19

Applications

  • Smallest enclosing ball
  • f moving points
  • Model Predictive Control
  • SVMs, MKL (with 2 base kernels)

min

x∈∆n xT (K + t1)x

  • Mean

Variance Portfolio Optimization

min

x∈∆n xT Cx − t · bT x

  • robust PCA
  • Recommender Systems

test accuracy

t

ionosphere breast-cancer

slide-20
SLIDE 20

Bernd Gärtner Joachim Giesen Soeren Laue Marek Sulovský

co-authors:

Thanks

100 200

ft(x∗

t )

t t0

ωt0(x)

ft0(x)

D ⊂ Rn x

f(x)

3D visualization:

Robert Carnecky