RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco - - PowerPoint PPT Presentation

regml 2016 class 6 structured sparsity
SMART_READER_LITE
LIVE PREVIEW

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco - - PowerPoint PPT Presentation

RegML 2016 Class 6 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016 Exploiting structure Building blocks of a function can be more structure than single variables L.Rosasco, RegML 2016 Sparsity Variables divided in


slide-1
SLIDE 1

RegML 2016 Class 6 Structured sparsity

Lorenzo Rosasco UNIGE-MIT-IIT June 30, 2016

slide-2
SLIDE 2

Exploiting structure

Building blocks of a function can be more structure than single variables

L.Rosasco, RegML 2016

slide-3
SLIDE 3

Sparsity

Variables divided in non-overlapping groups

L.Rosasco, RegML 2016

slide-4
SLIDE 4

Group sparsity

◮ f(x) = d j=1 wjxj ◮ w = (w1, . . . w(1)

, . . . , . . . , wd

w(G)

)

◮ each group Gg has size |Gg|, so w(g) ∈ R|Gg|

L.Rosasco, RegML 2016

slide-5
SLIDE 5

Group sparsity regularization

Regularization exploiting structure Rgroup(w) =

G

  • g=1

w(g) =

G

  • g=1
  • |Gg|
  • j=1

(w(g))2

j

L.Rosasco, RegML 2016

slide-6
SLIDE 6

Group sparsity regularization

Regularization exploiting structure Rgroup(w) =

G

  • g=1

w(g) =

G

  • g=1
  • |Gg|
  • j=1

(w(g))2

j

Compare to

G

  • g=1

w(g)2 =

G

  • g=1

|Gg|

  • j=1

(w(g))2

j

L.Rosasco, RegML 2016

slide-7
SLIDE 7

Group sparsity regularization

Regularization exploiting structure Rgroup(w) =

G

  • g=1

w(g) =

G

  • g=1
  • |Gg|
  • j=1

(w(g))2

j

Compare to

G

  • g=1

w(g)2 =

G

  • g=1

|Gg|

  • j=1

(w(g))2

j

  • r

G

  • g=1

w(g)2 =

G

  • g=1

|Gg|

  • j=1

|(w(g))j|

L.Rosasco, RegML 2016

slide-8
SLIDE 8

ℓ1 − ℓ2 norm

We take the ℓ2 norm of all the groups (w(1), . . . , w(G)) and then the ℓ1 norm of the above vector

G

  • g=1

w(g)

L.Rosasco, RegML 2016

slide-9
SLIDE 9

Groups lasso

min

w

1 n ˆ Xw − ˆ y2 + λ

G

  • g=1

w(g)

◮ reduces to the Lasso if groups have cardinality one

L.Rosasco, RegML 2016

slide-10
SLIDE 10

Computations

min

w

1 n ˆ Xw − ˆ y2 + λ

G

  • g=1

w(g)

  • non differentiable

Convex, non-smooth, but composite structure wt+1 = ProxγλRgroup

  • wt − γ 2

n ˆ X⊤( ˆ Xwt − ˆ y)

  • L.Rosasco, RegML 2016
slide-11
SLIDE 11

Block thresholding

It can be shown that ProxλRgroup(w) = (Proxλ·(w(1)), . . . , Proxλ·(w(G)) (Proxλ·(w(g)))j =

  • w(g)j − λ w(g)j

w(g)

w(g) > λ w(g) ≤ λ

◮ Entire groups of coefficients set to zero! ◮ Reduces to softhresholding if groups have cardinality one

L.Rosasco, RegML 2016

slide-12
SLIDE 12

Other norms

ℓ1 − ℓp norms R(w) =

G

  • g=1

w(g)p =

G

  • g=1

 

|Gg|

  • j=1

(w(g))p

j

 

1 p L.Rosasco, RegML 2016

slide-13
SLIDE 13

Overlapping groups

Variables divided in possibly overlapping groups

L.Rosasco, RegML 2016

slide-14
SLIDE 14

Regularization with overlapping groups

Group Lasso RGL(w) =

G

  • g=1

w(g)

L.Rosasco, RegML 2016

slide-15
SLIDE 15

Regularization with overlapping groups

Group Lasso RGL(w) =

G

  • g=1

w(g) → The selected variables are union of group complements

L.Rosasco, RegML 2016

slide-16
SLIDE 16

Regularization with overlapping groups

Let ¯ w(g) ∈ Rd be equal to w(g) on group Gg and zero otherwise

L.Rosasco, RegML 2016

slide-17
SLIDE 17

Regularization with overlapping groups

Let ¯ w(g) ∈ Rd be equal to w(g) on group Gg and zero otherwise Group Lasso with overlap RGLO(w) = inf G

  • g=1

w(g) | w(1), . . . , w(g) s.t. w =

G

  • g=1

¯ w(g)

  • L.Rosasco, RegML 2016
slide-18
SLIDE 18

Regularization with overlapping groups

Let ¯ w(g) ∈ Rd be equal to w(g) on group Gg and zero otherwise Group Lasso with overlap RGLO(w) = inf G

  • g=1

w(g) | w(1), . . . , w(g) s.t. w =

G

  • g=1

¯ w(g)

  • ◮ Multiple ways to write w = G

g=1 ¯

w(g)

L.Rosasco, RegML 2016

slide-19
SLIDE 19

Regularization with overlapping groups

Let ¯ w(g) ∈ Rd be equal to w(g) on group Gg and zero otherwise Group Lasso with overlap RGLO(w) = inf G

  • g=1

w(g) | w(1), . . . , w(g) s.t. w =

G

  • g=1

¯ w(g)

  • ◮ Multiple ways to write w = G

g=1 ¯

w(g)

◮ Selected variables are groups!

L.Rosasco, RegML 2016

slide-20
SLIDE 20

An equivalence

It holds min

w

1 n ˆ Xw − ˆ y2 + λRGLO(w) ⇔ min

˜ w

1 n ˜ X ˜ w − ˆ y2 + λ

G

  • g=1

w(g)

◮ ˜

X is the matrix obtained by replicating columns/variables

◮ ˜

w = (w(1), . . . , w(G)), vector with (nonoverlapping!) groups

L.Rosasco, RegML 2016

slide-21
SLIDE 21

An equivalence (cont.)

Indeed min

w

1 n ˆ Xw − ˆ y2 + λ inf

w(1),...,w(g) s.t. G

g=1 ¯

w(g)=w G

  • g=1

w(g) = inf

w(1),...,w(g) s.t. G

g=1 ¯

w(g)=w

1 n ˆ Xw − ˆ y2 + λ

G

  • g=1

w(g) = inf

w(1),...,w(g)

1 n ˆ X(

G

  • g=1

¯ w(g)) − ˆ y2 + λ

G

  • g=1

w(g) = inf

w(1),...,w(g)

1 n

G

  • g=1

ˆ X|Ggw(g) − ˆ y2 + λ

G

  • g=1

w(g) = min

˜ w

1 n ˜ X ˜ w − ˆ y2 + λ

G

  • g=1

w(g)

L.Rosasco, RegML 2016

slide-22
SLIDE 22

Computations

◮ Can use block thresholding with replicated variables =

⇒ potentially wasteful

◮ The proximal operator for RGLO can be computed efficiently but

not in closed form

L.Rosasco, RegML 2016

slide-23
SLIDE 23

More structure

Structured overlapping groups

◮ trees ◮ DAG ◮ . . .

Structure can be exploited in computations. . .

L.Rosasco, RegML 2016

slide-24
SLIDE 24

Beyond linear models

Consider a dictionary made by union of distinct dictionaries f(x) =

G

  • g=1

fg(x) =

G

  • g=1

Φg(x)⊤w(g), where each dictionary defines a feature map Φg(x) = (φg

1(x), . . . , φg pg(x))

Easy extension with usual change of variable...

L.Rosasco, RegML 2016

slide-25
SLIDE 25

Representer theorems

Let f(x) = x⊤(

G

  • g=1

¯ w(g)) =

G

  • g=1

¯ x(g)⊤ ¯ w(g) =

G

  • g=1

fg(x),

L.Rosasco, RegML 2016

slide-26
SLIDE 26

Representer theorems

Let f(x) = x⊤(

G

  • g=1

¯ w(g)) =

G

  • g=1

¯ x(g)⊤ ¯ w(g) =

G

  • g=1

fg(x), Idea Show that ¯ w(g) =

n

  • i=1

¯ x(g)ic(g)i, i.e. fg(x) =

n

  • i=1

¯ x(g)⊤¯ x(g)ic(g)i =

n

  • i=1

x(g)⊤x(g)i

  • Φg(x)⊤Φg(xi)=Kg(x,xi)

c(g)i

L.Rosasco, RegML 2016

slide-27
SLIDE 27

Representer theorems

Let f(x) = x⊤(

G

  • g=1

¯ w(g)) =

G

  • g=1

¯ x(g)⊤ ¯ w(g) =

G

  • g=1

fg(x), Idea Show that ¯ w(g) =

n

  • i=1

¯ x(g)ic(g)i, i.e. fg(x) =

n

  • i=1

¯ x(g)⊤¯ x(g)ic(g)i =

n

  • i=1

x(g)⊤x(g)i

  • Φg(x)⊤Φg(xi)=Kg(x,xi)

c(g)i Note that in this case fg2 = w(g)2 = c(g)⊤ ˆ X(g) ˆ X(g)⊤

  • ˆ

K(g)

c(g)

L.Rosasco, RegML 2016

slide-28
SLIDE 28

Coefficients update

ct+1 = ProxγλRgroup

  • ct − γ( ˆ

Kct − ˆ y))

  • where ˆ

K = ( ˆ K(1), . . . , ˆ K(G)), and ct = (ct(1), . . . , ct(G)) Block Thresholding It can be shown that (Proxλ·(c(g)))j =          c(g)j − λ

c(g)j

  • c(g)⊤ ˆ

K(g)c(g)

  • fg

fg > λ fg ≤ λ

L.Rosasco, RegML 2016

slide-29
SLIDE 29

Non-parametric sparsity

f(x) =

G

  • g=1

fg(x) fg(x) =

n

  • i=1

x(g)⊤x(g)i(c(g))i → fg(x) =

n

  • i=1

Kg(x, xi)(c(g))i (K1, . . . , KG) family of kernels

G

  • g=1

w(g) = ⇒

G

  • g=1

fgKg

L.Rosasco, RegML 2016

slide-30
SLIDE 30

ℓ1 MKL

inf

w(1),...,w(g) s.t. G

g=1 ¯

w(g)=w

1 n ˆ Xw − ˆ y2 + λ

G

  • g=1

w(g) = ⇓ min f1, . . . , fg s.t. G

g=1 fg = f

1 n

n

  • i=1

(yi − f(xi))2 + λ

G

  • g=1

fgKg

L.Rosasco, RegML 2016

slide-31
SLIDE 31

ℓ2 MKL

G

  • g=1

w(g)2 = ⇒

G

  • g=1

fg2

Kg

Corresponds to using the kernel K(x, x′) =

G

  • g=1

Kg(x, x′)

L.Rosasco, RegML 2016

slide-32
SLIDE 32

ℓ1 or ℓ2 MKL

◮ ℓ2 *much* faster ◮ ℓ1 could be useful is only few kernels are relevant

L.Rosasco, RegML 2016

slide-33
SLIDE 33

Why MKL?

◮ Data fusion– different features ◮ Model selection, e.g. gaussian kernels with different widths ◮ Richer model– many kernels!

L.Rosasco, RegML 2016

slide-34
SLIDE 34

MKL & kernel learning

It can be shown that min f1, . . . , fg s.t. G

g=1 fg = f

1 n

n

  • i=1

(yi − f(xi))2 + λ

G

  • g=1

fgKg

  • min

K∈K min f∈HK

1 n

n

  • i=1

(yi − f(xi))2 + λf2

K

where K = {K | K =

g Kgαg,

αg ≥ 0}

L.Rosasco, RegML 2016

slide-35
SLIDE 35

Sparsity beyond vectors

Recall multi-variable regression (xi, yi)i=1n, xi ∈ Rd, yi ∈ RT f(x) = x⊤ W

  • d×T

min

W ˆ

XW − ˆ Y 2

F + λ Tr(WAW ⊤)

L.Rosasco, RegML 2016

slide-36
SLIDE 36

Sparse regularization

◮ We have seen

Tr(WW ⊤) =

d

  • j=1

T

  • t=1

(Wt,j)2

◮ We could consider now d

  • j=1

T

  • t=1

|Wt,j|

◮ . . .

L.Rosasco, RegML 2016

slide-37
SLIDE 37

Spectral Norms/p-Schatten norms

◮ We have seen

Tr(WW ⊤) =

min{d,T }

  • t=1

σ2

i ◮ We could consider now

R(W) = W∗ =

min{d,T }

  • t=1

σi, nuclear norm

◮ or

R(W) = (

min{d,T }

  • t=1

(σi)p)1/p, p-Schatten norm

L.Rosasco, RegML 2016

slide-38
SLIDE 38

Nuclear norm regularization

min

W ˆ

XW − ˆ Y 2

F + λW∗

L.Rosasco, RegML 2016

slide-39
SLIDE 39

Computations

Wt+1 = Proxγλ·∗

  • Wt − 2γ ˆ

X⊤( XWt − Y )

  • Let W = UΣV ⊤, Σ = diag(σ1, . . . , σp)

Prox·∗(W) = U diag(Prox·1(σ1, . . . , σp))V ⊤

L.Rosasco, RegML 2016

slide-40
SLIDE 40

This class

◮ Structured sparsity ◮ MKL ◮ Matrix sparsity

L.Rosasco, RegML 2016

slide-41
SLIDE 41

Next class

Data representation!

L.Rosasco, RegML 2016