Low norm and 1 guarantees on Sparsifiability Shai - - PowerPoint PPT Presentation

low norm and
SMART_READER_LITE
LIVE PREVIEW

Low norm and 1 guarantees on Sparsifiability Shai - - PowerPoint PPT Presentation

Low norm and 1 guarantees on Sparsifiability Shai Shalev-Shwartz & Nathan Srebro Toyota Technologica Institute--Chicago ICML/COLT/UAI workshop, July 2008 Motivation Problem 1: w 0 = argmin E [ L ( w , x , y )] s . t .


slide-1
SLIDE 1

Low norm and guarantees on Sparsifiability

Shai Shalev-Shwartz & Nathan Srebro

Toyota Technologica Institute--Chicago

ICML/COLT/UAI workshop, July 2008

ℓ1

slide-2
SLIDE 2

Motivation

Problem 1:

w0 = argmin

w

E[L(w, x, y)] s.t. w0 ≤ S

slide-3
SLIDE 3

Motivation

Problem 1:

w0 = argmin

w

E[L(w, x, y)] s.t. w0 ≤ S

Problem II:

w1 = argmin

w

E[L(w, x, y)] s.t. w1 ≤ B

slide-4
SLIDE 4

Motivation

Problem 1:

w0 = argmin

w

E[L(w, x, y)] s.t. w0 ≤ S

Problem II:

w1 = argmin

w

E[L(w, x, y)] s.t. w1 ≤ B

Strict assumptions on data distribution ⇒ w1 is also sparse But, what if w1 is not sparse ?

slide-5
SLIDE 5

Motivation

Problem 1:

w0 = argmin

w

E[L(w, x, y)] s.t. w0 ≤ S

Problem II:

w1 = argmin

w

E[L(w, x, y)] s.t. w1 ≤ B

Strict assumptions on data distribution ⇒ w1 is also sparse But, what if w1 is not sparse ?

features not correlated

slide-6
SLIDE 6

Sparsification

Sparsification procedure

Predictor w with w1 = B Predictor ˜ w with ˜ w0 = S

slide-7
SLIDE 7

Sparsification

Sparsification procedure

Predictor w with w1 = B Predictor ˜ w with ˜ w0 = S Constraint: E[L( ˜ w, x, y)] ≤ E[L(w, x, y)] + ǫ Goal: Minimal S that satisfies constraint Question:How S depends on B and ǫ ?

slide-8
SLIDE 8

Main Result

Theorem: For any predictor w, λ-Lipschitz loss function L, distribu- tion D over X × Y , desired accuracy ǫ Exists ˜ w s.t. E[L( ˜ w, x, y)] ≤ E[L(w, x, y)] + ǫ and ˜ w0 = O λw1 ǫ 2 Tightness: Data distribution, loss function, dense predictor w with loss l, but need Ω((w2

1/ǫ)2) features for loss l + ǫ

Sparsifying by taking largest weights or following ℓ1 regu- larization path might fail Low ℓ2 norm predictor ⇒ sparse predictor

slide-9
SLIDE 9

Main Result (cont.)

Distribution D Loss L

slide-10
SLIDE 10

Main Result (cont.)

Convex

  • pt.

Distribution D Loss L Low ℓ1 predictor w

slide-11
SLIDE 11

Main Result (cont.)

Convex

  • pt.

Randomized sparsification Distribution D Loss L Low ℓ1 predictor w Sparse predictor ˜ w

slide-12
SLIDE 12

Main Result (cont.)

Convex

  • pt.

Randomized sparsification Forward selection procedure Distribution D Loss L Low ℓ1 predictor w Sparse predictor ˜ w

slide-13
SLIDE 13

Randomized Sparsification Procedure

|wn| Z |w1| Z | ˜ w1| Z′ | ˜ wn| Z′

Sparsification Procedure For j = 1, . . . , S Sample ri from distribu- tion Pi ∝ |wi| Add | ˜ wi| ←| ˜ wi| + 1

slide-14
SLIDE 14

Randomized Sparsification Procedure

|wn| Z |w1| Z | ˜ w1| Z′ | ˜ wn| Z′

Sparsification Procedure For j = 1, . . . , S Sample ri from distribu- tion Pi ∝ |wi| Add | ˜ wi| ←| ˜ wi| + 1

slide-15
SLIDE 15

Randomized Sparsification Procedure

Sparsification Procedure For j = 1, . . . , S Sample ri from distribution Pi ∝ |wi| Add | ˜ wi| ←| ˜ wi| + 1 Guarantee Assume: X = {x : x∞ ≤ 1}, Y = arbitrary set, D = arbitrary distribution over X × Y , Loss L : R × Y → R is λ-Lipschitz w.r.t. 1st argument If: S ≥ Ω

  • λ2 w2

1 log(1/δ)

ǫ2

  • Then, with probability at least 1 − δ,

E[L( ˜ w, x, y)] − E[L(w, x, y)] ≤ ǫ

slide-16
SLIDE 16

Randomized Sparsification Procedure

Convex

  • pt.

Randomized sparsification Distribution D Loss L Low ℓ1 predictor w Sparse predictor ˜ w

  • Requires access to w
  • Does not require access to D
slide-17
SLIDE 17

Tightness

Y X1 Xn

P(Y = ±1) = 1

2

P(Xi = y|y) = 1 + 1/B 2

Xi

Data distribution: spread ‘information’ about label among all features

slide-18
SLIDE 18

Tightness (cont.)

Dense predictor: wi = B

n and thus w1 = B

E[|w, x − y|] ≤

B √n

Sparse predictor: Any u with E[|u, x − y|] ≤ ǫ must satisfy: u0 = Ω B2 ǫ2

  • Y

X1 Xn

P(Y = ±1) = 1

2

P(Xi = y|y) = 1 + 1/B 2

Xi

slide-19
SLIDE 19

Tightness (cont.)

Proof uses a generalization of Khintchine inequality: If x = (x1, . . . , xn) are independent random variables with P[xk = 1] ∈ (5%, 95%) and Q is degree d polynomial, then: E[|Q(x)|] ≥ (0.2)d E[|Q(x)|2]

1 2

Dense predictor: wi = B

n and thus w1 = B

E[|w, x − y|] ≤

B √n

Sparse predictor: Any u with E[|u, x − y|] ≤ ǫ must satisfy: u0 = Ω B2 ǫ2

  • Y

X1 Xn

P(Y = ±1) = 1

2

P(Xi = y|y) = 1 + 1/B 2

Xi

slide-20
SLIDE 20

Low L2 norm does not guarantee sparsifiability

Same data distribution as before with B = ǫ√n Dense predictor: wi = B

n

E[|w, x − y|] ≤

B √n = ǫ

w2 =

B √n = ǫ

Sparse predictor: Any u with E[|u, x − y|] ≤ 2 ǫ must use almost all features: u0 = Ω B2 ǫ2

  • = Ω(n)

ℓ1 captures sparsity but ℓ2 doesn’t !

slide-21
SLIDE 21

Sparsifying by zeroing small weights fails

Y Z1 Zs

P(Y = ±1) = 1

2

X1 Xn Xsn

P(Xj = z⌈j/s⌉ | z⌈j/s⌉) = 7 8 P(Z1 = y|y) = 1 + 2/3 2 P(Zs = y|y) = 1 + 1/3 2

slide-22
SLIDE 22

Sparsifying by zeroing small weights fails

Y Z1 Zs

P(Y = ±1) = 1

2

X1 Xn Xsn

P(Xj = z⌈j/s⌉ | z⌈j/s⌉) = 7 8 P(Z1 = y|y) = 1 + 2/3 2 P(Zs = y|y) = 1 + 1/3 2

larger weights

slide-23
SLIDE 23

Sparsifying by zeroing small weights fails

Y Z1 Zs

P(Y = ±1) = 1

2

X1 Xn Xsn

P(Xj = z⌈j/s⌉ | z⌈j/s⌉) = 7 8 P(Z1 = y|y) = 1 + 2/3 2 P(Zs = y|y) = 1 + 1/3 2

larger weights initial weights on regularization path also fails on this example

slide-24
SLIDE 24

Intermediate Summary

We answer a fundamental question: How much sparsity does low ℓ1 norm guarantee ? ˜ w0 ≤ O

  • w2

1

ǫ2

  • This is tight

Achievable by simple randomized procedure Coming next: Direct approach also works !

slide-25
SLIDE 25

Intermediate Summary

We answer a fundamental question: How much sparsity does low ℓ1 norm guarantee ? ˜ w0 ≤ O

  • w2

1

ǫ2

  • This is tight

Achievable by simple randomized procedure Coming next: Direct approach also works !

Convex

  • pt.

Randomized sparsification

Forward selection procedure Distribution D Loss L Low ℓ1 predictor w Sparse predictor ˜ w

slide-26
SLIDE 26

Greedy Forward Selection

Step 1: Define a slightly modified loss function ˜ L(v, y) = min

u

λ2 ǫ (u − v)2 + L(u, y) Using infimal convolution theory, it can be shown that ˜ L has Lipschitz continuous derivative ∀v, y |L(v, y) − ˜ L(v, y)| ≤ ǫ/4 Step 2: Apply forward greedy selection on ˜ L Initialize w1 = 0 Choose feature using largest element of gradient Choose step size ηt (closed form solution exists) Update wt+1 = (1 − ηt)wt + ηt B ejt

slide-27
SLIDE 27

Greedy Forward Selection

Example – Hinge loss: L(v, y) = max{0, 1−v} ; ˜ L(v, y) =      if v > 1

1 ǫ (v − 1)2

if v ∈ [1 − 1

ǫ , 1]

(1 − ǫ

4) − v

else

slide-28
SLIDE 28

Greedy Forward Selection

Example – Hinge loss: L(v, y) = max{0, 1−v} ; ˜ L(v, y) =      if v > 1

1 ǫ (v − 1)2

if v ∈ [1 − 1

ǫ , 1]

(1 − ǫ

4) − v

else

slide-29
SLIDE 29

Guarantees

Theorem X = {x : x∞ ≤ 1}, Y = arbitrary set D = arbitrary distribution over X × Y Loss L : R×Y → R is proper, convex, and λ-Lipschitz w.r.t. 1st argument Forward greedy selection on ˜ L finds ˜ w s.t. ˜ w0 = O

  • λ2 B2

ǫ2

  • For any w with w1 ≤ B we have:

E[L( ˜ w, x, y)] − E[L(w, x, y)] ≤ ǫ

slide-30
SLIDE 30

Related Work

ℓ1 norm and sparsity: Donoho provides sufficient conditions for when minimizer of ℓ1 norm is also sparse. But, what if these conditions are not met? Compressed sensing: ℓ1 norm recovers sparse predictor, but only under server assumptions on the design matrix (in our case, the training examples) Converse question: Small ˜ w0

?

⇒ Small w1 ? Servedio: partial answer for the case of linear classification Wainwright: partial answer for the Lasso Sparsification: Randomized sparsification procedure previously proposed by Schapire et al. However, their bound depends on training set size Lee, Bartlett, and Williamson addressed similar question for the special case of squared-error loss Zhang presented forward greedy procedure for twice differentiable losses

slide-31
SLIDE 31

Summary

Randomized Distribution D Loss L Low ℓ1 predictor w largest weights Sparse predictor ˜ w ˜ w0 ≤ O

  • w2

1

ǫ2

  • This is tight

Convex opt.

slide-32
SLIDE 32

Summary

Randomized Distribution D Loss L Low ℓ1 predictor w largest weights Sparse predictor ˜ w ˜ w0 ≤ O

  • w2

1

ǫ2

  • This is tight

regularization path Forward selection Convex opt.

slide-33
SLIDE 33

Summary

Randomized Distribution D Loss L Low ℓ1 predictor w largest weights Sparse predictor ˜ w ˜ w0 ≤ O

  • w2

1

ǫ2

  • This is tight

regularization path Forward selection Convex opt. Convex opt. Low ℓ2 predictor