Efficient Algorithms for Learning Sparse Models from High - - PowerPoint PPT Presentation

efficient algorithms for learning sparse models from high
SMART_READER_LITE
LIVE PREVIEW

Efficient Algorithms for Learning Sparse Models from High - - PowerPoint PPT Presentation

Efficient Algorithms for Learning Sparse Models from High Dimensional Data Yoram Singer Google Research Machine Learning Summer School, UC Santa Cruz, July 18, 2012 1 Sparsity ? THE HIGHER MINIMUM WAGE THAT WAS SIGNED INTO LAW ... WILL BE


slide-1
SLIDE 1

Efficient Algorithms for Learning Sparse Models from High Dimensional Data

Yoram Singer Google Research

1

Machine Learning Summer School, UC Santa Cruz, July 18, 2012

slide-2
SLIDE 2

Sparsity ?

  • In many applications there are numerous input features

(e.g. all words in a dictionary, all possible html tokens)

  • However, only a fraction of the features are highly

relevant to the task on hand

  • Keeping all the features might also be computationally

infeasible

THE HIGHER MINIMUM WAGE THAT WAS SIGNED INTO LAW ... WILL BE WELCOME RELIEF OF WORKERS ... THE 90 CENT-AN-HOUR INCREASE... REGULATIONS LABOUR ECONOMICS

slide-3
SLIDE 3
  • A large collections of investments tools

(stocks, bonds, ETFs, cash, options, ...)

  • Cannot afford and/or maintain investments in all

possible financial instruments across the globe

  • Need to select a relatively small number of financial

instruments to achieve a certain goal (e.g. volatility-return profile)

Sparsity ?

slide-4
SLIDE 4
  • Web search and advertisement placement employ

large number of boolean predicates

  • Most predicates evaluate to be false most of the time
  • Example:
  • User types: [flowers] sees “Fernando’s Flower Shop”
  • Instantiated features:

query: “flowers” query:”flowers” && creative_keyword: “flower” lang: “en-US”

  • Resulting instance:

High Dimensional Sparse Data

x ∈ {0, 1}n (0, 0, 1, 0, 1, 0, . . . , 0, 0, 1, 0)

Which Predicates are important?

slide-5
SLIDE 5

Methods to Achieve Compact Models

  • Forward greedy feature induction (bottom-up)
  • Backward feature pruning (top-down)
  • Combination (FOBA) - alternate between:

(i) Feature induction (ii) Model fitting (iii) Feature pruning

  • This tutorial:

Efficient algorithms for learning “compact” linear models from large amounts of high dimensional data

slide-6
SLIDE 6

Linear Models

x1 x2 x3 x4 x5 x6 x7 x8

Linear Models

Input X

w1

Weights W

w2 w3 w4 w5 w6w7 w8

Prediction ˆ

y = w · x =

n

X

j=1

wjxj

True Target y

⇒ `(y, ˆ y)

(loss function)

`(y, ˆ y) = (y − ˆ y)2 `(y, ˆ y) = e−yˆ

y

Example of losses

squared error exponential loss

slide-7
SLIDE 7

Empirical Loss Minimization

  • Training set, sample
  • Goal: find W that attains low loss L(w) on S

.... and performs well on unseen data

  • Empirical Risk Minimization (ERM) balances between

loss minimization and “complexity” of W

S = {(xi, yi)}m

i=1

L(w) = 1 m

m

X

i=1

`(yi, w · xi)

slide-8
SLIDE 8

Two Forms of ERM

T

X

t=1

log (w · xt) s.t. w ∈ ∆

PENALIZED EMPIRICAL RISK arg min

w

R(w) + E(x,y)∼D [`(w; (x, y)]

DOMAIN CONSTRAINED EMPIRICAL RISK

arg min

w

E(x,y)∼D [`(w; (x, y)] s.t. w ∈ Ω arg min

w

σkwk2 + 1 m

m

X

i=1

[1 yi(w · xi)]+

slide-9
SLIDE 9

Sparse Linear Models

Weights W w1 0 w3 w4 0 w6 0

w1 w2 w3 w4 w5 w6w7 w8

  • Use regularization functions or domain constraints

that promote sparsity

  • Base tools: penalize or constrain the 1-norm of W
  • Use the base tools to build (promote) models with

structural (block) sparsity

kwk1 =

n

X

j=1

|wj|

slide-10
SLIDE 10

Why 1-norm ?

“CORNER”

w1 w2 L(w) k w k

1

= z

slide-11
SLIDE 11
  • L0 counts the number of non-zero coefficients

(not a norm)

  • ERM with 0-norm constraints is NP-Hard
  • 1-norm is a relaxation of 0-norm
  • Under (mild to restrictive) conditions:

1-norm “behaves like” 0-norm (Candes’06, Donoho’06, ...)

1-norm as Proxy to “0-norm”

0-norm 1-norm

kwk0 = |{j |wj = 0}|

slide-12
SLIDE 12

L1 and Generalization

  • By constraining or penalizing the 1-norm of the

weights we prevent excessively “complex” predictors

  • The penalties / constraints are especially important in

very high-dimensional data with binary features

x ∈ {0, 1}n (0, 0, 1, 0, 1, 0, . . . , 0, 0, 1, 0)

kwk1  z , kxk∞ = 1 ) |w · x|  z

(Holder Inequality) 1-norm constraint caps maximal value of predictions

slide-13
SLIDE 13

Rough Outline

  • Algorithms with sparsity promoting domain constraints
  • Algorithms with sparsity promoting regularization
  • Efficient implementation in high dimensions
  • Structural sparsity from base algorithms
  • Improved algorithms from base algorithms
  • Coordinate descent w/ sparsity promoting regularization

[time permitting]

  • Few experimental results

Work really well Trust Me Try Yourself

slide-14
SLIDE 14

Loss Minimization & Gradient Descent

  • Gradient descent main loop:
  • Compute gradient
  • Update

L(w)

w

r L(w1) r L(w2) r L(w3) w1 w2 w3

r

tL =

1 |S| X

i∈S

@ @w`(w; (xi, yi))

  • w=wt

wt+1 wt ηtr

tL

STEP SIZE

ηt ∼ 1 √ t or ηt ∼ 1 t

slide-15
SLIDE 15

Stochastic Gradient

  • Often when the training set is large we can use an

estimate of the gradient

S0 ⊂ S ˆ r

tL =

1 |S0| X

i2S0

@ @w`(w; (xi, yi))

  • w=wt
slide-16
SLIDE 16

Projection Onto A Convex Set

ΠΩ(w) = arg min

v∈Ω kv wk

w ΠΩ( w )

z

w u u = ΠΩ(w) u = sw where uj = z kwkwj

Ball of radius z

s

slide-17
SLIDE 17

Gradient Decent with Domain Constraints

  • Loop:
  • Compute gradient
  • Update

Similar convergence guarantees to GD

ˆ r

tL =

1 |S0| X

i2S0

@ @w`(w; (xi, yi))

  • w=wt

wt+1 = ΠΩ ⇣ wt ηt ˆ r

tL

slide-18
SLIDE 18

Gradient Decent with 1-norm Constraint

wt wt+1 wt+2

slide-19
SLIDE 19

Projection Onto 1-norm Ball

v1 := v1 − θ v2 := v2 − θ

θ θ

slide-20
SLIDE 20

v1 := max{0, v1 − θ} v2 := max{0, v2 − θ}

Projection Onto Ball

1

slide-21
SLIDE 21

Projection Onto Ball

1

sign(vj) max {0 , |vj| − θ}

slide-22
SLIDE 22

Algebraic-Geometric View

θ

v1 v2 v3 v4 v5 v6 v7 −θ −θ −θ −θ

slide-23
SLIDE 23

Algebraic-Geometric View

θ

v1 v2 v3 v4 v5 v6 v7 −θ −θ −θ −θ

⇒ θ = v1+v2+v4+v5−z

4

(v1 − θ) + (v2 − θ) + (v4 − θ) + (v5 − θ) = z

slide-24
SLIDE 24

Chicken and Egg Problem

  • Had we known the threshold we could have found all

the zero elements

  • Had we known the elements that become zero we

could have calculated the threshold

slide-25
SLIDE 25

From Egg to Omelet

If vj < vk then if after the projection the k’th component is zero, the j’th component must be zero as well

θ

v3 v6

slide-26
SLIDE 26

The Omelet

If two feasible solutions exist with k and k+1 non-zero elements then the solution with k+1 elements attains a lower loss v1 v2 v3 v4 v5 v6 v7

slide-27
SLIDE 27

Calculating Projection

  • Sort vector to be projected
  • If j is a feasible index then
  • Number of non-zero elements

ρ

µj > θ ⇒ µj > 1 j j ⇤

r=1

µr − z ⇥ ⌃ ⇧⌅ ⌥

θ

ρ = max ⇤ j : µj − 1 j j ⇧

r=1

µr − z ⇥ > 0 ⌅

1

v ⇒ µ s.t. µ1 ≥ µ2 ≥ µ3 ≥ ... ≥ µn

slide-28
SLIDE 28

Calculating the Projection

v1 v2 v3 v4 v5 v6 v7

ρ = 3 θ = 1 3 (v2 + v4 + v5 − z)

v4 − (v4 − z) > 0 v5 − 1 2(v4 + v5 − z) > 0

slide-29
SLIDE 29

More Efficient Procedure

vj > θ ⇔ vj > 1 ρ(vj) (s(vj) − z) ⇔ s(vj) − ρ(vj)vj < z

  • Assume we know number of elements greater than vj
  • Assume we know the sum of elements great than vj
  • Then, we can check in constant time the status of vj
  • Randomized median-like search [O(n) instead O(n log(n))]

ρ(vj) = |{vi : vi ≥ vj}| s(vj) =

  • i:vi≥vj

vi

slide-30
SLIDE 30

Efficient Implementation in High Dimensions

  • In many applications the dimension is very high

[ text applications: dictionaries of 20+ million words] [ web data: often > 1010 different html tokens ]

  • Small number of non-zero elements in each example

[ text applications: a news document contains 1000s of words ] [ web data: web page is often short, less than 104 html tokens ]

  • Online/stochastic updates only modify the weights

corresponding to non-zero features in example

  • Use red-black (RB) tree to store only non-zero weights +

additional data structure + lazy evaluation

  • Upon projection, removal of whole sub-tree is performed

in log time w/ Tarajan’s (83) algorithm for splitting RB tree

slide-31
SLIDE 31

Empirical Results for Digit Recognition

  • 60,000 training examples, 28x28 pixel images
  • Engineered 25,000 features
  • Multiclass logistic regression:
  • Gradient decent with L1 projection
  • Exponentiated Gradient (EG, mirror decent)

by Prof. Manfred and colleagues

  • Batch (deterministic) and stochastic GD & EG

wt+1,j = wt,j eηt ˆ

rLt,j

Zt

slide-32
SLIDE 32

GD+L1 vs. EG on MNIST

2 4 6 8 10 12 14 16 18 20 10

1

10

Gradient Evaluations f f*

EG L1

50 100 150 200 250 300 350 400 10

1

10

Stochastic Subgradient Evaluations f f*

EG L1

Stochastic

Deterministic

slide-33
SLIDE 33

Sparsity “on-the-fly”

1 2 3 4 5 6 7 8 x 10

5

1 2 3 4 5 6 7

Training Examples % Sparsity

% of Total Features % of Total Seen

Text classification (800,000 docs.)

slide-34
SLIDE 34
  • Penalized empirical risk minimization

Penalized Risk Minimization & L1

min

w L(w) + λ w1

L(w)

kwk1

+

slide-35
SLIDE 35

Subgradients

  • Subgradient set of function f

∂f(x0) =

  • g : f(x) ≥ f(x0) + g(x − x0)

⇥ x0

slide-36
SLIDE 36

ERM with L1 using Subgradients

  • Penalized empirical risk minimization
  • Unconstrained (stochastic) subgradient descent
  • Subgradient for L1

wt+1 = wt − ηtgt gt ∈ (∂L(w) + ∂R(w))|w=wt

R(w) = ⌅w⌅1 ⇥ ∂R ∂wj =

  • [1, 1]

wt,j = 0 sign(wt,j) wt,j ⇤= 0

min

w L(w) + λ w1

slide-37
SLIDE 37

Subgradients: Caveats

  • The subgradient set is large at singularities
  • Subgradients are “non-informative” at singularities

∂w1

★ DENSE SOLUTIONS ★ SLOW CONVERGENCE

∂⇧w⇧1 at w = 0 is {w | ⇧w⇧∞ 1} ∂⇧w⇧2 at w = 0 is {w | ⇧w⇧2 1}

slide-38
SLIDE 38

Two Phase Approach

+

rˆ Lt wt wt+ 1

2

+

wt+ 1

2

GD on L only Solve Analytically

slide-39
SLIDE 39

Two Phase Approach

  • Unconstrained (stochastic) gradient of loss
  • Incorporate regularization and solve

SECOND PHASE LEARNING RATE FIRST PHASE LEARNING RATE (STOCHASTIC) GRADIENT OF EMPIRICAL LOSS

wt+1 = argmin

w

⇥1 2

  • w − wt+ 1

2

  • 2

+ ηt+ 1

2 λ R(w)

⇤ gt ∈ ∂L(w)|w=wt

wt+ 1

2 = wt − ηtgt

slide-40
SLIDE 40

Analysis Tool

  • If is differentiable then

at the optimum

  • If is continuos then

at an optimum f(θ) rf(θ?) = 0 θ? f(θ) 0 ∈ ∂f(θ?) θ?

slide-41
SLIDE 41
  • The optimum ( ) satisfies

Forward Looking Subgradient

CURRENT GRADIENT OF EMPIRICAL LOSS FORWARD SUBGRADIENT OF REGULARIZATION

wt+1

FORWARD & ONWARD LOOKING SUBGRADIENT

FOBOS (FOLOS)

wt+1 = wt − ηt gL

t − ηt+ 1

2 λ gR

t+1

0 = wt+1 wt + ηt rL(wt) + ηt+ 1

2 λ ∂R(wt+1)

slide-42
SLIDE 42

Setting The Learning Rate (*)

  • If we set we obtain batch convergence
  • If we set we obtain online regret bounds
  • Analysis exploits forward-looking property

ηt+ 1

2 = ηt+1

ηt+ 1

2 = ηt

O 1 √ T ⇥

  • r O

log(T) T ⇥ (strong convexity) ηt ∝ 1 t ηt ∝ 1 √ t min

t=1,...,T L(wt) + λR(wt) − (L(w⇤) + λR(w⇤)) = O

1 √ T ⇥

slide-43
SLIDE 43

Similar Approaches

  • Truncation, Shrinkage, and many other names:
  • Wright, Nowak, Figueiredo

“Sparse Reconstruction by Separable Approximation”

  • Langford, Li, Zhang

“Sparse Online Learning via Truncated Gradient”

  • Beck & Teboulle

“Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems” (and many references therein)

  • Fobos provides distilled and simple analysis which makes

it easy to use and obtain structural sparsity

slide-44
SLIDE 44

Fobos with L1

  • Minimize
  • Hinge function
  • Update yields sparsity

[z]+ = max{0, z} wt+1,j = sign

  • wt+ 1

2 ,j

⇥ ⇤ |wt+ 1

2 ,j| − ληt+ 1 2

+

wt+1,j = 0 wt+1,j wt+ 1

2 ,j

λ ηt+ 1

2

1 2

  • w − wt+ 1

2 ,j

⇥2 + λ|w| min

w

1 2

  • w wt+ 1

2

  • 2

+ λkwk1

slide-45
SLIDE 45

Structural (Block) Sparsity

Often parameters have a predefined grouping:

  • multiclass categorization
  • multi-task learning
  • use the same feature set

We would like to zero them all or keep them all

wn w1 w2 w3w4 0 0 0 0

slide-46
SLIDE 46

If X2 irrelevant then W1,2 W2,2 W3,2 ... Wn,2 are redundant

Structural Sparsity

Multiclass: each class has a weight vector that operates

  • ver the same features

w11 w12 ... w1k w21 w22 ... w2k ... ... ... ... wn1 wn2 ... wnk

X1 X2 X3 X4

Pr (y = j|x) = efj(x) Pk

r=1 efr(x)

f1(x) = hw1, xi

slide-47
SLIDE 47

Tool for Structural Sparsity: Fobos with L2

  • If the two-phase update amounts to

gradient descent + shrinkage

  • If we obtain an all-or-nothing update

Zero vector if otherwise shrinkage R(w) = w2

2

wt+1 = wt+ 1

2

1 + ληt+ 1

2

R(w) = w2

⇥wt+ 1

2 ⇥ ληt+ 1 2

wt+1 = " 1 ληt+ 1

2

kwt+ 1

2 k

#

+

wt+ 1

2

slide-48
SLIDE 48
  • Define and assume
  • Subgradient of 2-norm
  • “Solve”: 0 is in subgradient set of “mini” objective

kwt+1k > 0

Fobos with L2

v = wt+ 1

2

˜ λ = ηt+ 1

2 λ

∂ ∂wk sX

j

w2

j =

wk qP

j w2 j

) ∂kwk = w kwk 0 2 ∂ ∂w ⇢1 2kw vk2 + ˜ λkwk

  • ) w v + ˜

λ w kwk = 0 w ✓ 1 + ˜ λ w kwk ◆ = v ) w = sv

slide-49
SLIDE 49

Fobos with L2

  • Find the minimum w.r.t. s
  • This implies that
  • Now we need to impose the constraint
  • Therefore s=0 if and we get

d ds ⇢1 2ksv vk2 + ˜ λksvk

  • = 0 ) (s 1)kvk + ˜

λ = 0 s = 1 ˜ λ kvk s ≥ 0 s = 0 if 1 ˜ λ kvk  0 kvk  ˜ λ

wt+1 = " 1 ληt+ 1

2

kwt+ 1

2 k

#

+

wt+ 1

2

slide-50
SLIDE 50
  • If the update amounts to thresholding
  • If then the result is the zero vector
  • But, how do we find ?

Dual problem

Fobos with L∞

θ θ = 0 PROJECTION ONTO L1 BALL R(w) = w∞ wt+1,j = min n wt+ 1

2 ,j , θ

  • max

u

1 2ku wt+ 1

2 k2 s.t. kuk1  λ ηt+ 1 2

min

w

1 2kw wt+ 1

2 k2 + λ ηt+ 1 2 kwk∞

slide-51
SLIDE 51

Mixed-Norms

W⇥1/⇥q =

n

  • i=1

wiq

w11 w12 ... w1k w21 w22 ... w2k ... ... ... ... wn1 wn2 ... wnk

q

  • r2

rn

q

  • r1

q

  • ⇥W⇥⇥1/⇥q =

n

  • i=1

|rn|

slide-52
SLIDE 52

Fobos with Mixed-Norms

  • Applications with grouped parameters:

multiclass categorization and multitask prediction

  • Block-Lasso with tied parameters (weights)
  • Use L1 over features space and the 2-norm or infinity

norm over different classes / blocks / tasks and obtain “structural sparsity” (“group sparsity”)

slide-53
SLIDE 53

High Dim Data ➪ Sparse Gradients

g1 g2 g3 g4 t=1 t=2 t=3 t=4 t=5 t=6 ... 1.2 5.4 2 1.8 1.5 2 4.1 2 2.4 3.5 4

slide-54
SLIDE 54

Fobos in High Dimensions

  • The input space is often sparse

(e.g. short documents from a very large dictionary)

  • Need to perform “just-in-time” updates
  • The following lemma to the rescue:

P.1 : wt = arg min

w

1 2⇤w wt−1⇤2 + λt⇤w⇤q P.2 : w = arg min

w

1 2⇤w w0⇤2 + T ⇤

t=1

λt ⇥ ⇤w⇤q

T × P.1 ≡ P.2

slide-55
SLIDE 55

FOBOS UPDATE WITH

λ =

6

  • t=2

λt

SKIP UPDATE PHASE (LAZY EVAL)

High Dimensional Update

g1 g2 g3 g4 t=1 t=2 t=3 t=4 t=5 t=6 ... 1.2 5.4 2 1.8 1.5 2 4.1 2 2.4 3.5 4

slide-56
SLIDE 56

Fobos vs Subgrad vs Int. Point

10 20 30 40 50 60 70 80 90 100 10

−3

10

−2

10

−1

10 10

1

Number of Operations f(wt) − f(w*)

L1 Folos L1 IP L1 Subgrad

slide-57
SLIDE 57

Fobos and Sparsity

10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Folos Steps Sparsity Proportion

λ = 1 λ = 10 λ = 2

slide-58
SLIDE 58

Folos with Mixed-Norms

True l1/l2 l1/l l1

slide-59
SLIDE 59

Multiple Image Reconstruction

  • Sequence of similar images (e.g. video frames)
  • MSE for compression with wavelet functions
  • Use mixed-norm to decide which functions to use

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

−3

10

−2

10

−1

Group Sparsity MSE L1/L1 L1/L2 L1/Linf

slide-60
SLIDE 60

Problem with Fobos (*advanced*)

  • Gradient step treats all features as equal
  • They are not!
  • Can we “adapt” to the “geometry” of space ?

yt xt,1 xt,2 xt,3 +1 1

  • 1

0.5 1 +1

  • 0.5

1

  • 1

+1 0.6

  • 1
  • 0.7

+1

  • 0.5

1 1

  • 1
  • 0.1

Frequent “Garbage” Infrequent but Informative

“EASY” “HARD”

slide-61
SLIDE 61

Adapting to Problem’s Geometry

  • (stochastic) gradient of the objective
  • Fobos Update
  • Adaptive gradient update

gt = ˆ rL(w)|w=wt wt+1 = arg min

w

⇢1 2kw wtk2 + η gt + R(w)

  • wt+1 = arg min

w

⇢1 2kw wtk2

A + η gt + R(w)

  • kwk2

A = hw, Awi

(A ⌫ 0)

HOW TO SET AND ADAPT THE MATRIX A ?

slide-62
SLIDE 62

“Offline” Motivation

  • Suppose we were to find a matrix S s.t. gradients

look “good” w.r.t to S (geometry becomes “easy”)

  • S must be Positive Semidefinite
  • We need to limit the norm of S (degree of freedom)

min

S T

X

t=1

⌦ gt, S−1gt ↵ s.t. S ⌫ 0 , tr(S)  c S = c tr ⇣ G

1 2

T

⌘G

1 2

T

GT =

T

X

t=1

gt g†

t

slide-63
SLIDE 63

Efficient Adaptation

  • Limit to diagonal matrices At = diag(at1,at2,...,atd)
  • Smooth the diagonal to ensure PD
  • Solve (e.g.) minimization with L1 using Fobos

at

j =

v u u t

t

X

τ=1

g2

τ,j + δ

wt+1,j = sign ⇣ wt+ 1

2 ,j

⌘ "

  • wt+ 1

2 ,j

  • − λη

at

j

#

+

wt+ 1

2 ,j = wt,j − ηλ

at

j

gt,j

PER FEATURE LEARNING RATE

slide-64
SLIDE 64

Convergence and Regret

  • Asymptotic rate of convergence is similar
  • Concrete convergence rate

L(wt) + λR(wt) − (L(w∗ + λR(w∗)) ≤ √ 2dD∞C T D∞ = kw0 w∗k∞ C = v u u tinf s ( T X

t=1

g†

tS−1gt : S ⌫ 0 , tr(S)  d

)

FAST RATE (STRONGLY CONVEX) SCALES WELL WITH DIMENSION SMALL WHEN SPACE CAN BE RESHAPED “ON THE FLY”

slide-65
SLIDE 65

Text Classification Experiment

Fobos AdaFobos PA AROW Economics Corporate Government Medicine 0.580 0.440 0.059 0.049 0.111 0.053 0.107 0.061 0.056 0.040 0.066 0.044 0.056 0.035 0.053 0.039

  • Reuters RCV1 document classification task:
  • 800,000 documents, two million features total
  • About 4000 non-zero features per document
  • Four top level categories (ECAT, CCAT, GCAT, MCAT)
slide-66
SLIDE 66

Sparsity (NNZ) on RCV1

6 7 8 10 11 13 ECAT CCAT GCAT MCAT

6.29 7.99 10.54 8.64 7.36 9.24 12.32 9.91

L1 Only L1 & AdaGrad

slide-67
SLIDE 67

Sparsity Level of ADaGrad

10

−5

10

−4

10

−3

10

−2

10

−1

10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

Proportion non−zero Test−set error rate AdaGrad AROW

(AROW: second order algorithm w/o sparsity)

10% NNZ WEIGHTS WHILE PERFORMANCE IS CLOSE TO OPTIMAL

slide-68
SLIDE 68

Coordinate Descent with L1

  • Fix all weights but one
  • Solve for a single “unfrozen” weight
  • Cycle through the coordinates

(randomly, using duality-gap criterion, ...)

  • The update can be parallelized such that all

coordinates are updated at the same time but with a less aggressive step

slide-69
SLIDE 69

Example: Exp-Loss with L1

  • Classification problem
  • Use exponential loss (AdaBoost) as empirical loss

[see also Yoav Freund’s tutorial]: find w for

  • Impose sparsity using L1 regularization

S = {(xi, yi)}m

i=1

L(w) = 1 m

m

X

i=1

e−yi(w·xi) Q(w) = 1 m

m

X

i=1

e−yi(w·xi) + λkwk1

xi ∈ {0, 1}n (xi,j ∈ {0, 1}) yi ∈ {−1, +1}

slide-70
SLIDE 70

Single Coordinate Update

  • Weight vector excluding coordinate j
  • Instances excluding coordinate j
  • Freeze all weights and find minimum w.r.t wj
  • Define

x↓j w↓j

1 m

m

X

i=1

e−yi(w↓j·xi↓j+wjxi,j) + λ|wj|

µ−

j =

X

yixi,j=−1

e−yi(w↓j·xi↓j) µ+

j =

X

yixi,j=+1

e−yi(w↓j·xi↓j)

slide-71
SLIDE 71

Single Coordinate Update

  • AdaBoost exp-loss per feature
  • Incorporating regularization
  • Assume that then
  • Solution along a single coordinate found by solving

polynomial equation for µ+

j e−wj + µ− j ewj + λ|wj|

µ+

j > µ− j

a = ewj µ+

j e−wj + µ− j ewj

d dwj

  • µ+

j e−wj + µ− j ewj + λwj

= 0 ⇒ −µ+

j e−wj + µ− j ewj + λ = 0

−µ+

j 1/a + µ− j a + λ = 0 ⇒ µ− j a2 + λa − µ+ j = 0

wj ≥ 0

slide-72
SLIDE 72

Update with L1 Regularization

δj = ⇥ ⇧ ⇧ ⇧ ⇧ ⌅ ⇧ ⇧ ⇧ ⇧ ⇤ −wj

  • µ+

j eηwj − µ− j e−ηwj

≤ λ η log

−λ+ q λ2+4µ+

j µ− j

2µ−

j

µ+

j eηwj > µ− j e−ηwj + λ

η log

λ+ q λ2+4µ+

j µ− j

2µ−

j

µ+

j eηwj < µ− j e−ηwj − λ

wj = 0 ∧ |µ+

j − µ− j | ≤ λ

⇒ δj = 0

Regularization pushes weights back to zero if correlation difference is small

wj ← wj + δj

Step size (1 for a single coordinate)

slide-73
SLIDE 73

Sibyl: A Large Scale Supervised Learning System

  • Uses parallel boosting (Collins et. al) with L1 & L2

regularization

  • See also talk by Tushar Chandra (past Friday)
  • Sibyl currently serves
  • Youtube: featured video & related video
  • Ads: Gmail, Mobile ads, Product ads
  • ...
  • Can deal with all but the largest supervised problems

[1++ B features, 100s B examples]

slide-74
SLIDE 74

Thanks

  • MLSS organizers & Prof. Manfred Warmuth
  • John Duchi, UCB & Google
  • Samy Bengio, Fernando Pereira, ... (Google Research)
  • Sibyl team
  • Further details:

http://www.magicbroom.info/Sparsity.html