Efficient Algorithms for Learning Sparse Models from High Dimensional Data
Yoram Singer Google Research
1
Machine Learning Summer School, UC Santa Cruz, July 18, 2012
Efficient Algorithms for Learning Sparse Models from High - - PowerPoint PPT Presentation
Efficient Algorithms for Learning Sparse Models from High Dimensional Data Yoram Singer Google Research Machine Learning Summer School, UC Santa Cruz, July 18, 2012 1 Sparsity ? THE HIGHER MINIMUM WAGE THAT WAS SIGNED INTO LAW ... WILL BE
Yoram Singer Google Research
1
Machine Learning Summer School, UC Santa Cruz, July 18, 2012
(e.g. all words in a dictionary, all possible html tokens)
relevant to the task on hand
infeasible
THE HIGHER MINIMUM WAGE THAT WAS SIGNED INTO LAW ... WILL BE WELCOME RELIEF OF WORKERS ... THE 90 CENT-AN-HOUR INCREASE... REGULATIONS LABOUR ECONOMICS
(stocks, bonds, ETFs, cash, options, ...)
possible financial instruments across the globe
instruments to achieve a certain goal (e.g. volatility-return profile)
large number of boolean predicates
query: “flowers” query:”flowers” && creative_keyword: “flower” lang: “en-US”
x ∈ {0, 1}n (0, 0, 1, 0, 1, 0, . . . , 0, 0, 1, 0)
Which Predicates are important?
(i) Feature induction (ii) Model fitting (iii) Feature pruning
Efficient algorithms for learning “compact” linear models from large amounts of high dimensional data
Input X
Weights W
Prediction ˆ
n
j=1
True Target y
(loss function)
y
Example of losses
squared error exponential loss
.... and performs well on unseen data
loss minimization and “complexity” of W
i=1
m
i=1
T
X
t=1
log (w · xt) s.t. w ∈ ∆
PENALIZED EMPIRICAL RISK arg min
w
R(w) + E(x,y)∼D [`(w; (x, y)]
DOMAIN CONSTRAINED EMPIRICAL RISK
arg min
w
E(x,y)∼D [`(w; (x, y)] s.t. w ∈ Ω arg min
w
σkwk2 + 1 m
m
X
i=1
[1 yi(w · xi)]+
Weights W w1 0 w3 w4 0 w6 0
that promote sparsity
structural (block) sparsity
n
j=1
“CORNER”
1
(not a norm)
1-norm “behaves like” 0-norm (Candes’06, Donoho’06, ...)
0-norm 1-norm
kwk0 = |{j |wj = 0}|
weights we prevent excessively “complex” predictors
very high-dimensional data with binary features
x ∈ {0, 1}n (0, 0, 1, 0, 1, 0, . . . , 0, 0, 1, 0)
(Holder Inequality) 1-norm constraint caps maximal value of predictions
[time permitting]
Work really well Trust Me Try Yourself
L(w)
r L(w1) r L(w2) r L(w3) w1 w2 w3
r
tL =
1 |S| X
i∈S
@ @w`(w; (xi, yi))
wt+1 wt ηtr
tL
STEP SIZE
ηt ∼ 1 √ t or ηt ∼ 1 t
estimate of the gradient
tL =
i2S0
v∈Ω kv wk
z
Ball of radius z
Similar convergence guarantees to GD
tL =
i2S0
tL
wt wt+1 wt+2
θ θ
v1 := max{0, v1 − θ} v2 := max{0, v2 − θ}
sign(vj) max {0 , |vj| − θ}
v1 v2 v3 v4 v5 v6 v7 −θ −θ −θ −θ
v1 v2 v3 v4 v5 v6 v7 −θ −θ −θ −θ
4
(v1 − θ) + (v2 − θ) + (v4 − θ) + (v5 − θ) = z
the zero elements
could have calculated the threshold
If vj < vk then if after the projection the k’th component is zero, the j’th component must be zero as well
v3 v6
If two feasible solutions exist with k and k+1 non-zero elements then the solution with k+1 elements attains a lower loss v1 v2 v3 v4 v5 v6 v7
µj > θ ⇒ µj > 1 j j ⇤
r=1
µr − z ⇥ ⌃ ⇧⌅ ⌥
θ
ρ = max ⇤ j : µj − 1 j j ⇧
r=1
µr − z ⇥ > 0 ⌅
v1 v2 v3 v4 v5 v6 v7
v4 − (v4 − z) > 0 v5 − 1 2(v4 + v5 − z) > 0
vj > θ ⇔ vj > 1 ρ(vj) (s(vj) − z) ⇔ s(vj) − ρ(vj)vj < z
ρ(vj) = |{vi : vi ≥ vj}| s(vj) =
vi
[ text applications: dictionaries of 20+ million words] [ web data: often > 1010 different html tokens ]
[ text applications: a news document contains 1000s of words ] [ web data: web page is often short, less than 104 html tokens ]
corresponding to non-zero features in example
additional data structure + lazy evaluation
in log time w/ Tarajan’s (83) algorithm for splitting RB tree
by Prof. Manfred and colleagues
rLt,j
2 4 6 8 10 12 14 16 18 20 10
110
Gradient Evaluations f f*
EG L1
50 100 150 200 250 300 350 400 10
110
Stochastic Subgradient Evaluations f f*
EG L1
1 2 3 4 5 6 7 8 x 10
5
1 2 3 4 5 6 7
Training Examples % Sparsity
% of Total Features % of Total Seen
min
w L(w) + λ w1
L(w)
∂f(x0) =
⇥ x0
wt+1 = wt − ηtgt gt ∈ (∂L(w) + ∂R(w))|w=wt
R(w) = ⌅w⌅1 ⇥ ∂R ∂wj =
wt,j = 0 sign(wt,j) wt,j ⇤= 0
min
w L(w) + λ w1
∂w1
∂⇧w⇧1 at w = 0 is {w | ⇧w⇧∞ 1} ∂⇧w⇧2 at w = 0 is {w | ⇧w⇧2 1}
rˆ Lt wt wt+ 1
2
wt+ 1
2
GD on L only Solve Analytically
SECOND PHASE LEARNING RATE FIRST PHASE LEARNING RATE (STOCHASTIC) GRADIENT OF EMPIRICAL LOSS
wt+1 = argmin
w
⇥1 2
2
+ ηt+ 1
2 λ R(w)
⇤ gt ∈ ∂L(w)|w=wt
wt+ 1
2 = wt − ηtgt
at the optimum
at an optimum f(θ) rf(θ?) = 0 θ? f(θ) 0 ∈ ∂f(θ?) θ?
CURRENT GRADIENT OF EMPIRICAL LOSS FORWARD SUBGRADIENT OF REGULARIZATION
wt+1
wt+1 = wt − ηt gL
t − ηt+ 1
2 λ gR
t+1
0 = wt+1 wt + ηt rL(wt) + ηt+ 1
2 λ ∂R(wt+1)
ηt+ 1
2 = ηt+1
ηt+ 1
2 = ηt
O 1 √ T ⇥
log(T) T ⇥ (strong convexity) ηt ∝ 1 t ηt ∝ 1 √ t min
t=1,...,T L(wt) + λR(wt) − (L(w⇤) + λR(w⇤)) = O
1 √ T ⇥
“Sparse Reconstruction by Separable Approximation”
“Sparse Online Learning via Truncated Gradient”
“Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems” (and many references therein)
it easy to use and obtain structural sparsity
[z]+ = max{0, z} wt+1,j = sign
2 ,j
⇥ ⇤ |wt+ 1
2 ,j| − ληt+ 1 2
⌅
+
wt+1,j = 0 wt+1,j wt+ 1
2 ,j
λ ηt+ 1
2
1 2
2 ,j
⇥2 + λ|w| min
w
1 2
2
+ λkwk1
Often parameters have a predefined grouping:
We would like to zero them all or keep them all
If X2 irrelevant then W1,2 W2,2 W3,2 ... Wn,2 are redundant
Multiclass: each class has a weight vector that operates
X1 X2 X3 X4
Pr (y = j|x) = efj(x) Pk
r=1 efr(x)
f1(x) = hw1, xi
gradient descent + shrinkage
Zero vector if otherwise shrinkage R(w) = w2
2
wt+1 = wt+ 1
2
1 + ληt+ 1
2
R(w) = w2
⇥wt+ 1
2 ⇥ ληt+ 1 2
wt+1 = " 1 ληt+ 1
2
kwt+ 1
2 k
#
+
wt+ 1
2
kwt+1k > 0
v = wt+ 1
2
˜ λ = ηt+ 1
2 λ
∂ ∂wk sX
j
w2
j =
wk qP
j w2 j
) ∂kwk = w kwk 0 2 ∂ ∂w ⇢1 2kw vk2 + ˜ λkwk
λ w kwk = 0 w ✓ 1 + ˜ λ w kwk ◆ = v ) w = sv
d ds ⇢1 2ksv vk2 + ˜ λksvk
λ = 0 s = 1 ˜ λ kvk s ≥ 0 s = 0 if 1 ˜ λ kvk 0 kvk ˜ λ
wt+1 = " 1 ληt+ 1
2
kwt+ 1
2 k
#
+
wt+ 1
2
Dual problem
θ θ = 0 PROJECTION ONTO L1 BALL R(w) = w∞ wt+1,j = min n wt+ 1
2 ,j , θ
u
1 2ku wt+ 1
2 k2 s.t. kuk1 λ ηt+ 1 2
min
w
1 2kw wt+ 1
2 k2 + λ ηt+ 1 2 kwk∞
n
n
|rn|
multiclass categorization and multitask prediction
norm over different classes / blocks / tasks and obtain “structural sparsity” (“group sparsity”)
g1 g2 g3 g4 t=1 t=2 t=3 t=4 t=5 t=6 ... 1.2 5.4 2 1.8 1.5 2 4.1 2 2.4 3.5 4
(e.g. short documents from a very large dictionary)
P.1 : wt = arg min
w
1 2⇤w wt−1⇤2 + λt⇤w⇤q P.2 : w = arg min
w
1 2⇤w w0⇤2 + T ⇤
t=1
λt ⇥ ⇤w⇤q
FOBOS UPDATE WITH
λ =
6
λt
SKIP UPDATE PHASE (LAZY EVAL)
g1 g2 g3 g4 t=1 t=2 t=3 t=4 t=5 t=6 ... 1.2 5.4 2 1.8 1.5 2 4.1 2 2.4 3.5 4
10 20 30 40 50 60 70 80 90 100 10
−3
10
−2
10
−1
10 10
1
Number of Operations f(wt) − f(w*)
L1 Folos L1 IP L1 Subgrad
10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Folos Steps Sparsity Proportion
λ = 1 λ = 10 λ = 2
True l1/l2 l1/l l1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
−310
−210
−1Group Sparsity MSE L1/L1 L1/L2 L1/Linf
yt xt,1 xt,2 xt,3 +1 1
0.5 1 +1
1
+1 0.6
+1
1 1
Frequent “Garbage” Infrequent but Informative
“EASY” “HARD”
gt = ˆ rL(w)|w=wt wt+1 = arg min
w
⇢1 2kw wtk2 + η gt + R(w)
w
⇢1 2kw wtk2
A + η gt + R(w)
A = hw, Awi
(A ⌫ 0)
HOW TO SET AND ADAPT THE MATRIX A ?
look “good” w.r.t to S (geometry becomes “easy”)
min
S T
X
t=1
⌦ gt, S−1gt ↵ s.t. S ⌫ 0 , tr(S) c S = c tr ⇣ G
1 2
T
⌘G
1 2
T
GT =
T
X
t=1
gt g†
t
at
j =
v u u t
t
X
τ=1
g2
τ,j + δ
wt+1,j = sign ⇣ wt+ 1
2 ,j
⌘ "
2 ,j
at
j
#
+
wt+ 1
2 ,j = wt,j − ηλ
at
j
gt,j
PER FEATURE LEARNING RATE
L(wt) + λR(wt) − (L(w∗ + λR(w∗)) ≤ √ 2dD∞C T D∞ = kw0 w∗k∞ C = v u u tinf s ( T X
t=1
g†
tS−1gt : S ⌫ 0 , tr(S) d
)
FAST RATE (STRONGLY CONVEX) SCALES WELL WITH DIMENSION SMALL WHEN SPACE CAN BE RESHAPED “ON THE FLY”
Fobos AdaFobos PA AROW Economics Corporate Government Medicine 0.580 0.440 0.059 0.049 0.111 0.053 0.107 0.061 0.056 0.040 0.066 0.044 0.056 0.035 0.053 0.039
6 7 8 10 11 13 ECAT CCAT GCAT MCAT
6.29 7.99 10.54 8.64 7.36 9.24 12.32 9.91
L1 Only L1 & AdaGrad
10
−5
10
−4
10
−3
10
−2
10
−1
10 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
Proportion non−zero Test−set error rate AdaGrad AROW
(AROW: second order algorithm w/o sparsity)
10% NNZ WEIGHTS WHILE PERFORMANCE IS CLOSE TO OPTIMAL
(randomly, using duality-gap criterion, ...)
coordinates are updated at the same time but with a less aggressive step
[see also Yoav Freund’s tutorial]: find w for
i=1
m
i=1
m
i=1
xi ∈ {0, 1}n (xi,j ∈ {0, 1}) yi ∈ {−1, +1}
x↓j w↓j
m
i=1
µ−
j =
X
yixi,j=−1
e−yi(w↓j·xi↓j) µ+
j =
X
yixi,j=+1
e−yi(w↓j·xi↓j)
polynomial equation for µ+
j e−wj + µ− j ewj + λ|wj|
µ+
j > µ− j
a = ewj µ+
j e−wj + µ− j ewj
d dwj
j e−wj + µ− j ewj + λwj
= 0 ⇒ −µ+
j e−wj + µ− j ewj + λ = 0
−µ+
j 1/a + µ− j a + λ = 0 ⇒ µ− j a2 + λa − µ+ j = 0
wj ≥ 0
δj = ⇥ ⇧ ⇧ ⇧ ⇧ ⌅ ⇧ ⇧ ⇧ ⇧ ⇤ −wj
j eηwj − µ− j e−ηwj
≤ λ η log
−λ+ q λ2+4µ+
j µ− j
2µ−
j
µ+
j eηwj > µ− j e−ηwj + λ
η log
λ+ q λ2+4µ+
j µ− j
2µ−
j
µ+
j eηwj < µ− j e−ηwj − λ
j − µ− j | ≤ λ
Regularization pushes weights back to zero if correlation difference is small
Step size (1 for a single coordinate)
Sibyl: A Large Scale Supervised Learning System
regularization
[1++ B features, 100s B examples]
http://www.magicbroom.info/Sparsity.html