Incremental and Stochastic Majorization-Minimization Algorithms for - - PowerPoint PPT Presentation

incremental and stochastic majorization minimization
SMART_READER_LITE
LIVE PREVIEW

Incremental and Stochastic Majorization-Minimization Algorithms for - - PowerPoint PPT Presentation

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization Julien Mairal INRIA LEAR, Grenoble Gargantua workshop, LJK, November 2013 Julien Mairal Incremental and Stochastic MM Algorithms 1/28 A simple


slide-1
SLIDE 1

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization

Julien Mairal

INRIA LEAR, Grenoble

Gargantua workshop, LJK, November 2013

Julien Mairal Incremental and Stochastic MM Algorithms 1/28

slide-2
SLIDE 2

A simple optimization principle f (θ) g(θ)

b κ

Objective: min

θ∈Θ f (θ)

Principle called Majorization-Minimization [Lange et al., 2000]; quite popular in statistics and signal processing.

Julien Mairal Incremental and Stochastic MM Algorithms 2/28

slide-3
SLIDE 3

In this work f (θ) g(θ)

b κ

scalable Majorization-Minimization algorithms; for convex or non-convex and smooth or non-smooth problems;

References

  • J. Mairal. Optimization with First-Order Surrogate Functions. ICML’13;
  • J. Mairal. Stochastic Majorization-Minimization Algorithms for

Large-Scale Optimization. NIPS’13.

Julien Mairal Incremental and Stochastic MM Algorithms 3/28

slide-4
SLIDE 4

Setting: First-Order Surrogate Functions h(θ) f (θ) g(θ)

b κ

g(θ′) ≥ f (θ′) for all θ′ in arg minθ∈Θ g(θ); the approximation error h △ = g − f is differentiable, and ∇h is L-Lipschitz. Moreover, h(κ) = 0 and ∇h(κ) = 0.

Julien Mairal Incremental and Stochastic MM Algorithms 4/28

slide-5
SLIDE 5

The Basic MM Algorithm

Algorithm 1 Basic Majorization-Minimization Scheme

1: Input: θ0 ∈ Θ (initial estimate); N (number of iterations). 2: for n = 1, . . . , N do 3:

Compute a surrogate gn of f near θn−1;

4:

Minimize gn and update the solution: θn ∈ arg min

θ∈Θ

gn(θ).

5: end for 6: Output: θN (final estimate);

Julien Mairal Incremental and Stochastic MM Algorithms 5/28

slide-6
SLIDE 6

Examples of First-Order Surrogate Functions

Lipschitz Gradient Surrogates: f is L-smooth (differentiable + L-Lipschitz gradient). g : θ → f (κ) + ∇f (κ)⊤(θ − κ) + L 2θ − κ2

2.

Minimizing g yields a gradient descent step θ ← κ − 1

L∇f (κ).

Julien Mairal Incremental and Stochastic MM Algorithms 6/28

slide-7
SLIDE 7

Examples of First-Order Surrogate Functions

Lipschitz Gradient Surrogates: f is L-smooth (differentiable + L-Lipschitz gradient). g : θ → f (κ) + ∇f (κ)⊤(θ − κ) + L 2θ − κ2

2.

Minimizing g yields a gradient descent step θ ← κ − 1

L∇f (κ).

Proximal Gradient Surrogates: f = f1 + f2 with f1 smooth. g : θ → f1(κ) + ∇f1(κ)⊤(θ − κ) + L 2θ − κ2

2 + f2(θ).

Minimizing g amounts to one step of the forward-backward, ISTA,

  • r proximal gradient descent algorithm.

[Beck and Teboulle, 2009, Combettes and Pesquet, 2010, Wright et al., 2008, Nesterov, 2007].

Julien Mairal Incremental and Stochastic MM Algorithms 6/28

slide-8
SLIDE 8

Examples of First-Order Surrogate Functions

Linearizing Concave Functions and DC-Programming: f = f1 + f2 with f2 smooth and concave. g : θ → f1(θ) + f2(κ) + ∇f2(κ)⊤(θ − κ). When f1 is convex, the algorithm is called DC-programming.

Julien Mairal Incremental and Stochastic MM Algorithms 7/28

slide-9
SLIDE 9

Examples of First-Order Surrogate Functions

Linearizing Concave Functions and DC-Programming: f = f1 + f2 with f2 smooth and concave. g : θ → f1(θ) + f2(κ) + ∇f2(κ)⊤(θ − κ). When f1 is convex, the algorithm is called DC-programming. Quadratic Surrogates: f is twice differentiable, and H is a uniform upper bound of ∇2f : g : θ → f (κ) + ∇f (κ)⊤(θ − κ) + 1 2(θ − κ)⊤H(θ − κ). Actually a big deal in statistics and machine learning [B¨

  • hning and

Lindsay, 1988, Khan et al., 2010, Jebara and Choromanska, 2012].

Julien Mairal Incremental and Stochastic MM Algorithms 7/28

slide-10
SLIDE 10

Examples of First-Order Surrogate Functions

More Exotic Surrogates: Consider a smooth approximation of the trace (nuclear) norm fµ : θ → Tr

  • (θ⊤θ + µI)1/2

=

p

  • i=1
  • λi(θ⊤θ) + µ,

f ′ : H → Tr

  • H1/2

is concave on the set of p.d. matrices and ∇f ′(H) = (1/2)H−1/2. gµ : θ → fµ(κ) + 1 2 Tr

  • (κ⊤κ + µI)−1/2(θ⊤θ − κ⊤κ)
  • ,

which yields the algorithm of Mohan and Fazel [2012].

Julien Mairal Incremental and Stochastic MM Algorithms 8/28

slide-11
SLIDE 11

Examples of First-Order Surrogate Functions

Variational Surrogates: f (θ1) △ = minθ2∈Θ2 ˜ f (θ1, θ2), where ˜ f is “smooth” w.r.t θ1 and strongly convex w.r.t θ2: g : θ1 → ˜ f (θ1, κ⋆

2) with κ⋆ 2 △

= arg min

θ2∈Θ2

˜ f (κ1, θ2). Saddle-Point Surrogates: f (θ1) △ = maxθ2∈Θ2 ˜ f (θ1, θ2), where ˜ f is “smooth” w.r.t θ1 and strongly concave w.r.t θ2: g : θ1 → ˜ f (θ1, κ⋆

2) + L′′

2 θ1 − κ12

2.

Jensen Surrogates: f (θ) △ = ˜ f (x⊤θ), where ˜ f is L-smooth. Choose a weight vector w in Rp

+ such that

w1 = 1 and wi = 0 whenever xi =0. g : θ →

p

  • i=1

wif xi wi (θi − κi) + x⊤κ

  • ,

Julien Mairal Incremental and Stochastic MM Algorithms 9/28

slide-12
SLIDE 12

Theoretical Guarantees

for non-convex problems: f (θn) monotically decreases and lim inf

n→+∞ inf θ∈Θ

∇f (θn, θ − θn) θ − θn2 ≥ 0, which is an asymptotic stationary point condition. for convex ones: f (θn) − f ⋆ = O(1/n). for µ-strongly convex ones: the convergence rate is linear O((1 − µ/L)n). the convergence rates and the proof techniques are the same as for proximal gradient methods [Nesterov, 2007, Beck and Teboulle, 2009].

Julien Mairal Incremental and Stochastic MM Algorithms 10/28

slide-13
SLIDE 13

New Majorization-Minimization Algorithms

Given f : Rp → R and Θ ⊆ Rp, our goal is to solve min

θ∈Θ f (θ).

We introduce algorithms for non-convex and convex optimization: a block coordinate scheme for separable surrogates; an incremental algorithm dubbed MISO for separable functions f ; a stochastic algorithm for minimizing expectations; Also several variants for convex optimization: an accelerated one (Nesterov’s like); a “Frank-Wolfe” majorization-minimization algorithm.

Julien Mairal Incremental and Stochastic MM Algorithms 11/28

slide-14
SLIDE 14

Incremental Optimization: MISO

Suppose that f splits into many components: f (θ) = 1 T

T

  • t=1

f t(θ).

Recipe

Incrementally update an approximate surrogate 1

T

T

t=1 gt;

add some heuristics for practical implementations.

Related (Inspiring) Work for Convex Problems

related to SAG [Schmidt et al., 2013] and SDCA [Shalev-Schwartz and Zhang, 2012], but offers different update rules.

Julien Mairal Incremental and Stochastic MM Algorithms 12/28

slide-15
SLIDE 15

Incremental Optimization: MISO

Algorithm 2 Incremental Scheme MISO

1: Input: θ0 ∈ Θ; N (number of iterations). 2: Choose surrogates gt

0 of f t near θ0 for all t;

3: for n = 1, . . . , N do 4:

Randomly pick up one index ˆ tn and choose a surrogate gˆ

tn n of f ˆ tn

near θn−1. Set gt

n △

= gt

n−1 for t = ˆ

tn;

5:

Update the solution: θn ∈ arg min

θ∈Θ

1 T

T

  • t=1

gt

n(θ)

.

6: end for 7: Output: θN (final estimate);

Julien Mairal Incremental and Stochastic MM Algorithms 13/28

slide-16
SLIDE 16

Incremental Optimization: MISO

Update Rule for Proximal Gradient Surrogates

We want to minimize 1

T

T

t=1 f t 1 (θ) + f2(θ).

θn = arg min

θ∈Θ

1 T

T

  • t=1

f1(κt) + ∇f1(κt)⊤(θ − κt) + L 2θ − κt2

2 + f2(θ)

= arg min

θ∈Θ

1 2

  • θ −
  • 1

T

T

  • t=1

κt − 1 LT

T

  • t=1

∇f t

1 (κt)

  • 2

2

+ 1 Lf2(θ). Then, randomly draw one index tn, and update κtn ← θn.

Remark

remove f2, replace

1 T

T

t=1 κt by θn−1 yields SAG [Schmidt et al., 2013];

replace L by µ is “close” to SDCA [Shalev-Schwartz and Zhang, 2012];

Julien Mairal Incremental and Stochastic MM Algorithms 14/28

slide-17
SLIDE 17

Incremental Optimization: MISO

Theoretical Guarantees

for non-convex problems, the guarantees are the same as the generic MM algorithm with probability one. for convex problems and proximal gradient surrogates, the expected convergence rate becomes O(T/n). for µ-strongly convex problems and proximal gradient surrogates, the expected convergence rate is linear O((1 − µ/(TL))n).

Julien Mairal Incremental and Stochastic MM Algorithms 15/28

slide-18
SLIDE 18

Incremental Optimization: MISO

Theoretical Guarantees

for non-convex problems, the guarantees are the same as the generic MM algorithm with probability one. for convex problems and proximal gradient surrogates, the expected convergence rate becomes O(T/n). for µ-strongly convex problems and proximal gradient surrogates, the expected convergence rate is linear O((1 − µ/(TL))n).

Remarks

for µ-strongly convex problems, the rates of SDCA and SAG are better: µ/(LT) is replaced by O(min(µ/L, 1/T)); MISO with minorizing surrogates is close to SDCA with “similar” convergence rates (details to be written yet).

Julien Mairal Incremental and Stochastic MM Algorithms 15/28

slide-19
SLIDE 19

Stochastic Majorization Minimization: SMM

Suppose that f is an expectation: f (θ) = Ex[l(θ, x)].

Recipe

Draw a function fn : θ → l(θ, xn) at iteration n; Iteratively update an approximate surrogate ¯ gn = (1−wn)¯ gn−1 + wngn; Possibly use an averaging scheme of the iterates.

Related Work

  • nline-EM [Neal and Hinton, 1998, Capp´

e and Moulines, 2009];

  • nline dictionary learning [Mairal et al., 2010a].

Julien Mairal Incremental and Stochastic MM Algorithms 16/28

slide-20
SLIDE 20

Stochastic Majorization Minimization: SMM

Algorithm 3 Stochastic Majorization-Minimization Scheme

1: Input: θ0 ∈ Θ (initial estimate); N (number of iterations); (wn)n≥1,

weights in (0, 1];

2: initialize the approximate surrogate: ¯

g0 : θ → ρ

2θ − θ02 2;

3: for n = 1, . . . , N do 4:

draw a training point xn;

5:

choose a surrogate function gn of fn : θ → ℓ(xn, θ) near θn−1;

6:

update the approximate surrogate: ¯ gn = (1 − wn)¯ gn−1 + wngn;

7:

update the current estimate: θn ∈ arg min

θ∈Θ

¯ gn(θ);

8: end for 9: Output: θN (current estimate);

Julien Mairal Incremental and Stochastic MM Algorithms 17/28

slide-21
SLIDE 21

Stochastic Majorization Minimization: SMM

Update Rule for Proximal Gradient Surrogate

θn ← arg min

θ∈Θ n

  • i=1

wi

n

  • ∇fi(θi−1)⊤θ + L

2θ − θi−12 2 + ψ(θ)

  • .

(SMM) Other schemes in the literature [Duchi and Singer, 2009]: θn ← arg min

θ∈Θ

∇fn(θn−1)⊤θ +

1 2ηn θ − θn−12 2 + ψ(θ),

(FOBOS)

  • r regularized dual averaging (RDA) of Xiao [2010]:

θn ← arg min

θ∈Θ

1 n

n

  • i=1

∇fi(θi−1)⊤θ +

1 2ηn θ2 2 + ψ(θ).

(RDA)

Julien Mairal Incremental and Stochastic MM Algorithms 18/28

slide-22
SLIDE 22

Stochastic Majorization Minimization: SMM

Theoretical Guarantees - Non-Convex Problems

under a set of reasonable assumptions, f (θn) almost surely converges; the function ¯ gn asymptotically behaves as a first-order surrogates; we almost surely have asymptotic stationary point conditions.

Theoretical Guarantees - Convex Problems

for proximal gradient surrogates, we obtain similar expected rates as SGD with averaging [see Nemirovski et al., 2009, Polyak and Juditsky, 1992]: O(1/n) for strongly convex problems, and O(1/√n) for convex ones.

Julien Mairal Incremental and Stochastic MM Algorithms 19/28

slide-23
SLIDE 23

Experimental Conclusions for ℓ2-logistic Regression

Datasets

name m p storage size (GB) alpha 250 000 500 dense 1 rcv1 781 265 47 152 sparse 0.95 covtype 581 012 54 dense 0.11

  • cr

2 500 000 1 155 dense 23.1

for ℓ2-logistic Regression

Incremental and stochastic schemes were significantly faster than batch ones; MISO with heuristics was competitive with the state of the art (SAG, SGD, Liblinear); after one pass over the data, SMM quickly achieves a low-precision solution. For higher precision, MISO is prefered. problems tested were large but relatively well conditioned.

Julien Mairal Incremental and Stochastic MM Algorithms 20/28

slide-24
SLIDE 24

Stochastic DC programming

Consider a binary classification problem with enormous training data (yn, xn), with yn in {−1, +1} and xn in Rp. Assume that there exists a sparse linear model y ≈ sign(θ⊤xi), learned by minimizing min

θ∈Rp E(y,x)[log(1 + e−yθ⊤x)] + λψ(θ).

Traditional choices for ψ: ψ(θ) = θ2

2 or θ1.

Non-convex sparsity inducing penalty: ψ(θ) = p

j=1 log(|θ[j]| + ε).

θ ψ(θ)

Julien Mairal Incremental and Stochastic MM Algorithms 21/28

slide-25
SLIDE 25

Stochastic DC programming

upper-bound fn : θ → log(1 + e−ynθ⊤xn) by θ → fn(θn−1) + ∇fn(θn−1)⊤(θ − θn−1) + L 2θ − θn−12

2;

upper-bound λ p

j=1 log(|θ[j]| + ε) by

θ → λ

p

  • j=1

|θ[j]| |θn−1[j]| + ε. this is a stochastic reweighted-ℓ1 algorithm [Cand` es et al., 2008].

Julien Mairal Incremental and Stochastic MM Algorithms 22/28

slide-26
SLIDE 26

Stochastic DC programming

upper-bound fn : θ → log(1 + e−ynθ⊤xn) by θ → fn(θn−1) + ∇fn(θn−1)⊤(θ − θn−1) + L 2θ − θn−12

2;

upper-bound λ p

j=1 log(|θ[j]| + ε) by

θ → λ

p

  • j=1

|θ[j]| |θn−1[j]| + ε. this is a stochastic reweighted-ℓ1 algorithm [Cand` es et al., 2008].

Datasets

name Ntr (train) Nte (test) p density (%) rcv1 781 265 23 149 47 152 0.161 webspam 250 000 100 000 16 091 143 0.023

Julien Mairal Incremental and Stochastic MM Algorithms 22/28

slide-27
SLIDE 27

Stochastic DC programming

5 10 15 20 25 −0.06 −0.04 −0.02

Iterations − Epochs / Dataset rcv1 Objective on Train Set Online DC Batch DC

5 10 15 20 25 −0.04 −0.03 −0.02 −0.01 0.01 0.02

Iterations − Epochs / Dataset rcv1 Objective on Test Set Online DC Batch DC

5 10 15 20 25 −4.54 −4.535 −4.53 −4.525 −4.52

Iterations − Epochs / Dataset webspam Objective on Train Set Online DC Batch DC

5 10 15 20 25 −4.385 −4.38 −4.375 −4.37

Iterations − Epochs / Dataset webspam Objective on Test Set Online DC Batch DC

Julien Mairal Incremental and Stochastic MM Algorithms 23/28

slide-28
SLIDE 28

Online Structured Matrix Factorization

Consider some signals x in Rm. We want to find a dictionary D in Rm×K. The quality of D is measured through the loss ℓ(x, D) △ = min

α∈RK

1 2x − Dα2

2 + λ1α1 + λ2

2 α2

2.

Then, learning the dictionary amounts to solving min

D∈C Ex [ℓ(x, D)] + ϕ(D),

and we can use the proximal gradient surrogate.

Why is it a matrix factorization problem?

min

D∈C,A∈RK×n

1 2nX − DA2

F + n

  • i=1

λ1αi1 + λ2 2 αi2

2 + ϕ(D).

Julien Mairal Incremental and Stochastic MM Algorithms 24/28

slide-29
SLIDE 29

Online Structured Matrix Factorization

when C = {D ∈ Rm×K s.t. dj2 ≤ 1} and ϕ = 0, the problem is called sparse coding or dictionary learning [Olshausen and Field, 1997, Elad and Aharon, 2006]. We can use the upper-bound ℓ(xn, D) ≤ 1 2xn − Dαn2

2 + λ1αn1 + λ2

2 αn2

2,

where αn

= arg min

α∈Rp

1 2xn − Dn−1α2

2 + λ1α1 + λ2

2 α2

2,

and we obtain the online dictionary learning of Mairal et al. [2010a]. non-negativity constraints can be easily added. It yields an online nonnegative matrix factorization algorithm. ϕ can be a function encouraging a particular structure in D [Jenatton et al., 2009].

Julien Mairal Incremental and Stochastic MM Algorithms 25/28

slide-30
SLIDE 30

Online Structured Matrix Factorization

Dictionary Learning on Natural Image Patches

Consider n = 250 000 whitened natural image patches of size m = 12 × 12. We learn a dictionary with K = 256 elements. 0s on an old laptop 1.2GHz dual-core CPU. (initialization)

Julien Mairal Incremental and Stochastic MM Algorithms 26/28

slide-31
SLIDE 31

Online Structured Matrix Factorization

Dictionary Learning on Natural Image Patches

Consider n = 250 000 whitened natural image patches of size m = 12 × 12. We learn a dictionary with K = 256 elements. 1.15s on an old laptop 1.2GHz dual-core CPU (0.1 pass)

Julien Mairal Incremental and Stochastic MM Algorithms 26/28

slide-32
SLIDE 32

Online Structured Matrix Factorization

Dictionary Learning on Natural Image Patches

Consider n = 250 000 whitened natural image patches of size m = 12 × 12. We learn a dictionary with K = 256 elements. 5.97s on an old laptop 1.2GHz dual-core CPU (0.5 pass)

Julien Mairal Incremental and Stochastic MM Algorithms 26/28

slide-33
SLIDE 33

Online Structured Matrix Factorization

Dictionary Learning on Natural Image Patches

Consider n = 250 000 whitened natural image patches of size m = 12 × 12. We learn a dictionary with K = 256 elements. 12.44s on an old laptop 1.2GHz dual-core CPU (1 pass)

Julien Mairal Incremental and Stochastic MM Algorithms 26/28

slide-34
SLIDE 34

Online Structured Matrix Factorization

Dictionary Learning on Natural Image Patches

Consider n = 250 000 whitened natural image patches of size m = 12 × 12. We learn a dictionary with K = 256 elements. 23.22s on an old laptop 1.2GHz dual-core CPU (2 passes)

Julien Mairal Incremental and Stochastic MM Algorithms 26/28

slide-35
SLIDE 35

Online Structured Matrix Factorization

Dictionary Learning on Natural Image Patches

Consider n = 250 000 whitened natural image patches of size m = 12 × 12. We learn a dictionary with K = 256 elements. 60.60s on an old laptop 1.2GHz dual-core CPU (5 passes)

Julien Mairal Incremental and Stochastic MM Algorithms 26/28

slide-36
SLIDE 36

Online Structured Matrix Factorization

With a structured regularization function ϕ [Jenatton et al., 2009]

ϕ(D) △ = γ1 K

j=1

  • g∈G maxk∈g |dj[k]| + γ2D2

F.

The proximal operator of ϕ can be computed by using network flow

  • ptimization [Mairal et al., 2010b].

Figure: Left: subset of a larger dictionary obtained with ℓ1; Right: subset

  • btained with ϕ after initialization with the dictionary on the left.

About 20 minutes per pass on the data on the 1.2GHz laptop CPU.

Julien Mairal Incremental and Stochastic MM Algorithms 27/28

slide-37
SLIDE 37

Conclusion

What we have done

we have given a unified view of a large number of algorithms; ... and introduced new ones for large-scale optimization.

A take-home message

  • ur algorithms are likely to be useful for large-scale non-convex

and possibly non-smooth problems.

Source Code

code will be included in the toolbox SPAMS (C++ interfaced with Matlab, Python, R). http://spams-devel.gforge.inria.fr/; the online dictionary learning algorithm is already in SPAMS.

Julien Mairal Incremental and Stochastic MM Algorithms 28/28

slide-38
SLIDE 38

References I

  • A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding

algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

  • D. B¨
  • hning and B. G. Lindsay. Monotonicity of quadratic-approximation
  • algorithms. Annals of the Institute of Statistical Mathematics, 40(4):

641–663, 1988.

  • E. J. Cand`

es, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted l1 minimization. Journal of Fourier Analysis and Applications, 14: 877–905, 2008.

  • O. Capp´

e and E. Moulines. On-line expectation–maximization algorithm for latent data models. 71(3):593–613, 2009.

  • P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal
  • processing. In Fixed-Point Algorithms for Inverse Problems in Science

and Engineering. Springer, 2010.

Julien Mairal Incremental and Stochastic MM Algorithms 29/28

slide-39
SLIDE 39

References II

  • J. Duchi and Y. Singer. Efficient online and batch learning using forward

backward splitting. Journal of Machine Learning Research, 10: 2899–2934, 2009.

  • M. Elad and M. Aharon. Image denoising via sparse and redundant

representations over learned dictionaries. IEEE Transactions on Image Processing, 54(12):3736–3745, December 2006.

  • T. Jebara and A. Choromanska. Majorization for CRFs and latent
  • likelihoods. In Advances in Neural Information Processing Systems,

2012.

  • R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selection

with sparsity-inducing norms. Technical report, 2009. preprint arXiv:0904.3523v1. Emtiyaz Khan, Ben Marlin, Guillaume Bouchard, and Kevin Murphy. Variational bounds for mixed-data factor analysis. In Advances in Neural Information Processing Systems, 2010.

Julien Mairal Incremental and Stochastic MM Algorithms 30/28

slide-40
SLIDE 40

References III

  • K. Lange, D.R. Hunter, and I. Yang. Optimization transfer using

surrogate objective functions. 9(1):1–20, 2000.

  • J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix

factorization and sparse coding. Journal of Machine Learning Research, 2010a.

  • J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow

algorithms for structured sparsity. In Advances in Neural Information Processing Systems, 2010b.

  • K. Mohan and M. Fazel. Iterative reweighted algorithms for matrix rank
  • minimization. Journal of Machine Learning Research, (13):3441–3473,

2012. R.M. Neal and G.E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, 89, 1998.

Julien Mairal Incremental and Stochastic MM Algorithms 31/28

slide-41
SLIDE 41

References IV

  • A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic

approximation approach to stochastic programming. 19(4): 1574–1609, 2009.

  • Y. Nesterov. Gradient methods for minimizing composite objective
  • function. Technical report, CORE, 2007.
  • B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete

basis set: A strategy employed by V1? Vision Research, 37: 3311–3325, 1997. Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv:1309.2388, 2013.

  • S. Shalev-Schwartz and T. Zhang. Proximal stochastic dual coordinate
  • ascent. preprint arXiv 1211.2717v1, 2012.

Julien Mairal Incremental and Stochastic MM Algorithms 32/28

slide-42
SLIDE 42

References V

  • S. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by

separable approximation. IEEE Transactions on Signal Processing, 2008.

  • L. Xiao. Dual averaging methods for regularized stochastic learning and
  • nline optimization. Journal of Machine Learning Research, 11:

2543–2596, 2010.

Julien Mairal Incremental and Stochastic MM Algorithms 33/28

slide-43
SLIDE 43

Performance of MISO for logistic-ℓ2 regression

With preliminary version of SAG

Effective passes over data / Dataset alpha Distance to optimum 5 10 15 20 25 30 35 40 45 50 10−5 10−4 10−3 10−2 10−1 100 Training time (sec) / Dataset alpha Distance to optimum 100 101 102 10−5 10−4 10−3 10−2 10−1 100

FISTA−LS LIBLINEAR SAG−LS ASGD Bottou SGD Bottou MISO1 MISO2 MISO2 b1000

Effective passes over data / Dataset rcv1 Distance to optimum 5 10 15 20 25 30 35 40 45 50 10−5 10−4 10−3 10−2 10−1 100 Training time (sec) / Dataset rcv1 Distance to optimum 100 101 102 103 10−5 10−4 10−3 10−2 10−1 100

FISTA−LS LIBLINEAR SAG−LS ASGD Bottou SGD Bottou MISO1 b10000 MISO2 b10000 MISO2 b1000

Julien Mairal Incremental and Stochastic MM Algorithms 34/28

slide-44
SLIDE 44

Online Dictionary Learning

Experimental results, batch vs online

m = 8 × 8, k = 256

Julien Mairal Incremental and Stochastic MM Algorithms 35/28

slide-45
SLIDE 45

Online Dictionary Learning

Experimental results, batch vs online

m = 12 × 12 × 3, k = 512

Julien Mairal Incremental and Stochastic MM Algorithms 36/28

slide-46
SLIDE 46

Online Dictionary Learning

Experimental results, batch vs online

m = 16 × 16, k = 1024

Julien Mairal Incremental and Stochastic MM Algorithms 37/28