Bridging the gap between Stochastic Approximation and Markov chains
Aymeric DIEULEVEUT
ENS Paris, INRIA
17 november 2017 Joint work with Francis Bach and Alain Durmus.
1
Bridging the gap between Stochastic Approximation and Markov chains - - PowerPoint PPT Presentation
Bridging the gap between Stochastic Approximation and Markov chains Aymeric DIEULEVEUT ENS Paris, INRIA 17 november 2017 Joint work with Francis Bach and Alain Durmus. 1 Outline Introduction to Stochastic Approximation for Machine
ENS Paris, INRIA
1
◮ Introduction to Stochastic Approximation for Machine
◮ Markov chain: a simple yet insightful point of view on
2
◮ Consider an input/output pair (X, Y ) ∈ X × Y,
◮ Y = R (regression) or {−1, 1} (classification). ◮ We want to find a function θ : X → R, such that θ(X)
◮ Prediction as a linear function θ, Φ(X) of features
◮ Consider a loss function ℓ : Y × R → R+: squared loss,
◮ We define the risk (generalization error) as
3
◮ Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n,
◮ n very large, up to 109 ◮ Computer vision: d = 104 to 106
◮ Empirical risk (or training error):
n
◮ Empirical risk minimization (regularized): find ˆ
θ∈Rd
n
4
◮ For example, least-squares regression:
θ∈Rd
n
◮ and logistic regression:
θ∈Rd
n
◮ Two fundamental questions: (1) computing ˆ
5
◮ Goal:
θ∈Rd f (θ)
n ◮ θ∗ := argminRd f (θ).
6
◮ Use one observation at each step ! ◮ Complexity: O(d) per iteration. ◮ Can be used for both true risk and empirical risk.
7
◮ For the empirical error ˆ
n n
◮ At each step k ∈ N∗, sample Ik ∼ U{1, . . . n}. ◮ Fk = σ((xi, yi)1≤i≤n, (Ii)1≤i≤k). ◮ At step k ∈ N∗, use:
Ik(θk−1) = ℓ′(yIk, θk−1, Φ(xIk))
IIk (θk−1)|Fk−1] = ˆ
◮ For the risk R(θ) = Efk(θ) = E ℓ(yk, θ, Φ(xk)):
◮ For 0 ≤ k ≤ n, Fk = σ((xi, yi)1≤i≤k). ◮ At step 0 < k ≤ n, use a new point independent of θk−1:
k (θk−1) = ℓ′(yk, θk−1, Φ(xk))
k (θk−1)|Fk−1] = R′(θk−1)
◮ Single pass through the data, Running-time = O(nd), ◮ “Automatic” regularization.
8
◮ A function g : Rd → R is L-smooth if and only if it is
9
◮ A twice differentiable function g : Rd → R is µ-strongly
10
◮ We consider an a.s. convex loss in θ. Thus ˆ
◮ Hessian of ˆ
1 n
i=1 Φ(xi)Φ(xi)⊤ or E[Φ(X)Φ(X)⊤].
◮ If ℓ is smooth, and E[Φ(X)2] ≤ r 2 , R is smooth. ◮ If ℓ is µ-strongly convex, and data has an invertible
11
n(θn−1)
∞
∞
n < ∞.
d
n , γ0 ≥ 1 µ. ◮ Limit variance scales as 1/µ2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown, so hard to choose the step size...
12
n
◮ off line averaging reduces the noise effect. ◮ on line computing: ¯
1 n+1θn+1 + n n+1 ¯
◮ one could also consider other averaging schemes (e.g.,
13
◮ Known global minimax rates of convergence for
◮ Strongly convex: O((µn)−1)
◮ Non-strongly convex: O(n−1/2)
◮
◮ All step sizes γn = Cn−α with α ∈ (1/2, 1), with
◮ asymptotic normality Polyak and Juditsky (1992), with
◮ non asymptotic analysis Bach and Moulines (2011). ◮ Rate
1 µn for γn ∝ n−1/2: adapts to strong convexity.
14
◮ Powerful algorithm:
◮ Simple to implement ◮ Cheap ◮ No regularization needed
◮ Convergence guarantees:
◮ γn =
1 √n good choice in most situations
◮ Initial conditions can be forgotten slowly: could we use
15
16
1 √n.
17
18
◮ Least-squares: R(θ) = 1 2E
◮ SGD = least-mean-square algorithm ◮ Usually studied without averaging and decreasing
◮ New analysis for averaging and constant step-size
◮ Assume Φ(xn) r and |yn − Φ(xn), θ∗| σ almost
◮ No assumption regarding lowest eigenvalues of the
◮ Main result:
◮ Matches statistical lower bound Tsybakov (2003).
19
◮ Beyond parametric models: Non Parametric Stochastic
◮ Improved Sampling: Averaged least-mean-squares:
◮ Acceleration: Harder, Better, Faster, Stronger
◮ Beyond smoothness and euclidean geometry: Stochastic
20
k+1 = θγ k − γ
k ) + εk+1(θγ k )
◮ satisfies Markov property ◮ is homogeneous, for γ constant, (εk)k∈N i.i.d.
◮ R′ k = R′ + εk+1 is almost surely L-co-coercive. ◮ Bounded moments
21
◮ Existence of a limit distribution πγ, and linear convergence to
n d
◮ Convergence of second order moments of the chain,
n L2
n→∞
◮ Behavior under the limit distribution (γ → 0): ¯
†Dieuleveut, Durmus, Bach [2017]. 22
n )n≥0 d
n )n≥0 admits a unique stationary
2 (θγ n , πγ) ≤ (1 − µγ)n
23
1 is almost surely L-co-coercive. Moreover, ε1(θ∗)
24
γ
k starting at θ0 ∼ ν0.
k ∼ δθ0Rk γ .
γh(θ)
k )] =
γ
25
k )k≥0 d
γ)k≥0 → πγ.
ξ∈Π(λ,ν)
k )k≥0 admits a unique
2 (δθRn γ, πγ) ≤ (1 − 2µγ(1 − γL))n
26
◮ Coupling: θ1, θ2 be independent and distributed
k,γ)≥0,(θ(2) k,γ)k≥0
k+1,γ
k,γ − γ
k,γ) + εk+1(θ(1) k,γ)
k+1,γ
k,γ − γ
k,γ) + εk+1(θ(2) k,γ)
◮ for all k ≥ 0, the distribution of (θ(1) k,γ, θ(2) k,γ) is in
γ, λ2Rk γ)
27
2 (λ1Rγ, λ2Rγ)
1,γ − θ(2) 1,γ2
1(θ1) − (θ2 − γf ′ 1(θ2)))2 A3
1(θ1) − f ′ 1(θ2)
A4
A1
28
2 (λ1Rn γ, λ2Rn γ) ≤ E
n,γ − θ(2) n,γ2
◮ Thus W2(δxRn γ, δyRn γ)≤(1 − 2µγ(1 − γL))n x − y2 . ◮ { prob. measures with second order moment }: Polish
◮ Picard fixed point theorem, (λ1Rn γ)n≥0 is a Cauchy
γ . ◮ Uniqueness, invariance, and Theorem follow:
2 (δθRn γ, πγ) ≤ (1−2µγ(1−γL))n
29
∞
k )
k )
γ, πγ), the
30
1 = θγ 0 − γ
0 ) + ε1(θγ 0 )
31
32
32
32
32
1 = θγ 0 − γ
0 ) + ε1(θγ 0 )
33
34
34
34
34
n − θ∗:
k − ¯
k − ¯
γ(θ) − χ2 γ(θ)
◮ φ(θ) = θ − θ∗. ψγ Poisson solution associated to φ, ◮ χ1 γ Poisson solution associated to φφ⊤, ◮ χ2 γ Poisson solution associated to (ψγ − φ)(ψγ − φ)⊤.
35
◮ Algebraic calculation (Rγ encodes a linear relationship
k ) ◮ For the first result:
k
k−1
γϕ)(θ0)
γψγ(θ0)
γπγ(ϕ) = πγϕ, and Rk γψγ(θ0) = O(ρk)
36
2Eρ
k − θ∗)⊗2
k − θ∗)⊗2
∗f ′ n (θ) = (Φ(xn)Φ(xn)⊤ − Σ)(θ − θ∗) + (θ∗, Φ(xn) − yn)Φ(xn) 37
◮ Convergence in distribution of the MC (Wasserstein metric). ◮ Allows to prove and analyze convergence of the moments of the
◮ We provide second order development as γ → 0 :
◮ Error decomposition as a sum of three terms :
n ) − f (θ∗) ≤
◮ As a consequence, we can recover the rate, for γ = 1/√n:
n ) − f (θ∗) = O
◮ Beyond: comparison to the continuous gradient flow for a more
38
n − ¯
n
39
n − ¯
n
39
n − ¯
n
39
n − ¯
n
39
n − ¯
n
39
n − ¯
n
39
40
n = 8 3 ¯
n − 2¯
n
3 ¯
n
41
42
◮ Extending proofs to self-concordant setting. ◮ Does this three term decomposition extend to decaying
◮ Understand the convex case more precisely.
43
Agarwal, A., Negahban, S., and Wainwright, M. J. (2012). Fast global convergence
40(5):2452–2482. Bach, F. and Moulines, E. (2011). Non-asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pages 451–459, USA. Curran Associates Inc. Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). Advances in Neural Information Processing Systems (NIPS). Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In Adv. NIPS. D´ efossez, A. and Bach, F. (2015). Averaged least-mean-squares: bias-variance trade-offs and optimal sampling distributions. In Proceedings of the International Conference on Artificial Intelligence and Statistics, (AISTATS). Dieuleveut, A. and Bach, F. (2015). Non-parametric stochastic approximation with large step sizes. Annals of Statistics. Dieuleveut, A., Flammarion, N., and Bach, F. (2016). Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression. ArXiv e-prints. Fabian, V. (1968). On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, pages 1327–1332. Flammarion, N. and Bach, F. (2017). Stochastic composite least-squares regression with convergence rate o (1/n). Jones, G. L. (2004). On the Markov chain central limit theorem. Probability Surveys, 1:299–320.
43
Lacoste-Julien, S., Schmidt, M., and Bach, F. (2012). A simpler approach to
e-prints 1212.2002. Nemirovsky, A. S. and Yudin, D. B. (1983). Problem complexity and method efficiency in optimization. A Wiley-Interscience Publication. John Wiley & Sons, Inc., New York. Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics. Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855. Robbins, H. and Monro, S. (1951). A stochastic approxiation method. The Annals
Robbins, H. and Siegmund, D. (1985). A convergence theorem for non negative almost supermartingales and some applications. In Herbert Robbins Selected Papers, pages 111–135. Springer. Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins-Monro
Engineering. Tsybakov, A. B. (2003). Optimal rates of aggregation. In Proceedings of the Annual Conference on Computational Learning Theory.
43