convex optimization for data science
play

Convex Optimization for Data Science Gasnikov Alexander - PowerPoint PPT Presentation

Convex Optimization for Data Science Gasnikov Alexander gasnikov.av@mipt.ru Lecture 6. Gradient-free methods. Coordinate descent February, 2017 1 Main books: Spall J.C. Introduction to stochastic search and optimization: estimation, simula- tion


  1. Convex Optimization for Data Science Gasnikov Alexander gasnikov.av@mipt.ru Lecture 6. Gradient-free methods. Coordinate descent February, 2017 1

  2. Main books: Spall J.C. Introduction to stochastic search and optimization: estimation, simula- tion and control. Wiley, 2003. Nesterov Yu. Random gradient-free minimization of convex functions // CORE Discussion Paper 2011/1. 2011. Nesterov Y.E. Efficiency of coordinate descent methods on large scale optimiza- tion problem // SIAM Journal on Optimization . 2012. V. 22. № 2. P. 341– 362. Fercoq O., Richtarik P. Accelerated, Parallel and Proximal Coordinate Descent // e-print, 2013. arXiv:1312.5799 Duchi J.C., Jordan M.I., Wainwright M.J., Wibisono A. Optimal rates for zero- order convex optimization: the power of two function evaluations // IEEE Trans. of Inf . 2015. V. 61. № 5. P. 2788– 2806. Wright S.J. Coordinate descent algorithms // e-print, 2015. arXiv:1502.04759 Gasnikov A.V. Searching equilibriums in large transport networks. Doctoral The- sis. MIPT, 2016. arXiv:1607.03142 2

  3. Structure of Lecture 6  Two-points gradient free methods and directional derivative methods (Preliminary results)  Stochastic Mirror Descent and gradient-free methods  The principal difference between one-point and two-points feedbacks  Non smooth case (double-smoothing technique)  Randomized Similar Triangles Method  Randomized coordinate version of Similar Triangles Method  Explanations why coordinate descent methods can works better in prac- tice then its full-gradient variants  Nesterov’s examples  Typical Data Science problem and its consideration from the (primal / dual) randomized coordinate descent point of view 3

  4. Two points gradient-free methods and directional derivative methods    f x min.   n x All the results can be generalized for composit case (Lecture 3). We assume that        N E f x f .   * N – number of required iterations (oracle calls): calculations of f (realiza- tions) / directional derivative of f . R – “ distance ” between starting point and the nearest solution.               2 2             2 N f y f x L y x E f x , M E f x , f x D     x 2 2 x 2 2 2 2     2 2 f x convex M R 2   2 2 L R  L R DR  2 n  2   n 2 n max ,  2    2        -strongly     f x –     2     n M 2   2 L R L R D   2 2  2  2      n ln 2  2    n max ln ,           convex in 2          2 2 2 2 4

  5. Stochastic Mirror Descent (SMD) (Lectures 3, 4) Consider convex optimization problem    f x min , (1)  x Q    x f x  with stochastic oracle, returns such stochastic subgradient , that:           E f x , f x . (2)    x   p    and assume that We introduce norm p -norm ( 1,2 ) with 1 p 1 q 1       2 q   . (3)    2 E f x , M , 2,    x q d x  (   We introduce prox-function    ) which is 1-strongly con- 0 d x 0 0 vex due to the p -norm and Bregman ’ s divergence (Lecture 3)              V x z , d x d z d z , x z . 5

  6. Method is                f x  k 1 = Mirr k k k k x h , , Mirr v argmin v, x x V x x , . k x k x x  x Q    x – is the solution of (1) (if x isn’t unique 2 0 We put R V x x * , , where * * x is minimized   ). If    – i.i.d. and 0 k then we assume that V x x * , *     1 N 1 R 2      2 0 N k R V x x * , , x x , h . 2 N M N M  k 0 Then, after (all the result cited below in this Lecture can be expressed in terms of probability of high deviations bounds, see Lecture 4) 2 2 2 M R  N  2 iterations (oracle calls)        N E f x f .   * 6

  7. Idea (randomization!)       n          k k k k k k k k f x , : , e : f x e , e , (one-point feedback) (4)  x         n          k k k k k k k k f x , : f x e , f x , e , (two-points feedback) (5)  x          k k k k k k f x , : n f x , , e e . (directional derivative feedback) (6) x x Assume that   f x  available with (non stochastic) small noise of level  . k k , k How to choose i.i.d. e ? Two main approaches:    –  ; k n k n e RS 2 1 e is equiprobable distributed on a unit Euclidian sphere in   e  – with probability 1 n (coordinate descent) for (5), (6). k 0,...,0,1,0,...,0      i 7

  8.    in (5) because of   in (7)) Note, that ( we can’t tend 0            k k k k k k E n f x , , e e f x , , (see (2))   k x x e   2       3   n 2           k k k k k k 2 2 2 k , , E f x e f x e n L E e      k 2   e 4 q   q    2 2     n E 2 2 2     2 k k k k k 3 n E f x , , e e 12 e . (see (3)) (7)          x k 2 e q q      2  k 2 If E f x , B then      k   2   2 2   n n B 2      k k k k  k E f x e , e E e . (see (3)) (8)       k 2   e q   q   . The results For coordinate descent randomization it’s optimal to choose p q 2       k n k n will be the same as for e RS 2 1 . Since that we concentrate on e RS 2 1 . 8

  9.    n If e RS 2 1 then due to the measure concentration phenomena (I. Usmanova) 2 1          n  2   2 2    , 1 q E e min q 1,4ln n n , E c e , c , 2 q     q 2 2 2  4      2 2 2      . q E c e , e c min q 1,4ln n n , 2 q   q 2 3     p  q   ) is already nontrivial! For example, for 1,2 ( 2, So the choice of    p  ( q   ). – unit simplex in  , it’s natural to choose n Q S 1 1 n For the function’s values feedback ((4), (5)) we have biased estimation of gr a- dient ((2 ) isn’t still the truth). So one’ve to generalize mentioned above a pproach             n n           k k k k k k k k k k E f x e , e E f x e , f x , e       k k     e   e if 0           and   k k k k E n f x , , e e 0 0  . // because  k x e 9

  10.    x f x  k k , Assume, that instead of real (unbiased) stochastic gradients (see (2))     x f x  k k it’s only available biased ones , , that satisfy (3) and additionally       N 1               k k k k 1 k 1 k sup E E f x , f x , ,..., , x x ,      k x x *     N   N      k k 1 k 1 k 1 x x ,...,  k 1 then          N E f x f .   * If  is small enough, then one can show (by the optimal choice of  ) that for (4):           (stochastic)  2 2           2 E f x , M f y f x L y x 2 0 N ( ) R x x   x 2 2 2 2 2 * p         2 2 2 1 2 q 2 2 1 2 q f x convex B M R n B L R n     2 2     4 3        -strongly convex in     f x – 2 2 2 2 2 2 B M n B L n     2 2  2      3 2     2 2 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend