Optimization Aymeric DIEULEVEUT EPFL, Lausanne January 26, 2018 - PowerPoint PPT Presentation

What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . 16

What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. 16

What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. 16

What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) 16

What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) ◮ Even when Θ = R d , d might be very large (typically millions) 16

What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) ◮ Even when Θ = R d , d might be very large (typically millions) � use only first order methods 16

What makes it hard: 3. Set Θ, complexity of f a. Set Θ: (if Θ is a convex set.) ◮ May be described implicitly (via equations): Θ = { θ ∈ R d s.t. � θ � 2 ≤ R and � θ, 1 � = r } . � Use dual formulation of the problem. ◮ Projection might be difficult or impossible. � use algorithms requiring linear minimization oracle instead of quadratic oracles (Frank Wolfe) ◮ Even when Θ = R d , d might be very large (typically millions) � use only first order methods b. Structure of f . If f = ˆ R ( θ ) = 1 � n i =1 ℓ ( y i , � θ, Φ( x i ) � ), n computing a gradient has a cost proportional to n . 16

Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients 17

Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? 17

Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) 17

Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) ◮ show how rates depend onsmoothness and strong convexity 17

Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) ◮ show how rates depend onsmoothness and strong convexity ◮ show how we can use the structure 17

Optimization Take home ◮ We express problems as minimizing a function over a set ◮ Most convex problems are solved ◮ Difficulties come from non-convexity, lack of regularity, complexity of the set Θ (or high dimension), complexity of computing gradients What happens for supervised machine learning ? Goals: ◮ present algorithms (convex, large dimension, high number of observations) ◮ show how rates depend onsmoothness and strong convexity ◮ show how we can use the structure ◮ not forgetting the initial problem...! 17

Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 Two fundamental questions: (a) computing (b) analyzing ˆ θ . 18

Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 Two fundamental questions: (a) computing (b) analyzing ˆ θ . “Large scale” framework: number of examples n and the number of explanatory variables d are both large. ⇒ First order algorithms 1. High dimension d = Gradient Descent (GD) : θ k = θ k − 1 − γ k ˆ R ′ ( θ k − 1 ) 18

Stochastic algorithms for ERM n � � R ( θ ) = 1 ˆ � ℓ ( y i , � θ, Φ( x i ) � ) min . n θ ∈ R d i =1 Two fundamental questions: (a) computing (b) analyzing ˆ θ . “Large scale” framework: number of examples n and the number of explanatory variables d are both large. ⇒ First order algorithms 1. High dimension d = Gradient Descent (GD) : θ k = θ k − 1 − γ k ˆ R ′ ( θ k − 1 ) Problem: computing the gradient costs O ( dn ) per iteration. ⇒ Stochastic algorithms 2. Large n = Stochastic Gradient Descent (SGD) 18

Stochastic Gradient descent ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). 19

Stochastic Gradient descent θ 0 ◮ Goal: θ ∈ R d f ( θ ) min given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) ◮ E [ f ′ k ( θ k − 1 ) |F k − 1 ] = f ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 19

Stochastic Gradient descent θ 0 θ 1 ◮ Goal: θ ∈ R d f ( θ ) min θ n given unbiased gradient θ ∗ estimates f ′ n ◮ θ ∗ := argmin R d f ( θ ). ◮ Key algorithm: Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951): θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) ◮ E [ f ′ k ( θ k − 1 ) |F k − 1 ] = f ′ ( θ k − 1 ) for a filtration ( F k ) k ≥ 0 , θ k is F k measurable. 19

SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = 20

SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) 20

SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) n I k ( θ k − 1 ) |F k − 1 ] = 1 � E [ f ′ ℓ ′ ( y k , � θ, Φ( x k ) � ) n k =1 20

SGD for ERM: f = ˆ R Loss for a single pair of observations, for any j ≤ n : f j ( θ ) := ℓ ( y j , � θ, Φ( x j ) � ) . ⇒ complexity O ( d ) per iteration. One observation at each step = n For the empirical risk ˆ R ( θ ) = 1 � ℓ ( y k , � θ, Φ( x k ) � ). n k =1 ◮ At each step k ∈ N ∗ , sample I k ∼ U{ 1 , . . . n } , and use: f ′ I k ( θ k − 1 ) = ℓ ′ ( y I k , � θ k − 1 , Φ( x I k ) � ) n I k ( θ k − 1 ) |F k − 1 ] = 1 � ℓ ′ ( y k , � θ, Φ( x k ) � ) = ˆ E [ f ′ R ′ ( θ k − 1 ) . n k =1 with F k = σ (( x i , y i ) 1 ≤ i ≤ n , ( I i ) 1 ≤ i ≤ k ). 20

Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate ( γ k ) k ≥ 0 . For smooth and strongly convex problem, θ k → θ ∗ a.s. if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 21

Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate ( γ k ) k ≥ 0 . For smooth and strongly convex problem, θ k → θ ∗ a.s. if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 √ d k ( θ k − θ ∗ ) → N (0 , V ), for And asymptotic normality γ k = γ 0 k , γ 0 ≥ 1 µ . 21

Analysis: behaviour of ( θ n ) n ≥ 0 θ k = θ k − 1 − γ k f ′ k ( θ k − 1 ) Importance of the learning rate ( γ k ) k ≥ 0 . For smooth and strongly convex problem, θ k → θ ∗ a.s. if ∞ ∞ � � γ 2 γ k = ∞ k < ∞ . k =1 k =1 √ d k ( θ k − θ ∗ ) → N (0 , V ), for And asymptotic normality γ k = γ 0 k , γ 0 ≥ 1 µ . ◮ Limit variance scales as 1 /µ 2 ◮ Very sensitive to ill-conditioned problems. ◮ µ generally unknown... 21

Polyak Ruppert averaging θ 1 θ 0 θ 1 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n k 1 θ ∗ ¯ � θ k = θ i . k + 1 i =0 ◮ off line averaging reduces the noise effect. 22

Polyak Ruppert averaging θ 1 θ 0 θ 1 θ 2 Introduced by Polyak and Juditsky (1992) and Ruppert (1988): θ n k 1 θ ∗ θ n ¯ � θ k = θ i . k + 1 i =0 ◮ off line averaging reduces the noise effect. ◮ on line computing: ¯ 1 k +1 ¯ k θ k +1 = k +1 θ k +1 + θ k . 22

Convex stochastic approximation: convergence Known global minimax rates for non-smooth problems ◮ Strongly convex: O (( µ k ) − 1 ) Attained by averaged stochastic gradient descent with γ k ∝ ( µ k ) − 1 ◮ Non-strongly convex: O ( k − 1 / 2 ) Attained by averaged stochastic gradient descent with γ k ∝ k − 1 / 2 23

Convex stochastic approximation: convergence Known global minimax rates for non-smooth problems ◮ Strongly convex: O (( µ k ) − 1 ) Attained by averaged stochastic gradient descent with γ k ∝ ( µ k ) − 1 ◮ Non-strongly convex: O ( k − 1 / 2 ) Attained by averaged stochastic gradient descent with γ k ∝ k − 1 / 2 For smooth problems ◮ Strongly convex: O ( µ k ) − 1 for γ k ∝ k − 1 / 2 : adapts to strong convexity. 23

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k 24

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. Can we get best of both worlds ? 24

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. 24

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k ⊖ Gradient descent update costs n times as much as SGD update. Can we get best of both worlds ? 24

Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n 25

Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) 25

Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n 25

Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n ◮ sample i k ∼ U [1; n ], use � n � 1 � f ′ i ( θ k i ) − f ′ i k ( θ k ik ) + f ′ i k ( θ k ) , n i =0 25

Methods for finite sum minimization ◮ GD: at step k , use 1 � n i =0 f ′ i ( θ k ) n ◮ SGD: at step k , sample i k ∼ U [1; n ], use f ′ i k ( θ k ) ◮ SAG: at step k , ◮ keep a “full gradient” 1 � n i ( θ k i ), with θ k i ∈ { θ 1 , . . . θ k } i =0 f ′ n ◮ sample i k ∼ U [1; n ], use � n � 1 � f ′ i ( θ k i ) − f ′ i k ( θ k ik ) + f ′ i k ( θ k ) , n i =0 � ⊕ update costs the same as SGD � ⊖ needs to store all gradients f ′ i ( θ k i ) at “points in the past” Some references: ◮ SAG Schmidt et al. (2013), SAGA Defazio et al. (2014a) ◮ SVRG Johnson and Zhang (2013) (reduces memory cost but 2 epochs...) ◮ FINITO Defazio et al. (2014b) ◮ S2GD Koneˇ cn` y and Richt´ arik (2013)... And many others... See for example Niao He’s lecture notes for a nice overview. 25

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k 26

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 O ( e − µ k ) 1 − ( µ ∧ 1 1 Stgly-Cvx O O n ) O µ k µ k GD, SGD, SAG (Fig. from Schmidt et al. (2013)) 26

Take home Stochastic algorithms for Empirical Risk Minimization. ◮ Rates depend on the regularity of the function. ◮ Several algorithms to optimize empirical risk, most efficient ones are stochastic and rely on finite sum structure

Take home Stochastic algorithms for Empirical Risk Minimization. ◮ Rates depend on the regularity of the function. ◮ Several algorithms to optimize empirical risk, most efficient ones are stochastic and rely on finite sum structure ◮ Stochastic algorithms to optimize a deterministic function. 27

What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). 28

What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM: ◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O (1 / √ n ), no need to be precise 28

What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM: ◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O (1 / √ n ), no need to be precise 2 important insights: 1. No need to optimize below statistical error, 2. Generalization risk is more important than empirical risk. 28

What about generalization risk Initial problem: Generalization guarantees. � � � ˆ ◮ Uniform upper bound sup θ R ( θ ) − R ( θ ) � . (empirical � � process theory) ◮ More precise: localized complexities (Bartlett et al., 2002), stability (Bousquet and Elisseeff, 2002). Problems for ERM: ◮ Choose regularization (overfitting risk) ◮ How many iterations (i.e., passes on the data)? ◮ Generalization guarantees generally of order O (1 / √ n ), no need to be precise 2 important insights: 1. No need to optimize below statistical error, 2. Generalization risk is more important than empirical risk. SGD can be used to minimize the generalization risk. 28

SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). 29

SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) 29

SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). E [ f ′ E ρ [ ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) |F k − 1 ] k ( θ k − 1 ) |F k − 1 ] = 29

SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). E [ f ′ E ρ [ ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) |F k − 1 ] k ( θ k − 1 ) |F k − 1 ] = ℓ ′ ( Y , � θ k − 1 , Φ( X ) � ) = R ′ ( θ k − 1 ) � � = E ρ 29

SGD for the generalization risk: f = R SGD: key assumption E [ f ′ n ( θ n − 1 ) |F n − 1 ] = f ′ ( θ n − 1 ). For the risk R ( θ ) = E ρ [ ℓ ( Y , � θ, Φ( X ) � )] ◮ At step 0 < k ≤ n , use a new point independent of θ k − 1 : f ′ k ( θ k − 1 ) = ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) ◮ For 0 ≤ k ≤ n , F k = σ (( x i , y i ) 1 ≤ i ≤ k ). E [ f ′ E ρ [ ℓ ′ ( y k , � θ k − 1 , Φ( x k ) � ) |F k − 1 ] k ( θ k − 1 ) |F k − 1 ] = ℓ ′ ( Y , � θ k − 1 , Φ( X ) � ) = R ′ ( θ k − 1 ) � � = E ρ ◮ Single pass through the data, Running-time = O ( nd ), ◮ “Automatic” regularization. 29

SGD for the generalization risk: f = R ERM minimization Gen. risk minimization several passes : 0 ≤ k One pass 0 ≤ k ≤ n F t -measurable for any t F t -measurable for t ≥ i . x i , y i is 30

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k 31

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ k k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ k 0 ≤ k 0 ≤ k ≤ n Lower Bounds α β γ δ δ :Information theoretic LB - Statistical theory (Tsybakov, 2003). Gradient does not even exist 31

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ n k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ n 0 ≤ k 0 ≤ k ≤ n 31

Convergence rate for f (˜ θ k ) − f ( θ ∗ ), smooth objective f . min ˆ R min R SGD GD SAG SGD � � � � � � 1 1 1 Convex O O O √ √ n k k � k � � � � � 1 1 − ( µ ∧ 1 1 O ( e − µ k ) Stgly-Cvx O O n ) O µ k µ n 0 ≤ k 0 ≤ k ≤ n Gradient is unknown 31

Least Mean Squares: rate independent of µ Least-squares: R ( θ ) = 1 ( Y − � Φ( X ) , θ � ) 2 � � 2 E Analysis for averaging and constant step-size γ = 1 / (4 R 2 ) (Bach and Moulines, 2013) ◮ Assume � Φ( x n ) � � r and | y n − � Φ( x n ) , θ ∗ �| � σ ◮ No assumption regarding lowest eigenvalues of the Hessian θ n ) − R ( θ ∗ ) � 4 σ 2 d + � θ 0 − θ ∗ � 2 E R (¯ γ n n 32

Least Mean Squares: rate independent of µ Least-squares: R ( θ ) = 1 ( Y − � Φ( X ) , θ � ) 2 � � 2 E Analysis for averaging and constant step-size γ = 1 / (4 R 2 ) (Bach and Moulines, 2013) ◮ Assume � Φ( x n ) � � r and | y n − � Φ( x n ) , θ ∗ �| � σ ◮ No assumption regarding lowest eigenvalues of the Hessian θ n ) − R ( θ ∗ ) � 4 σ 2 d + � θ 0 − θ ∗ � 2 E R (¯ γ n n ◮ Matches statistical lower bound (Tsybakov, 2003). ◮ Optimal rate with “large” step sizes 32

Optimization Aymeric DIEULEVEUT EPFL, Lausanne January 26, 2018 - PowerPoint PPT Presentation

Optimization Aymeric DIEULEVEUT EPFL, Lausanne January 26, 2018 Journ ees YSP 1 Outline 1. General context and examples. 2. What makes optimization hard ? 2 Outline 1. General context and examples. 2. What makes optimization hard ?

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

MATHEMATICS 1 CONTENTS Unconstrained optimization Constrained optimization Lagrange method

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

AM 205: lecture 20 Today: PDE optimization, constrained optimization example New topic:

Welcome to your third training module in the Volunteer Basic Training Series, Finances and Safety!

System Upgrades and Individual System Tests of the 13 kA Energy Extraction Facilities - related

InTouch & Deposit Refresher July 12th, 2016 Lisa Smith, Emily Stinebaugh, Holli Tuttle

EE 355 Unit 3 - Pointers Mark Redekopp 2 Why Pointers Scenario: You write a paper and

Biomonitoring health effects of climate and air pollution in the urban scenario Paulo Saldiva

Relative Clauses in HPSG Pollard & Sag 1994, ch. 5 Laura Kassner Seminar fr

Events in Magnetized GAr Tom Junk DUNE GARTPC ND Meeting January 5, 2018 The question: What

Tracking Detectors for Collider Experiments Hubert Kroha MPI Munich iSTEP 2016 Hubert Kroha,

Optimization Aymeric DIEULEVEUT EPFL, Lausanne January 26, 2018 - PowerPoint PPT Presentation

Optimization Aymeric DIEULEVEUT EPFL, Lausanne January 26, 2018 Journ ees YSP 1 Outline 1. General context and examples. 2. What makes optimization hard ? 2 Outline 1. General context and examples. 2. What makes optimization hard ?

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

TEG: A New Post-Layout TEG: A New Post-Layout Optimization Method Optimization Method Shuo

Evolutionary Algorithm 2. Swarm Intelligence and Ant Colony Optimization Ant Colony Optimization

Optimization Process Done by an Optimization Algorithm Jose Rueda Torres Learning Objectives

Optimization (Introduction) Optimization Goal: Find the minimizer that minimizes the

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

MATHEMATICS 1 CONTENTS Unconstrained optimization Constrained optimization Lagrange method

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

AM 205: lecture 20 Today: PDE optimization, constrained optimization example New topic:

Welcome to your third training module in the Volunteer Basic Training Series, Finances and Safety!

System Upgrades and Individual System Tests of the 13 kA Energy Extraction Facilities - related

InTouch &amp; Deposit Refresher July 12th, 2016 Lisa Smith, Emily Stinebaugh, Holli Tuttle

EE 355 Unit 3 - Pointers Mark Redekopp 2 Why Pointers Scenario: You write a paper and

Biomonitoring health effects of climate and air pollution in the urban scenario Paulo Saldiva

Relative Clauses in HPSG Pollard &amp; Sag 1994, ch. 5 Laura Kassner Seminar fr

Events in Magnetized GAr Tom Junk DUNE GARTPC ND Meeting January 5, 2018 The question: What

Tracking Detectors for Collider Experiments Hubert Kroha MPI Munich iSTEP 2016 Hubert Kroha,

InTouch & Deposit Refresher July 12th, 2016 Lisa Smith, Emily Stinebaugh, Holli Tuttle

Relative Clauses in HPSG Pollard & Sag 1994, ch. 5 Laura Kassner Seminar fr