surrogate losses for online learning of stepsizes in
play

Surrogate Losses for Online Learning of Stepsizes in Stochastic - PowerPoint PPT Presentation

Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization Zhenxun Zhuang 1 , Ashok Cutkosky 2 , Francesco Orabona 1 , 3 1 Department of Computer Science, Boston University 2 Google 3 Department of Electrical &


  1. Surrogate Losses for Online Learning of Stepsizes in Stochastic Non-Convex Optimization Zhenxun Zhuang 1 , Ashok Cutkosky 2 , Francesco Orabona 1 , 3 1 Department of Computer Science, Boston University 2 Google 3 Department of Electrical & Computer Engineering, Boston University 1 / 10

  2. Convex vs. Non-Convex Functions A Convex Function A Non-Convex Function Stationary points: �∇ f ( x ) � = 0 2 / 10

  3. Gradient Descent vs. Stochastic Gradient Descent Gradient Descent: x t +1 = x t − η t ∇ f ( x t ) x t +1 = x t − η t g ( x t , ξ t ) E t [ g ( x t , ξ t )] = ∇ f ( x t ) SGD: with 3 / 10

  4. Curse of Constant Stepsize • Ghadimi & Lan (2013): running SGD on M -smooth functions with � � g ( x t , ξ t ) − ∇ f ( x t ) � 2 � ≤ σ 2 yields 1 η ≤ M and assuming E t � f ( x 1 ) − f ⋆ � E [ �∇ f ( x i ) � 2 ] ≤ O + ησ 2 . η T • Ward et al. (2018) and Li & Orabona (2019) eliminated the need to know f ⋆ and σ for getting optimal rate by AdaGrad global stepsizes. 4 / 10

  5. Transform Non-Convexity to Convexity by Surrogate Losses When the objective function is M -smooth, drawing two independent stochastic gradients in each round of SGD, we have ( assume for now η t only depends on past gradients ) : � � �∇ f ( x t ) , x t +1 − x t � + M 2 � x t +1 − x t � 2 E [ f ( x t +1 ) − f ( x t )] ≤ E � � �∇ f ( x t ) , − η t g ( x t , ξ t ) � + M 2 η 2 t � g ( x t , ξ t ) � 2 = E � � t ) � + M η 2 � g ( x t , ξ t ) � 2 t = E − η t � g ( x t , ξ t ) , g ( x t , ξ ′ . 2 5 / 10

  6. Transform Non-Convexity to Convexity by Surrogate Losses We define the surrogate loss for f at round t as t ) � + M η 2 � g ( x t , ξ t ) � 2 . ℓ t ( η ) � − η � g ( x t , ξ t ) , g ( x t , ξ ′ 2 The inequality of last page becomes E [ f ( x t +1 ) − f ( x t )] ≤ E [ ℓ t ( η t )] , which, after summing from t = 1 to T gives us: T T � � f ⋆ − f ( x 1 ) ≤ E [ ℓ t ( η t ) − ℓ t ( η )] + E [ ℓ t ( η )] . t =1 t =1 � �� � � �� � Regret of η t wrt optimal η Cumulative loss of optimal η 6 / 10

  7. SGD with Online Learning Algorithm 1 Stochastic Gradient Descent with Online Learning (SGDOL) 1: Input: x 1 ∈ X , M , an online learning algorithm A 2: for t = 1 , 2 , . . . , T do Compute η t by running A on 3: i ) � + M η 2 2 � g ( x i , ξ i ) � 2 , ℓ i ( η ) = − η � g ( x i , ξ i ) , g ( x i , ξ ′ i = 1 , . . . , t − 1 two independent unbiased estimates of ∇ f ( x t ): Receive 4: g ( x t , ξ t ) , g ( x t , ξ ′ t ) Update x t +1 = x t − η t g t 5: 6: end for 7: Output : uniformly randomly choose a x k from x 1 , . . . , x T . 7 / 10

  8. Main Theorem Theorem 1: Assume some conditions, and make some choice of the online learning algorithm in Algorithm 1, for a smooth function and an uniformly randomly picked x k from x 1 , . . . , x T , we have: � 1 � σ � �∇ f ( x k ) � 2 � ≤ ˜ O T + √ , E T where ˜ O hides some logarithmic factors. 8 / 10

  9. Classification Problem � m θ 2 1 i =1 φ ( a ⊤ Objective Function: i x − y i ) with φ ( θ ) = 1+ θ 2 on the m adult (a9a) training dataset. 9 / 10

  10. 10 / 10

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend