subgradient method
play

Subgradient method Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Subgradient method Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember gradient descent We want to solve x R n f ( x ) , min for f convex and differentiable Gradient descent: choose initial x (0) R n , repeat: x


  1. Subgradient method Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

  2. Remember gradient descent We want to solve x ∈ R n f ( x ) , min for f convex and differentiable Gradient descent: choose initial x (0) ∈ R n , repeat: x ( k ) = x ( k − 1) − t k · ∇ f ( x ( k − 1) ) , k = 1 , 2 , 3 , . . . If ∇ f Lipschitz, gradient descent has convergence rate O (1 /k ) Downsides: • Can be slow ← later • Doesn’t work for nondifferentiable functions ← today 2

  3. Outline Today: • Subgradients • Examples and properties • Subgradient method • Convergence rate 3

  4. Subgradients Remember that for convex f : R n → R , f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) all x, y I.e., linear approximation always underestimates f A subgradient of convex f : R n → R at x is any g ∈ R n such that f ( y ) ≥ f ( x ) + g T ( y − x ) , all y • Always exists • If f differentiable at x , then g = ∇ f ( x ) uniquely • Actually, same definition works for nonconvex f (however, subgradient need not exist) 4

  5. Examples Consider f : R → R , f ( x ) = | x | 2.0 1.5 1.0 f(x) 0.5 0.0 −0.5 −2 −1 0 1 2 x • For x � = 0 , unique subgradient g = sign( x ) • For x = 0 , subgradient g is any element of [ − 1 , 1] 5

  6. Consider f : R n → R , f ( x ) = � x � (Euclidean norm) f(x) x2 x 1 • For x � = 0 , unique subgradient g = x/ � x � • For x = 0 , subgradient g is any element of { z : � z � ≤ 1 } 6

  7. Consider f : R n → R , f ( x ) = � x � 1 f(x) x2 x 1 • For x i � = 0 , unique i th component g i = sign( x i ) • For x i = 0 , i th component g i is an element of [ − 1 , 1] 7

  8. Let f 1 , f 2 : R n → R be convex, differentiable, and consider f ( x ) = max { f 1 ( x ) , f 2 ( x ) } 15 10 f(x) 5 0 −2 −1 0 1 2 x • For f 1 ( x ) > f 2 ( x ) , unique subgradient g = ∇ f 1 ( x ) • For f 2 ( x ) > f 1 ( x ) , unique subgradient g = ∇ f 2 ( x ) • For f 1 ( x ) = f 2 ( x ) , subgradient g is any point on the line segment between ∇ f 1 ( x ) and ∇ f 2 ( x ) 8

  9. Subdifferential Set of all subgradients of convex f is called the subdifferential: ∂f ( x ) = { g ∈ R n : g is a subgradient of f at x } • ∂f ( x ) is closed and convex (even for nonconvex f ) • Nonempty (can be empty for nonconvex f ) • If f is differentiable at x , then ∂f ( x ) = {∇ f ( x ) } • If ∂f ( x ) = { g } , then f is differentiable at x and ∇ f ( x ) = g 9

  10. Connection to convex geometry Convex set C ⊆ R n , consider indicator function I C : R n → R , � 0 if x ∈ C I C ( x ) = I { x ∈ C } = ∞ if x / ∈ C For x ∈ C , ∂I C ( x ) = N C ( x ) , the normal cone of C at x , N C ( x ) = { g ∈ R n : g T x ≥ g T y for any y ∈ C } Why? Recall definition of subgradient g , I C ( y ) ≥ I C ( x ) + g T ( y − x ) for all y • For y / ∈ C , I C ( y ) = ∞ • For y ∈ C , this means 0 ≥ g T ( y − x ) 10

  11. ● ● ● ● 11

  12. Subgradient calculus Basic rules for convex functions: • Scaling: ∂ ( af ) = a · ∂f provided a > 0 • Addition: ∂ ( f 1 + f 2 ) = ∂f 1 + ∂f 2 • Affine composition: if g ( x ) = f ( Ax + b ) , then ∂g ( x ) = A T ∂f ( Ax + b ) • Finite pointwise maximum: if f ( x ) = max i =1 ,...m f i ( x ) , then � � � ∂f ( x ) = conv ∂f i ( x ) , i : f i ( x )= f ( x ) the convex hull of union of subdifferentials of all active functions at x 12

  13. • General pointwise maximum: if f ( x ) = max s ∈S f s ( x ) , then � � �� � ∂f ( x ) ⊇ cl conv ∂f s ( x ) s : f s ( x )= f ( x ) and under some regularity conditions (on S , f s ), we get = • Norms: important special case, f ( x ) = � x � p . Let q be such that 1 /p + 1 /q = 1 , then � y : � y � q ≤ 1 and y T x = max � z � q ≤ 1 z T x � ∂f ( x ) = Why is this a special case? Note � z � q ≤ 1 z T x � x � p = max 13

  14. Why subgradients? Subgradients are important for two reasons: • Convex analysis: optimality characterization via subgradients, monotonicity, relationship to duality • Convex optimization: if you can compute subgradients, then you can minimize (almost) any convex function 14

  15. Optimality condition For convex f , f ( x ⋆ ) = min 0 ∈ ∂f ( x ⋆ ) x ∈ R n f ( x ) ⇔ I.e., x ⋆ is a minimizer if and only if 0 is a subgradient of f at x ⋆ Why? Easy: g = 0 being a subgradient means that for all y f ( y ) ≥ f ( x ⋆ ) + 0 T ( y − x ⋆ ) = f ( x ⋆ ) Note analogy to differentiable case, where ∂f ( x ) = {∇ f ( x ) } 15

  16. Soft-thresholding Lasso problem can be parametrized as 1 2 � y − Ax � 2 + λ � x � 1 min x where λ ≥ 0 . Consider simplified problem with A = I : 1 2 � y − x � 2 + λ � x � 1 min x Claim: solution of simple problem is x ⋆ = S λ ( y ) , where S λ is the soft-thresholding operator:  y i − λ if y i > λ   [ S λ ( y )] i = 0 if − λ ≤ y i ≤ λ  y i + λ if y i < − λ  16

  17. 2 � y − x � 2 + λ � x � 1 are Why? Subgradients of f ( x ) = 1 g = x − y + λs, where s i = sign( x i ) if x i � = 0 and s i ∈ [ − 1 , 1] if x i = 0 Now just plug in x = S λ ( y ) and check we can get g = 0 1.0 0.5 Soft-thresholding in 0.0 one variable: −0.5 −1.0 −1.0 −0.5 0.0 0.5 1.0 17

  18. Subgradient method Given convex f : R n → R , not necessarily differentiable Subgradient method: just like gradient descent, but replacing gradients with subgradients. I.e., initialize x (0) , then repeat x ( k ) = x ( k − 1) − t k · g ( k − 1) , k = 1 , 2 , 3 , . . . , where g ( k − 1) is any subgradient of f at x ( k − 1) Subgradient method is not necessarily a descent method, so we best among x (1) , . . . x ( k ) so far, i.e., keep track of best iterate x ( k ) f ( x ( k ) i =1 ,...k f ( x ( i ) ) best ) = min 18

  19. Step size choices • Fixed step size: t k = t all k = 1 , 2 , 3 , . . . • Diminishing step size: choose t k to satisfy ∞ ∞ � t 2 � k < ∞ , t k = ∞ , k =1 k =1 i.e., square summable but not summable Important that step sizes go to zero, but not too fast Other options too, but important difference to gradient descent: all step sizes options are pre-specified, not adaptively computed 19

  20. Convergence analysis Assume that f : R n → R is convex, also: • f is Lipschitz continuous with constant G > 0 , | f ( x ) − f ( y ) | ≤ G � x − y � for all x, y Equivalently: � g � ≤ G for any subgradient of f at any x • � x (1) − x ∗ � ≤ R (equivalently, � x (0) − x ∗ � is bounded) Theorem: For a fixed step size t , subgradient method satisfies k →∞ f ( x ( k ) best ) ≤ f ( x ⋆ ) + G 2 t/ 2 lim Theorem: For diminishing step sizes, subgradient method sat- isfies k →∞ f ( x ( k ) best ) = f ( x ⋆ ) lim 20

  21. Basic inequality Can prove both results from same basic inequality. Key steps: • Using definition of subgradient, � x ( k +1) − x ⋆ � 2 ≤ � x ( k ) − x ⋆ � 2 − 2 t k ( f ( x ( k ) ) − f ( x ⋆ )) + t 2 k � g ( k ) � 2 • Iterating last inequality, � x ( k +1) − x ⋆ � 2 ≤ k k � x (1) − x ⋆ � 2 − 2 � t i ( f ( x ( i ) ) − f ( x ⋆ )) + � t 2 i � g ( i ) � 2 i =1 i =1 21

  22. • Using � x ( k +1) − x ⋆ � ≥ 0 and � x (1) − x ⋆ � ≤ R , k k t i ( f ( x ( i ) ) − f ( x ⋆ )) ≤ R 2 + � � t 2 i � g ( i ) � 2 2 i =1 i =1 • Introducing f ( x ( k ) best ) , k k � � ( f ( x ( k ) � t i ( f ( x ( i ) ) − f ( x ⋆ )) ≥ 2 � best ) − f ( x ⋆ )) 2 t i i =1 i =1 • Plugging this in and using � g ( i ) � ≤ G , best ) − f ( x ⋆ ) ≤ R 2 + G 2 � k i =1 t 2 f ( x ( k ) i 2 � k i =1 t i 22

  23. Convergence proofs For constant step size t , basic bound is R 2 + G 2 t 2 k → G 2 t as k → ∞ 2 tk 2 For diminishing step sizes t k , ∞ ∞ � � t 2 i < ∞ , t i = ∞ , i =1 i =1 we get R 2 + G 2 � k i =1 t 2 i → 0 as k → ∞ 2 � k i =1 t i 23

  24. Convergence rate After k iterations, what is complexity of error f ( x ( k ) best ) − f ( x ⋆ ) ? √ Consider taking t i = R/ ( G k ) , all i = 1 , . . . k . Then basic bound is R 2 + G 2 � k i =1 t 2 = RG i √ 2 � k i =1 t i k Can show this choice is the best we can do (i.e., minimizes bound) √ I.e., subgradient method has convergence rate O (1 / k ) I.e., to get f ( x ( k ) best ) − f ( x ⋆ ) ≤ ǫ , need O (1 /ǫ 2 ) iterations 24

  25. Intersection of sets Example from Boyd’s lecture notes: suppose we want to find x ⋆ ∈ C 1 ∩ . . . ∩ C m , i.e., find point in intersection of closed, convex sets C 1 , . . . C m First define f ( x ) = max i =1 ,...m dist( x, C i ) , and now solve x ∈ R n f ( x ) min Note that f ( x ⋆ ) = 0 ⇒ x ⋆ ∈ C 1 ∩ . . . ∩ C m Recall distance to set C , dist( x, C ) = min {� x − u � : u ∈ C } 25

  26. For closed, convex C , there is a unique point minimizing � x − u � over u ∈ C . Denoted u ⋆ = P C ( x ) , so dist( x, C ) = � x − P C ( x ) � ● * Let f i ( x ) = dist( x, C i ) , each i . Then f ( x ) = max i =1 ,...m f i ( x ) , and x − P Ci ( x ) • For each i , and x / ∈ C i , ∇ f i ( x ) = � x − P Ci ( x ) � x − P Ci ( x ) • If f ( x ) = f i ( x ) � = 0 , then � x − P Ci ( x ) � ∈ ∂f ( x ) 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend