Subgradient method Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Subgradient method Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Remember gradient descent We want to solve x ∈ R n f ( x ) , min for f convex and differentiable Gradient descent: choose initial x (0) ∈ R n , repeat: x ( k ) = x ( k − 1) − t k · ∇ f ( x ( k − 1) ) , k = 1 , 2 , 3 , . . . If ∇ f Lipschitz, gradient descent has convergence rate O (1 /k ) Downsides: • Can be slow ← later • Doesn’t work for nondifferentiable functions ← today 2

Outline Today: • Subgradients • Examples and properties • Subgradient method • Convergence rate 3

Subgradients Remember that for convex f : R n → R , f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) all x, y I.e., linear approximation always underestimates f A subgradient of convex f : R n → R at x is any g ∈ R n such that f ( y ) ≥ f ( x ) + g T ( y − x ) , all y • Always exists • If f differentiable at x , then g = ∇ f ( x ) uniquely • Actually, same definition works for nonconvex f (however, subgradient need not exist) 4

Examples Consider f : R → R , f ( x ) = | x | 2.0 1.5 1.0 f(x) 0.5 0.0 −0.5 −2 −1 0 1 2 x • For x � = 0 , unique subgradient g = sign( x ) • For x = 0 , subgradient g is any element of [ − 1 , 1] 5

Consider f : R n → R , f ( x ) = � x � (Euclidean norm) f(x) x2 x 1 • For x � = 0 , unique subgradient g = x/ � x � • For x = 0 , subgradient g is any element of { z : � z � ≤ 1 } 6

Consider f : R n → R , f ( x ) = � x � 1 f(x) x2 x 1 • For x i � = 0 , unique i th component g i = sign( x i ) • For x i = 0 , i th component g i is an element of [ − 1 , 1] 7

Let f 1 , f 2 : R n → R be convex, differentiable, and consider f ( x ) = max { f 1 ( x ) , f 2 ( x ) } 15 10 f(x) 5 0 −2 −1 0 1 2 x • For f 1 ( x ) > f 2 ( x ) , unique subgradient g = ∇ f 1 ( x ) • For f 2 ( x ) > f 1 ( x ) , unique subgradient g = ∇ f 2 ( x ) • For f 1 ( x ) = f 2 ( x ) , subgradient g is any point on the line segment between ∇ f 1 ( x ) and ∇ f 2 ( x ) 8

Subdifferential Set of all subgradients of convex f is called the subdifferential: ∂f ( x ) = { g ∈ R n : g is a subgradient of f at x } • ∂f ( x ) is closed and convex (even for nonconvex f ) • Nonempty (can be empty for nonconvex f ) • If f is differentiable at x , then ∂f ( x ) = {∇ f ( x ) } • If ∂f ( x ) = { g } , then f is differentiable at x and ∇ f ( x ) = g 9

Connection to convex geometry Convex set C ⊆ R n , consider indicator function I C : R n → R , � 0 if x ∈ C I C ( x ) = I { x ∈ C } = ∞ if x / ∈ C For x ∈ C , ∂I C ( x ) = N C ( x ) , the normal cone of C at x , N C ( x ) = { g ∈ R n : g T x ≥ g T y for any y ∈ C } Why? Recall definition of subgradient g , I C ( y ) ≥ I C ( x ) + g T ( y − x ) for all y • For y / ∈ C , I C ( y ) = ∞ • For y ∈ C , this means 0 ≥ g T ( y − x ) 10

● ● ● ● 11

Subgradient calculus Basic rules for convex functions: • Scaling: ∂ ( af ) = a · ∂f provided a > 0 • Addition: ∂ ( f 1 + f 2 ) = ∂f 1 + ∂f 2 • Affine composition: if g ( x ) = f ( Ax + b ) , then ∂g ( x ) = A T ∂f ( Ax + b ) • Finite pointwise maximum: if f ( x ) = max i =1 ,...m f i ( x ) , then � � � ∂f ( x ) = conv ∂f i ( x ) , i : f i ( x )= f ( x ) the convex hull of union of subdifferentials of all active functions at x 12

• General pointwise maximum: if f ( x ) = max s ∈S f s ( x ) , then � � �� ∂f ( x ) ⊇ cl conv ∂f s ( x ) s : f s ( x )= f ( x ) and under some regularity conditions (on S , f s ), we get = • Norms: important special case, f ( x ) = � x � p . Let q be such that 1 /p + 1 /q = 1 , then � y : � y � q ≤ 1 and y T x = max � z � q ≤ 1 z T x � ∂f ( x ) = Why is this a special case? Note � z � q ≤ 1 z T x � x � p = max 13

Why subgradients? Subgradients are important for two reasons: • Convex analysis: optimality characterization via subgradients, monotonicity, relationship to duality • Convex optimization: if you can compute subgradients, then you can minimize (almost) any convex function 14

Optimality condition For convex f , f ( x ⋆ ) = min 0 ∈ ∂f ( x ⋆ ) x ∈ R n f ( x ) ⇔ I.e., x ⋆ is a minimizer if and only if 0 is a subgradient of f at x ⋆ Why? Easy: g = 0 being a subgradient means that for all y f ( y ) ≥ f ( x ⋆ ) + 0 T ( y − x ⋆ ) = f ( x ⋆ ) Note analogy to differentiable case, where ∂f ( x ) = {∇ f ( x ) } 15

Soft-thresholding Lasso problem can be parametrized as 1 2 � y − Ax � 2 + λ � x � 1 min x where λ ≥ 0 . Consider simplified problem with A = I : 1 2 � y − x � 2 + λ � x � 1 min x Claim: solution of simple problem is x ⋆ = S λ ( y ) , where S λ is the soft-thresholding operator:  y i − λ if y i > λ   [ S λ ( y )] i = 0 if − λ ≤ y i ≤ λ  y i + λ if y i < − λ  16

2 � y − x � 2 + λ � x � 1 are Why? Subgradients of f ( x ) = 1 g = x − y + λs, where s i = sign( x i ) if x i � = 0 and s i ∈ [ − 1 , 1] if x i = 0 Now just plug in x = S λ ( y ) and check we can get g = 0 1.0 0.5 Soft-thresholding in 0.0 one variable: −0.5 −1.0 −1.0 −0.5 0.0 0.5 1.0 17

Subgradient method Given convex f : R n → R , not necessarily differentiable Subgradient method: just like gradient descent, but replacing gradients with subgradients. I.e., initialize x (0) , then repeat x ( k ) = x ( k − 1) − t k · g ( k − 1) , k = 1 , 2 , 3 , . . . , where g ( k − 1) is any subgradient of f at x ( k − 1) Subgradient method is not necessarily a descent method, so we best among x (1) , . . . x ( k ) so far, i.e., keep track of best iterate x ( k ) f ( x ( k ) i =1 ,...k f ( x ( i ) ) best ) = min 18

Step size choices • Fixed step size: t k = t all k = 1 , 2 , 3 , . . . • Diminishing step size: choose t k to satisfy ∞ ∞ � t 2 � k < ∞ , t k = ∞ , k =1 k =1 i.e., square summable but not summable Important that step sizes go to zero, but not too fast Other options too, but important difference to gradient descent: all step sizes options are pre-specified, not adaptively computed 19

Convergence analysis Assume that f : R n → R is convex, also: • f is Lipschitz continuous with constant G > 0 , | f ( x ) − f ( y ) | ≤ G � x − y � for all x, y Equivalently: � g � ≤ G for any subgradient of f at any x • � x (1) − x ∗ � ≤ R (equivalently, � x (0) − x ∗ � is bounded) Theorem: For a fixed step size t , subgradient method satisfies k →∞ f ( x ( k ) best ) ≤ f ( x ⋆ ) + G 2 t/ 2 lim Theorem: For diminishing step sizes, subgradient method satisfies k →∞ f ( x ( k ) best ) = f ( x ⋆ ) lim 20

Basic inequality Can prove both results from same basic inequality. Key steps: • Using definition of subgradient, � x ( k +1) − x ⋆ � 2 ≤ � x ( k ) − x ⋆ � 2 − 2 t k ( f ( x ( k ) ) − f ( x ⋆ )) + t 2 k � g ( k ) � 2 • Iterating last inequality, � x ( k +1) − x ⋆ � 2 ≤ k k � x (1) − x ⋆ � 2 − 2 � t i ( f ( x ( i ) ) − f ( x ⋆ )) + � t 2 i � g ( i ) � 2 i =1 i =1 21

• Using � x ( k +1) − x ⋆ � ≥ 0 and � x (1) − x ⋆ � ≤ R , k k t i ( f ( x ( i ) ) − f ( x ⋆ )) ≤ R 2 + � � t 2 i � g ( i ) � 2 2 i =1 i =1 • Introducing f ( x ( k ) best ) , k k � � ( f ( x ( k ) � t i ( f ( x ( i ) ) − f ( x ⋆ )) ≥ 2 � best ) − f ( x ⋆ )) 2 t i i =1 i =1 • Plugging this in and using � g ( i ) � ≤ G , best ) − f ( x ⋆ ) ≤ R 2 + G 2 � k i =1 t 2 f ( x ( k ) i 2 � k i =1 t i 22

Convergence proofs For constant step size t , basic bound is R 2 + G 2 t 2 k → G 2 t as k → ∞ 2 tk 2 For diminishing step sizes t k , ∞ ∞ � � t 2 i < ∞ , t i = ∞ , i =1 i =1 we get R 2 + G 2 � k i =1 t 2 i → 0 as k → ∞ 2 � k i =1 t i 23

Convergence rate After k iterations, what is complexity of error f ( x ( k ) best ) − f ( x ⋆ ) ? √ Consider taking t i = R/ ( G k ) , all i = 1 , . . . k . Then basic bound is R 2 + G 2 � k i =1 t 2 = RG i √ 2 � k i =1 t i k Can show this choice is the best we can do (i.e., minimizes bound) √ I.e., subgradient method has convergence rate O (1 / k ) I.e., to get f ( x ( k ) best ) − f ( x ⋆ ) ≤ ǫ , need O (1 /ǫ 2 ) iterations 24

Intersection of sets Example from Boyd’s lecture notes: suppose we want to find x ⋆ ∈ C 1 ∩ . . . ∩ C m , i.e., find point in intersection of closed, convex sets C 1 , . . . C m First define f ( x ) = max i =1 ,...m dist( x, C i ) , and now solve x ∈ R n f ( x ) min Note that f ( x ⋆ ) = 0 ⇒ x ⋆ ∈ C 1 ∩ . . . ∩ C m Recall distance to set C , dist( x, C ) = min {� x − u � : u ∈ C } 25

For closed, convex C , there is a unique point minimizing � x − u � over u ∈ C . Denoted u ⋆ = P C ( x ) , so dist( x, C ) = � x − P C ( x ) � ● * Let f i ( x ) = dist( x, C i ) , each i . Then f ( x ) = max i =1 ,...m f i ( x ) , and x − P Ci ( x ) • For each i , and x / ∈ C i , ∇ f i ( x ) = � x − P Ci ( x ) � x − P Ci ( x ) • If f ( x ) = f i ( x ) � = 0 , then � x − P Ci ( x ) � ∈ ∂f ( x ) 26

Subgradient method Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Subgradient method Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember gradient descent We want to solve x R n f ( x ) , min for f convex and differentiable Gradient descent: choose initial x (0) R n , repeat: x

Kurdyka- Lojasiewicz inequality and Kurdyka- Lojasiewicz inequality and subgradient

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov,

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

New primal-dual subgradient methods for Convex Problems with Functional Constraints Yurii

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey

More Subgradient Calculus: Proximal Operator Following functions are again convex, but again, may

Zero-Convex Functions, Perturbation Resilience, and Subgradient Projections for

Gradient, Subgradient and how they may affect your grade(ient) David Sontag & Yoni Halpern

More Subgradient Calculus: Function Convexity first Following functions are again convex, but

Accelerated Stochastic Subgradient Methods under Local Error Bound Condition Yi Xu

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Method Handles Everywhere! Charles Oliver Nutter @headius Method Handles What are method

B Method Proof assistants May 16, 2017 Lucas Franceschino What is B method? B-method goal

Newtons method Newtons method 1 / 8 Newtons method Objective: solving a non-linear

Chapter 2: Method of Alterations The Probabilistic Method Summer 2020 Freie Universitt Berlin

Income ome A Appr pproach Three methods to determine: Summation Method (build-up method)

Verifiable Delay Functions and More from Isogenies and Pairings Luca De Feo based on joint work

Minimal multiple blocking sets Anurag Bishnoi (with S. Mattheus and J. Schillewaert) Free

Choice Set Optimization Under Discrete Choice Models of Group Decisions Kiran Tomlinson and

The Cosmic Microwave Background as a Backlight David Spergel Princeton Marseilles July 2014

1 2 3 4 5 6 7 $ sed -n 22,27p glibc/malloc/malloc.c This is a version (aka ptmalloc2) of

Scaling for Humongous amounts of data with MongoDB Alvin Richards Technical Director

APEX Extragalactic Surveys Attila Kovcs The Case for APEX in the ALMA Era Zero Spacing APEX

Z 2 Structure of the Quantum Spin Hall Effect Leon Balents, UCSB Joel Moore, UCB Summary

Subgradient method Geoff Gordon & Ryan Tibshirani Optimization - PowerPoint PPT Presentation

Subgradient method Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember gradient descent We want to solve x R n f ( x ) , min for f convex and differentiable Gradient descent: choose initial x (0) R n , repeat: x

Kurdyka- Lojasiewicz inequality and Kurdyka- Lojasiewicz inequality and subgradient

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov,

Generalized gradient descent Geoff Gordon &amp; Ryan Tibshirani Optimization 10-725 / 36-725 1

New primal-dual subgradient methods for Convex Problems with Functional Constraints Yurii

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey

More Subgradient Calculus: Proximal Operator Following functions are again convex, but again, may

Zero-Convex Functions, Perturbation Resilience, and Subgradient Projections for

Gradient, Subgradient and how they may affect your grade(ient) David Sontag &amp; Yoni Halpern

More Subgradient Calculus: Function Convexity first Following functions are again convex, but

Accelerated Stochastic Subgradient Methods under Local Error Bound Condition Yi Xu

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

Method Handles Everywhere! Charles Oliver Nutter @headius Method Handles What are method

B Method Proof assistants May 16, 2017 Lucas Franceschino What is B method? B-method goal

Newtons method Newtons method 1 / 8 Newtons method Objective: solving a non-linear

Chapter 2: Method of Alterations The Probabilistic Method Summer 2020 Freie Universitt Berlin

Income ome A Appr pproach Three methods to determine: Summation Method (build-up method)

Verifiable Delay Functions and More from Isogenies and Pairings Luca De Feo based on joint work

Minimal multiple blocking sets Anurag Bishnoi (with S. Mattheus and J. Schillewaert) Free

Choice Set Optimization Under Discrete Choice Models of Group Decisions Kiran Tomlinson and

The Cosmic Microwave Background as a Backlight David Spergel Princeton Marseilles July 2014

1 2 3 4 5 6 7 $ sed -n 22,27p glibc/malloc/malloc.c This is a version (aka ptmalloc2) of

Scaling for Humongous amounts of data with MongoDB Alvin Richards Technical Director

APEX Extragalactic Surveys Attila Kovcs The Case for APEX in the ALMA Era Zero Spacing APEX

Z 2 Structure of the Quantum Spin Hall Effect Leon Balents, UCSB Joel Moore, UCB Summary

Generalized gradient descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1

Gradient, Subgradient and how they may affect your grade(ient) David Sontag & Yoni Halpern