 
              Machine learning theory Convex learning problems Hamid Beigy Sharif university of technology June 8, 2020
Table of contents 1. Introduction 2. Convexity 3. Lipschitzness 4. Smoothness 5. Convex learning problems 6. Surrogate loss functions 7. Assignments 8. Summary 1/31
Introduction
Introduction ◮ Convex learning comprises an important family of learning problems, because most of what we can learn efficiently. ◮ Linear regression with the squared loss is a convex problem for regression. ◮ logistic regression is a convex problem for classification. ◮ Halfspaces with the 0 − 1 loss, which is a computationally hard problem to learn in unrealizable case, is non-convex. ◮ In general, a convex learning problem is a problem. 1. whose hypothesis class is a convex set and 2. whose loss function is a convex function for each example. ◮ Other properties of the loss function that facilitate successful learning are 1. Lipschitzness 2. Smoothness ◮ In this session, we study the learnability of 1. Convex-Smooth problems 2. Lipschitz-Bounded problems 2/31
Convexity
Convex set Definition (Convex set) A set C in a vector space is convex if for any two vectors u , v ∈ C , the line segment between u and v is contained in set C . That is, for any α ∈ [0 , 1], the convex combination α u + (1 − α ) v ∈ C . Given α ∈ [0 , 1], the combination, α u + (1 − α ) v of the points u , v is called a convex combination . Example (Convex and non-convex sets) Some examples of convex and non-convex sets in R 2 non-convex sets convex sets 3/31
Convex function Definition (Convex function) Let C be a convex set. Function f : C �→ C is convex if for any two vectors u , v ∈ C and α ∈ [0 , 1], f ( α u + (1 − α ) v ) ≤ α f ( u ) + (1 − α ) f ( v ) . In words, f is convex if for any u , v ∈ C , the graph of f between u and v lies below the line segment joining f ( u ) and f ( v ). Example (Convex function) f ( v ) αf ( u ) + (1 − α ) f ( v ) f ( u ) f ( α u + (1 − α ) v ) u v α u + (1 − α ) v 4/31
Epigraph A function f is convex if and only if its epigraph is a convex set. epigraph ( f ) = { ( x , β ) | f ( x ) ≤ β } . f ( x ) x 5/31
Properties of convex functions 1. If f is convex then every local minimum of f is also a global minimum . ◮ Let B ( u , r ) = { v | � v − u � ≤ r } be a ball of radius r centered around u . ◮ f ( u ) is a local minimum of f at u if ∃ r > 0 such that ∀ v ∈ B ( u , r ), we have f ( v ) ≥ f ( u ). ◮ It follows that for any v (not necessarily in B ), there is a small enough α > 0 such that u + α ( v − u ) ∈ B ( u , r ) and therefore f ( u ) ≤ f ( u + α ( v − u )) . ◮ If f is convex, we also have that f ( u + α ( v − u )) = f ( α u + (1 − α ) v ) ≤ (1 − α ) f ( u ) + α f ( v ) . ◮ Combining these two equations and rearranging terms, we conclude that f ( u ) ≤ f ( v ) . ◮ This holds for every v , hence f ( u ) is also a global minimum of f . 6/31
Properties of convex functions 2. If f is convex and differentiable, then ∀ u , f ( u ) ≥ f ( w ) + �∇ f ( w ) , u − w � � ∂ f ( w ) � ∂ w 1 , . . . , ∂ f ( w ) where ∇ f ( w ) = is the gradient of f at w . ∂ w n � � ◮ If f is convex, for every w , we can construct a tangent to f at w that lies below f everywhere. ◮ If f is differentiable, this tangent is the linear function l ( u ) = f ( w ) + �∇ f ( w ) , u − w � . i ) w ( f r f ( u ) , w − u h + ) w f ( w ) ( f w u 7/31
Properties of convex functions (Sub-gradients) ◮ v is sub-gradient of f at w if ∀ u , f ( u ) ≥ f ( w ) + �∇ f ( w ) , u − w � ◮ The differential set, ∂ f ( w ), is the set of sub-gradients of f at w . � ∂ f ( w ) ∂ w 1 , . . . , ∂ f ( w ) � where ∇ f ( w ) = is the gradient of f at w . ∂ w n Lemma Function f is convex iff for every w , ∂ f ( w ) � = 0 . i v , w f ( u ) − u h + ) w ( f f ( w ) w u ◮ f is locally flat around w ( 0 is a sub-gradient) iff w is a global minimizer . 8/31
Convex functions Lemma (Convexity of a scaler function) Let f : R �→ R be a scalar twice differential function, and f ′ , f ′′ be its first and second derivatives, respectively. Then, the following are equivalent: 1. f is convex. 2. f ′ is monotonically nondecreasing. 3. f ′′ is nonnegative. Example (convexity of scaler functions) 1. The scaler function f ( x ) = x 2 is convex, because f ′ ( x ) = 2 x and f ′′ ( x ) = 2 > 0. 2. The scaler function f ( x ) = log (1 + e x ) is convex, because e x 1 ◮ f ′ ( x ) = 1 + e x = e − x + 1 is a monotonically increasing function since the exponent function is a monotonically increasing function. e − x ◮ f ′′ ( x ) = ( e − x + 1) 2 = f ( x )(1 − f ( x )) is nonnegative. 9/31
Convex functions Lemma (Convexity of composition of a convex scalar function with a linear function) Let f : R n �→ R can be written as f ( w ) = g ( � w , x � + y ) , for some x ∈ R n , y ∈ R and g : R �→ R . Then convexity of g implies the convexity of f . Proof (Convexity of composition of a convex scalar function with a linear function). Let w 1 , w 2 ∈ R n and α ∈ [0 , 1]. We have f ( α w 1 + (1 − α ) w 2 ) = g ( � α w 1 + (1 − α ) w 2 , x � + y ) = g ( α � w 1 , x � + (1 − α ) � w 2 , x � + y ) = g ( α ( � w 1 , x � + y ) + (1 − α ) ( � w 2 , x � + y )) ≤ α g ( � w 1 , x � + y ) + (1 − α ) g ( � w 2 , x � + y ) . where the last inequality follows from the convexity of g . Example (Convexity of composition of a convex scalar function with a linear function) 1. Given some x ∈ R n and y ∈ R , let f ( w ) = ( � w , x � − y ) 2 . Then, f is a composition of the function g ( a ) = a 2 onto a linear function, and hence f is a convex function 2. Given some x ∈ R n and y ∈ {− 1 , +1 } , let f ( w ) = log (1 + exp ( − y � w , x � )). Then, f is a composition of the function g ( a ) = log (1 + e a ) onto a linear function, and hence f is a convex function 10/31
Convex functions Lemma (Convexity of maximum and sum of convex functions) Let f i : R n �→ R (1 ≤ i ≤ r ) be convex functions. Following functions g : R n �→ R are convex. 1. g ( x ) = max i ∈{ 1 ,..., r } f i ( x ) . 2. g ( x ) = � r i =1 w i f i ( x ) , where ∀ i , w i ≥ 0 . Proof (Convexity of maximum and sum of convex functions). 1. The first claim follows by g ( α u + (1 − α ) v ) = max f i ( α u + (1 − α ) v ) ≤ max [ α f i ( u ) + (1 − α ) f i ( v )] i i = α max f i ( u ) + (1 − α ) max f i ( v ) = α g ( u ) + (1 − α ) g ( v ) . i i 2. The second claim follows by r r � � g ( α u + (1 − α ) v ) = w i f i ( α u + (1 − α ) v ) ≤ w i [ α f i ( u ) + (1 − α ) f i ( v )] i =1 i =1 r r � � = α w i f i ( u ) + (1 − α ) w i f i ( v ) = α g ( u ) + (1 − α ) g ( v ) . i =1 i =1 Function g ( x ) = | x | is convex, because g ( x ) = max { f 1 ( x ) , f 2 ( x ) } , where both f 1 ( x ) = x and f 2 ( x ) = − x are convex . 11/31
Lipschitzness
Lipschitzness ◮ Definition of Lipschitzness is w.r.t Euclidean norm R n , but it can be defined w.r.t any norm. Definition (Lipschitzness) Function f : R n �→ R k is ρ -Lipschitz if for all w 1 , w 2 ∈ C we have � f ( w 1 ) − f ( w 2 ) � ≤ ρ � w 1 − w 2 � . ◮ A Lipschitz function cannot change too fast. If f : R �→ R is differentiable, then by the mean value theorem we have f ( w 1 ) − f ( w 2 ) = f ′ ( u )( w 1 − w 2 ), where u is a point between w 1 and w 2 . Theorem (Mean-Value Theorem) If f ( x ) is defined and continuous on the interval [ a , b ] and differentiable on ( a , b ) , then there is at least one number c in the interval ( a , b ) (that is a < c < b) such that f ′ ( c ) = f ( b ) − f ( a ) . b − a ◮ If f ′ is bounded everywhere (in absolute value) by ρ , then f is ρ -Lipschitz. 12/31
Lipschitzness Example (Lipschitzness) 1. Function f ( x ) = | x | is 1-Lipschitz over R , because (using triangle inequality) | x 1 | − | x 2 | = | x 1 − x 2 + x 2 | − | x 2 | ≤ | x 1 − x 2 | + | x 2 | − | x 2 | = | x 1 − x 2 | . 2. Function f ( x ) = log (1 + e x ) is 1-Lipschitz over R , because e x � � � � 1 | f ′ ( x ) | = � � � � � = � ≤ 1 . e − x + 1 � � � � 1 + e x � � 3. Function f ( x ) = x 2 is not ρ -Lipschitz over R for any ρ . Let x 1 = 0 and x 2 = 1 + ρ , then f ( x 2 ) − f ( x 1 ) = (1 + ρ ) 2 > ρ (1 + ρ ) = ρ | x 2 − x 1 | . 4. Function f ( x ) = x 2 is ρ -Lipschitz over set C = � | x | ≤ ρ � � � x . For x 1 , x 2 , we have 2 � = | x 1 − x 2 || x 1 + x 2 | ≤ 2 ρ � � � x 2 1 − x 2 2 | x 1 − x 2 | = ρ | x 1 − x 2 | . � � 2 5. Linear function f : R n �→ R defined by f ( w ) = � v , w � + b , where v ∈ R n is � v � − Lipschitz. By using Cauchy-Schwartz inequality, we have | f ( w 1 ) − f ( w 2 ) | = |� v , w 1 − w 2 �| ≤ � v � � w 1 − w 2 � . 13/31
Recommend
More recommend