Machine learning theory Convex learning problems Hamid Beigy - PowerPoint PPT Presentation

Machine learning theory Convex learning problems Hamid Beigy Sharif university of technology June 8, 2020

Table of contents 1. Introduction 2. Convexity 3. Lipschitzness 4. Smoothness 5. Convex learning problems 6. Surrogate loss functions 7. Assignments 8. Summary 1/31

Introduction

Introduction ◮ Convex learning comprises an important family of learning problems, because most of what we can learn efficiently. ◮ Linear regression with the squared loss is a convex problem for regression. ◮ logistic regression is a convex problem for classification. ◮ Halfspaces with the 0 − 1 loss, which is a computationally hard problem to learn in unrealizable case, is non-convex. ◮ In general, a convex learning problem is a problem. 1. whose hypothesis class is a convex set and 2. whose loss function is a convex function for each example. ◮ Other properties of the loss function that facilitate successful learning are 1. Lipschitzness 2. Smoothness ◮ In this session, we study the learnability of 1. Convex-Smooth problems 2. Lipschitz-Bounded problems 2/31

Convexity

Convex set Definition (Convex set) A set C in a vector space is convex if for any two vectors u , v ∈ C , the line segment between u and v is contained in set C . That is, for any α ∈ [0 , 1], the convex combination α u + (1 − α ) v ∈ C . Given α ∈ [0 , 1], the combination, α u + (1 − α ) v of the points u , v is called a convex combination . Example (Convex and non-convex sets) Some examples of convex and non-convex sets in R 2 non-convex sets convex sets 3/31

Convex function Definition (Convex function) Let C be a convex set. Function f : C �→ C is convex if for any two vectors u , v ∈ C and α ∈ [0 , 1], f ( α u + (1 − α ) v ) ≤ α f ( u ) + (1 − α ) f ( v ) . In words, f is convex if for any u , v ∈ C , the graph of f between u and v lies below the line segment joining f ( u ) and f ( v ). Example (Convex function) f ( v ) αf ( u ) + (1 − α ) f ( v ) f ( u ) f ( α u + (1 − α ) v ) u v α u + (1 − α ) v 4/31

Epigraph A function f is convex if and only if its epigraph is a convex set. epigraph ( f ) = { ( x , β ) | f ( x ) ≤ β } . f ( x ) x 5/31

Properties of convex functions 1. If f is convex then every local minimum of f is also a global minimum . ◮ Let B ( u , r ) = { v | � v − u � ≤ r } be a ball of radius r centered around u . ◮ f ( u ) is a local minimum of f at u if ∃ r > 0 such that ∀ v ∈ B ( u , r ), we have f ( v ) ≥ f ( u ). ◮ It follows that for any v (not necessarily in B ), there is a small enough α > 0 such that u + α ( v − u ) ∈ B ( u , r ) and therefore f ( u ) ≤ f ( u + α ( v − u )) . ◮ If f is convex, we also have that f ( u + α ( v − u )) = f ( α u + (1 − α ) v ) ≤ (1 − α ) f ( u ) + α f ( v ) . ◮ Combining these two equations and rearranging terms, we conclude that f ( u ) ≤ f ( v ) . ◮ This holds for every v , hence f ( u ) is also a global minimum of f . 6/31

Properties of convex functions 2. If f is convex and differentiable, then ∀ u , f ( u ) ≥ f ( w ) + �∇ f ( w ) , u − w � � ∂ f ( w ) � ∂ w 1 , . . . , ∂ f ( w ) where ∇ f ( w ) = is the gradient of f at w . ∂ w n � � ◮ If f is convex, for every w , we can construct a tangent to f at w that lies below f everywhere. ◮ If f is differentiable, this tangent is the linear function l ( u ) = f ( w ) + �∇ f ( w ) , u − w � . i ) w ( f r f ( u ) , w − u h + ) w f ( w ) ( f w u 7/31

Properties of convex functions (Sub-gradients) ◮ v is sub-gradient of f at w if ∀ u , f ( u ) ≥ f ( w ) + �∇ f ( w ) , u − w � ◮ The differential set, ∂ f ( w ), is the set of sub-gradients of f at w . � ∂ f ( w ) ∂ w 1 , . . . , ∂ f ( w ) � where ∇ f ( w ) = is the gradient of f at w . ∂ w n Lemma Function f is convex iff for every w , ∂ f ( w ) � = 0 . i v , w f ( u ) − u h + ) w ( f f ( w ) w u ◮ f is locally flat around w ( 0 is a sub-gradient) iff w is a global minimizer . 8/31

Convex functions Lemma (Convexity of a scaler function) Let f : R �→ R be a scalar twice differential function, and f ′ , f ′′ be its first and second derivatives, respectively. Then, the following are equivalent: 1. f is convex. 2. f ′ is monotonically nondecreasing. 3. f ′′ is nonnegative. Example (convexity of scaler functions) 1. The scaler function f ( x ) = x 2 is convex, because f ′ ( x ) = 2 x and f ′′ ( x ) = 2 > 0. 2. The scaler function f ( x ) = log (1 + e x ) is convex, because e x 1 ◮ f ′ ( x ) = 1 + e x = e − x + 1 is a monotonically increasing function since the exponent function is a monotonically increasing function. e − x ◮ f ′′ ( x ) = ( e − x + 1) 2 = f ( x )(1 − f ( x )) is nonnegative. 9/31

Convex functions Lemma (Convexity of composition of a convex scalar function with a linear function) Let f : R n �→ R can be written as f ( w ) = g ( � w , x � + y ) , for some x ∈ R n , y ∈ R and g : R �→ R . Then convexity of g implies the convexity of f . Proof (Convexity of composition of a convex scalar function with a linear function). Let w 1 , w 2 ∈ R n and α ∈ [0 , 1]. We have f ( α w 1 + (1 − α ) w 2 ) = g ( � α w 1 + (1 − α ) w 2 , x � + y ) = g ( α � w 1 , x � + (1 − α ) � w 2 , x � + y ) = g ( α ( � w 1 , x � + y ) + (1 − α ) ( � w 2 , x � + y )) ≤ α g ( � w 1 , x � + y ) + (1 − α ) g ( � w 2 , x � + y ) . where the last inequality follows from the convexity of g . Example (Convexity of composition of a convex scalar function with a linear function) 1. Given some x ∈ R n and y ∈ R , let f ( w ) = ( � w , x � − y ) 2 . Then, f is a composition of the function g ( a ) = a 2 onto a linear function, and hence f is a convex function 2. Given some x ∈ R n and y ∈ {− 1 , +1 } , let f ( w ) = log (1 + exp ( − y � w , x � )). Then, f is a composition of the function g ( a ) = log (1 + e a ) onto a linear function, and hence f is a convex function 10/31

Convex functions Lemma (Convexity of maximum and sum of convex functions) Let f i : R n �→ R (1 ≤ i ≤ r ) be convex functions. Following functions g : R n �→ R are convex. 1. g ( x ) = max i ∈{ 1 ,..., r } f i ( x ) . 2. g ( x ) = � r i =1 w i f i ( x ) , where ∀ i , w i ≥ 0 . Proof (Convexity of maximum and sum of convex functions). 1. The first claim follows by g ( α u + (1 − α ) v ) = max f i ( α u + (1 − α ) v ) ≤ max [ α f i ( u ) + (1 − α ) f i ( v )] i i = α max f i ( u ) + (1 − α ) max f i ( v ) = α g ( u ) + (1 − α ) g ( v ) . i i 2. The second claim follows by r r � � g ( α u + (1 − α ) v ) = w i f i ( α u + (1 − α ) v ) ≤ w i [ α f i ( u ) + (1 − α ) f i ( v )] i =1 i =1 r r � � = α w i f i ( u ) + (1 − α ) w i f i ( v ) = α g ( u ) + (1 − α ) g ( v ) . i =1 i =1 Function g ( x ) = | x | is convex, because g ( x ) = max { f 1 ( x ) , f 2 ( x ) } , where both f 1 ( x ) = x and f 2 ( x ) = − x are convex . 11/31

Lipschitzness

Lipschitzness ◮ Definition of Lipschitzness is w.r.t Euclidean norm R n , but it can be defined w.r.t any norm. Definition (Lipschitzness) Function f : R n �→ R k is ρ -Lipschitz if for all w 1 , w 2 ∈ C we have � f ( w 1 ) − f ( w 2 ) � ≤ ρ � w 1 − w 2 � . ◮ A Lipschitz function cannot change too fast. If f : R �→ R is differentiable, then by the mean value theorem we have f ( w 1 ) − f ( w 2 ) = f ′ ( u )( w 1 − w 2 ), where u is a point between w 1 and w 2 . Theorem (Mean-Value Theorem) If f ( x ) is defined and continuous on the interval [ a , b ] and differentiable on ( a , b ) , then there is at least one number c in the interval ( a , b ) (that is a < c < b) such that f ′ ( c ) = f ( b ) − f ( a ) . b − a ◮ If f ′ is bounded everywhere (in absolute value) by ρ , then f is ρ -Lipschitz. 12/31

Lipschitzness Example (Lipschitzness) 1. Function f ( x ) = | x | is 1-Lipschitz over R , because (using triangle inequality) | x 1 | − | x 2 | = | x 1 − x 2 + x 2 | − | x 2 | ≤ | x 1 − x 2 | + | x 2 | − | x 2 | = | x 1 − x 2 | . 2. Function f ( x ) = log (1 + e x ) is 1-Lipschitz over R , because e x � � � � 1 | f ′ ( x ) | = � � � � � = � ≤ 1 . e − x + 1 � � � � 1 + e x � � 3. Function f ( x ) = x 2 is not ρ -Lipschitz over R for any ρ . Let x 1 = 0 and x 2 = 1 + ρ , then f ( x 2 ) − f ( x 1 ) = (1 + ρ ) 2 > ρ (1 + ρ ) = ρ | x 2 − x 1 | . 4. Function f ( x ) = x 2 is ρ -Lipschitz over set C = � | x | ≤ ρ � � � x . For x 1 , x 2 , we have 2 � = | x 1 − x 2 || x 1 + x 2 | ≤ 2 ρ � � � x 2 1 − x 2 2 | x 1 − x 2 | = ρ | x 1 − x 2 | . � � 2 5. Linear function f : R n �→ R defined by f ( w ) = � v , w � + b , where v ∈ R n is � v � − Lipschitz. By using Cauchy-Schwartz inequality, we have | f ( w 1 ) − f ( w 2 ) | = |� v , w 1 − w 2 �| ≤ � v � � w 1 − w 2 � . 13/31

Machine learning theory Convex learning problems Hamid Beigy - PowerPoint PPT Presentation

Machine learning theory Convex learning problems Hamid Beigy Sharif university of technology June 8, 2020 Table of contents 1. Introduction 2. Convexity 3. Lipschitzness 4. Smoothness 5. Convex learning problems 6. Surrogate loss

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

14. Convex programming Convex sets and functions Convex programs Hierarchy of

A convex relaxation for weakly supervised classifiers Armand Joulin and Francis Bach SIERRA

1 Convexity x 1 Sets For scalars

Iterated learning optimizes for simplicity Jon W. Carr Centre for Language Evolution School of

Stationary points, non-convex optimization, and more... Instructor: Sham Kakade 1 Terminology

Convex Optimization and Inpainting: A Tutorial Thomas Pock Institute of Computer Graphics and

Shape optimization under convexity constraint Jimmy LAMBOLEY Universit Paris-Dauphine ANR GAOS

Robust minimum volume ellipsoids and higher order polynomial level sets Dmitry Malioutov Machine