 
              CS257 Linear and Convex Optimization Lecture 1 Bo Jiang John Hopcroft Center for Computer Science Shanghai Jiao Tong University September 7, 2020
Contents 1. Mathematical Optimization 2. Global and Local Optima 1/19
Mathematical Optimization Problems minimize f ( x ) x or min x ∈ X f ( x ) subject to x ∈ X • f : R n → R : objective function • x = ( x 1 , x 2 , . . . , x n ) T ∈ R n : optimization/decision variables • X ⊂ R n : feasible set or constraint set ◮ x is called feasible if x ∈ X and infeasible if x / ∈ X . Maximizing f is equivalent to minimizing − f ; will focus on minimization. The problem is unconstrained if X = R n and constrained if X � = R n . X is often specified by constraint functions, min f ( x ) x s . t . g i ( x ) ≤ 0 , i = 1 , 2 , . . . , m General optimization problems are very difficult; we will focus on convex optimization problems (to be defined later). 2/19
Example: Data Fitting Recall Hooke’s law in physics, F = − k ( x − x 0 ) = − kx + b , where b = kx 0 • F : force • x : length • k : spring constant • x 0 : length at rest Given m measurements ( x 1 , F 1 ) , ( x 2 , F 2 ) , . . . , ( x m , F m ) , F i = − kx i + b + ǫ i F • ǫ i : measurement error find k , b by fitting a line through data. x Least squares criterion, m m � � ǫ 2 ( F i + kx i − b ) 2 min i = k > 0 , b > 0 i = 1 i = 1 3/19
Example: Linear Least Squares Regression A linear model predicts a response/target by a linear combination of predictors/features (plus an intercept/bias), n � w i x i = w T x + b ˆ y = f ( x ) = b + i = 1 Given m data points ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x m , y m ) , linear (least squares) regression finds w and b by minimizing the sum of squared errors, m m ( f ( x i ) − y i ) 2 = � � ( w T x i + b − y i ) 2 min w ∈ R n , b ∈ R i = 1 i = 1 In a more compact form, w ∈ R n , b ∈ R � Xw + b 1 − y � 2 min • X = ( x 1 , . . . , x m ) T ∈ R m × n , y = ( y 1 , . . . , y m ) T ∈ R m • 1 = ( 1 , 1 , . . . , 1 ) T ∈ R m √ �� n i for z = ( z 1 , . . . , z n ) T ∈ R n • � z � = i = 1 z 2 z T z = 4/19
Example: Shipping Problem • need to ship products from n warehouses to m customers • inventory at warehouse i is a i , i = 1 , 2 , . . . , n • quantity ordered by customer j is b j , j = 1 , 2 , . . . , m • unit shipping cost from warehouse i to customer j is c ij Let x ij be quantity shipped from warehouse i to customer j Minimize total cost by solving the following linear program n m � � min c ij x ij ( x ij ) i = 1 j = 1 n � s . t . x ij = b j for j = 1 , 2 , . . . , m i = 1 m � x ij ≤ a i for i = 1 , 2 , . . . , n j = 1 x ij ≥ 0 for i = 1 , 2 , . . . , n ; j = 1 , 2 , . . . , m 5/19
Example: Binary Classification vs Represent an image by a vector x ∈ R n , label y ∈ { + 1 , − 1 } Given a set of images with labels ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x m , y m ) , want function f : R n → R , called classifier, such that � f ( x i ) > 0 , iff y i = + 1 iff y i = − 1 ⇐ ⇒ y i f ( x i ) > 0 f ( x i ) < 0 , Once we find f , we can use ˆ y = sign[ f ( x )] to classify new images. How to find f ? Let’s consider linear classifiers, i.e. f ( x ) = w T x + b 6/19
Example: Binary Classification (cont’d) Assume data is linearly separable, i.e. exists hyperplane w T x + b = 0 s.t. y i ( w T x i + b ) > 0 , ∀ i May exist many such hyperplanes. Want to maximize the minimum distance to the hyperplane • more robust against noise 7/19
Example: Binary Classification (cont’d) Assume data is linearly separable, i.e. exists hyperplane w T x + b = 0 s.t. y i ( w T x i + b ) > 0 , ∀ i May exist many such hyperplanes. Want to maximize the minimum distance to the hyperplane • more robust against noise Support vector machine: linear classifier with maximum margin | w T x i + b | max min � w � w , b 1 ≤ i ≤ m y i ( w T x i + b ) > 0 , s . t . i = 1 , 2 , . . . , m Can be reformulated as equivalent convex optimization problem yielding the same optimal hyperplane. 7/19
Example: Binary Classification (cont’d) Assume data is linearly separable, i.e. exists hyperplane w T x + b = 0 s.t. y i ( w T x i + b ) > 0 , ∀ i May exist many such hyperplanes. Want to maximize the minimum distance to the hyperplane • more robust against noise Support vector machine: linear classifier with maximum margin 1 2 � w � 2 min w , b y i ( w T x i + b ) ≥ 1 , s . t . i = 1 , 2 , . . . , n We will see this is a convex optimization problem. 7/19
SVM Problem reformulation • Note | w T x i + b | = y i ( w T x i + b ) , as y i = sgn( w T x i + b ) . w = α w and ˜ • For α > 0 , ˜ b = α b determine the same hyperplane P , w T x + ˜ ⇒ w T x + b = 0 ⇐ x ∈ P ⇐ ⇒ ˜ b = 0 w T x i + ˜ • Choosing α properly, we can assume min 1 ≤ i ≤ m y i (˜ b ) = 1 , 1 max � ˜ w � w , ˜ ˜ b w T x i + ˜ s . t . y i (˜ b ) ≥ 1 , i = 1 , 2 , . . . , m • Maximizing 1 / z is equivalent to minimizing 1 2 z 2 , 1 w � 2 min 2 � ˜ w , ˜ ˜ b w T x i + ˜ s . t . y i (˜ b ) ≥ 1 , i = 1 , 2 , . . . , m 8/19
Appendix: Distance to Hyperplane • w ⊥ hyperplane P : w T x + b = 0 x i • x ′ i is orthogonal projection of x i onto w P , i.e. x i − x ′ i ⊥ P x ′ w T x ′ i + b = 0 i • x i − x ′ i = γ i w for some γ i ∈ R , O ⇒ γ i = w T x i + b w T ( x i − γ i w )+ b = 0 = w T x + b = 0 w T w • distance from x i to P is i � = � γ i w � = | w T x i + b | y ∈ P � x i − y � = � x i − x ′ min � w � 9/19
Soft Margin SVM Hard margin SVM requires linear separability 1 2 � w � 2 min w , b y i ( w T x i + b ) ≥ 1 , s . t . ∀ i When not linear separable, • relax constraints • penalize deviation Soft margin SVM: introduce slack variables ξ = ( ξ 1 , . . . , ξ n ) T n 1 � 2 � w � 2 min 2 + C ξ i ( C > 0 is hyperparameter) w , b , ξ i = 1 y i ( w T x i + b ) ≥ 1 − ξ i , s . t . i = 1 , 2 , . . . , n ξ ≥ 0 , (i.e. ξ i ≥ 0 , i = 1 , 2 , . . . , n ) 10/19
Contents 1. Mathematical Optimization 2. Global and Local Optima 11/19
Global Optima x ∗ ∈ X is a global minimum 1 of f if f ( x ∗ ) ≤ f ( x ) , ∀ x ∈ X It is also called an optimal solution of the minimization problem min x ∈ X f ( x ) (P) and f ( x ∗ ) is the optimal value of (P). Global maximum is defined by reversing direction of inequality. Maximum and minimum are called extremum. Note. Global extrema may not exist. • f ( x ) = x , X = R , inf x ∈ X f ( x ) = −∞ unbounded from below • f ( x ) = x , X = ( 0 , 1 ) , inf x ∈ X f ( x ) = 0 , but not achievable 1 Global minimum often also refers to the minimum value f ( x ∗ ) . 12/19
Math Review Euclidean inner product on R n : � x , y � = x T y = � n i = 1 x i y i √ �� n x T x = i = 1 x 2 Euclidean norm (2-norm): � x � 2 = i A norm on R n is a function � · � : R n → R satisfying 1. � x � ≥ 0 , ∀ x ∈ R n 2. � x � = 0 iff x = 0 3. � a x � = | a |� x � , ∀ a ∈ R , x ∈ R n (positive homogeneity) 4. � x + y � ≤ � x � + � y � , ∀ x , y ∈ R n (triangle inequality) Example. • 1-norm: � x � 1 = � n i = 1 | x i | i = 1 | x i | p ) 1 / p , p ≥ 1 • p -norm: � x � p = ( � n • ∞ -norm: � x � ∞ = max 1 ≤ i ≤ n | x i | Property 4 is given by Minkowski’s inequality. By default, � x � means � x � 2 . 13/19
Math Review Open ball of radius r centered at x 0 B ( x 0 , r ) = { x : � x − x 0 � < r } Closed ball of radius r centered at x 0 ¯ B ( x 0 , r ) = { x : � x − x 0 � ≤ r } ∞ -norm 1-norm 2-norm unit balls in R 2 with different norms 14/19
Math Review Open ball of radius r centered at x 0 B ( x 0 , r ) = { x : � x − x 0 � < r } Closed ball of radius r centered at x 0 ¯ B ( x 0 , r ) = { x : � x − x 0 � ≤ r } z z z y y y x x x ∞ -norm 1-norm 2-norm unit balls in R 3 with different norms 14/19
Math Review A set S is open if for any x ∈ S , there exists ǫ > 0 s.t. B ( x , ǫ ) ⊂ S . A set S is closed if its complement S c is open. Examples in R . • ( 0 , 1 ) is open. • [ 0 , 1 ] is closed. • ( 0 , 1 ] is neither open nor closed. • [ 1 , ∞ ) is closed. A sequence { x n } converges to x , denoted x n → x or lim n →∞ x n = x if n →∞ � x − x n � = 0 lim Note. In R n , if x n → x in one norm, it converges in any norm. Theorem. S is closed iff for any sequence { x n } ⊂ S , x n → x = ⇒ x ∈ S . 15/19
Math Review A set S is bounded if there exists M < ∞ s.t. � x � ≤ M , ∀ x ∈ S . A set S ⊂ R n is compact if it is closed and bounded. Examples in R . • [ 0 , 1 ] is compact • ( 0 , 1 ) , ( 0 , 1 ] and [ 1 , ∞ ) are not compact A function f : X ⊂ R n → R is continuous at x if for any ǫ > 0 , there exists δ > 0 s.t. y ∈ X ∩ B ( x , δ ) = ⇒ | f ( y ) − f ( x ) | < ǫ Equivalently, f is continuous at x ∈ X if ∀{ x n } ⊂ X , x n → x = ⇒ f ( x n ) → f ( x ) f is continuous on X if it is continuous at every x ∈ X . 16/19
Recommend
More recommend