Chapter 22 Learning, Linear Separability and Linear Programming CS - PDF document

Chapter 22 Learning, Linear Separability and Linear Programming CS 573: Algorithms, Fall 2013 November 12, 2013 22.1 The Perceptron algorithm 22.1.0.1 Labeling... (A) given examples:a database of cars. (B) like to determine which cars are sport cars.. (C) Each car record: interpreted as point in high dimensions. (D) Example: sport car with 4 doors, manufactured in 1997, by Quaky (with manufacturer ID 6): (4 , 1997 , 6). Labeled as a sport car. (E) Tractor by General Mess (manufacturer ID 3) in 1998: (0 , 1997 , 3) Labeled as not a sport car. (F) Real world: hundreds of attributes. In some cases even millions of attributes! (G) Automate this classification process: label sports/regular car automatically. 22.1.0.2 Automatic classification... (A) learning algorithm: (A) given several (or many) classified examples... (B) ...develop its own conjecture for rule of classification. (C) ... can use it for classifying new data. (B) learning : training + classifying . R d → {− 1 , 1 } . (C) Learn a function: f : I (D) challenge: f might have infinite complexity... (E) ...rare situation in real world. Assume learnable functions. (F) red and blue points that are linearly separable. (G) Trying to learn a line ℓ that separates the red points from the blue points. 1

22.1.0.3 Linear separability example... ℓ 22.1.0.4 Learning linear separation (A) Given red and blue points – how to compute the separating line ℓ ? (B) line/plane/hyperplane is the zero set of a linear function. R d (C) Form: ∀ x ∈ I f ( x ) = ⟨ a, x ⟩ + b, R 2 . where a = ( a 1 , . . . , a d ) , b =( b 1 , . . . , b d ) ∈ I ⟨ a, x ⟩ = ∑ i a i x i is the dot product of a and x . (D) classification done by computing sign of f ( x ): sign( f ( x )). (E) If sign( f ( x )) is negative: x is not in class. If positive: inside. (F) A set of training examples : { } S = ( x 1 , y 1 ) , . . . , ( x n , y n ) , R d and y i ∈ { -1,1 } , for i = 1 , . . . , n . where x i ∈ I 22.1.0.5 Classification... R d and b ∈ I (A) linear classifier h : ( w, b ) where w ∈ I R. R d is sign( ⟨ w, x ⟩ + b ). (B) classification of x ∈ I (C) labeled example ( x, y ), h classifies ( x, y ) correctly if sign( ⟨ w, x ⟩ + b ) = y . (D) Assume a linear classifier exists. (E) Given n labeled example. How to compute the linear classifier for these examples? (F) Use linear programming.... (G) looking for ( w , b ), such that for an ( x i , y i ) we have sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . 2

22.1.0.6 Classification... ( ) ( w 1 , . . . , w d ) R d , for i = 1 , . . . , m , and let w = x 1 i , . . . , x d (A) Or equivalently, let x i = ∈ I , then we get i the linear constraint d ∑ w k x k i + b ≥ 0 if y i = 1 , k =1 d ∑ w k x k and i + b ≤ 0 if y i = − 1 . k =1 Thus, we get a set of linear constraints, one for each training example, and we need to solve the resulting linear program. 22.1.0.7 Linear programming for learning? (A) Stumbling block: is that linear programming is very sensitive to noise. (B) If points are misclassified = ⇒ no solution. (C) use an iterative algorithm that converges to the optimal solution if it exists... 22.1.0.8 Perceptron algorithm... perceptron ( S : a set of l examples) w 0 ← 0 , k ← 0 � � R = max ( x ,y ) ∈ S � x � . � � repeat for ( x , y ) ∈ S do if sign( ⟨ w k , x ⟩ ) ̸ = y then w k +1 ← w k + y ∗ x k ← k + 1 until no mistakes are made in the classification return w k and k 22.1.0.9 Perceptron algorithm (A) Why perceptron algorithm converges? (B) Assume made a mistake on a sample ( x , y ) and y = 1. Then, ⟨ w k , x ⟩ < 0, and ⟨ w k +1 , x ⟩ = ⟨ w k + y ∗ x , x ⟩ = ⟨ w k , x ⟩ + y ⟨ x , x ⟩ = ⟨ w k , x ⟩ + y ∥ x ∥ > ⟨ w k , x ⟩ . (C) “walking” in the right direction.. (D) ... new value assigned to x by w k +1 is larger (“more positive”) than the old value assigned to x by w k . (E) After enough iterations of such fix-ups, label would change... 22.1.0.10 Perceptron algorithm converges � � Theorem 22.1.1. Let S be a training set of examples, and let R = max ( x,y ) ∈ S � x � . Suppose that there � � � � exists a vector w opt such that � w opt � = 1 , and a number γ > 0 , such that � � y ⟨ w opt , x ⟩ ≥ γ ∀ ( x, y ) ∈ S. Then, the number of mistakes made by the online perceptron algorithm on S is at most ) 2 ( R . γ 3

22.1.0.11 Claim by figure... hard easy R R R R ℓ ℓ γ γ ′ w opt w opt # errors: ( R/γ ) 2 # errors: ( R/γ ′ ) 2 22.1.0.12 Proof of Perceptron convergence... (A) Idea of proof: perceptron weight vector converges to w opt . (B) Distance between w opt and k th update vector: 2 � w k − R 2 � � � � α k = γ w opt . � � � � � (C) Quantify the change between α k and α k +1 (D) Example being misclassified is ( x, y ). 4

22.1.0.13 Proof of Perceptron convergence... (A) Example being misclassified is ( x, y ) (both are constants). (B) w k +1 ← w k + y ∗ x 2 2 � w k +1 − R 2 � � � � w k + y x − R 2 � � � � � (C) α k +1 = γ w opt = γ w opt � � � � � � � � � � 2 � ( w k − R 2 ) � � � = γ w opt + y x � � � � � � ⟨ ( ) ( ) ⟩ w k − R 2 w k − R 2 = γ w opt + y x , γ w opt + y x ⟨ ( ) ( )⟩ ⟨ ( ) ⟩ w k − R 2 w k − R 2 w k − R 2 = γ w opt , γ w opt +2 y γ w opt , x + ⟨ x , x ⟩ 2 . ⟨ ( ) ⟩ � � w k − R 2 = α k + 2 y γ w opt , x + � x � � � 22.1.0.14 Proof of Perceptron convergence... 2 . ⟨ ( ) ⟩ � � w k − R 2 (A) We proved: α k +1 = α k + 2 y γ w opt , x + � x � � � (B) ( x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 (C) = (D) = ⇒ y ⟨ w k , x ⟩ < 0. � � (E) � x � ≤ R = ⇒ � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt, x ⟩ . (F) ... since 2 y ⟨ w k , x ⟩ < 0. 22.1.0.15 Proof of Perceptron convergence... (A) Proved: α k +1 ≤ α k + R 2 − 2 R 2 γ y ⟨ w opt, x ⟩ . (B) sign( ⟨ w opt , x ⟩ ) = y . (C) By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x, y ) ∈ S . (D) α k +1 ≤ α k + R 2 − 2 R 2 γ y ⟨ w opt, x ⟩ ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . 22.1.0.16 Proof of Perceptron convergence... (A) We have: α k +1 ≤ α k − R 2 2 � 0 − R 2 � � = R 4 2 = R 4 � � � � (B) α 0 = γ w opt � w opt γ 2 . � � � � γ 2 � � � � (C) ∀ i α i ≥ 0. (D) Q: max # classification errors can make? (E) ... # of updates (F) .. # of updates ≤ α 0 /R 2 ... (G) A: ≤ R 2 γ 2 . 5

22.1.0.17 Concluding comment... Any linear program can be written as the problem of separating red points from blue points. As such, the perceptron algorithm can be used to solve linear programs. 22.2 Learning A Circle 22.2.0.18 Learning a circle... (A) Given a set of red points, and blue points in the plane, we want to learn a circle that contains all the red points, and does not contain the blue points. σ (B) Q: How to compute the circle σ ? (C) Lifting : ℓ : ( x, y ) → ( x, y, x 2 + y 2 ). ℓ ( x, y ) = ( x, y, x 2 + y 2 ) { � } (D) z ( P ) = � ( x, y ) ∈ P � 22.2.0.19 Learning a circle... Theorem 22.2.1. Two sets of points R and B are separable by a circle in two dimensions, if and only if ℓ ( R ) and ℓ ( B ) are separable by a plane in three dimensions. 22.2.0.20 Proof (A) σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 (B) ∀ ( x, y ) ∈ R ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x, y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0. (C) ∀ ( x, y ) ∈ R − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0. ∀ ( x, y ) ∈ B (D) Setting z = z ( x, y ) = x 2 + y 2 : h ( x, y, z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x, y ) ∈ R h ( x, y, z ( x, y )) ≤ 0 (E) ⇐ ⇒ ∀ ( x, y ) ∈ R h ( ℓ ( x, y )) ≤ 0 ∀ ( x, y ) ∈ B h ( ℓ ( x, y )) > 0 (F) p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0. (G) Proved: if point set is separable by a circle = ⇒ lifted point set ℓ ( R ) and ℓ ( B ) are separable by a plane. 22.2.0.21 Proof: Other direction (A) Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating place be: h ≡ ax + by + cz + d = 0 (B) ∀ ( x, y, x 2 + y 2 ) ∈ ℓ ( R ): ax + by + c ( x 2 + y 2 ) + d ≤ 0 6

Chapter 22 Learning, Linear Separability and Linear Programming CS - PDF document

Chapter 22 Learning, Linear Separability and Linear Programming CS 573: Algorithms, Fall 2013 November 12, 2013 22.1 The Perceptron algorithm 22.1.0.1 Labeling... (A) given examples:a database of cars. (B) like to determine which cars are

Separability of Context-Free Languages by Piecewise Testable Languages Wojciech Czerwi ski

A Note on Decidable Separability by Piecewise Testable Languages Wojciech Czerwi ski Wim

Regular Separability of WSTS Roland Meyer joint work with Wojciech Czerwi nski, S lawomir

Learning with Low Rank Approximations or how to use near separability to extract content from

Separability of the Lambda Calculus and Term Rewriting Systems Department of Computer Science

Learning, Linear Separability and Linear Programming Lecture 22 November 12, 2013 Sariel (UIUC)

Deep learning 3.3. Linear separability and feature design Fran cois Fleuret

Review Linear separability (and use of features) Class probabilities for linear

Linear Classification Linear separability Inseparability Real world problems: there may not

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Topological approaches in machine learning D. A. Zighed University of Lyon (Lumire Lyon 2)

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Separability, Expressiveness and Decidability in the Ambient Logic AS mobilit e - December

Path separability of Graphs Emilie Diot and Cyril Gavoille LaBRI, University of Bordeaux, France

Graphs of separability at most two: structural characterizations and their consequences Ferdinando

CS675: Convex and Combinatorial Optimization Fall 2016 Consequences of the Ellipsoid Algorithm

Machine Learning Lecture Notes on Clustering (IV) 2017-2018 Davide Eynard davide.eynard@usi.ch

OptimizationoverZonotopes andTrainingSupportVectorMachines MarshallBern

M Introduction F Research methods Results x Laminar to turbulent transition in separated

AIRS Data Assimilation at the Regional Scale Brad Zavodsky & Will McCarty University of

Fragile Separation Robust Separation x 2 x 2 x 2 x 2 New data x 1 x 1 x 1 x 1 } What

Arguments for Separating Edges from Transit Core by Dan Jen, Michael Meisel, He Yan, Dan

Updates on e CC Selection Mike Wallbank 20/3/2017 Intro Gave an update at the CM (link),

Chapter 22 Learning, Linear Separability and Linear Programming CS - PDF document

Chapter 22 Learning, Linear Separability and Linear Programming CS 573: Algorithms, Fall 2013 November 12, 2013 22.1 The Perceptron algorithm 22.1.0.1 Labeling... (A) given examples:a database of cars. (B) like to determine which cars are

Separability of Context-Free Languages by Piecewise Testable Languages Wojciech Czerwi ski

A Note on Decidable Separability by Piecewise Testable Languages Wojciech Czerwi ski Wim

Regular Separability of WSTS Roland Meyer joint work with Wojciech Czerwi nski, S lawomir

Learning with Low Rank Approximations or how to use near separability to extract content from

Separability of the Lambda Calculus and Term Rewriting Systems Department of Computer Science

Learning, Linear Separability and Linear Programming Lecture 22 November 12, 2013 Sariel (UIUC)

Deep learning 3.3. Linear separability and feature design Fran cois Fleuret

Review Linear separability (and use of features) Class probabilities for linear

Linear Classification Linear separability Inseparability Real world problems: there may not

Chapter 1 What is Linear Algebra? Chapter 1 What is Linear Algebra? The study of linear

Topological approaches in machine learning D. A. Zighed University of Lyon (Lumire Lyon 2)

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Separability, Expressiveness and Decidability in the Ambient Logic AS mobilit e - December

Path separability of Graphs Emilie Diot and Cyril Gavoille LaBRI, University of Bordeaux, France

Graphs of separability at most two: structural characterizations and their consequences Ferdinando

CS675: Convex and Combinatorial Optimization Fall 2016 Consequences of the Ellipsoid Algorithm

Machine Learning Lecture Notes on Clustering (IV) 2017-2018 Davide Eynard davide.eynard@usi.ch

OptimizationoverZonotopes andTrainingSupportVectorMachines MarshallBern

M Introduction F Research methods Results x Laminar to turbulent transition in separated

AIRS Data Assimilation at the Regional Scale Brad Zavodsky &amp; Will McCarty University of

Fragile Separation Robust Separation x 2 x 2 x 2 x 2 New data x 1 x 1 x 1 x 1 } What

Arguments for Separating Edges from Transit Core by Dan Jen, Michael Meisel, He Yan, Dan

Updates on e CC Selection Mike Wallbank 20/3/2017 Intro Gave an update at the CM (link),

AIRS Data Assimilation at the Regional Scale Brad Zavodsky & Will McCarty University of