Learning, Linear Separability and Linear Programming Lecture 22 - PowerPoint PPT Presentation

May 24, 2023 •277 likes •1.42k views

CS 573: Algorithms, Fall 2013 Learning, Linear Separability and Linear Programming Lecture 22 November 12, 2013 Sariel (UIUC) CS573 1 Fall 2013 1 / 28 Labeling... . . given examples:a database of cars. 1 . . like to determine which

Learning linear separation . . Given red and blue points – how to compute the separating line 1 ℓ ? . . line/plane/hyperplane is the zero set of a linear function. 2 . . R d Form: ∀ x ∈ I f ( x ) = ⟨ a , x ⟩ + b , 3 R 2 . where a = ( a 1 , . . . , a d ) , b =( b 1 , . . . , b d ) ∈ I ⟨ a , x ⟩ = ∑ i a i x i is the dot product of a and x . . . classification done by computing sign of f ( x ) : sign( f ( x )) . 4 . . If sign( f ( x )) is negative: x is not in class. 5 If positive: inside. . . A set of training examples : 6 { } S = ( x 1 , y 1 ) , . . . , ( x n , y n ) , R d and y i ∈ { -1,1 } , for i = 1 , . . . , n . where x i ∈ I Sariel (UIUC) CS573 5 Fall 2013 5 / 28
Learning linear separation . . Given red and blue points – how to compute the separating line 1 ℓ ? . . line/plane/hyperplane is the zero set of a linear function. 2 . . R d Form: ∀ x ∈ I f ( x ) = ⟨ a , x ⟩ + b , 3 R 2 . where a = ( a 1 , . . . , a d ) , b =( b 1 , . . . , b d ) ∈ I ⟨ a , x ⟩ = ∑ i a i x i is the dot product of a and x . . . classification done by computing sign of f ( x ) : sign( f ( x )) . 4 . . If sign( f ( x )) is negative: x is not in class. 5 If positive: inside. . . A set of training examples : 6 { } S = ( x 1 , y 1 ) , . . . , ( x n , y n ) , R d and y i ∈ { -1,1 } , for i = 1 , . . . , n . where x i ∈ I Sariel (UIUC) CS573 5 Fall 2013 5 / 28
Learning linear separation . . Given red and blue points – how to compute the separating line 1 ℓ ? . . line/plane/hyperplane is the zero set of a linear function. 2 . . R d Form: ∀ x ∈ I f ( x ) = ⟨ a , x ⟩ + b , 3 R 2 . where a = ( a 1 , . . . , a d ) , b =( b 1 , . . . , b d ) ∈ I ⟨ a , x ⟩ = ∑ i a i x i is the dot product of a and x . . . classification done by computing sign of f ( x ) : sign( f ( x )) . 4 . . If sign( f ( x )) is negative: x is not in class. 5 If positive: inside. . . A set of training examples : 6 { } S = ( x 1 , y 1 ) , . . . , ( x n , y n ) , R d and y i ∈ { -1,1 } , for i = 1 , . . . , n . where x i ∈ I Sariel (UIUC) CS573 5 Fall 2013 5 / 28
Classification... R d and b ∈ I . . linear classifier h : ( w , b ) where w ∈ I R . 1 R d is sign( ⟨ w , x ⟩ + b ) . . . classification of x ∈ I 2 . . labeled example ( x , y ) , h classifies ( x , y ) correctly if 3 sign( ⟨ w , x ⟩ + b ) = y . . . Assume a linear classifier exists. 4 . . Given n labeled example. How to compute the linear classifier 5 for these examples? . . Use linear programming.... 6 . looking for (w , b ) , such that for an (x i , y i ) we have 7 sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . Sariel (UIUC) CS573 6 Fall 2013 6 / 28
Classification... R d and b ∈ I . . linear classifier h : ( w , b ) where w ∈ I R . 1 R d is sign( ⟨ w , x ⟩ + b ) . . . classification of x ∈ I 2 . . labeled example ( x , y ) , h classifies ( x , y ) correctly if 3 sign( ⟨ w , x ⟩ + b ) = y . . . Assume a linear classifier exists. 4 . . Given n labeled example. How to compute the linear classifier 5 for these examples? . . Use linear programming.... 6 . looking for (w , b ) , such that for an (x i , y i ) we have 7 sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . Sariel (UIUC) CS573 6 Fall 2013 6 / 28
Classification... R d and b ∈ I . . linear classifier h : ( w , b ) where w ∈ I R . 1 R d is sign( ⟨ w , x ⟩ + b ) . . . classification of x ∈ I 2 . . labeled example ( x , y ) , h classifies ( x , y ) correctly if 3 sign( ⟨ w , x ⟩ + b ) = y . . . Assume a linear classifier exists. 4 . . Given n labeled example. How to compute the linear classifier 5 for these examples? . . Use linear programming.... 6 . looking for (w , b ) , such that for an (x i , y i ) we have 7 sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . Sariel (UIUC) CS573 6 Fall 2013 6 / 28
Classification... R d and b ∈ I . . linear classifier h : ( w , b ) where w ∈ I R . 1 R d is sign( ⟨ w , x ⟩ + b ) . . . classification of x ∈ I 2 . . labeled example ( x , y ) , h classifies ( x , y ) correctly if 3 sign( ⟨ w , x ⟩ + b ) = y . . . Assume a linear classifier exists. 4 . . Given n labeled example. How to compute the linear classifier 5 for these examples? . . Use linear programming.... 6 . looking for (w , b ) , such that for an (x i , y i ) we have 7 sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . Sariel (UIUC) CS573 6 Fall 2013 6 / 28
Classification... R d and b ∈ I . . linear classifier h : ( w , b ) where w ∈ I R . 1 R d is sign( ⟨ w , x ⟩ + b ) . . . classification of x ∈ I 2 . . labeled example ( x , y ) , h classifies ( x , y ) correctly if 3 sign( ⟨ w , x ⟩ + b ) = y . . . Assume a linear classifier exists. 4 . . Given n labeled example. How to compute the linear classifier 5 for these examples? . . Use linear programming.... 6 . looking for (w , b ) , such that for an (x i , y i ) we have 7 sign( ⟨ w , x i ⟩ + b ) = y i , which is ⟨ w , x i ⟩ + b ≥ 0 if y i = 1 , and ⟨ w , x i ⟩ + b ≤ 0 if y i = − 1 . Sariel (UIUC) CS573 6 Fall 2013 6 / 28
Classification... . . ( ) x 1 i , . . . , x d R d , for Or equivalently, let x i = ∈ I 1 i ( w 1 , . . . , w d ) i = 1 , . . . , m , and let w = , then we get the linear constraint d w k x k ∑ i + b ≥ 0 if y i = 1 , k =1 d ∑ w k x k and i + b ≤ 0 if y i = − 1 . k =1 Thus, we get a set of linear constraints, one for each training example, and we need to solve the resulting linear program. Sariel (UIUC) CS573 7 Fall 2013 7 / 28
Linear programming for learning? . . Stumbling block: is that linear programming is very sensitive to 1 noise. . . If points are misclassified = ⇒ no solution. 2 . . use an iterative algorithm that converges to the optimal solution 3 if it exists... Sariel (UIUC) CS573 8 Fall 2013 8 / 28
Linear programming for learning? . . Stumbling block: is that linear programming is very sensitive to 1 noise. . . If points are misclassified = ⇒ no solution. 2 . . use an iterative algorithm that converges to the optimal solution 3 if it exists... Sariel (UIUC) CS573 8 Fall 2013 8 / 28
Linear programming for learning? . . Stumbling block: is that linear programming is very sensitive to 1 noise. . . If points are misclassified = ⇒ no solution. 2 . . use an iterative algorithm that converges to the optimal solution 3 if it exists... Sariel (UIUC) CS573 8 Fall 2013 8 / 28
Perceptron algorithm... perceptron ( S : a set of l examples) w 0 ← 0 , k ← 0 � � R = max (x , y ) ∈ S � x � . � � repeat for (x , y ) ∈ S do if sign( ⟨ w k , x ⟩ ) ̸ = y then w k +1 ← w k + y ∗ x k ← k + 1 until no mistakes are made in the classification return w k and k Sariel (UIUC) CS573 9 Fall 2013 9 / 28
Perceptron algorithm . . Why perceptron algorithm converges? 1 . . Assume made a mistake on a sample (x , y ) and y = 1 . Then, 2 ⟨ w k , x ⟩ < 0 , and ⟨ w k +1 , x ⟩ = ⟨ w k + y ∗ x , x ⟩ = ⟨ w k , x ⟩ + y ⟨ x , x ⟩ = ⟨ w k , x ⟩ + y ∥ x ∥ > ⟨ w k , x ⟩ . . . “walking” in the right direction.. 3 . ... new value assigned to x by w k +1 is larger (“more positive”) 4 than the old value assigned to x by w k . . . After enough iterations of such fix-ups, label would change... 5 Sariel (UIUC) CS573 10 Fall 2013 10 / 28
Perceptron algorithm . . Why perceptron algorithm converges? 1 . . Assume made a mistake on a sample (x , y ) and y = 1 . Then, 2 ⟨ w k , x ⟩ < 0 , and ⟨ w k +1 , x ⟩ = ⟨ w k + y ∗ x , x ⟩ = ⟨ w k , x ⟩ + y ⟨ x , x ⟩ = ⟨ w k , x ⟩ + y ∥ x ∥ > ⟨ w k , x ⟩ . . . “walking” in the right direction.. 3 . ... new value assigned to x by w k +1 is larger (“more positive”) 4 than the old value assigned to x by w k . . . After enough iterations of such fix-ups, label would change... 5 Sariel (UIUC) CS573 10 Fall 2013 10 / 28
Perceptron algorithm . . Why perceptron algorithm converges? 1 . . Assume made a mistake on a sample (x , y ) and y = 1 . Then, 2 ⟨ w k , x ⟩ < 0 , and ⟨ w k +1 , x ⟩ = ⟨ w k + y ∗ x , x ⟩ = ⟨ w k , x ⟩ + y ⟨ x , x ⟩ = ⟨ w k , x ⟩ + y ∥ x ∥ > ⟨ w k , x ⟩ . . . “walking” in the right direction.. 3 . ... new value assigned to x by w k +1 is larger (“more positive”) 4 than the old value assigned to x by w k . . . After enough iterations of such fix-ups, label would change... 5 Sariel (UIUC) CS573 10 Fall 2013 10 / 28
Perceptron algorithm . . Why perceptron algorithm converges? 1 . . Assume made a mistake on a sample (x , y ) and y = 1 . Then, 2 ⟨ w k , x ⟩ < 0 , and ⟨ w k +1 , x ⟩ = ⟨ w k + y ∗ x , x ⟩ = ⟨ w k , x ⟩ + y ⟨ x , x ⟩ = ⟨ w k , x ⟩ + y ∥ x ∥ > ⟨ w k , x ⟩ . . . “walking” in the right direction.. 3 . ... new value assigned to x by w k +1 is larger (“more positive”) 4 than the old value assigned to x by w k . . . After enough iterations of such fix-ups, label would change... 5 Sariel (UIUC) CS573 10 Fall 2013 10 / 28
Perceptron algorithm converges . Theorem . � � Let S be a training set of examples, and let R = max ( x , y ) ∈ S � x � . � � � � Suppose that there exists a vector w opt such that � w opt � = 1 , and a � � number γ > 0 , such that y ⟨ w opt , x ⟩ ≥ γ ∀ ( x , y ) ∈ S . Then, the number of mistakes made by the online perceptron algorithm on S is at most ) 2 ( R . γ . Sariel (UIUC) CS573 11 Fall 2013 11 / 28
Claim by figure... hard easy Sariel (UIUC) CS573 12 Fall 2013 12 / 28
Claim by figure... hard easy R R Sariel (UIUC) CS573 12 Fall 2013 12 / 28
Claim by figure... hard easy R R ℓ ℓ γ γ ′ w opt w opt Sariel (UIUC) CS573 12 Fall 2013 12 / 28
Claim by figure... hard easy R R ℓ ℓ γ γ ′ w opt w opt # errors: ( R /γ ) 2 # errors: ( R /γ ′ ) 2 Sariel (UIUC) CS573 12 Fall 2013 12 / 28
Proof of Perceptron convergence... . . Idea of proof: perceptron weight vector converges to w opt . 1 . . Distance between w opt and k th update vector: 2 2 � w k − R 2 � � � � α k = . γ w opt � � � � � . . Quantify the change between α k and α k +1 3 . . Example being misclassified is ( x , y ) . 4 Sariel (UIUC) CS573 13 Fall 2013 13 / 28
Proof of Perceptron convergence... . . Idea of proof: perceptron weight vector converges to w opt . 1 . . Distance between w opt and k th update vector: 2 2 � w k − R 2 � � � � α k = . γ w opt � � � � � . . Quantify the change between α k and α k +1 3 . . Example being misclassified is ( x , y ) . 4 Sariel (UIUC) CS573 13 Fall 2013 13 / 28
Proof of Perceptron convergence... . . Idea of proof: perceptron weight vector converges to w opt . 1 . . Distance between w opt and k th update vector: 2 2 � w k − R 2 � � � � α k = . γ w opt � � � � � . . Quantify the change between α k and α k +1 3 . . Example being misclassified is ( x , y ) . 4 Sariel (UIUC) CS573 13 Fall 2013 13 / 28
Proof of Perceptron convergence... . . Idea of proof: perceptron weight vector converges to w opt . 1 . . Distance between w opt and k th update vector: 2 2 � w k − R 2 � � � � α k = . γ w opt � � � � � . . Quantify the change between α k and α k +1 3 . . Example being misclassified is ( x , y ) . 4 Sariel (UIUC) CS573 13 Fall 2013 13 / 28
Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28
Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28
Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28
Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28
Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28
Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28
Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28
Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28
Proof of Perceptron convergence... . . Example being misclassified is ( x , y ) (both are constants). 1 . . w k +1 ← w k + y ∗ x 2 2 2 � w k +1 − R 2 � w k + y x − R 2 � � � � . . � � � � α k +1 = = γ w opt γ w opt 3 � � � � � � � � � � 2 w k − R 2 � � ( ) � � = γ w opt + y x � � � � � � ⟨ ( w k − R 2 ) ( w k − R 2 ) ⟩ = γ w opt + y x , γ w opt + y x ⟨ ( w k − R 2 ) ( w k − R 2 )⟩ = , γ w opt γ w opt ⟨ ( ) ⟩ w k − R 2 +2 y , x + ⟨ x , x ⟩ γ w opt 2 . � � ⟨ ( w k − R 2 ) ⟩ = α k + 2 y γ w opt , x + � x � � � Sariel (UIUC) CS573 14 Fall 2013 14 / 28
Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28
Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28
Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28
Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28
Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28
Proof of Perceptron convergence... 2 . . . ⟨ ( w k − R 2 ) ⟩ � � We proved: α k +1 = α k + 2 y , x + γ w opt � x 1 � � � . . (x , y ) is misclassified: sign( ⟨ w k , x ⟩ ) ̸ = y 2 . . ⇒ sign( y ⟨ w k , x ⟩ ) = − 1 = 3 . . = ⇒ y ⟨ w k , x ⟩ < 0 . 4 . . � � � ≤ R = ⇒ � x 5 � � ⟨ R 2 ⟩ α k +1 ≤ α k + R 2 + 2 y ⟨ w k , x ⟩ − 2 y γ w opt , x − 2 R 2 ≤ α k + R 2 + γ y ⟨ w opt , x ⟩ . . . ... since 2 y ⟨ w k , x ⟩ < 0 . 6 Sariel (UIUC) CS573 15 Fall 2013 15 / 28
Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28
Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28
Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28
Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28
Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28
Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28
Proof of Perceptron convergence... Proved: α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ . 1 . . sign( ⟨ w opt , x ⟩ ) = y . 2 . . By margin assumption: y ⟨ w opt , x ⟩ ≥ γ, ∀ ( x , y ) ∈ S . 3 α k +1 ≤ α k + R 2 − 2 R 2 . . γ y ⟨ w opt , x ⟩ 4 ≤ α k + R 2 − 2 R 2 γ γ ≤ α k + R 2 − 2 R 2 ≤ α k − R 2 . Sariel (UIUC) CS573 16 Fall 2013 16 / 28
Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28
Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28
Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28
Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28
Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28
Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28
Proof of Perceptron convergence... . . We have: α k +1 ≤ α k − R 2 1 2 � 0 − R 2 = R 4 2 = R 4 � � . . � � � � α 0 = γ w opt � w opt γ 2 . 2 � � � � � � γ 2 � � . . ∀ i α i ≥ 0 . 3 . . Q: max # classification errors can make? 4 . . ... # of updates 5 . . .. # of updates ≤ α 0 / R 2 ... 6 A: ≤ R 2 . . γ 2 . 7 Sariel (UIUC) CS573 17 Fall 2013 17 / 28
Concluding comment... Any linear program can be written as the problem of separating red points from blue points. As such, the perceptron algorithm can be used to solve linear programs. Sariel (UIUC) CS573 18 Fall 2013 18 / 28
Learning a circle... . . Given a set of red points, and blue points in the plane, we want 1 to learn a circle that contains all the red points, and does not contain the blue points. . . Q: How to compute the circle σ ? 2 Lifting : ℓ : ( x , y ) → ( x , y , x 2 + y 2 ) . . . 3 . . ℓ ( x , y ) = ( x , y , x 2 + y 2 ) { � } z ( P ) = � ( x , y ) ∈ P 4 � Sariel (UIUC) CS573 19 Fall 2013 19 / 28
Learning a circle... . . Given a set of red points, and blue points in the plane, we want 1 to learn a circle that contains all the red points, and does not contain the blue points. σ . Q: How to compute the circle σ ? 2 Lifting : ℓ : ( x , y ) → ( x , y , x 2 + y 2 ) . . . 3 ℓ ( x , y ) = ( x , y , x 2 + y 2 ) . . { � } z ( P ) = � ( x , y ) ∈ P 4 � Sariel (UIUC) CS573 19 Fall 2013 19 / 28
Learning a circle... . . Given a set of red points, and blue points in the plane, we want 1 to learn a circle that contains all the red points, and does not contain the blue points. σ . Q: How to compute the circle σ ? 2 Lifting : ℓ : ( x , y ) → ( x , y , x 2 + y 2 ) . . . 3 ℓ ( x , y ) = ( x , y , x 2 + y 2 ) . . { � } z ( P ) = � ( x , y ) ∈ P 4 � Sariel (UIUC) CS573 19 Fall 2013 19 / 28
Learning a circle... . . Given a set of red points, and blue points in the plane, we want 1 to learn a circle that contains all the red points, and does not contain the blue points. σ . Q: How to compute the circle σ ? 2 Lifting : ℓ : ( x , y ) → ( x , y , x 2 + y 2 ) . . . 3 ℓ ( x , y ) = ( x , y , x 2 + y 2 ) . . { � } z ( P ) = � ( x , y ) ∈ P 4 � Sariel (UIUC) CS573 19 Fall 2013 19 / 28
Learning a circle... . . Given a set of red points, and blue points in the plane, we want 1 to learn a circle that contains all the red points, and does not contain the blue points. σ . Q: How to compute the circle σ ? 2 Lifting : ℓ : ( x , y ) → ( x , y , x 2 + y 2 ) . . . 3 ℓ ( x , y ) = ( x , y , x 2 + y 2 ) . . { � } z ( P ) = � ( x , y ) ∈ P 4 � Sariel (UIUC) CS573 19 Fall 2013 19 / 28
Learning a circle... . Theorem . Two sets of points R and B are separable by a circle in two dimensions, if and only if ℓ ( R ) and ℓ ( B ) are separable by a plane in . three dimensions. Sariel (UIUC) CS573 20 Fall 2013 20 / 28
Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28
Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28
Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28
Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28
Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28
Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28
Proof σ ≡ ( x − a ) 2 + ( y − b ) 2 = r 2 : circle containing R , and all . . 1 points of B outside. ( x − a ) 2 + ( y − b ) 2 ≤ r 2 . . ∀ ( x , y ) ∈ R 2 ( x − a ) 2 + ( y − b ) 2 > r 2 . ∀ ( x , y ) ∈ B − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 ≤ 0 . . . ∀ ( x , y ) ∈ R 3 − 2 ax − 2 by +( x 2 + y 2 ) − r 2 + a 2 + b 2 > 0 . ∀ ( x , y ) ∈ B Setting z = z ( x , y ) = x 2 + y 2 : . . 4 h ( x , y , z ) = − 2 ax − 2 by + z − r 2 + a 2 + b 2 ∀ ( x , y ) ∈ R h ( x , y , z ( x , y )) ≤ 0 . . ⇐ ⇒ ∀ ( x , y ) ∈ R h ( ℓ ( x , y )) ≤ 0 5 ∀ ( x , y ) ∈ B h ( ℓ ( x , y )) > 0 . . p ∈ σ ⇐ ⇒ h ( ℓ ( p )) ≤ 0 . 6 . . Proved: if point set is separable by a circle = ⇒ lifted point set 7 ℓ ( R ) and ℓ ( B ) are separable by a plane. Sariel (UIUC) CS573 21 Fall 2013 21 / 28
Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28
Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28
Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28
Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28
Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28
Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28
Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28
Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28
Proof: Other direction . . Assume ℓ ( R ) and ℓ ( B ) are linearly separable. Let separating 1 place be: h ≡ ax + by + cz + d = 0 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( R ) : ax + by + c ( x 2 + y 2 ) + d ≤ 0 . . 2 ∀ ( x , y , x 2 + y 2 ) ∈ ℓ ( B ) : ax + by + c ( x 2 + y 2 ) + d ≥ 0 . . . 3 � h (( x , y , x 2 + y 2 )) ≤ 0 . . � { } U ( h ) = ( x , y ) . 4 � . . If U ( h ) is a circle = ⇒ R ⊂ U ( h ) and B ∩ U ( h ) = ∅ . 5 . . U ( h ) ≡ ax + by + c ( x 2 + y 2 ) ≤ − d . 6 x 2 + a y 2 + b . . ( ) ( ) ≤ − d ⇐ ⇒ + c x c y 7 c ) 2 + ) 2 ≤ a 2 + b 2 . . ( ( a b − d ⇐ ⇒ x + y + 8 2 c 2 c 4 c 2 c . . This is disk in the plane, as claimed. 9 Sariel (UIUC) CS573 22 Fall 2013 22 / 28
A closing comment... Linear separability is a powerful technique that can be used to learn complicated concepts that are considerably more complicated than just hyperplane separation. This lifting technique showed above is the kernel technique or linearization . Sariel (UIUC) CS573 23 Fall 2013 23 / 28
A Little Bit On VC Dimension . . Q: how complex is the function trying to learn? 1 . . VC-dimension is one way of capturing this notion. (VC = 2 Vapnik, Chervonenkis,1971). . . A matter of expressivity: What is harder to learn: 3 . . . A rectangle in the plane. 1 . . . A halfplane. 2 . . . A convex polygon with k sides. 3 Sariel (UIUC) CS573 24 Fall 2013 24 / 28
A Little Bit On VC Dimension . . Q: how complex is the function trying to learn? 1 . . VC-dimension is one way of capturing this notion. (VC = 2 Vapnik, Chervonenkis,1971). . . A matter of expressivity: What is harder to learn: 3 . . . A rectangle in the plane. 1 . . . A halfplane. 2 . . . A convex polygon with k sides. 3 Sariel (UIUC) CS573 24 Fall 2013 24 / 28
A Little Bit On VC Dimension . . Q: how complex is the function trying to learn? 1 . . VC-dimension is one way of capturing this notion. (VC = 2 Vapnik, Chervonenkis,1971). . . A matter of expressivity: What is harder to learn: 3 . . . A rectangle in the plane. 1 . . . A halfplane. 2 . . . A convex polygon with k sides. 3 Sariel (UIUC) CS573 24 Fall 2013 24 / 28
Thinking about concepts as binary functions... . . X = { p 1 , p 2 , . . . , p m } : points in the plane. 1 . . H : set of all halfplanes. 2 . . A half-plane r ∈ H defines a binary vector 3 r ( X ) =( b 1 , . . . , b m ) where b i = 1 if and only if p i is inside r . . . Possible binary vectors generated by halfplanes: 4 U ( X , H ) = { r ( X ) | r ∈ H} . . . A set X of m elements is shattered by R if 5 | U ( X , R ) | = 2 m . . . What does this mean? 6 . The VC-dimension of a set of ranges R is the size of the 7 largest set that it can shatter. Sariel (UIUC) CS573 25 Fall 2013 25 / 28
Thinking about concepts as binary functions... . . X = { p 1 , p 2 , . . . , p m } : points in the plane. 1 . . H : set of all halfplanes. 2 . . A half-plane r ∈ H defines a binary vector 3 r ( X ) =( b 1 , . . . , b m ) where b i = 1 if and only if p i is inside r . . . Possible binary vectors generated by halfplanes: 4 U ( X , H ) = { r ( X ) | r ∈ H} . . . A set X of m elements is shattered by R if 5 | U ( X , R ) | = 2 m . . . What does this mean? 6 . The VC-dimension of a set of ranges R is the size of the 7 largest set that it can shatter. Sariel (UIUC) CS573 25 Fall 2013 25 / 28

Recommend

Learning, Linear Separability and Linear Programming Lecture 22 - PowerPoint PPT Presentation

CS 573: Algorithms, Fall 2013 Learning, Linear Separability and Linear Programming Lecture 22 November 12, 2013 Sariel (UIUC) CS573 1 Fall 2013 1 / 28 Labeling... . . given examples:a database of cars. 1 . . like to determine which

Separability of Context-Free Languages by Piecewise Testable Languages Wojciech Czerwi ski

A Note on Decidable Separability by Piecewise Testable Languages Wojciech Czerwi ski Wim

Regular Separability of WSTS Roland Meyer joint work with Wojciech Czerwi nski, S lawomir

Chapter 22 Learning, Linear Separability and Linear Programming CS 573: Algorithms, Fall 2013

Learning with Low Rank Approximations or how to use near separability to extract content from

Separability of the Lambda Calculus and Term Rewriting Systems Department of Computer Science

Deep learning 3.3. Linear separability and feature design Fran cois Fleuret

Review Linear separability (and use of features) Class probabilities for linear

Linear Classification Linear separability Inseparability Real world problems: there may not

Linear Programming Linear Programming In a linear programming problem, there is a set of

Topological approaches in machine learning D. A. Zighed University of Lyon (Lumire Lyon 2)

Lecture 2: Linear Programming and Duality Lecture Outline Part I: Linear Programming and

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Linear Programming Linear Programming is the study of optimization problems in

Linear Programming Problems Linear programming problems come up in many applications. In a linear

Linear Programming What s Linear Programming? Often your try is to maximize or minimize an

Motor/Prop Matching Lecture 14 ME EN 415 Andrew Ning aning@byu.edu Motor/Prop Matching prop

MQTs Extraction Striving at the forefront of the unknown libraries Mal Le Garrec 2020-08-26

TractoR and Other Software Jon Clayden <j.clayden@ucl.ac.uk> DIBS Teaching Seminar, 11 Dec

Android:

FBK's Machine Translation Systems for IWSLT 2012's TED Lectures Nick Ruiz, Arianna Bisazza

www.farmvetco.org/homegrown-by-heroes/ Who is Matt Mccue? U.S. Army OIF veteran and returned

On the Fefferman Construction Andreas Cap (joint with A. Rod Gover) May 2005 Associate to

TRILL Fine Grained Labeling Donald Eastlake 3 rd Huawei

Learning, Linear Separability and Linear Programming Lecture 22 - PowerPoint PPT Presentation

CS 573: Algorithms, Fall 2013 Learning, Linear Separability and Linear Programming Lecture 22 November 12, 2013 Sariel (UIUC) CS573 1 Fall 2013 1 / 28 Labeling... . . given examples:a database of cars. 1 . . like to determine which

Separability of Context-Free Languages by Piecewise Testable Languages Wojciech Czerwi ski

A Note on Decidable Separability by Piecewise Testable Languages Wojciech Czerwi ski Wim

Regular Separability of WSTS Roland Meyer joint work with Wojciech Czerwi nski, S lawomir

Chapter 22 Learning, Linear Separability and Linear Programming CS 573: Algorithms, Fall 2013

Learning with Low Rank Approximations or how to use near separability to extract content from

Separability of the Lambda Calculus and Term Rewriting Systems Department of Computer Science

Deep learning 3.3. Linear separability and feature design Fran cois Fleuret

Review Linear separability (and use of features) Class probabilities for linear

Linear Classification Linear separability Inseparability Real world problems: there may not

Linear Programming Linear Programming In a linear programming problem, there is a set of

Topological approaches in machine learning D. A. Zighed University of Lyon (Lumire Lyon 2)

Lecture 2: Linear Programming and Duality Lecture Outline Part I: Linear Programming and

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Introduction to Linear Programming Linear Programming is the study of optimization problems in

Linear Programming Problems Linear programming problems come up in many applications. In a linear

Linear Programming What s Linear Programming? Often your try is to maximize or minimize an

Motor/Prop Matching Lecture 14 ME EN 415 Andrew Ning aning@byu.edu Motor/Prop Matching prop

MQTs Extraction Striving at the forefront of the unknown libraries Mal Le Garrec 2020-08-26

TractoR and Other Software Jon Clayden &lt;j.clayden@ucl.ac.uk&gt; DIBS Teaching Seminar, 11 Dec

Android:

FBK's Machine Translation Systems for IWSLT 2012's TED Lectures Nick Ruiz, Arianna Bisazza

www.farmvetco.org/homegrown-by-heroes/ Who is Matt Mccue? U.S. Army OIF veteran and returned

On the Fefferman Construction Andreas Cap (joint with A. Rod Gover) May 2005 Associate to

TRILL Fine Grained Labeling Donald Eastlake 3 rd Huawei

TractoR and Other Software Jon Clayden <j.clayden@ucl.ac.uk> DIBS Teaching Seminar, 11 Dec