 
              1 separating hyperplane . (1) Tien, . . Figure 1: Ilustration of linearly separable dataset yourself. ECE 6254 - Spring 2020 - Lecture 9 v1.1 - revised February 4, 2020 Perceptron Learning Algorithm Matthieu R. Bloch 1 A bit of geometry Definition 1.1. Dataset { x i , y i } N i =1 is linearly separable if there exists w ∈ R d and b ∈ R such that ∀ i ∈ � 1 , N � y i = sign ( w ⊺ x + b ) y i ∈ {± 1 } By definition sign ( x ) = +1 if x > 0 and − 1 else. Tie affine set { x : w ⊺ x + b = 0 } is then called a As illustrated in Fig. 1, it is important to note that H ≜ { x : w ⊺ x + b = 0 } is not a vector space because of the presence of the offset b It is an affine space, meaning that it can be described as H = x 0 + V , where x 0 ∈ H and V is a vector space. Make sure that this is clear and check for x 2 H = { w ⊺ x + b = 0 } − b w 2 x 1 0 Lemma 1.2. Consider the hyperplane H ≜ { x : w ⊺ x + b = 0 } . Tie vector w is orthogonal to all vectors parallel to the hyperplane. For z ∈ R d , the distance of z to the hyperplane is d ( z , H ) = | w ⊺ z + b | ∥ w ∥ 2 Proof. Consider x , x ′ in H . Tien, by definition, w ⊺ x + b = 0 = w ⊺ x ′ + b so that w ⊺ ( x − x ′ ) = 0 . Hence, w is orthogonal to all vectors parallel to H . Consider now any point z ∈ R d and a point x 0 ∈ H . Tie distance of z to H is the distance between z and its orthogonal projection onto H , which we can compute as d ( z , H ) = | w ⊺ ( z − x 0 ) | ∥ w ∥ 2 | w ⊺ ( z − x 0 ) | = | w ⊺ z + b | . ■
2 (2) the loss function. We obtain (4) Tie Perceptron Learning Algorithm (PLA) was proposed by Rosenblatt to identify a separating (3) Consider a loss function called the “perceptron loss” defined as Gradient descent view of PLA Tie principle of the algorithm is the following. Figure 2: PLA update Geometric view of PLA ECE 6254 - Spring 2020 - Lecture 9 v1.1 - revised February 4, 2020 2 The Perceptron Learning Algorithm hyperplane in a linearly separarable dataset { ( x i , y i ) } N i =1 if it exist. We assume that every vector x ∈ R d +1 with x 0 = 1 , so that we can use the shorthand θ ⊺ x = 0 to describe a affine hyperplane. 1. Start from a guess θ (0) . 2. For j ⩾ 1 , iterate over the data points (in any order) and update θ ( j ) + y i x i if y i ̸ = sign ( ) { θ ( j ) ⊺ x i θ ( j +1) = θ ( j ) else Tie effect of the PLA update is illustrated in Fig. 2. Note that the update of θ ( j +1) not only changes the overall hyperplane, and not just the associated vector space. Tiis is best seen in Fig. 2, where the offset changes and not just the slope of the separator. ① ✷ ✭ ❥ ✮ ✦ ✄ ☎ ✰ ✆ ✝ ✂ ✁ ✄ ☎ ✰ ✆ ✝ ✞ ✟ ✭ ❥ ✮ ✇ � ✐ ✭ ❥ ✠ ✶ ✮ ✭ ❥ ✮ ✇ ❜ ① ✶ N ∑ ℓ ( θ ) ≜ max (0 , − y i θ ⊺ x i ) . i =1 Intuitively, the loss penalizes misclassified points (according to θ ) with a penalty proportional to how badly they are misclassified. Setting ℓ i ( θ ) ≜ max (0 , − y i θ ⊺ x i ) , we have  0 if y i θ ⊺ x i > 0   ∇ ℓ i ( θ ) = − y i x i if y i θ ⊺ x i < 0  [0 , 1] × − y i x i if θ ⊺ x i = 0  Tie case of equality θ ⊺ x i = 0 corresponds to the point where the loss function ℓ i ( θ ) is not differ- entiable. In such case, we have to use a subgradient of ℓ i at θ , which is any vector v such that for all θ ′ , ℓ i ( θ ) − ℓ i ( θ ′ ) ⩾ v ⊺ ( θ − θ ′ ) . A subgradient is not unique and the set of subgradients is usually denoted ∂ℓ i ( θ ) . Let us now apply a stochastic gradient descent algorithm with a step size of 1 to
3 [2] F. Rosenblatt, “Tie perceptron: A probabilistic model for information storage and organization descent that treats the case of subgradients with its own rule. Note that (5) is almost identical to (2). Tie PLA udpate rule is essentially a stochastic gradient (5) proposal of the PLA is in [2]. Introductory treatments of the PLA are found in [3, Section 4.5] and [4, Section 8.5.4]. [1] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint, Jun. 2017. in the brain,” Psychological Review , vol. 65, no. 6, pp. 386–408, 1958. PLA because of classification errors is bounded and the PLA eventually identifies a separating hyperplane. [3] T. Hastie, R. Tibshirani, and J. H. Friedman, Tie Elements of Statistical Learning: Data Mining, Inference, and Prediction , ser. Springer series in statistics. Springer, 2009. [4] K. P . Murphy, Machine Learning: A Probabilistic Perspective . MIT Press, 2012. A simple and accessible review of gradient descent techniques can be found in [1]. Tie original ECE 6254 - Spring 2020 - Lecture 9 v1.1 - revised February 4, 2020 1. Start from a guess θ (0) . 2. For j ⩾ 1 , iterate over the data points (in any order) and update θ ( j ) + y i x i if − y i θ ( j ) ⊺ x i > 0   θ ( j +1) = θ ( j ) − ∇ ℓ i ( θ ) =  θ ( j ) if − y i θ ( j ) ⊺ x i < 0 θ ( j ) − v where v ∈ ∂ℓ i ( θ ) if θ ⊺ x i = 0   Tieorem 2.1. Consider a linearly separable data set { ( x i , y i ) } N i =1 . Tie number of updates made by the 3 To go further References
Recommend
More recommend