online learning ii
play

Online Learning II Presenter: Adams Wei Yu Carnegie Mellon - PowerPoint PPT Presentation

Online Learning II Presenter: Adams Wei Yu Carnegie Mellon University Mar 2015 Presenter: Adams Wei Yu (CMU) March 2015 1 / 31 Recap of Online Learning The data comes sequentially. Do not need to assume the data distribution. Adversarial


  1. Online Learning II Presenter: Adams Wei Yu Carnegie Mellon University Mar 2015 Presenter: Adams Wei Yu (CMU) March 2015 1 / 31

  2. Recap of Online Learning The data comes sequentially. Do not need to assume the data distribution. Adversarial setting (worst case analysis). Regret minimization: T T � � R T = L (ˆ y t , y t ) − min L (ˆ y t , i , y t ) i ∈{ 1 ,..., N } t =1 t =1 Several simple algorithms with theoretical guarantee (Halving, Weighted Majority, Randomized Weighted Majority, Exponential Weighted Average). Presenter: Adams Wei Yu (CMU) March 2015 2 / 31

  3. Weighted Majority Algorithm Algorithm 1 WEIGHTED-MAJORITY( N ) 1: for i ← 1 to N do w 1 , i ← 1 2: 3: for t ← 1 to T do RECEIVE( x t ) 4: if � i : y t , i =1 w t , i ≥ � i : y t , i =0 w t , i then 5: y t ← 1 ˆ 6: else 7: y t ← 0 ˆ 8: RECEIVE( y t ) 9: if ˆ y t � = y t then 10: for i ← 1 to N do 11: if y t , i � = y t then 12: w t +1 , i ← β w t , i 13: else w t +1 , i ← w t , i 14: 15: return w T +1 Presenter: Adams Wei Yu (CMU) March 2015 3 / 31

  4. Randomized weighted majority algorithm Algorithm 2 RANDOMIZED-WEIGHTED-MAJORITY( N ) 1: for i ← 1 to N do 2: w 1 , i ← 1; p 1 , i ← 1 / N 3: for t ← 1 to T do 4: RECEIVE( x t ) p 1 = � p 0 = � 5: � i : yt , i =1 p t , i ; � i : yt , i =0 p t , i 6: Draw u ∼ Uniform(0,1) 7: if u < � p 1 then 8: ˆ y t ← 1 9: else 10: y t ← 0 ˆ 11: for i ← 1 to N do 12: if l t , i = 1 then 13: w t +1 , i ← β w t , i 14: else w t +1 , i ← w t , i W t +1 ← � N 15: i =1 w t +1 , i 16: for i ← 1 to N do 17: p t +1 , i ← w t +1 , i / W t +1 18: return w T +1 Presenter: Adams Wei Yu (CMU) March 2015 4 / 31

  5. Topics today Perceptron algorithm and mistake bound. Winnow algorithm and mistake bound. Conversion from online to batch algorithm and analysis. Presenter: Adams Wei Yu (CMU) March 2015 5 / 31

  6. Perceptron Algorithm Algorithm 3 PERCEPTRON( w 0 ) 1: for t ← 1 to T do RECEIVE( x t ) 2: y t ← sgn ( w t · x t ) ˆ 3: RECEIVE( y t ) 4: if (ˆ y t � = y t ) then 5: w t +1 ← w t + y t x t ⊲ More generally η y t x t 6: else w t +1 ← w t 7: 8: return w T +1 If x t is misclassified, then y t w t · x t is negative. After one iteration, y t w t +1 · x t = y t w t · x t + η � x t � 2 2 , so the term y t w t · x t is corrected by η � x t � 2 2 . Presenter: Adams Wei Yu (CMU) March 2015 6 / 31

  7. Another Point of View: Stochastic Gradient Descent The Perceptron algorithm could be seen as finding the minimizer of an objective function F : � T F ( w ) = 1 D [ � max (0 , − y t ( w · x t )) = E x ∼ ˆ F ( w , x )] T t =1 where � F ( w , x ) = max (0 , − f ( x )( w · x )) with f ( x ) being the label of x , and ˆ D is the empirical distribution of sample ( x 1 , ..., x T ). F ( w ) is convex over w . Presenter: Adams Wei Yu (CMU) March 2015 7 / 31

  8. Another Point of View: Stochastic Gradient Descent � w t − η ∇ w � if � F ( w t , x t ) , F ( w , x t ) differentiable at w t w t +1 ← w t , otherwise Note that � F ( w , x t ) = max (0 , − y t ( w · x t )) � − y t x t if y t ( w · x t ) < 0 ∇ w � F ( w , x t ) 0 if y t ( w · x t ) > 0 ⇓  w t + η y t x t , if y t ( w t · x t ) < 0   w t +1 ← w t , if y t ( w t · x t ) > 0   w t , otherwise Presenter: Adams Wei Yu (CMU) March 2015 8 / 31

  9. Upper Bound on the Number of Mistakes: Separable Case Theorem 1 Let x 1 , ..., x T ∈ R N be a sequence of T points with � x t � ≤ r for all t ∈ [1 , T ] , some r > 0 . Assume that there exist ρ > 0 and v ∈ R N such that for all t ∈ [1 , T ] , ρ ≤ y t ( v · x t ) � v � 1 . Then, the number of updates made by the Perceptron algorithm when processing x 1 , ..., x T is bounded by r 2 /ρ 2 . Presenter: Adams Wei Yu (CMU) March 2015 9 / 31

  10. Proof I : the subset of the T rounds at which there is an update. M : the total number of updates, i.e. | I | = M . M ρ ≤ v · � t ∈ I y t x t ( ρ ≤ y t ( v · x t ) ) � v � � v � � ≤� y t x t � (Cauchy-Schwarz inequality) t ∈ I � = � ( w t +1 − w t ) � (definition of updates) t ∈ I = � w T +1 � (telescope sum, w 0 = 0) �� � w t +1 � 2 − � w t � 2 = (telescope sum, w 0 = 0) t ∈ I �� � w t + y t x t � 2 − � w t � 2 = (definition of updates) t ∈ I �� �� √ 2 y t w t · x t + � x t � 2 ≤ � x t � 2 ≤ Mr 2 ⇒ M ≤ r 2 /ρ 2 = t ∈ I t ∈ I Presenter: Adams Wei Yu (CMU) March 2015 10 / 31

  11. Remarks The Perceptron algorithm is simple. The bound of updates depends only on the margin ρ (we may assume r = 1) and is independent of the dimension N . This bound O ( 1 ρ 2 ) is tight for Perceptron Algorithm. Maybe very slow when the ρ is small. We may need multiple pass of the data. It will go to deadloop if the data is not separable. Presenter: Adams Wei Yu (CMU) March 2015 11 / 31

  12. Upper Bound on the Number of Mistakes: Inseparable Case Theorem 2 Let x 1 , ..., x T ∈ R N be a sequence of T points with � x t � ≤ r for all t ∈ [1 , T ] , some r > 0 . Let ρ > 0 and v ∈ R N , � v � = 1 . Define the �� T t =1 d 2 deviation of x t by d t = max { 0 , ρ − y t ( v · x t ) } and let δ = t . Then, the number of updates made by the Perceptron algorithm when processing x 1 , ..., x T is bounded by ( r + δ ) 2 /ρ 2 . Key Idea: Construct data points in higher dimension which are separable and have the same prediction behavior as the one of original space. Presenter: Adams Wei Yu (CMU) March 2015 12 / 31

  13. Proof We first reduce the problem to the separable case by mapping the data points from x t ∈ R N to the higher dimension vector x ′ t ∈ R N + T : x t = ( x t , 1 , ..., x t , N ) T → x ′ , ..., 0) T t = ( x t , 1 , ..., x t , N , 0 , ..., ∆ ���� ( N + t )-th component v = ( v 1 , v 2 , ..., v N ) T → v ′ = [ v 1 / Z , ..., v N / Z , y 1 d 1 / (∆ Z ) , ..., y T d T / (∆ Z )] T � 1 + δ 2 To make � v � = 1, we have Z = ∆ 2 . Then the predictions made by Perceptron for x ′ t , t ∈ [1 , T ] coincide with those made in the original space for x t . Presenter: Adams Wei Yu (CMU) March 2015 13 / 31

  14. Proof (con’t) t ) = y t ( v · x t + ∆ y t d t Z ∆ ) = y t v · x t + d t Z ≥ y t v · x t + ρ − y t ( v · x t ) = ρ y t ( v ′ · x ′ Z Z Z Z Z t � 2 ≤ r 2 + ∆ 2 So x ′ 1 , ..., x ′ T is linear separable with margin ρ/ Z . Noting that � x ′ and using the result in Theorem 1, we have that the number of updates made by the perceptron algorithm is bounded by ( r 2 + ∆ 2 )(1 + δ 2 / ∆ 2 ) ρ 2 Choosing ∆ 2 to minimize the bound leads to ∆ 2 = r δ and the bound is ( r + δ ) 2 ρ 2 Presenter: Adams Wei Yu (CMU) March 2015 14 / 31

  15. Dual Perceptron For the original perceptron, we can write the separating hyperplane as � T w = α s y s x s s =1 where α s is incremented by one when this prediction does not match the correct label. Then we write the algorithm as: Algorithm 4 DUAL PERCEPTRON( α 0 ) 1: α ← α 0 ⊲ typically α 0 = 0 2: for t ← 1 to T do RECEIVE( x t ) 3: y t ← sgn ( � T ˆ s =1 α s y s x s · x t ) 4: RECEIVE( y t ) 5: if (ˆ y t � = y t ) then 6: α t ← α t + 1 7: else α t ← α t 8: 9: return α Presenter: Adams Wei Yu (CMU) March 2015 15 / 31

  16. Kernel Perceptron Algorithm 5 KERNEL PERCEPTRON( α 0 ) 1: α ← α 0 ⊲ typically α 0 = 0 2: for t ← 1 to T do RECEIVE( x t ) 3: y t ← sgn ( � T ˆ s =1 α s y s K ( x s , x t )) 4: RECEIVE( y t ) 5: if (ˆ y t � = y t ) then 6: α t ← α t + 1 7: else α t ← α t 8: 9: return α Any PDS kernel could be used. Presenter: Adams Wei Yu (CMU) March 2015 16 / 31

  17. Winnow Algorithm Algorithm 6 WINNOW( η ) 1: w 1 ← 1 / N 2: for t ← 1 to T do RECEIVE( x t ) 3: y t ← sgn ( w t · x t ) ˆ 4: RECEIVE( y t ) 5: if (ˆ y t � = y t ) then 6: Z t ← � N i =1 w t , i exp( η y t x t , i ) 7: for i ← 1 to N do 8: w t +1 , i ← w t , i exp( η y t x t , i ) 9: Z t else w t +1 ← w t 10: 11: return w T +1 Presenter: Adams Wei Yu (CMU) March 2015 17 / 31

  18. Upper Bound on the Number of Mistakes: Separable Case Theorem 3 Let x 1 , ..., x T ∈ R N be a sequence of T points with � x t � ∞ ≤ r ∞ for all t ∈ [1 , T ] , some r ∞ > 0 . Assume that there exist ρ ∞ > 0 and v ∈ R N such that for all t ∈ [1 , T ] , ρ ∞ ≤ y t ( v · x t ) � v � . Then, for η = ρ ∞ ∞ , the number r 2 of updates made by the Winnow algorithm when processing x 1 , ..., x T is upper bounded by 2( r 2 ∞ /ρ 2 ∞ ) log N. Presenter: Adams Wei Yu (CMU) March 2015 18 / 31

  19. Proof I : the subset of the T rounds at which there is an update. M : the total number of updates, i.e. | I | = M . The potential function Φ t is the relative entropy of the distribution defined by the normalized weights v i / � v � 1 , i ∈ [1 , N ] and the one defined by the component of the weight vector w t , i : � N v i log v i / � v � 1 Φ t = � v � 1 w t , i i =1 Presenter: Adams Wei Yu (CMU) March 2015 19 / 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend