Online Learning II Presenter: Adams Wei Yu Carnegie Mellon - PowerPoint PPT Presentation

Online Learning II Presenter: Adams Wei Yu Carnegie Mellon University Mar 2015 Presenter: Adams Wei Yu (CMU) March 2015 1 / 31

Recap of Online Learning The data comes sequentially. Do not need to assume the data distribution. Adversarial setting (worst case analysis). Regret minimization: T T � � R T = L (ˆ y t , y t ) − min L (ˆ y t , i , y t ) i ∈{ 1 ,..., N } t =1 t =1 Several simple algorithms with theoretical guarantee (Halving, Weighted Majority, Randomized Weighted Majority, Exponential Weighted Average). Presenter: Adams Wei Yu (CMU) March 2015 2 / 31

Weighted Majority Algorithm Algorithm 1 WEIGHTED-MAJORITY( N ) 1: for i ← 1 to N do w 1 , i ← 1 2: 3: for t ← 1 to T do RECEIVE( x t ) 4: if � i : y t , i =1 w t , i ≥ � i : y t , i =0 w t , i then 5: y t ← 1 ˆ 6: else 7: y t ← 0 ˆ 8: RECEIVE( y t ) 9: if ˆ y t � = y t then 10: for i ← 1 to N do 11: if y t , i � = y t then 12: w t +1 , i ← β w t , i 13: else w t +1 , i ← w t , i 14: 15: return w T +1 Presenter: Adams Wei Yu (CMU) March 2015 3 / 31

Randomized weighted majority algorithm Algorithm 2 RANDOMIZED-WEIGHTED-MAJORITY( N ) 1: for i ← 1 to N do 2: w 1 , i ← 1; p 1 , i ← 1 / N 3: for t ← 1 to T do 4: RECEIVE( x t ) p 1 = � p 0 = � 5: � i : yt , i =1 p t , i ; � i : yt , i =0 p t , i 6: Draw u ∼ Uniform(0,1) 7: if u < � p 1 then 8: ˆ y t ← 1 9: else 10: y t ← 0 ˆ 11: for i ← 1 to N do 12: if l t , i = 1 then 13: w t +1 , i ← β w t , i 14: else w t +1 , i ← w t , i W t +1 ← � N 15: i =1 w t +1 , i 16: for i ← 1 to N do 17: p t +1 , i ← w t +1 , i / W t +1 18: return w T +1 Presenter: Adams Wei Yu (CMU) March 2015 4 / 31

Topics today Perceptron algorithm and mistake bound. Winnow algorithm and mistake bound. Conversion from online to batch algorithm and analysis. Presenter: Adams Wei Yu (CMU) March 2015 5 / 31

Perceptron Algorithm Algorithm 3 PERCEPTRON( w 0 ) 1: for t ← 1 to T do RECEIVE( x t ) 2: y t ← sgn ( w t · x t ) ˆ 3: RECEIVE( y t ) 4: if (ˆ y t � = y t ) then 5: w t +1 ← w t + y t x t ⊲ More generally η y t x t 6: else w t +1 ← w t 7: 8: return w T +1 If x t is misclassified, then y t w t · x t is negative. After one iteration, y t w t +1 · x t = y t w t · x t + η � x t � 2 2 , so the term y t w t · x t is corrected by η � x t � 2 2 . Presenter: Adams Wei Yu (CMU) March 2015 6 / 31

Another Point of View: Stochastic Gradient Descent The Perceptron algorithm could be seen as finding the minimizer of an objective function F : � T F ( w ) = 1 D [ � max (0 , − y t ( w · x t )) = E x ∼ ˆ F ( w , x )] T t =1 where � F ( w , x ) = max (0 , − f ( x )( w · x )) with f ( x ) being the label of x , and ˆ D is the empirical distribution of sample ( x 1 , ..., x T ). F ( w ) is convex over w . Presenter: Adams Wei Yu (CMU) March 2015 7 / 31

Another Point of View: Stochastic Gradient Descent � w t − η ∇ w � if � F ( w t , x t ) , F ( w , x t ) differentiable at w t w t +1 ← w t , otherwise Note that � F ( w , x t ) = max (0 , − y t ( w · x t )) � − y t x t if y t ( w · x t ) < 0 ∇ w � F ( w , x t ) 0 if y t ( w · x t ) > 0 ⇓  w t + η y t x t , if y t ( w t · x t ) < 0   w t +1 ← w t , if y t ( w t · x t ) > 0   w t , otherwise Presenter: Adams Wei Yu (CMU) March 2015 8 / 31

Upper Bound on the Number of Mistakes: Separable Case Theorem 1 Let x 1 , ..., x T ∈ R N be a sequence of T points with � x t � ≤ r for all t ∈ [1 , T ] , some r > 0 . Assume that there exist ρ > 0 and v ∈ R N such that for all t ∈ [1 , T ] , ρ ≤ y t ( v · x t ) � v � 1 . Then, the number of updates made by the Perceptron algorithm when processing x 1 , ..., x T is bounded by r 2 /ρ 2 . Presenter: Adams Wei Yu (CMU) March 2015 9 / 31

Proof I : the subset of the T rounds at which there is an update. M : the total number of updates, i.e. | I | = M . M ρ ≤ v · � t ∈ I y t x t ( ρ ≤ y t ( v · x t ) ) � v � � v � � ≤� y t x t � (Cauchy-Schwarz inequality) t ∈ I � = � ( w t +1 − w t ) � (definition of updates) t ∈ I = � w T +1 � (telescope sum, w 0 = 0) �� w t +1 � 2 − � w t � 2 = (telescope sum, w 0 = 0) t ∈ I �� w t + y t x t � 2 − � w t � 2 = (definition of updates) t ∈ I �� √ 2 y t w t · x t + � x t � 2 ≤ � x t � 2 ≤ Mr 2 ⇒ M ≤ r 2 /ρ 2 = t ∈ I t ∈ I Presenter: Adams Wei Yu (CMU) March 2015 10 / 31

Remarks The Perceptron algorithm is simple. The bound of updates depends only on the margin ρ (we may assume r = 1) and is independent of the dimension N . This bound O ( 1 ρ 2 ) is tight for Perceptron Algorithm. Maybe very slow when the ρ is small. We may need multiple pass of the data. It will go to deadloop if the data is not separable. Presenter: Adams Wei Yu (CMU) March 2015 11 / 31

Upper Bound on the Number of Mistakes: Inseparable Case Theorem 2 Let x 1 , ..., x T ∈ R N be a sequence of T points with � x t � ≤ r for all t ∈ [1 , T ] , some r > 0 . Let ρ > 0 and v ∈ R N , � v � = 1 . Define the �� T t =1 d 2 deviation of x t by d t = max { 0 , ρ − y t ( v · x t ) } and let δ = t . Then, the number of updates made by the Perceptron algorithm when processing x 1 , ..., x T is bounded by ( r + δ ) 2 /ρ 2 . Key Idea: Construct data points in higher dimension which are separable and have the same prediction behavior as the one of original space. Presenter: Adams Wei Yu (CMU) March 2015 12 / 31

Proof We first reduce the problem to the separable case by mapping the data points from x t ∈ R N to the higher dimension vector x ′ t ∈ R N + T : x t = ( x t , 1 , ..., x t , N ) T → x ′ , ..., 0) T t = ( x t , 1 , ..., x t , N , 0 , ..., ∆ �� ( N + t )-th component v = ( v 1 , v 2 , ..., v N ) T → v ′ = [ v 1 / Z , ..., v N / Z , y 1 d 1 / (∆ Z ) , ..., y T d T / (∆ Z )] T � 1 + δ 2 To make � v � = 1, we have Z = ∆ 2 . Then the predictions made by Perceptron for x ′ t , t ∈ [1 , T ] coincide with those made in the original space for x t . Presenter: Adams Wei Yu (CMU) March 2015 13 / 31

Proof (con’t) t ) = y t ( v · x t + ∆ y t d t Z ∆ ) = y t v · x t + d t Z ≥ y t v · x t + ρ − y t ( v · x t ) = ρ y t ( v ′ · x ′ Z Z Z Z Z t � 2 ≤ r 2 + ∆ 2 So x ′ 1 , ..., x ′ T is linear separable with margin ρ/ Z . Noting that � x ′ and using the result in Theorem 1, we have that the number of updates made by the perceptron algorithm is bounded by ( r 2 + ∆ 2 )(1 + δ 2 / ∆ 2 ) ρ 2 Choosing ∆ 2 to minimize the bound leads to ∆ 2 = r δ and the bound is ( r + δ ) 2 ρ 2 Presenter: Adams Wei Yu (CMU) March 2015 14 / 31

Dual Perceptron For the original perceptron, we can write the separating hyperplane as � T w = α s y s x s s =1 where α s is incremented by one when this prediction does not match the correct label. Then we write the algorithm as: Algorithm 4 DUAL PERCEPTRON( α 0 ) 1: α ← α 0 ⊲ typically α 0 = 0 2: for t ← 1 to T do RECEIVE( x t ) 3: y t ← sgn ( � T ˆ s =1 α s y s x s · x t ) 4: RECEIVE( y t ) 5: if (ˆ y t � = y t ) then 6: α t ← α t + 1 7: else α t ← α t 8: 9: return α Presenter: Adams Wei Yu (CMU) March 2015 15 / 31

Kernel Perceptron Algorithm 5 KERNEL PERCEPTRON( α 0 ) 1: α ← α 0 ⊲ typically α 0 = 0 2: for t ← 1 to T do RECEIVE( x t ) 3: y t ← sgn ( � T ˆ s =1 α s y s K ( x s , x t )) 4: RECEIVE( y t ) 5: if (ˆ y t � = y t ) then 6: α t ← α t + 1 7: else α t ← α t 8: 9: return α Any PDS kernel could be used. Presenter: Adams Wei Yu (CMU) March 2015 16 / 31

Winnow Algorithm Algorithm 6 WINNOW( η ) 1: w 1 ← 1 / N 2: for t ← 1 to T do RECEIVE( x t ) 3: y t ← sgn ( w t · x t ) ˆ 4: RECEIVE( y t ) 5: if (ˆ y t � = y t ) then 6: Z t ← � N i =1 w t , i exp( η y t x t , i ) 7: for i ← 1 to N do 8: w t +1 , i ← w t , i exp( η y t x t , i ) 9: Z t else w t +1 ← w t 10: 11: return w T +1 Presenter: Adams Wei Yu (CMU) March 2015 17 / 31

Upper Bound on the Number of Mistakes: Separable Case Theorem 3 Let x 1 , ..., x T ∈ R N be a sequence of T points with � x t � ∞ ≤ r ∞ for all t ∈ [1 , T ] , some r ∞ > 0 . Assume that there exist ρ ∞ > 0 and v ∈ R N such that for all t ∈ [1 , T ] , ρ ∞ ≤ y t ( v · x t ) � v � . Then, for η = ρ ∞ ∞ , the number r 2 of updates made by the Winnow algorithm when processing x 1 , ..., x T is upper bounded by 2( r 2 ∞ /ρ 2 ∞ ) log N. Presenter: Adams Wei Yu (CMU) March 2015 18 / 31

Proof I : the subset of the T rounds at which there is an update. M : the total number of updates, i.e. | I | = M . The potential function Φ t is the relative entropy of the distribution defined by the normalized weights v i / � v � 1 , i ∈ [1 , N ] and the one defined by the component of the weight vector w t , i : � N v i log v i / � v � 1 Φ t = � v � 1 w t , i i =1 Presenter: Adams Wei Yu (CMU) March 2015 19 / 31

Online Learning II Presenter: Adams Wei Yu Carnegie Mellon - PowerPoint PPT Presentation

Online Learning II Presenter: Adams Wei Yu Carnegie Mellon University Mar 2015 Presenter: Adams Wei Yu (CMU) March 2015 1 / 31 Recap of Online Learning The data comes sequentially. Do not need to assume the data distribution. Adversarial

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Teaching with Online Platforms What is an Online Learning Platform? A n Online Learning Platform

ONLINE ADVERTISING What is SIBC online? SIBC Online is a leading online news source for the

Online Learning with Kernel Losses Aldo Pacchiano UC Berkeley Joint work with Niladri Chatterji

Online Learning Tomaso Poggio and Lorenzo Rosasco 9.520 Class 15 March 30 2011 T. Poggio and L.

Online Learning Online Learning yeah ! yeah ! Any time, anywhere learning -- free

ONLINE LEARNING SERIES 2018 ONLINE LEARNING SERIES 2018 LEARNING AND CAPACITY DEVELOPMENT FOR

Efficient Online Learning using A Private Oracle Alon Gonen, UCSD Elad Hazan, Princeton Shay

SECURE-ONLINE (ZEKER-ONLINE) Quality mark for online cloud services Tom Vreeburg Boardmember

How DGD.online helps prepare DG documentation easily CIFFA Webinar, June 25, 2019 DGD.online

Getting Online Getting Online Domain Names Email Google My Business Listing

Online Identity & Social Media by: Nicole Santarsiero What is Online Identity? -Online

2008 Online Awards Awards Banquet Better Newspaper Online Contest 2008 Best Online Advertising

2013 IRS Online Services Update IRS Online Services Update Jim Weaver Director, Product

ONLINE PROCESS SIMULATION ONLINE PROCESS SIMULATION ONLINE, REAL-TIME AND PREDICTIVE PROCESS DATA

On Birmans Sequence of Hardy-Rellich Type Inequalities Isaac B. Michael (joint with F.

d i E Inner Product a l l u d Dr. Abdulla Eid b A College of Science . r D MATHS

Minimization Using Descent Information we will consider the minimization of unconstrained

A HOL theory of Euclidean space John Harrison Intel Corporation TPHOLs 2005, Oxford Wed 24th

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

COL863: Quantum Computation and Information Ragesh Jaiswal, CSE, IIT Delhi Ragesh Jaiswal, CSE,

2-source Randomness Extractors for Elliptic Curves Abdoul Aziz Ciss Laboratoire de Traitement de

Quantum Time-Space Tradeoffs for Deciding Systems of Linear Inequalities Robert Spalek

Online Learning II Presenter: Adams Wei Yu Carnegie Mellon - PowerPoint PPT Presentation

Online Learning II Presenter: Adams Wei Yu Carnegie Mellon University Mar 2015 Presenter: Adams Wei Yu (CMU) March 2015 1 / 31 Recap of Online Learning The data comes sequentially. Do not need to assume the data distribution. Adversarial

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Teaching with Online Platforms What is an Online Learning Platform? A n Online Learning Platform

ONLINE ADVERTISING What is SIBC online? SIBC Online is a leading online news source for the

Online Learning with Kernel Losses Aldo Pacchiano UC Berkeley Joint work with Niladri Chatterji

Online Learning Tomaso Poggio and Lorenzo Rosasco 9.520 Class 15 March 30 2011 T. Poggio and L.

Online Learning Online Learning yeah ! yeah ! Any time, anywhere learning -- free

ONLINE LEARNING SERIES 2018 ONLINE LEARNING SERIES 2018 LEARNING AND CAPACITY DEVELOPMENT FOR

Efficient Online Learning using A Private Oracle Alon Gonen, UCSD Elad Hazan, Princeton Shay

SECURE-ONLINE (ZEKER-ONLINE) Quality mark for online cloud services Tom Vreeburg Boardmember

How DGD.online helps prepare DG documentation easily CIFFA Webinar, June 25, 2019 DGD.online

Getting Online Getting Online Domain Names Email Google My Business Listing

Online Identity &amp; Social Media by: Nicole Santarsiero What is Online Identity? -Online

2008 Online Awards Awards Banquet Better Newspaper Online Contest 2008 Best Online Advertising

2013 IRS Online Services Update IRS Online Services Update Jim Weaver Director, Product

ONLINE PROCESS SIMULATION ONLINE PROCESS SIMULATION ONLINE, REAL-TIME AND PREDICTIVE PROCESS DATA

On Birmans Sequence of Hardy-Rellich Type Inequalities Isaac B. Michael (joint with F.

d i E Inner Product a l l u d Dr. Abdulla Eid b A College of Science . r D MATHS

Minimization Using Descent Information we will consider the minimization of unconstrained

A HOL theory of Euclidean space John Harrison Intel Corporation TPHOLs 2005, Oxford Wed 24th

An Improved Regret Bound for Thompson Sampling in the Gaussian Linear Bandit Setting Cem

COL863: Quantum Computation and Information Ragesh Jaiswal, CSE, IIT Delhi Ragesh Jaiswal, CSE,

2-source Randomness Extractors for Elliptic Curves Abdoul Aziz Ciss Laboratoire de Traitement de

Quantum Time-Space Tradeoffs for Deciding Systems of Linear Inequalities Robert Spalek

Online Identity & Social Media by: Nicole Santarsiero What is Online Identity? -Online