Online Learning Fei Xia Language Technology Institute - PowerPoint PPT Presentation

Online Learning Fei Xia Language Technology Institute feixia@cs.cmu.edu March 16, 2015 1 / 24

Outline Introduction Why online learning Basic stuff about online learning Prediction with expert advice Halving algorithm Weighted majority algorithm Randomized weighted majority algorithm Exponential weighted average algorithm 2 / 24

Why online learning? In many cases, data arrives sequentially while predictions are required on-the-fly Online algorithms do not require any distributional assumption Applicable in adversarial environments Simple algorithms Theoretical guarantees 3 / 24

Introduction Basic Properties: Instead of learning from a training set and then testing on a test set, the online learning scenario mixes the training and test phases. Instead of assuming the distribution over data points is fixed both for training/test points and points are sampled in and i.i.d fashion, no distributional assumption is assumed in online learning. Instead of learning a hypothesis with small generalization error, online learning algorithms is measured using a mistake model and the regret . 4 / 24

Introduction Basic Setting: For t = 1 , 2 , ..., T Receive an instance x t ∈ X Make a prediction ˆ y t ∈ Y Receive true label y t ∈ Y Suffer loss L (ˆ y t , y t ) Objective: T � min L (ˆ y t , y t ) t =1 5 / 24

Prediction with Expert Advice For t = 1 , 2 , ..., T Receive an instance x t ∈ X Receive an advice y t,i ∈ Y , i ∈ [1 , N ] from N experts Make a prediction ˆ y t ∈ Y Receive true label y t ∈ Y Suffer loss L (ˆ y t , y t ) Figure : Weather forecast: an example of a prediction problem based on expert advice [Mohri et al., 2012] 6 / 24

Regret Analysis Objective: minimize the regret R T T T N � � R T = L (ˆ y t , y t ) − min L (ˆ y t,i , y t ) i =1 t =1 t =1 What does low regret mean? It means that we don’t lose much from not knowing future events It means that we can perform almost as well as someone who observes the entire sequence and picks the best prediction strategy in hindsight It means that we can compete with changing environment 7 / 24

Halving algorithm Realizable case — After some number of rounds T , we will learn the concept and no longer make errors. Mistake bound — How many mistakes before we learn a particular concept? Maximum number of mistakes a learning algorithm A makes for a concept c : M A ( c ) = max | mistakes ( A , c ) | S Maximum number of mistakes a learning algorithm A makes for a concept class C : M A ( C ) = max c ∈ C M A ( c ) 8 / 24

Halving algorithm Algorithm 1 HALVING( H ) 1: H 1 ← H 2: for t ← 1 to T do RECEIVE( x t ) 3: y t ← MAJORITYVOTE( H t , x t ) ˆ 4: RECEIVE( y t ) 5: if ˆ y t � = y t then 6: H t +1 ← { c ∈ H t : c ( x t ) = y t } 7: 8: return H T +1 9 / 24

Halving algorithm Theorem Let H be a finite hypothesis set, then M Halving ( H ) ≤ log 2 | H | Proof. The algorithm makes predictions using majority vote from the active set. Thus, at each mistake, the active set is reduced by at least half. Hence, after log 2 | H | mistakes, there can only remain one active hypothesis. Since we are in the realizable case, this hypothesis must coincide with the target concept and we won’t make mistakes any more. 10 / 24

Weighted majority algorithm Algorithm 2 WEIGHTED-MAJORITY( N ) 1: for i ← 1 to N do w 1 ,i ← 1 2: 3: for t ← 1 to T do RECEIVE( x t ) 4: if � i : y t,i =1 w t,i ≥ � i : y t,i =0 w t,i then 5: ˆ y t ← 1 6: else 7: y t ← 0 ˆ 8: RECEIVE( y t ) 9: if ˆ y t � = y t then 10: for i ← 1 to N do 11: if y t,i � = y t then 12: w t +1 ,i ← βw t,i 13: else w t +1 ,i ← w t,i 14: 15: return w T +1 11 / 24

Weighted majority algorithm Theorem Fix β ∈ (0 , 1) . Let m T be the number of mistakes made by algorithm WM after T ≥ 1 rounds, and m ∗ T be the number of mistakes made by the best of the N experts. Then, the following inequality holds: T log 1 log N + m ∗ β m T ≤ 2 log 1+ β Proof. Introduce a potential function W t = � N 1 w t,i , then derive its upper and lower bound. Since the predictions are generated using weighted majority vote, if the algorithm makes an error at round t , we have � 1 + β � W t +1 ≤ W t 2 12 / 24

Weighted majority algorithm Proof (Cont.) m T mistakes after T rounds and W 1 = N , thus � m T N � 1 + β W T ≤ 2 Note that we also have W T ≥ w T,i = β m T,i where m T,i is the number of mistakes made by the i th expert. Thus, log N + m ∗ T log 1 � m T N ⇒ m T ≤ � 1 + β β β m ∗ T ≤ 2 2 log 1+ β 13 / 24

Weighted majority algorithm T log 1 log N + m ∗ β m T ≤ 2 log 1+ β m T ≤ O (log N ) + constant × | mistakes of best expert | No assumption about the sequence of samples The number of mistakes is roughly a constant times that of the best expert in hindsight When m ∗ T = 0 , the bound reduces to m T ≤ O (log N ) , which is the same as the Halving algorithm 14 / 24

Randomized weighted majority algorithm Drawback in weighted majority algorithm: zero-one loss; no deterministic algorithm can achieve a regret R T = o ( T ) In randomized scenario, A = { 1 , ..., N } of N actions is available At each round t ∈ [1 , T ] , an online algorithm A selects a distribution p t over the set of actions Receive a loss vector l t , where l t,i ∈ 0 , 1 is the loss associated with action i Define the expected loss for round t : L t = � N i =1 p t,i l t,i ; the total loss for T rounds: L T = � T t =1 L t Define the total loss associated with action i : L = � T t =1 l t,i ; the minimal loss of a single action: L min = min i ∈ A L T,i T 15 / 24

Randomized weighted majority algorithm Algorithm 3 RANDOMIZED-WEIGHTED-MAJORITY( N ) 1: for i ← 1 to N do w 1 ,i ← 1 2: p 1 ,i ← 1 /N 3: 4: for t ← 1 to T do for i ← 1 to N do 5: if l t,i = 1 then 6: w t +1 ,i ← βw t,i 7: else w t +1 ,i ← w t,i 8: W t +1 ← � N i =1 w t +1 ,i 9: for i ← 1 to N do 10: p t +1 ,i ← w t +1 ,i /W t +1 11: 12: return w T +1 Note: Let w 0 be the total weight on outcome 0 , w 1 be the total weight on outcome 1 , W = w 1 + w 2 ; then the prediction strategy is to predict i with probability w i /W . 16 / 24

Randomized weighted majority algorithm Theorem Fix β ∈ [1 / 2 , 1] . Then for any T ≥ 1 , the loss of algorithm RWM on any sequence can be bounded as follows: L T ≤ log N 1 − β + (2 − β ) L min T � In particular, for β = max { 1 / 2 , 1 − (log N ) /T } , the loss can be bounded as: L T ≤ L min + 2 � T log N Proof. Define potential function W t = � N i =1 w t,i , t ∈ [1 , T ] . 17 / 24

Proof (Cont.) W t +1 = � i : l t,i =0 w t,i + β � i : l t,i =1 w t,i = W t + ( β − 1) W t � i : l t,i =1 p t,i = W t (1 − (1 − β ) L t ) = N � T ⇒ W T +1 t =1 (1 − (1 − β ) L t ) Note that we also have W T +1 ≥ max i ∈ [1 ,N ] w T +1 ,i = β L min , thus, T β L min ≤ N � T t =1 (1 − (1 − β ) L t ) T L min ⇒ log β ≤ log N − (1 − β ) L T T L T ≤ log N 1 − β + (2 − β ) L min ⇒ T Since L min ≤ T , this also implies T L T ≤ log N 1 − β + (1 − β ) T + L min T By minimizing the RHS w.r.t β , we get L T ≤ L min � � + 2 T log N ⇔ R T ≤ 2 T log N T 18 / 24

Exponential weighted average algorithm We have extended WM algorithm to other loss functions L taking values in [0,1]. The EWA algorithm here is a further extension such that L is convex in its first argument. Algorithm 4 EXPONENTIAL-WEIGHTED-AVERAGE( N ) 1: for i ← 1 to N do w 1 ,i ← 1 2: 3: for t ← 1 to T do RECEIVE( x t ) 4: � N i =1 w t,i y t,i y t ← ˆ 5: � N i =1 w t,i RECEIVE( y t ) 6: for i ← 1 to N do 7: w t +1 ,i ← w t,i e − ηL (ˆ y t,i ,y t ) 8: 9: return w T +1 19 / 24

Exponential weighted average algorithm Theorem Assume that the loss function L is convex in its first argument and takes values in [0,1]. Then, for any η > 0 and any sequence y 1 , ..., y T ∈ Y , the regret for EWA algorithm is bounded as: R T ≤ log N + ηT η 8 � In particular, for η = 8 log N/T , the regret is bounded as: � R T ≤ ( T/ 2) log N Proof. Define the potential function Φ t = log � N i =1 w t,i , t ∈ [1 , T ] 20 / 24

Exponential weighted average algorithm Proof. We can prove that y t , y t ) + η 2 Φ t +1 − Φ t ≤ − ηL (ˆ 8 y t , y t ) + η 2 T Φ( T + 1) − Φ 1 ≤ − η � T ⇒ t =1 L (ˆ 8 Then we try to obtain the lower bound of Φ T +1 − Φ 1 . = log � N i =1 e − ηL T,i − log N Φ T +1 − Φ 1 i =1 e − ηL T,i − log N ≥ log max N = − η min N i =1 L T,i − log N Combining lower bound and upper bound, we get T i =1 L T,i ≤ log N N + ηT � L (ˆ y t , y t ) − min η 8 t =1 21 / 24

Exponential weighted average algorithm The optimal choice of η requires knowledge of T , which is an disadvantage of this analysis. How to solve this? The doubling trick. Dividing time into periods [2 k , 2 k +1 − 1] of � length 2 k with k = 0 , ..., n , and then choose η k = 8 log N . This 2 k leads to the following theorem. Theorem Assume that the loss function L is convex in its first argument and takes values in [0 , 1] . Then for any T ≤ 1 and any sequence y 1 , ..., y T ∈ Y , the regret of the EWA algorithm after T rounds is bounded as follows: √ � � 2 T log N R T ≤ √ 2 log N + 2 2 − 1 22 / 24

Online Learning Fei Xia Language Technology Institute - PowerPoint PPT Presentation

Online Learning Fei Xia Language Technology Institute feixia@cs.cmu.edu March 16, 2015 1 / 24 Outline Introduction Why online learning Basic stuff about online learning Prediction with expert advice Halving algorithm Weighted majority

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Teaching with Online Platforms What is an Online Learning Platform? A n Online Learning Platform

ONLINE ADVERTISING What is SIBC online? SIBC Online is a leading online news source for the

Online Learning with Kernel Losses Aldo Pacchiano UC Berkeley Joint work with Niladri Chatterji

Online Learning Tomaso Poggio and Lorenzo Rosasco 9.520 Class 15 March 30 2011 T. Poggio and L.

Online Learning Online Learning yeah ! yeah ! Any time, anywhere learning -- free

ONLINE LEARNING SERIES 2018 ONLINE LEARNING SERIES 2018 LEARNING AND CAPACITY DEVELOPMENT FOR

Efficient Online Learning using A Private Oracle Alon Gonen, UCSD Elad Hazan, Princeton Shay

SECURE-ONLINE (ZEKER-ONLINE) Quality mark for online cloud services Tom Vreeburg Boardmember

How DGD.online helps prepare DG documentation easily CIFFA Webinar, June 25, 2019 DGD.online

Getting Online Getting Online Domain Names Email Google My Business Listing

Online Identity & Social Media by: Nicole Santarsiero What is Online Identity? -Online

2008 Online Awards Awards Banquet Better Newspaper Online Contest 2008 Best Online Advertising

2013 IRS Online Services Update IRS Online Services Update Jim Weaver Director, Product

ONLINE PROCESS SIMULATION ONLINE PROCESS SIMULATION ONLINE, REAL-TIME AND PREDICTIVE PROCESS DATA

Known Algorithms on Graphs of Bounded Treewidth are Probably Optimal Dniel Marx

Operating Systems Threads Maria Hybinette, UGA Maria Hybinette, UGA Chapter: Threads:

Geo Routing in ad hoc nets References: Brad Karp and H.T. Kung GPSR: Greedy

Update of RTSP draft-ietf-mmusic-rfc2326bis-03.txt Authors: Henning Schulzrinne / Columbia

Solving All Lattice Problems in Deterministic Single Exponential Time Daniele Micciancio (Joint

Exponential Gaps in Mathematical Programming Michael J. Todd May 3, 2012 School of Operations

Fixed-Priority Schedulability of Sporadic Tasks on Uniprocessors is NP -hard Pontus Ekberg &

CS675: Convex and Combinatorial Optimization Spring 2018 The Simplex Algorithm Instructor:

Online Learning Fei Xia Language Technology Institute - PowerPoint PPT Presentation

Online Learning Fei Xia Language Technology Institute feixia@cs.cmu.edu March 16, 2015 1 / 24 Outline Introduction Why online learning Basic stuff about online learning Prediction with expert advice Halving algorithm Weighted majority

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

Teaching with Online Platforms What is an Online Learning Platform? A n Online Learning Platform

ONLINE ADVERTISING What is SIBC online? SIBC Online is a leading online news source for the

Online Learning with Kernel Losses Aldo Pacchiano UC Berkeley Joint work with Niladri Chatterji

Online Learning Tomaso Poggio and Lorenzo Rosasco 9.520 Class 15 March 30 2011 T. Poggio and L.

Online Learning Online Learning yeah ! yeah ! Any time, anywhere learning -- free

ONLINE LEARNING SERIES 2018 ONLINE LEARNING SERIES 2018 LEARNING AND CAPACITY DEVELOPMENT FOR

Efficient Online Learning using A Private Oracle Alon Gonen, UCSD Elad Hazan, Princeton Shay

SECURE-ONLINE (ZEKER-ONLINE) Quality mark for online cloud services Tom Vreeburg Boardmember

How DGD.online helps prepare DG documentation easily CIFFA Webinar, June 25, 2019 DGD.online

Getting Online Getting Online Domain Names Email Google My Business Listing

Online Identity &amp; Social Media by: Nicole Santarsiero What is Online Identity? -Online

2008 Online Awards Awards Banquet Better Newspaper Online Contest 2008 Best Online Advertising

2013 IRS Online Services Update IRS Online Services Update Jim Weaver Director, Product

ONLINE PROCESS SIMULATION ONLINE PROCESS SIMULATION ONLINE, REAL-TIME AND PREDICTIVE PROCESS DATA

Known Algorithms on Graphs of Bounded Treewidth are Probably Optimal Dniel Marx

Operating Systems Threads Maria Hybinette, UGA Maria Hybinette, UGA Chapter: Threads:

Geo Routing in ad hoc nets References: Brad Karp and H.T. Kung GPSR: Greedy

Update of RTSP draft-ietf-mmusic-rfc2326bis-03.txt Authors: Henning Schulzrinne / Columbia

Solving All Lattice Problems in Deterministic Single Exponential Time Daniele Micciancio (Joint

Exponential Gaps in Mathematical Programming Michael J. Todd May 3, 2012 School of Operations

Fixed-Priority Schedulability of Sporadic Tasks on Uniprocessors is NP -hard Pontus Ekberg &amp;

CS675: Convex and Combinatorial Optimization Spring 2018 The Simplex Algorithm Instructor:

Online Identity & Social Media by: Nicole Santarsiero What is Online Identity? -Online

Fixed-Priority Schedulability of Sporadic Tasks on Uniprocessors is NP -hard Pontus Ekberg &