Introduction to Machine Learning 25. Multiplicative Updates, Games - PowerPoint PPT Presentation

Introduction to Machine Learning 25. Multiplicative Updates, Games and Boosting Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701

Multiplicative updates and experts http://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf

Finding an expert http://xkcd.com/451/

Finding an expert • Pool of Experts E • At each time step • Each expert makes a prediction f it • We observe event y t • Goals • Find the expert who gets things right • Predict well in the meantime

Halving algorithm Halving algorithm

Halving algorithm • Start with pool of experts E • Predict with the majority of experts • Observe outcome • Discard those that got it wrong • Theorem Algorithm makes at most log 2 E errors • Proof Each time we make an error, at least half of the experts are removed. Otherwise no experts left.

Predicting as well as the best expert • Experts (can) make mistakes So we shouldn’t fire them immediately • Can we predict nearly as well as the best expert in the pool? • Regret - error relative to best expert t X L e ( t ) := l ( y τ , f e τ ) and R (ˆ y, t ) := L (ˆ y, t ) − min e 0 L e 0 ( t ) τ =1 Our predictions need not match any expert!

Weighted Majority

Weighted majority • Binary loss (1 if wrong, 0 if correct) • Experts initially have some weight w e 0 • For all observations do • Predict using weighted majority P e w et y et y τ = sgn ˆ P e w et • Observe outcome and reweight wrong experts w e,t +1 = β w et • Alternative variant draws from pool of experts

Weighted majority analysis • Update equation for mistakes w e,t +1 = β w et • Total expert weight X w t := w et t • We incur a loss when majority gets it wrong ◆ ˆ L t ✓ 1 + β 2 w t + β w t +1 ≤ 1 2 w t ≤ w 0 2 • For each expert we have the bound w e,t +1 = w e 0 β L et < w t +1 • Solving for loss yields L t ≤ L it log β − 1 + log w 0 − log w − 1 ˆ i 0 log 2 / (1 + β )

Weighted majority analysis • Solving for loss yields L t ≤ L it log β − 1 + log w 0 − log w − 1 ˆ i 0 log 2 / (1 + β ) • Small downweighting leads to small regret in the long term. • Initially give uniform weight to all experts (this is where you could hide your prior). log β log n ˆ L t ≤ L it log(1 + β ) / 2 + log 2 / (1 + β ) • Exponentially fast converges to the best expert!

Multiplicative Updates • Multiply by loss expert e would incur at time t w e,t +1 = w et e − η l ( f et ,y t ) = w e 0 e − η L e ( t ) • Lower bound for all experts e w t +1 > w e,t +1 • Hoeffding bound (rather nonstandard variant) η E [ X ] + η 2 / 8 � � E [exp( η X )] ≤ exp • Upper bound w e 0 t e − η l ( f 0 et ,y t ) ≤ w t e − η l t + η 2 / 8 ≤ w 0 e − η L (ˆ y,t )+ t η 2 / 8 X w t +1 = w t w t e 0 y, t ) ≤ L e ( t ) + η t 8 + η − 1 log n we set p L (ˆ 8 log n/t η =

Multiplicative Updates • Multiply by loss expert e would incur at time t w e,t +1 = w et e − η l ( f et ,y t ) = w e 0 e − η L e ( t ) • Lower bound for all experts e w t +1 > w e,t +1 • Hoeffding bound (rather nonstandard variant) η E [ X ] + η 2 / 8 � � E [exp( η X )] ≤ exp • Upper bound q 1 L (ˆ y, t ) ≤ L e ( t ) + 2 t log n

Application to Boosting

Boosting intuition • Data set (x i , y i ) • Weak learners that perform better than random for weighted distribution of data (w i , x i , y i ) m L ( w, f ) := 1 w i y i f ( x i ) ≥ 1 X 2 + γ m i =1 • Combine weak learners to get strong learner t f = 1 X f t t τ =1 • How do we weigh instances to generate good weak learners? Idea - find difficult instances.

Non-adaptive Boosting • Data set (w i , x i , y i ) • For t iterations do • Invoke weak learner m 1 X f t = argmax L ( w t , f ) = argmax w i,t − 1 y i f ( x i ) m f f i =1 • Reweight instances reduce weight w it = w i,t − 1 e − α y i f t ( x i ) if we got things • Output linear combination right t f = 1 X f t t τ =1

Boosting Analysis • For mistakes (majority wrong) we have w it ≥ e − α t 2 hence w t ≥ |{ f ( x i ) y i ≤ 0 }| e − α t 2 • Upper bound on weight w i,t − 1 e − α y i f t ( x i ) ≤ w t − 1 e − α ( γ +1 / 2)+ α 2 / 8 X w t = i ≤ ne − α t ( γ +1 / 2)+ t α 2 / 8 • Combining upper and lower bound n errors ≤ ne − t ( αγ − α 2 / 8) hence n error ≤ ne − 2 t γ 2 for α = 4 γ Error vanishes exponentially fast

AdaBoost • Refine algorithm by weighting functions • Adaptive in the performance of weak learner • Error for weighted observations n ✏ t := 1 X w it 1 2 { 1 − y i f t ( x i ) } n i =1 • Stepsize and weight ↵ t := 1 2 log 1 − ✏ t ✏ t X α t f t and w it = w i 0 e − P t α t y i f t ( x i ) f = t

Usage • Simple classifiers (weak learners) • Linear threshold functions • Decision trees (simple ones) • Neural networks • Do not boost SVMs. Why? Boosting the Margin, Schapire et al http://goo.gl/aLCSO • Overfitting is possible Boost with noisy data for too long Fix e.g. by limiting weight of observations.

Application to Game Theory

Games rock scissors paper rock 0 1 -1 scissors -1 0 1 paper 1 -1 0

Games • Game • Row player picks i, column player picks j Receive outcomes M ijk • Zero sum game has M ij,1 = -M ij,2 (my gain is your loss) • How to play • Deterministic strategy usually not optimal • Distribution over actions • Nash equilibrium Players have no incentive to change policy

Games • von Neumann minimax theorem x > M ⇥ ⇤ min x 2 P max j = max y 2 P min [ My ] i j i • Proof x > M y 2 P x > My ⇥ ⇤ min x 2 P max j = min x 2 P max j due to vertex solution. Apply linear programming duality to get y 2 P x > My = max x 2 P x > My min x 2 P max y 2 P min Apply vertex property again to complete proof.

Finding a Nash equilibrium approximately • Repeated game (initial distribution p 0 for player) • For t rounds do • Opponent picks best distribution q t+1 using p t • Player updates action distribution p t+1 using ! t X p i,t + t ∝ p i, 0 exp [ Mq t ] i − η τ =1 • Regret bound tells us that t t 1 1 [ Mq τ ] i + O ( t � 1 X X 2 ) p > τ Mq τ ≤ min t t i τ =1 τ =1

Finding a Nash equilibrium approximately • Regret bound t t t 1 1 1 [ Mq τ ] i + O ( t � 1 p > Mq τ + O ( t � 1 X X 2 ) = min X 2 ) p > τ Mq τ ≤ min t t t p i τ =1 τ =1 τ =1 • By construction of the algorithm we have t t 1 τ Mq ≤ 1 X X p > Mq ≤ max p > p > min p max τ Mq τ t t q q τ =1 τ =1 • Combining this yields t 1 p p > Mq + O ( t � 1 X 2 ) p > max τ Mq ≤ max min t q q τ =1

Simplified algorithm • Repeated game (initial distribution p 0 for player) • For t rounds do action • Opponent picks best distribution q t+1 using p t • Player updates action distribution p t+1 using ! t X p i,t + t ∝ p i, 0 exp [ Mq t ] i − η τ =1 • Regret bound tells us that t t 1 1 [ Mq τ ] i + O ( t � 1 X X 2 ) p > τ Mq τ ≤ min t t i τ =1 τ =1

Application to Particle Filtering

Sequential Monte Carlo • Recall particle filter idea (simplified) • Observe data in sequence • At each step approximate distribution of p ( θ | x 1: n ) by weighted samples from posterior • Bayes Rule p ( θ | x 1: n +1 ) ∝ p ( x n +1 | θ , x 1: n ) p ( θ | x 1: n ) Assuming conditional independence x i ⊥ x j | θ w i,n +1 = w in p ( x n +1 | θ ) = w in e log p ( x n +1 | θ )

Sequential Monte Carlo • Experts • Particle Filters • Neg. Loglikelihood • Loss • Weights • Weights • Convergence • Convergence bad news good news • Have only single sample • Found the best expert left (good) • Need to resample • Solution only as good • Adaptively find better as best expert solution

Sequential Monte Carlo • On a chain • Observe data in sequence • Fill in latent variables in sequence • Bayes Rule p ( x n +1 , θ n +1 | x 1: n , θ 1: n ) = p ( x n +1 | x 1: n , θ 1: n ) p ( θ n +1 | x 1: n +1 , θ 1: n ) • sample latent parameter prediction θ n +1 ∼ p ( θ n +1 | x 1: n +1 , θ 1: n ) “error” • update particle weight with w i,n +1 = w in p ( x n +1 | θ 1: n , x 1: n )

Introduction to Machine Learning 25. Multiplicative Updates, Games - PowerPoint PPT Presentation

Introduction to Machine Learning 25. Multiplicative Updates, Games and Boosting Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Multiplicative updates and experts

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Outline 1. Standing on the Shoulders of Giants . . . 2. What is Information? 3. Shannon

a chaining algorithm for online non parametric regression Pierre Gaillard December 2, 2015

Decision Problems Decision Making under Uncertainty, Part III Christos Dimitrakakis Chalmers

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve:

Multi-agent learning Erik Berbee & Bas van Gijzel , Master Student AT, Utrecht University Erik

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Sup

Introduction to Machine Learning 25. Multiplicative Updates, Games - PowerPoint PPT Presentation

Introduction to Machine Learning 25. Multiplicative Updates, Games and Boosting Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Multiplicative updates and experts

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Outline 1. Standing on the Shoulders of Giants . . . 2. What is Information? 3. Shannon

a chaining algorithm for online non parametric regression Pierre Gaillard December 2, 2015

Decision Problems Decision Making under Uncertainty, Part III Christos Dimitrakakis Chalmers

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve:

Multi-agent learning Erik Berbee &amp; Bas van Gijzel , Master Student AT, Utrecht University Erik

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Sup

Multi-agent learning Erik Berbee & Bas van Gijzel , Master Student AT, Utrecht University Erik