SLIDE 1 Introduction to Machine Learning
- 25. Multiplicative Updates, Games and Boosting
Alex Smola Carnegie Mellon University
http://alex.smola.org/teaching/cmu2013-10-701 10-701
SLIDE 2 Multiplicative updates and experts
http://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf
SLIDE 3 Finding an expert
http://xkcd.com/451/
SLIDE 4 Finding an expert
- Pool of Experts E
- At each time step
- Each expert makes a prediction fit
- We observe event yt
- Goals
- Find the expert who gets things right
- Predict well in the meantime
SLIDE 5
Halving algorithm Halving algorithm
SLIDE 6
Halving algorithm Halving algorithm
SLIDE 7
Halving algorithm Halving algorithm
SLIDE 8
Halving algorithm Halving algorithm
SLIDE 9
Halving algorithm Halving algorithm
SLIDE 10
Halving algorithm Halving algorithm
SLIDE 11
Halving algorithm Halving algorithm
SLIDE 12 Halving algorithm
- Start with pool of experts E
- Predict with the majority of experts
- Observe outcome
- Discard those that got it wrong
- Theorem
Algorithm makes at most log2 E errors
Each time we make an error, at least half of the experts are removed. Otherwise no experts left.
SLIDE 13 Predicting as well as the best expert
- Experts (can) make mistakes
So we shouldn’t fire them immediately
- Can we predict nearly as well as the
best expert in the pool?
- Regret - error relative to best expert
Our predictions need not match any expert!
Le(t) :=
t
X
τ=1
l(yτ, feτ) and R(ˆ y, t) := L(ˆ y, t) − min
e0 Le0(t)
SLIDE 14
Weighted Majority
SLIDE 15
Weighted Majority
SLIDE 16
Weighted Majority
SLIDE 17
Weighted Majority
SLIDE 18
Weighted Majority
SLIDE 19
Weighted Majority
SLIDE 20
Weighted Majority
SLIDE 21 Weighted majority
- Binary loss (1 if wrong, 0 if correct)
- Experts initially have some weight
- For all observations do
- Predict using weighted majority
- Observe outcome and reweight wrong experts
- Alternative variant draws from pool of experts
we,t+1 = βwet we0 ˆ yτ = sgn P
e wetyet
P
e wet
SLIDE 22 Weighted majority analysis
- Update equation for mistakes
- Total expert weight
- We incur a loss when majority gets it wrong
- For each expert we have the bound
- Solving for loss yields
we,t+1 = βwet
wt := X
t
wet
wt+1 ≤ 1 2wt + β 2 wt ≤ ✓1 + β 2 ◆ˆ
Lt
w0 we,t+1 = we0βLet < wt+1 ˆ Lt ≤ Lit log β−1 + log w0 − log w−1
i0
log 2/(1 + β)
SLIDE 23 Weighted majority analysis
- Solving for loss yields
- Small downweighting leads to small regret
in the long term.
- Initially give uniform weight to all experts
(this is where you could hide your prior).
- Exponentially fast converges to the best expert!
ˆ Lt ≤ Lit log β−1 + log w0 − log w−1
i0
log 2/(1 + β) ˆ Lt ≤ Lit log β log(1 + β)/2 + log n log 2/(1 + β)
SLIDE 24 Multiplicative Updates
- Multiply by loss expert e would incur at time t
- Lower bound for all experts e
- Hoeffding bound (rather nonstandard variant)
- Upper bound
we set
wt+1 > we,t+1 E[exp(ηX)] ≤ exp
- ηE[X] + η2/8
- we,t+1 = wete−ηl(fet,yt) = we0e−ηLe(t)
wt+1 = wt X
e0
we0t wt e−ηl(f0et,yt) ≤ wte−ηlt+η2/8 ≤ w0e−ηL(ˆ
y,t)+tη2/8
L(ˆ y, t) ≤ Le(t) + ηt 8 + η−1 log n η = p 8 log n/t
SLIDE 25 Multiplicative Updates
- Multiply by loss expert e would incur at time t
- Lower bound for all experts e
- Hoeffding bound (rather nonstandard variant)
- Upper bound
wt+1 > we,t+1 E[exp(ηX)] ≤ exp
- ηE[X] + η2/8
- we,t+1 = wete−ηl(fet,yt) = we0e−ηLe(t)
L(ˆ y, t) ≤ Le(t) + q
1 2t log n
SLIDE 26
Application to Boosting
SLIDE 27
- Data set (xi, yi)
- Weak learners that perform better than random
for weighted distribution of data (wi, xi, yi)
- Combine weak learners to get strong learner
- How do we weigh instances to generate good
weak learners? Idea - find difficult instances.
Boosting intuition
f = 1 t
t
X
τ=1
ft L(w, f) := 1 m
m
X
i=1
wiyif(xi) ≥ 1 2 + γ
SLIDE 28
- Data set (wi, xi, yi)
- For t iterations do
- Invoke weak learner
- Reweight instances
- Output linear combination
Non-adaptive Boosting
ft = argmax
f
L(wt, f) = argmax
f
1 m
m
X
i=1
wi,t−1yif(xi) wit = wi,t−1e−αyift(xi) f = 1 t
t
X
τ=1
ft
reduce weight if we got things right
SLIDE 29
- For mistakes (majority wrong) we have
- Upper bound on weight
- Combining upper and lower bound
Error vanishes exponentially fast
Boosting Analysis
wit ≥ e− αt
2 hence wt ≥ |{f(xi)yi ≤ 0}| e− αt 2
wt = X
i
wi,t−1e−αyift(xi) ≤ wt−1e−α(γ+1/2)+α2/8 ≤ ne−αt(γ+1/2)+tα2/8 nerrors ≤ ne−t(αγ−α2/8) hence nerror ≤ ne−2tγ2 for α = 4γ
SLIDE 30 AdaBoost
- Refine algorithm by weighting functions
- Adaptive in the performance of weak learner
- Error for weighted observations
- Stepsize and weight
✏t := 1 n
n
X
i=1
wit 1
2 {1 − yift(xi)}
↵t := 1 2 log 1 − ✏t ✏t f = X
t
αtft and wit = wi0e− P
t αtyift(xi)
SLIDE 31 Usage
- Simple classifiers (weak learners)
- Linear threshold functions
- Decision trees (simple ones)
- Neural networks
- Do not boost SVMs. Why?
Boosting the Margin, Schapire et al http://goo.gl/aLCSO
Boost with noisy data for too long Fix e.g. by limiting weight of observations.
SLIDE 32
Application to Game Theory
SLIDE 33
Games
SLIDE 34 Games
rock scissors paper rock 1
scissors
1 paper 1
SLIDE 35
- Game
- Row player picks i, column player picks j
Receive outcomes Mijk
- Zero sum game has Mij,1 = -Mij,2
(my gain is your loss)
- How to play
- Deterministic strategy usually not optimal
- Distribution over actions
- Nash equilibrium
Players have no incentive to change policy
Games
SLIDE 36
- von Neumann minimax theorem
- Proof
due to vertex solution. Apply linear programming duality to get Apply vertex property again to complete proof.
Games
min
x2P max j
⇥ x>M ⇤
j = min x2P max y2P x>My
min
x2P max j
⇥ x>M ⇤
j = max y2P min i
[My]i min
x2P max y2P x>My = max y2P min x2P x>My
SLIDE 37 Finding a Nash equilibrium approximately
- Repeated game (initial distribution p0 for player)
- For t rounds do
- Opponent picks best distribution qt+1 using pt
- Player updates action distribution pt+1 using
- Regret bound tells us that
pi,t+t ∝ pi,0 exp −η
t
X
τ=1
[Mqt]i ! 1 t
t
X
τ=1
p>
τ Mqτ ≤ min i
1 t
t
X
τ=1
[Mqτ]i + O(t 1
2 )
SLIDE 38 Finding a Nash equilibrium approximately
- Regret bound
- By construction of the algorithm we have
- Combining this yields
min
p max q
p>Mq ≤ max
q
1 t
t
X
τ=1
p>
τ Mq ≤ 1
t
t
X
τ=1
p>
τ Mqτ
1 t
t
X
τ=1
p>
τ Mqτ ≤ min i
1 t
t
X
τ=1
[Mqτ]i + O(t 1
2 ) = min
p
1 t
t
X
τ=1
p>Mqτ + O(t 1
2 )
max
q
1 t
t
X
τ=1
p>
τ Mq ≤ max q
min
p p>Mq + O(t 1
2 )
SLIDE 39 Simplified algorithm
- Repeated game (initial distribution p0 for player)
- For t rounds do
- Opponent picks best distribution qt+1 using pt
- Player updates action distribution pt+1 using
- Regret bound tells us that
pi,t+t ∝ pi,0 exp −η
t
X
τ=1
[Mqt]i ! 1 t
t
X
τ=1
p>
τ Mqτ ≤ min i
1 t
t
X
τ=1
[Mqτ]i + O(t 1
2 )
action
SLIDE 40
Application to Particle Filtering
SLIDE 41
- Recall particle filter idea (simplified)
- Observe data in sequence
- At each step approximate distribution of
by weighted samples from posterior
Assuming conditional independence
Sequential Monte Carlo
xi ⊥ xj|θ wi,n+1 = winp(xn+1|θ) = winelog p(xn+1|θ) p(θ|x1:n) p(θ|x1:n+1) ∝ p(xn+1|θ, x1:n)p(θ|x1:n)
SLIDE 42 Sequential Monte Carlo
- Experts
- Loss
- Weights
- Convergence
good news
(good)
as best expert
- Particle Filters
- Neg. Loglikelihood
- Weights
- Convergence
bad news
left
- Need to resample
- Adaptively find better
solution
SLIDE 43
- On a chain
- Observe data in sequence
- Fill in latent variables in sequence
- Bayes Rule
- sample latent parameter
- update particle weight with
Sequential Monte Carlo
p(xn+1, θn+1|x1:n, θ1:n) = p(xn+1|x1:n, θ1:n)p(θn+1|x1:n+1, θ1:n)
θn+1 ∼ p(θn+1|x1:n+1, θ1:n) wi,n+1 = winp(xn+1|θ1:n, x1:n)
prediction “error”
SLIDE 44
- Clustering
- Observe data in sequence
- We assume collapsed representation
(integrate out natural parameters)
- For each particle do (naive method)
- Compute data likelihood (for reweighting)
- Draw cluster ID from
- Resample particles if too skewed.
Sequential Monte Carlo
p(xn+1|x1:n, y1:n) = X
yn+1
p(xn+1|x1:n, y1:n, yn+1)p(yn+1|x1:n, y1:n) p(yn+1|x1:n+1, yn+1) ∝ p(xn+1|x1:n, y1:n, yn+1)p(yn+1|x1:n, y1:n)
SLIDE 45 Resampling generates tree
Canini et al, 2009 http://jmlr.csail.mit.edu/proceedings/papers/v5/canini09a/canini09a.pdf
SLIDE 46
Updates
- Boosting
- Game Theory
- Particle Filtering
SLIDE 47
- Neural Networks
- Reinforcement learning
- Bandits
- Collaborative Filtering
- Scalability
- Kalman Filter / HMM
- Lasso / l1 programming / sparse reconstruction
- ...
What we missed