Introduction to Machine Learning 25. Multiplicative Updates, Games - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning 25. Multiplicative Updates, Games - - PowerPoint PPT Presentation

Introduction to Machine Learning 25. Multiplicative Updates, Games and Boosting Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Multiplicative updates and experts


slide-1
SLIDE 1

Introduction to Machine Learning

  • 25. Multiplicative Updates, Games and Boosting

Alex Smola Carnegie Mellon University

http://alex.smola.org/teaching/cmu2013-10-701 10-701

slide-2
SLIDE 2

Multiplicative updates and experts

http://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf

slide-3
SLIDE 3

Finding an expert

http://xkcd.com/451/

slide-4
SLIDE 4

Finding an expert

  • Pool of Experts E
  • At each time step
  • Each expert makes a prediction fit
  • We observe event yt
  • Goals
  • Find the expert who gets things right
  • Predict well in the meantime
slide-5
SLIDE 5

Halving algorithm Halving algorithm

slide-6
SLIDE 6

Halving algorithm Halving algorithm

slide-7
SLIDE 7

Halving algorithm Halving algorithm

slide-8
SLIDE 8

Halving algorithm Halving algorithm

slide-9
SLIDE 9

Halving algorithm Halving algorithm

slide-10
SLIDE 10

Halving algorithm Halving algorithm

slide-11
SLIDE 11

Halving algorithm Halving algorithm

slide-12
SLIDE 12

Halving algorithm

  • Start with pool of experts E
  • Predict with the majority of experts
  • Observe outcome
  • Discard those that got it wrong
  • Theorem

Algorithm makes at most log2 E errors

  • Proof

Each time we make an error, at least half of the experts are removed. Otherwise no experts left.

slide-13
SLIDE 13

Predicting as well as the best expert

  • Experts (can) make mistakes

So we shouldn’t fire them immediately

  • Can we predict nearly as well as the

best expert in the pool?

  • Regret - error relative to best expert

Our predictions need not match any expert!

Le(t) :=

t

X

τ=1

l(yτ, feτ) and R(ˆ y, t) := L(ˆ y, t) − min

e0 Le0(t)

slide-14
SLIDE 14

Weighted Majority

slide-15
SLIDE 15

Weighted Majority

slide-16
SLIDE 16

Weighted Majority

slide-17
SLIDE 17

Weighted Majority

slide-18
SLIDE 18

Weighted Majority

slide-19
SLIDE 19

Weighted Majority

slide-20
SLIDE 20

Weighted Majority

slide-21
SLIDE 21

Weighted majority

  • Binary loss (1 if wrong, 0 if correct)
  • Experts initially have some weight
  • For all observations do
  • Predict using weighted majority
  • Observe outcome and reweight wrong experts
  • Alternative variant draws from pool of experts

we,t+1 = βwet we0 ˆ yτ = sgn P

e wetyet

P

e wet

slide-22
SLIDE 22

Weighted majority analysis

  • Update equation for mistakes
  • Total expert weight
  • We incur a loss when majority gets it wrong
  • For each expert we have the bound
  • Solving for loss yields

we,t+1 = βwet

wt := X

t

wet

wt+1 ≤ 1 2wt + β 2 wt ≤ ✓1 + β 2 ◆ˆ

Lt

w0 we,t+1 = we0βLet < wt+1 ˆ Lt ≤ Lit log β−1 + log w0 − log w−1

i0

log 2/(1 + β)

slide-23
SLIDE 23

Weighted majority analysis

  • Solving for loss yields
  • Small downweighting leads to small regret

in the long term.

  • Initially give uniform weight to all experts

(this is where you could hide your prior).

  • Exponentially fast converges to the best expert!

ˆ Lt ≤ Lit log β−1 + log w0 − log w−1

i0

log 2/(1 + β) ˆ Lt ≤ Lit log β log(1 + β)/2 + log n log 2/(1 + β)

slide-24
SLIDE 24

Multiplicative Updates

  • Multiply by loss expert e would incur at time t
  • Lower bound for all experts e
  • Hoeffding bound (rather nonstandard variant)
  • Upper bound

we set

wt+1 > we,t+1 E[exp(ηX)] ≤ exp

  • ηE[X] + η2/8
  • we,t+1 = wete−ηl(fet,yt) = we0e−ηLe(t)

wt+1 = wt X

e0

we0t wt e−ηl(f0et,yt) ≤ wte−ηlt+η2/8 ≤ w0e−ηL(ˆ

y,t)+tη2/8

L(ˆ y, t) ≤ Le(t) + ηt 8 + η−1 log n η = p 8 log n/t

slide-25
SLIDE 25

Multiplicative Updates

  • Multiply by loss expert e would incur at time t
  • Lower bound for all experts e
  • Hoeffding bound (rather nonstandard variant)
  • Upper bound

wt+1 > we,t+1 E[exp(ηX)] ≤ exp

  • ηE[X] + η2/8
  • we,t+1 = wete−ηl(fet,yt) = we0e−ηLe(t)

L(ˆ y, t) ≤ Le(t) + q

1 2t log n

slide-26
SLIDE 26

Application to Boosting

slide-27
SLIDE 27
  • Data set (xi, yi)
  • Weak learners that perform better than random

for weighted distribution of data (wi, xi, yi)

  • Combine weak learners to get strong learner
  • How do we weigh instances to generate good

weak learners? Idea - find difficult instances.

Boosting intuition

f = 1 t

t

X

τ=1

ft L(w, f) := 1 m

m

X

i=1

wiyif(xi) ≥ 1 2 + γ

slide-28
SLIDE 28
  • Data set (wi, xi, yi)
  • For t iterations do
  • Invoke weak learner
  • Reweight instances
  • Output linear combination

Non-adaptive Boosting

ft = argmax

f

L(wt, f) = argmax

f

1 m

m

X

i=1

wi,t−1yif(xi) wit = wi,t−1e−αyift(xi) f = 1 t

t

X

τ=1

ft

reduce weight if we got things right

slide-29
SLIDE 29
  • For mistakes (majority wrong) we have
  • Upper bound on weight
  • Combining upper and lower bound

Error vanishes exponentially fast

Boosting Analysis

wit ≥ e− αt

2 hence wt ≥ |{f(xi)yi ≤ 0}| e− αt 2

wt = X

i

wi,t−1e−αyift(xi) ≤ wt−1e−α(γ+1/2)+α2/8 ≤ ne−αt(γ+1/2)+tα2/8 nerrors ≤ ne−t(αγ−α2/8) hence nerror ≤ ne−2tγ2 for α = 4γ

slide-30
SLIDE 30

AdaBoost

  • Refine algorithm by weighting functions
  • Adaptive in the performance of weak learner
  • Error for weighted observations
  • Stepsize and weight

✏t := 1 n

n

X

i=1

wit 1

2 {1 − yift(xi)}

↵t := 1 2 log 1 − ✏t ✏t f = X

t

αtft and wit = wi0e− P

t αtyift(xi)

slide-31
SLIDE 31

Usage

  • Simple classifiers (weak learners)
  • Linear threshold functions
  • Decision trees (simple ones)
  • Neural networks
  • Do not boost SVMs. Why?

Boosting the Margin, Schapire et al http://goo.gl/aLCSO

  • Overfitting is possible

Boost with noisy data for too long Fix e.g. by limiting weight of observations.

slide-32
SLIDE 32

Application to Game Theory

slide-33
SLIDE 33

Games

slide-34
SLIDE 34

Games

rock scissors paper rock 1

  • 1

scissors

  • 1

1 paper 1

  • 1
slide-35
SLIDE 35
  • Game
  • Row player picks i, column player picks j

Receive outcomes Mijk

  • Zero sum game has Mij,1 = -Mij,2

(my gain is your loss)

  • How to play
  • Deterministic strategy usually not optimal
  • Distribution over actions
  • Nash equilibrium

Players have no incentive to change policy

Games

slide-36
SLIDE 36
  • von Neumann minimax theorem
  • Proof

due to vertex solution. Apply linear programming duality to get Apply vertex property again to complete proof.

Games

min

x2P max j

⇥ x>M ⇤

j = min x2P max y2P x>My

min

x2P max j

⇥ x>M ⇤

j = max y2P min i

[My]i min

x2P max y2P x>My = max y2P min x2P x>My

slide-37
SLIDE 37

Finding a Nash equilibrium approximately

  • Repeated game (initial distribution p0 for player)
  • For t rounds do
  • Opponent picks best distribution qt+1 using pt
  • Player updates action distribution pt+1 using
  • Regret bound tells us that

pi,t+t ∝ pi,0 exp −η

t

X

τ=1

[Mqt]i ! 1 t

t

X

τ=1

p>

τ Mqτ ≤ min i

1 t

t

X

τ=1

[Mqτ]i + O(t 1

2 )

slide-38
SLIDE 38

Finding a Nash equilibrium approximately

  • Regret bound
  • By construction of the algorithm we have
  • Combining this yields

min

p max q

p>Mq ≤ max

q

1 t

t

X

τ=1

p>

τ Mq ≤ 1

t

t

X

τ=1

p>

τ Mqτ

1 t

t

X

τ=1

p>

τ Mqτ ≤ min i

1 t

t

X

τ=1

[Mqτ]i + O(t 1

2 ) = min

p

1 t

t

X

τ=1

p>Mqτ + O(t 1

2 )

max

q

1 t

t

X

τ=1

p>

τ Mq ≤ max q

min

p p>Mq + O(t 1

2 )

slide-39
SLIDE 39

Simplified algorithm

  • Repeated game (initial distribution p0 for player)
  • For t rounds do
  • Opponent picks best distribution qt+1 using pt
  • Player updates action distribution pt+1 using
  • Regret bound tells us that

pi,t+t ∝ pi,0 exp −η

t

X

τ=1

[Mqt]i ! 1 t

t

X

τ=1

p>

τ Mqτ ≤ min i

1 t

t

X

τ=1

[Mqτ]i + O(t 1

2 )

action

slide-40
SLIDE 40

Application to Particle Filtering

slide-41
SLIDE 41
  • Recall particle filter idea (simplified)
  • Observe data in sequence
  • At each step approximate distribution of

by weighted samples from posterior

  • Bayes Rule

Assuming conditional independence

Sequential Monte Carlo

xi ⊥ xj|θ wi,n+1 = winp(xn+1|θ) = winelog p(xn+1|θ) p(θ|x1:n) p(θ|x1:n+1) ∝ p(xn+1|θ, x1:n)p(θ|x1:n)

slide-42
SLIDE 42

Sequential Monte Carlo

  • Experts
  • Loss
  • Weights
  • Convergence

good news

  • Found the best expert

(good)

  • Solution only as good

as best expert

  • Particle Filters
  • Neg. Loglikelihood
  • Weights
  • Convergence

bad news

  • Have only single sample

left

  • Need to resample
  • Adaptively find better

solution

slide-43
SLIDE 43
  • On a chain
  • Observe data in sequence
  • Fill in latent variables in sequence
  • Bayes Rule
  • sample latent parameter
  • update particle weight with

Sequential Monte Carlo

p(xn+1, θn+1|x1:n, θ1:n) = p(xn+1|x1:n, θ1:n)p(θn+1|x1:n+1, θ1:n)

θn+1 ∼ p(θn+1|x1:n+1, θ1:n) wi,n+1 = winp(xn+1|θ1:n, x1:n)

prediction “error”

slide-44
SLIDE 44
  • Clustering
  • Observe data in sequence
  • We assume collapsed representation

(integrate out natural parameters)

  • For each particle do (naive method)
  • Compute data likelihood (for reweighting)
  • Draw cluster ID from
  • Resample particles if too skewed.

Sequential Monte Carlo

p(xn+1|x1:n, y1:n) = X

yn+1

p(xn+1|x1:n, y1:n, yn+1)p(yn+1|x1:n, y1:n) p(yn+1|x1:n+1, yn+1) ∝ p(xn+1|x1:n, y1:n, yn+1)p(yn+1|x1:n, y1:n)

slide-45
SLIDE 45

Resampling generates tree

Canini et al, 2009 http://jmlr.csail.mit.edu/proceedings/papers/v5/canini09a/canini09a.pdf

slide-46
SLIDE 46
  • Multiplicative

Updates

  • Boosting
  • Game Theory
  • Particle Filtering
slide-47
SLIDE 47
  • Neural Networks
  • Reinforcement learning
  • Bandits
  • Collaborative Filtering
  • Scalability
  • Kalman Filter / HMM
  • Lasso / l1 programming / sparse reconstruction
  • ...

What we missed