Following the Flattened Leader lowski 1 unwald 1 Steven de Rooij 2 - - PowerPoint PPT Presentation

following the flattened leader
SMART_READER_LITE
LIVE PREVIEW

Following the Flattened Leader lowski 1 unwald 1 Steven de Rooij 2 - - PowerPoint PPT Presentation

Following the Flattened Leader lowski 1 unwald 1 Steven de Rooij 2 Wojciech Kot Peter Gr 1 National Research Institute for Mathematics and Computer Science (CWI) The Netherlands 2 University of Cambridge COLT 2010 1 / 14 Outline 1


slide-1
SLIDE 1

Following the Flattened Leader

Wojciech Kot lowski1 Peter Gr¨ unwald1 Steven de Rooij2

1National Research Institute for Mathematics and Computer Science (CWI)

The Netherlands

2University of Cambridge

COLT 2010

1 / 14

slide-2
SLIDE 2

Outline

1 Sequential prediction with log-loss.

Set of experts = exponential family.

2 / 14

slide-3
SLIDE 3

Outline

1 Sequential prediction with log-loss.

Set of experts = exponential family.

2 Prediction strategies:

Bayes strategy: achieves optimal regret usually hard to calculate “Follow the leader” strategy: simple to compute/update suboptimal

2 / 14

slide-4
SLIDE 4

Outline

1 Sequential prediction with log-loss.

Set of experts = exponential family.

2 Prediction strategies:

Bayes strategy: achieves optimal regret usually hard to calculate “Follow the leader” strategy: simple to compute/update suboptimal

3 Our contribution

“Follow the flattened leader” strategy: A slight modification of “follow the leader”. achieves performance of Bayes retains simplicity of ML

2 / 14

slide-5
SLIDE 5

Outline

1 Sequential prediction with log-loss.

Set of experts = exponential family.

2 Prediction strategies:

Bayes strategy: achieves optimal regret usually hard to calculate “Follow the leader” strategy: simple to compute/update suboptimal

3 Our contribution

“Follow the flattened leader” strategy: A slight modification of “follow the leader”. achieves performance of Bayes retains simplicity of ML

4 Applications: prediction, coding, model selection.

2 / 14

slide-6
SLIDE 6

Sequential Prediction

Family of distributions (model) M = {Pµ|µ ∈ Θ}.

3 / 14

slide-7
SLIDE 7

Sequential Prediction

Family of distributions (model) M = {Pµ|µ ∈ Θ}. Sequence of outcomes x1, x2, . . . ∈ X ∞, revealed one by one.

3 / 14

slide-8
SLIDE 8

Sequential Prediction

Family of distributions (model) M = {Pµ|µ ∈ Θ}. Sequence of outcomes x1, x2, . . . ∈ X ∞, revealed one by one. In each iteration, after observing xn = x1, x2, . . . , xn, predict xn+1 by assigning a distribution P(·|xn).

3 / 14

slide-9
SLIDE 9

Sequential Prediction

Family of distributions (model) M = {Pµ|µ ∈ Θ}. Sequence of outcomes x1, x2, . . . ∈ X ∞, revealed one by one. In each iteration, after observing xn = x1, x2, . . . , xn, predict xn+1 by assigning a distribution P(·|xn). After xn+1 is revealed, incur log-loss − log P(xn+1|xn).

3 / 14

slide-10
SLIDE 10

Sequential Prediction

Family of distributions (model) M = {Pµ|µ ∈ Θ}. Sequence of outcomes x1, x2, . . . ∈ X ∞, revealed one by one. In each iteration, after observing xn = x1, x2, . . . , xn, predict xn+1 by assigning a distribution P(·|xn). After xn+1 is revealed, incur log-loss − log P(xn+1|xn). Regret w.r.t. the best “expert” from M: R(P, xn) =

n

  • i=1

− log P(xi|xi−1) − inf

µ∈Θ n

  • i=1

− log Pµ(xi|xi−1).

3 / 14

slide-11
SLIDE 11

Sequential Prediction

Family of distributions (model) M = {Pµ|µ ∈ Θ}. Sequence of outcomes x1, x2, . . . ∈ X ∞, revealed one by one. In each iteration, after observing xn = x1, x2, . . . , xn, predict xn+1 by assigning a distribution P(·|xn). After xn+1 is revealed, incur log-loss − log P(xn+1|xn). Regret w.r.t. the best “expert” from M: R(P, xn) =

n

  • i=1

− log P(xi|xi−1) − inf

µ∈Θ n

  • i=1

− log Pµ(xi|xi−1). Process generating the outcomes:

adversarial: only boundedness assumptions on xn, stochastic: X1, X2, . . . i.i.d. ∼ P ∗, possibly P ∗ / ∈ M, R(P, Xn) is a random variable.

3 / 14

slide-12
SLIDE 12

Sequential Prediction: Example

M = {Pµ|µ ∈ [0, 1]}, Pµ Bernoulli. xn = 1010110110.

4 / 14

slide-13
SLIDE 13

Sequential Prediction: Example

M = {Pµ|µ ∈ [0, 1]}, Pµ Bernoulli. xn = 1010110110. Best expert in M: Pˆ

µn, ˆ

µn = #1

n (= 3 5).

4 / 14

slide-14
SLIDE 14

Sequential Prediction: Example

M = {Pµ|µ ∈ [0, 1]}, Pµ Bernoulli. xn = 1010110110. Best expert in M: Pˆ

µn, ˆ

µn = #1

n (= 3 5).

“Follow the leader” prediction strategy:

P(·|xi) = Pˆ

µi(·)

4 / 14

slide-15
SLIDE 15

Sequential Prediction: Example

M = {Pµ|µ ∈ [0, 1]}, Pµ Bernoulli. xn = 1010110110. Best expert in M: Pˆ

µn, ˆ

µn = #1

n (= 3 5).

“Follow the leader” prediction strategy:

P(·|xi) = Pˆ

µi(·) ⇐

= ˆ µ0 undefined, P(x2|x1) = 0. . .

4 / 14

slide-16
SLIDE 16

Sequential Prediction: Example

M = {Pµ|µ ∈ [0, 1]}, Pµ Bernoulli. xn = 1010110110. Best expert in M: Pˆ

µn, ˆ

µn = #1

n (= 3 5).

“Follow the leader” prediction strategy:

P(·|xi) = Pˆ

µi(·) ⇐

= ˆ µ0 undefined, P(x2|x1) = 0. . . P(·|xi) = Pˆ

µ◦

i (·), ˆ

µ◦

i = #1+1 n+2 (Laplace’s rule of succesion).

ˆ µ◦

i : 1 2, 2 3, 1 2, 3 5, 1 2, 4 7, 5 8, 5 9, 3 5, 7 11, 7 12.

4 / 14

slide-17
SLIDE 17

Sequential Prediction: Example

M = {Pµ|µ ∈ [0, 1]}, Pµ Bernoulli. xn = 1010110110. Best expert in M: Pˆ

µn, ˆ

µn = #1

n (= 3 5).

“Follow the leader” prediction strategy:

P(·|xi) = Pˆ

µi(·) ⇐

= ˆ µ0 undefined, P(x2|x1) = 0. . . P(·|xi) = Pˆ

µ◦

i (·), ˆ

µ◦

i = #1+1 n+2 (Laplace’s rule of succesion).

ˆ µ◦

i : 1 2, 2 3, 1 2, 3 5, 1 2, 4 7, 5 8, 5 9, 3 5, 7 11, 7 12.

If x∞ such that for large n, ˆ µn bounded away from {0, 1}: R(P, xn) = 1 2 log n + O(1).

4 / 14

slide-18
SLIDE 18

Problem Statement

5 / 14

slide-19
SLIDE 19

Problem Statement

M = {Pµ|µ ∈ Θ} is k-parameter exponential family

Bernoulli, Gaussian, Poisson, gamma, beta, geometric, χ2, . . . Mean-value parametrization, µ = E[X].

5 / 14

slide-20
SLIDE 20

Problem Statement

M = {Pµ|µ ∈ Θ} is k-parameter exponential family

Bernoulli, Gaussian, Poisson, gamma, beta, geometric, χ2, . . . Mean-value parametrization, µ = E[X].

Bayes strategy: Pbayes(xn+1|xn) =

  • Θ

Pµ(xn+1) dπ(µ|xn)

Pbayes(xn+1|xn) / ∈ M (strategy outside model).

5 / 14

slide-21
SLIDE 21

Problem Statement

M = {Pµ|µ ∈ Θ} is k-parameter exponential family

Bernoulli, Gaussian, Poisson, gamma, beta, geometric, χ2, . . . Mean-value parametrization, µ = E[X].

Bayes strategy: Pbayes(xn+1|xn) =

  • Θ

Pµ(xn+1) dπ(µ|xn)

Pbayes(xn+1|xn) / ∈ M (strategy outside model). R(Pbayes, xn) = k

2 log n + O(1) (asympt. optimal).

5 / 14

slide-22
SLIDE 22

Problem Statement

M = {Pµ|µ ∈ Θ} is k-parameter exponential family

Bernoulli, Gaussian, Poisson, gamma, beta, geometric, χ2, . . . Mean-value parametrization, µ = E[X].

Bayes strategy: Pbayes(xn+1|xn) =

  • Θ

Pµ(xn+1) dπ(µ|xn)

Pbayes(xn+1|xn) / ∈ M (strategy outside model). R(Pbayes, xn) = k

2 log n + O(1) (asympt. optimal).

Plug-in strategy: Pplug-in(xn+1 | xn) = P¯

µ(xn)(xn+1),

¯ µ: X ∞ → Θ

Uplug-in(xn+1 | xn) ∈ M (in-model strategy).

5 / 14

slide-23
SLIDE 23

Problem Statement

M = {Pµ|µ ∈ Θ} is k-parameter exponential family

Bernoulli, Gaussian, Poisson, gamma, beta, geometric, χ2, . . . Mean-value parametrization, µ = E[X].

Bayes strategy: Pbayes(xn+1|xn) =

  • Θ

Pµ(xn+1) dπ(µ|xn)

Pbayes(xn+1|xn) / ∈ M (strategy outside model). R(Pbayes, xn) = k

2 log n + O(1) (asympt. optimal).

Plug-in strategy: Pplug-in(xn+1 | xn) = P¯

µ(xn)(xn+1),

¯ µ: X ∞ → Θ

Uplug-in(xn+1 | xn) ∈ M (in-model strategy). ML plug-in strategy (“follow the leader”) if ¯ µ(xn) = ˆ µ◦

n:

ˆ µ◦

n = n0x0 + n i=1 xi

n0 + n (smoothed ML estimator)

5 / 14

slide-24
SLIDE 24

Problem Statement

M = {Pµ|µ ∈ Θ} is k-parameter exponential family

Bernoulli, Gaussian, Poisson, gamma, beta, geometric, χ2, . . . Mean-value parametrization, µ = E[X].

Bayes strategy: Pbayes(xn+1|xn) =

  • Θ

Pµ(xn+1) dπ(µ|xn)

Pbayes(xn+1|xn) / ∈ M (strategy outside model). R(Pbayes, xn) = k

2 log n + O(1) (asympt. optimal).

Plug-in strategy: Pplug-in(xn+1 | xn) = P¯

µ(xn)(xn+1),

¯ µ: X ∞ → Θ

Uplug-in(xn+1 | xn) ∈ M (in-model strategy). ML plug-in strategy (“follow the leader”) if ¯ µ(xn) = ˆ µ◦

n:

ˆ µ◦

n = n0x0 + n i=1 xi

n0 + n (smoothed ML estimator) R(Pplug-in, , xn) ≥ c k

2 log n + O(1), worst case: c ≫ 1.

5 / 14

slide-25
SLIDE 25

Contribution

Bayes strategy: (strategy outside the model)

  • asympt. optimal regret:

k 2 log n + O(1)

usually hard to calculate Plug-in strategy (incl. ML): (strategy in the model) suboptimal: c k

2 log n + O(1)

simple to compute/update “Follow the Flattened Leader” A slight modification (“flattening”) of the ML plug-in strategy, “almost” in the model, achieving optimal regret. achieves performance of Bayes retains simplicity of ML

6 / 14

slide-26
SLIDE 26

Motivating Example: Why Bayes is better than ML?

M = {N(µ, 1): µ ∈ R}.

7 / 14

slide-27
SLIDE 27

Motivating Example: Why Bayes is better than ML?

M = {N(µ, 1): µ ∈ R}. ML strategy prediction: N(ˆ µ◦

n, 1)

Bayes strategy prediction: N

  • ˆ

µ◦

n, 1 +

1 n + 1

  • 7 / 14
slide-28
SLIDE 28

Motivating Example: Why Bayes is better than ML?

M = {N(µ, 1): µ ∈ R}. ML strategy prediction: N(ˆ µ◦

n, 1)

Bayes strategy prediction: N

  • ˆ

µ◦

n, 1 +

1 n + 1

  • Sequence of outcomes xn: 2, −2, 2, −2, . . ..

−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

ML Bayes 2 4 6 8 10 12 14 2 4 6 8 10 n regret [nats] ML Bayes

7 / 14

slide-29
SLIDE 29

Motivating Example: Why Bayes is better than ML?

M = {N(µ, 1): µ ∈ R}. ML strategy prediction: N(ˆ µ◦

n, 1)

Bayes strategy prediction: N

  • ˆ

µ◦

n, 1 +

1 n + 1

  • Sequence of outcomes xn: 2, −2, 2, −2, . . ..

−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

ML Bayes 2 4 6 8 10 12 14 2 4 6 8 10 n regret [nats] ML Bayes

7 / 14

slide-30
SLIDE 30

Motivating Example: Why Bayes is better than ML?

M = {N(µ, 1): µ ∈ R}. ML strategy prediction: N(ˆ µ◦

n, 1)

Bayes strategy prediction: N

  • ˆ

µ◦

n, 1 +

1 n + 1

  • Sequence of outcomes xn: 2, −2, 2, −2, . . ..

−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

ML Bayes 2 4 6 8 10 12 14 2 4 6 8 10 n regret [nats] ML Bayes

7 / 14

slide-31
SLIDE 31

Motivating Example: Why Bayes is better than ML?

M = {N(µ, 1): µ ∈ R}. ML strategy prediction: N(ˆ µ◦

n, 1)

Bayes strategy prediction: N

  • ˆ

µ◦

n, 1 +

1 n + 1

  • Sequence of outcomes xn: 2, −2, 2, −2, . . ..

−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

ML Bayes 2 4 6 8 10 12 14 2 4 6 8 10 n regret [nats] ML Bayes

7 / 14

slide-32
SLIDE 32

Motivating Example: Why Bayes is better than ML?

M = {N(µ, 1): µ ∈ R}. ML strategy prediction: N(ˆ µ◦

n, 1)

Bayes strategy prediction: N

  • ˆ

µ◦

n, 1 +

1 n + 1

  • Sequence of outcomes xn: 2, −2, 2, −2, . . ..

−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

ML Bayes 2 4 6 8 10 12 14 2 4 6 8 10 n regret [nats] ML Bayes

7 / 14

slide-33
SLIDE 33

Motivating Example: Why Bayes is better than ML?

M = {N(µ, 1): µ ∈ R}. ML strategy prediction: N(ˆ µ◦

n, 1)

Bayes strategy prediction: N

  • ˆ

µ◦

n, 1 +

1 n + 1

  • Sequence of outcomes xn: 2, −2, 2, −2, . . ..

−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

ML Bayes 2 4 6 8 10 12 14 2 4 6 8 10 n regret [nats] ML Bayes

7 / 14

slide-34
SLIDE 34

Suboptimal Performance of ML Strategy

Gr¨ unwald & de Rooij (2005); Gr¨ unwald & Kot lowski (2010) M is a single-parameter exponential family, X1, X2, . . . i.i.d. ∼ P ∗, EP ∗[X] = µ∗ ∈ Θ. EP ∗[R(Pplug-in, Xn)] ≥ 1 2 varP ∗X varPµ∗X log n + O(1), Inferior performance when the variation of data greater than the variance of Pµ∗ ∈ M. = ⇒ Compensate for variability of the data.

8 / 14

slide-35
SLIDE 35

Improving ML Strategy

Flattened ML Strategy

Pfml(xn+1|xn) := Pˆ

µ◦

n(xn+1)n + n0 + 1

2(xn+1 − ˆ

µ◦

n)T I(ˆ

µ◦

n)(xn+1 − ˆ

µ◦

n)

n + n0 + k

2

9 / 14

slide-36
SLIDE 36

Improving ML Strategy

Flattened ML Strategy

Pfml(xn+1|xn) := Pˆ

µ◦

n(xn+1)n + n0 + 1

2(xn+1 − ˆ

µ◦

n)T I(ˆ

µ◦

n)(xn+1 − ˆ

µ◦

n)

n + n0 + k

2

Flattening term: 1 + O 1

n

  • compensation for variability of data

❄ ❄

for exponential families, I(µ) = Cov−1

µ X

9 / 14

slide-37
SLIDE 37

Main Result: Adversarial Case

Assumptions on outcomes: For all large n: sequence of data bounded: xn ≤ B sequence of ML estimators ˆ µn bounded away from ∂Θ. Then, the flattened ML strategy Pfml achieves asymptotically

  • ptimal regret, i.e.

R(Pfml, xn) = k 2 log n + O(1). where the constant under O( · ) does not depend on the outcomes.

10 / 14

slide-38
SLIDE 38

Main Result: Stochastic Case

Assumptions on outcomes: X1, X2, . . . i.i.d. ∼ P ∗, EP ∗[X] = µ∗ ∈ Θ. First four moments of P ∗ exist. Then, the flattened ML strategy Pfml almost surely achieves asymptotically optimal regret, i.e. R(Pfml, Xn) = k 2 log n + O(1) holds with probability one.

11 / 14

slide-39
SLIDE 39

Flattened ML vs. ML and Bayes

M = {N(µ, 1): µ ∈ R}.

12 / 14

slide-40
SLIDE 40

Flattened ML vs. ML and Bayes

M = {N(µ, 1): µ ∈ R}. ML vs. Bayes Flattened ML vs. Bayes

−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

ML Bayes −2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

Flattened Bayes

12 / 14

slide-41
SLIDE 41

Flattened ML vs. ML and Bayes

M = {N(µ, 1): µ ∈ R}. ML vs. Bayes Flattened ML vs. Bayes

−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

ML Bayes −2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

Flattened Bayes

12 / 14

slide-42
SLIDE 42

Flattened ML vs. ML and Bayes

M = {N(µ, 1): µ ∈ R}. ML vs. Bayes Flattened ML vs. Bayes

−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

ML Bayes −2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

Flattened Bayes

12 / 14

slide-43
SLIDE 43

Flattened ML vs. ML and Bayes

M = {N(µ, 1): µ ∈ R}. ML vs. Bayes Flattened ML vs. Bayes

−2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

ML Bayes −2 −1 1 2 0.0 0.1 0.2 0.3 0.4 x p(

(x) )

Flattened Bayes

12 / 14

slide-44
SLIDE 44

Flattened ML vs. ML and Bayes

2 4 6 8 10 12 14 2 4 6 8 10 n regret [nats] ML Flattened Bayes

13 / 14

slide-45
SLIDE 45

Conclusions

We proposed a simple “flattening” of the ML distribution for which the optimal asymptotic regret is achieved. Flattened ML strategy retains the simplicity of ML strategy, while achieving the performance of Bayes and NML. Applications in prediction, coding, model selection.

14 / 14