On-line estimation with the multivariate Gaussian distribution - - PowerPoint PPT Presentation

on line estimation with the multivariate gaussian
SMART_READER_LITE
LIVE PREVIEW

On-line estimation with the multivariate Gaussian distribution - - PowerPoint PPT Presentation

On-line estimation with the multivariate Gaussian distribution Sanjoy Dasgupta and Daniel Hsu UC San Diego On-line estimation with the multivariate Gaussian distribution Nr. 1 Outline 1. On-line density estimation and previous work 2.


slide-1
SLIDE 1

On-line estimation with the multivariate Gaussian distribution

Sanjoy Dasgupta and Daniel Hsu UC San Diego

On-line estimation with the multivariate Gaussian distribution

  • Nr. 1
slide-2
SLIDE 2

Outline

  • 1. On-line density estimation and previous work
  • 2. On-line multivariate Gaussian density estimation
  • 3. Regret analysis of follow-the-leader
  • 4. Open problem

On-line estimation with the multivariate Gaussian distribution

  • Nr. 2
slide-3
SLIDE 3

On-line density estimation

Learning protocol For trial t = 1, 2, . . .

  • 1. Learner chooses parameter θt ∈ Θ
  • 2. Nature chooses instance xt ∈ X
  • 3. Learner incurs loss ℓt(θt) = ℓ(θt, xt)

In on-line (parametric) density estimation, ℓ(θ, x) = − log p(x|θ) where {p(·|θ) : θ ∈ Θ} is a parametric family of densities.

On-line estimation with the multivariate Gaussian distribution

  • Nr. 3
slide-4
SLIDE 4

On-line density estimation

Loss and regret LT =

T

  • t=1

ℓt(θt) Total loss of learner after T trials L∗

T = inf θ∈Θ T

  • t=1

ℓt(θ) Best-in-hindsight fixed-parameter loss after T trials RT = LT − L∗

T

Regret of learner after T trials Goal: on-line density estimation strategies with regret bounds. As usual, no stochastic assumption about how Nature generates data.

On-line estimation with the multivariate Gaussian distribution

  • Nr. 4
slide-5
SLIDE 5

Previous work in on-line density estimation

Some on-line learning literature Freund, 1996 Bernoulli (weighted coins) Azoury & Warmuth, 1999 some exponential families, including ↓ Takimoto & Warmuth, 2000a fixed-covariance Gaussian Takimoto & Warmuth, 2000b some one-dimensional exponential families

  • Straightforward on-line parameter estimation yields O(log T) regret;

subtle variations can improve the constants. – In case of fixed-covariance Gaussian, a recursively-defined update rule yields minimax strategy.

  • Often, simple random sequences yield lower bounds.

On-line estimation with the multivariate Gaussian distribution

  • Nr. 5
slide-6
SLIDE 6

On-line Gaussian density estimation

For simplicity, just look at one-dimensional case; results generalize to multi- variate case with linear algebra.

  • Parameter space: Θ = R × R>0 (mean and variance)

– Learner chooses (µt, σ2

t ) in trial t

  • Data space: X = R
  • Loss function:

ℓ((µ, σ2), x) = (x − µ)2 2σ2 + 1 2 ln σ2 Can view as squared-loss on prediction µ with “confidence” parameter σ2.

On-line estimation with the multivariate Gaussian distribution

  • Nr. 6
slide-7
SLIDE 7

Main results

  • Standard formulation suffers from degenerate cases – similar to problems

in MLE of Gaussian distributions.

  • Instead, consider alternative formulation with hallucinated zeroth trial.
  • We study the strategy that chooses sample mean and variance of previ-
  • us instances (follow-the-leader). Trivial regret bound is O(T 2).
  • 1. For any p > 1, there are sequences (

xt) for which regret is Ω(T 1−1/p). Similar for any sublinear function in T.

  • 2. Linear bound on regret for all sequences.
  • 3. For any sequence, average regret is → 0; i.e. for any sequence,

lim sup

T ≥1

RT T ≤ 0.

On-line estimation with the multivariate Gaussian distribution

  • Nr. 7
slide-8
SLIDE 8

Problems with standard formulation

Unbounded instances

  • Learner’s means have |µt| < ∞.
  • Nature chooses xt so |xt − µt| arbitrarily large.
  • ∴ Regret unbounded.

Fix: assume all |xt| ≤ r for some r ≥ 0 (same as in fixed-variance case).

On-line estimation with the multivariate Gaussian distribution

  • Nr. 8
slide-9
SLIDE 9

Problems with standard formulation

Non-varying instances

  • If x1 = x2 = . . . = xT , then L∗

T = limσ2→0 T 2 ln σ2 = −∞.

  • ∴ Regret unbounded.

Fix: force some variance by hallucinating a zeroth trial, and include in loss and regret quantities. ℓ0(µ, σ2) = 1 2

  • x∈{±s}

(x − µ)2 2σ2 + 1 2 ln σ2

  • for some constant s > 0.

Consequence: L∗

T > −∞ for all T. (Alternative: compare to best-in-hindsight

loss plus Bregman divergence to initial parameter.)

On-line estimation with the multivariate Gaussian distribution

  • Nr. 9
slide-10
SLIDE 10

Follow-the-leader

Follow-the-leader: use parameter setting that minimizes total loss over all previous trials. This is the “natural” strategy: choose sample mean and variance of previously seen instances. (µ1, σ2

1) = (µ0, σ2 0) = (0, s2)

for some s2 > 0 due to trial zero µt+1 = 1 t + 1

t

  • i=1

xi and σ2

t+1 =

1 t + 1

  • s2 +

t

  • i=1

x2

i

  • − µ2

t+1

  • No randomization / perturbation (cf. follow-the-perturbed-leader).
  • Similar to algorithms proposed in Azoury & Warmuth (1999) for expo-

nential families, which enjoyed O(log T) regret bounds. In our setting, looks like O(T 2) bounds (without further assumptions).

On-line estimation with the multivariate Gaussian distribution

  • Nr. 10
slide-11
SLIDE 11

#1: Regret lower bound for follow-the-leader

Finite sequence: s = 1; sequence is x1 = . . . = xT −1 = 0 and xT = 1. Learner’s parameters: µt ≡ 0, σ2

t = 1/t;

Final regret: RT = Ω(1/σ2

T ) = Ω(T).

5 10 15 20 0.5 1 t σt

2

5 10 15 20 5 10 t Rt

On-line estimation with the multivariate Gaussian distribution

  • Nr. 11
slide-12
SLIDE 12

#1: Regret lower bound for follow-the-leader

Infinite sequence: iterate the finite sequence; regret arbitrarily close to linear.

0.5 1 1.5 2 2.5 x 10

4

10 20 30 40 50 60 t Rt

Let f : N → N be increasing. There is a sequence such that for any T in the range of f, RT ≥ C · T + 1 (f −1(T) + 1)2 .

On-line estimation with the multivariate Gaussian distribution

  • Nr. 12
slide-13
SLIDE 13

#2: Regret upper bound

Can derive expression for regret of follow-the-leader either directly or, say, via Bregman divergence formulation (Azoury & Warmuth, 1999): RT ≤

T

  • t=1

1 4(t + 1) (xt − µt)2 σ2

t

2

  • O(t2)
  • O(t)
  • O(T 2)

+1 4 ln(T + 1) + Θ(1) (bound due to second-order Taylor approximation). Since σ2

t ≥ s2/t (where s2 is trial-zero variance), have trivial bound of O(T 2).

Problem: small variances . . . (but they can’t always be small).

On-line estimation with the multivariate Gaussian distribution

  • Nr. 13
slide-14
SLIDE 14

#2: Regret upper bound

Rewrite variance parameter: σ2

t = 1

t

  • s2 +

t−1

  • k=1

∆k

  • where

∆k = k k + 1(xk − µk)2. (∆k is ≈ square distance of new instance to average of old instances.) ∴ Use potential function argument plus algebra: RT ≤ 1 4

T

  • t=1

(t + 1)

  • ∆t

s2 + t−1

k=1 ∆k

2 + O(log T) ≤ C 4 · (T + 1)

  • 1 −

s2 s2 + T

k=1 ∆k

  • + O(log T)

i.e. linear bound.

On-line estimation with the multivariate Gaussian distribution

  • Nr. 14
slide-15
SLIDE 15

Two regimes for follow-the-leader

  • 1. Sequences achieving lower bounds have σ2

t → 0: regret can be arbitrarily

close to linear.

  • 2. If, instead, lim inf σ2

t > 0, then there is some T0 such that for all T > T0,

RT ≤ c · T0 + O

  • log T

T0

  • i.e. eventually, average regret RT /T tends to zero at rate O
  • log T

T

  • .

On-line estimation with the multivariate Gaussian distribution

  • Nr. 15
slide-16
SLIDE 16

#3: lim sup average regret bound

Actually, even when σ2

t → 0, average regret tends to zero.

Formally, for any sequence, lim sup

T ≥1

RT T ≤ 0. Proof idea: show lim sup RT /T ≤ ǫ for any ǫ > 0.

  • Two types of trials, depending on ∆t ≈ (xt − µt)2:
  • 1. ∆t small → contribute ≪ ǫ to regret.
  • 2. ∆t large → cause variance to rise substantially; behavior is more

like second regime.

On-line estimation with the multivariate Gaussian distribution

  • Nr. 16
slide-17
SLIDE 17

Multivariate Gaussians

  • For d-dimensional Gaussians, essentially have extra factor of d in front
  • f all bounds.
  • “Progress” in covariance happens one dimension at a time, so lower

bounds can also exploit each dimension (almost) independently.

  • Potential function for upper bound is tr(Σ−1).

On-line estimation with the multivariate Gaussian distribution

  • Nr. 17
slide-18
SLIDE 18

Open problem

  • This work: analysis of follow-the-leader for on-line Gaussian density es-

timation with arbitrary covariance.

  • Still open (from Takimoto & Warmuth, 2000a):

What is the min-max strategy?

On-line estimation with the multivariate Gaussian distribution

  • Nr. 18
slide-19
SLIDE 19

Thanks!

Authors supported by:

  • NSF grant IIS-0347646
  • Engineering Instititue (Los Alamos National Laborartory / UC San

Diego) graduate fellowship

On-line estimation with the multivariate Gaussian distribution

  • Nr. 19
slide-20
SLIDE 20

Incremental off-line algorithm

Derived by Azoury & Warmuth (1999) for general exponential families. Update rule: choose initial parameter (µ1, σ2

1) ∈ R × R>0; then

(µt+1, σ2

t+1) = arg min (µ,σ2)

  • η−1

1 ∆((µ2, σ2), (µ1, σ2 1)) + t

  • i=1

ℓi(µ, σ2)

  • where ∆(·, ·) is the Bregman divergence for Gaussians

∆((µ, σ2), (˜ µ, ˜ σ2)) = 1 2 (µ − ˜ µ)2 σ2 + ˜ σ2 σ2 − ln ˜ σ2 σ2 − 1

  • and η−1

1

is a parameter (e.g. η−1

1

= 1).

On-line estimation with the multivariate Gaussian distribution

  • Nr. 20
slide-21
SLIDE 21

Incremental off-line algorithm

Theorem of Azoury & Warmuth (1999):

T

  • t=1

ℓt(µt, σ2

t ) − T

  • t=1

ℓt(µ, σ2) = η−1

1

∆((µ, σ2), (µ1, σ2

1))

  • divergence to initial

− η−1

T +1 ∆((µ, σ2), (µT +1, σ2 T +1))

  • divergence to final

+

T

  • t=1

η−1

t+1 ∆((µt, σ2 t ), (µt+1, σ2 t+1))

  • update cost

where η−1

t

= η−1

1

+ t − 1.

On-line estimation with the multivariate Gaussian distribution

  • Nr. 21
slide-22
SLIDE 22

Incremental off-line algorithm

What is the total update cost?

T

  • t=1

η−1

t+1 ∆((µt, σ2 t ), (µt+1, σ2 t+1))

  • update cost

= 1 2

T

  • t=1

(xt − µt)2 σ2

t

− (η−1

1

+ t) ln

  • 1 +

1 η−1

1

+ t · (xt − µt)2 σ2

t

  • + 1

4 ln η−1

1

+ T η−1

1

+ Θ(1) Our results show how to bound this quantity.

On-line estimation with the multivariate Gaussian distribution

  • Nr. 22
slide-23
SLIDE 23

Potential function

One-dimensional case: For any a1, a2, . . . , aT ∈ [0, c],

T

  • t=1

t

  • at

1 + t−1

i=1 ai

2 ≤ (c2 + c) · T ·

  • 1 −

1 1 + T

t=1 at

  • .

(Whenever at > 0, progress is made in denominator for term t + 1.)

On-line estimation with the multivariate Gaussian distribution

  • Nr. 23
slide-24
SLIDE 24

Potential function

d-dimensional case: For any rank-one A1, A2, . . . , AT ∈ Rd×d with tr(At) ∈ [0, c],

T

  • t=1

t tr  At

  • I +

t−1

  • i=1

Ai −1 

2

≤ (c2+c)·T·  d − tr  

  • I +

T

  • i=1

Ai −1    . (Need to worry about “denominator” in d directions.)

On-line estimation with the multivariate Gaussian distribution

  • Nr. 24
slide-25
SLIDE 25

The zeroth trial

Let s > 0 be a fixed constant. In zeroth trial, learner incurs loss ℓ0(µ, σ2) = 1 2

  • x∈±s

(x − µ)2 2σ2 + 1 2 ln σ2

  • = µ2 + s2

2σ2 + 1 2 ln σ2. Redefine loss and regret quantities to include zeroth trial RT

=

T

  • t=0

ℓt(µt, σ2

t ) − inf (µ,σ2) T

  • t=0

ℓt(µ, σ2). (Alternative: compare to best-in-hindsight loss plus Bregman divergence to initial parameter.)

On-line estimation with the multivariate Gaussian distribution

  • Nr. 25
slide-26
SLIDE 26

The zeroth trial

inf

(µ,σ2) T

  • t=0

ℓt(µ, σ2) = inf

(µ,σ2)

µ2 + s2 2σ2 + 1 2 ln σ2 +

T

  • t=1

(xt − µ)2 2σ2 + 1 2 ln σ2

  • = T + 1

2 + T + 1 2 ln σ2

T > −∞

where µT = 1 T + 1

T

  • t=1

xt σ2

T =

1 T + 1

  • s2 +

T

  • t=1

x2

t

  • − µ2

T ≥

s2 T + 1

On-line estimation with the multivariate Gaussian distribution

  • Nr. 26