Boosted Density Estimation Remastered Zac Cranko 1,2 and Richard Nock - - PowerPoint PPT Presentation

boosted density estimation remastered
SMART_READER_LITE
LIVE PREVIEW

Boosted Density Estimation Remastered Zac Cranko 1,2 and Richard Nock - - PowerPoint PPT Presentation

Boosted Density Estimation Remastered Zac Cranko 1,2 and Richard Nock 2,1,3 1 The Australian National University 2 CSIRO Data61 3 The University of Sydney Quick Summary Learn a density function incrementally Use classifiers for the


slide-1
SLIDE 1

Boosted Density Estimation Remastered

Zac Cranko1,2 and Richard Nock2,1,3

1The Australian National University 2CSIRO Data61 3The University of Sydney

slide-2
SLIDE 2

Quick Summary

  • Learn a density function incrementally
  • Use classifiers for the incremental updates (similar to GAN

discriminators)

  • Unlike other state of the art attempts, achieve strong convergence

results (geometric) using a weak learning assumption on the classifiers (in the paper!)

1

slide-3
SLIDE 3

sup

D:X→(0,1)

EQ0[log D] − EP [log(1 − D)]

2

slide-4
SLIDE 4

Take f(t)

def

= t log t − (t + 1) log(t + 1) and ϕ(D)

def

=

D 1−D. Then

sup

D:X→(0,1)

EQ0[log D] − EP [log(1 − D)] = sup

D:X→(0,1)

EQ0[f ′ ◦ ϕ ◦ D] − EP [f ∗ ◦ f ′ ◦ ϕ ◦ D] = sup

d:X→(0,∞)

EQ0[f ′ ◦ d] − EP [f ∗ ◦ f ′ ◦ d] = EQ0

  • f ′ ◦ dP

dQ0

  • − EP
  • f ∗ ◦ f ′ ◦ dP

dQ0

  • 3
slide-5
SLIDE 5

Take f(t)

def

= t log t − (t + 1) log(t + 1) and ϕ(D)

def

=

D 1−D. Then

sup

D:X→(0,1)

EQ0[log D] − EP [log(1 − D)] = sup

D:X→(0,1)

EQ0[f ′ ◦ ϕ ◦ D] − EP [f ∗ ◦ f ′ ◦ ϕ ◦ D] = sup

d:X→(0,∞)

EQ0[f ′ ◦ d] − EP [f ∗ ◦ f ′ ◦ d] = EQ0

  • f ′ ◦ dP

dQ0

  • − EP
  • f ∗ ◦ f ′ ◦ dP

dQ0

  • Recall:

∀f :

  • f(x)P(dx) =
  • f(x) dP

dQ0 (x)Q0(dx)

3

slide-6
SLIDE 6

Main Idea

d1 ∈ argmax

d′:X→(0,∞)

EQ0[f ′ ◦ d′] − EP [f ∗ ◦ f ′ ◦ d′]

  • 1. Find d1 as above

4

slide-7
SLIDE 7

Main Idea

d1 ∈ argmax

d′:X→(0,∞)

EQ0[f ′ ◦ d′] − EP [f ∗ ◦ f ′ ◦ d′]

  • 1. Find d1 as above
  • 2. Multiply d1(x)Q0(dx) to find P(dx)

4

slide-8
SLIDE 8

Main Idea

d1 ∈ argmax

d′:X→(0,∞)

EQ0[f ′ ◦ d′] − EP [f ∗ ◦ f ′ ◦ d′]

  • 1. Find d1 as above
  • 2. Multiply d1(x)Q0(dx) to find P(dx)
  • 3. Finished. Get a job at a hedge fund next door

Unfortunately this is not so simple since in practice we can only approximately solve the maximisation.

4

slide-9
SLIDE 9

Main Idea

d1 ∈ argmax

d′:X→(0,∞)

EQ0[f ′ ◦ d′] − EP [f ∗ ◦ f ′ ◦ d′]

  • 1. Find d1 as above
  • 2. Multiply d1(x)Q0(dx) to find P(dx)
  • 3. Finished. Get a job at a hedge fund next door

Unfortunately this is not so simple since in practice we can only approximately solve the maximisation. Sadface.

4

slide-10
SLIDE 10

Solution

dt ∈ argmax

d′:X→(0,∞)

EQt−1[f ′ ◦ d′] − EP [f ∗ ◦ f ′ ◦ d′] ˜ Qt(dx) = dαt

t (x) · ˜

Qt−1(dx), Qt = 1 Zt ˜ Qt, where Zt

def

=

  • d ˜

Qt,

  • 1. Some step size parameters αt ∈ (0, 1)
  • 2. Treat the updates as classifiers dt = exp ◦ct

5

slide-11
SLIDE 11

Solution

dt ∈ argmax

d′:X→(0,∞)

EQt−1[f ′ ◦ d′] − EP [f ∗ ◦ f ′ ◦ d′] ˜ Qt(dx) = dαt

t (x) · ˜

Qt−1(dx), Qt = 1 Zt ˜ Qt, where Zt

def

=

  • d ˜

Qt,

  • 1. Some step size parameters αt ∈ (0, 1)
  • 2. Treat the updates as classifiers dt = exp ◦ct
  • The classifiers are distinguishing between samples originating from

P and Qt−1 like in a GAN

  • However unlike a GAN there is not necessarily a simple fast sampler

for Qt−1, but there is a closed-form density function

5

slide-12
SLIDE 12

Solution

dt ∈ argmax

d′:X→(0,∞)

EQt−1[f ′ ◦ d′] − EP [f ∗ ◦ f ′ ◦ d′] ˜ Qt(dx) = dαt

t (x) · ˜

Qt−1(dx), Qt = 1 Zt ˜ Qt, where Zt

def

=

  • d ˜

Qt,

  • 1. Some step size parameters αt ∈ (0, 1)
  • 2. Treat the updates as classifiers dt = exp ◦ct
  • The classifiers are distinguishing between samples originating from

P and Qt−1 like in a GAN

  • However unlike a GAN there is not necessarily a simple fast sampler

for Qt−1, but there is a closed-form density function

Convergence of Qt → P in KL-divergence with a weak learning assumption on the updates as classifiers. With additional minimal assumptions: geometric convergence.

5

slide-13
SLIDE 13

Experiments

0.4 0.6 0.8 0.5 Accuracy (0.5 if Qt → P ) 1 2 3 4 5 1 2 3 4 KL(P, Qt) (lower is better) t = 0 t = 1 t = 2 t = 3

6

slide-14
SLIDE 14

Thanks for listening, come chat to us at poster #161. (Bring beer!)

7