Linking losses for density ratio and class-probability estimation - - PowerPoint PPT Presentation

linking losses for density ratio and class probability
SMART_READER_LITE
LIVE PREVIEW

Linking losses for density ratio and class-probability estimation - - PowerPoint PPT Presentation

Linking losses for density ratio and class-probability estimation Aditya Krishna Menon Cheng Soon Ong NICTA and The Australian National University 0 / 34 Linking losses for density ratio and class-probability estimation Aditya Krishna Menon


slide-1
SLIDE 1

Linking losses for density ratio and class-probability estimation

Aditya Krishna Menon Cheng Soon Ong

NICTA and The Australian National University

0 / 34

slide-2
SLIDE 2

Linking losses for density ratio and class-probability estimation

Aditya Krishna Menon Cheng Soon Ong

Data61 and The Australian National University

0 / 34

slide-3
SLIDE 3

Class-probability estimation (CPE)

From labelled instances

+" +" +" +" #" #" #" #"

1 / 34

slide-4
SLIDE 4

Class-probability estimation (CPE)

From labelled instances, estimate probability of instance being +’ve

e.g. using logistic regression

+" +" +" +" #" #" #" #"

0.6"

1 / 34

slide-5
SLIDE 5

Density ratio estimation (DRE)

Given samples from densities p,q

−5 5 0.5 1 1.5 2 p q

2 / 34

slide-6
SLIDE 6

Density ratio estimation (DRE)

Given samples from densities p,q, estimate density ratio r = p/q

−5 5 0.5 1 1.5 2 p q r

2 / 34

slide-7
SLIDE 7

Application: covariate shift adaptation

Marginal training distribution

+" +" +" +" #" #" #" #"

3 / 34

slide-8
SLIDE 8

Application: covariate shift adaptation

Marginal training distribution = marginal test distribution

+" +" +" +" #" #" #" #" +" +" +" +" #" #" #" #"

3 / 34

slide-9
SLIDE 9

Application: covariate shift adaptation

Marginal training distribution = marginal test distribution

+" +" +" +" #" #" #" #" +" +" +" +" #" #" #" #"

Can overcome by reweighting training instances

use ratio between test and test densities train e.g. weighted class-probability estimator

3 / 34

slide-10
SLIDE 10

This paper

Formal link between CPE and DRE CPE KLIEP LSIF Logistic Exponential Square Hinge Proper losses DRE

Ranking Bregman

4 / 34

slide-11
SLIDE 11

This paper

Formal link between CPE and DRE

existing DRE approaches → implicitly performing CPE

CPE KLIEP LSIF Logistic Exponential Square Hinge Proper losses DRE

Ranking Bregman

4 / 34

slide-12
SLIDE 12

This paper

Formal link between CPE and DRE

existing DRE approaches → implicitly performing CPE CPE → Bregman minimisation for DRE

CPE KLIEP LSIF Logistic Exponential Square Hinge Proper losses DRE

Ranking Bregman

4 / 34

slide-13
SLIDE 13

This paper

Formal link between CPE and DRE

existing DRE approaches → implicitly performing CPE CPE → Bregman minimisation for DRE new application of DRE losses to “top ranking”

CPE KLIEP LSIF Logistic Exponential Square Hinge Proper losses DRE

Ranking Bregman

4 / 34

slide-14
SLIDE 14

DRE and CPE: formally

5 / 34

slide-15
SLIDE 15

Distributions for learning with binary labels

Fix an instance space X (e.g. Rn) Let D be a distribution over X×{±1}, with P(Y = 1) = 1

2 and

(P(x),Q(x)) = (P(X = x|Y = 1),P(X = x|Y = −1))

Class conditionals

−3 −2 −1 1 2 3 0.2 0.4 0.6 0.8 1 6 / 34

slide-16
SLIDE 16

Distributions for learning with binary labels

Fix an instance space X (e.g. Rn) Let D be a distribution over X×{±1}, with P(Y = 1) = 1

2 and

(P(x),Q(x)) = (P(X = x|Y = 1),P(X = x|Y = −1)) (M(x),η(x)) = (P(X = x),P(Y = 1|X = x))

Class conditionals Marginal and class-probability function

−3 −2 −1 1 2 3 0.2 0.4 0.6 0.8 1 −3 −2 −1 1 2 3 0.2 0.4 0.6 0.8 6 / 34

slide-17
SLIDE 17

Scorers, losses, risks

A scorer is any s: X → R

e.g. linear scorer s: x → w,x +" +" +" +" #" #" #" #"

7 / 34

slide-18
SLIDE 18

Scorers, losses, risks

A scorer is any s: X → R

e.g. linear scorer s: x → w,x +" +" +" +" #" #" #" #"

A loss is any ℓ: {±1}×R → R+

e.g. logistic loss ℓ: (y,v) → log(1+e−yv)

−3 −2 −1 1 2 3 1 2 3 4 5 6 v ℓ(y, v)

7 / 34

slide-19
SLIDE 19

Scorers, losses, risks

A scorer is any s: X → R

e.g. linear scorer s: x → w,x +" +" +" +" #" #" #" #"

A loss is any ℓ: {±1}×R → R+

e.g. logistic loss ℓ: (y,v) → log(1+e−yv)

−3 −2 −1 1 2 3 1 2 3 4 5 6 v ℓ(y, v)

The risk of scorer s wrt loss ℓ and distribution D is

L(s;D,ℓ) = E(X,Y)∼D [ℓ(Y,s(X))]

average loss on a random sample

5" 10" 1" 20" 10" 6" 4" 30"

7 / 34

slide-20
SLIDE 20

CPE versus DRE

Given samples S ∼ DN, with D = (P,Q) = (M,η):

8 / 34

slide-21
SLIDE 21

CPE versus DRE

Given samples S ∼ DN, with D = (P,Q) = (M,η):

Class-probability estimation (CPE)

Estimate η

class-probability function

+" +" +" +" #" #" #" #"

0.6"

8 / 34

slide-22
SLIDE 22

CPE versus DRE

Given samples S ∼ DN, with D = (P,Q) = (M,η):

Class-probability estimation (CPE)

Estimate η

class-probability function

+" +" +" +" #" #" #" #"

0.6"

Density ratio estimation (DRE)

Estimate r = p/q

class-conditional density ratio

−5 5 0.5 1 1.5 2 p q r 8 / 34

slide-23
SLIDE 23

CPE approaches: proper composite losses

For suitable S ⊆ RX, find

argmin

s∈S

L(s;D,ℓ)

where ℓ is such that, for some invertible Ψ : [0,1] → R,

argmin

s∈RX L(s;D,ℓ) = Ψ◦η

estimate ˆ

η = Ψ−1 ◦s

9 / 34

slide-24
SLIDE 24

CPE approaches: proper composite losses

For suitable S ⊆ RX, find

argmin

s∈S

L(s;D,ℓ)

where ℓ is such that, for some invertible Ψ : [0,1] → R,

argmin

s∈RX L(s;D,ℓ) = Ψ◦η

estimate ˆ

η = Ψ−1 ◦s Such an ℓ is called strictly proper composite with link Ψ

9 / 34

slide-25
SLIDE 25

Examples of proper composite losses

−3 −2 −1 1 2 3 1 2 3 4 5 6 v ℓ(y, v)

Logistic loss Ψ−1 : v → σ(v)

−3 −2 −1 1 2 3 1 2 3 4 5 6 v ℓ(y, v)

Exponential loss Ψ−1 : v → σ(2v)

−3 −2 −1 1 2 3 1 2 3 4 5 6 v ℓ(y, v)

Square hinge loss Ψ−1 : v → min(max(0,(v+1)/2),1)

10 / 34

slide-26
SLIDE 26

DRE approaches: divergence minimisation

For suitable S ⊆ RX, find KLIEP: (Sugiyama et al., 2008)

argmin

s∈S

KL(pq⊙s)

constrained KL minimisation

LSIF: (Kanamori et al., 2009)

argmin

s∈S

EX∼Q

  • (r(X)−s(X))2

direct least squares minimisation

11 / 34

slide-27
SLIDE 27

Story so far

CPE Logistic Exponential Square Hinge KLIEP LSIF DRE

12 / 34

slide-28
SLIDE 28

Roadmap

We begin by showing existing DRE losses implicitly perform CPE CPE Logistic Exponential Square Hinge KLIEP LSIF Proper losses DRE

12 / 34

slide-29
SLIDE 29

Existing DRE losses are proper composite

13 / 34

slide-30
SLIDE 30

Existing DRE approaches

Suppose D = (P,Q) KLIEP: (Sugiyama et al., 2008)

argmin

s∈S

KL(pq⊙s) LSIF: (Kanamori et al., 2009)

argmin

s∈S

EX∼Q

  • (r(X)−s(X))2

14 / 34

slide-31
SLIDE 31

Existing DRE approaches as loss minimisation

Suppose D = (P,Q) KLIEP: (Sugiyama et al., 2008)

argmin

s∈S

E(X,Y)∼D [ℓ(Y,s(X))] ℓ(−1,v) = a·v and ℓ(1,v) = −logv

for suitable a > 0 LSIF: (Kanamori et al., 2009)

argmin

s∈S

E(X,Y)∼D [ℓ(Y,s(X))] ℓ(−1,v) = 1 2 ·v2 and ℓ(1,v) = −v

14 / 34

slide-32
SLIDE 32

Existing DRE approaches as loss minimisation

Suppose D = (P,Q) KLIEP: (Sugiyama et al., 2008)

argmin

s∈S

E(X,Y)∼D [ℓ(Y,s(X))] ℓ(−1,v) = a·v and ℓ(1,v) = −logv

for suitable a > 0 LSIF: (Kanamori et al., 2009)

argmin

s∈S

E(X,Y)∼D [ℓ(Y,s(X))] ℓ(−1,v) = 1 2 ·v2 and ℓ(1,v) = −v

These are no ordinary losses

14 / 34

slide-33
SLIDE 33

Existing DRE approaches as CPE

For u ∈ [0,1], let

Ψdr : u → u 1−u.

Lemma

The LSIF loss is strictly proper composite with link Ψdr. The KLIEP loss with a > 0 is strictly proper composite with link a−1 ·Ψdr.

15 / 34

slide-34
SLIDE 34

Existing DRE approaches as CPE

For u ∈ [0,1], let

Ψdr : u → u 1−u.

Lemma

The LSIF loss is strictly proper composite with link Ψdr. The KLIEP loss with a > 0 is strictly proper composite with link a−1 ·Ψdr. KLIEP and LSIF perform CPE in disguise!

15 / 34

slide-35
SLIDE 35

Proof

For LSIF and KLIEP (with a = 1),

ℓ′(1,v) ℓ′(−1,v) = −1 v,

so that

16 / 34

slide-36
SLIDE 36

Proof

For LSIF and KLIEP (with a = 1),

ℓ′(1,v) ℓ′(−1,v) = −1 v,

so that

f(v) = 1 1− ℓ′(1,v)

ℓ′(−1,v)

= v 1+v

16 / 34

slide-37
SLIDE 37

Proof

For LSIF and KLIEP (with a = 1),

ℓ′(1,v) ℓ′(−1,v) = −1 v,

so that

f(v) = 1 1− ℓ′(1,v)

ℓ′(−1,v)

= v 1+v = Ψ−1

dr (v).

16 / 34

slide-38
SLIDE 38

Proof

For LSIF and KLIEP (with a = 1),

ℓ′(1,v) ℓ′(−1,v) = −1 v,

so that

f(v) = 1 1− ℓ′(1,v)

ℓ′(−1,v)

= v 1+v = Ψ−1

dr (v).

Proper compositeness follows from (Reid and Williamson, 2010). The link Ψdr is especially suitable for DRE...

16 / 34

slide-39
SLIDE 39

Another view of Ψdr

Bayes’ rule shows targets of DRE and CPE are linked:

(∀x ∈ X)r(x) · = p(x) q(x)

17 / 34

slide-40
SLIDE 40

Another view of Ψdr

Bayes’ rule shows targets of DRE and CPE are linked:

(∀x ∈ X)r(x) · = p(x) q(x) = η(x) 1−η(x)

17 / 34

slide-41
SLIDE 41

Another view of Ψdr

Bayes’ rule shows targets of DRE and CPE are linked:

(∀x ∈ X)r(x) · = p(x) q(x) = η(x) 1−η(x) = Ψdr (η(x))

17 / 34

slide-42
SLIDE 42

Another view of Ψdr

Bayes’ rule shows targets of DRE and CPE are linked:

(∀x ∈ X)r(x) · = p(x) q(x) = η(x) 1−η(x) = Ψdr (η(x))

as is well known (Bickel et al, 2009)

17 / 34

slide-43
SLIDE 43

Another view of Ψdr

Bayes’ rule shows targets of DRE and CPE are linked:

(∀x ∈ X)r(x) · = p(x) q(x) = η(x) 1−η(x) = Ψdr (η(x))

as is well known (Bickel et al, 2009) KLIEP and LSIF apposite for DRE

Optimal scorer is exactly Ψdr ◦η = r

17 / 34

slide-44
SLIDE 44

Story so far

Existing DRE losses are specific examples of CPE losses CPE Logistic Exponential Square Hinge KLIEP LSIF Proper losses DRE

18 / 34

slide-45
SLIDE 45

Roadmap

Now consider using arbitrary CPE losses for DRE CPE Logistic Exponential Square Hinge KLIEP LSIF Proper losses DRE

?

18 / 34

slide-46
SLIDE 46

CPE as Bregman minimisation

19 / 34

slide-47
SLIDE 47

General CPE approach to DRE?

Suppose ℓ proper composite with link Ψ Class-probability estimate ˆ

η = Ψ−1 ◦s

for logistic loss, ˆ

η(x) = 1/(1+e−s(x))

20 / 34

slide-48
SLIDE 48

General CPE approach to DRE?

Suppose ℓ proper composite with link Ψ Class-probability estimate ˆ

η = Ψ−1 ◦s

for logistic loss, ˆ

η(x) = 1/(1+e−s(x))

Density ratio estimate is naturally:

ˆ r(x) · = Ψdr( ˆ η(x)) = ˆ η(x) 1− ˆ η(x).

e.g. for logistic loss, ˆ

r(x) = es(x)

20 / 34

slide-49
SLIDE 49

General CPE approach to DRE?

Suppose ℓ proper composite with link Ψ Class-probability estimate ˆ

η = Ψ−1 ◦s

for logistic loss, ˆ

η(x) = 1/(1+e−s(x))

Density ratio estimate is naturally:

ˆ r(x) · = Ψdr( ˆ η(x)) = ˆ η(x) 1− ˆ η(x).

e.g. for logistic loss, ˆ

r(x) = es(x)

Intuitive, but what can we guarantee about this? preceding analysis only asymptotic

20 / 34

slide-50
SLIDE 50

A Bregman minimisation view of CPE

For proper composite ℓ, the regret or excess risk of a scorer is

reg(s;D,ℓ) = L(s;D,ℓ)− min

s∗∈RX L(s∗;D,ℓ)

21 / 34

slide-51
SLIDE 51

A Bregman minimisation view of CPE

For proper composite ℓ, the regret or excess risk of a scorer is

reg(s;D,ℓ) = L(s;D,ℓ)− min

s∗∈RX L(s∗;D,ℓ)

= EX∼M

  • Bf (η(X), ˆ

η(X))

  • for Bregman divergence Bf and loss-specific f

21 / 34

slide-52
SLIDE 52

A Bregman minimisation view of CPE

For proper composite ℓ, the regret or excess risk of a scorer is

reg(s;D,ℓ) = L(s;D,ℓ)− min

s∗∈RX L(s∗;D,ℓ)

= EX∼M

  • Bf (η(X), ˆ

η(X))

  • for Bregman divergence Bf and loss-specific f

e.g. for logistic loss, regret is a KL projection

reg(s;D,ℓ) = EX∼M [KL(η(X) ˆ η(X))]

21 / 34

slide-53
SLIDE 53

A Bregman minimisation view of CPE

For proper composite ℓ, the regret or excess risk of a scorer is

reg(s;D,ℓ) = L(s;D,ℓ)− min

s∗∈RX L(s∗;D,ℓ)

= EX∼M

  • Bf (η(X), ˆ

η(X))

  • for Bregman divergence Bf and loss-specific f

e.g. for logistic loss, regret is a KL projection

reg(s;D,ℓ) = EX∼M [KL(η(X) ˆ η(X))] Does this imply a Bregman projection onto r?

21 / 34

slide-54
SLIDE 54

A Bregman identity

The following lemma lets us make progress.

Lemma

Pick any convex and twice differentiable f : [0,1] → R. Then,

(∀x,y ∈ [0,∞))Bf

  • x

1+x, y 1+y

  • where f : z → (1+z)·f

z

1+z

  • .

22 / 34

slide-55
SLIDE 55

A Bregman identity

The following lemma lets us make progress.

Lemma

Pick any convex and twice differentiable f : [0,1] → R. Then,

(∀x,y ∈ [0,∞))Bf

  • x

1+x, y 1+y

  • =

1 1+x ·Bf (x,y),

where f : z → (1+z)·f

z

1+z

  • .

22 / 34

slide-56
SLIDE 56

A Bregman identity

The following lemma lets us make progress.

Lemma

Pick any convex and twice differentiable f : [0,1] → R. Then,

(∀x,y ∈ [0,∞))Bf

  • x

1+x, y 1+y

  • =

1 1+x ·Bf (x,y),

where f : z → (1+z)·f

z

1+z

  • .

f is closely related to the perspective transform

22 / 34

slide-57
SLIDE 57

A Bregman identity

The following lemma lets us make progress.

Lemma

Pick any convex and twice differentiable f : [0,1] → R. Then,

(∀x,y ∈ [0,∞))Bf

  • x

1+x, y 1+y

  • =

1 1+x ·Bf (x,y),

where f : z → (1+z)·f

z

1+z

  • .

f is closely related to the perspective transform

Unlike standard dual symmetry,

Bf (x,y) = Bf ∗(f ′(y),f ′(x)),

  • rder of x and y retained, and only x appears in extra scaling factor

22 / 34

slide-58
SLIDE 58

Proof - I

By (Reid and Williamson 2009, Equation 12),

Bf (x,y) =

x

y (x−z)·f ′′(z)dz.

Applying this to the LHS,

Bf

  • x

1+x, y 1+y

  • =
  • x

1+x y 1+y

  • x

1+x −z

  • ·f ′′(z)dz.

23 / 34

slide-59
SLIDE 59

Proof - II

Employing the substitution z =

u 1+u, with dz = du (1+u)2,

LHS =

x

y

  • x

1+x − u 1+u

  • ·f ′′
  • u

1+u

  • ·

1 (1+u)2 du

24 / 34

slide-60
SLIDE 60

Proof - II

Employing the substitution z =

u 1+u, with dz = du (1+u)2,

LHS =

x

y

  • x

1+x − u 1+u

  • ·f ′′
  • u

1+u

  • ·

1 (1+u)2 du = 1 1+x ·

x

y (x−u)·f ′′

  • u

1+u

  • ·

1 (1+u)3 du

24 / 34

slide-61
SLIDE 61

Proof - II

Employing the substitution z =

u 1+u, with dz = du (1+u)2,

LHS =

x

y

  • x

1+x − u 1+u

  • ·f ′′
  • u

1+u

  • ·

1 (1+u)2 du = 1 1+x ·

x

y (x−u)·f ′′

  • u

1+u

  • ·

1 (1+u)3 du = 1 1+x ·Bf (x,y),

since by definition of f ,

(f )′′(z) = f ′′

  • z

1+z

  • ·

1 (1+z)3.

24 / 34

slide-62
SLIDE 62

Proof - II

Employing the substitution z =

u 1+u, with dz = du (1+u)2,

LHS =

x

y

  • x

1+x − u 1+u

  • ·f ′′
  • u

1+u

  • ·

1 (1+u)2 du = 1 1+x ·

x

y (x−u)·f ′′

  • u

1+u

  • ·

1 (1+u)3 du = 1 1+x ·Bf (x,y),

since by definition of f ,

(f )′′(z) = f ′′

  • z

1+z

  • ·

1 (1+z)3.

Not obviously generalisable with another substitution

RHS does not remain a Bregman divergence

24 / 34

slide-63
SLIDE 63

Implication for DRE via CPE

Identity is equivalently

Bf

  • Ψ−1

dr (x),Ψ−1 dr (y)

  • =

1 1+x ·Bf (x,y).

25 / 34

slide-64
SLIDE 64

Implication for DRE via CPE

Identity is equivalently

Bf

  • Ψ−1

dr (x),Ψ−1 dr (y)

  • =

1 1+x ·Bf (x,y).

Apply to x = r, so that Ψ−1

dr (x) = η

25 / 34

slide-65
SLIDE 65

Implication for DRE via CPE

Identity is equivalently

Bf

  • Ψ−1

dr (x),Ψ−1 dr (y)

  • =

1 1+x ·Bf (x,y).

Apply to x = r, so that Ψ−1

dr (x) = η

Lemma

Pick any strictly proper composite ℓ with f twice differentiable. Then, for any distribution D = (P,Q) and scorer s: X → R,

reg(s;D,ℓ) = 1 2 ·EX∼Q

  • Bf (r(X), ˆ

r(X))

  • ,

for ˆ

r = Ψdr ◦ ˆ η = Ψdr ◦Ψ−1 ◦s.

Justifies using CPE for DRE

concrete sense in which ˆ

r is a good estimate

25 / 34

slide-66
SLIDE 66

Story so far

Shown how to perform DRE with range of CPE losses CPE Logistic Exponential Square Hinge KLIEP LSIF Proper losses DRE

Bregman

26 / 34

slide-67
SLIDE 67

Roadmap

Final link is to use DRE losses for CPE problems CPE Logistic Exponential Square Hinge KLIEP LSIF Proper losses DRE

Bregman

?

26 / 34

slide-68
SLIDE 68

DRE for bipartite top ranking

27 / 34

slide-69
SLIDE 69

Bipartite top ranking

Given S ∼ DN as before, learn scorer s: X → R with

28 / 34

slide-70
SLIDE 70

Bipartite top ranking

Given S ∼ DN as before, learn scorer s: X → R with Bipartite ranking: maximal area under ROC curve

rank average positives above negatives CPE is suitable (Kotlowski et al, 2010, Agarwal, 2014)

28 / 34

slide-71
SLIDE 71

Bipartite top ranking

Given S ∼ DN as before, learn scorer s: X → R with Bipartite ranking: maximal area under ROC curve

rank average positives above negatives CPE is suitable (Kotlowski et al, 2010, Agarwal, 2014)

Top ranking: maximal partial area under ROC curve

rank top positives above negatives is CPE suitable?

28 / 34

slide-72
SLIDE 72

CPE and weight functions

Any proper composite ℓ has weight function w: [0,1] → R∗

large w(c) → more focus on η ≈ c

0.2 0.4 0.6 0.8 1 2 4 6 8 10 c w(c)

Logistic loss w(c) =

1 2·c·(1−c)

0.2 0.4 0.6 0.8 1 2 4 6 8 10 c w(c)

Exponential loss w(c) =

1 4·c3/2·(1−c)3/2

0.2 0.4 0.6 0.8 1 2 4 6 8 10 c w(c)

Square hinge loss w(c) = 2

29 / 34

slide-73
SLIDE 73

Top ranking via LSIF

Carefully selected ℓ suitable for top ranking

choose ℓ with w focussing on large values of η

Easy to check that for LSIF,

w(c) = 1 (1−c)3.

focusses on η ≈ 1 appealing due to closed-form solution!

See paper for details

ℓ(−1,v) = 1 2 ·v2 and ℓ(1,v) = −v.

30 / 34

slide-74
SLIDE 74

Conclusion

31 / 34

slide-75
SLIDE 75

Summary

Formal links between (losses for) CPE and DRE CPE Logistic Exponential Square Hinge KLIEP LSIF Proper losses DRE

Bregman Ranking

32 / 34

slide-76
SLIDE 76

Future work

Finite sample analysis

understanding of when importance weighting doesn’t help

Other applications of DRE losses?

closed form solution for LSIF is appealing

Other applications for Bregman lemma?

33 / 34

slide-77
SLIDE 77

Thanks!1

1Drop by the poster for more (Paper ID 152)

34 / 34