Non-parameteric Estimation of Integral Probability Metrics Bharath - - PowerPoint PPT Presentation

non parameteric estimation of integral probability metrics
SMART_READER_LITE
LIVE PREVIEW

Non-parameteric Estimation of Integral Probability Metrics Bharath - - PowerPoint PPT Presentation

Non-parameteric Estimation of Integral Probability Metrics Bharath K. Sriperumbudur , Kenji Fukumizu , Arthur Gretton , , olkopf and Gert R. G. Lanckriet Bernhard Sch The Institute of Statistical Mathematics UC San


slide-1
SLIDE 1

Non-parameteric Estimation of Integral Probability Metrics

Bharath K. Sriperumbudur⋆, Kenji Fukumizu†, Arthur Gretton‡,×, Bernhard Sch¨

  • lkopf× and Gert R. G. Lanckriet⋆

⋆UC San Diego †The Institute of Statistical Mathematics ‡ CMU ×MPI for Biological Cybernetics

ISIT 2010

slide-2
SLIDE 2

Probability Metrics

◮ X : measurable space. ◮ P : set of all probability measures defined on X. ◮ γ : P × P → R+ is a notion of distance on P, called the

probability metric. Popular example: φ-divergence Dφ(P, Q) :=

X φ

  • dP

dQ

  • dQ,

P ≪ Q +∞,

  • therwise

, where φ : [0, ∞) → (−∞, ∞] is a convex function. Appropriate choice of φ: Kullback-Leibler divergence, Jensen-Shannon divergence, Total-variation distance, Hellinger distance, χ2-distance.

slide-3
SLIDE 3

Probability Metrics

◮ X : measurable space. ◮ P : set of all probability measures defined on X. ◮ γ : P × P → R+ is a notion of distance on P, called the

probability metric. Popular example: φ-divergence Dφ(P, Q) :=

X φ

  • dP

dQ

  • dQ,

P ≪ Q +∞,

  • therwise

, where φ : [0, ∞) → (−∞, ∞] is a convex function. Appropriate choice of φ: Kullback-Leibler divergence, Jensen-Shannon divergence, Total-variation distance, Hellinger distance, χ2-distance.

slide-4
SLIDE 4

Applications

Two-sample problem:

◮ Given random samples {X1, . . . , Xm} and {Y1, . . . , Yn} drawn i.i.d.

from P and Q, respectively.

◮ Determine: are P and Q different?

slide-5
SLIDE 5

Applications

Two-sample problem:

◮ Given random samples {X1, . . . , Xm} and {Y1, . . . , Yn} drawn i.i.d.

from P and Q, respectively.

◮ Determine: are P and Q different? ◮ γ(P, Q) : distance metric between P and Q.

H0 : P = Q H0 : γ(P, Q) = 0 ≡ H1 : P = Q H1 : γ(P, Q) > 0

◮ Test: Say H0 if

γ(P, Q) < ε. Otherwise say H1.

slide-6
SLIDE 6

Applications

Two-sample problem:

◮ Given random samples {X1, . . . , Xm} and {Y1, . . . , Yn} drawn i.i.d.

from P and Q, respectively.

◮ Determine: are P and Q different? ◮ γ(P, Q) : distance metric between P and Q.

H0 : P = Q H0 : γ(P, Q) = 0 ≡ H1 : P = Q H1 : γ(P, Q) > 0

◮ Test: Say H0 if

γ(P, Q) < ε. Otherwise say H1. Other applications:

◮ Hypothesis testing : Independence test, Goodness of fit test, etc. ◮ Limit theorems (central limit theorem), density estimation, etc.

slide-7
SLIDE 7

Estimation of Dφ(P, Q)

◮ Given random samples {X1, . . . , Xm} and {Y1, . . . , Yn} drawn i.i.d.

from P and Q, estimate Dφ(P, Q).

◮ Well-studied for φ(t) = t log t, t ∈ [0, ∞), i.e., Kullback-Liebler

divergence.

◮ Approaches:

◮ Histogram estimator based on space partitioning scheme

[Wang et al., 2005].

◮ M-estimation based on the variational characterization

[Nguyen et al., 2008], Dφ(P, Q) = sup

f :X→R

  • X

f dP −

  • X

φ∗(f ) dQ

  • ,

where φ∗ is the convex conjugate of φ.

slide-8
SLIDE 8

Estimation of Dφ(P, Q)

◮ Given random samples {X1, . . . , Xm} and {Y1, . . . , Yn} drawn i.i.d.

from P and Q, estimate Dφ(P, Q).

◮ Well-studied for φ(t) = t log t, t ∈ [0, ∞), i.e., Kullback-Liebler

divergence.

◮ Approaches:

◮ Histogram estimator based on space partitioning scheme

[Wang et al., 2005].

◮ M-estimation based on the variational characterization

[Nguyen et al., 2008], Dφ(P, Q) = sup

f :X→R

  • X

f dP −

  • X

φ∗(f ) dQ

  • ,

where φ∗ is the convex conjugate of φ.

slide-9
SLIDE 9

Estimation of Dφ(P, Q)

◮ Given random samples {X1, . . . , Xm} and {Y1, . . . , Yn} drawn i.i.d.

from P and Q, estimate Dφ(P, Q).

◮ Well-studied for φ(t) = t log t, t ∈ [0, ∞), i.e., Kullback-Liebler

divergence.

◮ Approaches:

◮ Histogram estimator based on space partitioning scheme

[Wang et al., 2005].

◮ M-estimation based on the variational characterization

[Nguyen et al., 2008], Dφ(P, Q) = sup

f :X→R

  • X

f dP −

  • X

φ∗(f ) dQ

  • ,

where φ∗ is the convex conjugate of φ.

slide-10
SLIDE 10

Estimation of Dφ(P, Q)

◮ Given random samples {X1, . . . , Xm} and {Y1, . . . , Yn} drawn i.i.d.

from P and Q, estimate Dφ(P, Q).

◮ Well-studied for φ(t) = t log t, t ∈ [0, ∞), i.e., Kullback-Liebler

divergence.

◮ Approaches:

◮ Histogram estimator based on space partitioning scheme

[Wang et al., 2005].

◮ M-estimation based on the variational characterization

[Nguyen et al., 2008], Dφ(P, Q) = sup

f :X→R

  • X

f dP −

  • X

φ∗(f ) dQ

  • ,

where φ∗ is the convex conjugate of φ.

slide-11
SLIDE 11

Properties of Estimators

◮ Computability ◮ Consistency ◮ Rate of convergence

Issues:

◮ Though the estimators of Dφ(P, Q) are consistent, their rate of

convergence can be arbitrarily slow depending on P and Q.

◮ Let X ⊂ Rd. For large d, the estimator proposed by

[Wang et al., 2005] is computationally inefficient.

slide-12
SLIDE 12

Properties of Estimators

◮ Computability ◮ Consistency ◮ Rate of convergence

Issues:

◮ Though the estimators of Dφ(P, Q) are consistent, their rate of

convergence can be arbitrarily slow depending on P and Q.

◮ Let X ⊂ Rd. For large d, the estimator proposed by

[Wang et al., 2005] is computationally inefficient.

slide-13
SLIDE 13

Integral Probability Metrics

◮ The integral probability metric [M¨

uller, 1997] between P and Q is defined as γF(P, Q) = sup

f ∈F

  • X

f dP −

  • X

f dQ

  • .

◮ Many popular probability metrics can be obtained by appropriately

choosing F.

◮ Total variation distance : F =

  • f : f ∞ := supx∈X |f (x)| ≤ 1
  • .

◮ Wasserstein distance : F =

  • f : f L := supx=y∈X

|f (x)−f (y)| ρ(x,y)

≤ 1

  • .

◮ Dudley metric : F = {f : f L + f ∞ ≤ 1}. ◮ Lp metric : F = {f : f Lp(X,µ) := (

  • X |f |p dµ)1/p ≤ 1, 1 ≤ p < ∞}.

◮ well-studied in probability theory, mass transporation problems, etc.

slide-14
SLIDE 14

Outline

◮ Relation between γF(P, Q) and Dφ(P, Q) ◮ Estimation of γF(P, Q) ◮ Consistency analysis and rate of convergence

slide-15
SLIDE 15

γF(P, Q) vs. Dφ(P, Q)

Dφ,F(P, Q) := sup

f ∈F

  • X

f dP −

  • X

φ∗(f ) dQ

  • ◮ Dφ,F(P, Q) = Dφ(P, Q) if F is the set of all real-valued measurable

functions on X.

◮ Dφ,F(P, Q) = γF(P, Q) if φ(t) =

  • 0,

t = 1 +∞, t = 1 .

◮ Dφ(P, Q) = γF(P, Q) if and only if any one of the following hold:

(i) F = {f : f ∞ ≤ β−α

2 } and φ(t) =

α(t − 1), 0 ≤ t ≤ 1 β(t − 1), t ≥ 1 for some α < β < ∞. (ii) F = {f : f = c, c ∈ R}, φ(t) = α(t − 1), t ≥ 0, α ∈ R

◮ Total-variation is the only φ-divergence that is also an integral

probability metric.

slide-16
SLIDE 16

γF(P, Q) vs. Dφ(P, Q)

Dφ,F(P, Q) := sup

f ∈F

  • X

f dP −

  • X

φ∗(f ) dQ

  • ◮ Dφ,F(P, Q) = Dφ(P, Q) if F is the set of all real-valued measurable

functions on X.

◮ Dφ,F(P, Q) = γF(P, Q) if φ(t) =

  • 0,

t = 1 +∞, t = 1 .

◮ Dφ(P, Q) = γF(P, Q) if and only if any one of the following hold:

(i) F = {f : f ∞ ≤ β−α

2 } and φ(t) =

α(t − 1), 0 ≤ t ≤ 1 β(t − 1), t ≥ 1 for some α < β < ∞. (ii) F = {f : f = c, c ∈ R}, φ(t) = α(t − 1), t ≥ 0, α ∈ R

◮ Total-variation is the only φ-divergence that is also an integral

probability metric.

slide-17
SLIDE 17

γF(P, Q) vs. Dφ(P, Q)

Dφ,F(P, Q) := sup

f ∈F

  • X

f dP −

  • X

φ∗(f ) dQ

  • ◮ Dφ,F(P, Q) = Dφ(P, Q) if F is the set of all real-valued measurable

functions on X.

◮ Dφ,F(P, Q) = γF(P, Q) if φ(t) =

  • 0,

t = 1 +∞, t = 1 .

◮ Dφ(P, Q) = γF(P, Q) if and only if any one of the following hold:

(i) F = {f : f ∞ ≤ β−α

2 } and φ(t) =

α(t − 1), 0 ≤ t ≤ 1 β(t − 1), t ≥ 1 for some α < β < ∞. (ii) F = {f : f = c, c ∈ R}, φ(t) = α(t − 1), t ≥ 0, α ∈ R

◮ Total-variation is the only φ-divergence that is also an integral

probability metric.

slide-18
SLIDE 18

γF(P, Q) vs. Dφ(P, Q)

Dφ,F(P, Q) := sup

f ∈F

  • X

f dP −

  • X

φ∗(f ) dQ

  • ◮ Dφ,F(P, Q) = Dφ(P, Q) if F is the set of all real-valued measurable

functions on X.

◮ Dφ,F(P, Q) = γF(P, Q) if φ(t) =

  • 0,

t = 1 +∞, t = 1 .

◮ Dφ(P, Q) = γF(P, Q) if and only if any one of the following hold:

(i) F = {f : f ∞ ≤ β−α

2 } and φ(t) =

α(t − 1), 0 ≤ t ≤ 1 β(t − 1), t ≥ 1 for some α < β < ∞. (ii) F = {f : f = c, c ∈ R}, φ(t) = α(t − 1), t ≥ 0, α ∈ R

◮ Total-variation is the only φ-divergence that is also an integral

probability metric.

slide-19
SLIDE 19

Outline

◮ Relation between γF(P, Q) and Dφ(P, Q) ◮ Estimation of γF(P, Q) ◮ Consistency analysis and rate of convergence

slide-20
SLIDE 20

Estimation of γF(P, Q)

◮ Given random samples {X1, . . . , Xm} and {Y1, . . . , Yn} drawn i.i.d.

from P and Q, estimate γF(P, Q).

◮ Estimator:

γF(Pm, Qn) = sup

f ∈F

  • 1

m

m

  • i=1

f (Xi) − 1 n

n

  • i=1

f (Yi)

  • ,

where Pm := 1

m

m

i=1 δXi and Qn := 1 n

n

i=1 δYi. ◮ Computability: Possible for certain choices of F.

◮ F = {f : f ∞ ≤ 1} ◮ F = {f : f L ≤ 1} ◮ F = {f : f L + f ∞ ≤ 1} ◮ F = {f : f H ≤ 1} where H is a reproducing kernel Hilbert space.

◮ Consistency and rate of convergence: determined by the “size” of F.

slide-21
SLIDE 21

Estimation of γF(P, Q)

◮ Given random samples {X1, . . . , Xm} and {Y1, . . . , Yn} drawn i.i.d.

from P and Q, estimate γF(P, Q).

◮ Estimator:

γF(Pm, Qn) = sup

f ∈F

  • 1

m

m

  • i=1

f (Xi) − 1 n

n

  • i=1

f (Yi)

  • ,

where Pm := 1

m

m

i=1 δXi and Qn := 1 n

n

i=1 δYi. ◮ Computability: Possible for certain choices of F.

◮ F = {f : f ∞ ≤ 1} ◮ F = {f : f L ≤ 1} ◮ F = {f : f L + f ∞ ≤ 1} ◮ F = {f : f H ≤ 1} where H is a reproducing kernel Hilbert space.

◮ Consistency and rate of convergence: determined by the “size” of F.

slide-22
SLIDE 22

Estimation of γF(P, Q)

◮ Given random samples {X1, . . . , Xm} and {Y1, . . . , Yn} drawn i.i.d.

from P and Q, estimate γF(P, Q).

◮ Estimator:

γF(Pm, Qn) = sup

f ∈F

  • 1

m

m

  • i=1

f (Xi) − 1 n

n

  • i=1

f (Yi)

  • ,

where Pm := 1

m

m

i=1 δXi and Qn := 1 n

n

i=1 δYi. ◮ Computability: Possible for certain choices of F.

◮ F = {f : f ∞ ≤ 1} ◮ F = {f : f L ≤ 1} ◮ F = {f : f L + f ∞ ≤ 1} ◮ F = {f : f H ≤ 1} where H is a reproducing kernel Hilbert space.

◮ Consistency and rate of convergence: determined by the “size” of F.

slide-23
SLIDE 23

Estimation of γF(P, Q)

V := {X1, . . . , Xm, Y1, . . . , Yn}, S := { 1

m, . . . , 1 m, − 1 n, . . . , − 1 n},

N := m + n.

Theorem

◮ F = {f : f L ≤ 1}: γF(Pm, Qn) = N i=1 Sia∗ i , where

{a∗

i }N i=1 = arg max

  • N
  • i=1

Siai : −ρ(Vi, Vj) ≤ ai−aj ≤ ρ(Vi, Vj), ∀ i, j

  • .

◮ F = {f : f L + f ∞ ≤ 1}: γF(Pm, Qn) = N i=1 Sib∗ i , where

{b∗

i }N i=1 = arg

max

b1,...,bN,e,c N

  • i=1

Sibi s.t. −e ρ(Vi, Vj) ≤ bi − bj ≤ e ρ(Vi, Vj), ∀ i, j −c ≤ bi ≤ c, ∀ i, e + c ≤ 1.

slide-24
SLIDE 24

Estimation of γF(P, Q)

V := {X1, . . . , Xm, Y1, . . . , Yn}, S := { 1

m, . . . , 1 m, − 1 n, . . . , − 1 n},

N := m + n.

Theorem

◮ F = {f : f L ≤ 1}: γF(Pm, Qn) = N i=1 Sia∗ i , where

{a∗

i }N i=1 = arg max

  • N
  • i=1

Siai : −ρ(Vi, Vj) ≤ ai−aj ≤ ρ(Vi, Vj), ∀ i, j

  • .

◮ F = {f : f L + f ∞ ≤ 1}: γF(Pm, Qn) = N i=1 Sib∗ i , where

{b∗

i }N i=1 = arg

max

b1,...,bN,e,c N

  • i=1

Sibi s.t. −e ρ(Vi, Vj) ≤ bi − bj ≤ e ρ(Vi, Vj), ∀ i, j −c ≤ bi ≤ c, ∀ i, e + c ≤ 1.

slide-25
SLIDE 25

Estimation of γF(P, Q)

F = {f : f H ≤ 1}, where H is a reproducing kernel Hilbert space (RKHS).

Definition

A Hilbert space H is said to be an RKHS if the evaluation functionals (δx(f ) = f (x), x ∈ X, f ∈ H) are bounded and continuous.

◮ There exists a unique kernel, k : X × X → R such that

∀x ∈ X, ∀ f ∈ H, f , k(·, x)H = f (x).

◮ k is the reproducing kernel (r.k.) of H as

k(x, y) = k(·, x), k(·, y)H, x, y ∈ X.

◮ Every r.k. is a positive definite function. ◮ For every positive definite function, k on X × X, there exists a

unique RKHS, H as k as its r.k.

◮ Example: k(x, y) = e−|x−y|, x, y ∈ R induces a Sobolev space.

slide-26
SLIDE 26

Estimation of γF(P, Q)

F = {f : f H ≤ 1}, where H is a reproducing kernel Hilbert space (RKHS).

Definition

A Hilbert space H is said to be an RKHS if the evaluation functionals (δx(f ) = f (x), x ∈ X, f ∈ H) are bounded and continuous.

◮ There exists a unique kernel, k : X × X → R such that

∀x ∈ X, ∀ f ∈ H, f , k(·, x)H = f (x).

◮ k is the reproducing kernel (r.k.) of H as

k(x, y) = k(·, x), k(·, y)H, x, y ∈ X.

◮ Every r.k. is a positive definite function. ◮ For every positive definite function, k on X × X, there exists a

unique RKHS, H as k as its r.k.

◮ Example: k(x, y) = e−|x−y|, x, y ∈ R induces a Sobolev space.

slide-27
SLIDE 27

Estimation of γF(P, Q)

F = {f : f H ≤ 1}, where H is a reproducing kernel Hilbert space (RKHS).

Definition

A Hilbert space H is said to be an RKHS if the evaluation functionals (δx(f ) = f (x), x ∈ X, f ∈ H) are bounded and continuous.

◮ There exists a unique kernel, k : X × X → R such that

∀x ∈ X, ∀ f ∈ H, f , k(·, x)H = f (x).

◮ k is the reproducing kernel (r.k.) of H as

k(x, y) = k(·, x), k(·, y)H, x, y ∈ X.

◮ Every r.k. is a positive definite function. ◮ For every positive definite function, k on X × X, there exists a

unique RKHS, H as k as its r.k.

◮ Example: k(x, y) = e−|x−y|, x, y ∈ R induces a Sobolev space.

slide-28
SLIDE 28

Estimation of γF(P, Q)

F = {f : f H ≤ 1}, where H is a reproducing kernel Hilbert space (RKHS).

Definition

A Hilbert space H is said to be an RKHS if the evaluation functionals (δx(f ) = f (x), x ∈ X, f ∈ H) are bounded and continuous.

◮ There exists a unique kernel, k : X × X → R such that

∀x ∈ X, ∀ f ∈ H, f , k(·, x)H = f (x).

◮ k is the reproducing kernel (r.k.) of H as

k(x, y) = k(·, x), k(·, y)H, x, y ∈ X.

◮ Every r.k. is a positive definite function. ◮ For every positive definite function, k on X × X, there exists a

unique RKHS, H as k as its r.k.

◮ Example: k(x, y) = e−|x−y|, x, y ∈ R induces a Sobolev space.

slide-29
SLIDE 29

Estimation of γF(P, Q)

V := {X1, . . . , Xm, Y1, . . . , Yn}, S := { 1

m, . . . , 1 m, − 1 n, . . . , − 1 n},

N := m + n.

Theorem

Let F = {f : f H ≤ 1} with k being bounded and measurable. Then γF(Pm, Qn) =

  • N
  • i,j=1

SiSjk(Vi, Vj).

slide-30
SLIDE 30

Outline

◮ Relation between γF(P, Q) and Dφ(P, Q) ◮ Estimation of γF(P, Q) ◮ Consistency analysis and rate of convergence

slide-31
SLIDE 31

Consistency and Rate of Convergence

Theorem

Suppose F be such that ν := supf ∈F,x∈X |f (x)| < ∞. Fix δ ∈ (0, 1). Then with probability 1 − δ over the choice of samples, {Xi}m

i=1 and

{Yi}n

i=1, the following holds:

|γF(Pm, Qn) − γF(P, Q)| ≤

  • 18ν2 log 4

δ 1 √m + 1 √n

  • +2Rm(F; {Xi}) + 2Rn(F; {Yi}),

where Rm(F; {xi}m

i=1) := Eσ sup f ∈F

  • 1

m

m

  • i=1

σif (xi)

  • ,

is called the Rademacher complexity of F and {σi} are independent Rademacher random variables defined as σi = 2Bi − 1, with {Bi} being Bernoulli random variables.

slide-32
SLIDE 32

Consistency and Rate of Convergence

Note that if Rm(F; {Xi}m

i=1) = OP(rm) and Rn(F; {Yi}n i=1) = OQ(rn),

then |γF(Pm, Qn) − γF(P, Q)| = OP,Q(rm ∨ m−1/2 + rn ∨ n−1/2), where a ∨ b := max(a, b).

Theorem ([von Luxburg and Bousquet, 2004])

For every ǫ > 0, the following holds: Rm(F; {xi}m

i=1) ≤ 2ǫ + 4

√ 2 m ∞

ǫ/4

  • log N(τ, F, L2(Pm)) dτ.
slide-33
SLIDE 33

Consistency and Rate of Convergence

Corollary

◮ Let X be a bounded subset of (Rd, · s) for some 1 ≤ s ≤ ∞.

Then, for F = {f : f L ≤ 1} and F = {f : f ∞ + f L ≤ 1}, we have |γF(Pm, Qn) − γF(P, Q)| = OP,Q(rm + rn) where rm = m−1/2 log m, d = 1 m−1/(d+1), d ≥ 2 . In addition if X is a bounded, convex subset of (Rd, · s) with non-empty interior, then rm =    m−1/2, d = 1 m−1/2 log m, d = 2 m−1/d, d > 2 .

slide-34
SLIDE 34

Consistency and Rate of Convergence

Corollary

◮ Let X be a measurable space. Suppose k is measurable and

supx∈M k(x, x) ≤ C < ∞. Then, for F = {f : f H ≤ 1}, we have |γF(Pm, Qn) − γF(P, Q)| = OP,Q(m−1/2 + n−1/2). Examples:

◮ Gaussian kernel: k(x, y) = e−σx−y2

2, σ > 0, x, y ∈ Rd

◮ Laplacian kernel: k(x, y) = e−σx−y1, σ > 0, x, y ∈ Rd ◮ Inverse multi-quadratic kernel: k(x, y) = (c2 + x − y2 2)−t, c > 0,

t > d/2, x, y ∈ Rd.

slide-35
SLIDE 35

Estimation of Total Variation Distance

Total variation distance is both a φ-divergence and integral probability metric given by TV (P, Q) = sup

X

f d(P − Q) : f ∞ ≤ 1

  • .

◮ Estimator: TV (Pm, Qn) = N i=1 Sia∗ i where {a∗ i }N i=1 solve the linear

program: max

  • N
  • i=1

Siai : −1 ≤ ai ≤ 1, ∀ i

  • .

Easy to see that a∗

i = sign(Si) and therefore TV (Pm, Qn) = 2 for

any m, n. Not consistent.

◮ Can be estimated consistently using kernel density estimators.

slide-36
SLIDE 36

Lower Bounds on Total Variation Distance

◮ W (P, Q) = sup{

  • X f d(P − Q) : f L ≤ 1}

◮ β(P, Q) = sup{

  • X f d(P − Q) : f L + f ∞ ≤ 1}

◮ γk(P, Q) = sup{

  • X f d(P − Q) : f H ≤ 1}

Theorem

(i) For all P = Q, we have TV (P, Q) ≥ W (P, Q)β(P, Q) W (P, Q) − β(P, Q). (ii) Suppose C := supx∈X k(x, x) < ∞. Then TV (P, Q) ≥ γk(P, Q) √ C .

◮ Lower bounds on Kullback-Leibler divergence through Pinsker’s

inequality.

slide-37
SLIDE 37

Summary

◮ Integral probability metrics vs. φ-divergences. ◮ Estimation of integral probability metrics from finite samples: easily

computable compared to φ-divergences.

◮ Fast rates of convergence compared to φ-divergences. ◮ Open question: Minimax rates for estimating integral probability

metrics.

slide-38
SLIDE 38

Thank You

slide-39
SLIDE 39

References

M¨ uller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29:429–443.

Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2008). Estimating divergence functionals and the likelihood ratio by convex risk minimization. Technical Report 764, Department of Statistics, University of California, Berkeley.

von Luxburg, U. and Bousquet, O. (2004). Distance-based classification with Lipschitz functions. Journal for Machine Learning Research, 5:669–695.

Wang, Q., Kulkarni, S. R., and Verd´ u, S. (2005). Divergence estimation of continuous distributions based on data-dependent partitions. IEEE Trans. Information Theory, 51(9):3064–3074.