The Nearest Neighbor Information Estimator is Adaptively Near - - PowerPoint PPT Presentation

the nearest neighbor information estimator is adaptively
SMART_READER_LITE
LIVE PREVIEW

The Nearest Neighbor Information Estimator is Adaptively Near - - PowerPoint PPT Presentation

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal Jiantao Jiao Berkeley EECS Weihao Gao UIUC ECE Yanjun Han Stanford EE NIPS 2018, Montr eal, Canada The Nearest Neighbor Information Estimator is


slide-1
SLIDE 1

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Jiantao Jiao Berkeley EECS Weihao Gao UIUC ECE Yanjun Han Stanford EE NIPS 2018, Montr´ eal, Canada

slide-2
SLIDE 2

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Differential Entropy Estimation

Differential entropy of a continuous density on Rd: h(f ) =

  • Rd f (x) log

1 f (x)dx

2 / 6

slide-3
SLIDE 3

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Differential Entropy Estimation

Differential entropy of a continuous density on Rd: h(f ) =

  • Rd f (x) log

1 f (x)dx

◮ machine learning tasks, e.g., classification, clustering, feature

selection

2 / 6

slide-4
SLIDE 4

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Differential Entropy Estimation

Differential entropy of a continuous density on Rd: h(f ) =

  • Rd f (x) log

1 f (x)dx

◮ machine learning tasks, e.g., classification, clustering, feature

selection

◮ causal inference

2 / 6

slide-5
SLIDE 5

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Differential Entropy Estimation

Differential entropy of a continuous density on Rd: h(f ) =

  • Rd f (x) log

1 f (x)dx

◮ machine learning tasks, e.g., classification, clustering, feature

selection

◮ causal inference ◮ sociology

2 / 6

slide-6
SLIDE 6

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Differential Entropy Estimation

Differential entropy of a continuous density on Rd: h(f ) =

  • Rd f (x) log

1 f (x)dx

◮ machine learning tasks, e.g., classification, clustering, feature

selection

◮ causal inference ◮ sociology ◮ computational biology

2 / 6

slide-7
SLIDE 7

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Differential Entropy Estimation

Differential entropy of a continuous density on Rd: h(f ) =

  • Rd f (x) log

1 f (x)dx

◮ machine learning tasks, e.g., classification, clustering, feature

selection

◮ causal inference ◮ sociology ◮ computational biology ◮ · · ·

2 / 6

slide-8
SLIDE 8

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Differential Entropy Estimation

Differential entropy of a continuous density on Rd: h(f ) =

  • Rd f (x) log

1 f (x)dx

◮ machine learning tasks, e.g., classification, clustering, feature

selection

◮ causal inference ◮ sociology ◮ computational biology ◮ · · ·

Our Task

Given empirical samples X1, · · · , Xn ∼ f , estimate h(f ).

2 / 6

slide-9
SLIDE 9

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Ideas of Nearest Neighbor

3 / 6

slide-10
SLIDE 10

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Ideas of Nearest Neighbor

Notations:

◮ n: number of samples ◮ d: dimensionality ◮ k: number of nearest neighbors ◮ Ri,k: ℓ2 distance of i-th sample to its k-th nearest neighbor ◮ vold(r): volumn of the d-dimensional ball with radius r

3 / 6

slide-11
SLIDE 11

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Ideas of Nearest Neighbor

Notations:

◮ n: number of samples ◮ d: dimensionality ◮ k: number of nearest neighbors ◮ Ri,k: ℓ2 distance of i-th sample to its k-th nearest neighbor ◮ vold(r): volumn of the d-dimensional ball with radius r

Idea

h(f ) = E[− log f (X)] ≈ −1 n

n

  • i=1

log f (Xi) f (Xi) · vold(Ri,k) ≈ k n

3 / 6

slide-12
SLIDE 12

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Ideas of Nearest Neighbor

Notations:

◮ n: number of samples ◮ d: dimensionality ◮ k: number of nearest neighbors ◮ Ri,k: ℓ2 distance of i-th sample to its k-th nearest neighbor ◮ vold(r): volumn of the d-dimensional ball with radius r

Idea

h(f ) = E[− log f (X)] ≈ −1 n

n

  • i=1

log f (Xi) f (Xi) · vold(Ri,k) ≈ k n

3 / 6

slide-13
SLIDE 13

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Kozachenko–Leonenko Estimator

Definition (Kozachenko–Leonenko Estimator)

ˆ hKL

n,k = 1

n

n

  • i=1

log n k vold(Ri,k)

  • + log(k) − ψ(k)
  • bias correction term

4 / 6

slide-14
SLIDE 14

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Kozachenko–Leonenko Estimator

Definition (Kozachenko–Leonenko Estimator)

ˆ hKL

n,k = 1

n

n

  • i=1

log n k vold(Ri,k)

  • + log(k) − ψ(k)
  • bias correction term

◮ Easy to implement: no numerical integration

4 / 6

slide-15
SLIDE 15

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Kozachenko–Leonenko Estimator

Definition (Kozachenko–Leonenko Estimator)

ˆ hKL

n,k = 1

n

n

  • i=1

log n k vold(Ri,k)

  • + log(k) − ψ(k)
  • bias correction term

◮ Easy to implement: no numerical integration ◮ Only tuning parameter: k

4 / 6

slide-16
SLIDE 16

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Kozachenko–Leonenko Estimator

Definition (Kozachenko–Leonenko Estimator)

ˆ hKL

n,k = 1

n

n

  • i=1

log n k vold(Ri,k)

  • + log(k) − ψ(k)
  • bias correction term

◮ Easy to implement: no numerical integration ◮ Only tuning parameter: k ◮ Good empirical performance without theoretical guarantee,

especially when the density may be close to zero.

4 / 6

slide-17
SLIDE 17

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Main Result

Let Hs

d be the class of probability densities supported on [0, 1]d

which are H¨

  • lder smooth with parameter s ≥ 0.

5 / 6

slide-18
SLIDE 18

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Main Result

Let Hs

d be the class of probability densities supported on [0, 1]d

which are H¨

  • lder smooth with parameter s ≥ 0.

Theorem (Main Result)

For fixed k and s ∈ (0, 2],

  • sup

f ∈Hs

d

Ef

  • ˆ

hKL

n,k − h(f )

2 1

2

n−

s s+d log n + n− 1 2 . 5 / 6

slide-19
SLIDE 19

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Main Result

Let Hs

d be the class of probability densities supported on [0, 1]d

which are H¨

  • lder smooth with parameter s ≥ 0.

Theorem (Main Result)

For fixed k and s ∈ (0, 2],

  • sup

f ∈Hs

d

Ef

  • ˆ

hKL

n,k − h(f )

2 1

2

n−

s s+d log n + n− 1 2 .

First theoretical guarantee of Kozachenko–Leonenko estimator without assuming density bounded away from zero.

5 / 6

slide-20
SLIDE 20

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Matching Lower Bound

Theorem (Han–Jiao–Weissman–Wu’17)

For any s ≥ 0,

  • inf

ˆ h

sup

f ∈Hs

d

Ef

  • ˆ

h − h(f ) 2 1

2

n−

s s+d (log n)− s+2d s+d + n− 1 2 . 6 / 6

slide-21
SLIDE 21

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Matching Lower Bound

Theorem (Han–Jiao–Weissman–Wu’17)

For any s ≥ 0,

  • inf

ˆ h

sup

f ∈Hs

d

Ef

  • ˆ

h − h(f ) 2 1

2

n−

s s+d (log n)− s+2d s+d + n− 1 2 .

Take-home Message

◮ Nearest neighbor estimator is nearly minimax

6 / 6

slide-22
SLIDE 22

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Matching Lower Bound

Theorem (Han–Jiao–Weissman–Wu’17)

For any s ≥ 0,

  • inf

ˆ h

sup

f ∈Hs

d

Ef

  • ˆ

h − h(f ) 2 1

2

n−

s s+d (log n)− s+2d s+d + n− 1 2 .

Take-home Message

◮ Nearest neighbor estimator is nearly minimax ◮ Nearest neighbor estimator adapts to the unknown

smoothness s

6 / 6

slide-23
SLIDE 23

The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal

Matching Lower Bound

Theorem (Han–Jiao–Weissman–Wu’17)

For any s ≥ 0,

  • inf

ˆ h

sup

f ∈Hs

d

Ef

  • ˆ

h − h(f ) 2 1

2

n−

s s+d (log n)− s+2d s+d + n− 1 2 .

Take-home Message

◮ Nearest neighbor estimator is nearly minimax ◮ Nearest neighbor estimator adapts to the unknown

smoothness s

◮ Maximal inequality plays a central role in dealing with small

densities.

6 / 6