Three Approaches towards Optimal Property Estimation and Testing - - PowerPoint PPT Presentation

three approaches towards optimal property estimation and
SMART_READER_LITE
LIVE PREVIEW

Three Approaches towards Optimal Property Estimation and Testing - - PowerPoint PPT Presentation

Three Approaches towards Optimal Property Estimation and Testing Jiantao Jiao (Stanford EE) Joint work with: Yanjun Han, Dmitri Pavlichin, Kartik Venkat, Tsachy Weissman Frontiers in Distribution Testing Workshop, FOCS 2017 Oct. 14th, 2017 1


slide-1
SLIDE 1

Three Approaches towards Optimal Property Estimation and Testing

Jiantao Jiao (Stanford EE) Joint work with: Yanjun Han, Dmitri Pavlichin, Kartik Venkat, Tsachy Weissman

Frontiers in Distribution Testing Workshop, FOCS 2017

  • Oct. 14th, 2017

1 / 23

slide-2
SLIDE 2

Statistical properties

Disclaimer: Throughout this talk, n refers to the number of samples, S refer to the alphabet size of a distribution.

1 Shannon entropy: H(P) S

i=1 −pi ln pi.

2 Fα(P): Fα(P) S

i=1 pα i , α > 0.

3 KL divergence, χ2 divergence, L1 distance, Hellinger distance

F(P, Q) S

i=1 f (pi, qi) for

f (x, y) = x ln(x/y), (x − y)2/x, |x − y|, (√x − √y)2.

2 / 23

slide-3
SLIDE 3

Tolerant testing/learning/estimation

We focus on the question: how many samples are needed to achieve accuracy ǫ for estimating these properties from empirical data? Example: L1(P, US), US = (1/S, 1/S, . . . , 1/S), observe n i.i.d. samples from P; (VV’11, VV’11): exist approach whose error is

  • S

n ln n when S ln S n S; no consistent estimator when n S ln S ;

The MLE plug-in L1( ˆ Pn, US) achieves error

  • S

n when n S.

3 / 23

slide-4
SLIDE 4

Tolerant testing/learning/estimation

We focus on the question: how many samples are needed to achieve accuracy ǫ for estimating these properties from empirical data? Example: L1(P, US), US = (1/S, 1/S, . . . , 1/S), observe n i.i.d. samples from P; (VV’11, VV’11): exist approach whose error is

  • S

n ln n when S ln S n S; no consistent estimator when n S ln S ;

The MLE plug-in L1( ˆ Pn, US) achieves error

  • S

n when n S.

Effective sample size enlargement

Minimax rate-optimal with n samples ⇐ ⇒ MLE with n ln n samples

Similar results also hold for Shannon entropy (VV’11, VV’11, VV’13, WY’16, JVHW’15), power sum functional (JVHW’15), R´ enyi entropy estimation (AOST’14), χ2, Hellinger, and KL-divergence estimation (HJW’16, BZLV’16), Lr norm estimation under Gaussian white noise model (HJMW’17), L1 distance estimation (JHW’16), etc. except for support size (WY’16)

3 / 23

slide-5
SLIDE 5

Effective sample size enlargement

Rminmax(F, P, n) = inf

ˆ F(X1,...,Xn)

sup

P∈P

E| ˆ F − F(P)| Rplug-in(F, P, n) = sup

P∈P

E|F( ˆ Pn) − F(P)|.

F(P) P Rminmax(F, P, n) Rplug-in(F, P, n)

S

  • i=1

pi log

  • 1

pi

  • MS

S n log(n) + log(S) √n S n + log(S) √n Fα(P) =

S

  • i=1

i ,

0 < α ≤ 1

2

MS S (n log(n))α S nα Fα(P),

1 2 < α < 1

MS S (n log(n))α + S1−α √n S nα + S1−α √n Fα(P), 1 < α < 3

2

MS (n log(n))−(α−1) n−(α−1) Fα(P), α ≥ 3

2

MS 1 √n 1 √n

S

  • i=1

1(pi = 0) {P : mini pi ≥ 1

S }

Se

−Θ

  • max
  • n log(n)

S , n S

  • Se−Θ

n S

  • S
  • i=1

|pi − qi | MS

S

  • i=1

qi ∧

  • qi

n ln n

S

  • i=1

qi ∧ qi n 4 / 23

slide-6
SLIDE 6

Effective sample size enlargement

Divergence functions: here P, Q ∈ MS where we have m samples from p and n samples from q. For the Kullback-Leibler and χ2 divergence estimators we only consider (P, Q) ∈ {(P, Q)|P, Q ∈ MS, Pi

Qi ≤ u(S)}

where u(S) is some function of S.

F(P, Q) Rminmax(F, P, m, n) Rplug-in(F, P, m, n)

S

  • i=1

|pi − qi |

  • S

min{m, n} log(min{m, n}) +

  • S

min{m, n} 1 2

S

  • i=1

(√pi − √qi )2

  • S

min{m, n} log(min{m, n})

  • S

min{m, n} D(PQ) =

S

  • i=1

pi log

  • pi

qi

  • S

m log(m) + Su(S) n log(n) + log(u(S)) √m +

  • u(S)

√n S m + Su(S) n + log(u(S)) √m +

  • u(S)

√n χ2(PQ) =

S

  • i=1

p2

i

qi − 1 Su(S)2 n log(n) + u(S) √m + u(S)3/2 √n Su(S)2 n + u(S) √m + u(S)3/2 √n 5 / 23

slide-7
SLIDE 7

Goal of this talk

Understand the mechanism behind the logarithmic sample size enlargement. For what functionals do we have this phenomenon? What concrete algorithms achieve this phenomenon? If there exist multiple approaches, what are their relative advantages and disadvantages?

6 / 23

slide-8
SLIDE 8

First approach: Approximation methodology

Question

Is the enlargement phenomenon caused by the fact that the functionals are permutation invariant (symmetric)?

7 / 23

slide-9
SLIDE 9

First approach: Approximation methodology

Question

Is the enlargement phenomenon caused by the fact that the functionals are permutation invariant (symmetric)?

Answer

  • Nope. :)

Literature on approximation methodology

VV’11 (linear estimator), WY’16, WY’16 JVHW’15, AOST’14, HJW’16, BZLV’16, HJMW’16, JHW’16

7 / 23

slide-10
SLIDE 10

Example: L1 distance estimation

Given Q = (q1, q2, . . . , qS), we estimate L1(P, Q) given i.i.d. samples from P.

Theorem (J., Han, Weissman’16)

Suppose ln S ln n ln S

i=1

√qi ∧ qi √ n ln n

  • , S ≥ 2. Then,

inf

ˆ L

sup

P∈MS

EP|ˆ L − L1(P, Q)| ≍

S

  • i=1

qi ∧ qi n ln n. (1) For the MLE, we have sup

P∈MS

EP|L1( ˆ Pn, Q) − L1(P, Q)| ≍

S

  • i=1

qi ∧ qi n . (2)

8 / 23

slide-11
SLIDE 11

Confidence sets in binomial model: coverage probability ≍ 1 − n−A

1 Θ = [0, 1] nˆ p ∼ B(n, p)

slide-12
SLIDE 12

Confidence sets in binomial model: coverage probability ≍ 1 − n−A

1 Θ = [0, 1] nˆ p ∼ B(n, p)

ln n n

slide-13
SLIDE 13

Confidence sets in binomial model: coverage probability ≍ 1 − n−A

1 Θ = [0, 1] nˆ p ∼ B(n, p)

ln n n

ˆ p < ln n

n

slide-14
SLIDE 14

Confidence sets in binomial model: coverage probability ≍ 1 − n−A

1 Θ = [0, 1] nˆ p ∼ B(n, p)

ln n n

ˆ p < ln n

n

U(ˆ p) ∼ ln n

n

slide-15
SLIDE 15

Confidence sets in binomial model: coverage probability ≍ 1 − n−A

1 Θ = [0, 1] nˆ p ∼ B(n, p)

ln n n

ˆ p < ln n

n

U(ˆ p) ∼ ln n

n

ˆ p > ln n

n

slide-16
SLIDE 16

Confidence sets in binomial model: coverage probability ≍ 1 − n−A

1 Θ = [0, 1] nˆ p ∼ B(n, p)

ln n n

ˆ p < ln n

n

U(ˆ p) ∼ ln n

n

ˆ p > ln n

n

U(ˆ p) ∼

  • ˆ

p ln n n

9 / 23

slide-17
SLIDE 17

Confidence sets in binomial model: coverage probability ≍ 1 − n−A

1 Θ = [0, 1] nˆ p ∼ B(n, p)

ln n n

ˆ p < ln n

n

U(ˆ p) ∼ ln n

n

ˆ p > ln n

n

U(ˆ p) ∼

  • ˆ

p ln n n

Theorem (J., Han, Weissman’16)

Partition [0, 1] into finitely number of intervals Ii = [xi, xi+1], x0 = 0, x1 ≍ ln n

n , √xi+1 − √xi ≍

  • ln n

n . Then,

1 if p ∈ Ii, then ˆ

p ∈ 2Ii with probability 1 − n−A;

2 if ˆ

p ∈ Ii, then p ∈ 2Ii with probability 1 − n−A;

3 Those intervals are of the shortest length. 9 / 23

slide-18
SLIDE 18

Algorithmic description of Approximation methodology

First conduct sampling splitting, get ˆ pi, ˆ p′

i i.i.d. with distribution 2 n · B(n/2, pi).

Suppose qi ∈ Ij. For each i do the following:

1 if ˆ

pi ∈ Ij, compute best polynomial approximation in 2Ij: PK(x; qi) = arg min

P∈PolyK

max

z∈2Ij ||z − qi| − P(z)|,

(3) and then estimate |pi − qi| by the unbiased estimator of PK(pi; qi) using ˆ p′

i;

2 if ˆ

pi / ∈ Ij, estimate |pi − qi| by |ˆ p′

i − qi|;

3 sum everything up. 10 / 23

slide-19
SLIDE 19

Why it works?

1 Suppose ˆ

pi ∈ Ij. No matter what we use to estimate, one can always assume that pi ∈ 2Ij;

2 The bias of the MLE is approximately (Strukov and Timan’77)

sup

pi∈2Ij

||pi − qi| − E|ˆ pi − qi|| ≍ qi ∧ qi n ; (4)

3 The bias of the Approximation methodology is approximately (Ditzian

and Totik’87) sup

pi∈2Ij

||pi − qi| − PK(pi; qi)| ≍ qi ∧ qi n ln n. (5)

4 Permutation invariance does not play a role since we are doing symbol

by symbol bias correction;

5 The bias dominates in high dimensions (measure concentration

phenomenon).

11 / 23

slide-20
SLIDE 20

Properties of the Approximation Methodology

1 Applies to essentially any functional 2 Applies to a wide range of statistical models (binomial, Poisson,

Gaussian, etc)

3 Near-linear complexity 4 Explicit polynomial approximation for each different functional 5 Need to tune parameters in practice 12 / 23

slide-21
SLIDE 21

Second approach: Local moment matching methodology

Motivation

Does there exist a single plug-in estimator that can replace the Approximation methodology?

13 / 23

slide-22
SLIDE 22

Second approach: Local moment matching methodology

Motivation

Does there exist a single plug-in estimator that can replace the Approximation methodology?

Answer

  • No. For any plug-in rule ˆ

P, there exists a fixed Q such that L1( ˆ P, Q) requires n ≫ S samples to consistently estimate L1(P, Q), while the

  • ptimal method requires at most n ≫

S ln S .

13 / 23

slide-23
SLIDE 23

Second approach: Local moment matching methodology

Motivation

Does there exist a single plug-in estimator that can replace the Approximation methodology?

Answer

  • No. For any plug-in rule ˆ

P, there exists a fixed Q such that L1( ˆ P, Q) requires n ≫ S samples to consistently estimate L1(P, Q), while the

  • ptimal method requires at most n ≫

S ln S .

Weakened goal

What about we only consider permutation invariant functionals?

Literature on the local moment matching methodology

VV’11 (linear programming), HJW’17

13 / 23

slide-24
SLIDE 24

Local moment matching methodology

Theorem (Han, J., Weissman’17)

There exists a single estimator ˆ P, efficiently computable, and achieves the

  • ptimal phase transitions for ALL the permutation invariant functionals

mentioned above. In particular, it solves the minimax problem inf

ˆ P

sup

P∈MS

E ˆ P − P<1 ≍

  • S

n ln n +

  • ˜

O(n−1/3) ∧

  • S

n

  • ,

(6) where P< = (p(1), p(2), . . . , p(S)), p(i) ≤ p(i+1).

14 / 23

slide-25
SLIDE 25

A simple example

Assume for all i, pi ≤ ln n

n , ˆ

pi ≤ ln n

n . Consider the Shannon entropy

functional H(P) = S

i=1 f (pi), f (x) = x ln(1/x).

Theorem (VV’11, Wu and Yang’16, J. et al’15)

Optimal error in estimating H is

S n ln n, while MLE error is S n .

15 / 23

slide-26
SLIDE 26

A simple example

Assume for all i, pi ≤ ln n

n , ˆ

pi ≤ ln n

n . Consider the Shannon entropy

functional H(P) = S

i=1 f (pi), f (x) = x ln(1/x).

Theorem (VV’11, Wu and Yang’16, J. et al’15)

Optimal error in estimating H is

S n ln n, while MLE error is S n .

Suppose we use the plug-in rule S

i=1 f (qi) to estimate H(P), where

qi ≤ ln n

n . Then, for any PK(x) ∈ PolyK, K = ln n,

H −

  • i

f (qi) =

  • i

(f (pi) − PK(pi)) +

  • i

(PK(pi) − PK(qi)) +

  • i

(PK(qi) − f (qi)) ≤ 2S · inf

PK

max

x∈[0, ln n

n ]

|f (x) − PK(x)| +

  • i

(PK(pi) − PK(qi))

  • S

n ln n +

  • i

(PK(pi) − PK(qi)).

15 / 23

slide-27
SLIDE 27

Local moment matching

We showed for any plug-in rule Q, H −

  • i

f (qi) S n ln n +

  • i

(PK(pi) − PK(qi)). (7)

Why MLE is bad?

The MLE is bad because

  • E
  • i

(PK(pi) − PK(qi))

  • S

n . (8)

Solution

It suffices to reduce the bias of PK(qi) in estimating PK(pi).

16 / 23

slide-28
SLIDE 28

Local moment matching

Ideal situation

Suppose for each 0 ≤ k ≤ ln n,

  • j

pk

j =

  • j

qk

j ,

(9) we immediately have E

  • i

(PK(pi) − PK(qi))

  • = 0.

(10)

17 / 23

slide-29
SLIDE 29

Algorithmic description of local moment matching

For each interval Ij, collect A = {i : ˆ pi ∈ Ij}. Then, for each 0 ≤ k ≤ ln n, we solve Q such that

  • i∈A

qk

i −

  • unbiased estimates of
  • i∈A

pk

i

  • nǫ · σk,A,

(11) here σk,A = standard deviation of unbiased estimates of

  • i∈A

pk

i .

(12)

Existence of solution

The solution exists with overwhelming probability since the true distribution P satisfies these inequalities with overwhelming probability.

18 / 23

slide-30
SLIDE 30

Properties of the Local moment matching Methodology

1 Applies only to permutation invariant functionals 2 Applies to a wide range of statistical models (binomial, Poisson,

Gaussian, etc)

3 Polynomial complexity 4 Implicit polynomial approximation, just need to compute once 5 Need to tune parameters in practice 19 / 23

slide-31
SLIDE 31

Third approach: the profile maximum likelihood methodology (PML)

Properties Approximation Local MM PML Permutation invariant No Yes Yes Statistical model Broad Broad (Conjectured) Broad Complexity Near-linear Polynomial Unclear Functional dependent Yes No No Parameter tuning Yes Yes No

Thank you!

20 / 23

slide-32
SLIDE 32

Literature

Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. ”A unified maximum likelihood approach for

  • ptimal distribution property estimation”, Proceedings of ICML, 2017.

Jiantao Jiao, Yanjun Han, and Tsachy Weissman. ”Minimax Estimation of the L1 Distance”, arXiv e-prints, May 2017 Gregory Valiant and Paul Valiant. ”A CLT and tight lower bounds for estimating entropy”, Electronic Colloquium on Computational Complexity (ECCC), 2010 Gregory Valiant and Paul Valiant. ”Estimating the unseen: a sublinear-sample canonical estimator of distributions”, Electronic Colloquium on Computational Complexity, 2010. Gregory Valiant and Paul Valiant, ”Estimating the unseen: an n/ log n sample estimator for entropy and support size, shown optimal via new clts”, Proceedings of STOC, 2011. Gregory Valiant and Paul Valiant, ”The power of linear estimators”, Proceedings of FOCS, 2011.

21 / 23

slide-33
SLIDE 33

Literature

Yihong Wu and Pengkun Yang. ”Minimax rates of entropy estimation

  • n large alphabets via best polynomial approximation.” IEEE

Transactions on Information Theory 62.6 (2016): 3702-3720. Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. ”Minimax estimation of functionals of discrete distributions.” IEEE Transactions on Information Theory 61.5 (2015): 2835-2885. Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, Himanshu

  • Tyagi. ”The complexity of estimating Rnyi entropy.” Proceedings of

the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete

  • Algorithms. Society for Industrial and Applied Mathematics, 2014.

Yanjun Han, Jiantao Jiao, and Tsachy Weissman. ”Minimax Rate-Optimal Estimation of Divergences between Discrete

  • Distributions. ” arXiv preprint arXiv:1605.09124 (2016).

Yuheng Bu, Shaofeng Zou, Yingbin Liang, Venugopal V. Veeravalli. ”Estimation of KL Divergence: Optimal Minimax Rate.” arXiv preprint arXiv:1607.02653 (2016).

22 / 23

slide-34
SLIDE 34

Literature

Yanjun Han, Jiantao Jiao, Rajarshi Mukherjee, and Tsachy

  • Weissman. ”On Estimation of Lr-Norms in Gaussian White Noise
  • Models. ” arXiv preprint arXiv:1710.03863 (2017).

Yihong Wu and Pengkun Yang. ”Chebyshev polynomials, moment matching, and optimal estimation of the unseen.” arXiv preprint arXiv:1504.01227 (2015). Yanjun Han, Jiantao Jiao, Tsachy Weissman, ”Local moment matching: a unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance”, in preparation

23 / 23