An asymptotic analysis of nonparametric divide-and-conquer methods - - PowerPoint PPT Presentation

an asymptotic analysis of nonparametric divide and
SMART_READER_LITE
LIVE PREVIEW

An asymptotic analysis of nonparametric divide-and-conquer methods - - PowerPoint PPT Presentation

An asymptotic analysis of nonparametric divide-and-conquer methods Botond Szab and Harry van Zanten van Dantzig seminar, Delft, 06. 04. 2017. Table of contents 1 Motivation 2 Distributed methods: examples and counter examples Kernel density


slide-1
SLIDE 1

An asymptotic analysis of nonparametric divide-and-conquer methods

Botond Szabó and Harry van Zanten van Dantzig seminar, Delft, 06. 04. 2017.

slide-2
SLIDE 2

Table of contents

1 Motivation 2 Distributed methods: examples and counter examples

Kernel density estimation Gaussian white noise model Data-driven distribute methods

3 Distributed methods: fundamental limits

Communication constraints Data-driven methods with limited communication

4 Summary, ongoing work

slide-3
SLIDE 3

Distributed methods

slide-4
SLIDE 4

Applications

  • Volunteer computing (NASA, CERN, SETI,... projects)
  • Massive multiplayer online games (peer network)
  • Aircraft control systems
  • Meteorology, Astronomy
  • Medical data from different hospitals
slide-5
SLIDE 5

Distributed setting

slide-6
SLIDE 6

Distributed setting II

Interested in high-dimensional and nonparametric models.

  • Methods have tunning-, regularity-, sparsity-, bandwidth-hyperparameters to

adjust for optimal bias-variance trade-off. How does it work in distributed settings?

slide-7
SLIDE 7

Distributed setting II

Interested in high-dimensional and nonparametric models.

  • Methods have tunning-, regularity-, sparsity-, bandwidth-hyperparameters to

adjust for optimal bias-variance trade-off. How does it work in distributed settings?

  • Several approach in the literature (Consensus MC, WASP, Fast-KRR, Distributed

GP,...)

  • Limited theoretical underpinning
  • No unified framework to compare methods
  • Statistical models for illustration:
  • Kernel density estimation,
  • Gaussian white noise model,
  • Random design nonparametric regression.
slide-8
SLIDE 8

Kernel density estimation I

  • Model: Observe X1, ..., Xn

iid

∼ f0 with f0 ∈ Hβ(L).

  • Distributed setting: distribute data randomly over m machines.
  • Method:
  • Local machines: Kernel density estimation in each

ˆ f (i)

h (x) =

1 hn/m

n/m

  • j=1

K x − X (i)

j

h

  • .
  • Central machine: average local estimators

ˆ fh(x) = 1 m

m

  • i=1

ˆ f (i)

h (x).

slide-9
SLIDE 9

Kernel density estimation II

Problem: The choice of the bandwidth parameter h:

  • Local bias-variance trade-off:

|f0(x) − Ef0 ˆ f (i)

h (x)| hβ,

and Varf0 ˆ f (i)

h (x) ≍ m

hn,

  • ptimal bandwidth: h = (n/m)−1/(1+2β).
slide-10
SLIDE 10

Kernel density estimation II

Problem: The choice of the bandwidth parameter h:

  • Local bias-variance trade-off:

|f0(x) − Ef0 ˆ f (i)

h (x)| hβ,

and Varf0 ˆ f (i)

h (x) ≍ m

hn,

  • ptimal bandwidth: h = (n/m)−1/(1+2β).
  • Global bias-variance trade-off:

|f0(x) − Ef0 ˆ fh(x)| hβ, and Varf0 ˆ fh(x) ≍ 1 hn,

  • ptimal bandwidth: h = n−1/(1+2β).
slide-11
SLIDE 11

Kernel density estimation II

Problem: The choice of the bandwidth parameter h:

  • Local bias-variance trade-off:

|f0(x) − Ef0 ˆ f (i)

h (x)| hβ,

and Varf0 ˆ f (i)

h (x) ≍ m

hn,

  • ptimal bandwidth: h = (n/m)−1/(1+2β).
  • Global bias-variance trade-off:

|f0(x) − Ef0 ˆ fh(x)| hβ, and Varf0 ˆ fh(x) ≍ 1 hn,

  • ptimal bandwidth: h = n−1/(1+2β).
  • Local bias-variance trade-off results too big bias for ˆ

fh: oversmoothing.

slide-12
SLIDE 12

Kernel density estimation II

Problem: The choice of the bandwidth parameter h:

  • Local bias-variance trade-off:

|f0(x) − Ef0 ˆ f (i)

h (x)| hβ,

and Varf0 ˆ f (i)

h (x) ≍ m

hn,

  • ptimal bandwidth: h = (n/m)−1/(1+2β).
  • Global bias-variance trade-off:

|f0(x) − Ef0 ˆ fh(x)| hβ, and Varf0 ˆ fh(x) ≍ 1 hn,

  • ptimal bandwidth: h = n−1/(1+2β).
  • Local bias-variance trade-off results too big bias for ˆ

fh: oversmoothing.

  • In practice β is unknown: distributed data-driven methods?
slide-13
SLIDE 13

Gaussian white noise model

Single observer: dYt = f0(t) + 1 √ndWt, t ∈ [0, 1].

slide-14
SLIDE 14

Gaussian white noise model

Single observer: dYt = f0(t) + 1 √ndWt, t ∈ [0, 1]. Distributed case: m observer dY (i)

t

= f0(t) + m n dW (i)

t ,

t ∈ [0, 1], i ∈ {1, ..., m}, W (i)

t

are independent Brownian motions. Assumption: f0 ∈ Sβ(L), for β > 0.

slide-15
SLIDE 15

Distributed Bayesian approach

  • Endow f0 in each local problem with GP prior of the form

f |α ∼

  • j=1

j−1/2−αZjφj, where Zj are iid N(0, 1) and (φj)j the Fourrier basis.

  • Compute locally the posterior (or a modification of it)
  • Aggregate the local posteriors into a global one.
  • Can we get optimal recovery and reliable uncertainty quantification?
slide-16
SLIDE 16

Benchmark: Non-distributed setting I

  • One server: m = 1.
  • Squared bias (of posterior mean): f0 − E ˆ

fα2

2 n−

2β 1+2α

  • Variance, posterior spread: Var(ˆ

fα) ≍ σ2

|Y ≍ n−

2α 1+2α .

  • Optimal bias-variance trade-off: at α = β.
slide-17
SLIDE 17

Benchmark: Non-distributed setting II

0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4

Posterior from non−distributed data

t f(t)

slide-18
SLIDE 18

Distributed naive method

  • We have m local machines, with data (Y (1), ..., Y (m)).
  • Take α = β.
  • Local posteriors:

Π(i)

β (f ∈ B|Y (i)) =

  • B pf (Y (i))dΠβ(f )
  • pf (Y (i))dΠβ(f ) .
  • Aggregate the local posteriors by averaging the draws taken from them.

Result: Sub-optimal contraction, misleading uncertainty quantification. f0 − E ˆ f 2

2 (n/m)−

2β 1+2β ,

Var(ˆ f ) ≍ σ2

|Y ≍ m−

1 1+2β n− 2β 1+2β .

slide-19
SLIDE 19

Distributed naive method II

0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4

Posterior from naive distributed method

t f(t)

slide-20
SLIDE 20

The likelihood approach

  • Again m local machines, with data (Y (1), ..., Y (m)) and take α = β.
  • Modify the local likelihoods for each machine:

Π(i)(f ∈ B|Y (i)) =

  • B pf (Y (i))mdΠ(f )
  • pf (Y (i))mdΠ(f ) .
  • Aggregate the modified posteriors by averaging the draws taken from them.
slide-21
SLIDE 21

The likelihood approach

  • Again m local machines, with data (Y (1), ..., Y (m)) and take α = β.
  • Modify the local likelihoods for each machine:

Π(i)(f ∈ B|Y (i)) =

  • B pf (Y (i))mdΠ(f )
  • pf (Y (i))mdΠ(f ) .
  • Aggregate the modified posteriors by averaging the draws taken from them.

Result: Optimal posterior contraction, but bad uncertainty quantification. f0 − E ˆ f 2

2 n−

2β 1+2β ,

Var(ˆ f ) ≍ n−

2β 1+2β , ,

σ2

|Y ≍ m−1n−

2β 1+2β .

slide-22
SLIDE 22

The likelihood approach II

0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4

Posterior from likelihood distributed method

t f(t)

slide-23
SLIDE 23

The prior rescaling approach

  • Again m local machines, with data (Y (1), ..., Y (m)).
  • Modify the local priors for each machine:

Π(i)(f ∈ B|Y (i)) =

  • B pf (Y (i))π(f )1/mdλ(f )
  • pf (Y (i))π(f )1/mdλ(f ) .
  • Aggregate the modified posteriors by averaging the draws taken from them.
slide-24
SLIDE 24

The prior rescaling approach

  • Again m local machines, with data (Y (1), ..., Y (m)).
  • Modify the local priors for each machine:

Π(i)(f ∈ B|Y (i)) =

  • B pf (Y (i))π(f )1/mdλ(f )
  • pf (Y (i))π(f )1/mdλ(f ) .
  • Aggregate the modified posteriors by averaging the draws taken from them.

Result: Optimal posterior contraction and uncertainty quantification. f0 − E ˆ f 2

2 n−

2β 1+2β ,

Var(ˆ f ) ≍ σ2

|Y ≍ n−

2β 1+2β .

slide-25
SLIDE 25

The prior rescaling approach II

0.0 0.2 0.4 0.6 0.8 1.0 −0.2 0.0 0.2 0.4

Posterior from rescaled distributed method

t f(t)

slide-26
SLIDE 26

Other approaches

Methods posterior contraction rate coverage naive, average sub-optimal no naive,Wasserstein sub-optimal yes likelihood, average minimax no likelihood, Wasserstein (WASP) minimax yes scaling, average (consensus MC) minimax yes scaling, Wasserstein minimax yes undersmoothing minimax (on a range of β, m) yes (on a range of β, m) PoE sub-optimal no gPoE sub-optimal yes BCM minimax yes rBCM sub-optimal yes

slide-27
SLIDE 27

Data-driven methods

Note: All methods above use the knowledge of the true regularity parameter β, which is in practice usually not available. Solution: Data-driven choice of the regularity-, tunning-hyperparameter.

slide-28
SLIDE 28

Data-driven methods

Note: All methods above use the knowledge of the true regularity parameter β, which is in practice usually not available. Solution: Data-driven choice of the regularity-, tunning-hyperparameter. Benchmark: In the non-distributed case (m = 1)

  • Hierarchical Bayes: endow α with hyperprior.
  • Empirical Bayes: estimate α from the data (marginal maximum likelihood

estimator).

  • Adaptive minimax posterior contraction rate.
  • Coverage of credible sets (under polished tail/self-similarity assumption, using

blow-up factors).

slide-29
SLIDE 29

Empirical Bayes posterior

0.0 0.2 0.4 0.6 0.8 1.0 −0.4 −0.2 0.0 0.2 0.4 t f(t)

slide-30
SLIDE 30

Marginal likelihood

2 4 6 8 10 −200 −150 −100 −50 alpha likelihood

slide-31
SLIDE 31

Data driven distributed methods

Proposed methods:

  • Naive EB: local MMLE

ˆ α(i) = arg max

α

  • pf (Y (i))dΠα(f ).
  • Interactive EB Deisenroth and Ng (2015):

ˆ α = arg max

α m

  • i=1

log

  • pf (Y (i))dΠα(f ).
  • Other EB: Lepskii’s method ˜

α(i) or cross-validation (in the context of ridge regression Zhang, Duchi, Wainwright (2015))

slide-32
SLIDE 32

Counter example

Theorem: Consider f0 ∈ Sβ(L) with Fourrier coefficients f 2

0,j =

  • j−1−2β,

if j ≥ (n/√m)

1 1+2β ,

0, else. Then for all the above empirical Bayes methods (Naive, Interactive, Lepskii) the regularity hyper-parameter is oversmoothed P(min(ˆ α(i), ˆ α, ˜ α(i)) ≥ β + 1/2) = 1 + o(1). By combining it with any (in non-adaptive case) optimal aggregation methods (above)

  • ne gets

Πaggr,ˆ

α(f : f − f02 2 ≥ c(n/√m)−

2β 1+2β |Y ) = 1 + o(1).

slide-33
SLIDE 33

Aggregated empirical Bayes posterior

0.0 0.2 0.4 0.6 0.8 1.0 −0.4 −0.2 0.0 0.2 0.4 t f(t)

slide-34
SLIDE 34

Local marginal likelihoods

2 4 6 8 10 −14 −8 −2 alpha likelihood 2 4 6 8 10 −12 −6 −2 alpha likelihood 2 4 6 8 10 −20 −10 alpha likelihood 2 4 6 8 10 −15 −5 alpha likelihood 2 4 6 8 10 −25 −10 alpha likelihood 2 4 6 8 10 −14 −8 −2 alpha likelihood 2 4 6 8 10 −15 −5 alpha likelihood 2 4 6 8 10 −12 −6 −2 alpha likelihood 2 4 6 8 10 −15 −5 alpha likelihood 2 4 6 8 10 −20 −10 alpha likelihood 2 4 6 8 10 −8 −4 alpha likelihood 2 4 6 8 10 −12 −6 alpha likelihood 2 4 6 8 10 −12 −8 −4 alpha likelihood 2 4 6 8 10 −20 −10 alpha likelihood 2 4 6 8 10 −15 −5 alpha likelihood 2 4 6 8 10 −10 −6 −2 alpha likelihood 2 4 6 8 10 −15 −5 alpha likelihood 2 4 6 8 10 −15 −5 alpha likelihood 2 4 6 8 10 −15 −5 alpha likelihood 2 4 6 8 10 −8 −4 alpha likelihood

slide-35
SLIDE 35

Data-driven methods: constraints

Question: Is it possible to construct data-driven distributed methods with good recovery at all?

slide-36
SLIDE 36

Data-driven methods: constraints

Question: Is it possible to construct data-driven distributed methods with good recovery at all?

  • Yes: by transferring all data from local machines to central machine and then

data-driven method in the centralmachine.

slide-37
SLIDE 37

Data-driven methods: constraints

Question: Is it possible to construct data-driven distributed methods with good recovery at all?

  • Yes: by transferring all data from local machines to central machine and then

data-driven method in the centralmachine.

  • BUT this is clearly not what we are looking for...
  • In practice there are constraints on the method:
  • Computational: in the central machine minimize the amount of

computation.

  • Communication: as less communication between servers as possible.
slide-38
SLIDE 38

Data-driven methods: constraints

Question: Is it possible to construct data-driven distributed methods with good recovery at all?

  • Yes: by transferring all data from local machines to central machine and then

data-driven method in the centralmachine.

  • BUT this is clearly not what we are looking for...
  • In practice there are constraints on the method:
  • Computational: in the central machine minimize the amount of

computation.

  • Communication: as less communication between servers as possible.

New Question: Are there distributed data-driven methods with optimal recovery and “optimal” communication/computational costs.

slide-39
SLIDE 39

Communication constraints

slide-40
SLIDE 40

Communication constraints: minimax rate

  • No restriction (Bi = ∞): back to non-distributed case.
  • No communcation (Bi = 0): no (sensible) inference is possible.
  • In parametric models: Zhang et al. (2013). No result in nonparametric models.

Theorem: For β, L > 0 inf

ˆ f ∈Fdist:B1,..,Bm

sup

f ∈Bβ

2,∞(L)

Ef ˆ f − f 2

2 δn

2β 1+2β ,

where δn is the solution of δn = min

  • m

n log m, m n log m m

i=1(δ

1 1+2β

n

Bi ∧ 1)

slide-41
SLIDE 41

Communication constraints: minimax rate

  • No restriction (Bi = ∞): back to non-distributed case.
  • No communcation (Bi = 0): no (sensible) inference is possible.
  • In parametric models: Zhang et al. (2013). No result in nonparametric models.

Theorem: For β, L > 0 inf

f ∈Fdist:B1,..,Bm

sup

f ∈Bβ

2,∞(L)

Ef ˆ f − f 2

2 δ

2β 1+2β

n

, where δn is the solution of (for B = B1 = ... = Bm) δn = min

  • m

n log m, 1 n log m(δ

1 1+2β

n

B ∧ 1)

slide-42
SLIDE 42

Remarks

  • The proof is via Fano’s inequality (using mutual information).
  • If Bi ≥ n

1 1+2β , then δn ≍ (log m)γ1/n and the minimax lower bound is

(log m)γ2n−

2β 1+2β .

  • If Bi ≤ nρn

1 1+2β (for some ρ < 0), then the lower bound is nρ1n− 2β 1+2β (for some

ρ1 > 0).

  • It is easy to construct estimators, which attain the lower bounds up to logaritmic

terms.

  • So the optimal communication cost is Bi = n

1 1+2β (up tp log m term).

  • Problem: β is usually not available in practice.
slide-43
SLIDE 43

Adaptive distributed methods - bad news

Question: Is it possible to achieve the minimax (non-distributed) convergence rate and

  • ptimal communication at the same time (without knowing β)?
slide-44
SLIDE 44

Adaptive distributed methods - bad news

Question: Is it possible to achieve the minimax (non-distributed) convergence rate and

  • ptimal communication at the same time (without knowing β)?

Theorem: Let β, L > 0 be arbitrary. If m ≫ n

1 2+2β , then there exist no ideal procedure

that can adapt both the transmission rate and the estimation rate uniformly over all f0 ∈ Bβ

2,∞(L).

slide-45
SLIDE 45

Adaptive distributed methods - bad news

Question: Is it possible to achieve the minimax (non-distributed) convergence rate and

  • ptimal communication at the same time (without knowing β)?

Theorem: Let β, L > 0 be arbitrary. If m ≫ n

1 2+2β , then there exist no ideal procedure

that can adapt both the transmission rate and the estimation rate uniformly over all f0 ∈ Bβ

2,∞(L).

Corollary: Suppose m = np for p ∈ (0, 1/2), let β, L > 0 be arbitrary. If β > 1/(4p) − 1/2, then there exist no ideal procedure that can adapt both the transmission rate and the estimation rate uniformly over all f0 ∈ Bβ

2,∞(L).

slide-46
SLIDE 46

Idea of the proof

One can construct a finite sieve F ⊂ Bβ

2,∞(L), such that

  • Local machines can not test consistently if f = 0 or f ∈ F (they are close to 0

and there aren’t too many of them).

  • The set is large enough, such that the minimax (non-dstributed) rate for

estimation is n−

2β 1+2β .

  • To achieve this rate (up to a logaritmic factor) one has to transmit (in average)

n1/(1+2β) bits (up to a logaritmic factor).

  • Using the number of transmitted bits one could construct tests with higher

precision, than possible via the first theoretical limit. Contradiction.

slide-47
SLIDE 47

Adaptive distributed methods - good news

Theorem: Assume that m = np for p ∈ (0, 1/2), let β, L > 0. Then there exists a distributed procedure with transmission rates ˆ Bi and agrregated estimator ˆ f such that for all 0 < β < β< 1/(4p) − 1/2 inf

β≤β≤β

inf

f ∈Bβ

2,∞(L)

Pf ( ˆ Bi ≤ C(log n)δn

1 1+2β ) → 1,

inf

β≤β≤β

inf

f ∈Bβ

2,∞(L)

Pf (f − ˆ f 2

2 ≤ C(log n)δn−

2β 1+2β ) → 1.

slide-48
SLIDE 48

Good news: Idea of the proof I

We show adaptation to two classes indexed with 0 < β1 < β2 < 1/(4p) − 1/2 (adaptation for continuum classes can be done by introducing a grid).

slide-49
SLIDE 49

Good news: Idea of the proof I

We show adaptation to two classes indexed with 0 < β1 < β2 < 1/(4p) − 1/2 (adaptation for continuum classes can be done by introducing a grid). Local machines:

  • Split data into two iid parts (twice the variance)
  • Using the first part construct consistent test φ for

H0 : f ∈ Bβ2

2,∞(L)

vs Ha : f ∈ {f ∈ Bβ1

2,∞(L) : f −Bβ2 2,∞(L)2 2 ≥ ( n

m)−

β1 1/2+2β1 }.

  • Turn the test into an estimator for the smoothness

ˆ β(i) =

  • β1,

if φ = 0, β2, if φ = 1.

  • Transmit log n bits of the first ˆ

N(i) = n

1 1+2 ˆ β(i) wavelet coefficients of Y (i)

t .

slide-50
SLIDE 50

Good news: Idea of the proof II

Central machine:

  • Compute the median number of transmitted coefficients: ˆ

N.

  • Define estimator:

ˆ fj,k =

  • 1

Nj,k

  • i∈Nj,k Y (i)

j,k ,

if 2j ≤ ˆ N, 0, else, where Y (i)

j,k is the (first log n bits) of the (j, k)th wavelet coefficient of Y (i) t , ˆ

fj,k the (j, k)th wavelet coefficient of ˆ f , and Nj,k = {1 ≤ i ≤ m : ˆ N(i) ≥ 2j}.

slide-51
SLIDE 51

Summary

  • Several distributed methods proposed in the literature (Bayes and frequentist).
  • Compared them on a unified framework (distributed Gaussian white noise).
  • Investigated standard data-driven methods: do not work.
  • Theoretical limitations: under communication constraints.
  • Only on a range of regularity classes exists adaptive estimator with optimal

communication costs (in L2).

slide-52
SLIDE 52

Further results/Ongoing work

  • For f0 ∈ Bβ

∞,∞ and L∞ doesn’t exist an adaptive procedure (not even on a limited

range).

  • Under self-similarity assumption there exists an adaptive procedure.
  • Similar results can be derived for random design regression (technically more

demanding): ongoing.

  • Uncertainty quantification in adaptive setting: ongoing.
  • Computational constraints: NP vs P, quadratic, linear algorithms: future.
  • Combining computational and communication constraints: future.
  • General theorem (both Bayesian and non-bayesian): future.