an asymptotic analysis of nonparametric divide and
play

An asymptotic analysis of nonparametric divide-and-conquer methods - PowerPoint PPT Presentation

An asymptotic analysis of nonparametric divide-and-conquer methods Botond Szab and Harry van Zanten van Dantzig seminar, Delft, 06. 04. 2017. Table of contents 1 Motivation 2 Distributed methods: examples and counter examples Kernel density


  1. An asymptotic analysis of nonparametric divide-and-conquer methods Botond Szabó and Harry van Zanten van Dantzig seminar, Delft, 06. 04. 2017.

  2. Table of contents 1 Motivation 2 Distributed methods: examples and counter examples Kernel density estimation Gaussian white noise model Data-driven distribute methods 3 Distributed methods: fundamental limits Communication constraints Data-driven methods with limited communication 4 Summary, ongoing work

  3. Distributed methods

  4. Applications • Volunteer computing (NASA, CERN, SETI,... projects) • Massive multiplayer online games (peer network) • Aircraft control systems • Meteorology, Astronomy • Medical data from different hospitals

  5. Distributed setting

  6. Distributed setting II Interested in high-dimensional and nonparametric models. • Methods have tunning-, regularity-, sparsity-, bandwidth-hyperparameters to adjust for optimal bias-variance trade-off. How does it work in distributed settings?

  7. Distributed setting II Interested in high-dimensional and nonparametric models. • Methods have tunning-, regularity-, sparsity-, bandwidth-hyperparameters to adjust for optimal bias-variance trade-off. How does it work in distributed settings? • Several approach in the literature (Consensus MC, WASP, Fast-KRR, Distributed GP,...) • Limited theoretical underpinning • No unified framework to compare methods • Statistical models for illustration: • Kernel density estimation, • Gaussian white noise model, • Random design nonparametric regression.

  8. Kernel density estimation I iid ∼ f 0 with f 0 ∈ H β ( L ) . • Model: Observe X 1 , ..., X n • Distributed setting: distribute data randomly over m machines. • Method: • Local machines: Kernel density estimation in each n / m � x − X ( i ) 1 � f ( i ) j ˆ � h ( x ) = K . hn / m h j = 1 • Central machine: average local estimators m f h ( x ) = 1 ˆ f ( i ) ˆ � h ( x ) . m i = 1

  9. Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) .

  10. Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) . • Global bias-variance trade-off: f h ( x ) ≍ 1 | f 0 ( x ) − E f 0 ˆ Var f 0 ˆ f h ( x ) | � h β , and hn , optimal bandwidth: h = n − 1 / ( 1 + 2 β ) .

  11. Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) . • Global bias-variance trade-off: f h ( x ) ≍ 1 | f 0 ( x ) − E f 0 ˆ Var f 0 ˆ f h ( x ) | � h β , and hn , optimal bandwidth: h = n − 1 / ( 1 + 2 β ) . • Local bias-variance trade-off results too big bias for ˆ f h : oversmoothing.

  12. Kernel density estimation II Problem: The choice of the bandwidth parameter h : • Local bias-variance trade-off: h ( x ) ≍ m f ( i ) f ( i ) | f 0 ( x ) − E f 0 ˆ h ( x ) | � h β , Var f 0 ˆ and hn , optimal bandwidth: h = ( n / m ) − 1 / ( 1 + 2 β ) . • Global bias-variance trade-off: f h ( x ) ≍ 1 | f 0 ( x ) − E f 0 ˆ Var f 0 ˆ f h ( x ) | � h β , and hn , optimal bandwidth: h = n − 1 / ( 1 + 2 β ) . • Local bias-variance trade-off results too big bias for ˆ f h : oversmoothing. • In practice β is unknown: distributed data-driven methods?

  13. Gaussian white noise model Single observer: dY t = f 0 ( t ) + 1 √ ndW t , t ∈ [ 0 , 1 ] .

  14. Gaussian white noise model Single observer: dY t = f 0 ( t ) + 1 √ ndW t , t ∈ [ 0 , 1 ] . Distributed case: m observer � m dY ( i ) n dW ( i ) = f 0 ( t ) + t , t ∈ [ 0 , 1 ] , i ∈ { 1 , ..., m } , t W ( i ) are independent Brownian motions. t Assumption: f 0 ∈ S β ( L ) , for β > 0.

  15. Distributed Bayesian approach • Endow f 0 in each local problem with GP prior of the form ∞ � j − 1 / 2 − α Z j φ j , f | α ∼ j = 1 where Z j are iid N ( 0 , 1 ) and ( φ j ) j the Fourrier basis. • Compute locally the posterior (or a modification of it) • Aggregate the local posteriors into a global one. • Can we get optimal recovery and reliable uncertainty quantification?

  16. Benchmark: Non-distributed setting I • One server: m = 1. 2 β • Squared bias (of posterior mean): � f 0 − E ˆ 2 � n − f α � 2 1 + 2 α 2 α • Variance, posterior spread: Var (ˆ | Y ≍ n − f α ) ≍ σ 2 1 + 2 α . • Optimal bias-variance trade-off: at α = β .

  17. Benchmark: Non-distributed setting II Posterior from non−distributed data 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

  18. Distributed naive method • We have m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) . • Take α = β . • Local posteriors: B p f ( Y ( i ) ) d Π β ( f ) � Π ( i ) β ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) d Π β ( f ) . � • Aggregate the local posteriors by averaging the draws taken from them. Result: Sub-optimal contraction, misleading uncertainty quantification. 2 β 2 β 1 2 � ( n / m ) − | Y ≍ m − 1 + 2 β n − � f 0 − E ˆ 1 + 2 β , Var (ˆ 1 + 2 β . f � 2 f ) ≍ σ 2

  19. Distributed naive method II Posterior from naive distributed method 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

  20. The likelihood approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) and take α = β . • Modify the local likelihoods for each machine: � B p f ( Y ( i ) ) m d Π( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) m d Π( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them.

  21. The likelihood approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) and take α = β . • Modify the local likelihoods for each machine: � B p f ( Y ( i ) ) m d Π( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) m d Π( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them. Result: Optimal posterior contraction, but bad uncertainty quantification. 2 β 2 β 2 β � f 0 − E ˆ 2 � n − Var (ˆ f ) ≍ n − | Y ≍ m − 1 n − f � 2 1 + 2 β , 1 + 2 β , , σ 2 1 + 2 β .

  22. The likelihood approach II Posterior from likelihood distributed method 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

  23. The prior rescaling approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) . • Modify the local priors for each machine: � B p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them.

  24. The prior rescaling approach • Again m local machines, with data ( Y ( 1 ) , ..., Y ( m ) ) . • Modify the local priors for each machine: � B p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) Π ( i ) ( f ∈ B | Y ( i ) ) = p f ( Y ( i ) ) π ( f ) 1 / m d λ ( f ) . � • Aggregate the modified posteriors by averaging the draws taken from them. Result: Optimal posterior contraction and uncertainty quantification. 2 β 2 β � f 0 − E ˆ 2 � n − Var (ˆ | Y ≍ n − f � 2 1 + 2 β , f ) ≍ σ 2 1 + 2 β .

  25. The prior rescaling approach II Posterior from rescaled distributed method 0.4 0.2 f(t) 0.0 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 t

  26. Other approaches Methods posterior contraction rate coverage naive, average sub-optimal no naive,Wasserstein sub-optimal yes likelihood, average minimax no likelihood, Wasserstein (WASP) minimax yes scaling, average (consensus MC) minimax yes scaling, Wasserstein minimax yes minimax yes undersmoothing (on a range of β , m ) (on a range of β , m ) PoE sub-optimal no gPoE sub-optimal yes BCM minimax yes rBCM sub-optimal yes

  27. Data-driven methods Note: All methods above use the knowledge of the true regularity parameter β , which is in practice usually not available. Solution: Data-driven choice of the regularity-, tunning-hyperparameter.

  28. Data-driven methods Note: All methods above use the knowledge of the true regularity parameter β , which is in practice usually not available. Solution: Data-driven choice of the regularity-, tunning-hyperparameter. Benchmark: In the non-distributed case ( m = 1) • Hierarchical Bayes: endow α with hyperprior. • Empirical Bayes: estimate α from the data (marginal maximum likelihood estimator). • Adaptive minimax posterior contraction rate. • Coverage of credible sets (under polished tail/self-similarity assumption, using blow-up factors).

  29. Empirical Bayes posterior 0.4 0.2 0.0 f(t) −0.2 −0.4 0.0 0.2 0.4 0.6 0.8 1.0 t

  30. Marginal likelihood 0 −50 −100 likelihood −150 −200 0 2 4 6 8 10 alpha

  31. Data driven distributed methods Proposed methods: • Naive EB: local MMLE � α ( i ) = arg max p f ( Y ( i ) ) d Π α ( f ) . ˆ α • Interactive EB Deisenroth and Ng (2015): m � � p f ( Y ( i ) ) d Π α ( f ) . α = arg max ˆ log α i = 1 α ( i ) or cross-validation (in the context of ridge • Other EB: Lepskii’s method ˜ regression Zhang, Duchi, Wainwright (2015))

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend