distributed learning for cooperative inference
play

Distributed Learning for Cooperative Inference C esar A. Uribe . - PowerPoint PPT Presentation

Distributed Learning for Cooperative Inference C esar A. Uribe . Collaboration with: Alex Olshevsky and Angelia Nedi c LCCC - Focus Period on Large-Scale and Distributed Optimization June 5th, 2017 Optimization


  1. Distributed Learning for Cooperative Inference C´ esar A. Uribe . Collaboration with: Alex Olshevsky and Angelia Nedi´ c LCCC - Focus Period on Large-Scale and Distributed Optimization June 5th, 2017

  2. Optimization � �� � Distributed Learning for Cooperative Inference � �� � � �� � Statistical Estimation Consensus − Based C´ esar A. Uribe . Collaboration with: Alex Olshevsky and Angelia Nedi´ c LCCC - Focus Period on Large-Scale and Distributed Optimization June 5th, 2017

  3. The three components for estimation Data: X ∼ P ∗ is a r.v. with a sample space ( X , X ) . P ∗ is unknown . Model: ◮ P a collection of probability measures P : X → [0 , 1] . ◮ Parametrized by Θ ; ∃ an injective map Θ → P : θ → P θ . ◮ Dominated: ∃ λ s.t. P θ ≪ λ with p θ = dP θ /dλ . (Point) Estimator: A map ˆ P : X → P . The best guess ˆ P ∈ P for P ∗ based on X , e.g. ˆ θ ( X ) = sup p θ ( X ) θ ∈ Θ

  4. Bayesian Methods The parameter is a r.v. ϑ taking values in (Θ , T ) . There is a probability measure on X × Θ with F = σ ( X × T ) , Π : F → [0 , 1] , Model: The distribution of X conditioned on ϑ , Π X | ϑ . Prior: The marginal of Π on ϑ , Π : T → [0 , 1] . Posterior: The distribution Π ϑ | X : T × X → [0 , 1] . In particular, � B p θ ( X ) d Π( θ ) Π( ϑ ∈ B | X ) = Θ p θ ( X ) d Π( θ ) . �

  5. One can construct the MAP or MMSE estimators as: ˆ θ MAP ( X ) = arg max Π( θ | X ) θ ∈ Θ � ˆ θ MMSE ( X ) = θd Π( θ | X ) θ ∈ Θ

  6. The Belief Notation We are interested in computing posterior distributions. Thus, lets define the belief density on a hypothesis θ ∈ Θ at time k as dµ k ( θ ) = d Π( θ | X 1 , . . . , X k ) k � ∝ p θ ( X i ) d Π( θ ) i =1 = p θ ( X k ) dµ k − 1 ( θ ) This defines a iterative algorithm dµ k +1 ( θ ) ∝ dµ k ( θ ) p θ ( x k +1 ) We will say that, we learn a parameter θ ∗ if k →∞ µ k ( θ ∗ ) = 1 lim a.s. (usually) We hope that P θ ∗ is the closest to P ∗ (in a sense defined later).

  7. Example: Estimating the Mean of a Gaussian Model Data: Assume we receive a sample x 1 , . . . , x k , where X k ∼ N ( θ ∗ , σ 2 ) . σ 2 is known and we want to estimate θ ∗ . Model: The collection of all Normal distributions with variance σ 2 , i.e. P θ = {N ( θ, σ 2 ) } . Prior: Our prior is the standard Normal distribution dµ 0 ( θ ) = N (0 , 1) . Posterior: The posterior is defined as k � dµ k ( θ ) ∝ dµ 0 ( θ ) p θ ( x t ) t =1 �� k � σ 2 t =1 x t = N σ 2 + k , σ 2 + k

  8. dµ k ( · ) k = 0 Θ θ ∗ θ 0

  9. dµ k ( · ) k = 1 Θ θ ∗ θ 0

  10. dµ k ( · ) k = 2 Θ θ ∗ θ 0

  11. dµ k ( · ) k = 3 Θ θ ∗ θ 0

  12. dµ k ( · ) k = 4 Θ θ ∗ θ 0

  13. dµ k ( · ) k = 5 Θ θ ∗ θ 0

  14. dµ k ( · ) k = 6 Θ θ ∗ θ 0

  15. Geometric Interpretation for Finite Hypotheses dµ 0 Variance dµ 1 Mean θ ∗

  16. Bayes’ Theorem Belongs to Stochastic Approximations

  17. Consider the following optimization problem min θ ∈ Θ F ( θ ) = D KL ( P � P θ ) , (1) We can rewrite Eq. (1) as θ ∈ Θ D KL ( P � P θ ) = min π ∈ ∆ Θ E π D KL ( P � P θ ) where θ ∼ π min � � − log dP θ = min , π ∈ ∆ Θ E π E P dP Moreover, D KL ( P � P θ ) = arg min E π E P [ − log p θ ( X )] , θ ∼ π, X ∼ P arg min θ ∈ Θ π ∈ ∆ Θ = arg min E P E π [ − log p θ ( X )] , θ ∼ π, X ∼ P. π ∈ ∆ Θ

  18. Consider the following optimization problem min x ∈ Z E [ F ( x, Ξ)] , The stochastic mirror descent approach constructs a sequence { x k } as follows: � � �∇ F ( x, ξ k ) , x � + 1 x k +1 = arg min D w ( x, x k ) , α k x ∈ Z Recall our original problem π ∈ ∆ Θ E P E π [ − log p θ ( X )] , θ ∼ π, X ∼ P. min (2) For Eq. (2), Stochastic Mirror Descent generates a sequence of densities { dµ k } , as follows: � � �− log p θ ( x k +1 ) , π � + 1 dµ k +1 = arg min D w ( π, dµ k ) , θ ∼ π. α k π ∈ ∆ Θ (3)

  19. {�− log p θ ( x k +1 ) , π � + D KL ( π � dµ k ) } , θ ∼ π. dµ k +1 = arg min π ∈ ∆ Θ � Choose w ( x ) = x log x , then the corresponding Bregman distance is the Kullback-Leibler (KL) divergence D KL . Additionally, by selecting α k = 1 then for each θ ∈ Θ , dµ k +1 ( θ ) ∝ p θ ( x k +1 ) dµ k ( θ ) � �� � Bayesian Posterior

  20. Distributed Inference Setup

  21. Distributed Inference Setup ◮ n agents: V = { 1 , 2 , · · · , n } ◮ Agent i observes X i k : Ω → X i , X i k ∼ P i ◮ Agent i has a model about P i , P i = { P i θ | θ ∈ Θ } ◮ Agent i has a local belief density dµ i k ( θ ) ◮ Agents share beliefs over the network (connected, fixed, undirected) ◮ a ij ∈ (0 , 1) is how agent i weights agent j information, � a ij = 1 Agents want to collectively solve the following optimization problem n � D KL ( P i � P i θ ∈ Θ F ( θ ) � D KL ( P � P θ ) = min θ ) . (4) i =1 Consensus Learning: dµ i ∞ ( θ ∗ ) = 1 for all i .

  22. Our approach Include beliefs of other agents in the regularization term: Distributed Stochastic Entropic Mirror-descent   n  � �  � � � � ��� dµ i π � dµ j p i x i − E π k +1 = arg min a ij D KL log k θ k +1 π ∈ ∆ Θ   j =1 n � � � k ( θ ) a ij p i dµ i dµ j x i k +1 ( θ ) ∝ (5) θ k +1 j =1 Q1. Does (5) achieves consensus learning? Q2. If Q1 is positive, at what rate does this happens?

  23. A finite set Θ Extensive literature for finite parameter sets Θ ◮ The network is static/time-varying. ◮ The network is directed/undirected. ◮ Prove consistency of the algorithm. ◮ Prove asymptotic/non-asymptotic convergence rates. Shahrampour, Rahimian, Jadbabaie, Lalitha, Sarwate, Javidi, Su, Vaidya, Qipeng, Bandyopadhyay, Sahu, Kar, Sayed, Chazelle, Olshevsky, Nedi´ c, U.

  24. Geometric Interpretation for Finite Hypotheses P P θ 3 P θ 1 P θ 2

  25. Distributed Source Localization θ 7 θ 8 θ 9 10 2 3 1 y-position θ 4 θ 5 θ 6 0 Agents Source θ 1 θ 2 θ 3 −10 −10 0 10 x-position (a) Network of Agents (b) Hypothesis Distributions

  26. Distributed Source Localization

  27. Distributed Source Localization

  28. Our results for three different problems 1. Time-varying undirected graphs (Nedi´ c,Olshevsky,U to appear TAC) ◮ A k is doubly-stochastic with [ A k ] ij > 0 if ( i, j ) ∈ E k . 2. Time-varying directed graphs (Nedi´ c,Olshevsky,U in ACC16) � 1 if j ∈ in N i d j k ◮ [ A k ] ij = k 0 if otherwise d i k is the out degree of node i at time k . in N i k is the set of in neighbors of node i . 3. Acceleration in static graphs (Nedi´ c,Olshevsky,U to appear TAC) � 1 if ( i, j ) ∈ E , ¯ max { d i ,d j } A ij = ◮ ∈ E , 0 if ( i, j ) / d i degree of the node i . 2 ¯ A = 1 2 I + 1 A,

  29. Time-Varying k +1 ( θ ) ∝ � n j =1 µ j µ i k ( θ ) [ A k ] ij p i θ ( x i k +1 ) Undirected n k ( θ ) (1+ σ ) ¯ µ j Aij p i θ ( x i � k +1 ) µ i j =1 k +1 ( θ ) ∝ Fixed Undirected n σ ¯ Aij j =1 ( µ j k − 1 ( θ ) p j θ ( x j k ) ) � y j k +1 = � y i Time-Varying k d j j ∈ N i k k 1   yj yi k k +1  � dj � � µ j µ i k p i x i Directed k +1 ( θ ) ∝ k ( θ )  θ k +1 j ∈ N i k

  30. General form of Theorems N YY µ i k +1 ( θ ) ≤ exp( − kγ 2 + γ 1 ) Under appropriate assumptions, a group of agents following algorithm X . There is a time N ( n, λ, ρ ) such that with ∈ Θ ∗ , probability 1 − ρ for all k ≥ N ( n, λ, ρ ) for all θ / µ i k ( θ ) ≤ exp ( − kγ 2 + γ 1 ) for all i = 1 , . . . , n,

  31. After a time N ( n, λ, ρ ) such that with probability 1 − ρ for all ∈ Θ ∗ , k ≥ N ( n, λ, ρ ) , for all θ / µ i k +1 ( θ ) ≤ exp ( − kγ 2 + γ 1 ) for all i = 1 , . . . , n. Graph N γ 1 γ 2 δ Time-Varying O ( n 2 O (log 1 /ρ ) η log n ) O (1) Undirected O ( n 2 log n ) · · · + Metropolis O (log 1 /ρ ) O (1) Time-Varying O ( n n log n ) 1 1 δ ≥ δ 2 O (log 1 /ρ ) O (1) n n Directed O ( n 3 log n ) · · · + regular O (log 1 /ρ ) O (1) 1 Fixed O (log 1 /ρ ) O ( n log n ) O (1) Undirected

  32. 10 5 10 5 10 4 Mean number of Iterations Mean number of Iterations Mean number of Iterations 10 4 10 4 10 3 10 3 10 3 10 2 10 2 10 2 10 1 10 1 10 1 0 100 200 300 60 100 140 0 100 200 300 Number of nodes Number of nodes Number of nodes (a) Path Graph (b) Circle Graph (c) Grid Graph Figure: Empirical mean over 50 Monte Carlo runs of the number of iterations required for µ i ∈ Θ ∗ . All agents k ( θ ) < ǫ for all agents on θ / but one have all their hypotheses to be observationally equivalent. Dotted line for the algorithm proposed by Jadbabaie et al. Dashed line no acceleration and solid line for acceleration.

  33. A particularly bad graph

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend