privacy preserving decentralized learning of personalized
play

privacy-preserving decentralized learning of personalized models and - PowerPoint PPT Presentation

privacy-preserving decentralized learning of personalized models and collaboration graphs Aurlien Bellet (Inria) Includes work with: M. Tommasi, P. Vanhaesebrouck (University of Lille & Inria) R. Guerraoui, M. Taziki (EPFL) V. Zantedeschi


  1. privacy-preserving decentralized learning of personalized models and collaboration graphs Aurélien Bellet (Inria) Includes work with: M. Tommasi, P. Vanhaesebrouck (University of Lille & Inria) R. Guerraoui, M. Taziki (EPFL) V. Zantedeschi (University of Saint-Etienne) Workshop on Optimization for Machine Learning Centre International de Rencontres Mathématiques, Marseille March 10, 2020

  2. connected devices: pervasive or invasive? • Connected devices are spreading rapidly and collect increasingly personal data • Ex: browsing logs, health, speech, accelerometer, geolocation... • Opportunity to provide personalized services but also a potential threat to privacy • A first step to try and reconcile the two: keep and process data on the user device • Training on the edge: train ML model on data from many devices 2

  3. training on the edge: challenges • How to deal with imbalanced and non-i.i.d. local datasets • How to provide formal privacy guarantees • ... 3 • How to scale to a large number of devices

  4. federated vs fully decentralized training Standard federated learning • Coordination by a central server • Single point of failure, server may become a bottleneck Fully decentralized learning • Device-to-device communication in a sparse network graph • Naturally scales to many devices 4 See [Kairouz et al., 2019] for a detailed overview of federated/decentralized ML

  5. global model vs personalized models Global model predictions for all devices all users • Large model may be needed to capture the specificities of each user Personalized models • One model per device bla blablbalbalablab that user and from similar users • Smaller models may be sufficient 5 • One-size-fits-all: same model makes • Model should be trained on data from • Model should be trained on data from

  6. our approach We propose to learn personalized models in a fully decentralized setting: • Learn “who to communicate with” by inferring a graph of similarities between users • Collaboratively learn personalized models over this graph • Jointly optimize the models and the graph, in an alternating fashion 6

  7. problem formulation

  8. users and local datasets m i m i 8 • A set of n users (devices) with common feature space X and label space Y • User i has local dataset S i = { ( x j i , y j i ) } m i j = 1 drawn from personal distribution and wants to learn a model θ i ∈ R p which generalizes well to future local data • Let ℓ : R p × X × Y → R be a loss function, differentiable in first argument • In isolation, user i can learn a model by minimizing a local objective L i ( θ ; S i ) , e.g., ∑ L i ( θ ; S i ) = 1 ℓ ( θ ; x j i , y j i ) + λ i ∥ θ ∥ 2 , with λ i ≥ 0 j = 1 • This will generalize poorly when local data is scarce → need to collaborate

  9. decentralized setting • Asynchronous time model : each user becomes active at random times, asynchronously and in parallel (we use global counter t to denote the t -th activation) • Communication model : all users can exchange messages, but we want to restrict communication to pairs of most similar users • We model this by a collaboration graph: a sparse weighted graph with edge weight 9 w ij ≥ 0 reflecting similarity between the learning tasks of users i and j

  10. joint optimization problem n local models and a shared model per connected component 2 10 as solutions to min • Learn personalized models Θ ∈ R n × p and graph weights w ∈ R n ( n − 1 ) / 2 ≥ 0 ∑ ∑ d i c i L i ( θ i ; S i ) + µ w ij ∥ θ i − θ j ∥ 2 + λ g ( w ) , J (Θ , w ) = Θ ∈ R n × p i = 1 i < j w ∈ R n ( n − 1 ) / 2 ≥ 0 • c i ∈ ( 0 , 1 ] ∝ m i : “confidence” of user i , d i = ∑ j ̸ = i w ij : degree of i • Trade-off between accurate models on local data and smooth models over the graph • Term g ( w ) : avoid trivial collaboration graph, encourage sparsity • Flexible relationships: hyperparameter µ ≥ 0 interpolates between learning purely

  11. outline of the proposed algorithm 1. A decentralized algorithm to learn the models given the graph 2. A decentralized algorithm to learn a graph given the models 11 We design an alternating optimization procedure over Θ and w :

  12. learning models given the graph

  13. properties of objective function i i i 13 • For fixed graph weights w , denote f (Θ) := J (Θ , w ) • Assume local loss L i has L loc i -Lipschitz continuous gradient • Then ∇ f is L i -Lipschitz w.r.t. block θ i with L i = d i ( µ + c i L loc i ) • Can also assume that L i is σ loc -strongly convex where σ loc > 0 • Then f is σ -strongly convex with σ ≥ min 1 ≤ i ≤ n [ d i c i σ loc ] > 0

  14. decentralized algorithm i • This is an instance of block coordinate descent! d i w ij 14 1 • Denote neighborhood of user i by N i = { j : w ij > 0 } • Initialize models Θ( 0 ) ∈ R n × p • At step t ≥ 0, a random user i becomes active: 1. user i updates its model based on its local dataset S i and the information from neighbors: ( ) ∑ θ i ( t + 1 ) = θ i ( t ) − c i ∇L i ( θ i ( t ); S i ) − µ θ j ( t ) µ + c i L loc j ∈ N i 2. user i sends its updated model θ i ( t + 1 ) to its neighborhood N i

  15. convergence rate Proposition ( [Bellet et al., 2018] ) • Makes the algorithm naturally scalable to many users nL max 15 convex, we have: For any T > 0 , let (Θ( t )) T t = 1 be the sequence of iterates generated by the algorithm run- ning for T iterations from an initial point Θ( 0 ) . When the local losses L i are strongly ( ) T σ E [ f (Θ( T )) − f ⋆ ] ≤ ( f (Θ( 0 )) − f ∗ ) . 1 − where L max = max i L i and σ are smoothness and strong convexity parameters. • Constant number of per-user updates → optimality gap roughly constant in n

  16. what about privacy? • In some applications, data may be sensitive and users may not want to reveal it sequences of models computed from data • Consider an adversary observing all the information sent over the network (but not the internal memory of users) • Goal: formally quantify how much information is leaked about the local dataset 16 • In our algorithms, users never communicate their local data but they exchange

  17. differential privacy • Information-theoretic (no computational assumptions) 17 ϵ -Differential Privacy [Dwork, 2006] Let M be a randomized mechanism taking a dataset as input, and let ϵ > 0. We say that M is ϵ -differentially private if for all datasets S , S ′ differing in a single data point and for all sets of possible outputs O ⊆ range ( M ) , we have: Pr ( M ( S ) ∈ O ) ≤ e ϵ Pr ( M ( S ′ ) ∈ O ) . • Output of M almost the same regardless of whether a particular data point was used • Robust to background knowledge that adversary may have • Composition property: the combined output of two ϵ -DP mechanisms run on the same dataset is 2 ϵ -DP

  18. differentially private algorithm i d i w ij 1. Replace the update of the algorithm by c i 18 1 ( ) ∑ ( ) θ i ( t + 1 ) = � � ∇L i ( � � θ i ( t ) − θ i ( t ); S i ) + η i − µ θ j ( t ) µ + c i L loc j ∈ N i where η i ∼ Laplace ( 0 , s i ) p ∈ R p 2. User i then broadcasts noisy iterate � θ i ( t + 1 ) to its neighbors

  19. privacy guarantee Theorem ( [Bellet et al., 2018] ) L 0 • Follows from sensitivity analysis of the update • Can be improved by strong composition [Kairouz et al., 2015] (under relaxed DP) 19 Let i ∈ � n � and assume • ℓ ( · ; x , y ) L 0 -Lipschitz w.r.t. the L 1 -norm for all ( x , y ) ∈ X × Y • User i wakes up T i times and use noise scale s i = ϵ i m i • Mechanism M i ( S i ) : releases the sequence of user i’s models For any � Θ( 0 ) independent of S i , M i ( S i ) is ¯ ϵ i -DP with ¯ ϵ i = T i ϵ i .

  20. privacy/utility trade-off nL max • See paper for details on warm start strategy and how to scale noise across iterations • A good (differentially private) warm start can help a lot • Users with less data add more noise but their contribution to the error is smaller nL max n Theorem ( [Bellet et al., 2018] ) 1 nL min 20 For any T > 0 , let ( � Θ( t )) T t = 1 be the sequence of iterates generated by T iterations. For σ -strongly convex f, we have: [ Θ( T )) − f ⋆ ] ( ) T ( Θ( 0 )) − f ⋆ ) ( ) t [ T − 1 ∑ ∑ ] 2 , σ σ f ( � f ( � E ≤ 1 − + 1 − d i c i s i ( t ) t = 0 i = 1 where L min = min 1 ≤ i ≤ n L i . • T rules a trade-off between optimization error and noise error

  21. extension: personalized l1-adaboost n • More details in [Zantedeschi et al., 2020] 2 exp d i c i log 21 min • Consider a set of base models H = { h k : X → R } K k = 1 (e.g., pre-trained on proxy data) • Find personalized ensembles α 1 , . . . , α n ∈ R K as solutions to: ( m i )) ∑ ∑ ∑ ( + µ w ij ∥ θ i − θ j ∥ 2 + λ g ( w ) − ( A i θ i ) j ∥ θ 1 ∥ 1 ≤ β,..., ∥ θ K ∥ 1 ≤ β i = 1 j = 1 i < j w ∈ R n ( n − 1 ) / 2 ≥ 0 • A i ∈ R m i × K : margins of base models on each data point of user i • Use block coordinate Frank Wolfe → communication cost logarithmic in K

  22. learning the graph given models

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend