privacy-preserving decentralized learning of personalized models and - - PowerPoint PPT Presentation

privacy preserving decentralized learning of personalized
SMART_READER_LITE
LIVE PREVIEW

privacy-preserving decentralized learning of personalized models and - - PowerPoint PPT Presentation

privacy-preserving decentralized learning of personalized models and collaboration graphs Aurlien Bellet (Inria) Includes work with: M. Tommasi, P. Vanhaesebrouck (University of Lille & Inria) R. Guerraoui, M. Taziki (EPFL) V. Zantedeschi


slide-1
SLIDE 1

privacy-preserving decentralized learning of personalized models and collaboration graphs

Aurélien Bellet (Inria)

Includes work with:

  • M. Tommasi, P. Vanhaesebrouck (University of Lille & Inria)
  • R. Guerraoui, M. Taziki (EPFL)
  • V. Zantedeschi (University of Saint-Etienne)

Workshop on Optimization for Machine Learning Centre International de Rencontres Mathématiques, Marseille March 10, 2020

slide-2
SLIDE 2

connected devices: pervasive or invasive?

  • Connected devices are spreading rapidly and collect increasingly personal data
  • Ex: browsing logs, health, speech, accelerometer, geolocation...
  • Opportunity to provide personalized services but also a potential threat to privacy
  • A first step to try and reconcile the two: keep and process data on the user device
  • Training on the edge: train ML model on data from many devices

2

slide-3
SLIDE 3

training on the edge: challenges

  • How to deal with imbalanced and non-i.i.d. local datasets
  • How to scale to a large number of devices
  • How to provide formal privacy guarantees
  • ...

3

slide-4
SLIDE 4

federated vs fully decentralized training

Standard federated learning

  • Coordination by a central server
  • Single point of failure, server may

become a bottleneck Fully decentralized learning

  • Device-to-device communication in a

sparse network graph

  • Naturally scales to many devices

See [Kairouz et al., 2019] for a detailed overview of federated/decentralized ML

4

slide-5
SLIDE 5

global model vs personalized models

Global model

  • One-size-fits-all: same model makes

predictions for all devices

  • Model should be trained on data from

all users

  • Large model may be needed to capture

the specificities of each user Personalized models

  • One model per device

bla blablbalbalablab

  • Model should be trained on data from

that user and from similar users

  • Smaller models may be sufficient

5

slide-6
SLIDE 6
  • ur approach

We propose to learn personalized models in a fully decentralized setting:

  • Learn “who to communicate with” by inferring a graph of similarities between users
  • Collaboratively learn personalized models over this graph
  • Jointly optimize the models and the graph, in an alternating fashion

6

slide-7
SLIDE 7

problem formulation

slide-8
SLIDE 8

users and local datasets

  • A set of n users (devices) with common feature space X and label space Y
  • User i has local dataset Si = {(xj

i, yj i)}mi j=1 drawn from personal distribution and wants

to learn a model θi ∈ Rp which generalizes well to future local data

  • Let ℓ : Rp × X × Y → R be a loss function, differentiable in first argument
  • In isolation, user i can learn a model by minimizing a local objective Li(θ; Si), e.g.,

Li(θ; Si) = 1 mi

mi

j=1

ℓ(θ; xj

i, yj i) + λi∥θ∥2, with λi ≥ 0

  • This will generalize poorly when local data is scarce → need to collaborate

8

slide-9
SLIDE 9

decentralized setting

  • Asynchronous time model: each user becomes active at random times,

asynchronously and in parallel (we use global counter t to denote the t-th activation)

  • Communication model: all users can exchange messages, but we want to restrict

communication to pairs of most similar users

  • We model this by a collaboration graph: a sparse weighted graph with edge weight

wij ≥ 0 reflecting similarity between the learning tasks of users i and j

9

slide-10
SLIDE 10

joint optimization problem

  • Learn personalized models Θ ∈ Rn×p and graph weights w ∈ Rn(n−1)/2

≥0

as solutions to min

Θ∈Rn×p w∈Rn(n−1)/2

≥0

J(Θ, w) =

n

i=1

diciLi(θi; Si) + µ 2 ∑

i<j

wij∥θi − θj∥2 + λg(w),

  • ci ∈ (0, 1] ∝ mi: “confidence” of user i, di = ∑

j̸=i wij: degree of i

  • Trade-off between accurate models on local data and smooth models over the graph
  • Term g(w): avoid trivial collaboration graph, encourage sparsity
  • Flexible relationships: hyperparameter µ ≥ 0 interpolates between learning purely

local models and a shared model per connected component

10

slide-11
SLIDE 11
  • utline of the proposed algorithm

We design an alternating optimization procedure over Θ and w:

  • 1. A decentralized algorithm to learn the models given the graph
  • 2. A decentralized algorithm to learn a graph given the models

11

slide-12
SLIDE 12

learning models given the graph

slide-13
SLIDE 13

properties of objective function

  • For fixed graph weights w, denote f(Θ) := J(Θ, w)
  • Assume local loss Li has Lloc

i -Lipschitz continuous gradient

  • Then ∇f is Li-Lipschitz w.r.t. block θi with Li = di(µ + ciLloc

i )

  • Can also assume that Li is σloc

i

  • strongly convex where σloc

i

> 0

  • Then f is σ-strongly convex with σ ≥ min1≤i≤n[diciσloc

i

] > 0

13

slide-14
SLIDE 14

decentralized algorithm

  • Denote neighborhood of user i by Ni = {j : wij > 0}
  • Initialize models Θ(0) ∈ Rn×p
  • At step t ≥ 0, a random user i becomes active:
  • 1. user i updates its model based on its local dataset Si and the information from neighbors:

θi(t + 1) = θi(t) − 1 µ + ciLloc

i

( ci∇Li(θi(t); Si) − µ ∑

j∈Ni

wij di θj(t) )

  • 2. user i sends its updated model θi(t + 1) to its neighborhood Ni
  • This is an instance of block coordinate descent!

14

slide-15
SLIDE 15

convergence rate

Proposition ([Bellet et al., 2018]) For any T > 0, let (Θ(t))T

t=1 be the sequence of iterates generated by the algorithm run-

ning for T iterations from an initial point Θ(0). When the local losses Li are strongly convex, we have: E [f(Θ(T)) − f⋆] ≤ ( 1 − σ nLmax )T (f(Θ(0)) − f∗) . where Lmax = maxi Li and σ are smoothness and strong convexity parameters.

  • Constant number of per-user updates → optimality gap roughly constant in n
  • Makes the algorithm naturally scalable to many users

15

slide-16
SLIDE 16

what about privacy?

  • In some applications, data may be sensitive and users may not want to reveal it
  • In our algorithms, users never communicate their local data but they exchange

sequences of models computed from data

  • Consider an adversary observing all the information sent over the network (but not

the internal memory of users)

  • Goal: formally quantify how much information is leaked about the local dataset

16

slide-17
SLIDE 17

differential privacy

ϵ-Differential Privacy [Dwork, 2006] Let M be a randomized mechanism taking a dataset as input, and let ϵ > 0. We say that M is ϵ-differentially private if for all datasets S, S′ differing in a single data point and for all sets of possible outputs O ⊆ range(M), we have: Pr(M(S) ∈ O) ≤ eϵPr(M(S′) ∈ O).

  • Output of M almost the same regardless of whether a particular data point was used
  • Information-theoretic (no computational assumptions)
  • Robust to background knowledge that adversary may have
  • Composition property: the combined output of two ϵ-DP mechanisms run on the

same dataset is 2ϵ-DP

17

slide-18
SLIDE 18

differentially private algorithm

  • 1. Replace the update of the algorithm by
  • θi(t + 1) =

θi(t) − 1 µ + ciLloc

i

( ci ( ∇Li( θi(t); Si) + ηi ) − µ ∑

j∈Ni

wij di

  • θj(t)

) where ηi ∼ Laplace(0, si)p ∈ Rp

  • 2. User i then broadcasts noisy iterate

θi(t + 1) to its neighbors

18

slide-19
SLIDE 19

privacy guarantee

Theorem ([Bellet et al., 2018]) Let i ∈ n and assume

  • ℓ(·; x, y) L0-Lipschitz w.r.t. the L1-norm for all (x, y) ∈ X × Y
  • User i wakes up Ti times and use noise scale si =

L0 ϵimi

  • Mechanism Mi(Si): releases the sequence of user i’s models

For any Θ(0) independent of Si, Mi(Si) is ¯ ϵi-DP with ¯ ϵi = Tiϵi.

  • Follows from sensitivity analysis of the update
  • Can be improved by strong composition [Kairouz et al., 2015] (under relaxed DP)

19

slide-20
SLIDE 20

privacy/utility trade-off

Theorem ([Bellet et al., 2018]) For any T > 0, let ( Θ(t))T

t=1 be the sequence of iterates generated by T iterations. For

σ-strongly convex f, we have: E [ f( Θ(T)) − f⋆] ≤ ( 1 − σ nLmax )T ( f( Θ(0)) − f⋆) + 1 nLmin

T−1

t=0 n

i=1

( 1 − σ nLmax )t[ dicisi(t) ]2, where Lmin = min1≤i≤n Li.

  • Users with less data add more noise but their contribution to the error is smaller
  • T rules a trade-off between optimization error and noise error
  • A good (differentially private) warm start can help a lot
  • See paper for details on warm start strategy and how to scale noise across iterations

20

slide-21
SLIDE 21

extension: personalized l1-adaboost

  • Consider a set of base models H = {hk : X → R}K

k=1 (e.g., pre-trained on proxy data)

  • Find personalized ensembles α1, . . . , αn ∈ RK as solutions to:

min

∥θ1∥1≤β,...,∥θK∥1≤β w∈Rn(n−1)/2

≥0

n

i=1

dicilog ( mi ∑

j=1

exp ( − (Aiθi)j )) + µ 2 ∑

i<j

wij∥θi − θj∥2 + λg(w)

  • Ai ∈ Rmi×K: margins of base models on each data point of user i
  • Use block coordinate Frank Wolfe → communication cost logarithmic in K
  • More details in [Zantedeschi et al., 2020]

21

slide-22
SLIDE 22

learning the graph given models

slide-23
SLIDE 23

regularization on the graph weights

min

Θ∈Rn×p w∈Rn(n−1)/2

≥0

J(Θ, w) =

n

i=1

diciLi(θi; Si) + µ 2 ∑

i<j

wij∥θi − θj∥2 + λg(w)

  • Inspired by [Kalofolias, 2016], we can define

g(w) = β∥w∥2 − 1T log(d + δ) (with δ small constant)

  • Log barrier on the degree vector d to avoid isolated users and L2 penalty on weights

to control the graph sparsity

  • The resulting objective h(w) := J(Θ, w) is strongly convex

23

slide-24
SLIDE 24

decentralized algorithm

  • We rely on decentralized peer sampling [Jelasity et al., 2007] to allow users to

communicate with a set of κ random peers

  • Initialize weights w(0), choose parameter κ ∈ {1, . . . , n − 1}
  • At each step t ≥ 0, a random user i becomes active:
  • 1. Draw a set K of κ users and request their model, loss and degree
  • 2. Update the associated weights w(t + 1)i,K = (w(t + 1)ij)j∈K ∈ Rκ:

w(t + 1)i,K ← max ( 0, w(t)i,K − 1 Lκ [∇h(w(t))]i,K ) where Lκ = 2µ( κ+1

δ2 + β) is the block Lipschitz constant of ∇h(w)

  • 3. Send each updated weight w(t + 1)ij to the associated user j ∈ K
  • This is proximal block coordinate descent with an overlapping block structure
  • Can be extended to any weight/degree-separable g

24

slide-25
SLIDE 25

convergence rate

Theorem ([Zantedeschi et al., 2020]) For any T > 0, let (w(t))T

t=1 be the sequence of iterates generated by the algorithm running

for T iterations from an initial point w(0). We have: E[h(w(T)) − h∗] ≤ ρT(h(w(0)) − h∗), where ρ = 1 − 4 n(n − 1) κβδ2 κ + 1 + 2βδ2

  • κ can be used to trade-off between communication cost and convergence speed
  • Communication cost per iteration is linear in κ, while the impact on ρ fades quickly

(due to worst-case dependence of Lκ in κ)

  • κ = 1 minimizes total communication cost if moderate precision is sufficient, while

larger values reduce number of rounds

25

slide-26
SLIDE 26

numerical experiments

slide-27
SLIDE 27

synthetic linear classification task

  • Results when using the “oracle” graph

20 40 60 80 100 Dimension p 0.50 0.50 0.55 0.55 0.60 0.60 0.65 0.65 0.70 0.70 0.75 0.75 0.80 0.80 0.85 0.85 0.90 0.90 0.95 0.95 Test accuracy Purely local models Non-private CD Private CD ( = 1.05) Private CD ( = 0.55) Private CD ( = 0.15)

27

slide-28
SLIDE 28

synthetic linear classification task

  • We approximately recover the ground-truth cluster structure
  • Prediction accuracy is close to that of the oracle graph

20 40 60 80 20 40 60 80

Oracle

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

20 40 60 80 20 40 60 80

Dada-Learned

6 12 18 24 30 36 42 48

54 47 43 43 34 53 65 39 42 46 125 129 139 136 152 137 125 125 146 148 153 130 149 132 141 136 144 129 147 141 315 301 327 332 335 320 310 291 325 335 329 314 309 323 318 284 313 309 324 314 308 308 313 288 318 298 321 284 323 318 232 213 230 230 226 237 214 223 203 242 236 234 231 221 222 242 238 228 226 260 217 208 226 223 225 219 210 214 219 228 236 212 210 229 217 246 226 225 223 225

28

slide-29
SLIDE 29

real datasets

  • Real datasets that are naturally collected at the user/device level
  • Number of users n from 23 to 190, no task similarity available
  • Linear models and nonlinear ensembles
  • Our approach clearly outperforms both global and purely local models

Dataset Global-lin Local-lin Ours-lin Global-nonlin Local-nonlin Ours-nonlin Harws 93.64 92.69 96.31 94.34 93.16 95.70 Vehicle 87.11 90.38 91.37 88.02 90.59 90.81 Computer 62.18 60.68 69.08 69.16 66.61 72.09 School 57.06 70.43 71.92 69.16 66.61 72.22 bold blue = best, regular blue = second best

29

slide-30
SLIDE 30

lots to do at the intersection of optimization and privacy!

  • So far, differentially private optimization algorithms remain somewhat naive

adaptations of standard optimizers

  • A lot of tricks needed to make them work in practice (e.g., gradient clipping)
  • There is room for better coupling between DP and optimization, e.g., by designing
  • ptimization algorithms with DP constraints in mind
  • Federated and decentralized settings: how to accommodate efficient crypto

primitives which can improve privacy-utility trade-off and robustness

30

slide-31
SLIDE 31

Thank you for your attention! Questions?

31

slide-32
SLIDE 32

references I

[Bellet et al., 2018] Bellet, A., Guerraoui, R., Taziki, M., and Tommasi, M. (2018). Personalized and Private Peer-to-Peer Machine Learning. In AISTATS. [Dwork, 2006] Dwork, C. (2006). Differential Privacy. In ICALP. [Jelasity et al., 2007] Jelasity, M., Voulgaris, S., Guerraoui, R., Kermarrec, A.-M., and van Steen, M. (2007). Gossip-based peer sampling. ACM Trans. Comput. Syst., 25(3). [Kairouz et al., 2019] Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., D’Oliveira, R. G. L., Rouayheb, S. E., Evans, D., Gardner, J., Garrett, Z., Gascón, A., Ghazi, B., Gibbons, P. B., Gruteser, M., Harchaoui, Z., He, C., He, L., Huo, Z., Hutchinson, B., Hsu, J., Jaggi, M., Javidi, T., Joshi, G., Khodak, M., Konečný, J., Korolova, A., Koushanfar, F., Koyejo, S., Lepoint, T., Liu, Y., Mittal, P., Mohri, M., Nock, R., Özgür, A., Pagh, R., Raykova, M., Qi, H., Ramage, D., Raskar, R., Song, D., Song, W., Stich, S. U., Sun, Z., Suresh, A. T., Tramèr, F., Vepakomma, P., Wang, J., Xiong, L., Xu, Z., Yang, Q., Yu, F. X., Yu, H., and Zhao, S. (2019). Advances and Open Problems in Federated Learning. Technical report, arXiv:1912.04977. 32

slide-33
SLIDE 33

references II

[Kairouz et al., 2015] Kairouz, P., Oh, S., and Viswanath, P. (2015). The Composition Theorem for Differential Privacy. In ICML. [Kalofolias, 2016] Kalofolias, V. (2016). How to learn a graph from smooth signals. In AISTATS. [Vanhaesebrouck et al., 2017] Vanhaesebrouck, P., Bellet, A., and Tommasi, M. (2017). Decentralized Collaborative Learning of Personalized Models over Networks. In AISTATS. [Zantedeschi et al., 2020] Zantedeschi, V., Bellet, A., and Tommasi, M. (2020). Fully decentralized joint learning of personalized models and collaboration graphs. In AISTATS. 33