cpSGD: c ommunication-efficient and differentially- p rivate - - PowerPoint PPT Presentation

cpsgd c ommunication efficient and differentially p
SMART_READER_LITE
LIVE PREVIEW

cpSGD: c ommunication-efficient and differentially- p rivate - - PowerPoint PPT Presentation

cpSGD: c ommunication-efficient and differentially- p rivate distributed SGD Naman Agarwal, Ananda Theertha Suresh, Felix X. Yu Sanjiv Kumar, H. Brendan McMahan Distributed learning with mobile devices Train a centralized model; data stays on


slide-1
SLIDE 1

cpSGD: communication-efficient and differentially-private distributed SGD

Naman Agarwal, Ananda Theertha Suresh, Felix X. Yu Sanjiv Kumar, H. Brendan McMahan

slide-2
SLIDE 2

Distributed learning with mobile devices

Train a centralized model; data stays on mobile phones. In each iteration...

slide-3
SLIDE 3

w w

Server sends model to clients...

w w w w ∊ Rd: the model vector

slide-4
SLIDE 4

w - learning_rate ∑iδwi/n δw4

Clients send updates back...

δw3 δw1 δw2 n: number of clients δwi: gradient of the i-th client

slide-5
SLIDE 5

w - learning_rate ∑iQ(δwi)/n Q(δw4)

Challenge I: uplink communication is expensive

Q(δw3) Q(δw1) Q(δw2)

  • Q: quantization
slide-6
SLIDE 6

How to design the quantization?

  • Convergence of SGD depends on the MSE of the estimated gradient.
  • Sufficient to study:

bits vs. quantization error in distributed mean estimation. ○ No compression (float): 32 bits per coordinate; 0 MSE. ○ Binary quantization: 1 bit; O(d/n) MSE ○ Variable length coding: O(1/n) MSE ○ [Suresh et al., 17] [Alistarh et al., 17] [Wen et al., 17] [Bernstein et al., 18]

slide-7
SLIDE 7

Challenge II: user privacy is important

  • Differential privacy (DP)

○ Removing or changing single client’s data should not result in big difference in the estimated mean ○ Adding Gaussian noise [Abadi et al., 16]

  • Both communication efficiency and differential privacy

Goal of this paper

slide-8
SLIDE 8

∑iQ(xi)/n + Q(x4)

Attempt 1: add Gaussian noise on the server

Q(x3) Q(x1) Q(x2)

  • DP results readily available

○ Assuming L2 norm of the gradient is bounded (gradient clipping).

  • Server has to be trustworthy.
slide-9
SLIDE 9

∑iQ(xi)/n Q(x4)

Attempt 2: add Gaussian noise on the client

Q(x3) Q(x1) Q(x2)

  • After quantization: no communication efficiency.
  • Before quantization: hard to analyze.
slide-10
SLIDE 10

∑iQ(xi)/n Q(x4)

cpSGD: add binomial noise after quantization

Q(x3) Q(x1) Q(x2)

slide-11
SLIDE 11

cpSGD

  • Maintains communication efficiency

○ Binomial is discrete.

  • Differentially private

○ Binomial similar to Gaussian. ○ Extended to d-dimension with improved bound.

  • Works if server is negligent but not malicious
  • Works even if clients do not trust the server

○ Secure aggregation.

slide-12
SLIDE 12

Tue Dec 4th 05:00 -- 07:00 PM Room 210 & 230 AB #27

For d variables and n ≈ d clients, cpSGD uses

  • O(log log(nd)) bits of communication per client per coordinate
  • Constant privacy