SLIDE 1
cpSGD: c ommunication-efficient and differentially- p rivate - - PowerPoint PPT Presentation
cpSGD: c ommunication-efficient and differentially- p rivate - - PowerPoint PPT Presentation
cpSGD: c ommunication-efficient and differentially- p rivate distributed SGD Naman Agarwal, Ananda Theertha Suresh, Felix X. Yu Sanjiv Kumar, H. Brendan McMahan Distributed learning with mobile devices Train a centralized model; data stays on
SLIDE 2
SLIDE 3
w w
Server sends model to clients...
w w w w ∊ Rd: the model vector
SLIDE 4
w - learning_rate ∑iδwi/n δw4
Clients send updates back...
δw3 δw1 δw2 n: number of clients δwi: gradient of the i-th client
SLIDE 5
w - learning_rate ∑iQ(δwi)/n Q(δw4)
Challenge I: uplink communication is expensive
Q(δw3) Q(δw1) Q(δw2)
- Q: quantization
SLIDE 6
How to design the quantization?
- Convergence of SGD depends on the MSE of the estimated gradient.
- Sufficient to study:
bits vs. quantization error in distributed mean estimation. ○ No compression (float): 32 bits per coordinate; 0 MSE. ○ Binary quantization: 1 bit; O(d/n) MSE ○ Variable length coding: O(1/n) MSE ○ [Suresh et al., 17] [Alistarh et al., 17] [Wen et al., 17] [Bernstein et al., 18]
SLIDE 7
Challenge II: user privacy is important
- Differential privacy (DP)
○ Removing or changing single client’s data should not result in big difference in the estimated mean ○ Adding Gaussian noise [Abadi et al., 16]
- Both communication efficiency and differential privacy
Goal of this paper
SLIDE 8
∑iQ(xi)/n + Q(x4)
Attempt 1: add Gaussian noise on the server
Q(x3) Q(x1) Q(x2)
- DP results readily available
○ Assuming L2 norm of the gradient is bounded (gradient clipping).
- Server has to be trustworthy.
SLIDE 9
∑iQ(xi)/n Q(x4)
Attempt 2: add Gaussian noise on the client
Q(x3) Q(x1) Q(x2)
- After quantization: no communication efficiency.
- Before quantization: hard to analyze.
SLIDE 10
∑iQ(xi)/n Q(x4)
cpSGD: add binomial noise after quantization
Q(x3) Q(x1) Q(x2)
SLIDE 11
cpSGD
- Maintains communication efficiency
○ Binomial is discrete.
- Differentially private
○ Binomial similar to Gaussian. ○ Extended to d-dimension with improved bound.
- Works if server is negligent but not malicious
- Works even if clients do not trust the server
○ Secure aggregation.
SLIDE 12
Tue Dec 4th 05:00 -- 07:00 PM Room 210 & 230 AB #27
For d variables and n ≈ d clients, cpSGD uses
- O(log log(nd)) bits of communication per client per coordinate
- Constant privacy