Federated Learning
Min Du Postdoc, UC Berkeley
Federated Learning Min Du Postdoc, UC Berkeley Outline q - - PowerPoint PPT Presentation
Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems Outline q Preliminary: deep learning and SGD q Federated
Min Du Postdoc, UC Berkeley
Outline
q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems
Outline
q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems
Example task Given input Desired output
Image classification 8 Playing GO Next move Next-word-prediction Looking forward to your ? reply
π₯ is the set of parameters contained by the function
The goal of deep learning
is to find a set of parameters π₯, to maximize the probability of outputting π§# given π¦#.
Given input: π¦# Maximize: π(5|π¦#, π₯)
Finding the function: model training
goal of deep learning model training is to find a set of parameters π₯, such that the average of π(π§,) is maximized given π¦,.
Given input: Output:
Finding the function: model training
Which is equivalent to
πππ¦ππππ¨π 1 π 4
,56 7
π(π§,|π¦,, π₯) πππππππ¨π 1 π 4
,56 7
βlog(π(π§,|π¦,, π₯)) A basic component for loss function π(π¦,, π§,, π₯) given sample π¦,, π§, : Let π
, π₯ = π(π¦,, π§,, π₯) denote the
loss function.
goal of deep learning model training is to find a set of parameters π₯, such that the average of π(π§,) is maximized given π¦,.
Finding the function: model training
Deep learning model training
For a training dataset containing π samples (π¦,, π§,), 1 β€ π β€ π, the training
min
CββE π(π₯)
where π π₯ β 6
7 β,56 7
π
,(π₯)
π
, π₯ = π(π¦,, π§,, π₯) is the loss of the prediction on example π¦,, π§,
No closed-form solution: in a typical deep learning model, π₯ may contain millions of parameters. Non-convex: multiple local minima exist.
π₯ π(π₯)
Solution: Gradient Descent
π₯ Loss π(π₯) Randomly initialized weight π₯ Compute gradient βπ(π₯) π₯IJ6 = π₯I β πβπ(π₯) (Gradient Descent) At the local minimum, βπ(π₯) is close to 0. Learning rate π controls the step size
How to stop? β when the update is small enough β converge. β₯ π₯IJ6 β π₯I β₯β€ π
Problem: Usually the number of training samples n is large β slow convergence
Solution: Stochastic Gradient Descent (SGD)
samples, randomly pick a small subset (mini-batch) of training samples π¦N, π§N .
each step is much faster.
π₯IJ6 β π₯I β πβπ π₯I; π¦N, π§N
Outline
q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems
The biggest obstacle to using advanced data analysis isnβt skill base or technology; itβs plain old access to the data.
The importance of data for ML
βData is the New Oilβ
Google, Apple, ......
ML model Private data: all the photos a user takes and everything they type on their mobile keyboard, including passwords, URLs, messages, etc. image classification: e.g. to predict which photos are most likely to be viewed multiple times in the future; language models: e.g. voice recognition, next-word-prediction, and auto-reply in Gmail
Google, Apple, ......
Instead of uploading the raw data, train a model locally and upload the model. Addressing privacy: Model parameters will never contain more information than the raw training data Addressing network overhead: The size of the model is generally smaller than the size
ML model ML model ML model MODEL AGGREGATION ML model
Federated optimization
β
Non-IID
β
The data generated by each user are quite different
β
Unbalanced
β
Some users produce significantly more data than others
β
Massively distributed
β
# mobile device owners >> avg # training samples on each device
β
Limited communication
β
Unstable mobile network connections
A new paradigm β Federated Learning
a synchronous update scheme that proceeds in rounds of communication
McMahan, H. Brendan, Eider Moore, Daniel Ramage, and Seth Hampson. "Communication-efficient learning of deep networks from decentralized data." AISTATS, 2017.
Local data Local data Local data Local data
Central Server Global model M(i)
Model M(i) Model M(i) Model M(i) Model M(i) Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i)
In round number iβ¦
Deployed by Google, Apple, etc.
Federated learning β overview
Local data Local data Local data Local data
Central Server In round number iβ¦
Updates of M(i) Updates of M(i) Updates of M(i) Updates of M(i) Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i)
Model Aggregation M(i+1)
Federated learning β overview
Local data
Central Server
Local data Local data Local data
Global model M(i+1)
Model M(i+1) Model M(i+1) Model M(i+1) Model M(i+1)
Round number i+1 and continueβ¦
Federated learning β overview
Federated learning β detail
For efficiency, at the beginning of each round, a random fraction C of clients is selected, and the server sends the current model parameters to each of these clients.
Federated learning β detail
β
For a training dataset containing π samples (π¦,, π§,), 1 β€ π β€ π, the training
π
, π₯ = π(π¦,, π§,, π₯) is the loss of the prediction on example π¦,, π§,
β
Deep learning optimization relies on SGD and its variants, through mini-batches
π₯IJ6 β π₯I β πβπ π₯I; π¦N, π§N
min
CββE π(π₯)
where π π₯ β
6 7 β,56 7
π
,(π₯)
Federated learning β detail
β
Suppose π training samples are distributed to πΏ clients, where πN is the set of indices of data points on client π, and πN = πN .
β
For training objective: min
CββE π(π₯)
π π₯ = βN56
T 7U 7 πΊ N(π₯)
where πΊ
N(π₯) β 6 7U β,βWU π ,(π₯)
A baseline β FederatedSGD (FedSGD)
learning β A randomly selected sample in traditional deep learning
round
round.
β
C=1: full-batch (non-stochastic) gradient descent
β
C<1: stochastic gradient descent (SGD)
A baseline β FederatedSGD (FedSGD)
Learning rate: π; total #samples: π; total #clients: πΏ; #samples on a client k: πN; clients fraction π· = 1
β
The central server broadcasts current model π₯I to each client; each client k computes gradient: πN = βπΊN(π₯I), on its local data.
β
Approach 1: Each client k submits πN; the central server aggregates the gradients to generate a new model:
T 7U 7 πN.
β
Approach 2: Each client k computes: π₯IJ6
N β π₯I β ππN; the central server performs
aggregation:
T 7U 7 π₯IJ6 N
For multiple times βΉ FederatedAveraging (FedAvg)
Recall f w = β^56
_ `a ` F^(w)
Federated learning β deal with limited communication
β
Select more clients for training between each communication round
β
Increase computation on each client
Federated learning β FederatedAveraging (FedAvg)
Learning rate: π; total #samples: π; total #clients: πΏ; #samples on a client k: πN; clients fraction π·
β
The central server broadcasts current model π₯I to each client; each client k computes gradient: πN = βπΊN(π₯I), on its local data.
β
Approach 2:
N
β π₯I β ππN
T 7U 7 π₯IJ6 N
7U e .
Federated learning β FederatedAveraging (FedAvg)
Model initialization
β
On the central server
β
On each client
The loss on the full MNIST training set for models generated by
ππ₯ + (1 β π)π₯i Shared initialization works better in practice.
Federated learning β FederatedAveraging (FedAvg)
Model averaging
right figure:
The loss on the full MNIST training set for models generated by
ππ₯ + (1 β π)π₯i In practice, naΓ―ve parameter averaging works surprisingly well.
Federated learning β FederatedAveraging (FedAvg)
initialized on the central server.
i. A random set of clients are chosen; ii. Each client performs local gradient descent steps;
model parameters submitted by the clients.
Federated learning β Evaluation
β
IID: Random partition
β
Non-IID: each client only contains two digits
β
Balanced
1 client 1 client FedSGD FedSGD FedAvg FedAvg #rounds required to achieve a target accuracy on test dataset.
Im Image e classification
Impact ct of varying C In general, the higher C, the smaller #rounds to reach target accuracy.
Federated learning β Evaluation
β
#clients: 1146, each corresponding to a speaking role
β
Unbalanced: different #lines for each role
β
Train-test split ratio: 80% - 20%
β
A balanced and IID dataset with 1146 clients is also constructed
Lang Languag uage mo mode deling ling
Federated learning β Evaluation
computation in each round (decrease B / increase E)
Im Image e classification
In general, the more computation in each round, the faster the model trains. FedAvg also converges to a higher test accuracy (B=10, E=20).
Federated learning β Evaluation
Lang Languag uage mo mode deling ling
computation in each round (decrease B / increase E)
In general, the more computation in each round, the faster the model trains. FedAvg also converges to a higher test accuracy (B=10, E=5).
Federated learning β Evaluation
Best performance may achieve at earlier rounds; increasing #rounds do not improve. Be Best t practi tice: decay y th the amount t of lo local l computatio ion when the model l is is clo lose to converge.
Federated learning β Evaluation
β
IID: Random partition
β
Non-IID: each client only contains two digits
β
Balanced
Im Image e classification
Federated learning β Evaluation
from a large social network
β
#clients: 500,000, each corresponding to an author
model
Lang Languag uage mo mode deling ling
A A real-wo world problem
200 clients per round; B=8, E=1
Outline
q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems
Federated learning β related research
Data poisoning attacks. How to backdoor federated learning, arXiv:1807.00459. Secure aggregation. Practical Secure Aggregation for Privacy-Preserving Machine Learning,
CCSβ17
Client-level differential privacy. Differentially Private Federated Learning: A Client-level
Perspective, ICLRβ19
Decentralize the central server via blockchain.
Google FL Workshop: https://sites.google.com/view/federated-learning-2019/home
HiveMind: Decentralized Federated Learning
Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract
aggregated noise
Differentially Private Global Model Differential privacy Secure aggregation Model encryption
DP noise
HiveMind: Decentralized Federated Learning
Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract
aggregated noise
Differentially Private Global Model Model M(i) Model M(i) Model M(i) Model M(i)
In round number i⦠Global model M(i)
HiveMind: Decentralized Federated Learning
Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract
Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i)
In round number iβ¦
DP noise
+
DP noise
+
DP noise+ DP noise+
Differential privacy Secure aggregation Model encryption
DP noise
HiveMind: Decentralized Federated Learning
Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract
In round number iβ¦
Differential privacy Secure aggregation Model encryption Gradient updates for M(i)
DP noise
+
Gradient updates for M(i)
DP noise
+
Gradient updates for M(i)
DP noise+
Gradient updates for M(i)
DP noise+ DP noise
HiveMind: Decentralized Federated Learning
Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract
In round number iβ¦
Differential privacy Secure aggregation Model encryption Gradient updates for M(i)
DP noise
+
Gradient updates for M(i)
DP noise
+
Gradient updates for M(i)
DP noise+
Gradient updates for M(i)
DP noise+ DP noise
Secure Aggregation
aggregated noise
Secure Aggregation
HiveMind: Decentralized Federated Learning
Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract
In round number iβ¦
Differential privacy Secure aggregation Model encryption Gradient updates for M(i)
DP noise
+
Gradient updates for M(i)
DP noise
+
Gradient updates for M(i)
DP noise+
Gradient updates for M(i)
DP noise+ DP noise
Global model M(i+1)
HiveMind: Decentralized Federated Learning
Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract
aggregated noise
Differentially Private Global Model Differential privacy Secure aggregation Model encryption
Global model M(i+1) Round number i+1 and continueβ¦
DP noise
HiveMind: Decentralized Federated Learning
Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract
aggregated noise
Differentially Private Global Model Differential privacy Secure aggregation Model encryption
DP noise
Outline
q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems
Federated learning β open problems
with secure aggregation.
Min Du min.du@berkeley.edu