Federated Learning Min Du Postdoc, UC Berkeley Outline q - - PowerPoint PPT Presentation

β–Ά
federated learning
SMART_READER_LITE
LIVE PREVIEW

Federated Learning Min Du Postdoc, UC Berkeley Outline q - - PowerPoint PPT Presentation

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems Outline q Preliminary: deep learning and SGD q Federated


slide-1
SLIDE 1

Federated Learning

Min Du Postdoc, UC Berkeley

slide-2
SLIDE 2

Outline

q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems

slide-3
SLIDE 3

Outline

q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems

slide-4
SLIDE 4
  • Find a function, which produces a desired output given a particular input.

Example task Given input Desired output

Image classification 8 Playing GO Next move Next-word-prediction Looking forward to your ? reply

π‘₯ is the set of parameters contained by the function

The goal of deep learning

slide-5
SLIDE 5
  • Given one input sample pair 𝑦#, 𝑧# , the goal of deep learning model training

is to find a set of parameters π‘₯, to maximize the probability of outputting 𝑧# given 𝑦#.

Given input: 𝑦# Maximize: π‘ž(5|𝑦#, π‘₯)

Finding the function: model training

slide-6
SLIDE 6
  • Given a training dataset containing π‘œ input-output pairs 𝑦,, 𝑧, , 𝑗 ∈ 1, π‘œ , the

goal of deep learning model training is to find a set of parameters π‘₯, such that the average of π‘ž(𝑧,) is maximized given 𝑦,.

Given input: Output:

Finding the function: model training

slide-7
SLIDE 7
  • That is,

Which is equivalent to

𝑛𝑏𝑦𝑗𝑛𝑗𝑨𝑓 1 π‘œ 4

,56 7

π‘ž(𝑧,|𝑦,, π‘₯) π‘›π‘—π‘œπ‘—π‘›π‘—π‘¨π‘“ 1 π‘œ 4

,56 7

βˆ’log(π‘ž(𝑧,|𝑦,, π‘₯)) A basic component for loss function π‘š(𝑦,, 𝑧,, π‘₯) given sample 𝑦,, 𝑧, : Let 𝑔

, π‘₯ = π‘š(𝑦,, 𝑧,, π‘₯) denote the

loss function.

  • Given a training dataset containing π‘œ input-output pairs 𝑦,, 𝑧, , 𝑗 ∈ 1, π‘œ , the

goal of deep learning model training is to find a set of parameters π‘₯, such that the average of π‘ž(𝑧,) is maximized given 𝑦,.

Finding the function: model training

slide-8
SLIDE 8

Deep learning model training

For a training dataset containing π‘œ samples (𝑦,, 𝑧,), 1 ≀ 𝑗 ≀ π‘œ, the training

  • bjective is:

min

Cβˆˆβ„E 𝑔(π‘₯)

where 𝑔 π‘₯ ≝ 6

7 βˆ‘,56 7

𝑔

,(π‘₯)

𝑔

, π‘₯ = π‘š(𝑦,, 𝑧,, π‘₯) is the loss of the prediction on example 𝑦,, 𝑧,

No closed-form solution: in a typical deep learning model, π‘₯ may contain millions of parameters. Non-convex: multiple local minima exist.

π‘₯ 𝑔(π‘₯)

slide-9
SLIDE 9

Solution: Gradient Descent

π‘₯ Loss 𝑔(π‘₯) Randomly initialized weight π‘₯ Compute gradient βˆ‡π‘”(π‘₯) π‘₯IJ6 = π‘₯I βˆ’ πœƒβˆ‡π‘”(π‘₯) (Gradient Descent) At the local minimum, βˆ‡π‘”(π‘₯) is close to 0. Learning rate πœƒ controls the step size

How to stop? – when the update is small enough – converge. βˆ₯ π‘₯IJ6 βˆ’ π‘₯I βˆ₯≀ πœ—

  • r βˆ₯ βˆ‡π‘”(π‘₯I) βˆ₯≀ πœ—

Problem: Usually the number of training samples n is large – slow convergence

slide-10
SLIDE 10

Solution: Stochastic Gradient Descent (SGD)

  • At each step of gradient descent, instead of compute for all training

samples, randomly pick a small subset (mini-batch) of training samples 𝑦N, 𝑧N .

  • Compared to gradient descent, SGD takes more steps to converge, but

each step is much faster.

π‘₯IJ6 ← π‘₯I βˆ’ πœƒβˆ‡π‘” π‘₯I; 𝑦N, 𝑧N

slide-11
SLIDE 11

Outline

q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems

slide-12
SLIDE 12

The biggest obstacle to using advanced data analysis isn’t skill base or technology; it’s plain old access to the data.

  • Edd Wilder-James, Harvard Business Review

β€œ ”

The importance of data for ML

slide-13
SLIDE 13

β€œData is the New Oil”

slide-14
SLIDE 14

Google, Apple, ......

ML model Private data: all the photos a user takes and everything they type on their mobile keyboard, including passwords, URLs, messages, etc. image classification: e.g. to predict which photos are most likely to be viewed multiple times in the future; language models: e.g. voice recognition, next-word-prediction, and auto-reply in Gmail

slide-15
SLIDE 15

Google, Apple, ......

Instead of uploading the raw data, train a model locally and upload the model. Addressing privacy: Model parameters will never contain more information than the raw training data Addressing network overhead: The size of the model is generally smaller than the size

  • f the raw training data

ML model ML model ML model MODEL AGGREGATION ML model

slide-16
SLIDE 16

Federated optimization

  • Characteristics (Major challenges)

β—‹

Non-IID

β– 

The data generated by each user are quite different

β—‹

Unbalanced

β– 

Some users produce significantly more data than others

β—‹

Massively distributed

β– 

# mobile device owners >> avg # training samples on each device

β—‹

Limited communication

β– 

Unstable mobile network connections

slide-17
SLIDE 17

A new paradigm – Federated Learning

a synchronous update scheme that proceeds in rounds of communication

McMahan, H. Brendan, Eider Moore, Daniel Ramage, and Seth Hampson. "Communication-efficient learning of deep networks from decentralized data." AISTATS, 2017.

slide-18
SLIDE 18

Local data Local data Local data Local data

Central Server Global model M(i)

Model M(i) Model M(i) Model M(i) Model M(i) Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i)

In round number i…

Deployed by Google, Apple, etc.

Federated learning – overview

slide-19
SLIDE 19

Local data Local data Local data Local data

Central Server In round number i…

Updates of M(i) Updates of M(i) Updates of M(i) Updates of M(i) Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i)

Model Aggregation M(i+1)

Federated learning – overview

slide-20
SLIDE 20

Local data

Central Server

Local data Local data Local data

Global model M(i+1)

Model M(i+1) Model M(i+1) Model M(i+1) Model M(i+1)

Round number i+1 and continue…

Federated learning – overview

slide-21
SLIDE 21

Federated learning – detail

For efficiency, at the beginning of each round, a random fraction C of clients is selected, and the server sends the current model parameters to each of these clients.

slide-22
SLIDE 22

Federated learning – detail

  • Recall in traditional deep learning model training

β—‹

For a training dataset containing π‘œ samples (𝑦,, 𝑧,), 1 ≀ 𝑗 ≀ π‘œ, the training

  • bjective is:

𝑔

, π‘₯ = π‘š(𝑦,, 𝑧,, π‘₯) is the loss of the prediction on example 𝑦,, 𝑧,

β—‹

Deep learning optimization relies on SGD and its variants, through mini-batches

π‘₯IJ6 ← π‘₯I βˆ’ πœƒβˆ‡π‘” π‘₯I; 𝑦N, 𝑧N

min

Cβˆˆβ„E 𝑔(π‘₯)

where 𝑔 π‘₯ ≝

6 7 βˆ‘,56 7

𝑔

,(π‘₯)

slide-23
SLIDE 23

Federated learning – detail

  • In federated learning

β—‹

Suppose π‘œ training samples are distributed to 𝐿 clients, where 𝑄N is the set of indices of data points on client 𝑙, and π‘œN = 𝑄N .

β—‹

For training objective: min

Cβˆˆβ„E 𝑔(π‘₯)

𝑔 π‘₯ = βˆ‘N56

T 7U 7 𝐺 N(π‘₯)

where 𝐺

N(π‘₯) ≝ 6 7U βˆ‘,∈WU 𝑔 ,(π‘₯)

slide-24
SLIDE 24

A baseline – FederatedSGD (FedSGD)

  • A randomly selected client that has π‘œN training data samples in federated

learning β‰ˆ A randomly selected sample in traditional deep learning

  • Federated SGD (FedSGD): a single step of gradient descent is done per

round

  • Recall in federated learning, a C-fraction of clients are selected at each

round.

β—‹

C=1: full-batch (non-stochastic) gradient descent

β—‹

C<1: stochastic gradient descent (SGD)

slide-25
SLIDE 25

A baseline – FederatedSGD (FedSGD)

Learning rate: πœƒ; total #samples: π‘œ; total #clients: 𝐿; #samples on a client k: π‘œN; clients fraction 𝐷 = 1

  • In a round t:

β—‹

The central server broadcasts current model π‘₯I to each client; each client k computes gradient: 𝑕N = βˆ‡πΊN(π‘₯I), on its local data.

β– 

Approach 1: Each client k submits 𝑕N; the central server aggregates the gradients to generate a new model:

  • π‘₯IJ6 ← π‘₯I βˆ’ πœƒβˆ‡π‘” π‘₯I = π‘₯I βˆ’ πœƒ βˆ‘N56

T 7U 7 𝑕N.

β– 

Approach 2: Each client k computes: π‘₯IJ6

N ← π‘₯I βˆ’ πœƒπ‘•N; the central server performs

aggregation:

  • π‘₯IJ6 ← βˆ‘N56

T 7U 7 π‘₯IJ6 N

For multiple times ⟹ FederatedAveraging (FedAvg)

Recall f w = βˆ‘^56

_ `a ` F^(w)

slide-26
SLIDE 26

Federated learning – deal with limited communication

  • Increase computation

β—‹

Select more clients for training between each communication round

β—‹

Increase computation on each client

slide-27
SLIDE 27

Federated learning – FederatedAveraging (FedAvg)

Learning rate: πœƒ; total #samples: π‘œ; total #clients: 𝐿; #samples on a client k: π‘œN; clients fraction 𝐷

  • In a round t:

β—‹

The central server broadcasts current model π‘₯I to each client; each client k computes gradient: 𝑕N = βˆ‡πΊN(π‘₯I), on its local data.

β– 

Approach 2:

  • Each client k computes for E epochs : π‘₯IJ6

N

← π‘₯I βˆ’ πœƒπ‘•N

  • The central server performs aggregation: π‘₯IJ6 ← βˆ‘N56

T 7U 7 π‘₯IJ6 N

  • Suppose B is the local mini-batch size, #updates on client k in each round: 𝑣N = 𝐹

7U e .

slide-28
SLIDE 28

Federated learning – FederatedAveraging (FedAvg)

Model initialization

  • Two choices:

β—‹

On the central server

β—‹

On each client

The loss on the full MNIST training set for models generated by

πœ„π‘₯ + (1 βˆ’ πœ„)π‘₯i Shared initialization works better in practice.

slide-29
SLIDE 29

Federated learning – FederatedAveraging (FedAvg)

Model averaging

  • As shown in the

right figure:

The loss on the full MNIST training set for models generated by

πœ„π‘₯ + (1 βˆ’ πœ„)π‘₯i In practice, naΓ―ve parameter averaging works surprisingly well.

slide-30
SLIDE 30

Federated learning – FederatedAveraging (FedAvg)

  • 1. At first, a model is randomly

initialized on the central server.

  • 2. For each round t:

i. A random set of clients are chosen; ii. Each client performs local gradient descent steps;

  • iii. The server aggregates

model parameters submitted by the clients.

slide-31
SLIDE 31

Federated learning – Evaluation

  • #clients: 100
  • Dataset: MNIST

β—‹

IID: Random partition

β—‹

Non-IID: each client only contains two digits

β—‹

Balanced

1 client 1 client FedSGD FedSGD FedAvg FedAvg #rounds required to achieve a target accuracy on test dataset.

Im Image e classification

Impact ct of varying C In general, the higher C, the smaller #rounds to reach target accuracy.

slide-32
SLIDE 32

Federated learning – Evaluation

  • Dataset from: The Complete Works of Shakespeare

β—‹

#clients: 1146, each corresponding to a speaking role

β—‹

Unbalanced: different #lines for each role

β—‹

Train-test split ratio: 80% - 20%

β—‹

A balanced and IID dataset with 1146 clients is also constructed

  • Task: next character prediction
  • Model: character-level LSTM language model

Lang Languag uage mo mode deling ling

slide-33
SLIDE 33

Federated learning – Evaluation

  • The effect of increasing

computation in each round (decrease B / increase E)

  • Fix C=0.1

Im Image e classification

In general, the more computation in each round, the faster the model trains. FedAvg also converges to a higher test accuracy (B=10, E=20).

slide-34
SLIDE 34

Federated learning – Evaluation

Lang Languag uage mo mode deling ling

  • The effect of increasing

computation in each round (decrease B / increase E)

  • Fix C=0.1

In general, the more computation in each round, the faster the model trains. FedAvg also converges to a higher test accuracy (B=10, E=5).

slide-35
SLIDE 35

Federated learning – Evaluation

  • What if we maximize the computation on each client? 𝐹 β†’ ∞

Best performance may achieve at earlier rounds; increasing #rounds do not improve. Be Best t practi tice: decay y th the amount t of lo local l computatio ion when the model l is is clo lose to converge.

slide-36
SLIDE 36

Federated learning – Evaluation

  • #clients: 100
  • Dataset: CIFAR-10

β—‹

IID: Random partition

β—‹

Non-IID: each client only contains two digits

β—‹

Balanced

Im Image e classification

slide-37
SLIDE 37

Federated learning – Evaluation

  • Dataset from: 10 million public posts

from a large social network

β—‹

#clients: 500,000, each corresponding to an author

  • Task: next word prediction
  • Model: word-level LSTM language

model

Lang Languag uage mo mode deling ling

A A real-wo world problem

200 clients per round; B=8, E=1

slide-38
SLIDE 38

Outline

q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems

slide-39
SLIDE 39

Federated learning – related research

Data poisoning attacks. How to backdoor federated learning, arXiv:1807.00459. Secure aggregation. Practical Secure Aggregation for Privacy-Preserving Machine Learning,

CCS’17

Client-level differential privacy. Differentially Private Federated Learning: A Client-level

Perspective, ICLR’19

Decentralize the central server via blockchain.

Google FL Workshop: https://sites.google.com/view/federated-learning-2019/home

slide-40
SLIDE 40

HiveMind: Decentralized Federated Learning

Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract

aggregated noise

Differentially Private Global Model Differential privacy Secure aggregation Model encryption

DP noise

slide-41
SLIDE 41

HiveMind: Decentralized Federated Learning

Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract

aggregated noise

Differentially Private Global Model Model M(i) Model M(i) Model M(i) Model M(i)

In round number i… Global model M(i)

slide-42
SLIDE 42

HiveMind: Decentralized Federated Learning

Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract

Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i) Gradient updates for M(i)

In round number i…

DP noise

+

DP noise

+

DP noise+ DP noise+

Differential privacy Secure aggregation Model encryption

DP noise

slide-43
SLIDE 43

HiveMind: Decentralized Federated Learning

Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract

In round number i…

Differential privacy Secure aggregation Model encryption Gradient updates for M(i)

DP noise

+

Gradient updates for M(i)

DP noise

+

Gradient updates for M(i)

DP noise+

Gradient updates for M(i)

DP noise+ DP noise

slide-44
SLIDE 44

HiveMind: Decentralized Federated Learning

Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract

In round number i…

Differential privacy Secure aggregation Model encryption Gradient updates for M(i)

DP noise

+

Gradient updates for M(i)

DP noise

+

Gradient updates for M(i)

DP noise+

Gradient updates for M(i)

DP noise+ DP noise

Secure Aggregation

slide-45
SLIDE 45

aggregated noise

Secure Aggregation

HiveMind: Decentralized Federated Learning

Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract

In round number i…

Differential privacy Secure aggregation Model encryption Gradient updates for M(i)

DP noise

+

Gradient updates for M(i)

DP noise

+

Gradient updates for M(i)

DP noise+

Gradient updates for M(i)

DP noise+ DP noise

Global model M(i+1)

slide-46
SLIDE 46

HiveMind: Decentralized Federated Learning

Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract

aggregated noise

Differentially Private Global Model Differential privacy Secure aggregation Model encryption

Global model M(i+1) Round number i+1 and continue…

DP noise

slide-47
SLIDE 47

HiveMind: Decentralized Federated Learning

Local data Local data Local data Local data Oasis Blockchain Platform HiveMind Smart Contract

aggregated noise

Differentially Private Global Model Differential privacy Secure aggregation Model encryption

DP noise

slide-48
SLIDE 48

Outline

q Preliminary: deep learning and SGD q Federated learning: FedSGD and FedAvg q Related research in federated learning q Open problems

slide-49
SLIDE 49

Federated learning – open problems

  • Detect data poisoning attacks, while secure aggregation is being used.
  • Asynchronous model update in federated learning and its co-existence

with secure aggregation.

  • Further reduce communication overhead through quantization etc.
  • The usage of differential privacy in each of the above settings.
  • ……
slide-50
SLIDE 50

Min Du min.du@berkeley.edu

Thank you!