Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi - - PowerPoint PPT Presentation
Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi - - PowerPoint PPT Presentation
first.last@epfl.ch Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi Rachid Guerraoui Arsany Guirguis L Nguyn Hoang Sbastien Rouault Swiss Federal Institute of Technology (EPFL) August 6, 2020 The Big Picture
The Big Picture
Machine learning (ML) tackles critical tasks...
1
The Big Picture
Machine learning (ML) tackles critical tasks... ...so ML should be made robust
1
The Big Picture
Machine learning (ML) tackles critical tasks... ...so ML should be made robust the model Literature: robust when using training
1
The Big Picture
Machine learning (ML) tackles critical tasks... ...so ML should be made robust the model Literature: robust when training
1
The Big Picture
Machine learning (ML) tackles critical tasks... ...so ML should be made robust the model Literature: robust when training 4y ago
1
The Big Picture
Machine learning (ML) tackles critical tasks... ...so ML should be made robust the model Literature: robust when training 4y ago
1
The Big Picture
Machine learning (ML) tackles critical tasks... ...so ML should be made robust the model Literature: robust when training 4y ago Genuinely distributed, Byzantine ML
1
Machine learning (ML)
Boat Goat ...
2
Machine learning (ML)
~1 to 100 millions
Boat Goat ...
2
Machine learning (ML)
~1 to 100 millions
Krust ZrOm ...
2
Machine learning (ML)
~1 to 100 millions
Brust GOrm ...
2
Machine learning (ML)
~1 to 100 millions
Bost GOat ...
2
Machine learning (ML)
~1 to 100 millions
Boat Goat ...
2
Stochastic Gradient Descent (SGD)
~1 to 100 millions
4.2 0.8 0.3 0.5 1.0 5.7
- .-
Training loop:
- 1. Estimate gradient
- 2. Turn potentiometers
following the gradient
- 3. Loop back to step 1.
3
Stochastic Gradient Descent (SGD)
4.2
- 0.5
0.8 0.3
- 1.0
- 5.7
Training loop:
- 1. Estimate gradient
- 2. Turn potentiometers
following the gradient
- 3. Loop back to step 1.
3
Stochastic Gradient Descent (SGD)
4.2
- 0.5
0.8 0.3
- 1.0
- 5.7
Training loop:
- 1. Estimate gradient
- 2. Turn potentiometers
following the gradient
- 3. Loop back to step 1.
3
Distributed SGD
worker parameter server network
~1 to 100 millions
4
Distributed SGD
worker parameter server network
~1 to 100 millions
4.2
- 0.5
0.8 0.4
- 1.0
- 5.7
4.3
- 0.5
0.7 0.3
- 0.9
- 5.7
4.2
- 0.5
0.9 0.2
- 1.0
- 5.7
4.1
- 0.5
0.8 0.3
- 1.0
- 5.7
4.1
- 0.5
0.7 0.3
- 1.0
- 5.7
4.3
- 0.5
0.9 0.4
- 1.0
- 5.7
4
Distributed SGD
worker parameter server network
~1 to 100 millions
4.2
- 0.5
0.8 0.4
- 1.0
- 5.7
4.1
- 0.5
0.7 0.3
- 1.0
- 5.7
4.3
- 0.5
0.7 0.3
- 0.9
- 5.7
4
Distributed SGD
worker parameter server network
~1 to 100 millions
4
Distributed, Byzantine SGD
worker parameter server network
~1 to 100 millions
5
Distributed, Byzantine SGD
worker parameter server network
~1 to 100 millions
4.2
- 0.5
0.8 0.4
- 1.0
- 5.7
4.3
- 0.5
0.7 0.3
- 0.9
- 5.7
4.2
- 0.5
0.9 0.2
- 1.0
- 5.7
4.1
- 0.5
0.8 0.3
- 1.0
- 5.7
412
- 153
824 349
- 752
- 537
412
- 153
824 349
- 752
- 537
5
Distributed, Byzantine SGD
worker parameter server network
~1 to 100 millions
4.2
- 0.5
0.8 0.4
- 1.0
- 5.7
4.1
- 0.5
0.7 0.3
- 1.0
- 5.7412
- 153
824 349
- 752
- 537
5
Distributed, Byzantine SGD
worker parameter server network
~1 to 100 millions
5
Byzantine-resilient SGD
Average
412
- 153
824 349
- 752
- 537
≈
4.2
- 0.5
0.8 0.4
- 1.0
- 5.7
4.1
- 0.5
0.7 0.3
- 1.0
- 5.7412
- 153
824 349
- 752
- 537
6
Byzantine-resilient SGD
Average
412
- 153
824 349
- 752
- 537
≈
4.2
- 0.5
0.8 0.4
- 1.0
- 5.7
4.1
- 0.5
0.7 0.3
- 1.0
- 5.7412
- 153
824 349
- 752
- 537
≈
4.2
- 0.5
0.8 0.4
- 1.0
- 5.7
4.1
- 0.5
0.7 0.3
- 1.0
- 5.7412
- 153
824 349
- 752
- 537
4.1
- 0.5
0.7 0.3
- 1.0
- 5.7
Krum Median Bulyan GeoMed MDA
6
Byzantine-resilient SGD
Average
412
- 153
824 349
- 752
- 537
≈
4.2
- 0.5
0.8 0.4
- 1.0
- 5.7
4.1
- 0.5
0.7 0.3
- 1.0
- 5.7412
- 153
824 349
- 752
- 537
≈
4.2
- 0.5
0.8 0.4
- 1.0
- 5.7
4.1
- 0.5
0.7 0.3
- 1.0
- 5.7412
- 153
824 349
- 752
- 537
4.1
- 0.5
0.7 0.3
- 1.0
- 5.7
MDA
6
Problem
single point
- f failure
7
Problem… solution
7
Problem… solution
B y z a n t i n e C
- n
s e n s u s
7
Problem… solution… nope
B y z a n t i n e C
- n
s e n s u s asynchronous network
8
Key problem: divergence
A B C D 1 2 3
9
Key problem: divergence
A B C D 1 2 3
9
Key problem: divergence
A B C D 1 2 3
9
Key problem: divergence
A B C D 1 2 3
9
Key problem: divergence
A B C D 1 2 3
9
Key problem: divergence
A B C D 1 2 3
9
Key problem: divergence
A B C D 1 2 3
9
Key problem: divergence
A B C D 1 2 3
9
Key problem: divergence
A B C D 1 2 3
9
Key problem: divergence
A B C D 1 2 3
9
Key problem: divergence
A B C D 1 2 3
9
Key problem: divergence
A B C D 1 2 3
9
The goal "close" to each other...
~1 to 100 millions ~1 to 100 millions ~1 to 100 millions
Can we keep the ...despite network asynchrony... ...and Byzantine behaviors?
10
Key approach back closer to each other...
~1 to 100 millions ~1 to 100 millions ~1 to 100 millions
Can we bring the ...despite network asynchrony... ...and Byzantine behaviors?
11
Key approach: +1 round
A B C D 1 2 3
11
Key approach: toy example
2 3 4 1
& one 1-parameter model:
=
12
Key approach: toy example
2 3 4 1
& one diameter
12
Key approach: toy example
2 3 4 1
& one reduced diameter
12
Key approach: toy example
2 3 4 1
& one
1 2 3 4 12
Key approach: toy example
2 3 4 1
& one
1 2 3 4 12
Key approach: toy example
2 3 4 1
& one
1 2 3 4 12
Key approach: toy example
2 3 4 1
& one
1 2 3 4 12
Key approach: last remark
2 3 4 1
& one
1 2 3 4 13
Key approach: last remark
2 3 4 1
& one
1 2 3 4
×2
2 3 4
×2 ×2 ×2
13
Key approach: last remark
2 3 4 1
& one
1 2 3 4
×2
2 3 4
×2 ×2 ×2
13