Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi - - PowerPoint PPT Presentation

genuinely distributed byzantine machine learning
SMART_READER_LITE
LIVE PREVIEW

Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi - - PowerPoint PPT Presentation

first.last@epfl.ch Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi Rachid Guerraoui Arsany Guirguis L Nguyn Hoang Sbastien Rouault Swiss Federal Institute of Technology (EPFL) August 6, 2020 The Big Picture


slide-1
SLIDE 1

Genuinely Distributed Byzantine Machine Learning

El-Mahdi El-Mhamdi Rachid Guerraoui Arsany Guirguis Lê Nguyên Hoang Sébastien Rouault first.last@epfl.ch

Swiss Federal Institute of Technology (EPFL) August 6, 2020

slide-2
SLIDE 2

The Big Picture

Machine learning (ML) tackles critical tasks...

1

slide-3
SLIDE 3

The Big Picture

Machine learning (ML) tackles critical tasks... ...so ML should be made robust

1

slide-4
SLIDE 4

The Big Picture

Machine learning (ML) tackles critical tasks... ...so ML should be made robust the model Literature: robust when using training

1

slide-5
SLIDE 5

The Big Picture

Machine learning (ML) tackles critical tasks... ...so ML should be made robust the model Literature: robust when training

1

slide-6
SLIDE 6

The Big Picture

Machine learning (ML) tackles critical tasks... ...so ML should be made robust the model Literature: robust when training 4y ago

1

slide-7
SLIDE 7

The Big Picture

Machine learning (ML) tackles critical tasks... ...so ML should be made robust the model Literature: robust when training 4y ago

1

slide-8
SLIDE 8

The Big Picture

Machine learning (ML) tackles critical tasks... ...so ML should be made robust the model Literature: robust when training 4y ago Genuinely distributed, Byzantine ML

1

slide-9
SLIDE 9

Machine learning (ML)

Boat Goat ...

2

slide-10
SLIDE 10

Machine learning (ML)

~1 to 100 millions

Boat Goat ...

2

slide-11
SLIDE 11

Machine learning (ML)

~1 to 100 millions

Krust ZrOm ...

2

slide-12
SLIDE 12

Machine learning (ML)

~1 to 100 millions

Brust GOrm ...

2

slide-13
SLIDE 13

Machine learning (ML)

~1 to 100 millions

Bost GOat ...

2

slide-14
SLIDE 14

Machine learning (ML)

~1 to 100 millions

Boat Goat ...

2

slide-15
SLIDE 15

Stochastic Gradient Descent (SGD)

~1 to 100 millions

4.2 0.8 0.3 0.5 1.0 5.7

  • .-

Training loop:

  • 1. Estimate gradient
  • 2. Turn potentiometers

following the gradient

  • 3. Loop back to step 1.

3

slide-16
SLIDE 16

Stochastic Gradient Descent (SGD)

4.2

  • 0.5

0.8 0.3

  • 1.0
  • 5.7

Training loop:

  • 1. Estimate gradient
  • 2. Turn potentiometers

following the gradient

  • 3. Loop back to step 1.

3

slide-17
SLIDE 17

Stochastic Gradient Descent (SGD)

4.2

  • 0.5

0.8 0.3

  • 1.0
  • 5.7

Training loop:

  • 1. Estimate gradient
  • 2. Turn potentiometers

following the gradient

  • 3. Loop back to step 1.

3

slide-18
SLIDE 18

Distributed SGD

worker parameter server network

~1 to 100 millions

4

slide-19
SLIDE 19

Distributed SGD

worker parameter server network

~1 to 100 millions

4.2

  • 0.5

0.8 0.4

  • 1.0
  • 5.7

4.3

  • 0.5

0.7 0.3

  • 0.9
  • 5.7

4.2

  • 0.5

0.9 0.2

  • 1.0
  • 5.7

4.1

  • 0.5

0.8 0.3

  • 1.0
  • 5.7

4.1

  • 0.5

0.7 0.3

  • 1.0
  • 5.7

4.3

  • 0.5

0.9 0.4

  • 1.0
  • 5.7

4

slide-20
SLIDE 20

Distributed SGD

worker parameter server network

~1 to 100 millions

4.2

  • 0.5

0.8 0.4

  • 1.0
  • 5.7

4.1

  • 0.5

0.7 0.3

  • 1.0
  • 5.7

4.3

  • 0.5

0.7 0.3

  • 0.9
  • 5.7

4

slide-21
SLIDE 21

Distributed SGD

worker parameter server network

~1 to 100 millions

4

slide-22
SLIDE 22

Distributed, Byzantine SGD

worker parameter server network

~1 to 100 millions

5

slide-23
SLIDE 23

Distributed, Byzantine SGD

worker parameter server network

~1 to 100 millions

4.2

  • 0.5

0.8 0.4

  • 1.0
  • 5.7

4.3

  • 0.5

0.7 0.3

  • 0.9
  • 5.7

4.2

  • 0.5

0.9 0.2

  • 1.0
  • 5.7

4.1

  • 0.5

0.8 0.3

  • 1.0
  • 5.7

412

  • 153

824 349

  • 752
  • 537

412

  • 153

824 349

  • 752
  • 537

5

slide-24
SLIDE 24

Distributed, Byzantine SGD

worker parameter server network

~1 to 100 millions

4.2

  • 0.5

0.8 0.4

  • 1.0
  • 5.7

4.1

  • 0.5

0.7 0.3

  • 1.0
  • 5.7412
  • 153

824 349

  • 752
  • 537

5

slide-25
SLIDE 25

Distributed, Byzantine SGD

worker parameter server network

~1 to 100 millions

5

slide-26
SLIDE 26

Byzantine-resilient SGD

Average

412

  • 153

824 349

  • 752
  • 537

4.2

  • 0.5

0.8 0.4

  • 1.0
  • 5.7

4.1

  • 0.5

0.7 0.3

  • 1.0
  • 5.7412
  • 153

824 349

  • 752
  • 537

6

slide-27
SLIDE 27

Byzantine-resilient SGD

Average

412

  • 153

824 349

  • 752
  • 537

4.2

  • 0.5

0.8 0.4

  • 1.0
  • 5.7

4.1

  • 0.5

0.7 0.3

  • 1.0
  • 5.7412
  • 153

824 349

  • 752
  • 537

4.2

  • 0.5

0.8 0.4

  • 1.0
  • 5.7

4.1

  • 0.5

0.7 0.3

  • 1.0
  • 5.7412
  • 153

824 349

  • 752
  • 537

4.1

  • 0.5

0.7 0.3

  • 1.0
  • 5.7

Krum Median Bulyan GeoMed MDA

6

slide-28
SLIDE 28

Byzantine-resilient SGD

Average

412

  • 153

824 349

  • 752
  • 537

4.2

  • 0.5

0.8 0.4

  • 1.0
  • 5.7

4.1

  • 0.5

0.7 0.3

  • 1.0
  • 5.7412
  • 153

824 349

  • 752
  • 537

4.2

  • 0.5

0.8 0.4

  • 1.0
  • 5.7

4.1

  • 0.5

0.7 0.3

  • 1.0
  • 5.7412
  • 153

824 349

  • 752
  • 537

4.1

  • 0.5

0.7 0.3

  • 1.0
  • 5.7

MDA

6

slide-29
SLIDE 29

Problem

single point

  • f failure

7

slide-30
SLIDE 30

Problem… solution

7

slide-31
SLIDE 31

Problem… solution

B y z a n t i n e C

  • n

s e n s u s

7

slide-32
SLIDE 32

Problem… solution… nope

B y z a n t i n e C

  • n

s e n s u s asynchronous network

8

slide-33
SLIDE 33

Key problem: divergence

A B C D 1 2 3

9

slide-34
SLIDE 34

Key problem: divergence

A B C D 1 2 3

9

slide-35
SLIDE 35

Key problem: divergence

A B C D 1 2 3

9

slide-36
SLIDE 36

Key problem: divergence

A B C D 1 2 3

9

slide-37
SLIDE 37

Key problem: divergence

A B C D 1 2 3

9

slide-38
SLIDE 38

Key problem: divergence

A B C D 1 2 3

9

slide-39
SLIDE 39

Key problem: divergence

A B C D 1 2 3

9

slide-40
SLIDE 40

Key problem: divergence

A B C D 1 2 3

9

slide-41
SLIDE 41

Key problem: divergence

A B C D 1 2 3

9

slide-42
SLIDE 42

Key problem: divergence

A B C D 1 2 3

9

slide-43
SLIDE 43

Key problem: divergence

A B C D 1 2 3

9

slide-44
SLIDE 44

Key problem: divergence

A B C D 1 2 3

9

slide-45
SLIDE 45

The goal "close" to each other...

~1 to 100 millions ~1 to 100 millions ~1 to 100 millions

Can we keep the ...despite network asynchrony... ...and Byzantine behaviors?

10

slide-46
SLIDE 46

Key approach back closer to each other...

~1 to 100 millions ~1 to 100 millions ~1 to 100 millions

Can we bring the ...despite network asynchrony... ...and Byzantine behaviors?

11

slide-47
SLIDE 47

Key approach: +1 round

A B C D 1 2 3

11

slide-48
SLIDE 48

Key approach: toy example

2 3 4 1

& one 1-parameter model:

=

12

slide-49
SLIDE 49

Key approach: toy example

2 3 4 1

& one diameter

12

slide-50
SLIDE 50

Key approach: toy example

2 3 4 1

& one reduced diameter

12

slide-51
SLIDE 51

Key approach: toy example

2 3 4 1

& one

1 2 3 4 12

slide-52
SLIDE 52

Key approach: toy example

2 3 4 1

& one

1 2 3 4 12

slide-53
SLIDE 53

Key approach: toy example

2 3 4 1

& one

1 2 3 4 12

slide-54
SLIDE 54

Key approach: toy example

2 3 4 1

& one

1 2 3 4 12

slide-55
SLIDE 55

Key approach: last remark

2 3 4 1

& one

1 2 3 4 13

slide-56
SLIDE 56

Key approach: last remark

2 3 4 1

& one

1 2 3 4

×2

2 3 4

×2 ×2 ×2

13

slide-57
SLIDE 57

Key approach: last remark

2 3 4 1

& one

1 2 3 4

×2

2 3 4

×2 ×2 ×2

13