Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi - PowerPoint PPT Presentation

first.last@epfl.ch Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi Rachid Guerraoui Arsany Guirguis Lê Nguyên Hoang Sébastien Rouault Swiss Federal Institute of Technology (EPFL) August 6, 2020

The Big Picture Machine learning (ML) tackles critical tasks ... 1

The Big Picture Machine learning (ML) tackles critical tasks ... ...so ML should be made robust 1

The Big Picture Machine learning (ML) tackles critical tasks ... ...so ML should be made robust using Literature: robust when the model training 1

The Big Picture Machine learning (ML) tackles critical tasks ... ...so ML should be made robust Literature: robust when training the model 1

The Big Picture Machine learning (ML) tackles critical tasks ... ...so ML should be made robust Literature: robust when training the model 4y ago 1

The Big Picture Machine learning (ML) tackles critical tasks ... ...so ML should be made robust Literature: robust when training the model 4y ago Genuinely distributed, Byzantine ML 1

Machine learning (ML) Boat Goat ... 2

Machine learning (ML) Boat Goat ~1 to 100 millions ... 2

Machine learning (ML) Krust ZrOm ~1 to 100 millions ... 2

Machine learning (ML) Brust GOrm ~1 to 100 millions ... 2

Machine learning (ML) Bost GOat ~1 to 100 millions ... 2

Machine learning (ML) Boat Goat ~1 to 100 millions ... 2

Stochastic Gradient Descent (SGD) 4.2 0.5 1.0 0.8 Training loop: 1. Estimate gradient 5.7 0.3 2. Turn potentiometers ~1 to 100 following the gradient millions 3. Loop back to step 1. -.- 3

Stochastic Gradient Descent (SGD) 4.2 Training loop: -0.5 1. Estimate gradient -1.0 0.8 2. Turn potentiometers -5.7 following the gradient 0.3 3. Loop back to step 1. 3

Distributed SGD parameter server ~1 to 100 millions worker network 4

Distributed SGD 4.2 4.1 -0.5 -0.5 -1.0 -1.0 0.8 0.7 -5.7 -5.7 0.4 0.3 parameter server 4.3 4.3 -0.5 -0.5 -0.9 -1.0 0.7 0.9 -5.7 -5.7 0.3 0.4 ~1 to 100 4.2 4.1 millions -0.5 -0.5 -1.0 -1.0 0.9 0.8 -5.7 -5.7 0.2 0.3 worker network 4

Distributed SGD parameter server 4.2 -0.5 4.1 -1.0 -0.5 4.3 0.8 -1.0 -0.5 -5.7 0.7 -0.9 0.4 -5.7 0.7 0.3 -5.7 0.3 ~1 to 100 millions worker network 4

Distributed SGD parameter server ~1 to 100 millions worker network 4

Distributed, Byzantine SGD parameter server ~1 to 100 millions worker network 5

Distributed, Byzantine SGD 4.2 -537 -0.5 -752 -1.0 349 0.8 412 -5.7 824 0.4 -153 parameter server 4.3 -537 -0.5 -752 -0.9 349 0.7 412 -5.7 824 0.3 -153 ~1 to 100 4.2 4.1 millions -0.5 -0.5 -1.0 -1.0 0.9 0.8 -5.7 -5.7 0.2 0.3 worker network 5

Distributed, Byzantine SGD parameter server 4.2 -0.5 4.1 -1.0 -0.5 -537 0.8 -1.0 -752 -5.7 0.7 349 0.4 -5.7412 0.3 824 -153 ~1 to 100 millions worker network 5

Distributed, Byzantine SGD parameter server ~1 to 100 millions worker network 5

Byzantine-resilient SGD 4.2 -0.5 4.1 -537 -1.0 -752 -0.5 -537 0.8 Average 349 -1.0 ≈ -752 -5.7 0.7 412 349 0.4 -5.7412 824 -153 0.3 824 -153 6

Byzantine-resilient SGD 4.2 -0.5 4.1 -537 -1.0 -752 -0.5 -537 0.8 Average 349 -1.0 ≈ -752 -5.7 0.7 412 349 0.4 -5.7412 824 -153 0.3 824 -153 MDA Median 4.2 -0.5 4.1 4.1 -1.0 -0.5 -0.5 -537 0.8 Krum -1.0 -1.0 ≈ -752 -5.7 0.7 0.7 349 0.4 -5.7412 -5.7 Bulyan 0.3 0.3 824 -153 GeoMed 6

Byzantine-resilient SGD 4.2 -0.5 4.1 -537 -1.0 -752 -0.5 -537 0.8 Average 349 -1.0 ≈ -752 -5.7 0.7 412 349 0.4 -5.7412 824 -153 0.3 824 -153 4.2 -0.5 4.1 4.1 -1.0 -0.5 -0.5 -537 0.8 MDA -1.0 -1.0 ≈ -752 -5.7 0.7 0.7 349 0.4 -5.7412 -5.7 0.3 0.3 824 -153 6

Problem single point of failure 7

Problem… solution 7

Problem… solution a n z t i y n B e s C u o s n n s e 7

Problem… solution… nope a n z t i y n B e s C u o s n n s e asynchronous network 8

Key problem: divergence A 1 B 2 3 C D 9

The goal Can we keep the ~1 to 100 millions ~1 to 100 millions ~1 to 100 millions "close" to each other... ...despite network asynchrony ... ...and Byzantine behaviors? 10

Key approach Can we bring the ~1 to 100 millions ~1 to 100 millions ~1 to 100 millions back closer to each other... ...despite network asynchrony ... ...and Byzantine behaviors? 11

Key approach: +1 round A 1 B 2 3 C D 11

Key approach: toy example 1 2 3 4 = 1-parameter model: & one 12

Key approach: toy example 1 2 3 4 diameter & one 12

Key approach: toy example 1 2 3 4 reduced diameter & one 12

Key approach: toy example 1 1 2 2 3 3 4 4 & one 12

Key approach: last remark 1 1 2 2 3 3 4 4 & one 13

Key approach: last remark ×2 1 1 ×2 2 2 2 ×2 3 3 3 ×2 4 4 4 & one 13

Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi - PowerPoint PPT Presentation

first.last@epfl.ch Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi Rachid Guerraoui Arsany Guirguis L Nguyn Hoang Sbastien Rouault Swiss Federal Institute of Technology (EPFL) August 6, 2020 The Big Picture

Byzantine Techniques Michael George November 29, 2005 Michael George Byzantine Techniques

Distributed Systems Making Byzantine Fault-Tolerant Systems Tolerate Byzantine Faults Hubert

Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults Dian Yu 1/16 Comparison with

Tutorial : Byzantine agreement Valerie King University of Victoria Victoria, Canada 25

Making Byzantine Fault Tolerant Systems Tolerate Byzantine Failures Allen Clement, Mirco

BFTCBFTP: BYZANTINE-FAULT -TOLERANT CONSTRUCTION OF BFT PROTOCOLS EDWARD TREMEL SIGSEGV 2019

A Faster P Solution for the Byzantine Agreement Problem Michael J. Dinneen, Yun-Bum Kim, and Radu

Byzantine Generals Problem & FLP Impossibility Addendum Sep. 4th, 2019 Byzantine Fault

Towards Low-Latency Byzantine Agreement Protocols Using RDMA DSN Workshop on Byzantine Consensus

SpaceMac pa Ma Anh Le, Athina Markopoulou University of California, Irvine Byzantine (a.k.a.

Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults Xiaotian Zou, Lei Chu Outline

Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine

Federated Byzantine Quorum Systems lvaro Garca-Prez and Alexey Gotsman IMDEA Software

Weighted Byzantine Agreement Vijay K. Garg John Bridgman Parallel and Distributed Systems Lab at

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

PUX Router Pro 10/17/13 | 2.009 Fall 2013 PUX | Key Challenges Customer Need Product Attribute

Formal Verification of Mathematical Algorithms John Harrison Intel Corporation Real Numbers and

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

Electric Potential Energy and Electric Potential Work x y z f f f

Radiometry Sung-Eui Yoon ( ) Course URL: http://sglab.kaist.ac.kr/~sungeui/GCG Class

Algebra Based Physics Work and Energy 2015-11-30 www.njctl.org Slide 3 / 112 Slide 4 / 112

Introduction Signals, Noise, Energy and Linearity by Erol Seke For the course Communications

Context Development of AMM (Automatic Metering Management) Electric power consumption will

Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi - PowerPoint PPT Presentation

first.last@epfl.ch Genuinely Distributed Byzantine Machine Learning El-Mahdi El-Mhamdi Rachid Guerraoui Arsany Guirguis L Nguyn Hoang Sbastien Rouault Swiss Federal Institute of Technology (EPFL) August 6, 2020 The Big Picture

Byzantine Techniques Michael George November 29, 2005 Michael George Byzantine Techniques

Distributed Systems Making Byzantine Fault-Tolerant Systems Tolerate Byzantine Faults Hubert

Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults Dian Yu 1/16 Comparison with

Tutorial : Byzantine agreement Valerie King University of Victoria Victoria, Canada 25

Making Byzantine Fault Tolerant Systems Tolerate Byzantine Failures Allen Clement, Mirco

BFTCBFTP: BYZANTINE-FAULT -TOLERANT CONSTRUCTION OF BFT PROTOCOLS EDWARD TREMEL SIGSEGV 2019

A Faster P Solution for the Byzantine Agreement Problem Michael J. Dinneen, Yun-Bum Kim, and Radu

Byzantine Generals Problem &amp; FLP Impossibility Addendum Sep. 4th, 2019 Byzantine Fault

Towards Low-Latency Byzantine Agreement Protocols Using RDMA DSN Workshop on Byzantine Consensus

SpaceMac pa Ma Anh Le, Athina Markopoulou University of California, Irvine Byzantine (a.k.a.

Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults Xiaotian Zou, Lei Chu Outline

Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine

Federated Byzantine Quorum Systems lvaro Garca-Prez and Alexey Gotsman IMDEA Software

Weighted Byzantine Agreement Vijay K. Garg John Bridgman Parallel and Distributed Systems Lab at

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

PUX Router Pro 10/17/13 | 2.009 Fall 2013 PUX | Key Challenges Customer Need Product Attribute

Formal Verification of Mathematical Algorithms John Harrison Intel Corporation Real Numbers and

Comparison Between Bayesian and Maximum Entropy Analysis of Flow Networks 1 Maximum Entropy

Electric Potential Energy and Electric Potential Work x y z f f f

Radiometry Sung-Eui Yoon ( ) Course URL: http://sglab.kaist.ac.kr/~sungeui/GCG Class

Algebra Based Physics Work and Energy 2015-11-30 www.njctl.org Slide 3 / 112 Slide 4 / 112

Introduction Signals, Noise, Energy and Linearity by Erol Seke For the course Communications

Context Development of AMM (Automatic Metering Management) Electric power consumption will

Byzantine Generals Problem & FLP Impossibility Addendum Sep. 4th, 2019 Byzantine Fault