High Performance Machine Learning: Advances, Challenges and - - PowerPoint PPT Presentation

high performance machine learning advances challenges and
SMART_READER_LITE
LIVE PREVIEW

High Performance Machine Learning: Advances, Challenges and - - PowerPoint PPT Presentation

High Performance Machine Learning: Advances, Challenges and Opportunities Eduardo Rodrigues Lecture @ ERAD-RS - April 11th, 2019 IBM Research Artificial Intelligence Deep Blue (1997) AI and Machine Learning AI ML


slide-1
SLIDE 1

High Performance Machine Learning: Advances, Challenges and Opportunities

Eduardo Rodrigues Lecture @ ERAD-RS - April 11th, 2019

slide-2
SLIDE 2

IBM Research

slide-3
SLIDE 3

❆❉❱❆◆❈❊❙

slide-4
SLIDE 4

Artificial Intelligence

Deep Blue (1997)

slide-5
SLIDE 5

AI and Machine Learning

AI ML

slide-6
SLIDE 6

Jeopardy (2011)

slide-7
SLIDE 7

Debater

https://www.youtube.com/watch?v=UeF_N1r91RQ

slide-8
SLIDE 8

Machine Learning is becoming central to ✘✘✘

✘ ❳❳❳ ❳

many all industries

◮ Nine out of 10

executives from around the world describe AI as important to solving their organizations’ strategic challenges.

◮ Over the next decade,

AI enterprise software revenue will grow from $644 million to nearly $39 billion

◮ Services-related

revenue should reach almost $150 billion

slide-9
SLIDE 9

AI identifies which primates could be carrying the Zika virus

slide-10
SLIDE 10

Biophysics-Inspired AI Uses Photons to Help Surgeons Identify Cancer

slide-11
SLIDE 11

IBM takes on Alzheimer’s disease with machine learning

slide-12
SLIDE 12

Seismic Facies Segmentation Using Deep Learning

slide-13
SLIDE 13

Crop detection

slide-14
SLIDE 14

Automatic Citrus Tree Detection from UAV Images

slide-15
SLIDE 15

Agropad

https://www.youtube.com/watch?v=UYVc0TeuK-w

slide-16
SLIDE 16

HPC and ML/AI

◮ As data abounds, deeper and

more complex models are developed

◮ These models have many

parameters and hyperparameters to tune

◮ A cycle of train, test and

adjust is done many times before good results can be achieved

◮ Speedup exploratory cycle

improves productivity

◮ Parallel execution is the

solution

slide-17
SLIDE 17

Basics: deep learning sequential execution Training basics

◮ loop over mini-batches

and epochs

◮ forward

propagation

◮ compute loss ◮ backward

propagation (gradients)

◮ update parameters

L = 1 Nbs

  • i

Li, ∂Li ∂Wn

slide-18
SLIDE 18

Parallel execution

single node - multi-GPU system

Many ways to divide the deep neural network The most common strategy is to divide mini-batches across GPUs

◮ The model is replicated

across GPUs

◮ Data is divided among them ◮ Two possible approaches:

◮ non-overlapping division ◮ shuffled division

◮ Each GPU computes

forward, cost and mini-batch gradients

◮ Gradients are then averaged

and stored in a shared space (visible to all GPUs)

slide-19
SLIDE 19

Parallelization strategies

multi-node

One can use a similar strategy with multi-node It requires communication across nodes Two strategies:

◮ Asynchronous ◮ Synchronous

slide-20
SLIDE 20

Synchronous

◮ Can be implemented with

high efficiency protocols

◮ No need to exchange

variables

◮ Faster in terms of time to

quality

slide-21
SLIDE 21

DDL - Distributed Deep Learning

◮ We use a mesh-tori like

reduction

◮ Earlier dimensions need more

BW to transfer

◮ Later dimensions need less

BW to transfer

slide-22
SLIDE 22

Hierarchical communication (1)

slide-23
SLIDE 23

Hierarchical communication (2)

Reduce example

This shows a single example of communication pattern that benefits from hierarchical communication

More bandwith at the beginning

slide-24
SLIDE 24

Hierarchical communication (2)

Reduce example

This shows a single example of communication pattern that benefits from hierarchical communication

Progressivelly less bandwith is required

slide-25
SLIDE 25

Hierarchical communication (2)

Reduce example

This shows a single example of communication pattern that benefits from hierarchical communication

Progressivelly less bandwith is required

slide-26
SLIDE 26

Seismic Segmentation Models based on DNNs

A symbiotic partnership

◮ Deep Neural Networks have become the

main tool for visual recognition

◮ They also have been used by seismologists

to help interpret seismic data

◮ Relevant training examples may be sparse ◮ Training these models may take very long ◮ Parallel execution speed up training

slide-27
SLIDE 27

Seismic Segmentation Models based on DNNs

Challenges

◮ Current deep leaning models (Alexnet,

VGG, Inception) do not fit well the task

◮ They are too big ◮ Little data (compared to traditional vision

recognition tasks)

◮ Data pre-processing forces model’s input to

be smaller

◮ Parallel execution strategies proposed in

the literature are not appropriate

slide-28
SLIDE 28

What is the recommendation:

slide-29
SLIDE 29

Traditional technique

slide-30
SLIDE 30

Traditional technique

slide-31
SLIDE 31

Traditional technique pitfalls

Key assumptions are:

◮ the full batch is very large ◮ the effective minibatch is still a small

fraction of the full batch A hidden assumption is that small full batches don’t need to run in parallel

slide-32
SLIDE 32

Not only Imagenet can benefit from parallel execution

slide-33
SLIDE 33

weak scaling, strong scaling

slide-34
SLIDE 34

weak scaling, strong scaling

slide-35
SLIDE 35
  • ur experiments (1)

Time to run 200 epochs

2 4 8 # of GPUs 500 1000 1500 2000 2500 execution time (s) Strong Weak
slide-36
SLIDE 36
  • ur experiments (1)

Time to run 200 epochs

2 4 8 # of GPUs 500 1000 1500 2000 2500 execution time (s) Strong Weak

Intersection over union

25 50 75 100 125 150 175 200 Epochs 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 IOU strong 2 GPUs strong 4 GPUs strong 8 GPUs weak 2 GPUs weak 4 GPUs weak 8 GPUs
slide-37
SLIDE 37
  • ur experiments (2)

Time to reach 60% IOU

2 4 8 # of GPUs 2500 5000 7500 10000 12500 15000 17500 execution time (s) Strong Weak
slide-38
SLIDE 38
  • ur experiments (2)

Time to reach 60% IOU

2 4 8 # of GPUs 2500 5000 7500 10000 12500 15000 17500 execution time (s) Strong Weak

Intersection over union

500 1000 1500 2000 Epochs 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 IOU strong 2 GPUs strong 4 GPUs strong 8 GPUs weak 2 GPUs weak 4 GPUs weak 8 GPUs
slide-39
SLIDE 39

HPC AI

slide-40
SLIDE 40

HPC AI

slide-41
SLIDE 41

Motivation

◮ End-users must specify several parameters in their job

submissions to the queue system, e.g.:

◮ Number of processors ◮ Queue / Partition ◮ Memory requirements ◮ Other resource requirements

◮ Those parameters have direct impact in the job turnaround

time and, more importantly, in the total system utilization

◮ Frequently, end-users are not aware of the implications of the

parameters they use

◮ System log keeps valuable information that can be leveraged

to improve parameter choice

slide-42
SLIDE 42

Related work

◮ Karnak has been used in

XSEDE to predict waiting time and runtime

◮ Useful for users to plan their

experiments

◮ The method may not apply

well for other job parameters, for example memory requirements

fa fb label a b c d e Neighbor ( p) Query ( q) Query neighborhood E( q)
  • x
D ( q, x) = D ( x, q) = d x Point in the knowledge base
slide-43
SLIDE 43

Memory requirements

◮ System owner wants to maximize utilization ◮ Users may not specify memory precisely ◮ Log data can provide training examples for a machine learning

approach for predicting memory requirements

◮ This can be seen as a supervised learning task ◮ We have a set of features (e.g. user id, cwd, command

parameters, submission time, etc)

◮ We want to predict memory requirements (label)

slide-44
SLIDE 44

The Wisdom of Crowds

There are many learning algorithms available, e.g. Classification trees, Neural Networks, Instance-based learners, etc Instead of relying on a single algorithm, we aggregate the predictions of several methods "Aggregating the judgment of many consistently beats the accuracy of the average member of the group"

slide-45
SLIDE 45

Comparison between mode and poll

x86 system

1 2 3 4 Segment 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy 0.791 0.602 0.663 0.807 0.774 0.909 0.754 0.869 0.892 0.782

Prediction performance in the x86 system

mode poll

slide-46
SLIDE 46

❈❍❆▲▲❊◆●❊❙

slide-47
SLIDE 47

Is the singularity really near?

Nick Bostrom - Superintelligence Yuval Noah Harari - 21 Lessons for the 21st Century

slide-48
SLIDE 48

Employment

slide-49
SLIDE 49

Employment

slide-50
SLIDE 50

Flexibility and care

Kai-Fu Lee - AI Super-powers - China, Silicon Valley and the New World Order

slide-51
SLIDE 51

Knowledge

https://xkcd.com/1838/

slide-52
SLIDE 52

http://tylervigen.com/view_correlation?id=359

slide-53
SLIDE 53

http://tylervigen.com/view_correlation?id=1703

slide-54
SLIDE 54

https://xkcd.com/552/

slide-55
SLIDE 55

Judea Pearl - The book of why Pedro Domingos - The Master Algorithm

slide-56
SLIDE 56

❖PP❖❘❚❯◆■❚■❊❙

slide-57
SLIDE 57

AI

slide-58
SLIDE 58

AI HPC

slide-59
SLIDE 59

AI HPC App

slide-60
SLIDE 60

AI HPC Agri

slide-61
SLIDE 61

IBM Cloud

slide-62
SLIDE 62

IBM to launch AI research center in Brazil

slide-63
SLIDE 63

HPML 2019

High Performance Machine Learning Workshop

@ IEEE/ACM CCGrid - Cyprus

http://hpml2019.github.io