CS 744: PYTORCH Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - - PowerPoint PPT Presentation

cs 744 pytorch
SMART_READER_LITE
LIVE PREVIEW

CS 744: PYTORCH Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - - PowerPoint PPT Presentation

Hi ! CS 744: PYTORCH Shivaram Venkataraman Fall 2020 ADMINISTRIVIA week ) ( Monday 10/5 next Assignment 2 out! Due Oct 1 Bid on topics, submit group (1 sentences) Oct 5 -28g y , Project Proposal (2 pages) Oct 16 Piazza -


slide-1
SLIDE 1

CS 744: PYTORCH

Shivaram Venkataraman Fall 2020

Hi!

slide-2
SLIDE 2

ADMINISTRIVIA

Assignment 2 out! Bid on topics, submit group (1 sentences) – Oct 5 Project Proposal (2 pages) – Oct 16 Introduction Related Work Timeline (with eval plan)

Due 10/5 ( Monday next week )
  • 28g y
, Oct 1
  • Piazza
slide-3
SLIDE 3 Scalable Storage Systems Datacenter Architecture Resource Management Computational Engines Machine Learning SQL Streaming Graph Applications
  • Iem

spark , MapReduce

ars

pesos

DRF
slide-4
SLIDE 4

EMPIRICAL RISK MINIMIZATION

Function Data (Examples) Model Regularization

ised → and labels

Shifrin

, green training dd "

f)

Fit a model
  • -
slide-5
SLIDE 5

DEEP LEARNING

ResNet18

Convolution ReLU MaxPool Fully Connected SoftMax

  • pp
FC . 84 =

[

84 dim ]

(

gtfo

"

#

eager; read argon man eager

I

  • r

O

' r

t.g.ie:

"

qq.im

Him

' m?!";fion '
slide-6
SLIDE 6

STOCHASTIC GRADIENT DESCENT

Initialize w For many iterations: Loss = Forward pass Gradient = backward Update model End

Good fit for

sin"

raiser Tinhorn

in

→ ardent

;

f

  • eat

[

y

leathers

diindiarddef

  • (model) →

yfcw ,

input )
  • b Ha
'

Fwy

"
  • (model
) chain rule
model is shared how do we

parallelize

every iteration

depends

  • n
previous
slide-7
SLIDE 7

DATA PARALLEL MODEL TRAINING

+

Parallelize

  • ne iteration
nextiteration model
  • ' reflate
WH

int

,ft

. data points does CB . ) ← 256 64 B , ← gradient ( B . ) model wµ,

forward pass

pandita

132 → lots!B"

f (model

, B) model ! what

↳ flmodd

, Bi) 64 133 .

ffwodd

. , BD model . i . Wied - 64 Bu

!dd

up) :

x÷iWy

average

ly

Funt

Adn

. update

step

that

Eli

takes

all grads

into accent
slide-8
SLIDE 8

COLLECTIVE COMMUNICATION

Broadcast, Scatter Gather, Reduce

From https://mpitutorial.com/tutorials/ ° → MPI

go.im?iodno*EI:ag::g:...qeiqng@-.rni:ad

.EE

]

send ,

ties"

  • (data ten,
root )

① detain

D

→ comate

  • vector ¥

"

Chief

D

  • e
  • Es , 47,42

D

  • 5+2+7+4
slide-9
SLIDE 9

ALL REDUCE

From https://mpitutorial.com/tutorials/ Ring All Reduce
  • Po

" ' EET

⑧ Ds- ②

  • I

's

⑨④ c- Da

!# 18

14 Pe

B

ends
slide-10
SLIDE 10

DISTRIBUTED DATA PARALLEL API

→ only

line
  • f
code

change

✓ local

model
  • Non
  • intrusive
  • Hooks
to do optimizations in

background

slide-11
SLIDE 11

GRADIENT BUCKETING

Why do we need gradient bucketing?

60M parameter ↳ small tensor sires

lead

to

greater

time for

Ad

Reduce

Every

All Redn? latency con
  • how
wt

(

fixed
  • verhead

)

t handoff 't → why not
  • ne

big

bucket = wait

for all

gradiah-reatdy.be

O

g

= Cannot
  • verlap
backward, Altadena
slide-12
SLIDE 12

GRADIENT BUCKETING + ALL REDUCE

parameter

= . layers

\

A. buckets become

ready

, we start All Reduce
  • n
them

{⑧CTO

wage

In

background

,

£

the

gradient

comp

9

continues

griffe

.

Ered

tf

by sive = . 25 MB
slide-13
SLIDE 13

Gradient Accumulation

xtra

parameter

C 3-

e

dgidda.FI

[

wm

no - sync AllReduce Bet , Bu , Bi

D "

y

Allrednce

DCI

134 BED

\

Br , Bs , B-

D

'"

00¥;!!!8§,

B , ' C

D

Bg , Bb I '33

D

slide-14
SLIDE 14

IMPLEMENTATION

Bucket_cap_mb Parameter-to-bucket mapping Round-robin ProcessGroups

Fazio

Port ①→ ⑦

1234

y I viii.iii. y

③ ←

=

Parameter

that is

tunable-

small

  • verhead
~ middle = . 25 MB

large

no overlap -

↳ query

baiatal

"" SMB

¥7

Lag::]

  • mate
> um 's

flayer ]

  • buckets
→ filled up

math function amp GPUs =
  • n
a batch cpu, .. data

/

backward

pass
slide-15
SLIDE 15

BREAKDOWN

slide-16
SLIDE 16

SUMMARY

Pytorch: Framework for deep learning DistributedDataParallel API Gradient bucketing, AllReduce Overlap computation and communication

slide-17
SLIDE 17

DISCUSSION

https://forms.gle/6xhVBNBhdzsJ6gBE6

slide-18
SLIDE 18

profanity na%ner÷ldf% ,

Andy.de

;rwrk

well terrene

Timefr329pI

  • e fine for
16 am
  • .

O

④ scales

well

!
  • ptimal

bucket

sin

depends

  • n
→ 00000

a.

  • r New
00
  • Nccu
is more

perform

art & variance

town

less
slide-19
SLIDE 19 This paper scales well ! ?

f

weak Seeling strong scaling 13=64 , incremeaprn.mn i B÷ 256 increase mm GPUs T
  • #
T
  • ¥
, I 2
slide-20
SLIDE 20

What could be some challenges in implementing similar

  • ptimizations for AllReduce in Apache Spark?

spark

: "

larger

workloads " ? Each worker node
  • n

spark

had dataset

↳ spark

needs to

shuffle

  • peration
14

Org

, more

expensive

than Necc
  • pig reduce
0 Top Tree
  • Org ,
  • Reduce
  • verlap

compute / communication

Otsu - knees

ask compete.

veggie fahimgdfngtie time

slide-21
SLIDE 21

NEXT STEPS

Next class: PipeDream Assignment 2 is due soon! Project Proposal Groups by Oct 5 2 pager by Oct 16

bucket

flare:C

.
  • J
bye! copy user program Alto C . . .

e

I

scatter

h÷!

an Process Group API

safer \

  • <
aloo

/

EI

. TITE ) Nccu
  • #
which link network monitoring

Fes

  • '
  • ↳¥→

too

:*

.

Eisman.mn?!;aYE!ir'?/FEiI.nm

+ We ""

¥

YE