CS 744: PYTORCH
Shivaram Venkataraman Fall 2020
Hi!
CS 744: PYTORCH Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - - PowerPoint PPT Presentation
Hi ! CS 744: PYTORCH Shivaram Venkataraman Fall 2020 ADMINISTRIVIA week ) ( Monday 10/5 next Assignment 2 out! Due Oct 1 Bid on topics, submit group (1 sentences) Oct 5 -28g y , Project Proposal (2 pages) Oct 16 Piazza -
CS 744: PYTORCH
Shivaram Venkataraman Fall 2020
Hi!
ADMINISTRIVIA
Assignment 2 out! Bid on topics, submit group (1 sentences) – Oct 5 Project Proposal (2 pages) – Oct 16 Introduction Related Work Timeline (with eval plan)
→ Due 10/5 ( Monday next week )←
→
spark , MapReducears
→
pesos→
DRFEMPIRICAL RISK MINIMIZATION
Function Data (Examples) Model Regularization
ised → and labelsShifrin
, green training dd "f)
Fit a modelDEEP LEARNING
ResNet18
Convolution ReLU MaxPool Fully Connected SoftMax
[
84 dim ](
gtfo
"
#
eager; read argon man eagerI
O
' r"
qq.im
Him
' m?!";fion 'STOCHASTIC GRADIENT DESCENT
Initialize w For many iterations: Loss = Forward pass Gradient = backward Update model End
Good fit forsin"
raiser Tinhorn
in
→ ardent
;
←
f
[
y
leathers
diindiarddef
yfcw ,
input )Fwy
"parallelize
↳
every iterationdepends
DATA PARALLEL MODEL TRAINING
+
Parallelize
int
,ft
. data points → does CB . ) ← 256 64 B , ← gradient ( B . ) model wµ,forward pass
pandita
132 → lots!B"f (model
, B) model ! what↳ flmodd
, Bi) 64 133 .ffwodd
. , BD model . i . Wied - 64 Bu!dd
up) :x÷iWy
averagely
FuntAdn
. updatestep
that
Eli
takesall grads
into accentCOLLECTIVE COMMUNICATION
Broadcast, Scatter Gather, Reduce
From https://mpitutorial.com/tutorials/ ° → MPIgo.im?iodno*EI:ag::g:...qeiqng@-.rni:ad
.EE
]
send ,ties"
① detain
D
→ comate
"
Chief
D
D
ALL REDUCE
From https://mpitutorial.com/tutorials/ Ring All Reduce→
" ' EET⑧ Ds- ②
1¥
's⑨④ c- Da
!# 18
14 PeB
endsDISTRIBUTED DATA PARALLEL API
→ only
linechange
✓ local
modelbackground
GRADIENT BUCKETING
Why do we need gradient bucketing?
60M parameter ↳ small tensor sireslead
togreater
time forAd
ReduceEvery
All Redn? latency con(
fixed)
t handoff 't → why notbig
bucket = waitfor all
gradiah-reatdy.be
O
g
= CannotGRADIENT BUCKETING + ALL REDUCE
parameter
= . layers\
A. buckets become②
ready
, we start All Reduce{⑧CTO
wage
Inbackground
,£
thegradient
comp9
continuesgriffe
.Ered
tf
by sive = . 25 MBGradient Accumulation
xtraparameter
C 3-
e✓
dgidda.FI
[
wm
no - sync AllReduce Bet , Bu , BiD "
y
AllrednceDCI
134 BED\
Br , Bs , B-D
'"00¥;!!!8§,
B , ' CD
Bg , Bb I '33D
IMPLEMENTATION
Bucket_cap_mb Parameter-to-bucket mapping Round-robin ProcessGroups
Fazio
Port ①→ ⑦↳
1234y I viii.iii. y
③ ←
② =Parameter
that istunable-
small
→large
→ no overlap -↳ query
baiatal
"" SMBLag::]
flayer ]
↳
math function amp GPUs =/
backward
passBREAKDOWN
SUMMARY
Pytorch: Framework for deep learning DistributedDataParallel API Gradient bucketing, AllReduce Overlap computation and communication
DISCUSSION
https://forms.gle/6xhVBNBhdzsJ6gBE6
profanity na%ner÷ldf% ,
Andy.de
;rwrk
well terreneTimefr329pI
⑦
O
well
!bucket
sindepends
a.
perform
art & variancetown
lessf
weak Seeling strong scaling 13=64 , incremeaprn.mn i B÷ 256 increase mm GPUs TWhat could be some challenges in implementing similar
spark
: "larger
workloads " ? Each worker nodespark
had dataset↳ spark
needs toshuffle
Org
, moreexpensive
than Necccompute / communication
Otsu - knees
ask compete.
→veggie fahimgdfngtie time
NEXT STEPS
Next class: PipeDream Assignment 2 is due soon! Project Proposal Groups by Oct 5 2 pager by Oct 16
bucketflare:C
.e
I
scatterh÷!
an Process Group APIsafer \
/
EI
. TITE ) NccuFes
too
:÷
:*
.Eisman.mn?!;aYE!ir'?/FEiI.nm
⇒
+ We ""YE