cs 744 pytorch
play

CS 744: PYTORCH Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - PowerPoint PPT Presentation

Hi ! CS 744: PYTORCH Shivaram Venkataraman Fall 2020 ADMINISTRIVIA week ) ( Monday 10/5 next Assignment 2 out! Due Oct 1 Bid on topics, submit group (1 sentences) Oct 5 -28g y , Project Proposal (2 pages) Oct 16 Piazza -


  1. Hi ! CS 744: PYTORCH Shivaram Venkataraman Fall 2020

  2. ADMINISTRIVIA week ) ( Monday 10/5 next Assignment 2 out! → Due Oct 1 Bid on topics, submit group (1 sentences) – Oct 5 -28g y , Project Proposal (2 pages) – Oct 16 Piazza - Introduction Related Work Timeline (with eval plan)

  3. Applications - Iem MapReduce ← Machine Learning SQL Streaming Graph spark , Computational Engines → ars → Scalable Storage Systems pesos DRF → Resource Management - Datacenter Architecture

  4. EMPIRICAL RISK MINIMIZATION is ed → and labels Shifrin dd " green training f) , model - - Fit Regularization a Function Model Data (Examples)

  5. - pp DEEP LEARNING dim ] 84 [ 84 FC = eager ; read argon man eager . gtfo " I ResNet18 ( # Convolution - ReLU r t.g.ie : MaxPool ' ' m ? ! " ;f ion O Fully Connected r SoftMax ' Him " qq.im

  6. ↳ for Good fit STOCHASTIC GRADIENT DESCENT sin " raiser Tinhorn → ardent in ; - eat ← f - leathers Initialize w [ y For many iterations: " Fwy ' b Ha - ( model ) → input ) Loss = Forward pass yfcw , diindiarddef - Gradient = backward - ( model ) chain rule Update model parallelize - ↳ do how we End shared model → is depends iteration previous on every

  7. Parallelize DATA PARALLEL MODEL TRAINING one iteration next iteration - ' reflate int ,ft → model WH points . data does CB . ) ← 256 64 B , → ← gradient ( . ) B forward pass model w µ , B ) f ( model → lots !B " pandita , 132 ↳ flmodd model , Bi ) what ! + , BD ffwodd 133 64 . !dd model . Wied - . i ly up ) . Bu : 64 x÷iWy average that Adn step Fun t update . all grads accent into Eli takes

  8. go.im?iodno*EI:ag::g:...qeiqng@-.rni:ad ° COLLECTIVE COMMUNICATION MPI → .EE ] send ties " Broadcast, Scatter Gather, Reduce , - root ) ( data ten , ① detain D vector ¥ " → comate - - - - D Chief Es , 47,42 - - e - D 5+2+7+4 - - - From https://mpitutorial.com/tutorials/

  9. All Reduce ALL REDUCE Ring " ' EET - Po ⑧ Ds - ② → - I 1¥ 's - - - ! # 18 ⑨ ④ c- Da 14 Pe B ends From https://mpitutorial.com/tutorials/

  10. DISTRIBUTED DATA PARALLEL API change code line → only of ✓ local model - intrusive Non - do optimizations to Hooks background - in

  11. GRADIENT BUCKETING 60M parameter Why do we need gradient bucketing? ↳ small sires tensor time for lead greater to Reduce Ad wt how ) All Redn ? latency Every ( con 't - t handoff overhead fixed → why bucket big not one gradiah-reatdy.be all O g for wait = backward , Altadena overlap Cannot =

  12. parameter GRADIENT BUCKETING + ALL REDUCE . layers = \ become buckets A . start ② 0 ready we , them All Reduce on wage { ⑧ CTO background , In £ comp gradient the griffe continues 9 Ered . tf 25 MB sive by = . -

  13. Gradient Accumulation parameter dgidda.FI xtra e 3 - C [ wm ✓ no - sync All Reduce 134 BED DCI D " Bi Allrednce \ Bet , Bu , y 00¥ ; ! ! !8§ , - ④ , Bs B - ' " D Br , D B , ' C D ' 33 Bg , Bb - I

  14. ↳ ↳ Fazio ① → ⑦ Port IMPLEMENTATION 1234 I y ② ③ ← viii. iii. y tunable - that Parameter is 25 MB Bucket_cap_mb ~ middle overhead = = . small → no overlap - baiatal large → ↳ query Parameter-to-bucket mapping " " SMB ¥7 Lag : :] Round-robin ProcessGroups > um 's mate - → filled up buckets function flayer ] math - amp / a batch on backward GPUs = data cpu , . . pass 0

  15. BREAKDOWN

  16. SUMMARY Pytorch: Framework for deep learning DistributedDataParallel API Gradient bucketing, AllReduce Overlap computation and communication

  17. DISCUSSION https://forms.gle/6xhVBNBhdzsJ6gBE6

  18. profanity na%ner÷ldf% , well terrene Andy .de ;rwrk Timefr329pI ⑦ 16 am - e fine for well ! scales O -0 - ④ o . bucket optimal depends - → 00000 on sin or New a. more is 00 Nccu -0 art & - town perform variance less

  19. This paper well ! ? scales f Seeling weak scaling incremeaprn.mn strong 13=64 , GPUs i 256 mm B÷ ¥ increase T - # T - , 2 I

  20. What could be some challenges in implementing similar optimizations for AllReduce in Apache Spark? workloads " " ? larger spark : dataset had spark node worker on Each operation shuffle to needs ↳ spark Necc 14 than Org pig reduce , expensive - more 0 Top veggie fahimgdfngtie Tree - compute / communication Reduce Org , overlap - Otsu - knees compete . → ask time

  21. ⇒ - J flare :C h÷ ! bucket . bye ! NEXT STEPS program copy user Alto C I scatter an . . . Process Group API e - ↳ ¥ → Next class: PipeDream safer \ - . TITE ) Assignment 2 is due soon! EI / < aloo - Nccu which link # Project Proposal monitoring Fes Eisman .mn?!;aYE!ir'?/FEiI.nm network too • ' Groups by Oct 5 :÷ :* 2 pager by Oct 16 . ¥ " " + We YE

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend