scaling deep learning to 100s of gpus on hops hadoop
play

Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso - PowerPoint PPT Presentation

Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB HopsFS: Next generation HDFS 37x Number of fles 16x Throughput


  1. Scaling Deep Learning to 100s of GPUs on Hops Hadoop Fabio Buso Software Engineer Logical Clocks AB

  2. HopsFS: Next generation HDFS 37x Number of fles 16x Throughput *https://www.usenix.org/conference/fast17/technical-sessions/presentation/niazi **https://eurosys2017.github.io/assets/data/posters/poster09-Niazi.pdf S c a l e C h a l l e n g e Wi n n e r ( 2 0 1 7 ) 2

  3. Hops platform Projects, Datasets, Users Jobs, Grafana, ELK REST Jupyter, Zeppelin API Spark, T ensorfow, Hive, Kafka, Flink HopsFS, HopsYARN, MySQL NDB Cluster Version 0.3.0 just released! 3

  4. Python frst C o n d a R e p o h c r E n v i r o n m e n t u s a b l e b y a e S S p a r k / T e n s o r fl o w I n s t a l l / R e m o v e P r o j e c t C o n d a e n v P y t h o n - 3 . 6 , p a n d a s - 1 . 4 , N u m p y - 0 . 9 Hops python library: Make development easy ● Hyperparameter searching ● Manage T ensorboard lifecycle 4

  5. Find big datasets - Dela* ● Discover, Share and experiment with interesting datasets ● p2p network of Hops Cluster ● ImageNet, YouT ube8M, Reddit comments... ● Exploits unused bandwidth * h t t p : / / i e e e x p l o r e . i e e e . o r g / d o c u m e n t / 7 9 8 0 2 2 5 / ( I C D C S 2 0 1 7 ) 5

  6. Scale out level: 1 Parallel Hyper parameter searching

  7. Parallel Hyperparameter searching def model(lr, dropout): … args_dict = { 'learning_rate': [0.001, 0.0005, 0.0001], 'dropout': [0.45, 0.7]} args_dict_grid = util.grid_params(args_dict) tflauncher.launch ( spark , model, args_dict_grid) S t a r t s 6 p a r a l l e l e x p e r i m e n t s 7

  8. Scale out Level: 2 Distributed Training

  9. T ensorFlowOnSpark (TFoS) by Yahoo! ● Distributed T ensorFlow over Spark ● Runs on top of a Hadoop cluster ● PS/Workers executed inside Spark executors ● Uses Spark for resource allocations – Our version: exclusive GPUs allocations – Parameter server(s) do not get GPU(s) ● Manages T ensorboard 9

  10. Run TFoS def training_fun(argv, ctx): ….. TFNode.start_cluster_server() ….. TFCluster.run(spark, training_fun, num_exec, num_ps…) Full conversion guide: https://github.com/yahoo/T ensorFlowOnSpark/wiki/Conversio n-Guide 10

  11. Scale out level: Master of the dark arts Horovod

  12. PS server architecture doesn’t scale F r o m : h t t p s : / / g i t h u b . c o m / u b e r / h o r o v o d 12

  13. Horovod by Uber ● Based on previous work done by Baidu ● Organize workers in a ring ● Gradients updates distributed using All- Reduce ● Synchronous protocol 13

  14. All-Reduce a 0 b 0 c 0 G P U 1 G P U 2 a 1 b 1 c 1 G P U 3 a 2 b 2 c 2 14

  15. All-Reduce a 0 b 0 c 0 + c 2 G P U 1 G P U 2 a 0 + a 1 b 1 c 1 G P U 3 a 2 b 1 + b 2 c 2 15

  16. All-Reduce a 0 b 0 + b 1 + b 2 c 0 + c 2 G P U 1 G P U 2 a 0 + a 1 b 1 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 1 + b 2 c 2 16

  17. All-Reduce G P U 1 a 0 b 0 + b 1 + b 2 c 0 + c 2 G P U 2 a 0 + a 1 b 1 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 1 + b 2 c 2 17

  18. All-Reduce G P U 1 a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 2 G P U 2 a 0 + a 1 b 0 + b 1 + b 2 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 1 + b 2 c 0 + c 1 + c 2 18

  19. All-Reduce a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 1 + c 2 G P U 1 G P U 2 a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 1 + c 2 G P U 3 a 0 + a 1 + a 2 b 0 + b 1 + b 2 c 0 + c 1 + c 2 19

  20. Hops AllReduce import horovod.tensorflow as hvd def conv_model(feature, target, mode) ….. def main(_): hvd.init() opt = hvd.DistributedOptimizer(opt) if hvd.local_rank()==0: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. else: hooks = [hvd.BroadcastGlobalVariablesHook(0), ..] ….. from hops import allreduce allreduce.launch (spark, 'hdfs:///Projects/ …/all_reduce.ipynb') 20

  21. Demo time!

  22. Play with it → hops.io/?q=content/hopsworks-vagrant Doc → hops.io Star us! → github.com/hopshadoop Follow us! → @hopshadoop

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend