big data for data science
play

Big Data for Data Science Scalable Machine Learning - PowerPoint PPT Presentation

Big Data for Data Science Scalable Machine Learning event.cwi.nl/lsde A SHORT INTRODUCTION TO NEURAL NETWORKS credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde Example: Image Recognition Input Image


  1. Big Data for Data Science Scalable Machine Learning event.cwi.nl/lsde

  2. A SHORT INTRODUCTION TO NEURAL NETWORKS credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  3. Example: Image Recognition Input Image Weights Loss AlexNet ‘convolutional’ neural network credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  4. Neural Nets - Basics • Score function (linear, matrix) • Activation function (normalize [0-1]) • Regularization function (penalize complex W) credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  5. Neural Nets are Computational Graphs • Score , Activation and Regularization together with a Loss function 1.00 • For backpropagation, we need a formula for the “gradient”, i.e. the derivative of each computational function: credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  6. Training the model: backpropagation • backpropagate loss to the weights to be adjusted, proportional to a learning rate -0.53 1.00 -1/(1.37) 2 *1.00 = -0.53 • For backpropagation, we need a formula for the “gradient”, i.e. the derivative of each computational function: credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung, Song Han) event.cwi.nl/lsde

  7. Training the model: backpropagation • backpropagate loss to the weights to be adjusted, proportional to a learning rate -0.53 -0.53 1.00 1 *-0.53 = -0.53 • For backpropagation, we need a formula for the “gradient”, i.e. the derivative of each computational function: credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung, Song Han) event.cwi.nl/lsde

  8. Training the model: backpropagation • backpropagate loss to the weights to be adjusted, proportional to a learning rate -0.53 -0.53 -0.20 1.00 e -1.00 *-0.53 = -0.20 • For backpropagation, we need a formula for the “gradient”, i.e. the derivative of each computational function: credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  9. Training the model: backpropagation • backpropagate loss to the weights to be adjusted, proportional to a learning rate -0.20 0.20 0.40 0.20 -0.40 0.20 -0.53 -0.53 0.20 -0.20 1.00 -0.60 0.20 • For backpropagation, we need a formula for the “gradient”, i.e. the derivative of each computational function: credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  10. Activation Functions credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  11. Get going quickly: Transfer Learning credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  12. Neural Network Architecture • (mini) batch-wise training • matrix calculations galore credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  13. DEEP LEARNING SOFTWARE credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  14. Deep Learning Frameworks ➔ Caffe2 Caffe Paddle (UC Berkeley) (Facebook) (Baidu) ➔ . PyTorch Torch CNTK (NYU/Facebook) (Facebook) (Microsoft) ➔ TensorFlow Theano MXNET (Univ. Montreal) (Google) (Amazon) • Easily build big computational graphs • Easily compute gradients in these graphs • Run it at high speed (e.g. GPU) credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  15. Deep Learning Frameworks ..have to compute ..gradient computations are ..similar to TensorFlow gradients by hand.. generated automagically from the Not a “new language” but forward phase (z=x*y; b=a+x; c= No GPU support embedded in Python sum(b)) + GPU support (control flow). credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  16. TensorFlow: TensorBoard GUI credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  17. Higher Levels of Abstraction f ormulas “by name” =stochastic gradient descent credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  18. Static vs Dynamic Graphs credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  19. Static vs Dynamic: optimization credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung, Song Han) event.cwi.nl/lsde

  20. Static vs Dynamic: serialization serialization = create a runnable program from the trained network credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung, Song Han) event.cwi.nl/lsde

  21. Static vs Dynamic: conditionals, loops credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  22. What to Use? • TensorFlow is a safe bet for most projects. Not perfect but has huge community, wide usage. Maybe pair with high-level wrapper (Keras, Sonnet, etc) • PyTorch is best for research. However still new, there can be rough patches. • Use TensorFlow for one graph over many machines Consider Caffe , Caffe2 , or TensorFlow for production deployment • Consider TensorFlow or Caffe2 for mobile credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  23. DEEP LEARNING PERFORMANCE OPTIMIZATIONS credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  24. ML models are getting larger credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  25. First Challenge: Model Size credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  26. Second Challenge: Energy Efficiency credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  27. Third Challenge: Training Speed credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  28. Hardware Basics credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  29. Special hardware? It’s in your pocket.. • iPhone 8 with A11 chip 6 CPU cores: Apple GPU 2 powerful 4 energy-efficient Apple TPU (deep learning ASIC) only on-chip FPGA missing (will come in time..) event.cwi.nl/lsde

  30. Hardware Basics: Number Representation credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  31. Hardware Basics: Number Representation credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  32. Hardware Basics: Memory = Energy larger model ➔ more memory references ➔ more energy consumed credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  33. Pruning Neural Networks credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  34. Pruning Neural Networks • Learning both Weights and Connections for Efficient Neural Networks, Han, Pool, Tran Dally, NIPS2015 credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  35. Pruning Changes the Weight Distribution credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  36. Pruning Happens in the Human Brain credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  37. Trained Quantization credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  38. Trained Quantization credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  39. Trained Quantization: Before • Continuous weight distribution credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  40. Trained Quantization: After • Discrete weight distribution credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  41. Trained Quantization: How Many Bits? • Deep Compression: compressing deep neural networks with pruning, trained quantization and Huffman coding, Han, Moa, Dally, ICLR2016 credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  42. Quantization to Fixed Point Decimals (=Ints) credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  43. Hardware Basics: Number Representation credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  44. Mixed Precision Training credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  45. Mixed Precision Training credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  46. DEEP LEARNING HARDWARE credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  47. The end of CPU scaling event.cwi.nl/lsde

  48. CPUs for Training - SIMD to the rescue? credits: cs231n.stanford.edu, Song Han event.cwi.nl/lsde

  49. CPUs for Training - SIMD to the rescue? 4 scalar instructions 1 SIMD instruction event.cwi.nl/lsde

  50. “ALU”: arithmetic logic unit (implements +, *, - etc. instructions) CPU vs GPU CPU: A lot of chip GPU: almost all chip surface for cache surface for ALUs memory and control (compute power) GPU cards have their own memory chips: smaller but nearby and faster than system memory credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  51. Programming GPUs • CUDA (NVIDIA only) – Write C-like code that runs directly on the GPU – Higher-level APIs: cuBLAS, cuFFT, cuDNN, etc • OpenCL – Similar to CUDA, but runs on anything – Usually slower :( All major deep learning libraries (TensorFlow, PyTorch, MXNET, etc) support training and model evaluation on GPUs. credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

  52. CPU vs GPU: performance credits: cs231n.stanford.edu; Fei-Fei Li, Justin Johnson, Serena Yeung event.cwi.nl/lsde

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend