cs 4803 7643 deep learning
play

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward autodifferentiation (Beginning of) Convolutional neural networks Zsolt Kira Georgia Tech Administrivia PS0 released mean of 20.7


  1. CS 4803 / 7643: Deep Learning Topics: – Specifying Layers – Forward & Backward autodifferentiation – (Beginning of) Convolutional neural networks Zsolt Kira Georgia Tech

  2. Administrivia • PS0 released – mean of 20.7 – standard deviation of 3.4 – median of 21 – max of 25 – See me if you did not pass • PS1/HW1 out • Start thinking about project topics/teams – More details on project next time (C) Dhruv Batra & Zsolt Kira 2

  3. Recap from last time (C) Dhruv Batra & Zsolt Kira 3

  4. Gradient Descent Pseudocode for i in {0,…,num_epochs}: for x, y in data: Some design decisions: • How many examples to use to calculate gradient per iteration? • What should alpha (learning rate) be? • Should it be constant throughout? • How many epochs to run to?

  5. Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra & Zsolt Kira 5 Slide Credit: Marc'Aurelio Ranzato

  6. Key Computation: Back-Prop (C) Dhruv Batra & Zsolt Kira 6 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

  7. Neural Network Training • Step 1: Compute Loss on mini-batch [F-Pass] (C) Dhruv Batra & Zsolt Kira 7 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

  8. Neural Network Training • Step 1: Compute Loss on mini-batch [F-Pass] • Step 2: Compute gradients wrt parameters [B-Pass] (C) Dhruv Batra & Zsolt Kira 8 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

  9. General Flow Graphs “Deep Learning” book, Bengio

  10. 10

  11. 11

  12. Jacobian of ReLU 4096-d 4096-d g(x) = max(0,x) input vector output vector (elementwise) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  13. Jacobian of ReLU 4096-d 4096-d g(x) = max(0,x) input vector output vector (elementwise) Q: what is the size of the Jacobian matrix? 13 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  14. Jacobian of ReLU 4096-d 4096-d g(x) = max(0,x) input vector output vector (elementwise) Q: what is the size of the Jacobian matrix? [4096 x 4096!] 14 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  15. Jacobian of ReLU 4096-d 4096-d g(x) = max(0,x) input vector output vector (elementwise) Q: what is the Q2: what does it size of the look like? Jacobian matrix? [4096 x 4096!] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  16. Plan for Today • Specifying Layers • Forward & Backward auto-differentiation • (Beginning of) Convolutional neural networks (C) Dhruv Batra & Zsolt Kira 17

  17. Deep Learning = Differentiable Programming • Computation = Graph – Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering • What do we need to do? – Generic code for representing the graph of modules – Specify modules (both forward and backward function) (C) Dhruv Batra & Zsolt Kira 18

  18. Modularized implementation: forward / backward API Graph (or Net) object (rough psuedo code) 19 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  19. Modularized implementation: forward / backward API x z * y (x,y,z are scalars) 20 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  20. Modularized implementation: forward / backward API x z * y (x,y,z are scalars) 21 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  21. Example: Caffe layers Caffe is licensed under BSD 2-Clause 22 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  22. Caffe Sigmoid Layer * top_diff (chain rule) Caffe is licensed under BSD 2-Clause 23 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  23. Deep Learning = Differentiable Programming • Computation = Graph – Input = Data + Parameters – Output = Loss – Scheduling = Topological ordering • Auto-Diff – A family of algorithms for implementing chain-rule on computation graphs (C) Dhruv Batra & Zsolt Kira 24

  24. Forward mode vs Reverse Mode • Key Computations (C) Dhruv Batra & Zsolt Kira 25

  25. Forward mode AD g 26

  26. Reverse mode AD g 27

  27. Example: Forward mode AD + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 28

  28. Example: Forward mode AD + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 29

  29. Example: Forward mode AD + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 30

  30. Example: Forward mode AD Q: What happens if there’s another input variable x 3 ? + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 31

  31. Example: Forward mode AD Q: What happens if there’s another input variable x 3 ? A: more sophisticated graph; + d “forward props” for d variables sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 32

  32. Example: Forward mode AD Q: What happens if there’s another output variable f 2 ? + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 33

  33. Example: Forward mode AD Q: What happens if there’s another output variable f 2 ? A: more sophisticated graph; + single “forward prop” sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 34

  34. Example: Reverse mode AD + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 35

  35. Example: Reverse mode AD + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 36

  36. Gradients add at branches + Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  37. Example: Reverse mode AD Q: What happens if there’s another input variable x 3 ? + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 38

  38. Example: Reverse mode AD Q: What happens if there’s another input variable x 3 ? A: more sophisticated graph; + single “backward prop” sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 39

  39. Example: Reverse mode AD Q: What happens if there’s another output variable f 2 ? + sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 40

  40. Example: Reverse mode AD Q: What happens if there’s another output variable f 2 ? A: more sophisticated graph; + c “backward props” for c vars sin( ) * x 1 x 2 (C) Dhruv Batra & Zsolt Kira 41

  41. Forward mode vs Reverse Mode • x  Graph  L • Intuition of Jacobian (C) Dhruv Batra & Zsolt Kira 42

  42. Forward mode vs Reverse Mode • What are the differences? • Which one is faster to compute? – Forward or backward? (C) Dhruv Batra & Zsolt Kira 43

  43. Forward mode vs Reverse Mode • What are the differences? • Which one is faster to compute? – Forward or backward? • Which one is more memory efficient (less storage)? – Forward or backward? + + sin( ) sin( ) * * x 1 x 2 x 1 x 2 (C) Dhruv Batra & Zsolt Kira 44

  44. Practical Note 2: Software Frameworks A few weeks ago! +Keras Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  45. PyTorch

  46. Plan for Today (Cont.) • Specifying Layers • Forward & Backward auto-differentiation • (Beginning of) Convolutional neural networks – What is a convolution? – FC vs Conv Layers (C) Dhruv Batra & Zsolt Kira 48

  47. Recall: Linear Classifier 3072x1 f(x,W) = Wx + b 10x1 Image 10x1 10x3072 10 numbers giving f( x , W ) class scores Array of 32x32x3 numbers W (3072 numbers total) parameters or weights Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  48. Example with an image with 4 pixels, and 3 classes (cat/dog/ship) Stretch pixels into column 56 0.2 -0.5 0.1 2.0 1.1 -96.8 Cat score 56 231 231 + = 1.5 1.3 2.1 0.0 3.2 437.9 Dog score 24 2 24 0 0.25 0.2 -0.3 -1.2 61.95 Ship score Input image 2 b W 50 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  49. Recall: (Fully-Connected) Neural networks ( Before ) Linear score function: ( Now ) 2-layer Neural Network x h s W1 W2 10 3072 100 51 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  50. Convolutional Neural Networks (without the brain stuff) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  51. Fully Connected Layer Example: 200x200 image 40K hidden units ~2B parameters !!! - Spatial correlation is local - Waste of resources + we have not enough training samples anyway.. 53 Slide Credit: Marc'Aurelio Ranzato

  52. Locally Connected Layer Example: 200x200 image 40K hidden units “Filter” size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face 54 Slide Credit: Marc'Aurelio Ranzato recognition).

  53. Locally Connected Layer STATIONARITY? Statistics similar at all locations 55 Slide Credit: Marc'Aurelio Ranzato

  54. Convolutional Layer Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels 56 Slide Credit: Marc'Aurelio Ranzato

  55. What filter to use?

  56. Discrete convolution • Discrete Convolution! • Very similar to correlation but associative 1D Convolution 2D Convolution Filter

  57. A note on sizes m N-m +1 N m N N-m +1 Filter Input Output MATLAB to the rescue! • conv2(x,w, ‘valid’)

  58. Convolutions! • Math vs. CS vs. programming viewpoints (C) Dhruv Batra & Zsolt Kira 60

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend