cs 7643 deep learning
play

CS 7643: Deep Learning Topics: Computational Graphs Notation + - PowerPoint PPT Presentation

CS 7643: Deep Learning Topics: Computational Graphs Notation + example Computing Gradients Forward mode vs Reverse mode AD Dhruv Batra Georgia Tech Administrativia HW1 Released Due: 09/22 PS1 Solutions


  1. CS 7643: Deep Learning Topics: – Computational Graphs – Notation + example – Computing Gradients – Forward mode vs Reverse mode AD Dhruv Batra Georgia Tech

  2. Administrativia • HW1 Released – Due: 09/22 • PS1 Solutions – Coming soon (C) Dhruv Batra 2

  3. Project • Goal – Chance to try Deep Learning – Combine with other classes / research / credits / anything • You have our blanket permission • Extra credit for shooting for a publication – Encouraged to apply to your research (computer vision, NLP, robotics,…) – Must be done this semester. • Main categories – Application/Survey • Compare a bunch of existing algorithms on a new application domain of your interest – Formulation/Development • Formulate a new model or algorithm for a new or old problem – Theory • Theoretically analyze an existing algorithm (C) Dhruv Batra 3

  4. Administrativia • Project Teams Google Doc – https://docs.google.com/spreadsheets/d/1AaXY0JE4lAbHvo DaWlc9zsmfKMyuGS39JAn9dpeXhhQ/edit#gid=0 – Project Title – 1-3 sentence project summary TL;DR – Team member names + GT IDs (C) Dhruv Batra 4

  5. Recap of last time (C) Dhruv Batra 5

  6. How do we compute gradients? • Manual Differentiation • Symbolic Differentiation • Numerical Differentiation • Automatic Differentiation – Forward mode AD – Reverse mode AD • aka “backprop” (C) Dhruv Batra 6

  7. Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra 7 Slide Credit: Marc'Aurelio Ranzato

  8. Directed Acyclic Graphs (DAGs) • Exactly what the name suggests – Directed edges – No (directed) cycles – Underlying undirected cycles okay (C) Dhruv Batra 8

  9. Directed Acyclic Graphs (DAGs) • Concept – Topological Ordering (C) Dhruv Batra 9

  10. Directed Acyclic Graphs (DAGs) (C) Dhruv Batra 10

  11. Computational Graphs • Notation #1 f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) (C) Dhruv Batra 11

  12. Computational Graphs • Notation #2 f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) (C) Dhruv Batra 12

  13. Example f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 13

  14. Logistic Regression as a Cascade Given a library of simple functions Compose into a ✓ ◆ 1 − log 1 + e − w | x complicate function | x w (C) Dhruv Batra 14 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

  15. Forward mode vs Reverse Mode • Key Computations (C) Dhruv Batra 15

  16. Forward mode AD g 16

  17. Reverse mode AD g 17

  18. Example: Forward mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 18

  19. Example: Forward mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) w 3 = ˙ ˙ w 1 + ˙ w 2 + w 1 = cos( x 1 ) ˙ ˙ w 2 = ˙ ˙ x 1 x 2 + x 1 ˙ x 1 x 2 sin( ) * ˙ ˙ ˙ x 1 x 1 x 2 x 1 x 2 (C) Dhruv Batra 19

  20. Example: Forward mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) w 3 = ˙ ˙ w 1 + ˙ w 2 + w 1 = cos( x 1 ) ˙ ˙ w 2 = ˙ ˙ x 1 x 2 + x 1 ˙ x 1 x 2 sin( ) * ˙ ˙ ˙ x 1 x 1 x 2 x 1 x 2 (C) Dhruv Batra 20

  21. Example: Reverse mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 21

  22. Example: Reverse mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) w 3 = 1 ¯ + w 1 = ¯ ¯ w 3 ¯ w 2 = ¯ w 3 sin( ) * x 1 = ¯ ¯ w 1 cos( x 1 ) ¯ x 1 = ¯ w 2 x 2 ¯ x 2 = ¯ w 2 x 1 x 1 x 2 (C) Dhruv Batra 22

  23. Forward Pass vs Forward mode AD vs Reverse Mode AD f ( x 1 , x 2 ) = x 1 x 2 + sin( x 1 ) + sin( ) * x 1 x 2 w 3 = ˙ ˙ w 1 + ˙ w 3 = 1 ¯ w 2 + + w 1 = cos( x 1 ) ˙ ˙ w 2 = ˙ ˙ x 1 x 2 + x 1 ˙ w 1 = ¯ ¯ w 3 ¯ w 2 = ¯ w 3 x 1 x 2 sin( ) sin( ) * * x 1 = ¯ ¯ w 1 cos( x 1 ) x 1 = ¯ ¯ w 2 x 2 ¯ x 2 = ¯ w 2 x 1 ˙ ˙ ˙ x 1 x 1 x 2 x 1 x 2 x 1 x 2 (C) Dhruv Batra 23

  24. Forward mode vs Reverse Mode • What are the differences? • Which one is more memory efficient (less storage)? – Forward or backward? (C) Dhruv Batra 24

  25. Forward mode vs Reverse Mode • What are the differences? • Which one is more memory efficient (less storage)? – Forward or backward? • Which one is faster to compute? – Forward or backward? (C) Dhruv Batra 25

  26. Plan for Today • (Finish) Computing Gradients – Forward mode vs Reverse mode AD – Patterns in backprop – Backprop in FC+ReLU NNs • Convolutional Neural Networks (C) Dhruv Batra 26

  27. Patterns in backward flow Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  28. Patterns in backward flow Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  29. Patterns in backward flow add gate: gradient distributor Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  30. Patterns in backward flow add gate: gradient distributor Q: What is a max gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  31. Patterns in backward flow add gate: gradient distributor max gate: gradient router Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  32. Patterns in backward flow add gate: gradient distributor max gate: gradient router Q: What is a mul gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  33. Patterns in backward flow add gate: gradient distributor max gate: gradient router mul gate: gradient switcher Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  34. Gradients add at branches + Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  35. Duality in Fprop and Bprop FPROP BPROP SUM + COPY + (C) Dhruv Batra 35

  36. Modularized implementation: forward / backward API Graph (or Net) object (rough psuedo code) 36 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  37. Modularized implementation: forward / backward API x z * y (x,y,z are scalars) 37 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  38. Modularized implementation: forward / backward API x z * y (x,y,z are scalars) 38 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  39. Example: Caffe layers Caffe is licensed under BSD 2-Clause 39 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  40. Caffe Sigmoid Layer * top_diff (chain rule) Caffe is licensed under BSD 2-Clause 40 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  41. (C) Dhruv Batra 41

  42. (C) Dhruv Batra 42

  43. Key Computation in DL: Forward-Prop (C) Dhruv Batra 43 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

  44. Key Computation in DL: Back-Prop (C) Dhruv Batra 44 Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

  45. Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  46. Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Q: what is the size of the Jacobian matrix? 46 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  47. Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Q: what is the size of the Jacobian matrix? [4096 x 4096!] 47 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  48. Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) in practice we process an Q: what is the entire minibatch (e.g. 100) size of the of examples at one time: Jacobian matrix? i.e. Jacobian would technically be a [4096 x 4096!] [409,600 x 409,600] matrix :\ Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  49. Jacobian of ReLU 4096-d 4096-d f(x) = max(0,x) input vector output vector (elementwise) Q: what is the Q2: what does it size of the look like? Jacobian matrix? [4096 x 4096!] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  50. Jacobians of FC-Layer (C) Dhruv Batra 50

  51. Jacobians of FC-Layer (C) Dhruv Batra 51

  52. Jacobians of FC-Layer (C) Dhruv Batra 52

  53. Convolutional Neural Networks (without the brain stuff) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

  54. Fully Connected Layer Example: 200x200 image 40K hidden units ~2B parameters !!! - Spatial correlation is local - Waste of resources + we have not enough training samples anyway.. 54 Slide Credit: Marc'Aurelio Ranzato

  55. Locally Connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). 55 Slide Credit: Marc'Aurelio Ranzato

  56. Locally Connected Layer STATIONARITY? Statistics is similar at different locations Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). 56 Slide Credit: Marc'Aurelio Ranzato

  57. Convolutional Layer Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels 57 Slide Credit: Marc'Aurelio Ranzato

  58. Convolutions for mathematicians (C) Dhruv Batra 58

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend