autodifferentiation
play

Autodifferentiation CMSC 678 UMBC Recap from last time Maximum - PowerPoint PPT Presentation

Neural Networks and Autodifferentiation CMSC 678 UMBC Recap from last time Maximum Entropy (Log-linear) Models ) exp( , ) model the posterior probabilities of the K classes via linear functions


  1. Neural Networks and Autodifferentiation CMSC 678 UMBC

  2. Recap from last time…

  3. Maximum Entropy (Log-linear) Models π‘ž 𝑦 𝑧) ∝ exp(πœ„ π‘ˆ 𝑔 𝑦, 𝑧 ) β€œmodel the posterior probabilities of the K classes via linear functions in ΞΈ , while at the same time ensuring that they sum to one and remain in [0, 1 ]” ~ Ch 4.4 β€œ[The log -linear estimate] is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information .” Jaynes, 1957

  4. Normalization for Classification Z = Ξ£ weight 1 * f 1 (fatally shot, X ) exp ( ) weight 2 * f 2 (seriously wounded, X ) weight 3 * f 3 (Shining Path, X ) … label x

  5. Connections to Other Techniques Log-Linear Models (Multinomial) logistic regression Softmax regression Max`imum Entropy models (MaxEnt) Generalized Linear Models Discriminative NaΓ―ve Bayes Very shallow (sigmoidal) neural nets 𝑧 = ෍ πœ„ 𝑙 𝑦 𝑙 + 𝑐 𝑙 the response can be a general (transformed) version of another response log π‘ž(𝑦 = 𝑗) logistic regression log π‘ž(𝑦 = 𝐿) = ෍ πœ„ 𝑙 𝑔(𝑦 𝑙 , 𝑗) + 𝑐 𝑙

  6. Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p ΞΈ 𝔽 π‘ž [𝑔(𝑦 β€² , 𝑧 𝑗 ) ෍ thinks it computes for feature f k 𝑗

  7. Outline Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)

  8. Sigmoid s=10 1 s=1 𝜏 𝑀 = 1 + exp(βˆ’π‘‘π‘€) s=0.5

  9. Sigmoid 1 𝜏 𝑀 = 1 + exp(βˆ’π‘‘π‘€) s=10 πœ–πœ 𝑀 = 𝑑 βˆ— 𝜏 𝑀 βˆ— 1 βˆ’ 𝜏 𝑀 πœ–π‘€ s=1 calc practice: verify for yourself s=0.5

  10. Remember Multi-class Linear Regression/Perceptron? 𝑦 𝐱 𝑧 𝑧 = 𝐱 π‘ˆ 𝑦 + 𝑐 output: if y > 0: class 1 else: class 2

  11. Linear Regression/Perceptron: A Per-Class View 𝑦 𝑦 𝑧 𝐱 𝟐 𝐱 π‘ˆ 𝑦 + 𝑐 𝑧 1 = 𝐱 𝟐 𝑧 1 𝑧 𝑧 2 π‘ˆ 𝑦 + 𝑐 𝑧 2 = 𝐱 πŸ‘ 𝑧 = 𝐱 π‘ˆ 𝑦 + 𝑐 𝐱 πŸ‘ output: output: if y > 0: class 1 i = argmax { y 1 , y 2 } else: class 2 class i binary version is special case

  12. Logistic Regression/Classification 𝑦 𝑦 𝑧 𝐱 𝟐 π‘ˆ 𝑦 + 𝑐) 𝑧 1 ∝ exp(𝐱 𝟐 𝐱 𝑧 1 𝑧 𝑧 2 π‘ˆ 𝑦 + 𝑐) 𝑧 2 ∝ exp( 𝐱 πŸ‘ 𝑧 = 𝜏(𝐱 π‘ˆ 𝑦 + 𝑐) 𝐱 πŸ‘ 𝑧 = softmax(𝐱 π‘ˆ 𝑦 + 𝑐) output: i = argmax { y 1 , y 2 } class i

  13. Logistic Regression/Classification 𝑦 𝑧 𝐱 𝟐 Q : Why didn’t our maxent π‘ˆ 𝑦 + 𝑐) 𝑧 1 ∝ exp(𝐱 𝟐 formulation from last class have multiple weight vectors? 𝑧 1 𝑧 2 π‘ˆ 𝑦 + 𝑐) 𝑧 2 ∝ exp( 𝐱 πŸ‘ 𝐱 πŸ‘ output: i = argmax { y 1 , y 2 } class i

  14. Logistic Regression/Classification 𝑦 𝑧 𝐱 𝟐 Q : Why didn’t our maxent π‘ˆ 𝑦 + 𝑐) 𝑧 1 ∝ exp(𝐱 𝟐 formulation from last class have multiple weight vectors? 𝑧 1 𝑧 2 A : Implicitly it did. Our π‘ˆ 𝑦 + 𝑐) 𝑧 2 ∝ exp( 𝐱 πŸ‘ formulation was 𝐱 πŸ‘ 𝑧 ∝ exp(π‘₯ π‘ˆ 𝑔 𝑦, 𝑧 ) output: i = argmax { y 1 , y 2 } class i

  15. Stacking Logistic Regression 𝑧 𝑦 β„Ž 𝐱 𝟐 𝐱 πŸ‘ 𝐱 πŸ’ 𝐱 πŸ“ Goal : you still want to predict y Idea : Can making an initial round of separate (independent) binary predictions h help? π‘ˆ 𝑦 + 𝑐 0 ) β„Ž 𝑗 = 𝜏(𝐱 𝐣

  16. Stacking Logistic Regression 𝑧 𝑦 β„Ž 𝐱 𝟐 𝐱 πŸ‘ 𝐱 πŸ’ 𝐱 πŸ“ 𝜸 𝑧 1 𝑧 2 π‘ˆ 𝑦 + 𝑐 0 ) π‘ˆ β„Ž + 𝑐 1 ) β„Ž 𝑗 = 𝜏(𝐱 𝐣 𝑧 π‘˜ = softmax(𝛄 𝐀 Predict y from your first round of predictions h Idea : data/signal compression

  17. Stacking Logistic Regression 𝑧 𝑦 β„Ž 𝐱 𝟐 𝐱 πŸ‘ 𝐱 πŸ’ 𝐱 πŸ“ 𝜸 𝑧 1 𝑧 2 π‘ˆ 𝑦 + 𝑐 0 ) π‘ˆ β„Ž + 𝑐 1 ) β„Ž 𝑗 = 𝜏(𝐱 𝐣 𝑧 π‘˜ = softmax(𝛄 𝐀 Do we need (binary) probabilities here?

  18. Stacking Logistic Regression 𝑧 𝑦 β„Ž 𝐱 𝟐 𝐱 πŸ‘ 𝐱 πŸ’ 𝐱 πŸ“ 𝜸 𝑧 1 𝑧 2 π‘ˆ 𝑦 + 𝑐 0 ) π‘ˆ β„Ž + 𝑐 1 ) β„Ž 𝑗 = 𝐺(𝐱 𝐣 𝑧 π‘˜ = softmax(𝛄 𝐀 F : (non-linear) Do we need activation function probabilities here?

  19. Stacking Logistic Regression 𝑧 𝑦 β„Ž 𝐱 𝟐 𝐱 πŸ‘ 𝐱 πŸ’ 𝐱 πŸ“ 𝜸 𝑧 1 𝑧 2 π‘ˆ 𝑦 + 𝑐 0 ) π‘ˆ β„Ž + 𝑐 1 ) β„Ž 𝑗 = 𝐺(𝐱 𝐣 𝑧 π‘˜ = softmax(𝛄 𝐀 F : (non-linear) Do we need activation function probabilities here? Classification: probably Regression: not really

  20. Stacking Logistic Regression 𝑧 𝑦 β„Ž 𝐱 𝟐 𝐱 πŸ‘ 𝐱 πŸ’ 𝐱 πŸ“ 𝜸 𝑧 1 𝑧 2 π‘ˆ 𝑦 + 𝑐 0 ) π‘ˆ β„Ž + 𝑐 1 ) β„Ž 𝑗 = 𝐺(𝐱 𝐣 𝑧 π‘˜ = G(𝛄 𝐀 G: (non-linear) F : (non-linear) activation function activation function Classification: softmax Regression: identity

  21. Multilayer Perceptron, a.k.a. Feed-Forward Neural Network 𝑧 𝑦 β„Ž 𝐱 𝟐 𝐱 πŸ‘ 𝐱 πŸ’ 𝐱 πŸ“ 𝜸 𝑧 1 𝑧 2 π‘ˆ 𝑦 + 𝑐 0 ) π‘ˆ β„Ž + 𝑐 1 ) β„Ž 𝑗 = 𝐺(𝐱 𝐣 𝑧 π‘˜ = G(𝛄 𝐀 G: (non-linear) F : (non-linear) activation function activation function Classification: softmax Regression: identity

  22. Feed-Forward Neural Network 𝑧 𝑦 β„Ž 𝐱 𝟐 𝐱 πŸ‘ 𝐱 πŸ’ 𝐱 πŸ“ 𝜸 𝑧 1 𝑧 2 π‘ˆ 𝑦 + 𝑐 0 ) π‘ˆ β„Ž + 𝑐 1 ) β„Ž 𝑗 = 𝐺(𝐱 𝐣 𝑧 π‘˜ = G(𝛄 𝐀 𝜸 : # output X # hidden 𝐱 : # hidden X # input

  23. Why Non-Linear? 𝑧 𝑦 β„Ž 𝐱 𝟐 𝐱 πŸ‘ 𝐱 πŸ’ 𝐱 πŸ“ 𝜸 π‘ˆ β„Ž + 𝑐 1 𝑧 π‘˜ = G 𝛄 π‘˜ 𝑧 1 𝑧 2 π‘ˆ 𝐺 π‘₯ 𝑗 π‘ˆ 𝑦 + 𝑐 0 𝑧 π‘˜ = 𝐻 𝛾 π‘˜ 𝑗

  24. Feed-Forward 𝑧 𝑦 β„Ž 𝐱 𝟐 𝐱 πŸ‘ 𝐱 πŸ’ 𝐱 πŸ“ 𝜸 𝑧 1 𝑧 2 information/ no self-loops computation flow (recurrence/reuse of weights)

  25. Why β€œNeural?” argue from neuroscience perspective neurons (in the brain) receive input and β€œfire” when sufficiently excited/activated Image courtesy Hamed Pirsiavash

  26. Universal Function Approximator Theorem [Kurt Hornik et al., 1989]: Let F be a continuous function on a bounded subset of D-dimensional space. Then there exists a two-layer network G with finite number of hidden units that approximates F arbitrarily well. For all x in the domain of F, |F(x) – G(x) |< Ξ΅ β€œ a two- layer network can approximate any function” Going from one to two layers dramatically improves the representation power of the network Slide courtesy Hamed Pirsiavash

  27. How Deep Can They Be? So many choices: Computational Issues : Vanishing gradients Architecture Gradients shrink as one moves # of hidden layers away from the output layer # of units per hidden layer Convergence is slow Opportunities : Training deep networks is an active area of research Layer-wise initialization (perhaps using unsupervised data) Engineering: GPUs to train on massive labelled datasets Slide courtesy Hamed Pirsiavash

  28. Some Results: Digit Classification simple feed logistic forward regression (similar to MNIST in A2, but not exactly the same) ESL, Ch 11

  29. Tensorflow Playground http://playground.tensorflow.org Experiment with small (toy) data neural networks in your browser Feel free to use this to gain an intuition

  30. Outline Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)

  31. Empirical Risk Minimization β„“ xent 𝑧 βˆ— , 𝑧 = βˆ’ ෍ 𝑧 βˆ— 𝑙 log π‘ž(𝑧 = 𝑙) Cross entropy loss 𝑙 β„“ L2 𝑧 βˆ— , 𝑧 = (𝑧 βˆ— βˆ’ 𝑧)^2 mean squared error/L2 loss 2 squared expectation β„“ sqβˆ’expt 𝑧 βˆ— , 𝑧 = 𝑧 βˆ— βˆ’ π‘ž 𝑧 loss 2 β„“ hinge 𝑧 βˆ— , 𝑧 = max 0, 1 + max π‘˜β‰ π‘§ βˆ— 𝑧 π‘˜ βˆ’ 𝑧 βˆ— [π‘˜] hinge loss

  32. Gradient Descent: Backpropagate the Error Set t = 0 Pick a starting value ΞΈ t Until converged: epoch : a single for example(s) i: run over all training data 1. Compute loss l on x i epoch 2. Get gradient g t = l’(x i ) (mini)batch (mini-)batch : a 3. Get scaling factor ρ t run over a subset of the data 4. Set ΞΈ t+1 = ΞΈ t - ρ t *g t 5. Set t += 1

  33. Gradients for Feed Forward Neural Network π‘ˆ 𝜏 π‘₯ 𝑧 βˆ— 𝑙 log 𝑧 𝑙 π‘ˆ 𝑦 + 𝑐 0 𝑧 𝑙 = 𝜏 𝛾 𝑙 β„’ = βˆ’ ෍ π‘˜ π‘˜ 𝑙 β„Ž : a vector πœ–π‘§ 𝑧 βˆ— πœ–β„’ = βˆ’1 πœ–π›Ύ π‘™π‘˜ 𝑧 𝑧 βˆ— πœ–π›Ύ π‘™π‘˜ πœ–β„’ πœ–π‘₯ π‘˜π‘š

  34. Gradients for Feed Forward Neural Network π‘ˆ 𝜏 π‘₯ 𝑧 βˆ— 𝑙 log 𝑧 𝑙 π‘ˆ 𝑦 + 𝑐 0 𝑧 𝑙 = 𝜏 𝛾 𝑙 β„’ = βˆ’ ෍ π‘˜ π‘˜ 𝑙 β„Ž : a vector π‘ˆ β„Ž βˆ’πœ β€² 𝛾 𝑧 βˆ— π‘ˆ β„Ž πœ–β„’ = βˆ’1 πœ–π‘§ 𝑧 βˆ— πœ–π›Ύ 𝑙 = π‘ˆ β„Ž πœ–π›Ύ π‘™π‘˜ 𝑧 𝑧 βˆ— πœ–π›Ύ π‘™π‘˜ πœ–π›Ύ π‘™π‘˜ 𝜏 𝛾 𝑧 βˆ— πœ–β„’ πœ–π‘₯ π‘˜π‘š

  35. Gradients for Feed Forward Neural Network π‘ˆ 𝜏 π‘₯ 𝑧 βˆ— 𝑙 log 𝑧 𝑙 π‘ˆ 𝑦 + 𝑐 0 𝑧 𝑙 = 𝜏 𝛾 𝑙 β„’ = βˆ’ ෍ π‘˜ π‘˜ 𝑙 β„Ž : a vector π‘ˆ β„Ž π‘ˆ β„Ž βˆ’πœ β€² 𝛾 𝑧 βˆ— βˆ’πœ β€² 𝛾 𝑧 βˆ— π‘ˆ β„Ž πœ– Οƒ π‘˜ 𝛾 𝑧 βˆ— π‘˜ β„Ž π‘˜ πœ–β„’ = βˆ’1 πœ–π‘§ 𝑧 βˆ— πœ–π›Ύ 𝑙 = = π‘ˆ β„Ž π‘ˆ β„Ž πœ–π›Ύ π‘™π‘˜ 𝑧 𝑧 βˆ— πœ–π›Ύ π‘™π‘˜ πœ–π›Ύ π‘™π‘˜ πœ–π›Ύ π‘™π‘˜ 𝜏 𝛾 𝑧 βˆ— 𝜏 𝛾 𝑧 βˆ— πœ–β„’ πœ–π‘₯ π‘˜π‘š

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend