Autodifferentiation CMSC 678 UMBC Recap from last time Maximum - PowerPoint PPT Presentation

Neural Networks and Autodifferentiation CMSC 678 UMBC

Recap from last time…

Maximum Entropy (Log-linear) Models 𝑞 𝑦 𝑧) ∝ exp(𝜄 𝑈 𝑔 𝑦, 𝑧 ) “model the posterior probabilities of the K classes via linear functions in θ , while at the same time ensuring that they sum to one and remain in [0, 1 ]” ~ Ch 4.4 “[The log -linear estimate] is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information .” Jaynes, 1957

Normalization for Classification Z = Σ weight 1 * f 1 (fatally shot, X ) exp ( ) weight 2 * f 2 (seriously wounded, X ) weight 3 * f 3 (Shining Path, X ) … label x

Connections to Other Techniques Log-Linear Models (Multinomial) logistic regression Softmax regression Max`imum Entropy models (MaxEnt) Generalized Linear Models Discriminative Naïve Bayes Very shallow (sigmoidal) neural nets 𝑧 = ෍ 𝜄 𝑙 𝑦 𝑙 + 𝑐 𝑙 the response can be a general (transformed) version of another response log 𝑞(𝑦 = 𝑗) logistic regression log 𝑞(𝑦 = 𝐿) = ෍ 𝜄 𝑙 𝑔(𝑦 𝑙 , 𝑗) + 𝑐 𝑙

Log-Likelihood Gradient Each component k is the difference between: the total value of feature f k in the training data and the total value the current model p θ 𝔽 𝑞 [𝑔(𝑦 ′ , 𝑧 𝑗 ) ෍ thinks it computes for feature f k 𝑗

Outline Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)

Sigmoid s=10 1 s=1 𝜏 𝑤 = 1 + exp(−𝑡𝑤) s=0.5

Sigmoid 1 𝜏 𝑤 = 1 + exp(−𝑡𝑤) s=10 𝜖𝜏 𝑤 = 𝑡 ∗ 𝜏 𝑤 ∗ 1 − 𝜏 𝑤 𝜖𝑤 s=1 calc practice: verify for yourself s=0.5

Remember Multi-class Linear Regression/Perceptron? 𝑦 𝐱 𝑧 𝑧 = 𝐱 𝑈 𝑦 + 𝑐 output: if y > 0: class 1 else: class 2

Linear Regression/Perceptron: A Per-Class View 𝑦 𝑦 𝑧 𝐱 𝟐 𝐱 𝑈 𝑦 + 𝑐 𝑧 1 = 𝐱 𝟐 𝑧 1 𝑧 𝑧 2 𝑈 𝑦 + 𝑐 𝑧 2 = 𝐱 𝟑 𝑧 = 𝐱 𝑈 𝑦 + 𝑐 𝐱 𝟑 output: output: if y > 0: class 1 i = argmax { y 1 , y 2 } else: class 2 class i binary version is special case

Logistic Regression/Classification 𝑦 𝑦 𝑧 𝐱 𝟐 𝑈 𝑦 + 𝑐) 𝑧 1 ∝ exp(𝐱 𝟐 𝐱 𝑧 1 𝑧 𝑧 2 𝑈 𝑦 + 𝑐) 𝑧 2 ∝ exp( 𝐱 𝟑 𝑧 = 𝜏(𝐱 𝑈 𝑦 + 𝑐) 𝐱 𝟑 𝑧 = softmax(𝐱 𝑈 𝑦 + 𝑐) output: i = argmax { y 1 , y 2 } class i

Logistic Regression/Classification 𝑦 𝑧 𝐱 𝟐 Q : Why didn’t our maxent 𝑈 𝑦 + 𝑐) 𝑧 1 ∝ exp(𝐱 𝟐 formulation from last class have multiple weight vectors? 𝑧 1 𝑧 2 𝑈 𝑦 + 𝑐) 𝑧 2 ∝ exp( 𝐱 𝟑 𝐱 𝟑 output: i = argmax { y 1 , y 2 } class i

Logistic Regression/Classification 𝑦 𝑧 𝐱 𝟐 Q : Why didn’t our maxent 𝑈 𝑦 + 𝑐) 𝑧 1 ∝ exp(𝐱 𝟐 formulation from last class have multiple weight vectors? 𝑧 1 𝑧 2 A : Implicitly it did. Our 𝑈 𝑦 + 𝑐) 𝑧 2 ∝ exp( 𝐱 𝟑 formulation was 𝐱 𝟑 𝑧 ∝ exp(𝑥 𝑈 𝑔 𝑦, 𝑧 ) output: i = argmax { y 1 , y 2 } class i

Stacking Logistic Regression 𝑧 𝑦 ℎ 𝐱 𝟐 𝐱 𝟑 𝐱 𝟒 𝐱 𝟓 Goal : you still want to predict y Idea : Can making an initial round of separate (independent) binary predictions h help? 𝑈 𝑦 + 𝑐 0 ) ℎ 𝑗 = 𝜏(𝐱 𝐣

Stacking Logistic Regression 𝑧 𝑦 ℎ 𝐱 𝟐 𝐱 𝟑 𝐱 𝟒 𝐱 𝟓 𝜸 𝑧 1 𝑧 2 𝑈 𝑦 + 𝑐 0 ) 𝑈 ℎ + 𝑐 1 ) ℎ 𝑗 = 𝜏(𝐱 𝐣 𝑧 𝑘 = softmax(𝛄 𝐤 Predict y from your first round of predictions h Idea : data/signal compression

Stacking Logistic Regression 𝑧 𝑦 ℎ 𝐱 𝟐 𝐱 𝟑 𝐱 𝟒 𝐱 𝟓 𝜸 𝑧 1 𝑧 2 𝑈 𝑦 + 𝑐 0 ) 𝑈 ℎ + 𝑐 1 ) ℎ 𝑗 = 𝜏(𝐱 𝐣 𝑧 𝑘 = softmax(𝛄 𝐤 Do we need (binary) probabilities here?

Stacking Logistic Regression 𝑧 𝑦 ℎ 𝐱 𝟐 𝐱 𝟑 𝐱 𝟒 𝐱 𝟓 𝜸 𝑧 1 𝑧 2 𝑈 𝑦 + 𝑐 0 ) 𝑈 ℎ + 𝑐 1 ) ℎ 𝑗 = 𝐺(𝐱 𝐣 𝑧 𝑘 = softmax(𝛄 𝐤 F : (non-linear) Do we need activation function probabilities here?

Stacking Logistic Regression 𝑧 𝑦 ℎ 𝐱 𝟐 𝐱 𝟑 𝐱 𝟒 𝐱 𝟓 𝜸 𝑧 1 𝑧 2 𝑈 𝑦 + 𝑐 0 ) 𝑈 ℎ + 𝑐 1 ) ℎ 𝑗 = 𝐺(𝐱 𝐣 𝑧 𝑘 = softmax(𝛄 𝐤 F : (non-linear) Do we need activation function probabilities here? Classification: probably Regression: not really

Stacking Logistic Regression 𝑧 𝑦 ℎ 𝐱 𝟐 𝐱 𝟑 𝐱 𝟒 𝐱 𝟓 𝜸 𝑧 1 𝑧 2 𝑈 𝑦 + 𝑐 0 ) 𝑈 ℎ + 𝑐 1 ) ℎ 𝑗 = 𝐺(𝐱 𝐣 𝑧 𝑘 = G(𝛄 𝐤 G: (non-linear) F : (non-linear) activation function activation function Classification: softmax Regression: identity

Multilayer Perceptron, a.k.a. Feed-Forward Neural Network 𝑧 𝑦 ℎ 𝐱 𝟐 𝐱 𝟑 𝐱 𝟒 𝐱 𝟓 𝜸 𝑧 1 𝑧 2 𝑈 𝑦 + 𝑐 0 ) 𝑈 ℎ + 𝑐 1 ) ℎ 𝑗 = 𝐺(𝐱 𝐣 𝑧 𝑘 = G(𝛄 𝐤 G: (non-linear) F : (non-linear) activation function activation function Classification: softmax Regression: identity

Feed-Forward Neural Network 𝑧 𝑦 ℎ 𝐱 𝟐 𝐱 𝟑 𝐱 𝟒 𝐱 𝟓 𝜸 𝑧 1 𝑧 2 𝑈 𝑦 + 𝑐 0 ) 𝑈 ℎ + 𝑐 1 ) ℎ 𝑗 = 𝐺(𝐱 𝐣 𝑧 𝑘 = G(𝛄 𝐤 𝜸 : # output X # hidden 𝐱 : # hidden X # input

Why Non-Linear? 𝑧 𝑦 ℎ 𝐱 𝟐 𝐱 𝟑 𝐱 𝟒 𝐱 𝟓 𝜸 𝑈 ℎ + 𝑐 1 𝑧 𝑘 = G 𝛄 𝑘 𝑧 1 𝑧 2 𝑈 𝐺 𝑥 𝑗 𝑈 𝑦 + 𝑐 0 𝑧 𝑘 = 𝐻 𝛾 𝑘 𝑗

Feed-Forward 𝑧 𝑦 ℎ 𝐱 𝟐 𝐱 𝟑 𝐱 𝟒 𝐱 𝟓 𝜸 𝑧 1 𝑧 2 information/ no self-loops computation flow (recurrence/reuse of weights)

Why “Neural?” argue from neuroscience perspective neurons (in the brain) receive input and “fire” when sufficiently excited/activated Image courtesy Hamed Pirsiavash

Universal Function Approximator Theorem [Kurt Hornik et al., 1989]: Let F be a continuous function on a bounded subset of D-dimensional space. Then there exists a two-layer network G with finite number of hidden units that approximates F arbitrarily well. For all x in the domain of F, |F(x) – G(x) |< ε “ a two- layer network can approximate any function” Going from one to two layers dramatically improves the representation power of the network Slide courtesy Hamed Pirsiavash

How Deep Can They Be? So many choices: Computational Issues : Vanishing gradients Architecture Gradients shrink as one moves # of hidden layers away from the output layer # of units per hidden layer Convergence is slow Opportunities : Training deep networks is an active area of research Layer-wise initialization (perhaps using unsupervised data) Engineering: GPUs to train on massive labelled datasets Slide courtesy Hamed Pirsiavash

Some Results: Digit Classification simple feed logistic forward regression (similar to MNIST in A2, but not exactly the same) ESL, Ch 11

Tensorflow Playground http://playground.tensorflow.org Experiment with small (toy) data neural networks in your browser Feel free to use this to gain an intuition

Outline Neural networks: non-linear classifiers Learning weights: backpropagation of error Autodifferentiation (in reverse mode)

Empirical Risk Minimization ℓ xent 𝑧 ∗ , 𝑧 = − ෍ 𝑧 ∗ 𝑙 log 𝑞(𝑧 = 𝑙) Cross entropy loss 𝑙 ℓ L2 𝑧 ∗ , 𝑧 = (𝑧 ∗ − 𝑧)^2 mean squared error/L2 loss 2 squared expectation ℓ sq−expt 𝑧 ∗ , 𝑧 = 𝑧 ∗ − 𝑞 𝑧 loss 2 ℓ hinge 𝑧 ∗ , 𝑧 = max 0, 1 + max 𝑘≠𝑧 ∗ 𝑧 𝑘 − 𝑧 ∗ [𝑘] hinge loss

Gradient Descent: Backpropagate the Error Set t = 0 Pick a starting value θ t Until converged: epoch : a single for example(s) i: run over all training data 1. Compute loss l on x i epoch 2. Get gradient g t = l’(x i ) (mini)batch (mini-)batch : a 3. Get scaling factor ρ t run over a subset of the data 4. Set θ t+1 = θ t - ρ t *g t 5. Set t += 1

Gradients for Feed Forward Neural Network 𝑈 𝜏 𝑥 𝑧 ∗ 𝑙 log 𝑧 𝑙 𝑈 𝑦 + 𝑐 0 𝑧 𝑙 = 𝜏 𝛾 𝑙 ℒ = − ෍ 𝑘 𝑘 𝑙 ℎ : a vector 𝜖𝑧 𝑧 ∗ 𝜖ℒ = −1 𝜖𝛾 𝑙𝑘 𝑧 𝑧 ∗ 𝜖𝛾 𝑙𝑘 𝜖ℒ 𝜖𝑥 𝑘𝑚

Gradients for Feed Forward Neural Network 𝑈 𝜏 𝑥 𝑧 ∗ 𝑙 log 𝑧 𝑙 𝑈 𝑦 + 𝑐 0 𝑧 𝑙 = 𝜏 𝛾 𝑙 ℒ = − ෍ 𝑘 𝑘 𝑙 ℎ : a vector 𝑈 ℎ −𝜏 ′ 𝛾 𝑧 ∗ 𝑈 ℎ 𝜖ℒ = −1 𝜖𝑧 𝑧 ∗ 𝜖𝛾 𝑙 = 𝑈 ℎ 𝜖𝛾 𝑙𝑘 𝑧 𝑧 ∗ 𝜖𝛾 𝑙𝑘 𝜖𝛾 𝑙𝑘 𝜏 𝛾 𝑧 ∗ 𝜖ℒ 𝜖𝑥 𝑘𝑚

Gradients for Feed Forward Neural Network 𝑈 𝜏 𝑥 𝑧 ∗ 𝑙 log 𝑧 𝑙 𝑈 𝑦 + 𝑐 0 𝑧 𝑙 = 𝜏 𝛾 𝑙 ℒ = − ෍ 𝑘 𝑘 𝑙 ℎ : a vector 𝑈 ℎ 𝑈 ℎ −𝜏 ′ 𝛾 𝑧 ∗ −𝜏 ′ 𝛾 𝑧 ∗ 𝑈 ℎ 𝜖 σ 𝑘 𝛾 𝑧 ∗ 𝑘 ℎ 𝑘 𝜖ℒ = −1 𝜖𝑧 𝑧 ∗ 𝜖𝛾 𝑙 = = 𝑈 ℎ 𝑈 ℎ 𝜖𝛾 𝑙𝑘 𝑧 𝑧 ∗ 𝜖𝛾 𝑙𝑘 𝜖𝛾 𝑙𝑘 𝜖𝛾 𝑙𝑘 𝜏 𝛾 𝑧 ∗ 𝜏 𝛾 𝑧 ∗ 𝜖ℒ 𝜖𝑥 𝑘𝑚

Autodifferentiation CMSC 678 UMBC Recap from last time Maximum - PowerPoint PPT Presentation

Neural Networks and Autodifferentiation CMSC 678 UMBC Recap from last time Maximum Entropy (Log-linear) Models ) exp( , ) model the posterior probabilities of the K classes via linear functions

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward

L101: Feed Forward Neural Networks Linear classifiers e.g. binary logistic regression: And

draft-duric-rtp-ilbc-00 emai/ SIP: alan.duric@globalipsound.com iLBC - IETF work IETF

Bus 701: Advanced Statistics Harald Schmidbauer c Harald Schmidbauer & Angi R osch,

Enhanced C-V2X Mode-4 Subchannel Selection Luis F. Abanto-Leon Co-authors: Arie Koppelaar Sonia

Constraints on the equation of state of dense matter from experiments and observations

Physical predictions from lattice QCD Christian Hoelbling Bergische Universitt Wuppertal

Simulations at the nanoscale on the GRID using Quantum ESPRESSO P. Giannozzi Universit` a di

EMRIs, Kicks & Tails from Black Hole Perturbation Theory (using GPUs) Gaurav Khanna

WELCOME TO THE 4% Code Liberation x Indiecade 2014 A trans-inclusive, women-only pro g rammin g

Lecture no: 7 Channel Coding Ove Edfors, Department of Electrical and Information Technology

Computing and Communications 1. Introduction Ying Cui Department of Electronic Engineering

Relative generalized Hamming weights of one-point algebraic geometric codes: an application to

The Experience at Fermilab: Recycler Ring and Beam Lines based on PM Technology V. Kashikhin, B.

Bo Chen MICHIGAN TECH Bo Chen , bchen@mtu.edu RESEARCH FORUM TECHTALKS Mobile devices are

HUMAN TRAFFICKING: Investigation, Prosecution, Reparation 1 FEDERAL BUREAU OF INVESTIGATION

Wireless Access Graduate course in Communications Engineering University of Rome La Sapienza

ASPECTS: Agile Spectrum Security G.C. Polyzos, G. Marias, S. Arkoulis, P. Frangoudis Athens

The Internet @ rural: Why not TV- White spaces 4E in Mozambique Salomo David PhD. Student @

Some Discussion Points Doctoral Dissertation Colloquium CTS 2012, May 22, 2012 Todd Davies

OFDMA Backscatter: Boosted Capacity Low Power IoT System Renjie Zhao 2017 6

Enabling Multiple Controllable Radios in OMNeT++ Nodes lafur Helgason w. Sylvia Kouyoumdjieva

Dynamic spectrum access under partial observations: A restless bandit approach Nima Akbarzadeh,

Autodifferentiation CMSC 678 UMBC Recap from last time Maximum - PowerPoint PPT Presentation

Neural Networks and Autodifferentiation CMSC 678 UMBC Recap from last time Maximum Entropy (Log-linear) Models ) exp( , ) model the posterior probabilities of the K classes via linear functions

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward &amp; Backward

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward &amp; Backward

L101: Feed Forward Neural Networks Linear classifiers e.g. binary logistic regression: And

draft-duric-rtp-ilbc-00 emai/ SIP: alan.duric@globalipsound.com iLBC - IETF work IETF

Bus 701: Advanced Statistics Harald Schmidbauer c Harald Schmidbauer &amp; Angi R osch,

Enhanced C-V2X Mode-4 Subchannel Selection Luis F. Abanto-Leon Co-authors: Arie Koppelaar Sonia

Constraints on the equation of state of dense matter from experiments and observations

Physical predictions from lattice QCD Christian Hoelbling Bergische Universitt Wuppertal

Simulations at the nanoscale on the GRID using Quantum ESPRESSO P. Giannozzi Universit` a di

EMRIs, Kicks &amp; Tails from Black Hole Perturbation Theory (using GPUs) Gaurav Khanna

WELCOME TO THE 4% Code Liberation x Indiecade 2014 A trans-inclusive, women-only pro g rammin g

Lecture no: 7 Channel Coding Ove Edfors, Department of Electrical and Information Technology

Computing and Communications 1. Introduction Ying Cui Department of Electronic Engineering

Relative generalized Hamming weights of one-point algebraic geometric codes: an application to

The Experience at Fermilab: Recycler Ring and Beam Lines based on PM Technology V. Kashikhin, B.

Bo Chen MICHIGAN TECH Bo Chen , bchen@mtu.edu RESEARCH FORUM TECHTALKS Mobile devices are

HUMAN TRAFFICKING: Investigation, Prosecution, Reparation 1 FEDERAL BUREAU OF INVESTIGATION

Wireless Access Graduate course in Communications Engineering University of Rome La Sapienza

ASPECTS: Agile Spectrum Security G.C. Polyzos, G. Marias, S. Arkoulis, P. Frangoudis Athens

The Internet @ rural: Why not TV- White spaces 4E in Mozambique Salomo David PhD. Student @

Some Discussion Points Doctoral Dissertation Colloquium CTS 2012, May 22, 2012 Todd Davies

OFDMA Backscatter: Boosted Capacity Low Power IoT System Renjie Zhao 2017 6

Enabling Multiple Controllable Radios in OMNeT++ Nodes lafur Helgason w. Sylvia Kouyoumdjieva

Dynamic spectrum access under partial observations: A restless bandit approach Nima Akbarzadeh,

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward

CS 4803 / 7643: Deep Learning Topics: Specifying Layers Forward & Backward

Bus 701: Advanced Statistics Harald Schmidbauer c Harald Schmidbauer & Angi R osch,

EMRIs, Kicks & Tails from Black Hole Perturbation Theory (using GPUs) Gaurav Khanna