ECE 417 Fall 2018 Lecture 17: Neural Networks Mark Hasegawa-Johnson - PowerPoint PPT Presentation

Intro Design Nonlinearities Metric Gradient ECE 417 Fall 2018 Lecture 17: Neural Networks Mark Hasegawa-Johnson University of Illinois October 23, 2018

Intro Design Nonlinearities Metric Gradient Outline What is a Neural Net? 1 Knowledge-Based Design 2 Nonlinearities 3 Error Metric 4 Gradient Descent 5

Intro Design Nonlinearities Metric Gradient Two-Layer Feedforward Neural Network z = h ( � � x , U , V ) which is decomposed as. . . . . . z = g ( � z 1 z 2 z r z ℓ = g ( b ℓ ) � b ) b ℓ = v k 0 + � q � k =1 v ℓ k y k b = V � y . . . y q y 1 y 2 1 y k = f ( a k ) y = f ( � � a ) a k = u k 0 + � p j =1 u kj x j � a = U � x . . . x p x 1 x 2 � x is the input vector 1

Intro Design Nonlinearities Metric Gradient A Neural Net is Made Of. . . x , � Linear transformations: � a = U � b = V � y , one per layer. Scalar nonlinearities: � y = f ( � a ) means that, element-by-element, y k = f ( a k ) for some nonlinear function f ( · ). The nonlinearities can all be different, if you want. For today, I’ll assume that all nodes in the first layer use one function f ( · ), and all nodes in the second layer use some other function g ( · ). Networks with more than two layers are called “Deep Neural Networks” (DNN). I won’t talk about them today. Andrew Barron (1993) proved that combining two layers of linear transforms, with one scalar nonlinearity between them, is enough to model any multivariate nonlinear function � z = h ( � x ).

Intro Design Nonlinearities Metric Gradient Neural Network = Universal Approximator Assume. . . Linear Output Nodes: g ( b ) = b Smoothly Nonlinear Hidden Nodes: f ′ ( a ) = df da finite Smooth Target Function: � z = h ( � x , U , V ) approximates � ζ = h ∗ ( � x ) ∈ H , where H is some class of sufficiently smooth functions of � x (functions whose Fourier transform has a first moment less than some finite number C ) There are q hidden nodes, y k , 1 ≤ k ≤ q The input vectors are distributed with some probability density function, p ( � x ), over which we can compute expected values. Then (Barron, 1993) showed that. . . � 1 � x ) | 2 � � x , U , V ) − h ∗ ( � max x ) ∈H min h ( � ≤ O U , V E q h ∗ ( �

Intro Design Nonlinearities Metric Gradient Neural Network Problems: Outline of Remainder of this Talk 1 Knowledge-Based Design. Given U , V , f , g , what kind of function is h ( � x , U , V )? Can we draw � z as a function of � x ? Can we heuristically choose U and V so that � z looks kinda like � ζ ? 2 Nonlinearities. They come in pairs: the test-time nonlinearity, and the training-time nonlinearity. 3 Error Metric. In what way should � z = h ( � x ) be “similar to” � ζ = h ∗ ( � x )? 4 Training: Gradient Descent with Back-Propagation. Given an initial U , V , how do I find ˆ U , ˆ V that more closely approximate � ζ ?

Intro Design Nonlinearities Metric Gradient Synapse, First Layer: a k = u k 0 + � 2 j =1 u kj x j

Intro Design Nonlinearities Metric Gradient Axon, First Layer: y k = tanh( a k )

Intro Design Nonlinearities Metric Gradient Synapse, Second Layer: b ℓ = v ℓ 0 + � 2 k =1 v ℓ k y k

Intro Design Nonlinearities Metric Gradient Axon, Second Layer: z ℓ = sign( b ℓ )

Intro Design Nonlinearities Metric Gradient Differentiable and Non-differentiable Nonlinearities The nonlinearities come in pairs: (1) the test-time nonlinearity is the one that you use in the output layer of your learned classifier , e.g., in the app on your cell phone (2) the training-time nonlinearity is used in the output layer during training, and in the hidden layers during both training and test. Application Test-Time Training-Time Output Output & Hidden Nonlinearity Nonlinearity { 0 , 1 } classification step logistic or ReLU {− 1 , +1 } classification signum tanh multinomial classification argmax softmax regression linear (hidden nodes must be nonlinear)

Intro Design Nonlinearities Metric Gradient Step and Logistic nonlinearities Signum and Tanh nonlinearities

Intro Design Nonlinearities Metric Gradient “Linear Nonlinearity” and ReLU Argmax and Softmax Argmax: � 1 b ℓ = max m b m z ℓ = 0 otherwise Softmax: e b ℓ z ℓ = � m e b m

Intro Design Nonlinearities Metric Gradient Error Metric: MMSE for Linear Output Nodes Minimum Mean Squared Error (MMSE) n U ∗ , V ∗ = arg min E = arg min 1 � | � z ( x i ) | 2 ζ i − � 2 n i =1 Why would we want to use this metric? x i , � If the training samples ( � ζ i ) are i.i.d., then in the limit as the number of training tokens goes to infinity, � � � h ( � x ) → E ζ | � x

Intro Design Nonlinearities Metric Gradient Error Metric: MMSE for Binary Target Vector Binary target vector Suppose � 1 with probability P ℓ ( � x ) ζ ℓ = 0 with probability 1 − P ℓ ( � x ) and suppose 0 ≤ z ℓ ≤ 1, e.g., logistic output nodes. Why does MMSE make sense for binary targets? E [ ζ ℓ | � x ] = 1 · P ℓ ( � x ) + 0 · (1 − P ℓ ( � x )) = P ℓ ( � x ) So the MMSE neural network solution is � � � h ( � x ) → E ζ | � x = P ℓ ( � x )

Intro Design Nonlinearities Metric Gradient Softmax versus Logistic Output Nodes Encoding the Neural Net Output using a “One-Hot Vector” Suppose � ζ i is a “one hot” vector, i.e., only one element is “hot” ( ζ ℓ ( i ) , i = 1), all others are “cold” ( ζ mi = 0, m � = ℓ ( i )). Training logistic output nodes with MMSE training will approach the solution z ℓ = Pr { ζ ℓ = 1 | � x } , but there’s no guarantee that it’s a correctly normalized pmf ( � z ℓ = 1) until it has fully converged. Softmax output nodes guarantee that � z ℓ = 1. Softmax output nodes e b ℓ z ℓ = � m e b m

Intro Design Nonlinearities Metric Gradient Cross-Entropy The softmax nonlinearity is “matched” to an error criterion called “cross-entropy,” in the sense that its derivative can be simplified to have a very, very simple form. ζ ℓ, i is the true reference probability that observation � x i is of class ℓ . In most cases, this “reference probability” is either 0 or 1 (one-hot). z ℓ, i is the neural network’s hypothesis about the probability that � x i is of class ℓ . The softmax function constrains this to be 0 ≤ z ℓ, i ≤ 1 and � ℓ z ℓ, i = 1. The average cross-entropy between these two distributions is n E = − 1 � � ζ ℓ, i log z ℓ, i n i =1 ℓ

Intro Design Nonlinearities Metric Gradient Cross-Entropy = Log Probability x i is of class ℓ ∗ , meaning that ζ ℓ ∗ , i = 1, and all Suppose token � others are zero. Then cross-entropy is just the neural net’s estimate of the negative log probability of the correct class: n E = − 1 � log z ℓ ∗ , i n i =1 In other words, E is the average of the negative log probability of each training token: n E = − 1 � E i , E i = − log z ℓ ∗ , i n i =1

Intro Design Nonlinearities Metric Gradient Cross-Entropy is matched to softmax Now let’s plug in the softmax: e b ℓ ∗ , i E i = − log z ℓ ∗ , i , z ℓ ∗ , i = � k e b ki Its gradient with respect to the softmax inputs, b mi , is = − 1 ∂ E i ∂ z ℓ ∗ , i ∂ b mi z ℓ ∗ , i ∂ b mi  � 2 � � � e b ℓ ∗ , i e b ℓ ∗ , i  1 m = ℓ ∗ − k e bki −   2 z ℓ ∗ , i � ( k e bki )  � = � � − e b ℓ ∗ , i e bmi 1  m � = ℓ ∗ −   2 z ℓ ∗ , i  ( k e bki ) � = z mi − ζ mi

Intro Design Nonlinearities Metric Gradient Error Metrics Summarized � � � Use MSE to achieve � z = E ζ | � . That’s almost always what x you want. If � ζ is a one-hot vector, then use Cross-Entropy (with a softmax nonlinearity on the output nodes) to guarantee that � z is a properly normalized probability mass function, and because it gives you the amazingly easy formula ∂ E i ∂ b mi = z mi − ζ mi . If ζ ℓ is binary, but not necessarily one-hot, then use MSE (with a logistic nonlinearity) to achieve z ℓ = Pr { ζ ℓ = 1 | � x } .

Intro Design Nonlinearities Metric Gradient Gradient Descent = Local Optimization

ECE 417 Fall 2018 Lecture 17: Neural Networks Mark Hasegawa-Johnson - PowerPoint PPT Presentation

Intro Design Nonlinearities Metric Gradient ECE 417 Fall 2018 Lecture 17: Neural Networks Mark Hasegawa-Johnson University of Illinois October 23, 2018 Intro Design Nonlinearities Metric Gradient Outline What is a Neural Net? 1

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Lecture 7: Neural Nets Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Lecture 19- ECE 240a Laser Phase Noise 1 ECE 240a Lasers - Fall 2019 Lecture 19 Phase Noise

Lecture 20- ECE 240a Distributed Feedback Lasers 1 ECE 240a Lasers - Fall 2019 Lecture 20

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

IMAGE WARPING Vuong Le Dept. Of ECE University of Illinois ECE 417 Spring 2013 With some

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Lecture 4: Filtered Noise Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

Lecture 3: Noise Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

Lecture 8: Nonlinearities Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

Strings and Languages Lecture 1 August 28, 2018 Chandra Chekuri (UIUC) CS/ECE 374 1 Fall 2018

Graph Search Lecture 15 October 18, 2018 Chandra Chekuri (UIUC) CS/ECE 374 1 Fall 2018 1 /

Parameter Estimation Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical

Detection and Estimation Theory Lecture 13 Mojtaba Soltanalian- UIC msol@uic.edu

Malaysian Healthy Ageing Society DR R ZURR RRAIN INI I ARA RABI Dementia Ageing population

Cherenkov detector R&D for super B factory - Introduction - Prototype development -

Phase transitions in low-rank matrix estimation May 11, 2017 Marc Lelarge & L eo Miolane

A Multichannel Feature Compensation Approach for Robust ASR in Noisy and Reverberant Environments

On the Spectral Efficiency of Space-Constrained Massive MIMO with Linear Receivers Jiayi Zhang 1

Risk Management Trivia Webinar JOSHUA M. ROGOVE MODERATOR PRESIDENT, CR SOLUTIONS Independent

ECE 417 Fall 2018 Lecture 17: Neural Networks Mark Hasegawa-Johnson - PowerPoint PPT Presentation

Intro Design Nonlinearities Metric Gradient ECE 417 Fall 2018 Lecture 17: Neural Networks Mark Hasegawa-Johnson University of Illinois October 23, 2018 Intro Design Nonlinearities Metric Gradient Outline What is a Neural Net? 1

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Lecture 7: Neural Nets Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Lecture 19- ECE 240a Laser Phase Noise 1 ECE 240a Lasers - Fall 2019 Lecture 19 Phase Noise

Lecture 20- ECE 240a Distributed Feedback Lasers 1 ECE 240a Lasers - Fall 2019 Lecture 20

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

IMAGE WARPING Vuong Le Dept. Of ECE University of Illinois ECE 417 Spring 2013 With some

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Lecture 4: Filtered Noise Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

Lecture 3: Noise Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

Lecture 8: Nonlinearities Mark Hasegawa-Johnson ECE 417: Multimedia Signal Processing, Fall 2020

Strings and Languages Lecture 1 August 28, 2018 Chandra Chekuri (UIUC) CS/ECE 374 1 Fall 2018

Graph Search Lecture 15 October 18, 2018 Chandra Chekuri (UIUC) CS/ECE 374 1 Fall 2018 1 /

Parameter Estimation Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical

Detection and Estimation Theory Lecture 13 Mojtaba Soltanalian- UIC msol@uic.edu

Malaysian Healthy Ageing Society DR R ZURR RRAIN INI I ARA RABI Dementia Ageing population

Cherenkov detector R&amp;D for super B factory - Introduction - Prototype development -

Phase transitions in low-rank matrix estimation May 11, 2017 Marc Lelarge &amp; L eo Miolane

A Multichannel Feature Compensation Approach for Robust ASR in Noisy and Reverberant Environments

On the Spectral Efficiency of Space-Constrained Massive MIMO with Linear Receivers Jiayi Zhang 1

Risk Management Trivia Webinar JOSHUA M. ROGOVE MODERATOR PRESIDENT, CR SOLUTIONS Independent

Cherenkov detector R&D for super B factory - Introduction - Prototype development -

Phase transitions in low-rank matrix estimation May 11, 2017 Marc Lelarge & L eo Miolane