Introduction to Data Science: Neural [ 1 , 2 , , p ] g x w h m - PowerPoint PPT Presentation

Projection-Pursuit Regression Feed-forward Neural Networks Fitting with back propagation Multiple Minima Fitting with back propagation Projection-Pursuit Regression Multiple Minima Multiple Minima Fitting with back propagation Multiple Minima Multiple Minima Feed-forward Neural Networks Multiple Minima Projection-Pursuit Regression Summary Multiple Minima Neural Networks and Deep Learning Projection-Pursuit Regression Deep Learning Deep Feed-Forward Neural Networks Deep Feed-Forward Neural Networks Feed-forward Neural Networks Deep Feed-Forward Neural Networks Deep Learning Fitting with back propagation Feed-forward Neural Networks Deep Feed-Forward Neural Networks Deep Learning Multiple Minima Fitting with back propagation Deep Feed-Forward Neural Networks Projection-Pursuit Regression Projection-Pursuit Regression Fitting with back propagation Fitting with back propagation Fitting with back propagation Fitting with back propagation Fitting with back propagation Fitting with back propagation Practical Issues Projection-Pursuit Regression Scaling Fitting with back propagation Fitting with back propagation Fitting with back propagation Architecture Design Initialization Fitting with back propagation Over�tting Over�tting Fitting with back propagation Architecture Design Architecture Design Projection-Pursuit Regression Architecture Design Fitting with back propagation Architecture Design Fitting with back propagation Feed-forward Neural Networks Deep Feed-Forward Neural Networks Feed-forward Neural Networks Example Feed-forward Neural Networks Parameter Sharing Deep Feed-Forward Neural Networks Feed-forward Neural Networks Parameter Sharing Projection-Pursuit Regression Recurrent Networks Recurrent Networks Historical Overview Projection-Pursuit Regression Historical Overview Example Projection-Pursuit Regression Supervised Pre-training Example Historical Overview Example Projection-Pursuit Regression Example Projection-Pursuit Regression Summary Summary Historical Overview Summary Projection-Pursuit Regression Projection-Pursuit Regression Historical Overview Parameter Sharing Fitting with back propagation Supervised Pre-training Deep Learning Feed-forward Neural Networks Supervised Pre-training Supervised Pre-training Feed-forward Neural Networks Feed-forward Neural Networks Projection-Pursuit Regression Deep Learning Supervised Pre-training Deep Feed-Forward Neural Networks Supervised Pre-training Projection-Pursuit Regression Deep Learning Deep Feed-Forward Neural Networks Feed-forward Neural Networks Feed-forward Neural Networks Deep Learning Projection-Pursuit Regression The single-layer feed-forward neural network has the same Matrix Consider the following procedure Given an updated weight vector Let's revist the data from our previous unit and see how the PPR model In modern neural network literature, the graphical representation of Let's consider the PPR and this fitting technique a bit more in detail with So, how can we fit the PPR model? Gradient-based methods based on stochastic gradient descent are most A classic result of the Neural Network literature is the universal function Gradient-based methods based on stochastic gradient descent are most This interpretation is closer to that used in the neural network literature, Gradient-based methods based on stochastic gradient descent are most A classic result of the Neural Network literature is the universal function A single layer feed-forward neural network is defined as The fitting procedure propagates errors (residuals) down this function In modern neural network literature, the graphical representation of Empirically, a single-layer feed-forward neural network has similar We can now write the general formulation for a feed-forward neural In modern neural network literature, the graphical representation of In multi-class classification, categorical outcome may take multiple To tie this to the formulation usually described in the neural network Gradient-based methods based on stochastic gradient descent are most Advanced parallel computation systems and methods are used in order The Projection-Pursuit Regression (PPR) model predicts outcome A significant issue in the application of feed-forward neural networks is A significant issue in the application of feed-forward neural networks is A significant issue in the application of feed-forward neural networks is Advances in computational processing: A significant issue in the application of feed-forward neural networks is Ideal network architectures are task dependent As opposed to other learning methods we have seen so far, the feed- As opposed to other learning methods we have seen so far, the feed- As opposed to other learning methods we have seen so far, the feed- As opposed to other learning methods we have seen so far, the feed- As with other highly-flexible models we have seen previously, feed- The local minima a method like SGD may yield depend on the initial The local minima a method like SGD may yield depend on the initial The local minima a method like SGD may yield depend on the initial To motivate our discussion of Deep Neural Networks, let's turn to simple The local minima a method like SGD may yield depend on the initial In this unit we study neural networks and recent advances in Deep Advanced parallel computation systems and methods are used in order Advanced parallel computation systems and methods are used in order Advanced parallel computation systems and methods are used in order As with other highly-flexible models we have seen previously, feed- With properly scaled inputs, initialization of weights can be done in a This procedure continues until all layers are trained. The scale of inputs effectively determines the scale of weight matrices Learn to add: "55+22=77" Learn to add: "55+22=77" The long short-term memory Strong results about the ability of these models to approximate arbitrary Deep Learning is riding a big wave of popularity. Two important applications use this idea: Another method for reducing the number of parameters in a deep Stochastic gradient descent (SGD) based on back-propagation algorithm Learn to add: "55+22=77" Learn to add: "55+22=77" This is one strategy others are popular and effective Advances in computational processing: This is one strategy others are popular and effective Advances in computational processing: Many of recent advances stem from computational and technical Our prediction function is a linear function (with Here's another intuition. Recall the Principal Component Analysis The interpretation of hidden units as representations can lead to insight. Neural networks are representationally powerful prediction models. Given function now update weight vector of linear transformations we can then fit again and continue Now train two layer network. A clever idea for training deep The network is organized into input , In many applications, these derived Function Units Train the first layer as a single layer We require Function Assume we have a current estimate Once we have computed gradients At this point we have computed Next, we want to compute Finally, we set Next, we need the gradient At this point we have propagated We will continously update a Once we have computed output , First, to perform back propation we Empirically, it is found that by using The general form of feed-forward We may interpret hidden layers as The same principles we saw before: represent a hidden layer , used in the output layer is an activation function terms). using a gradient , we we can . ## ______________________________________________________ M N ^ Introduction to Data Science: Neural [ ϕ 1 , ϕ 2 , … , ϕ p ] g x w h m g h g f w old M ∇ h J g ∇ W k J ∇ t J Y y Recurrent Networks are used in Convolutional Networks are used in Nowadays, a rectified Historically, the Neural networks are a decades old Starting from 2005, technical Image recognition software uses Somewhat limited success in An automated Go player, making We will try to predict Self driving cars make use of Deep Inspiration: model neuron as Deep Learning models have been representational ability of the single-layer feed-forward neural network representational ability of the single-layer feed-forward neural network to train these deep networks, with billions of connections. https://keras.rstudio.com/articles/examples/addition_rnn.html performance to kernel-based methods like SVMs. parameterization as the PPR model, https://keras.rstudio.com/articles/examples/addition_rnn.html a few observations neural nets we saw above has been extended to computational graphs . approaches to modeling. iterating until a stop condition is reached. neural nets we saw above has been extended to computational graphs . network. neural nets we saw above has been extended to computational graphs . (LSTM) model is very popular in performs. forward neural network yields a non-convex optimization problem. forward neural network yields a non-convex optimization problem. at each stage of the composition we apply a non-linear transform to the parameter values chosen. composition in a stage-wise manner. but very powerful class of models. to train these deep networks, with billions of connections. Learning. to train these deep networks, with billions of connections. parameter values chosen. using function https://keras.rstudio.com/articles/examples/addition_rnn.html parameter values chosen. parameter values chosen. forward neural network yields a non-convex optimization problem. https://keras.rstudio.com/articles/examples/addition_rnn.html literature we make one slight change to our understanding of derived to train these deep networks, with billions of connections. forward neural network yields a non-convex optimization problem. functions that we need to choose the number of units in the hidden layer. as shown above introduces some complications. somewhat reasonable manner problem we saw in the previous unit. forward neural nets are prone to overfit data. learning model. frequently used to fit the parameters of neural networks. that we need to choose the number of units in the hidden layer. forward neural nets are prone to overfit data. that we need to choose the number of units in the hidden layer. that we need to choose the number of units in the hidden layer. frequently used to fit the parameters of neural networks. frequently used to fit the parameters of neural networks. frequently used to fit the parameters of neural networks. values descent method as hidden and output layers. networks. representations are used for model the gradient of cost function to all feed forward network. which we can interpret as a derived used to introduce non-linearity to progressive derived representations compute as for weights connecting the hidden must compute the error of the model gradients for the weight matrix of model parameters, and we are gradient vector network can be extended by adding we can compute error (or, generally, more, thinner, layers, better ( y i − g ( v i )) g ′ ( v i ) x i . since where currently is depends on the outcome modeled. g m ( w ′ ## Layer (type) Output Shape Param # w = w old + 2 α f ( X ) = ∑ ∑ m X ) ∇ ← ∇ z J = ∇ ⋅ g ′ ( z ) W ′ g ′ W f ( X ) k ∇ ∇ ∇ J W k Learning models for sensor area of study. As we have done previously in other regression settings, we start with a There is current research on interpreting these to support some notion of imaging applications. State-of-the-art results in many applications. modeling neuronal processing You can think of the first vector processing unit. advances have led to a resurgence sequence modeling applications. Deep Learning to identify individuals linear unit (ReLU) affordability at the heavy use of Deep Learning, is sigmoid activation Hypothesis is that training each layer on the outcome of interest moves Each term Require much experimentation applied to medical imaging to yield Initialize weight vector They can be difficult to optimize properly due to the non-convexity of Image processing: local structure of nearby pixels powerful parallel processing given by Graphical Processing Units powerful parallel processing given by Graphical Processing Units Train each layer as a single layer on the hidden layer of the previous Train each layer as a single layer network using the hidden layer of powerful parallel processing given by Graphical Processing Units is the result of applying a non-linear function to, to some value as a linear transformation that We can do so using the gradient we We arrange computation using a Weights We apply the chain rule to obtain are initialized to result Networks and Deep Learning h m = g h ( w ′ g m ( w ′ W 1 feature . with ReLU activation functions (Leshno et al. 1993). time series analysis data of the type with ReLU activation functions (Leshno et al. 1993). . interpretation. expected prediction error is additional hidden layers . on observation given the current processing one observation (in the representation. parameters of the model the derivative of the ReLU activation non-linear representation of the and output layers, we can compute of the input data. cost) from the hidden layer to the output . has value . ## ====================================================== 1 m X ), m = 1, … , M i =1 i =1 m X ) w ϕ 1 w old g ( w ′ x ) h J ( y , ^ y ) x x We saw above that a wide enough hidden layer is capable of perfectly We will present the formulation for a general case where we are Especially useful to guide the design of general-use programming We can think of the PPR model as composing three functions: These methods require that gradients are computed based on model expert-level prognosis. These methods require that gradients are computed based on model processing. These methods require that gradients are computed based on model This is not usually the case once more than a single-layer is used in a We saw above that a wide enough hidden layer is capable of perfectly Especially useful to guide the design of general-use programming We saw above that a wide enough hidden layer is capable of perfectly Keeping track of these advances is hard, and many of them are ad-hoc. embeds observations into 1 dimension: Became the subject of intense study in ML. Addition encoded as sequence of one-hot vectors: One idea is to train multiple models using different initial values and The applications we discussed previously build this type of massive deep As per the usual regression setting, suppose Result encoded as sequence of one-hot vectors Given: loss function to minimize within photos. The applications we discussed previously build this type of massive deep of interest in neural networks, Randomly choose initial weights in the weights to parts of parameter space that lead to good performance. We consider One idea is to train multiple models using different initial values and what we can think of as, a derived feature (or derived predictor) Scale can have a large effect on how well SGD behaves. statistical inference. function was One idea is to train multiple models using different initial values and Activation functions This will lead to the problem of multiple local minima in which methods capable of beating the best human This will lead to the problem of multiple local minima in which methods last time-point given We can incorporate penalty terms to control model complexity to some Result encoded as sequence of one-hot vectors The applications we discussed previously build this type of massive deep This will lead to the problem of multiple local minima in which methods When predictors This is a time series dataset of mortgage affordability as calculated and the resulting optimization problem. Sequence modeling: structure given by sequence layer, but also add the original input data as input to every layer of the the previous layer as inputs to the model. as a discriminant function for class , exhibit some internal structure, parts of the model are much simpler, as opposed to, e.g., smoothing just computed First, we set Weights initialized as standard Train each layer successively on the computing graph . ∇ t J since For classification a soft-max of previous fit. ∇ t J = ∇ ⊙ f ′ ( t ) layer, which we can use to update practice a small batch of set of parameters. gradients for weights connecting the input data as we saw before. function. obtained. N M N In PPR we are reducing the dimensionality of from to using ## lstm_10 (LSTM) (None, 128) 72192 f k = g fk ( w ′ g h ( v ) = max{0, v } Y k X g h [−.7, .7] ∇ ← ∇ ^ ∇ k y J Some of the mathematical functions Advances in neural network architecture design and network Best results in applications with massive amounts of data. Input is pixel data. Judicious use of cross-validation methods to measure expected Advances in neural network architecture design and network Initially, these computational models For instance, time series and Neural network models gained Construct derived variable 2 k h ), k = 1, … , K like SGD can suffer. network. like SGD can suffer. in the dataset based commonly used make predictions using the model that gives best expected prediction Go players in the world. degree. specifically in Deep Neural Consider the case make predictions using the model that gives best expected prediction make predictions using the model that gives best expected prediction network. can then share parameters. is used more network. modeling libraries for the specification of neural nets. distributed by Zillow: https://www.zillow.com/research/data/. error. libraries for the specification of neural nets. error. fitting data. However, the number of units in the hidden layer may be exponentially neural network. error. splines as used in PPR. fitting data. like SGD can suffer. fitting data. network. outcomes . , the final predictor is a linear combination as We can thus update the model for outcome of interest. Since we train based on a loss- practice. Once we do this we can walk back ( y i − g ( v i )) g ′ ( v i ) x i ~ . . function can be used g m ( w ′ X p M At this point we can set . = v = w old w old + 2 α ∑ r i x i w = w old + 2 α f ( X ) = ∑ ∑ m X ) V m = w ′ Héctor Corrada Bravo input and hidden layers. those parameters as part of observations is used). linear projections, ## ______________________________________________________ m X K M > 1 Y 1 , … , Y k ∇ W k J = ∇ t J ∇ W k t f 1 ( X ), … , f K ( X ) ∇ ← ∇ h J historically used in neural network optimization were created with the goal of Excellent textbook: http://deeplearningbook.org In practice, all inputs are usually standardized to have zero mean and forecasting. Minimizing updates can ameliorate dependency problem. In practice, effective training of these models was both technically and Not straightforward to determine a-priori how these technical advances Final classification is made using prediction error to guide architecture choice. optimization The latter includes modeling of time series data. traction as general Machine the linear projection Deciding on network architecture is a significant challenge. We'll see given predictors (attributes) Data set In this case, no long term dependencies occur at all. , , where Use Stochastic Gradient Descent is the derivative of the softmax is the vector of variable for an observation . For regression, we can ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] 2 w ′ x f ′ Networks . error. on the time series up error. frequently in practice. large to approximate arbitrary functions. error. or function, these derived However, each layer introduces This gives us the next step of stochastic gradient To do this we compute all activation through the computation graph to . . where Z 1 = ϕ 11 X 1 + ϕ 21 X 2 + ⋯ + ϕ p 1 X p e tk i =1 N i =1 i =1 M ∑ M 1 { x 1 , x 2 , … , x n } { X 1 , … , X p } arg max k Y k x i p However, newer methods allow use in other situations. Parameters are shared across Use a non-linear regression method to fit function based on model stochastic gradient descent. ∇ W h J = ∇ x ′ ## repeat_vector_5 (Repeat (None, 3, 128) 0 g h ( v ) = They also require massive amounts of data to train. We will also see later that in many cases making the neural network take unit variance before application of SGD. The layer-wise propagation of error is at the core of these gradient models arise from biologically Learning models. They also require massive amounts of data to train. The dataset contains affordability measurements for 76 counties with We will see later in detail a variety of approaches used to address this They have the advantage of explicitly representing all operations used in We will also see later that in many cases making the neural network The layer-wise propagation of error is at the core of these gradient may help in a specific application. mimicking the processing of We will see later in detail a variety of approaches used to address this computationally difficult. values for the -th observation. later that recent proposals use deep, but thinner networks effectively. No long-term dependency the result of non-linear function and, in the case when . function In this case, we get Use the resulting weights as initial This fits . , . i =1 g m ( v m ) g fk ( t k ) = 1+ e − v ## 7 0 0 0 0 0 0 0 0 0 1 ## 7 0 0 0 0 0 0 0 0 0 1 ## 5 0 0 0 0 0 0 0 1 0 0 g m ( w ′ ∑ K l =1 e tk ∇ W k J = ∇ h ′ g W 1 to one year previous (there are many representations should make more linearity into the network. functions along the computation obtain gradients of cost with descent. N And building a regression function over the representation with reduced where: the hyperbolic In the second line we rewrite the gradient in terms of the residual of L ( g , W ) = ∑ [ y i − ∑ m x i ) ] University of Maryland, College Park, USA where is selected so that the resulting dataset has K = 1 i g M > 1 Researchers apply Deep Neural Networks successfully in a number of nearby parts of the image. Parameters are shared across a Performance may suffer. we want to predict a continuous outcome . You can use additive splines or loess Use Backpropagation for gradient . ## ______________________________________________________ . h J neuronal networks. data from 1979 to 2017. Here we plot the time series of affordability for problem. deeper instead of wider performs better. A related idea is to average the predictions of this multiple models. problem. computations. plausible activation functions. a neural network which then permits easier specification of gradient- computations. A related idea is to average the predictions of this multiple models. deeper instead of wider performs better. Even in this case, choice of model depth is difficult. the linear combination of the functions. weights for network with one more is element-wise multiplication. r i ## 7 0 0 0 0 0 0 0 0 0 1 ## 5 0 0 0 0 0 0 0 1 0 0 ## 7 0 0 0 0 0 0 0 0 0 1 We could also think of each term as providing a non-linear ϕ 1 t k = w ′ { z 1 , … , z n } E [ Y | V ] = g ( v ) = w old + 2 α Y ∑ r i x i to the last time point. extensions) respect to any of the model modeling the outcome of interest graph from the bottom up. i =1 m =1 dimension. the current model tangent. (using the derived feature ) weighted by, what 2 k h CMSC 320: 2020-05-10 maximum variance . g m ⊙ applications. However, this approach can still be applicable to moderate datasizes Return: Require significant amount of experimentation. time lag. Performance improves calculation along the computation g m ( v m ) ## lstm_11 (LSTM) (None, 3, 128) 131584 all counties. based algorithms. Making optimization markedly more additional layer. g ( v i ) v is a p-dimensional weight vector ## [,11] [,12] ## [,11] [,12] ## + 0 1 0 0 0 0 0 0 0 0 dimensionality reduction to a single derived feature . i =1 Use an optimization method to minimize the error of the model. progressively easier. parameters applying the chain rule. we could think of, as the sensitivity of the model to changes in derived For regression, we may take to This is called back-propagation. Finally, we can use bagging as described in a previous session to create In this case, models may have significantly fewer parameters, but tend to Here, we present a few rule of thumbs to follow. with careful network design, regularization and training. There is tremendous excitment over recent excellent performance of graph. Set . w m ## ______________________________________________________ difficult. where is a learning rate. so, is a linear combination of predictors ## 7 0 0 ## 7 0 0 ## 2 0 0 0 0 1 0 0 0 0 0 g fk w ′ X = ∑ p ∇ ← ∇ t J Number of parameters increases. Matrix of linear transformations that retain maximal feature . be the identify function. be much harder to fit. an ensemble of neural networks to circumvent the local minima problem. deep neural networks in many applications. j =1 w mj x j x j α ## time_distributed_5 (Tim (None, 3, 12) 1548 [ ϕ 1 , ϕ 2 , … , ϕ p ] For simplicity let's consider a model with and drop the subscript v i ## 7 0 0 ## 7 0 0 ## 2 0 0 0 0 1 0 0 0 0 0 and , are univariate non-linear functions (a variance . M = 1 ## ______________________________________________________ g m m = 1, … , M . ## [,11] [,12] 103 / 110 109 / 110 100 / 110 104 / 110 101 / 110 108 / 110 105 / 110 106 / 110 107 / 110 102 / 110 110 / 110 89 / 110 38 / 110 39 / 110 82 / 110 20 / 110 79 / 110 37 / 110 88 / 110 10 / 110 25 / 110 85 / 110 87 / 110 26 / 110 23 / 110 27 / 110 28 / 110 84 / 110 29 / 110 21 / 110 30 / 110 31 / 110 32 / 110 33 / 110 41 / 110 34 / 110 35 / 110 83 / 110 22 / 110 36 / 110 40 / 110 24 / 110 19 / 110 68 / 110 60 / 110 42 / 110 61 / 110 96 / 110 62 / 110 63 / 110 64 / 110 15 / 110 65 / 110 95 / 110 66 / 110 94 / 110 67 / 110 69 / 110 13 / 110 70 / 110 71 / 110 93 / 110 72 / 110 73 / 110 92 / 110 90 / 110 74 / 110 75 / 110 76 / 110 14 / 110 91 / 110 80 / 110 77 / 110 59 / 110 16 / 110 97 / 110 53 / 110 43 / 110 44 / 110 45 / 110 46 / 110 47 / 110 78 / 110 48 / 110 49 / 110 12 / 110 18 / 110 51 / 110 52 / 110 81 / 110 50 / 110 99 / 110 54 / 110 86 / 110 58 / 110 55 / 110 57 / 110 98 / 110 56 / 110 17 / 110 11 / 110 9 / 110 5 / 110 8 / 110 3 / 110 4 / 110 7 / 110 2 / 110 1 / 110 6 / 110 smoothing spline for example) m This is a sequence-to-sequence model. Perfect application for ## activation_5 (Activatio (None, 3, 12) 0 ## 5 0 0

Historical Overview Neural networks are a decades old area of study. Initially, these computational models were created with the goal of mimicking the processing of neuronal networks. 1 / 110

Historical Overview Inspiration: model neuron as processing unit. Some of the mathematical functions historically used in neural network models arise from biologically plausible activation functions. 2 / 110

Historical Overview Somewhat limited success in modeling neuronal processing Neural network models gained traction as general Machine Learning models. 3 / 110

Historical Overview Strong results about the ability of these models to approximate arbitrary functions Became the subject of intense study in ML. In practice, effective training of these models was both technically and computationally difficult. 4 / 110

Historical Overview Starting from 2005, technical advances have led to a resurgence of interest in neural networks, specifically in Deep Neural Networks . 5 / 110

Deep Learning Advances in computational processing: powerful parallel processing given by Graphical Processing Units 6 / 110

Deep Learning Advances in computational processing: powerful parallel processing given by Graphical Processing Units Advances in neural network architecture design and network optimization 7 / 110

Deep Learning Advances in computational processing: powerful parallel processing given by Graphical Processing Units Advances in neural network architecture design and network optimization Researchers apply Deep Neural Networks successfully in a number of applications. 8 / 110

Deep Learning Self driving cars make use of Deep Learning models for sensor processing. 9 / 110

Deep Learning Image recognition software uses Deep Learning to identify individuals within photos. 10 / 110

Deep Learning Deep Learning models have been applied to medical imaging to yield expert-level prognosis. 11 / 110

Deep Learning An automated Go player, making heavy use of Deep Learning, is capable of beating the best human Go players in the world. 12 / 110

Neural Networks and Deep Learning In this unit we study neural networks and recent advances in Deep Learning. 13 / 110

Projection-Pursuit Regression To motivate our discussion of Deep Neural Networks, let's turn to simple but very powerful class of models. As per the usual regression setting, suppose given predictors (attributes) for an observation { X 1 , … , X p } we want to predict a continuous outcome . Y 14 / 110

Projection-Pursuit Regression The Projection-Pursuit Regression (PPR) model predicts outcome Y using function as f ( X ) M g m ( w ′ f ( X ) = ∑ m X ) i =1 where: is a p-dimensional weight vector w m so, is a linear combination of predictors w ′ X = ∑ p j =1 w mj x j x j and , are univariate non-linear functions (a g m m = 1, … , M 15 / 110 smoothing spline for example)

Projection-Pursuit Regression Our prediction function is a linear function (with terms). M Each term is the result of applying a non-linear function to, g m ( w ′ m X ) what we can think of as, a derived feature (or derived predictor) . V m = w ′ m X 16 / 110

Projection-Pursuit Regression Here's another intuition. Recall the Principal Component Analysis problem we saw in the previous unit. Given: Data set , where is the vector of variable { x 1 , x 2 , … , x n } x i p values for the -th observation. i Return: Matrix of linear transformations that retain maximal [ ϕ 1 , ϕ 2 , … , ϕ p ] variance . 17 / 110

Projection-Pursuit Regression Matrix of linear transformations [ ϕ 1 , ϕ 2 , … , ϕ p ] You can think of the first vector as a linear transformation that ϕ 1 embeds observations into 1 dimension: Z 1 = ϕ 11 X 1 + ϕ 21 X 2 + ⋯ + ϕ p 1 X p where is selected so that the resulting dataset has ϕ 1 { z 1 , … , z n } maximum variance . 18 / 110

Projection-Pursuit Regression M g m ( w ′ f ( X ) = ∑ m X ) i =1 In PPR we are reducing the dimensionality of from to using X p M linear projections, And building a regression function over the representation with reduced dimension. 19 / 110

Projection-Pursuit Regression Let's revist the data from our previous unit and see how the PPR model performs. This is a time series dataset of mortgage affordability as calculated and distributed by Zillow: https://www.zillow.com/research/data/. The dataset contains affordability measurements for 76 counties with data from 1979 to 2017. Here we plot the time series of affordability for all counties. 20 / 110

Projection-Pursuit Regression We will try to predict affordability at the last time-point given in the dataset based on the time series up to one year previous to the last time point. 21 / 110

Projection-Pursuit Regression 22 / 110

Projection-Pursuit Regression So, how can we fit the PPR model? As we have done previously in other regression settings, we start with a loss function to minimize 2 N M g m ( w ′ L ( g , W ) = ∑ [ y i − ∑ m x i ) ] i =1 m =1 Use an optimization method to minimize the error of the model. For simplicity let's consider a model with and drop the subscript M = 1 . 23 / 110 m

Projection-Pursuit Regression Consider the following procedure Initialize weight vector to some value w w old Construct derived variable v = w old Use a non-linear regression method to fit function based on model g . You can use additive splines or loess E [ Y | V ] = g ( v ) 24 / 110

Projection-Pursuit Regression Given function now update weight vector using a gradient g w old descent method N ( y i − g ( v i )) g ′ ( v i ) x i w = w old + 2 α ∑ i =1 N = w old + 2 α ∑ r i x i i =1 where is a learning rate. α 25 / 110

Projection-Pursuit Regression N ( y i − g ( v i )) g ′ ( v i ) x i w = w old + 2 α ∑ i =1 N ~ = w old + 2 α ∑ r i x i i =1 In the second line we rewrite the gradient in terms of the residual of r i the current model (using the derived feature ) weighted by, what g ( v i ) v we could think of, as the sensitivity of the model to changes in derived feature . v i 26 / 110

Projection-Pursuit Regression Given an updated weight vector we can then fit again and continue w g iterating until a stop condition is reached. 27 / 110

Projection-Pursuit Regression Let's consider the PPR and this fitting technique a bit more in detail with a few observations We can think of the PPR model as composing three functions: the linear projection , w ′ x the result of non-linear function and, in the case when , g M > 1 the linear combination of the functions. g m 28 / 110

Projection-Pursuit Regression To tie this to the formulation usually described in the neural network literature we make one slight change to our understanding of derived feature . Consider the case , the final predictor is a linear combination M > 1 . ∑ M i =1 g m ( v m ) We could also think of each term as providing a non-linear g m ( v m ) dimensionality reduction to a single derived feature . 29 / 110

Projection-Pursuit Regression This interpretation is closer to that used in the neural network literature, at each stage of the composition we apply a non-linear transform to the data of the type . g ( w ′ x ) 30 / 110

Projection-Pursuit Regression The fitting procedure propagates errors (residuals) down this function composition in a stage-wise manner. 31 / 110

Feed-forward Neural Networks We can now write the general formulation for a feed-forward neural network. We will present the formulation for a general case where we are modeling outcomes as . K Y 1 , … , Y k f 1 ( X ), … , f K ( X ) 32 / 110

Feed-forward Neural Networks In multi-class classification, categorical outcome may take multiple values We consider as a discriminant function for class , Y k k Final classification is made using . For regression, we can arg max k Y k take . K = 1 33 / 110

Feed-forward Neural Networks A single layer feed-forward neural network is defined as h m = g h ( w ′ 1 m X ), m = 1, … , M f k = g fk ( w ′ 2 k h ), k = 1, … , K 34 / 110

Feed-forward Neural Networks The network is organized into input , hidden and output layers. 35 / 110

Feed-forward Neural Networks Units represent a hidden layer , h m which we can interpret as a derived non-linear representation of the input data as we saw before. 36 / 110

Feed-forward Neural Networks Function is an activation function g h used to introduce non-linearity to the representation. 37 / 110

Feed-forward Neural Networks Historically, the sigmoid activation function was commonly used or 1 g h ( v ) = 1+ e − v the hyperbolic tangent. 38 / 110

Feed-forward Neural Networks Nowadays, a rectified linear unit (ReLU) g h ( v ) = max{0, v } is used more frequently in practice. (there are many extensions) 39 / 110

Feed-forward Neural Networks Function used in the output layer g f depends on the outcome modeled. For classification a soft-max function can be used where e tk g fk ( t k ) = ∑ K l =1 e tk . t k = w ′ 2 k h For regression, we may take to g fk be the identify function. 40 / 110

Feed-forward Neural Networks The single-layer feed-forward neural network has the same parameterization as the PPR model, Activation functions are much simpler, as opposed to, e.g., smoothing g h splines as used in PPR. 41 / 110

Feed-forward Neural Networks A classic result of the Neural Network literature is the universal function representational ability of the single-layer feed-forward neural network with ReLU activation functions (Leshno et al. 1993). 42 / 110

Feed-forward Neural Networks A classic result of the Neural Network literature is the universal function representational ability of the single-layer feed-forward neural network with ReLU activation functions (Leshno et al. 1993). However, the number of units in the hidden layer may be exponentially large to approximate arbitrary functions. 43 / 110

Feed-forward Neural Networks Empirically, a single-layer feed-forward neural network has similar performance to kernel-based methods like SVMs. This is not usually the case once more than a single-layer is used in a neural network. 44 / 110

Fitting with back propagation In modern neural network literature, the graphical representation of neural nets we saw above has been extended to computational graphs . 45 / 110

Fitting with back propagation In modern neural network literature, the graphical representation of neural nets we saw above has been extended to computational graphs . Especially useful to guide the design of general-use programming libraries for the specification of neural nets. 46 / 110

Fitting with back propagation In modern neural network literature, the graphical representation of neural nets we saw above has been extended to computational graphs . Especially useful to guide the design of general-use programming libraries for the specification of neural nets. They have the advantage of explicitly representing all operations used in a neural network which then permits easier specification of gradient- based algorithms. 47 / 110

Fitting with back propagation 48 / 110

Fitting with back propagation Gradient-based methods based on stochastic gradient descent are most frequently used to fit the parameters of neural networks. 49 / 110

Fitting with back propagation Gradient-based methods based on stochastic gradient descent are most frequently used to fit the parameters of neural networks. These methods require that gradients are computed based on model error. 50 / 110

Fitting with back propagation Gradient-based methods based on stochastic gradient descent are most frequently used to fit the parameters of neural networks. These methods require that gradients are computed based on model error. The layer-wise propagation of error is at the core of these gradient computations. 51 / 110

Fitting with back propagation Gradient-based methods based on stochastic gradient descent are most frequently used to fit the parameters of neural networks. These methods require that gradients are computed based on model error. The layer-wise propagation of error is at the core of these gradient computations. This is called back-propagation. 52 / 110

Fitting with back propagation Assume we have a current estimate of model parameters, and we are processing one observation (in x practice a small batch of observations is used). 53 / 110

Fitting with back propagation First, to perform back propation we must compute the error of the model on observation given the current x set of parameters. To do this we compute all activation functions along the computation graph from the bottom up. 54 / 110

Fitting with back propagation Once we have computed output , ^ y we can compute error (or, generally, cost) . J ( y , ^ y ) Once we do this we can walk back through the computation graph to obtain gradients of cost with J respect to any of the model parameters applying the chain rule. 55 / 110

Fitting with back propagation We will continously update a gradient vector . ∇ First, we set ∇ ← ∇ ^ y J 56 / 110

Fitting with back propagation Next, we need the gradient ∇ t J We apply the chain rule to obtain ∇ t J = ∇ ⊙ f ′ ( t ) is the derivative of the softmax f ′ function is element-wise multiplication. ⊙ Set . ∇ ← ∇ t J 57 / 110

Fitting with back propagation Next, we want to compute . ∇ W k J We can do so using the gradient we just computed since ∇ . ∇ W k J = ∇ t J ∇ W k t In this case, we get . ∇ W k J = ∇ h ′ 58 / 110

Fitting with back propagation At this point we have computed gradients for the weight matrix W k from the hidden layer to the output layer, which we can use to update those parameters as part of stochastic gradient descent. 59 / 110

Fitting with back propagation Once we have computed gradients for weights connecting the hidden and output layers, we can compute gradients for weights connecting the input and hidden layers. 60 / 110

Fitting with back propagation We require , we we can ∇ h J compute as since currently W ′ k ∇ ∇ has value . ∇ t J At this point we can set . ∇ ← ∇ h J 61 / 110

Fitting with back propagation Finally, we set where is ∇ ← ∇ z J = ∇ ⋅ g ′ ( z ) g ′ the derivative of the ReLU activation function. This gives us . ∇ W h J = ∇ x ′ 62 / 110

Fitting with back propagation At this point we have propagated the gradient of cost function to all J parameters of the model We can thus update the model for the next step of stochastic gradient descent. 63 / 110

Practical Issues Stochastic gradient descent (SGD) based on back-propagation algorithm as shown above introduces some complications. 64 / 110

Scaling The scale of inputs effectively determines the scale of weight matrices x W Scale can have a large effect on how well SGD behaves. In practice, all inputs are usually standardized to have zero mean and unit variance before application of SGD. 65 / 110

Initialization With properly scaled inputs, initialization of weights can be done in a somewhat reasonable manner Randomly choose initial weights in . [−.7, .7] 66 / 110

Over�tting As with other highly-flexible models we have seen previously, feed- forward neural nets are prone to overfit data. 67 / 110

Over�tting As with other highly-flexible models we have seen previously, feed- forward neural nets are prone to overfit data. We can incorporate penalty terms to control model complexity to some degree. 68 / 110

Architecture Design A significant issue in the application of feed-forward neural networks is that we need to choose the number of units in the hidden layer. 69 / 110

Architecture Design A significant issue in the application of feed-forward neural networks is that we need to choose the number of units in the hidden layer. We saw above that a wide enough hidden layer is capable of perfectly fitting data. 70 / 110

Architecture Design A significant issue in the application of feed-forward neural networks is that we need to choose the number of units in the hidden layer. We saw above that a wide enough hidden layer is capable of perfectly fitting data. We will also see later that in many cases making the neural network deeper instead of wider performs better. 71 / 110

Architecture Design A significant issue in the application of feed-forward neural networks is that we need to choose the number of units in the hidden layer. We saw above that a wide enough hidden layer is capable of perfectly fitting data. We will also see later that in many cases making the neural network deeper instead of wider performs better. In this case, models may have significantly fewer parameters, but tend to be much harder to fit. 72 / 110

Architecture Design Ideal network architectures are task dependent Require much experimentation Judicious use of cross-validation methods to measure expected prediction error to guide architecture choice. 73 / 110

Multiple Minima As opposed to other learning methods we have seen so far, the feed- forward neural network yields a non-convex optimization problem. 74 / 110

Multiple Minima As opposed to other learning methods we have seen so far, the feed- forward neural network yields a non-convex optimization problem. This will lead to the problem of multiple local minima in which methods like SGD can suffer. 75 / 110

Multiple Minima As opposed to other learning methods we have seen so far, the feed- forward neural network yields a non-convex optimization problem. This will lead to the problem of multiple local minima in which methods like SGD can suffer. We will see later in detail a variety of approaches used to address this problem. 76 / 110

Multiple Minima As opposed to other learning methods we have seen so far, the feed- forward neural network yields a non-convex optimization problem. This will lead to the problem of multiple local minima in which methods like SGD can suffer. We will see later in detail a variety of approaches used to address this problem. Here, we present a few rule of thumbs to follow. 77 / 110

Multiple Minima The local minima a method like SGD may yield depend on the initial parameter values chosen. 78 / 110

Multiple Minima The local minima a method like SGD may yield depend on the initial parameter values chosen. One idea is to train multiple models using different initial values and make predictions using the model that gives best expected prediction error. 79 / 110

Multiple Minima The local minima a method like SGD may yield depend on the initial parameter values chosen. One idea is to train multiple models using different initial values and make predictions using the model that gives best expected prediction error. A related idea is to average the predictions of this multiple models. 80 / 110

Multiple Minima The local minima a method like SGD may yield depend on the initial parameter values chosen. One idea is to train multiple models using different initial values and make predictions using the model that gives best expected prediction error. A related idea is to average the predictions of this multiple models. Finally, we can use bagging as described in a previous session to create an ensemble of neural networks to circumvent the local minima problem. 81 / 110

Summary Neural networks are representationally powerful prediction models. They can be difficult to optimize properly due to the non-convexity of the resulting optimization problem. Deciding on network architecture is a significant challenge. We'll see later that recent proposals use deep, but thinner networks effectively. Even in this case, choice of model depth is difficult. There is tremendous excitment over recent excellent performance of deep neural networks in many applications. 82 / 110

Deep Feed-Forward Neural Networks The general form of feed-forward network can be extended by adding additional hidden layers . 83 / 110

Deep Feed-Forward Neural Networks The same principles we saw before: We arrange computation using a computing graph Use Stochastic Gradient Descent Use Backpropagation for gradient calculation along the computation graph. 84 / 110

Deep Feed-Forward Neural Networks Empirically, it is found that by using more, thinner, layers, better expected prediction error is obtained. However, each layer introduces more linearity into the network. Making optimization markedly more difficult. 85 / 110

Deep Feed-Forward Neural Networks We may interpret hidden layers as progressive derived representations of the input data. Since we train based on a loss- function, these derived representations should make modeling the outcome of interest progressively easier. 86 / 110

Deep Feed-Forward Neural Networks In many applications, these derived representations are used for model interpretation. 87 / 110

Deep Feed-Forward Neural Networks Advanced parallel computation systems and methods are used in order to train these deep networks, with billions of connections. 88 / 110

Deep Feed-Forward Neural Networks Advanced parallel computation systems and methods are used in order to train these deep networks, with billions of connections. The applications we discussed previously build this type of massive deep network. 89 / 110

Deep Feed-Forward Neural Networks Advanced parallel computation systems and methods are used in order to train these deep networks, with billions of connections. The applications we discussed previously build this type of massive deep network. They also require massive amounts of data to train. 90 / 110

Deep Feed-Forward Neural Networks Advanced parallel computation systems and methods are used in order to train these deep networks, with billions of connections. The applications we discussed previously build this type of massive deep network. They also require massive amounts of data to train. However, this approach can still be applicable to moderate datasizes with careful network design, regularization and training. 91 / 110

Supervised Pre-training A clever idea for training deep networks. Train each layer successively on the outcome of interest. Use the resulting weights as initial weights for network with one more additional layer. 92 / 110

Supervised Pre-training Train the first layer as a single layer feed forward network. Weights initialized as standard practice. This fits . W 1 h 93 / 110

Supervised Pre-training Now train two layer network. Weights are initialized to result W 1 h of previous fit. 94 / 110

Supervised Pre-training This procedure continues until all layers are trained. Hypothesis is that training each layer on the outcome of interest moves the weights to parts of parameter space that lead to good performance. Minimizing updates can ameliorate dependency problem. 95 / 110

Supervised Pre-training This is one strategy others are popular and effective Train each layer as a single layer network using the hidden layer of the previous layer as inputs to the model. In this case, no long term dependencies occur at all. Performance may suffer. 96 / 110

Supervised Pre-training This is one strategy others are popular and effective Train each layer as a single layer on the hidden layer of the previous layer, but also add the original input data as input to every layer of the network. No long-term dependency Performance improves Number of parameters increases. 97 / 110

Parameter Sharing Another method for reducing the number of parameters in a deep learning model. When predictors exhibit some internal structure, parts of the model X can then share parameters. 98 / 110

Parameter Sharing Two important applications use this idea: Image processing: local structure of nearby pixels Sequence modeling: structure given by sequence The latter includes modeling of time series data. 99 / 110

Introduction to Data Science: Neural [ 1 , 2 , , p ] g x w h m - PowerPoint PPT Presentation

Projection-Pursuit Regression Feed-forward Neural Networks Fitting with back propagation Multiple Minima Fitting with back propagation Projection-Pursuit Regression Multiple Minima Multiple Minima Fitting with back propagation Multiple

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Introduction to Artificial Intelligence Neural Networks - Deep Learning for NLP Janyl Jumadinova

Neural Networks 1. Introduction Spring 2019 1 Neural Networks are taking over! Neural

(Very) Brief Introduction to Neural Networks IITP-03 Algorithms for NLP 1 / 31 Learning

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Overview Understanding the neural code Neural Encoding Encoding: Prediction of neural response to

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

http://listenonrepeat.com/ watch/?v=X2Q_udvSakg#Col umbia_FSAE_Metlife_Stadiu m_Autocross Making

Workshop 7: (Generalized) Linear models Murray Logan 19 Jul 2017 Section 1 Linear model

Coding Lab: Visualizing data with ggplot2 Ari Anisfeld Summer 2020 1 / 36 How to use ggplot

QGIS Tool for Landslide Hazard Assessment Darya Golovko, Sigrid Roessner, Robert Behling and

Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop

MATH 12002 - CALCULUS I 5.2: The Natural Logarithm Professor Donald L. White Department of

Low Weight Discrete Logarithms and Subset Sum in 2 0 . 65 n with Polynomial Memory EUROCRYPT 2020 ,

Base-2 Logarithms If n = 2 k then k is called the logarithm (base 2) of n n=2 k k log 10 (n)