neural networks neural net basics
play

Neural Networks Neural Net Basics Dan Klein, John DeNero UC - PowerPoint PPT Presentation

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg Durrett Neural Networks Neural Networks Linear classification: argmax y w > f ( x, y ) possible because Linear Neural we transformed


  1. Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg Durrett Neural Networks Neural Networks ‣ Linear classification: argmax y w > f ( x, y ) …possible because Linear Neural we transformed classifier network the space! ‣ Want to learn intermediate conjunctive features of the input the movie was not all that good I[contains not & contains good ] ‣ How do we learn this if our feature vector is just the unigram indicators? I[contains not ], I[contains good ] Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

  2. Logistic Regression with NNs Neural Networks for Classification exp( w > f ( x , y )) ‣ Single scalar probability P ( y | x ) = P ( y | x ) = softmax( Wg ( V f ( x ))) y 0 exp( w > f ( x , y 0 )) P num_classes � ‣ Compute scores for all possible 
 probs d activations of "hidden" units [ w > f ( x , y )] y 2 Y labels at once (returns vector) � P ( y | x ) = softmax exp( p i ) ‣ softmax: exps and normalizes a P ( y | x ) f ( x ) softmax( p ) i = V softmax W P i 0 exp( p i 0 ) given vector z g ‣ Weight vector per class; 
 P ( y | x ) = softmax( Wf ( x )) d x n matrix num_classes x d W is [num classes x num feats] matrix nonlinearity 
 n features (tanh, relu, …) ‣ Now one hidden layer P ( y | x ) = softmax( Wg ( V f ( x ))) Objective Function Training Procedure P ( y | x ) = softmax( W z ) z = g ( V f ( x )) •Initialize parameters •For each epoch (one pass through all the training examples): ‣ Maximize log likelihood of training data observations •Shuffle the examples •Group them into mini-batches L ( x , i ∗ ) = log P ( y = i ∗ | x ) = log (softmax( W z ) · e i ∗ ) •For each mini-batch (these days often just called a "batch"): ‣ i *: index of the gold label • Compute the loss over the mini-batch ‣ e i : 1 in the i th row, zero elsewhere. This dot selects the i* th index • Compute the gradient of the loss w.r.t. the parameters • Update parameters according to a gradient-based optimizer X L ( x , i ∗ ) = W z · e i ∗ − log exp( W z ) · e j •Evaluate the current network on a held-out validation set j

  3. Batching ‣ Batching data gives speedups due to more efficient matrix operations ‣ Need to process a batch at a time Training Tips # input is [batch_size, num_feats] 
 # gold_label is [batch_size, num_classes] def make_update(input, gold_label) ... probs = ffnn.forward(input) # [batch_size, num_classes] loss = torch.sum (torch.neg(torch.log(probs)).dot(gold_label)) ... ‣ A batch size of 32 is typical, but the best choice is 
 model & application dependent Initialization Initialization ‣ Nonlinear model…how does this affect things? 1) Can’t use zeroes for parameters to generate hidden layers: all values in that hidden layer are always 0 and have zero gradients. 2) Initialize too large and cells are saturated ‣ A common approach is random uniform/normal initialization with appropriate scale (small is typically good) " # r r ‣ Xavier Glorot (2010) 
 6 6 fan-in + fan-out , + U uniform initializer: − fan-in + fan-out ‣ If cell activations are large in absolute value, gradients are small. ‣ Want variance of inputs and gradients for each layer to be similar ‣ ReLU: Zero gradient when activation is negative.

  4. Dropout Optimizer ‣ Probabilistically zero out some activations during training to ‣ Adam (Kingma and Ba, ICLR prevent overfitting, but use the whole network at test time 2015): very widely used. Adaptive step size + momentum ‣ Form of stochastic ‣ Wilson et al. NIPS 2017: regularization adaptive methods can ‣ Similar to benefits actually perform badly at test of ensembling: time (Adam is in pink, SGD in network needs to be black) robust to missing ‣ One more trick: gradient signals, so it has clipping (set a max value for redundancy your gradients) ‣ One line in Pytorch/Tensorflow Srivastava et al. (2014) Symbol Embeddings ‣ Words and characters are discrete symbols, 
 but input to a neural network must be real-valued ‣ Different symbols in language do have common characteristics that correlate with their distributional properties Embeddings ‣ An "embedding" for a symbol: a learned low-dimensional vector dim=128 Intuition: Low-rank approximation dim=2 to a co-occurrence matrix dim=32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend