Neural Networks Neural Net Basics Dan Klein, John DeNero UC - PowerPoint PPT Presentation

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg Durrett Neural Networks Neural Networks ‣ Linear classification: argmax y w > f ( x, y ) …possible because Linear Neural we transformed classifier network the space! ‣ Want to learn intermediate conjunctive features of the input the movie was not all that good I[contains not & contains good ] ‣ How do we learn this if our feature vector is just the unigram indicators? I[contains not ], I[contains good ] Taken from http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Logistic Regression with NNs Neural Networks for Classification exp( w > f ( x , y )) ‣ Single scalar probability P ( y | x ) = P ( y | x ) = softmax( Wg ( V f ( x ))) y 0 exp( w > f ( x , y 0 )) P num_classes � ‣ Compute scores for all possible   probs d activations of "hidden" units [ w > f ( x , y )] y 2 Y labels at once (returns vector) � P ( y | x ) = softmax exp( p i ) ‣ softmax: exps and normalizes a P ( y | x ) f ( x ) softmax( p ) i = V softmax W P i 0 exp( p i 0 ) given vector z g ‣ Weight vector per class;   P ( y | x ) = softmax( Wf ( x )) d x n matrix num_classes x d W is [num classes x num feats] matrix nonlinearity   n features (tanh, relu, …) ‣ Now one hidden layer P ( y | x ) = softmax( Wg ( V f ( x ))) Objective Function Training Procedure P ( y | x ) = softmax( W z ) z = g ( V f ( x )) •Initialize parameters •For each epoch (one pass through all the training examples): ‣ Maximize log likelihood of training data observations •Shuffle the examples •Group them into mini-batches L ( x , i ∗ ) = log P ( y = i ∗ | x ) = log (softmax( W z ) · e i ∗ ) •For each mini-batch (these days often just called a "batch"): ‣ i *: index of the gold label • Compute the loss over the mini-batch ‣ e i : 1 in the i th row, zero elsewhere. This dot selects the i* th index • Compute the gradient of the loss w.r.t. the parameters • Update parameters according to a gradient-based optimizer X L ( x , i ∗ ) = W z · e i ∗ − log exp( W z ) · e j •Evaluate the current network on a held-out validation set j

Batching ‣ Batching data gives speedups due to more efficient matrix operations ‣ Need to process a batch at a time Training Tips # input is [batch_size, num_feats]   # gold_label is [batch_size, num_classes] def make_update(input, gold_label) ... probs = ffnn.forward(input) # [batch_size, num_classes] loss = torch.sum (torch.neg(torch.log(probs)).dot(gold_label)) ... ‣ A batch size of 32 is typical, but the best choice is   model & application dependent Initialization Initialization ‣ Nonlinear model…how does this affect things? 1) Can’t use zeroes for parameters to generate hidden layers: all values in that hidden layer are always 0 and have zero gradients. 2) Initialize too large and cells are saturated ‣ A common approach is random uniform/normal initialization with appropriate scale (small is typically good) " # r r ‣ Xavier Glorot (2010)   6 6 fan-in + fan-out , + U uniform initializer: − fan-in + fan-out ‣ If cell activations are large in absolute value, gradients are small. ‣ Want variance of inputs and gradients for each layer to be similar ‣ ReLU: Zero gradient when activation is negative.

Dropout Optimizer ‣ Probabilistically zero out some activations during training to ‣ Adam (Kingma and Ba, ICLR prevent overfitting, but use the whole network at test time 2015): very widely used. Adaptive step size + momentum ‣ Form of stochastic ‣ Wilson et al. NIPS 2017: regularization adaptive methods can ‣ Similar to benefits actually perform badly at test of ensembling: time (Adam is in pink, SGD in network needs to be black) robust to missing ‣ One more trick: gradient signals, so it has clipping (set a max value for redundancy your gradients) ‣ One line in Pytorch/Tensorflow Srivastava et al. (2014) Symbol Embeddings ‣ Words and characters are discrete symbols,   but input to a neural network must be real-valued ‣ Different symbols in language do have common characteristics that correlate with their distributional properties Embeddings ‣ An "embedding" for a symbol: a learned low-dimensional vector dim=128 Intuition: Low-rank approximation dim=2 to a co-occurrence matrix dim=32

Neural Networks Neural Net Basics Dan Klein, John DeNero UC - PowerPoint PPT Presentation

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg Durrett Neural Networks Neural Networks Linear classification: argmax y w > f ( x, y ) possible because Linear Neural we transformed

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Outline Random Networks Basics Basics Basics Definitions Definitions How to build

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Relaxation and Hopfield Networks Neural Networks Neural Networks - Hopfield 1 Bibliography

WITH DEEP NEURAL NETWORKS INTELLIGENT ROBOTICS SEMINAR PIA UK 25.11.2019 OUTLINE 1.

Neural Networks with Googles TensorFlow Shuo Zhang Computational discourse analysis 11/22/16

Modular Neural Networks CPSC 533 Franco Lee Ian Ko Modular Neural Networks What is it ? Dif

Neural Networks 1. Introduction Spring 2020 1 Neural Networks are taking over! Neural

Overview Last time we studied the evolution of a discrete linear dynamical system, and today we

Silvano DAL ZILIO LAAS-CNRS, Vertics team presentation for our paper: MCC: a Tool for Unfolding

IXP database and tool LACNIC 2017 Foz do Iguazu, Brazil Euro-IX - What do we do? Two fora

Safety Improvements (CRISI) Program Information U.S. Department of Transportation American

Preparing a Benefit-Cost Office of the Chief Analysis Economist United States July 23, 2019

Neural Embeddings for Populated GeoNames Locations Mayank Kejriwal, Pedro Szekely USC

SOFTWARE ENGINEERING IN STARTUPS Dr. Vadim Zaytsev Universiteit van Amsterdam 20 January 2014

Classbased Detailed Routing in VLSI Design C. Schulte, T. Nieberg Research Institute for Discrete