Feedforward neural nets CSE 250B Outline 1 Architecture 2 - PowerPoint PPT Presentation

Feedforward neural nets CSE 250B

Outline 1 Architecture 2 Expressivity 3 Learning

The architecture y h ( ` ) . . . h (2) h (1) x

The value at a hidden unit h z 1 z 2 z m · · · How is h computed from z 1 , . . . , z m ?

The value at a hidden unit h z 1 z 2 z m · · · How is h computed from z 1 , . . . , z m ? • h = σ ( w 1 z 1 + w 2 z 2 + · · · + w m z m + b ) • σ ( · ) is a nonlinear activation function , e.g. “rectified linear” � u if u ≥ 0 σ ( u ) = 0 otherwise

Common activation functions • Threshold function or Heaviside step function � 1 if z ≥ 0 σ ( z ) = 0 otherwise • Sigmoid 1 σ ( z ) = 1 + e − z • Hyperbolic tangent σ ( z ) = tanh( z ) • ReLU (rectified linear unit) σ ( z ) = max(0 , z )

Why do we need nonlinear activation functions? y h ( ` ) . . . h (2) h (1) x

The output layer Classification with k labels: want k probabilities summing to 1. y 1 y 2 y k · · · z 3 z 1 z 2 z m · · ·

The output layer Classification with k labels: want k probabilities summing to 1. y 1 y 2 y k · · · z 3 z 1 z 2 z m · · · • y 1 , . . . , y k are linear functions of the parent nodes z i . • Get probabilities using softmax : e y j Pr (label j ) = e y 1 + · · · + e y k .

The complexity y h ( ` ) . . . h (2) h (1) x

Approximation capability Let f : R d → R be any continuous function. There is a neural net with a single hidden layer that approximates f arbitrarily well.

Approximation capability Let f : R d → R be any continuous function. There is a neural net with a single hidden layer that approximates f arbitrarily well. • The hidden layer may need a lot of nodes. • For certain classes of functions: • Either: one hidden layer of enormous size • Or: multiple hidden layers of moderate size

Stone-Weierstrass theorem I If f : [ a , b ] → R is continuous then there is a sequence of polynomials P n such that P n has degree n and sup | P n ( x ) − f ( x ) | → 0 as n → ∞ . x ∈ [ a , b ]

Stone-Weierstrass theorem II Let K ⊂ R d be some bounded set. Suppose there is a collection of functions A such that: • A is an algebra : closed under addition, scalar multiplication, and multiplication. • A does not vanish on K : for any x ∈ K , there is some h ∈ A with h ( x ) � = 0. • A separates points in K : for any x � = y ∈ K , there is some h ∈ A with h ( x ) � = h ( y ). Then for any continuous function f : K → R and any ǫ > 0, there is some h ∈ A with sup | f ( x ) − h ( x ) | ≤ ǫ. x ∈ K

Example: exponentiated linear functions For domain K = R d , let A be all linear combinations of { e w · x + b : w ∈ R d , b ∈ R } . 1 Is an algebra. 2 Does not vanish. 3 Separates points.

Variation: RBF kernels For domain K = R d , and any σ > 0, let A be all linear combinations of { e −� x − u � 2 /σ 2 : u ∈ R d } . Any continuous function is approximated arbitrarily well by A .

A class of activation functions For domain K = R d , let A be all linear combinations of { σ ( w · x + b ) : w ∈ R d , b ∈ R } where σ : R → R is continuous and non-decreasing with � 1 if z → ∞ σ ( z ) → 0 if z → −∞ This also satisfies the conditions of the approximation result.

Learning a net: the loss function Classification problem with k labels. • Parameters of entire net: W • For any input x , net computes probabilities of labels: Pr W (label = j | x )

Learning a net: the loss function Classification problem with k labels. • Parameters of entire net: W • For any input x , net computes probabilities of labels: Pr W (label = j | x ) • Given data set ( x (1) , y (1) ) , . . . , ( x ( n ) , y ( n ) ), loss function: n � ln Pr W ( y ( i ) | x ( i ) ) L ( W ) = − i =1 (also called cross-entropy ).

Nature of the loss function L ( w ) L ( w ) w w

Variants of gradient descent Initialize W and then repeatedly update. 1 Gradient descent Each update involves the entire training set. 2 Stochastic gradient descent Each update involves a single data point. 3 Mini-batch stochastic gradient descent Each update involves a modest, fixed number of data points.

Derivative of the loss function Update for a specific parameter: derivative of loss function wrt that parameter. y h ( ` ) . . . h (2) h (1) x

Chain rule 1 Suppose h ( x ) = g ( f ( x )), where x ∈ R and f , g : R → R . Then: h ′ ( x ) = g ′ ( f ( x )) f ′ ( x )

Chain rule 1 Suppose h ( x ) = g ( f ( x )), where x ∈ R and f , g : R → R . Then: h ′ ( x ) = g ′ ( f ( x )) f ′ ( x ) 2 Suppose z is a function of y , which is a function of x . x y z Then: dz dx = dz dy dy dx

A single chain of nodes A neural net with one node per hidden layer: · · · x = h 0 h 1 h 2 h 3 h ` For a specific input x , • h i = σ ( w i h i − 1 + b i ) • The loss L can be gleaned from h ℓ

A single chain of nodes A neural net with one node per hidden layer: · · · x = h 0 h 1 h 2 h 3 h ` For a specific input x , • h i = σ ( w i h i − 1 + b i ) • The loss L can be gleaned from h ℓ To compute dL / dw i we just need dL / dh i : dL = dL dh i = dL σ ′ ( w i h i − 1 + b i ) h i − 1 dw i dh i dw i dh i

Backpropagation • On a single forward pass, compute all the h i . • On a single backward pass, compute dL / dh ℓ , . . . , dL / dh 1 · · · x = h 0 h 1 h 2 h 3 h `

Backpropagation • On a single forward pass, compute all the h i . • On a single backward pass, compute dL / dh ℓ , . . . , dL / dh 1 · · · x = h 0 h 1 h 2 h 3 h ` From h i +1 = σ ( w i +1 h i + b i +1 ), we have dL dL dh i +1 dL σ ′ ( w i +1 h i + b i +1 ) w i +1 = = dh i dh i +1 dh i dh i +1

Two-dimensional examples What kind of net to use for this data?

Two-dimensional examples What kind of net to use for this data? • Input layer: 2 nodes • One hidden layer: H nodes • Output layer: 1 node • Input → hidden: linear functions, ReLU activation • Hidden → output: linear function, sigmoid activation

Example 1 How many hidden units should we use?

Example 1 H = 2

Example 2 H = 4

Example 2 H = 8: overparametrized

Example 3 H = 4

Example 3 H = 8

Example 3 H = 16

Example 3 H = 32

Example 3 H = 64

PyTorch snippet Declaring and initializing the network: d, H = 2, 8 model = torch.nn.Sequential( torch.nn.Linear(d, H), torch.nn.ReLU(), torch.nn.Linear(H, 1), torch.nn.Sigmoid()) lossfn = torch.nn.BCELoss() A gradient step: ypred = model(x) loss = lossfn(ypred, y) model.zero grad() loss.backward() with torch.no grad(): for param in model.parameters(): param -= eta * param.grad

Feedforward neural nets CSE 250B Outline 1 Architecture 2 - PowerPoint PPT Presentation

Feedforward neural nets CSE 250B Outline 1 Architecture 2 Expressivity 3 Learning The architecture y h ( ` ) . . . h (2) h (1) x The value at a hidden unit h z 1 z 2 z m How is h computed from z 1 , . . . , z m ? The value at a

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward

Conflict nets: Efficient locally canonical MALL proof nets Dominic J. D. Hughes and Willem

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Petri Nets Petri Nets Inputs and Outputs Petri Nets vs FSM Lionel Morel Modeling Templates

Mix-Nets Lecture 19 Some tools for electronic-voting (and other things) Mix-Nets Mix-Nets

Petri Nets and Model Checking Natasa Gkolfi University of Oslo March 31, 2017 Petri Nets and

NLP Programming Tutorial 8 - Recurrent Neural Nets Graham Neubig Nara Institute of Science and

CS7015 (Deep Learning) : Lecture 3 Sigmoid Neurons, Gradient Descent, Feedforward Neural Networks,

An Introduction to Neural Networks - Feedforward NN Backpropagation Agathe Merceron Beuth

Convolutional Neural Nets CS447 Natural Language Processing (J. Hockenmaier)

Today CS 188: Artificial Intelligence Neural Nets (wrap-up) and Decision Trees Neural Nets --

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

From DB-nets to Coloured Petri Nets with Priorities Marco Montali and Andrey Rivkin KRDB Research

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint

Neural Networks Hopfield Nets and Auto Associators Spring 2020 1 Story so far Neural

Lecture 12: Computational Graph Backpropagation Aykut Erdem March 2016 Hacettepe

CS4811 Neural Network Training Example Consider the following network. It has two inputs (two

Reminder: Linear Classifiers CS 188: Artificial Intelligence Optimization and Neural Nets

Logistic Regression INFO-4604, Applied Machine Learning University of Colorado Boulder September

Anartificialneuron Artificialneuralnetworks y = f ( S ) x 0 =+1 Background

Deep Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Deep Networks Need for

Gradients of Deep Networks Chris Cremer March 29 2017 Neural Net $ %

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou