CS 6355: Structured Prediction
Neural Networks
Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
Neural Networks CS 6355: Structured Prediction Based on slides and - - PowerPoint PPT Presentation
Neural Networks CS 6355: Structured Prediction Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others This lecture What is a neural network? Training
Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others
1
2
3
features dot product threshold Prediction π‘ππ (π'π + π) = π‘ππ(βπ₯/π¦/ + π) Learning various algorithms perceptron, SVM, logistic regression,β¦ in general, minimize loss But where do these input features come from? What if the features were outputs of another classifier?
4
5
6
Each of these connections have their own weights as well
7
8
This is a two layer feed forward neural network
9
The output layer The hidden layer The input layer This is a two layer feed forward neural network Think of the hidden layer as learning a good representation of the inputs
10
The dot product followed by the threshold constitutes a neuron Five neurons in this picture (four in hidden layer and one output) This is a two layer feed forward neural network
11
What if the inputs were the outputs of a classifier? The input layer We can make a three layer networkβ¦. And so on.
12
13
14
The first drawing of a brain cells by Santiago RamΓ³n y Cajal in 1899 Neurons: core components of brain and the nervous system consisting of
15
The first drawing of a brain cells by Santiago RamΓ³n y Cajal in 1899 Neurons: core components of brain and the nervous system consisting of
Modern artificial neurons are βinspiredβ by biological neurons But there are many, many fundamental differences Donβt take the similarity seriously (as also claims in the news about the βemergenceβ of intelligent behavior)
Functions that very loosely mimic a biological neuron
16
Dot product Threshold activation Other activations are possible ππ£π’ππ£π’ = πππ’ππ€ππ’πππ(π'π + π)
Name of the neuron Activation function: πππ’ππ€ππ’πππ π¨ Linear unit π¨ Threshold/sign unit sgn(π¨) Sigmoid unit 1 1 + exp (βπ¨) Rectified linear unit (ReLU) max (0, π¨) Tanh unit tanh (π¨)
17
ππ£π’ππ£π’ = πππ’ππ€ππ’πππ(π'π + π) Many more activation functions exist (sinusoid, sinc, gaussian, polynomialβ¦) Also called transfer functions
18
19
Called the architecture
Typically predefined, part of the design of the classifier Learned from data Input Hidden Output wIJ
K
wIJ
L
20
Input Hidden Output wIJ
K
wIJ
L
21
Called the architecture
Typically predefined, part of the design of the classifier Learned from data Input Hidden Output wIJ
K
wIJ
L
22
See also: http://people.idsia.ch/~juergen/deep-learning-overview.html
23
24
25
In general, convex polygons
Figure from Shai Shalev-Shwartz and Shai Ben-David, 2014
26
In general, unions
Figure from Shai Shalev-Shwartz and Shai Ben-David, 2014
[DasGupta et al 1993]
β Exercise: Prove this
β Upper bound: Ξ π K πΉ K β Lower bound: Ξ© πΉ K
27
Exercise: Show that if we have only linear units, then multiple layers does not change the expressiveness
28
Bias feature, always 1 Sigmoid activations Linear activation Naming conventions for this example
29
X + π₯LL X π¨L + π₯KL X π¨K
Z + π₯LK Z π¦L + π₯KK Z π¦K)
Z + π₯LL Z π¦L + π₯KL Z π¦K) Questions?
30
31
32
X + π₯LL X π¨L + π₯KL X π¨K
Z + π₯LK Z π¦L + π₯KK Z π¦K)
Z + π₯LL Z π¦L + π₯KL Z π¦K)
33
Suppose the true label for this example is a number π§β We can write the square loss for this example as:
X + π₯LL X π¨L + π₯KL X π¨K
Z + π₯LK Z π¦L + π₯KK Z π¦K)
Z + π₯LL Z π¦L + π₯KL Z π¦K)
34
Perhaps with a regularizer
π d π(ππ π¦/, π₯ , π§/)
35
Β°t: learning rate, many tweaks possible The objective is not convex. Initialization can be important
min
π d π(ππ π¦/, π₯ , π§/)
36
Β°t: learning rate, many tweaks possible The objective is not convex. Initialization can be important
min
π d π(ππ π¦/, π₯ , π§/)
37
38
39
Where are we
Questions?
jl
40
Slide courtesy Richard Socher
jl
41
Slide courtesy Richard Socher
jl
42
Slide courtesy Richard Socher
43
X + π₯LL X π¨L + π₯KL X π¨K
Z + π₯LK Z π¦L + π₯KK Z π¦K)
Z + π₯LL Z π¦L + π₯KL Z π¦K)
44
q and
r
X + π₯LL X π¨L + π₯KL X π¨K
Z + π₯LK Z π¦L + π₯KK Z π¦K)
Z + π₯LL Z π¦L + π₯KL Z π¦K)
45
Applying the chain rule to compute the gradient (And remembering partial computations along the way to speed up things)
q and
r
X + π₯LL X π¨L + π₯KL X π¨K
Z + π₯LK Z π¦L + π₯KK Z π¦K)
Z + π₯LL Z π¦L + π₯KL Z π¦K)
46
X + π₯LL X π¨L + π₯KL X π¨K
X = ππ
W Backpropagation example
47
X + π₯LL X π¨L + π₯KL X π¨K
X = ππ
X
X = 1 Backpropagation example
48
X = ππ
W
X + π₯LL X π¨L + π₯KL X π¨K
Backpropagation example
49
X = ππ
W
X = π¨L
X + π₯LL X π¨L + π₯KL X π¨K
We have already computed this partial derivative for the previous case Cache to speed up! Backpropagation example
50
X + π₯LL X π¨L + π₯KL X π¨K
Z + π₯LK Z π¦L + π₯KK Z π¦K)
Z + π₯LL Z π¦L + π₯KL Z π¦K) Backpropagation example
51
Z = ππ
Z Backpropagation example
52
Z = ππ
Z
Z (π₯WL X + π₯LL X π¨L + π₯KL X π¨K)
X + π₯LL X π¨L + π₯KL X π¨K Backpropagation example
53
Z = ππ
Z
Z (π₯WL X + π₯LL X π¨L + π₯KL X π¨K)
X + π₯LL X π¨L + π₯KL X π¨K
X
Z π¨L + π₯KL X
Z π¨K)
Z
Backpropagation example
54
Z = ππ
Z
Z (π₯WL X + π₯LL X π¨L + π₯KL X π¨K)
X + π₯LL X π¨L + π₯KL X π¨K
X
Z Backpropagation example
55
Z = ππ
Z
Z (π₯WL X + π₯LL X π¨L + π₯KL X π¨K)
X
Z
Z + π₯LK Z π¦L + π₯KK Z π¦K) Backpropagation example
56
Z = ππ
Z
Z (π₯WL X + π₯LL X π¨L + π₯KL X π¨K)
X
Z
Z + π₯LK Z π¦L + π₯KK Z π¦K) Call this s Backpropagation example
57
Z = ππ
Z
Z (π₯WL X + π₯LL X π¨L + π₯KL X π¨K)
X
Z
Z + π₯LK Z π¦L + π₯KK Z π¦K) Call this s
X ππ¨K
Z Backpropagation example
58
Z = ππ
X ππ¨K
Z
Z + π₯LK Z π¦L + π₯KK Z π¦K) Call this s
Why? Because π¨K π‘ is the logistic function we have already seen
Z = π¦K Each of these partial derivatives is easy Backpropagation example
59
Z = ππ
X ππ¨K
Z
Z + π₯LK Z π¦L + π₯KK Z π¦K) Call this s
Why? Because π¨K π‘ is the logistic function we have already seen
Z = π¦K Each of these partial derivatives is easy
Backpropagation example
60
61
backpropagation
62
Β°t: learning rate, many tweaks possible The objective is not convex. Initialization can be important
min
π d π(ππ π¦/, π₯ , π§/)
The usual stochastic gradient descent tricks apply here
63
64
65
β Use k-fold cross-validation to determine the average number of epochs that
β Train on the full data set using this many epochs to produce the final results
66
67
68
69
70
Cat Dog Tiger Table These vectors do not capture inherent similarities Distances or dot products are all equal
71
Cat Dog Tiger Table Dense vector (often lower dimensional) representations can capture similarities better
72
73
https://colah.github.io/posts/2015-08-Understanding-LSTMs/ https://karpathy.github.io/2015/05/21/rnn-effectiveness/
74
Neural network Input Prediction We can assign labels to inputs cat burrito But what if the label to an input depends on a previous state of the network? Vanilla neural networks
75
76
Sequential input Sequential output Recurrent connections The same template is repeated
77
Vanilla networks Sequence
image captioning) Sequence input (eg: sentiment analysis) Seq2seq (eg: translation)
78
79
80
I grew up in Franceβ¦. I speak ____ The answer: Better control over the memory via Long Short-term Memory (LSTM) units RNNs donβt seem to be able to learn long range dependencies [Hochreiter 1991, Bengio et al 1994]
81
82
Adds an additional memory to the cell
83
Cell state
84
The βforget gateβ: Use the current input to decide what to erase in the cell state
85
Create a new cell state and also a filter that decides what part of the newly created cell state should be remembered
86
New cell state = remaining part of previous state + newly computed information
87
Finally, output = filtered version of the new cell state
88
https://karpathy.github.io/2015/05/21/rnn-effectiveness/
A three layer RNN, 512 hidden nodes in each layer Millions of parameters
89
https://highnoongmt.wordpress.com/2015/05/22/lisls-stis-recurrent-neural-networks-for-folk-music-generation/
90
91
92