Introduction to Neural Networks
Philipp Koehn 24 September 2020
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
Introduction to Neural Networks Philipp Koehn 24 September 2020 - - PowerPoint PPT Presentation
Introduction to Neural Networks Philipp Koehn 24 September 2020 Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020 Linear Models 1 We used before weighted linear combination of feature values h j and
Philipp Koehn 24 September 2020
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
1
score(λ, di) =
λj hj(di)
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
2
– any value in the range [0;5] is equally good – values over 8 are bad – higher than 10 is not worse
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
3
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
4
(each arrow is a weight)
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
5
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
6
score(λ, di) =
λj hj(di)
score(λ, di) = f
j
λj hj(di)
tanh(x) sigmoid(x) =
1 1+e−x
relu(x) = max(0,x)
(sigmoid is also called the ”logistic function”)
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
7
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
8
– computer = sequence of Boolean gates – neural computer = sequence of layers
e.g., sorting on input values
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
9
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
10
4.5
. 5 3.7 2.9 3.7 2.9
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
11
1.0 0.0
4 . 5
. 5 3.7 2.9 3.7 2.9
sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) = 1 1 + e−2.2 = 0.90 sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) = 1 1 + e1.6 = 0.17
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
12
.90 .17
1.0 0.0
4.5
. 5 3.7 2.9 3.7 2.9
sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) = 1 1 + e−2.2 = 0.90 sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) = 1 1 + e1.6 = 0.17
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
13
.90 .17
1.0 0.0
4.5
. 5 3.7 2.9 3.7 2.9
sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) = 1 1 + e−1.17 = 0.76
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
14
.90 .17
.76 1.0 0.0
4.5
. 5 3.7 2.9 3.7 2.9
sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) = 1 1 + e−1.17 = 0.76
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
15
Input x0 Input x1 Hidden h0 Hidden h1 Output y0 0.12 0.02 0.18 → 0 1 0.88 0.27 0.74 → 1 1 0.73 0.12 0.74 → 1 1 1 0.99 0.73 0.33 → 0
– hidden node h0 is OR – hidden node h1 is AND – final layer operation is h0 − −h1
just as: more Boolean circuits → more complex computations possible
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
16
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
17
Soma Axon Nucleus Dendrite Axon terminal
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
18
Neurotransmitter Neurotransmitter transporter Axon terminal Synaptic cleft Dendrite Receptor Postsynaptic density Voltage gated Ca++ channel Synaptic vesicle
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
19
– Neurons, connections between neurons – Learning = change of connections, not change of neurons – Massive parallel processing
– computation within neuron vastly simplified – discrete time steps – typically some form of supervised learning with massive number of stimuli
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
20
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
21
.90 .17
.76 1.0 0.0
4.5
. 5 3.7 2.9 3.7 2.9
⇒ How do we adjust the weights?
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
22
– error is a function of the weights – we want to reduce the error – gradient descent: move towards the error minimum – compute gradient → get direction to the error minimum – adjust weights towards direction of lower error
– first adjust last set of weights – propagate error back to each previous layer – adjust their weights
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
23
λ error(λ) gradient = 1 current λ
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
24
Gradient for w1 Gradient for w2
Combined Gradient
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
25
sigmoid(x) = 1 1 + e−x
f(x) g(x) ′ = g(x)f ′(x) − f(x)g′(x) g(x)2
d sigmoid(x) dx = d dx 1 1 + e−x = 0 × (1 − e−x) − (−e−x) (1 + e−x)2 = 1 1 + e−x
1 + e−x
1 1 + e−x
1 1 + e−x
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
26
k wkhk
2(t − y)2
dE dwk = dE dy dy ds ds dwk
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
27
k wkhk
2(t − y)2
dE dwk = dE dy dy ds ds dwk
dE dy = d dy 1 2(t − y)2 = −(t − y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
28
k wkhk
2(t − y)2
dE dwk = dE dy dy ds ds dwk
dy ds = d sigmoid(s) ds = sigmoid(s)(1 − sigmoid(s)) = y(1 − y)
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
29
k wkhk
2(t − y)2
dE dwk = dE dy dy ds ds dwk
ds dwk = d dwk
wkhk = hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
30
dE dwk = dE dy dy ds ds dwk = −(t − y) y(1 − y) hk – error – derivative of sigmoid: y′
∆wk = µ (t − y) y′ hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
31
E =
1 2(tj − yj)2
∆wj←k = µ(tj − yj) y′
j hk Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
32
δj = (tj − yj) y′
j
(why this way? there is math to back it up...)
δi =
j
wj←iδj
i
∆wj←k = µ δj hk
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
33
.90 .17
.76 1.0 0.0
4.5
. 5 3.7 2.9 3.7 2.9 A B C D E F G
– δG = (t − y) y′ = (1 − .76) 0.181 = .0434 – ∆wGD = µ δG hD = 10 × .0434 × .90 = .391 – ∆wGE = µ δG hE = 10 × .0434 × .17 = .074 – ∆wGF = µ δG hF = 10 × .0434 × 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
34
.90 .17
.76 1.0 0.0
4.5
. 5 3.7 2.9 3.7 2.9 A B C D E F G 4.891 —
– δG = (t − y) y′ = (1 − .76) 0.181 = .0434 – ∆wGD = µ δG hD = 10 × .0434 × .90 = .391 – ∆wGE = µ δG hE = 10 × .0434 × .17 = .074 – ∆wGF = µ δG hF = 10 × .0434 × 1 = .434
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
35
.90 .17
.76 1.0 0.0
4.5
. 5 3.7 2.9 3.7 2.9 A B C D E F G 4.891 —
– δD =
j wj←iδj
D = wGD δG y′ D = 4.5 × .0434 × .0898 = .0175
– ∆wDA = µ δD hA = 10 × .0175 × 1.0 = .175 – ∆wDB = µ δD hB = 10 × .0175 × 0.0 = 0 – ∆wDC = µ δD hC = 10 × .0175 × 1 = .175
– δE =
j wj←iδj
E = wGE δG y′ E = −5.2 × .0434 × 0.2055 = −.0464
– ∆wEA = µ δE hA = 10 × −.0464 × 1.0 = −.464 – etc.
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
36
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
37
e.g., uniformly from interval [−0.01, 0.01]
– for shallow neural networks
√n, 1 √n
– for deep neural networks
√ 6 √nj + nj+1 , √ 6 √nj + nj+1
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
38
y = (0, 0, 1)T
– predicted class is output node yi with highest value – obtain posterior probability distribution by soft-max softmax(yi) = eyi
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
39
λ error(λ)
Too high learning rate
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
40
λ error(λ)
Bad initialization
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
41
λ error(λ) local optimum global optimum
Local optimum
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
42
∆wj←k(n − 1)
∆wj←k(n) = µ δj hk + ρ∆wj←k(n − 1)
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
43
– at the beginning, things have to change a lot – later, just fine-tuning
based on error E with respect to the weight w at time t = gt = dE
dw
∆wt = µ t
τ=1 g2 τ
gt
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
44
(very good on train, bad on unseen test)
– mask: set of hidden units dropped – randomly generate, say, 10–20 masks – alternate between the masks during training
→ bagging, ensemble, ...
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
45
– sum up their updates – apply sum to model
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
46
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
47
s = W h
y = sigmoid( h)
δ = ( t − y) sigmoid’( s)
δi = W δi+1 · sigmoid’( s)
δ hT
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
48
h require 200 × 200 = 40, 000 multiplications
– image rendering requires such vector and matrix operations – massively mulit-core but lean processing units – example: NVIDIA Tesla K20c GPU provides 2496 thread processors
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020
49
Philipp Koehn Machine Translation: Introduction to Neural Networks 24 September 2020