Neural Networks
Hugo Larochelle ( @hugo_larochelle ) Google Brain
Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain - - PowerPoint PPT Presentation
Neural Networks Hugo Larochelle ( @hugo_larochelle ) Google Brain 2 NEURAL NETWORKS What well cover ... f ( x ) computer vision architectures - convolutional networks - data augmentation 1 - residual networks ... ...
Hugo Larochelle ( @hugo_larochelle ) Google Brain
2
...
xd
...
1 1
... ...
1
... ... ...
x
4
it contains
112 pixels 150 pixels
5
6
subregion (patch) of the input image
an unmanageable number of parameters
hidden units would be very expensive
... ...
7
... ...
8
6
... ... ... ... ... ...
feature map 1 feature map 2 feature map 3 same color = same matrix
8
6
... ... ... ... ... ...
feature map 1 feature map 2 feature map 3 same color = same matrix
8
6
... ... ... ... ... ...
feature map 1 feature map 2 feature map 3 same color = same matrix
8
6
... ... ... ... ... ...
feature map 1 feature map 2 feature map 3 same color = same matrix
8
6
... ... ... ... ... ...
feature map 1 feature map 2 feature map 3 same color = same matrix
Wij is the matrix connecting the ith input channel with the jth feature map
8
6
... ... ... ... ... ...
feature map 1 feature map 2 feature map 3 same color = same matrix
Wij is the matrix connecting the ith input channel with the jth feature map
9
... ... ... ... ... ...
feature map 1 feature map 2 feature map 3 same color = same matrix
Wij is the matrix connecting the ith input channel with the jth feature map
10
the hidden weights matrix Wij with its rows and columns flipped
f e a t u r e m a p s
(could have added a bias)
11
80 40 20 40 40 0.25 0.5 1
11
80 40 20 40 40 0.25 0.5 1
pq
1 0.5 0.25
12
0.25 0.5 1
45
80 40 20 40 40 1 0.5 0.25 1 x 0 + 0.5 x 80 + 0.25 x 20 + 0 x 40
pq
13
0.25 0.5 1
45 110
80 40 20 40 40 1 0.5 0.25 1 x 80 + 0.5 x 40 + 0.25 x 40 + 0 x 0
pq
14
0.25 0.5 1
45 110 40
80 40 20 40 40 1 0.5 0.25 1 x 20 + 0.5 x 40 + 0.25 x 0 + 0 x 0
pq
15
0.25 0.5 1
45 110 40 40
80 40 20 40 40 1 0.5 0.25 1 x 40 + 0.5 x 0 + 0.25 x 0 + 0 x 40
pq
16
17
%%%%% %%%%%
0% 128% 128% 0% 0% 128% 128% 0% 0% 255% 0% 0% 255% 0% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 0%
0% 0.5% 0.5% 0%
0% 0.5% 0.5% 0%
18 0% 128% 128% 0% 0% 128% 128% 0% 0% 255% 0% 0% 255% 0% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 0% 0.02% 0.19% 0.19% 0.02% 0.02% 0.19% 0.19% 0.02% 0.02% 0.75% 0.02% 0.02% 0.75% 0.02% 0.02% 0.02%
19
0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 0% 0% 128% 128% 0% 0% 128% 128% 0% 0% 255% 0% 0% 255% 0% 0% 0%
19
0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 0% 0% 128% 128% 0% 0% 128% 128% 0% 0% 255% 0% 0% 255% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
19
0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 0% 0% 128% 128% 0% 0% 128% 128% 0% 0% 255% 0% 0% 255% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
20
0.19% 0.19% 0.75% 0.02% 0.02% 0.19% 0.19% 0.02% 0.02% 0.19% 0.19% 0.02% 0.02% 0.75% 0.02% 0.02% 0.75% 0.02% 0.02% 0.02%
21
0.19% 0.19% 0.75% 0.02%
couche)«)complex)cell)»)
0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 255% 0% 0% 0% 0% 0% 0% 255% 0% 0% 0% 0% 0% 0% 0% 0% 255% 255% 0% 0% 255% 0% 0% 0% 0% 0% 0% 0% 0% 0%
22
fully connected
23
pooling operation
24
3x3, 64 1x1, 64
relu
1x1, 256
relu relu
3x3, 64 3x3, 64
relu relu 64-d 256-d
25
26
translation rotation scaling crop crop crop crop undo undo undo undo
28
29
in English or French
‘‘ He’s spending 7 days in San
‘‘ He ’’ ‘‘ ’s ’’ ‘‘ spending ’’ ‘’ 7 ’’ ‘‘ days ’’ ‘‘ in ’’ ‘‘ San Francisco ’’ ‘‘ . ’’
30
‘‘ He ’’ ‘‘ ’s ’’ ‘‘ spending ’’ ‘’ 7 ’’ ‘‘ days ’’ ‘‘ in ’’ ‘‘ San Francisco ’’ ‘‘ . ’’ ‘‘ he ’’ ‘‘ be ’’ ‘‘ spend ’’ ‘’ NUMBER ’’ ‘‘ day ’’ ‘‘ in ’’ ‘‘ San Francisco ’’ ‘‘ . ’’
31
(position of word in vocabulary)
32
‘‘ the ’’ ‘‘ cat ’’ ‘‘ and ’’ ‘’ the ’’ ‘‘ dog ’’ ‘‘ play ’’ ‘‘ . ’’ Word w
‘‘ the ’’
1
‘‘ and ’’
2
‘‘ dog ’’
3
‘’ . ’’
4
‘‘ OOV ’’
5 1 5 2 1 3 5 4
Vocabulary
33
associated with the ID
e(w) = [ 0 0 0 1 0 0 0 0 0 0 ]
34
units!
35
‘‘ the ’’
1
[ 0.6762, -0.9607, 0.3626, -0.2410, 0.6636 ] ‘‘ a ’’
2
[ 0.6859, -0.9266, 0.3777, -0.2140, 0.6711 ] ‘‘ have ’’
3
[ 0.1656, -0.1530, 0.0310, -0.3321, -0.1342 ] ‘‘ be ’’
4
[ 0.1760, -0.1340, 0.0702, -0.2981, -0.1111 ] ‘‘ cat ’’
5
[ 0.5896, 0.9137, 0.0452, 0.7603, -0.6541 ] ‘‘ dog ’’
6
[ 0.5965, 0.9143, 0.0899, 0.7702, -0.6392 ] ‘‘ car ’’
7
[ -0.0069, 0.7995, 0.6433, 0.2898, 0.6359 ]
36
between words
MAY, WOULD, COULD, SHOULD, MIGHT, MUST, CAN, CANNOT, COULDN'T, WON'T, WILL ONE, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT, NINE, TEN, ELEVEN, TWELVE, THIRTEEN, FOURTEEN, FIFTEEN, SIXTEEN, SEVENTEEN, EIGHTEEN JANUARY FEBRUARY MARCH APRIL JUNE JULY AUGUST SEPTEMBER OCTOBER NOVEMBER DECEMBER MILLION BILLION MONDAY TUESDAY WEDNESDAY THURSDAY FRIDAY SATURDAY SUNDAY ZERO
37
representations of each word x = [C(w1)⊤, ... , C(w10)⊤] ⊤
where l is the loss function optimized by the neural network
C(w) ( = C(w) αrC(w)l
38
39
to allow transfer to n-grams not observed in training corpus
softmax tanh . . . . . . . . . . . . . . . . . . . . . across words most computation here index for index for index for shared parameters Matrix in look−up Table . . .
C C wt−1 wt−2 C(wt−2) C(wt−1) C(wt−n+1) wt−n+1 i-th output = P(wt = i | context)
Bengio, Ducharme, Vincent and Jauvin, 2003
Wn-1 W2 W1
39
to allow transfer to n-grams not observed in training corpus
softmax tanh . . . . . . . . . . . . . . . . . . . . . across words most computation here index for index for index for shared parameters Matrix in look−up Table . . .
C C wt−1 wt−2 C(wt−2) C(wt−1) C(wt−n+1) wt−n+1 i-th output = P(wt = i | context)
Bengio, Ducharme, Vincent and Jauvin, 2003
Wn-1 W2 W1
39
to allow transfer to n-grams not observed in training corpus
softmax tanh . . . . . . . . . . . . . . . . . . . . . across words most computation here index for index for index for shared parameters Matrix in look−up Table . . .
C C wt−1 wt−2 C(wt−2) C(wt−1) C(wt−n+1) wt−n+1 i-th output = P(wt = i | context)
Bengio, Ducharme, Vincent and Jauvin, 2003
Wn-1 W2 W1
39
to allow transfer to n-grams not observed in training corpus
softmax tanh . . . . . . . . . . . . . . . . . . . . . across words most computation here index for index for index for shared parameters Matrix in look−up Table . . .
C C wt−1 wt−2 C(wt−2) C(wt−1) C(wt−n+1) wt−n+1 i-th output = P(wt = i | context)
Bengio, Ducharme, Vincent and Jauvin, 2003
Wn-1 W2 W1
40
but [‘‘ the ’’, ‘‘ dog ’’, ‘‘ is ’’, ‘‘ eating ’’ ] is
be able to generalize to the case of ‘‘ cat ’’
4-grams: [‘‘ the ’’, ‘‘ cat ’’, ‘‘ was ’’, ‘‘ sleeping ’’ ] [‘‘ the ’’, ‘‘ dog ’’, ‘‘ was ’’, ‘‘ sleeping ’’ ]
41
linear activation of the hidden layer
hidden layer as Wi
rC(w)l =
n1
X
i=1
1(wt−i=w) W>
i ra(x)l
> i ra(x)l
softmax tanh . . . . . . . . . . . . . . . . . . . . . across words most computation here index for index for index for shared parameters Matrix in look−up Table . . .
C C wt−1 wt−2 C(wt−2) C(wt−1) C(wt−n+1) wt−n+1 i-th output = P(wt = i | context)
Wn-1 W2 W1
42
3 ra(x)l
r
2 ra(x)l
1 ra(x)l + W> 4 ra(x)l
43
44
within the average
…
average
no weights here
) W(1)
W(2)
W(3)
softmax i-th output = P(y=i-th class | w)
…
index for w1 Lookup C C(w1)
…
index for w2 Lookup C C(w2)
…
index for w3 Lookup C C(w3)
…
index for w4 Lookup C C(w4)
45
W(3)
softmax i-th output = P(y=i-th class | w)
… … … … …
index for w1 Lookup C C(w1)
…
index for w2 Lookup C C(w2)
…
index for w3 Lookup C C(w3)
…
index for w4 Lookup C C(w4)
) W(1) ) W(1) ) W(1) ) W(1)
W(2)
(1) (1)
U U
(1)
U
h1 h2 h3
(1) (1) (1)
h4
(1)
h(1)
t
= tanh(b(1) + U(1)h(1)
t−1 + W(1)C(wt))
(1)
46
softmax i-th output = P(y=i-th class | w)
… … … … …
index for w1 Lookup C C(w1)
…
index for w2 Lookup C C(w2)
…
index for w3 Lookup C C(w3)
…
index for w4 Lookup C C(w4)
) W(1) ) W(1) ) W(1) ) W(1)
(1) (1)
U U
(1)
U
h1 h2 h3
(1) (1) (1)
h4
(1)
h(2)
t
= tanh(b(2) + U(2)h(2)
t−1 + W(2)h(1) t )
h(1)
t
= tanh(b(1) + U(1)h(1)
t−1 + W(1)C(wt))
… … … …
(2) (2)
U U
(2)
U
h1 h2 h3
(2) (2) (2)
h4
(2)
W(3)
47
in each direction
representation from both directions
W(3)
softmax i-th output = P(y=i-th class | w)
… … … … … … … … …
index for w1 Lookup C C(w1)
…
index for w2 Lookup C C(w2)
…
index for w3 Lookup C C(w3)
…
index for w4 Lookup C C(w4)
48
…
index for w1
Lookup C
C(w1)
…
W
…
index for w2
Lookup C
C(w2)
…
W
…
index for w3
Lookup C
C(w3)
…
W
h1 h2 h3
U U
49
…
index for w1
Lookup C
C(w1)
…
W
…
index for w2
Lookup C
C(w2)
…
W
…
index for w3
Lookup C
C(w3)
…
W
h1 h2 h3
U U
V V V
Hochreiter, Schmidhuber 1995
49
…
index for w1
Lookup C
C(w1)
…
W
…
index for w2
Lookup C
C(w2)
…
W
…
index for w3
Lookup C
C(w3)
…
W
h1 h2 h3
U U
V V V
… … … … … … … … …
it = sigm(b[i] + U[i]ht−1 + W[i]C(wt)) ft = sigm(b[f] + U[f]ht−1 + W[f]C(wt))
Input, forget, output gates: Hochreiter, Schmidhuber 1995
49
…
index for w1
Lookup C
C(w1)
…
W
…
index for w2
Lookup C
C(w2)
…
W
…
index for w3
Lookup C
C(w3)
…
W
h1 h2 h3
U U
V V V
… … … … … … … … …
Hochreiter, Schmidhuber 1995
49
…
index for w1
Lookup C
C(w1)
…
W
…
index for w2
Lookup C
C(w2)
…
W
…
index for w3
Lookup C
C(w3)
…
W
h1 h2 h3
U U
V V V
… … … … … … … … … … … …
Cell state:
e ct = tanh(b[c] + U[c]ht−1 + W[c]C(wt)) ct = ft ct−1 + it e ct
Hochreiter, Schmidhuber 1995
49
…
index for w1
Lookup C
C(w1)
…
W
…
index for w2
Lookup C
C(w2)
…
W
…
index for w3
Lookup C
C(w3)
…
W
h1 h2 h3
U U
V V V
… … … … … … … … … … … …
Hochreiter, Schmidhuber 1995
49
…
index for w1
Lookup C
C(w1)
…
W
…
index for w2
Lookup C
C(w2)
…
W
…
index for w3
Lookup C
C(w3)
…
W
h1 h2 h3
U U
V V V
… … … … … … … … … … … …
Hidden layer:
ht = ot tanh(ct)
Hochreiter, Schmidhuber 1995
50
it = sigm(b[i] + U[i]ht−1 + W[i]C(wt)) ft = sigm(b[f] + U[f]ht−1 + W[f]C(wt))
Input, forget, output gates: Hidden layer:
ht = ot tanh(ct)
Cell state:
e ct = tanh(b[c] + U[c]ht−1 + W[c]C(wt)) ct = ft ct−1 + it e ct
50
it = sigm(b[i] + U[i]ht−1 + W[i]C(wt)) ft = sigm(b[f] + U[f]ht−1 + W[f]C(wt))
Input, forget, output gates: Hidden layer:
ht = ot tanh(ct)
Cell state:
e ct = tanh(b[c] + U[c]ht−1 + W[c]C(wt)) ct = ft ct−1 + it e ct
50
it = sigm(b[i] + U[i]ht−1 + W[i]C(wt)) ft = sigm(b[f] + U[f]ht−1 + W[f]C(wt))
Input, forget, output gates: Hidden layer:
ht = ot tanh(ct)
Cell state:
e ct = tanh(b[c] + U[c]ht−1 + W[c]C(wt)) ct = ft ct−1 + it e ct
50
it = sigm(b[i] + U[i]ht−1 + W[i]C(wt)) ft = sigm(b[f] + U[f]ht−1 + W[f]C(wt))
Input, forget, output gates: Hidden layer:
ht = ot tanh(ct)
Cell state:
e ct = tanh(b[c] + U[c]ht−1 + W[c]C(wt)) ct = ft ct−1 + it e ct
51
ct = ft ct−1 + it e ct
51
ct = + it e ct ft ft−1 ct−1 + ft it−1 e ct−1
2
51
ct =
51
ct =
t
X
t0=0
ft · · · ft0+1 it0 e ct0
51
ct =
t
X
t0=0
ft · · · ft0+1 it0 e ct0
52