Understanding Hidden Memories of Recurrent Neural Networks
Yao Ming, Shaozu Cao, Ruixiang Zhang, Zhen Li, Yuanzhe Chen, Yangqiu Song, Huamin Qu.
THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY
Understanding Hidden Memories of Recurrent Neural Networks Yao Ming - - PowerPoint PPT Presentation
Understanding Hidden Memories of Recurrent Neural Networks Yao Ming , Shaozu Cao, Ruixiang Zhang, Zhen Li, Yuanzhe Chen, Yangqiu Song, Huamin Qu. THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY What is a Recurrent Neural Network? H K U S T
Yao Ming, Shaozu Cao, Ruixiang Zhang, Zhen Li, Yuanzhe Chen, Yangqiu Song, Huamin Qu.
THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY
2
H K U S T
3
H K U S T
x(t) h(t) tanh y(t)
A vanilla RNN
Machine Translation, Speech Recognition, Language Modeling, β¦ A deep learning model used for:
RNN
input
hidden state Visual Analytics Science & Technology 4
H K U S T A 2-layer RNN
π(#) = tanh (πΏπ #,- + πΎπ(#))
A vanilla RNN takes an input π(#), and update its hidden state π(#,-) using:
π¦(-) π§(-) π¦(3) π§(3) π¦(4) π§(4) π¦(5) π§(5)
x(t) h(t) tanh h(t) tanh y(t)
π-
#
π3
#
x(t) h(t) tanh y(t)
A vanilla RNN
5
H K U S T
6
H K U S T
A unit sensitive to position in a line. A lot more units have no clear meanings.
7
H K U S T
Each column represents the value of the hidden state vector when reads a input word
Machine Translation: 4-layer, 1000 units/layer (Sutskever I. et al., 2014) Language Modeling: 2-layer, 1500 units/layer (Zaremba et al., 2015)
8
H K U S T
9
H K U S T
10
H K U S T
(#) , π = 1, β¦ , π.
# ) implies that the word π₯ is more salient to unit π.
(#) can vary given the same word π₯, we use the expectation:
11
H K U S T
response Unit: #36
25% - 75% 9% - 91%
12
H K U S T
Highly responsive hidden units
Unit #
mean 25% - 75% 9% - 91%
13
H K U S T
14
H K U S T
15
H K U S T
16
H K U S T
17
H K U S T
18
H K U S T
Color: sign of the average edge weight Width: scale of the average edge weight
19
H K U S T
20
H K U S T
Color: each unitβs salience to the selected word
21
H K U S T
22
H K U S T
The ratio of preserved value Each bar represents the average scale of the value in a hidden units cluster Current value Increased value Decreased value More positive value are preserved More negative value are preserved Update towards positive Update towards negative
23
H K U S T
24
H K U S T
25
H K U S T
Sentence A: I love the food, though the staff is not helpful Sentence B: The staff is not helpful, though I love the food A B negative positive
Update towards positive Update towards negative
26
H K U S T
27
H K U S T
Accuracy (test): 88.6% Accuracy (test): 91.9%
28
H K U S T
29
H K U S T
30
H K U S T
31
H K U S T
32
H K U S T
Contact: Yao Ming, ymingaa@connect.ust.hk Page: www.myaooo.com/rnnvis Code: www.github.com/myaooo/rnnvis
33
H K U S T
The output of an RNN at step π’ is typically a probability distribution: π: = softmax π½π(#) = exp π:
Lπ #
β exp(πN
Lπ # )
where π½ = π:
L , π = 1,2, β¦ , π, is the output projection matrix.
The numerator of π: can be decomposed to: exp π:
Lπ #
= exp Q π:
L π R β π R,- # RT-
= U exp(π:
LΞπ # ) π πTπ
Here exp(π:
LΞπ # ) is the multiplicative contribution of input word π₯#, the update of hidden state
Ξπ # can be regard as the modelβs response to π₯#.
34
H K U S T
Show a tutorial video
1 4 2
Explore the tool
3
Compare two models Answer questions
5
Finish a survey
35
H K U S T
36
H K U S T
Left (A-C): co-cluster visualization of the last layer of an RNN. Right (D-F): visualization of the cell states of the last layer of an LSTM. Bottom (GH): two modelsβ responses to the same word βofferβ.
37
H K U S T