- EN. 601.467/667
Introduc3on to Human Language Technology Deep Learning I
Shinji Watanabe
1
EN. 601.467/667 Introduc3on to Human Language Technology Deep - - PowerPoint PPT Presentation
EN. 601.467/667 Introduc3on to Human Language Technology Deep Learning I Shinji Watanabe 1 Todays agenda Introduction of deep neural network Basics of neural network 2 Short bio Research interests Automatic speech
Shinji Watanabe
1
2
machine learning to speech processing
Reference) I want to go to the Johns Hopkins campus Recognition result) I want to go to the 10 top kids campus
1% 10% 100% 2000 2010 2005 1995 2015 (Pallett’03, Saon’15, Xiong’16) Switchboard task (Telephone conversation speech) Word error rate (WER) 2016
1% 10% 100% 2000 2010 2005 1995 2015
Deep learning
(Pallett’03, Saon’15, Xiong’16) Switchboard task (Telephone conversation speech) Word error rate (WER) 2016 5.9%
15
argmax
W
p(W|O) = argmax
W
X
L
p(W, L|O) = argmax
W
X
L
p(O|L, W)p(L, W) = argmax
W
X
L
p(O|L)p(L|W)p(W)
<latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit><latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit><latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit>Acoustic model Language model
16
,
17
18
𝑝. 𝑝1 𝑏.𝑝. + 𝑏1𝑝1 + 𝑐 = 0 from http://cs.jhu.edu/~kevinduh/a/deep2014/140114-ResearchSeminar.pdf
C(𝐩|/a/) C 𝐩 /a/ DC(𝐩|/k/)
19
𝑝. 𝑝1
20
𝑝. 𝑝1
21
𝑝. 𝑝1
22
23
Output HMM state or phoneme Log mel filterbank + 11 context frames ~7hidden layers, 2048 units 30 ~ 10,000 units
・・・ ・・・
Input speech features
a i u w N
・・・
24
Output HMM state Log mel filterbank + 11 context frames ~7hidden layers, 2048 units 30 ~ 10,000 units
・・・ ・・・
Input speech features
a i u w N
・・・
25
26
from https://en.wikipedia.org/wiki/Geoffrey_Hinton
recognition,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.
27
training
→ Provides stable esJmaJon
28
・・・ ・・・
1,000 ~ 10,000 units
a i u w N
・・・
(2011): "Analysis and comparison of recent MLP features for LVCSR systems", In INTERSPEECH-2011, 1245-1248.
speech transcription using context-dependent deep neural networks", In INTERSPEECH-2011, 437-440.
Burget, Lukáš / Černocký, Jan (2011): "Empirical evaluation and combination
29
30
Hub5 ‘00 (SWB) WSJ GMM 18.6 5.6 DNN 14.2 3.6 DNN with sequence- discriminative training 12.6 3.2
31
32
33
34
from https://en.wikipedia.org/wiki/Geoffrey_Hinton
35
Meeting, October, 2013
28.2 25.8 16.4 11.7 6.7
5 10 15 20 25 30 2010 (NEC) 2011 (Xerox) 2012 (Tronto) 2013 (Clarifai) 2014 (Google) 36
Error rate
37
w0 w1 w1 w2 w2 w3 w3 w4 w4 w5 w5 w6 y2 y1 y3 y4 y5 ”I” “want” “to” “go” “to” “my “<s>” ”I” “want” “to” “go” “to” Perplexity N-gram (conventional) 336 RNNLM 156
38
39
40
from https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html
Later
41
42
43
44
Output HMM state Log mel filterbank + 11 context frames ~7hidden layers, 2048 units 30 ~ 10,000 units
・・・ ・・・
Input speech features
a i u w N
・・・
the le• and right contexts, and just throw it!
45
by an HMM during recognition)
46
47
SXVY
= 𝜀(𝑗, 𝑗′)
SUVYTY
= 𝜀 𝑗, 𝑗\ ℎ^\
48
S.
= 𝜏(𝑦)(1 − 𝜏(𝑦))
49
SWV
= 𝑞(𝑘|𝐢)(1 − 𝑞 𝑘 𝐢 )
SC(^|𝐢) SWV
= −𝑞(𝑗|𝐢) 𝑞 𝑘 𝐢
SWV
= 𝑞(𝑘|𝐢)(𝜀(𝑗, 𝑘) − 𝑞 𝑗 𝐢 ) :𝜀(𝑗, 𝑘): Kronecker’s delta
50
51
simplified
52
𝐢opq − 𝐢
r
two
53
54
Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformation Sigmoid acJvaJon Softmax activation +, −, exp , log , etc.
55
Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformaJon Sigmoid activation Softmax activation
56
Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformaJon Sigmoid activation Softmax activation Linear transformation Sigmoid activation
57
Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformation Sigmoid activation Softmax activation +, −, exp , log , etc.
58
Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformation Sigmoid activation Softmax activation
59
and we just combine these derivatives
61
Softmax activation Linear transformation Sigmoid activation Linear transformation
processing
62
where
Whole data (batch)
mini batch mini batch mini batch
Split the whole data into minibatch Θ~•€ Θ•p‚+ Θ~•€ Θ•p‚ƒ Θ•p‚+ Θ•p‚r
63