[PPT] - EN. 601.467/667 Introduc3on to Human Language Technology Deep PowerPoint Presentation

SLIDE 1

EN. 601.467/667

Introduc3on to Human Language Technology Deep Learning I

Shinji Watanabe

1

SLIDE 2

Today’s agenda

Introduction of deep neural network
Basics of neural network

2

SLIDE 3

Short bio

Research interests
Automatic speech recognition (ASR), speech enhancement, application of

machine learning to speech processing

Around 20 years of ASR experience since 2001

SLIDE 4

Speech recogni3on evalua3on metric

Word error rate (WER)
Using edit distance word-by-word:
# inserJon errors = 1, # subsJtuJon errors = 2, # of deleJon errors = 0 ➡ Edit distance = 3
Word error rate (%): Edit distance (=3) / # reference words (=9) * 100 = 33.3%
How to compute WERs for languages that do not have word boundaries?
Chunking or using character error rate

Reference) I want to go to the Johns Hopkins campus Recognition result) I want to go to the 10 top kids campus

SLIDE 5

2001: when I started speech recognition….

1% 10% 100% 2000 2010 2005 1995 2015 (Pallett’03, Saon’15, Xiong’16) Switchboard task (Telephone conversation speech) Word error rate (WER) 2016

SLIDE 6

Really bad age….

No applicaJon
No breakthrough technologies
Everyone outside speech research criJcized it…
General people don’t know “what is speech recogniJon”

SLIDE 7

SLIDE 8

Now we are at

1% 10% 100% 2000 2010 2005 1995 2015

Deep learning

(Pallett’03, Saon’15, Xiong’16) Switchboard task (Telephone conversation speech) Word error rate (WER) 2016 5.9%

SLIDE 9

Everything was changed

No application
No breakthrough technologies
Everyone outside speech research criticized it…
General people don’t know “what is speech recognition”

SLIDE 10

Everything was changed

No application voice search, smart speakers
No breakthrough technologies
Everyone outside speech research criticized it…
General people don’t know “what is speech recognition”

SLIDE 11

Everything was changed

No application voice search, smart speakers
No breakthrough technologies deep neural network
Everyone outside speech research criticized it…
General people don’t know “what is speech recognition”

SLIDE 12

Everything was changed

No applicaJon voice search, smart speakers
No breakthrough technologies deep neural network
Everyone outside speech research criJcized it… many people outside

speech research know/respect it

General people don’t know “what is speech recogniJon”

SLIDE 13

Everything was changed

No application voice search, smart speakers
No breakthrough technologies deep neural network
Everyone outside speech research criticized it… many people outside

speech research know/respect it

General people don’t know “what is speech recognition” now my

wife knows what I’m doing

SLIDE 14

SLIDE 15

Acoustic model

15

argmax

W

p(W|O) = argmax

W

X

L

p(W, L|O) = argmax

W

X

L

p(O|L, W)p(L, W) = argmax

W

X

L

p(O|L)p(L|W)p(W)

<latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit><latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit><latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit>

From Bayes decision theory to acoustic model

Acoustic model Language model

SLIDE 16

GMM/HMM

16

Given HMM state j, we can represent the likelihood function as
Deep neural network acoustic model only replaces this GMM

representation with a neural network

SLIDE 17

Problem

Input MFCC vector: 𝐩"
Output phoneme (or HMM state): 𝑡" = {/a/, /k/, …, }
How to find a probabilistic distribution of 𝑞(𝑡"|𝐩")???
We use a large amounts of pair data 𝐩", 𝑡" "*+

,

to train the model parameter of the distribution

17

SLIDE 18

Very easy case

We can use a linear classifier
/a/: 𝑏.𝑝. + 𝑏1𝑝1 + 𝑐 ≥ 0
/k/: 𝑏.𝑝. + 𝑏1𝑝1 + 𝑐 < 0
We can also make a probability with

the sigmoid funcJon 𝜏()

𝑞(/a/ 𝐩 = 𝜏(𝑏.𝑝. + 𝑏1𝑝1 + 𝑐)
𝑞(/k/ 𝐩 = 1 - 𝜏(𝑏.𝑝. + 𝑏1𝑝1 + 𝑐)
Sigmoid funcJon:

𝜏 𝑦 = 1 1 + 𝑓:.

18

/a/ /k/

𝑝. 𝑝1 𝑏.𝑝. + 𝑏1𝑝1 + 𝑐 = 0 from http://cs.jhu.edu/~kevinduh/a/deep2014/140114-ResearchSeminar.pdf

SLIDE 19

Very easy case

We can use GMM (although not so suitable)

𝑞(𝐩|/a/) = ∑< 𝜕<𝑂(𝐩|𝜈<, Σ<) 𝑞(𝐩|/k/) = ∑< 𝜕′<𝑂(𝐩|𝜈′<, Σ′<)

𝑞(/a/ 𝐩 ≈

C(𝐩|/a/) C 𝐩 /a/ DC(𝐩|/k/)

19

/a/ /k/

𝑝. 𝑝1

SLIDE 20

Getting more difficult with the GMM classifier

r linear classifier

20

/a/ /k/

𝑝. 𝑝1

SLIDE 21

Getting more difficult with the GMM classifier

r linear classifier

21

/a/ /k/

𝑝. 𝑝1

SLIDE 22

Neural network

22

Combination of linear classifiers to

classify complicated patterns

More layers, more complicated

patterns

SLIDE 23

Neural network used in speech recogni3on

Very large combinaJon of linear classifiers

23

Output HMM state or phoneme Log mel filterbank + 11 context frames ~7hidden layers, 2048 units 30 ~ 10,000 units

･･････

Input speech features

a i u w N

・・・

SLIDE 24

24

Output HMM state Log mel filterbank + 11 context frames ~7hidden layers, 2048 units 30 ~ 10,000 units

･･････

Input speech features

a i u w N

・・・

SLIDE 25

Before deep learning (2002 – 2009)

Success of neural networks was very old period
People believed that GMM was beqer
But very small gain from standard GMMs

25

SLIDE 26

26

from https://en.wikipedia.org/wiki/Geoffrey_Hinton

SLIDE 27

When I noticed deep learning (2010)

A. Mohamed, G. E. Dahl, and G. E. Hinton, “Deep belief networks for phone

recognition,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.

This still did not fully convince me (I introduced it at NTT’s reading group)

27

Using deep belief network as pre-

training

Fine-tuning deep neural network

→ Provides stable esJmaJon

SLIDE 28

Pre-training and fine-tuning

First train neural network like

parameters with deep belief network or autoencoder

Then, using deep neural

network training

28

･･････

1,000 ~ 10,000 units

a i u w N

・・・

SLIDE 29

Interspeech 2011 at Florence

The following three papers convinced me
Feature extraction: Valente, Fabio / Magimai-Doss, Mathew / Wang, Wen

(2011): "Analysis and comparison of recent MLP features for LVCSR systems", In INTERSPEECH-2011, 1245-1248.

Acoustic model: Seide, Frank / Li, Gang / Yu, Dong (2011): "Conversational

speech transcription using context-dependent deep neural networks", In INTERSPEECH-2011, 437-440.

Language model: Mikolov, Tomáš / Deoras, Anoop / Kombrink, Stefan /

Burget, Lukáš / Černocký, Jan (2011): "Empirical evaluation and combination

f advanced language modeling techniques", In INTERSPEECH-2011, 605-608.
I discussed this potential to my NLP folks in NTT but they did not

believe it (SVM, log linear model)

29

SLIDE 30

Late 2012

My first deep learning (Kaldi nnet)
Kaldi started to support DNN since 2012 (mainly developed by Karel Vesely)
Deep belief network based pre-training
Feed forward neural network
Sequence-discriminative training

30

Hub5 ‘00 (SWB) WSJ GMM 18.6 5.6 DNN 14.2 3.6 DNN with sequence- discriminative training 12.6 3.2

SLIDE 31

31

SLIDE 32

Build speech recogni3on with public tools and resources

TED-LIUM (~100 hours)
LIBRISPEECH (~1000 hours)
We can build use Kald+DNN+TED-LIUM to make English speech

recogniJon system by using one machine (GPU + many core machines)

Before this, it’s only for a big company.

32

SLIDE 33

Same things happened in co computer vision

33

SLIDE 34

34

from https://en.wikipedia.org/wiki/Geoffrey_Hinton

SLIDE 35

ImageNet challenge (Large scale data)

35

L. Fei-Fei and O. Russakovsky, Analysis of Large-Scale Visual Recognition, Bay Area Vision

Meeting, October, 2013

SLIDE 36

ImageNet challenge AlexNet, GoogLeNet, VGG, ResNet, …

28.2 25.8 16.4 11.7 6.7

5 10 15 20 25 30 2010 (NEC) 2011 (Xerox) 2012 (Tronto) 2013 (Clarifai) 2014 (Google) 36

Error rate

Deep learning!

SLIDE 37

Same things happened in te text processing

Recurrent neural network language model (RNNLM) [Mikolov+ (2010)]

37

w0 w1 w1 w2 w2 w3 w3 w4 w4 w5 w5 w6 y2 y1 y3 y4 y5 ”I” “want” “to” “go” “to” “my “<s>” ”I” “want” “to” “go” “to” Perplexity N-gram (conventional) 336 RNNLM 156

SLIDE 38

Word embedding example

https://www.tensorflow.org/tutorials/word2vec

38

SLIDE 39

Neural machine translation (New York Times, December 2016)

39

SLIDE 40

40

from https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html

SLIDE 41

Deep neural network toolkit (2013-)

Theano
Caffe
Torch
CNTK
Chainer
Keras

Later

Tensorflow
PyTorch (used in this course)
MxNet
Etc.

41

SLIDE 42

Summary

Before 2011
GMM/HMM, limitation of the performance, bit boring
This is because they are linear models…
After 2011
DNN/HMM
Toolkit
Public large data
GPU
NLP, image/vision were also moved to DNN
Always something exciting

42

SLIDE 43

Today’s agenda

Introduction of deep neural network
Basics of neural network

43

SLIDE 44

44

Output HMM state Log mel filterbank + 11 context frames ~7hidden layers, 2048 units 30 ~ 10,000 units

･･････

Input speech features

a i u w N

・・・

SLIDE 45

Input feature

GMM/HMM formulaJon
Lot of condiJonal independence assumpJon and Markov assumpJon
Many of our trials are how to break these assumpJons
In GMM, we always have to care about the correlaJon
Delta, linear discriminant analysis, semi-Jed covariance
In DNN, we don’t have to care J
We can simply concatenate

the le• and right contexts, and just throw it!

45

SLIDE 46

Output

Phoneme or HMM state ID is used
We need to have a pair data of output and input data at frame 𝑢
First use the Viterbi alignment to obtain the state sequence
Then, we get the input and output pair
Make acoustic model as a multiclass classification problem by

predicting the all HMM state ID given the observation

Not consider any constraint in this stage (e.g., left to right, which is handled

by an HMM during recognition)

46

SLIDE 47

Feed-forward neural networks

Affine transformation and non-linear activation function (sigmoid

function)

Apply the above transformation L times
Softmax operation to get the probability distribution

47

SLIDE 48

Linear opera3on

Transforms 𝐸(G:+)-dimensional input to 𝐸(G) output

𝑔(𝐢(G:+)) = 𝐗(G)𝐢(G:+) + 𝐜(G)

𝐗(G) ∈ ℝN(O)×N(OQR): Linear transformation matrix
𝐜(G) ∈ ℝN(O): bias vector
Derivatives
S ∑T UVTWTDXV

SXVY

= 𝜀(𝑗, 𝑗′)

S(∑T UVTWTDXV)

SUVYTY

= 𝜀 𝑗, 𝑗\ ℎ^\

48

SLIDE 49

Sigmoid function

Sigmoid function
Convert the domain from ℝ to [0, 1]
Elementwise sigmoid function:
No trainable parameter in general
Derivative
Sa(.)

S.

= 𝜏(𝑦)(1 − 𝜏(𝑦))

49

SLIDE 50

Softmax function

Softmax function
Convert the domain from ℝc to 0, 1 c (make a multinomial dist. → classification)
Satisfy the sum to one condition, i.e., ∑^*+ 𝑞 𝑘 𝐢 = 1
𝐾 = 2: sigmoid function
Derivative
For 𝑗 = 𝑘: SC(^|𝐢)

SWV

= 𝑞(𝑘|𝐢)(1 − 𝑞 𝑘 𝐢 )

For 𝑗 ≠ 𝑘:

SC(^|𝐢) SWV

= −𝑞(𝑗|𝐢) 𝑞 𝑘 𝐢

Or we can write as SC(^|𝐢)

SWV

= 𝑞(𝑘|𝐢)(𝜀(𝑗, 𝑘) − 𝑞 𝑗 𝐢 ) :𝜀(𝑗, 𝑘): Kronecker’s delta

50

SLIDE 51

What functions/operations we cannot use?

The function/operations that we cannot take a derivative, including

some discrete operation

argmaxU𝑞(𝑋|𝑃): Basic ASR operation, but we cannot take a derivative….
Discretization
Etc.

51

SLIDE 52

Objective function design

We usually use the cross entropy as an objective function
Since the Viterbi sequence is a hard assignment, the summation over states is

simplified

52

SLIDE 53

Other objective functions

Square error

𝐢opq − 𝐢

r

We could also use p norm, e.g., L1 norm
Binary cross entropy
Again this is a special case of the cross entropy when the number of classes is

two

53

SLIDE 54

Building blocks

54

Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformation Sigmoid acJvaJon Softmax activation ＋, −, exp , log , etc.

SLIDE 55

Building blocks

55

Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformaJon Sigmoid activation Softmax activation

SLIDE 56

Building blocks

56

Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformaJon Sigmoid activation Softmax activation Linear transformation Sigmoid activation

SLIDE 57

Building blocks

57

Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformation Sigmoid activation Softmax activation ＋, −, exp , log , etc.

SLIDE 58

Building blocks

58

Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformation Sigmoid activation Softmax activation

SLIDE 59

How to optimize? Gradient decent and their variants

Take a derivative and update parameters with this derivative
Chain rule

59

SLIDE 60

Deep neural network: nested function

Chain rule to get a derivative recursively
Each transformation (Affine, sigmoid, and softmax) has analytical derivatives

and we just combine these derivatives

We can obtain the derivative from the back propagation algorithm

61

Softmax activation Linear transformation Sigmoid activation Linear transformation

SLIDE 61

Minibatch processing

Batch processing
Slow convergence
Effective computation
Online processing
Fast convergence
Very inefficient computation
Minibatch processing
Something between batch and online

processing

62

where

Whole data (batch)

mini batch mini batch mini batch

Split the whole data into minibatch Θ~•€ Θ•p‚+ Θ~•€ Θ•p‚ƒ Θ•p‚+ Θ•p‚r

SLIDE 62

Summary of today’s talk

Deep learning changes the world
A lot of human language technologies are boosted by deep learning
Deep neural network basics
Input
Output
FuncJon
Back propagataion

63

Introduc3on to Human Language Technology Deep Learning I

Today’s agenda

Short bio

Speech recogni3on evalua3on metric

2001: when I started speech recognition….

Really bad age….

Now we are at

Everything was changed

Everything was changed

Everything was changed

Everything was changed

speech research know/respect it

Everything was changed

speech research know/respect it

wife knows what I’m doing

Acoustic model

GMM/HMM

representation with a neural network

Problem

to train the model parameter of the distribution

Very easy case

the sigmoid funcJon 𝜏()

𝜏 𝑦 = 1 1 + 𝑓:.

/a/ /k/

Very easy case

𝑞(𝐩|/a/) = ∑< 𝜕<𝑂(𝐩|𝜈<, Σ<) 𝑞(𝐩|/k/) = ∑< 𝜕′<𝑂(𝐩|𝜈′<, Σ′<)

/a/ /k/

Getting more difficult with the GMM classifier

/a/ /k/

Getting more difficult with the GMM classifier

/a/ /k/

Neural network

classify complicated patterns

patterns

Neural network used in speech recogni3on

Why neural network was not focused

Before deep learning (2002 – 2009)

When I noticed deep learning (2010)

Pre-training and fine-tuning

parameters with deep belief network or autoencoder

network training

Interspeech 2011 at Florence

believe it (SVM, log linear model)

Late 2012

Build speech recogni3on with public tools and resources

recogniJon system by using one machine (GPU + many core machines)

Same things happened in co computer vision

ImageNet challenge (Large scale data)

ImageNet challenge AlexNet, GoogLeNet, VGG, ResNet, …

Deep learning!

Same things happened in te text processing

Word embedding example

https://www.tensorflow.org/tutorials/word2vec

Neural machine translation (New York Times, December 2016)

Deep neural network toolkit (2013-)

Summary

Today’s agenda

Feed-forward neural network for acoustic model

Input feature

Output

predicting the all HMM state ID given the observation

Feed-forward neural networks

function)

Linear opera3on

𝑔(𝐢(G:+)) = 𝐗(G)𝐢(G:+) + 𝐜(G)

Sigmoid function

Softmax function

What functions/operations we cannot use?

some discrete operation

Objective function design

Other objective functions

Building blocks

Building blocks

Building blocks

Building blocks

Building blocks

How to optimize? Gradient decent and their variants

Deep neural network: nested function

Minibatch processing

Summary of today’s talk