EN. 601.467/667 Introduc3on to Human Language Technology Deep - - PowerPoint PPT Presentation

en 601 467 667 introduc3on to human language technology
SMART_READER_LITE
LIVE PREVIEW

EN. 601.467/667 Introduc3on to Human Language Technology Deep - - PowerPoint PPT Presentation

EN. 601.467/667 Introduc3on to Human Language Technology Deep Learning I Shinji Watanabe 1 Todays agenda Introduction of deep neural network Basics of neural network 2 Short bio Research interests Automatic speech


slide-1
SLIDE 1
  • EN. 601.467/667

Introduc3on to Human Language Technology Deep Learning I

Shinji Watanabe

1

slide-2
SLIDE 2

Today’s agenda

  • Introduction of deep neural network
  • Basics of neural network

2

slide-3
SLIDE 3

Short bio

  • Research interests
  • Automatic speech recognition (ASR), speech enhancement, application of

machine learning to speech processing

  • Around 20 years of ASR experience since 2001
slide-4
SLIDE 4

Speech recogni3on evalua3on metric

  • Word error rate (WER)
  • Using edit distance word-by-word:
  • # inserJon errors = 1, # subsJtuJon errors = 2, # of deleJon errors = 0 ➡ Edit distance = 3
  • Word error rate (%): Edit distance (=3) / # reference words (=9) * 100 = 33.3%
  • How to compute WERs for languages that do not have word boundaries?
  • Chunking or using character error rate

Reference) I want to go to the Johns Hopkins campus Recognition result) I want to go to the 10 top kids campus

slide-5
SLIDE 5

2001: when I started speech recognition….

1% 10% 100% 2000 2010 2005 1995 2015 (Pallett’03, Saon’15, Xiong’16) Switchboard task (Telephone conversation speech) Word error rate (WER) 2016

slide-6
SLIDE 6

Really bad age….

  • No applicaJon
  • No breakthrough technologies
  • Everyone outside speech research criJcized it…
  • General people don’t know “what is speech recogniJon”
slide-7
SLIDE 7
slide-8
SLIDE 8

Now we are at

1% 10% 100% 2000 2010 2005 1995 2015

Deep learning

(Pallett’03, Saon’15, Xiong’16) Switchboard task (Telephone conversation speech) Word error rate (WER) 2016 5.9%

slide-9
SLIDE 9

Everything was changed

  • No application
  • No breakthrough technologies
  • Everyone outside speech research criticized it…
  • General people don’t know “what is speech recognition”
slide-10
SLIDE 10

Everything was changed

  • No application voice search, smart speakers
  • No breakthrough technologies
  • Everyone outside speech research criticized it…
  • General people don’t know “what is speech recognition”
slide-11
SLIDE 11

Everything was changed

  • No application voice search, smart speakers
  • No breakthrough technologies deep neural network
  • Everyone outside speech research criticized it…
  • General people don’t know “what is speech recognition”
slide-12
SLIDE 12

Everything was changed

  • No applicaJon voice search, smart speakers
  • No breakthrough technologies deep neural network
  • Everyone outside speech research criJcized it… many people outside

speech research know/respect it

  • General people don’t know “what is speech recogniJon”
slide-13
SLIDE 13

Everything was changed

  • No application voice search, smart speakers
  • No breakthrough technologies deep neural network
  • Everyone outside speech research criticized it… many people outside

speech research know/respect it

  • General people don’t know “what is speech recognition” now my

wife knows what I’m doing

slide-14
SLIDE 14
slide-15
SLIDE 15

Acoustic model

15

argmax

W

p(W|O) = argmax

W

X

L

p(W, L|O) = argmax

W

X

L

p(O|L, W)p(L, W) = argmax

W

X

L

p(O|L)p(L|W)p(W)

<latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit><latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit><latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">ADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPoeP3n6bK21/vyrM7lvM+MNPYkpY5LoXkfBEh+klOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk2v5v/Aer0rIZqYthqx524Lrwsuo1o6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzV3A2KepElfumdER4b6z8NuHb/JAqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvLi57WfTfdN534s+7f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA=</latexit>
  • From Bayes decision theory to acoustic model

Acoustic model Language model

slide-16
SLIDE 16

GMM/HMM

16

  • Given HMM state j, we can represent the likelihood function as
  • Deep neural network acoustic model only replaces this GMM

representation with a neural network

slide-17
SLIDE 17

Problem

  • Input MFCC vector: 𝐩"
  • Output phoneme (or HMM state): 𝑡" = {/a/, /k/, …, }
  • How to find a probabilistic distribution of 𝑞(𝑡"|𝐩")???
  • We use a large amounts of pair data 𝐩", 𝑡" "*+

,

to train the model parameter of the distribution

17

slide-18
SLIDE 18

Very easy case

  • We can use a linear classifier
  • /a/: 𝑏.𝑝. + 𝑏1𝑝1 + 𝑐 ≥ 0
  • /k/: 𝑏.𝑝. + 𝑏1𝑝1 + 𝑐 < 0
  • We can also make a probability with

the sigmoid funcJon 𝜏()

  • 𝑞(/a/ 𝐩 = 𝜏(𝑏.𝑝. + 𝑏1𝑝1 + 𝑐)
  • 𝑞(/k/ 𝐩 = 1 - 𝜏(𝑏.𝑝. + 𝑏1𝑝1 + 𝑐)
  • Sigmoid funcJon:

𝜏 𝑦 = 1 1 + 𝑓:.

18

/a/ /k/

𝑝. 𝑝1 𝑏.𝑝. + 𝑏1𝑝1 + 𝑐 = 0 from http://cs.jhu.edu/~kevinduh/a/deep2014/140114-ResearchSeminar.pdf

slide-19
SLIDE 19

Very easy case

  • We can use GMM (although not so suitable)

𝑞(𝐩|/a/) = ∑< 𝜕<𝑂(𝐩|𝜈<, Σ<) 𝑞(𝐩|/k/) = ∑< 𝜕′<𝑂(𝐩|𝜈′<, Σ′<)

  • 𝑞(/a/ 𝐩 ≈

C(𝐩|/a/) C 𝐩 /a/ DC(𝐩|/k/)

19

/a/ /k/

𝑝. 𝑝1

slide-20
SLIDE 20

Getting more difficult with the GMM classifier

  • r linear classifier

20

/a/ /k/

𝑝. 𝑝1

slide-21
SLIDE 21

Getting more difficult with the GMM classifier

  • r linear classifier

21

/a/ /k/

𝑝. 𝑝1

slide-22
SLIDE 22

Neural network

22

  • Combination of linear classifiers to

classify complicated patterns

  • More layers, more complicated

patterns

slide-23
SLIDE 23

Neural network used in speech recogni3on

  • Very large combinaJon of linear classifiers

23

Output HMM state or phoneme Log mel filterbank + 11 context frames ~7hidden layers, 2048 units 30 ~ 10,000 units

・・・ ・・・

Input speech features

a i u w N

・・・

slide-24
SLIDE 24

Why neural network was not focused

  • 1. Very difficult to train
  • Batch? On-line? Mini-batch?
  • Stochastic gradient decent
  • Learning rate? Scheduling?
  • What kind of topologies?
  • Large computational cost
  • 2. The amount of training data is very critical
  • 3. CPU -> GPU

24

Output HMM state Log mel filterbank + 11 context frames ~7hidden layers, 2048 units 30 ~ 10,000 units

・・・ ・・・

Input speech features

a i u w N

・・・

slide-25
SLIDE 25

Before deep learning (2002 – 2009)

  • Success of neural networks was very old period
  • People believed that GMM was beqer
  • But very small gain from standard GMMs

25

slide-26
SLIDE 26

26

from https://en.wikipedia.org/wiki/Geoffrey_Hinton

slide-27
SLIDE 27

When I noticed deep learning (2010)

  • A. Mohamed, G. E. Dahl, and G. E. Hinton, “Deep belief networks for phone

recognition,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.

  • This still did not fully convince me (I introduced it at NTT’s reading group)

27

  • Using deep belief network as pre-

training

  • Fine-tuning deep neural network

→ Provides stable esJmaJon

slide-28
SLIDE 28

Pre-training and fine-tuning

  • First train neural network like

parameters with deep belief network or autoencoder

  • Then, using deep neural

network training

28

・・・ ・・・

1,000 ~ 10,000 units

a i u w N

・・・

slide-29
SLIDE 29

Interspeech 2011 at Florence

  • The following three papers convinced me
  • Feature extraction: Valente, Fabio / Magimai-Doss, Mathew / Wang, Wen

(2011): "Analysis and comparison of recent MLP features for LVCSR systems", In INTERSPEECH-2011, 1245-1248.

  • Acoustic model: Seide, Frank / Li, Gang / Yu, Dong (2011): "Conversational

speech transcription using context-dependent deep neural networks", In INTERSPEECH-2011, 437-440.

  • Language model: Mikolov, Tomáš / Deoras, Anoop / Kombrink, Stefan /

Burget, Lukáš / Černocký, Jan (2011): "Empirical evaluation and combination

  • f advanced language modeling techniques", In INTERSPEECH-2011, 605-608.
  • I discussed this potential to my NLP folks in NTT but they did not

believe it (SVM, log linear model)

29

slide-30
SLIDE 30

Late 2012

  • My first deep learning (Kaldi nnet)
  • Kaldi started to support DNN since 2012 (mainly developed by Karel Vesely)
  • Deep belief network based pre-training
  • Feed forward neural network
  • Sequence-discriminative training

30

Hub5 ‘00 (SWB) WSJ GMM 18.6 5.6 DNN 14.2 3.6 DNN with sequence- discriminative training 12.6 3.2

slide-31
SLIDE 31

31

slide-32
SLIDE 32

Build speech recogni3on with public tools and resources

  • TED-LIUM (~100 hours)
  • LIBRISPEECH (~1000 hours)
  • We can build use Kald+DNN+TED-LIUM to make English speech

recogniJon system by using one machine (GPU + many core machines)

  • Before this, it’s only for a big company.

32

slide-33
SLIDE 33

Same things happened in co computer vision

33

slide-34
SLIDE 34

34

from https://en.wikipedia.org/wiki/Geoffrey_Hinton

slide-35
SLIDE 35

ImageNet challenge (Large scale data)

35

  • L. Fei-Fei and O. Russakovsky, Analysis of Large-Scale Visual Recognition, Bay Area Vision

Meeting, October, 2013

slide-36
SLIDE 36

ImageNet challenge AlexNet, GoogLeNet, VGG, ResNet, …

28.2 25.8 16.4 11.7 6.7

5 10 15 20 25 30 2010 (NEC) 2011 (Xerox) 2012 (Tronto) 2013 (Clarifai) 2014 (Google) 36

Error rate

Deep learning!

slide-37
SLIDE 37

Same things happened in te text processing

  • Recurrent neural network language model (RNNLM) [Mikolov+ (2010)]

37

w0 w1 w1 w2 w2 w3 w3 w4 w4 w5 w5 w6 y2 y1 y3 y4 y5 ”I” “want” “to” “go” “to” “my “<s>” ”I” “want” “to” “go” “to” Perplexity N-gram (conventional) 336 RNNLM 156

slide-38
SLIDE 38

Word embedding example

https://www.tensorflow.org/tutorials/word2vec

38

slide-39
SLIDE 39

Neural machine translation (New York Times, December 2016)

39

slide-40
SLIDE 40

40

from https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html

slide-41
SLIDE 41

Deep neural network toolkit (2013-)

  • Theano
  • Caffe
  • Torch
  • CNTK
  • Chainer
  • Keras

Later

  • Tensorflow
  • PyTorch (used in this course)
  • MxNet
  • Etc.

41

slide-42
SLIDE 42

Summary

  • Before 2011
  • GMM/HMM, limitation of the performance, bit boring
  • This is because they are linear models…
  • After 2011
  • DNN/HMM
  • Toolkit
  • Public large data
  • GPU
  • NLP, image/vision were also moved to DNN
  • Always something exciting

42

slide-43
SLIDE 43

Today’s agenda

  • Introduction of deep neural network
  • Basics of neural network

43

slide-44
SLIDE 44

Feed-forward neural network for acoustic model

  • Configurations
  • Input features
  • Context expansion
  • Output class
  • Softmax function
  • Training criterion
  • Number of layers
  • Number of hidden states
  • Type of non-linear activations

44

Output HMM state Log mel filterbank + 11 context frames ~7hidden layers, 2048 units 30 ~ 10,000 units

・・・ ・・・

Input speech features

a i u w N

・・・

slide-45
SLIDE 45

Input feature

  • GMM/HMM formulaJon
  • Lot of condiJonal independence assumpJon and Markov assumpJon
  • Many of our trials are how to break these assumpJons
  • In GMM, we always have to care about the correlaJon
  • Delta, linear discriminant analysis, semi-Jed covariance
  • In DNN, we don’t have to care J
  • We can simply concatenate

the le• and right contexts, and just throw it!

45

slide-46
SLIDE 46

Output

  • Phoneme or HMM state ID is used
  • We need to have a pair data of output and input data at frame 𝑢
  • First use the Viterbi alignment to obtain the state sequence
  • Then, we get the input and output pair
  • Make acoustic model as a multiclass classification problem by

predicting the all HMM state ID given the observation

  • Not consider any constraint in this stage (e.g., left to right, which is handled

by an HMM during recognition)

46

slide-47
SLIDE 47

Feed-forward neural networks

  • Affine transformation and non-linear activation function (sigmoid

function)

  • Apply the above transformation L times
  • Softmax operation to get the probability distribution

47

slide-48
SLIDE 48

Linear opera3on

  • Transforms 𝐸(G:+)-dimensional input to 𝐸(G) output

𝑔(𝐢(G:+)) = 𝐗(G)𝐢(G:+) + 𝐜(G)

  • 𝐗(G) ∈ ℝN(O)×N(OQR): Linear transformation matrix
  • 𝐜(G) ∈ ℝN(O): bias vector
  • Derivatives
  • S ∑T UVTWTDXV

SXVY

= 𝜀(𝑗, 𝑗′)

  • S(∑T UVTWTDXV)

SUVYTY

= 𝜀 𝑗, 𝑗\ ℎ^\

48

slide-49
SLIDE 49

Sigmoid function

  • Sigmoid function
  • Convert the domain from ℝ to [0, 1]
  • Elementwise sigmoid function:
  • No trainable parameter in general
  • Derivative
  • Sa(.)

S.

= 𝜏(𝑦)(1 − 𝜏(𝑦))

49

slide-50
SLIDE 50

Softmax function

  • Softmax function
  • Convert the domain from ℝc to 0, 1 c (make a multinomial dist. → classification)
  • Satisfy the sum to one condition, i.e., ∑^*+ 𝑞 𝑘 𝐢 = 1
  • 𝐾 = 2: sigmoid function
  • Derivative
  • For 𝑗 = 𝑘: SC(^|𝐢)

SWV

= 𝑞(𝑘|𝐢)(1 − 𝑞 𝑘 𝐢 )

  • For 𝑗 ≠ 𝑘:

SC(^|𝐢) SWV

= −𝑞(𝑗|𝐢) 𝑞 𝑘 𝐢

  • Or we can write as SC(^|𝐢)

SWV

= 𝑞(𝑘|𝐢)(𝜀(𝑗, 𝑘) − 𝑞 𝑗 𝐢 ) :𝜀(𝑗, 𝑘): Kronecker’s delta

50

slide-51
SLIDE 51

What functions/operations we cannot use?

  • The function/operations that we cannot take a derivative, including

some discrete operation

  • argmaxU𝑞(𝑋|𝑃): Basic ASR operation, but we cannot take a derivative….
  • Discretization
  • Etc.

51

slide-52
SLIDE 52

Objective function design

  • We usually use the cross entropy as an objective function
  • Since the Viterbi sequence is a hard assignment, the summation over states is

simplified

52

slide-53
SLIDE 53

Other objective functions

  • Square error

𝐢opq − 𝐢

r

  • We could also use p norm, e.g., L1 norm
  • Binary cross entropy
  • Again this is a special case of the cross entropy when the number of classes is

two

53

slide-54
SLIDE 54

Building blocks

54

Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformation Sigmoid acJvaJon Softmax activation +, −, exp , log , etc.

slide-55
SLIDE 55

Building blocks

55

Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformaJon Sigmoid activation Softmax activation

slide-56
SLIDE 56

Building blocks

56

Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformaJon Sigmoid activation Softmax activation Linear transformation Sigmoid activation

slide-57
SLIDE 57

Building blocks

57

Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformation Sigmoid activation Softmax activation +, −, exp , log , etc.

slide-58
SLIDE 58

Building blocks

58

Input: 𝐩" ∈ ℝN Output: 𝑡" ∈ {1, … , 𝐾} Linear transformation Sigmoid activation Softmax activation

slide-59
SLIDE 59

How to optimize? Gradient decent and their variants

  • Take a derivative and update parameters with this derivative
  • Chain rule

59

slide-60
SLIDE 60

Deep neural network: nested function

  • Chain rule to get a derivative recursively
  • Each transformation (Affine, sigmoid, and softmax) has analytical derivatives

and we just combine these derivatives

  • We can obtain the derivative from the back propagation algorithm

61

Softmax activation Linear transformation Sigmoid activation Linear transformation

slide-61
SLIDE 61

Minibatch processing

  • Batch processing
  • Slow convergence
  • Effective computation
  • Online processing
  • Fast convergence
  • Very inefficient computation
  • Minibatch processing
  • Something between batch and online

processing

62

where

Whole data (batch)

mini batch mini batch mini batch

Split the whole data into minibatch Θ~•€ Θ•p‚+ Θ~•€ Θ•p‚ƒ Θ•p‚+ Θ•p‚r

slide-62
SLIDE 62

Summary of today’s talk

  • Deep learning changes the world
  • A lot of human language technologies are boosted by deep learning
  • Deep neural network basics
  • Input
  • Output
  • FuncJon
  • Back propagataion

63