Neural Networks for Distant Speech Recognition Steve Renals ! Joint - - PowerPoint PPT Presentation

neural networks
SMART_READER_LITE
LIVE PREVIEW

Neural Networks for Distant Speech Recognition Steve Renals ! Joint - - PowerPoint PPT Presentation

C S T R Neural Networks for Distant Speech Recognition Steve Renals ! Joint work with ! Centre for Speech Technology Research ! Pawel wi toja ski University of Edinburgh ! ! s.renals@ed.ac.uk 14 May 2014 ! Significant contributions


slide-1
SLIDE 1

R T S C

Neural Networks

for Distant Speech Recognition

Steve Renals!

Centre for Speech Technology Research! University of Edinburgh!

!

s.renals@ed.ac.uk

14 May 2014!

Joint work with !

Pawel Świętojański

Significant contributions from Peter Bell & Arnab Ghoshal

slide-2
SLIDE 2

R T S C

Distant Speech Recognition

... so you have your energy source your user interface who’s controlling the chip ... hmm click rustle

slide-3
SLIDE 3

R T S C

Why study meetings?

  • Natural communication scenes!
  • Multistream - multiple asynchronous streams of data!
  • Multimodal - words, prosody, gesture, attention!
  • Multiparty - social roles, individual and group

behaviours!

  • Meetings offer realistic, complex behaviours but in a

circumscribed setting!

  • Applications based on meeting capture, analysis,

recognition and interpretation!

  • Great arena for interdisciplinary research
slide-4
SLIDE 4

R T S C

“ASR Complete” problem

  • Transcription of conversational speech!
  • Distant speech recognition with microphone arrays!
  • Speech separation, multiple acoustic channels!
  • Reverberation!
  • Overlap detection!
  • Utterance and speaker segmentation!
  • Disfluency detection
slide-5
SLIDE 5

R T S C

Today’s Menu

  • MDM corpora: ICSI and AMI meetings corpora!
  • MDM systems in 2010: GMMs, beamforming, and

lots of adaptation!

  • MDM systems in 2014: Neural networks, less

beamforming, and less adaptation

slide-6
SLIDE 6

R T S C

Corpora

slide-7
SLIDE 7

R T S C

ICSI Corpus

Tabletop boundary mics Headset mics

slide-8
SLIDE 8

R T S C

AMI Corpus

Mic Array Headset mic Lapel mic

http://corpus.amiproject.org

slide-9
SLIDE 9

R T S C

AMI Corpus Example

slide-10
SLIDE 10

R T S C Meeting recording

(c. 2005)

slide-11
SLIDE 11

R T S C

Meeting recording (2010s)

slide-12
SLIDE 12

R T S C

GMM-based systems! (State-of-the-art 2010)

slide-13
SLIDE 13

R T S C

BEAMFORMER beamformer.env beamformer

Basic system

  • Speech/non-speech

segmentation!

  • PLP/MFCC features!
  • ML trained HMM/GMM

system (122k 39D Gaussians)!

  • 50k vocabulary!
  • Trigram language model

(small: 26M words, PPL 78)!

  • Weighted FST decoder
slide-14
SLIDE 14

R T S C

Additional components

  • Microphone array front end!
  • Speaker / channel adaptation!
  • Vocal tract length normalisation (VTLN)!
  • Maximum likelihood linear regression (MLLR)!
  • Input feature transform – LDA/STC!
  • Discriminative training!
  • eg boosted maximum mutual information, BMMI!
  • Discriminative features!
  • Model combination
slide-15
SLIDE 15

R T S C

SDM MDM beamforming

70 10 20 30 40 50 60

WER/%

63.2 56.1 54.8 46.8 29.6 ICSI AMI ICSI AMI AMI

IHM

ASR Word Error Rates for GMM/HMM Systems

GMM results (WER)

slide-16
SLIDE 16

R T S C Microphone array processing!

for distant speech recognition

  • Mic array processing in AMIDA ASR system (Hain et

al, 2012)!

  • Wiener noise filter!
  • Filter-sum beamforming based on time-delay-of-arrival!
  • Viterbi smoother post processing!
  • Track direction of maximum energy
  • Optimise beamforming for speech recognition!
  • LIMABEAM (Seltzer et al, 2004, 2006) [explicit]!
  • Simply concatenate feature vectors from multiple mics

(Marino and Hain, 2011) [implicit]

slide-17
SLIDE 17

R T S C

(Deep) Neural Networks

slide-18
SLIDE 18

R T S C

The Perceptron (Rosenblatt)

slide-19
SLIDE 19

R T S C

The Perceptron (Rosenblatt)

slide-20
SLIDE 20

R T S C

The Perceptron (Rosenblatt)

NN Winter #1

slide-21
SLIDE 21

R T S C

MLPs and backprop! (mid 1980s)

slide-22
SLIDE 22

R T S C

MLPs and backprop

Outputs Hidden units zj xi w(2)

1j

w(2)

j

δ δ1 w(1)

ji

yK y y1 w(2)

K j

δj = h(bj)

  • δwj

δK

  • Train multiple layers of

hidden units – nested nonlinear functions!

  • Powerful feature

detectors!

  • Posterior probability

estimation!

  • Theorem: any

function can be approximated with a single hidden layer

slide-23
SLIDE 23

R T S C

“Hybrid” Neural network acoustic models (1990s)

Million Parameters Error (%)

1 2 3 4 5 6 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 CI-HMM CI-MLP CD-HMM MIX

DARPA RM 1992

Utterance Hypothesis Speech CI RNN CI MLP CD RNN Decoder Chronos Chronos Decoder Chronos Decoder ROVER Prediction Linear Perceptual Prediction Linear Perceptual Spectrogram Modulation

Broadcast news 1998!

20.8% WER! (best GMM-based system, 13.5%)!

Cook, Christie, Ellis, Fosler-Lussier, Gotoh,! Kingsbury, Morgan, Renals, Robinson, & Williams, DARPA, 1999 Renals, Morgan, Cohen & Franco, ICASSP 1992 Bourlard & Morgan, 1994! Robinson, IEEE TNN 1994

slide-24
SLIDE 24

R T S C

NN acoustic models Limitations vs GMMs

  • Computationally restricted to monophone outputs !
  • CD-RNN factored over multiple networks – limited

within-word context!

  • Training not easily parallelisable!
  • experimental turnaround slower!
  • systems less complex (fewer parameters)!
  • RNN – <100k parameters!
  • MLP – ~1M parameters!
  • Rapid adaptation hard (cf MLLR)
slide-25
SLIDE 25

R T S C

s-iy+l f-iy-l t-iy-n t-iy-m

NN Winter #2

GMM CRF SVM

slide-26
SLIDE 26

R T S C

Discriminative long-term features – Tandem

  • A neural network-based technique provided the

biggest increase in accuracy in speech recognition during the 2000s

  • Tandem features (Hermansky, Ellis & Sharma, 2000)!
  • use (transformed) outputs or (bottleneck) hidden

values as input features for a GMM!

  • deep networks – e.g. 5 layer MLP to obtain bottleneck

features (Grézl, Karafiát, Kontár & Černocký, 2007)!

  • reduces errors by about 10% relative (Hain, Burget, Dines,

Garner, Grezl, el Hannani, Huijbregts, Karafiat, Lincoln & Wan, 2012)

slide-27
SLIDE 27

R T S C

Deep Neural Networks (2010s)

MFCC Inputs CD Phone Outputs Hidden units (39*9=351) 2000 3–8 hidden layers 12000

Hybrid

Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath & Kingsbury, IEEE SP Mag 2012 Dahl, Yu, Deng & Acero,! IEEE TASLP2012

Tandem

Bottleneck layer! 26!

slide-28
SLIDE 28

R T S C

Deep neural networks

What’s new?

slide-29
SLIDE 29

R T S C

Deep neural networks

  • 1. Unsupervised pretraining (Hinton, Osindero & Teh, 2006)!
  • Train a stacked RBM generative model, then finetune!
  • Good initialisation!
  • Regularisation
  • 2. Deep – many hidden layers!
  • Deeper models more accurate!
  • GPUs gave us the computational power
  • 3. Wide output layer (context dependent phone

classes) rather than factorised into multiple nets!

  • More accurate phone models!
  • GPUs gave us the computational power
slide-30
SLIDE 30

R T S C

Deep neural networks

  • 1. Unsupervised pretraining (Hinton, Osindero & Teh, 2006)!
  • Train a stacked RBM generative model, then finetune!
  • Good initialisation!
  • Regularisation
  • 2. Deep – many hidden layers!
  • Deeper models more accurate!
  • GPUs gave us the computational power
  • 3. Wide output layer (context dependent phone

classes) rather than factorised into multiple nets!

  • More accurate phone models!
  • GPUs gave us the computational power
slide-31
SLIDE 31

R T S C

  • --GMM/BMMI---
  • ---DNN/CE----
  • --DNN/sMBR---

35 5 10 15 20 25 30

WER/% SWB SWB SWB CHE CHE CHE AVE AVE AVE

Hub5 '00 test set 300 hour training set 24.1 12.6

18.4

20.0 25.8 25.7 33.0 14.2 18.6

K Vesely, A Ghoshal, L Burget, and D Povey, ! “Sequence-discriminative training of deep neural networks”, Interspeech–2013.

Switchboard

slide-32
SLIDE 32

R T S C

  • --GMM/BMMI---
  • ---DNN/CE----
  • --DNN/sMBR---

35 5 10 15 20 25 30

WER/% SWB SWB SWB CHE CHE CHE AVE AVE AVE

Hub5 '00 test set 300 hour training set 24.1 12.6

18.4

20.0 25.8 25.7 33.0 14.2 18.6

K Vesely, A Ghoshal, L Burget, and D Povey, ! “Sequence-discriminative training of deep neural networks”, Interspeech–2013.

Switchboard

http://kaldi.sf.net/!

slide-33
SLIDE 33

R T S C

Neural network ! acoustic models

3-8 hidden layers ~2000 hidden units ~6000 CD phone outputs 9x39 MFCC inputs

Automatically learned! feature extraction Softmax output layer Aim to learn representations for distant speech recognition based multiple mic channels

slide-34
SLIDE 34

R T S C

Neural network ! acoustic models

3-8 hidden layers ~2000 hidden units ~6000 CD phone outputs 9x39 MFCC inputs

Automatically learned! feature extraction Softmax output layer Aim to learn representations for distant speech recognition based multiple mic channels

Multi-channel input! Spectral domain?

slide-35
SLIDE 35

R T S C

Neural network acoustic models for distant speech recognition

  • NNs have proven to result in accurate systems for a

variety of tasks – TIMIT, WSJ, Switchboard, Broadcast News, Lectures, Aurora4, …

  • NNs can integrate information from multiple frames
  • f data (in comparison with GMMs)
  • NNs can construct feature representations, from

multiple sources of data

  • NNs are well suited to learning multiple modules

with a common objective function

slide-36
SLIDE 36

R T S C

6 hidden layers 2048 hidden units ~4000 tied state outputs 11x120 FBANK inputs mic array Wiener filter noise cancellation Smoothed tdoa estimates Delay-sum beamforming

Baseline DNN system

50,000 word pronunciation dictionary! ! Small trigram LM! (PPL 78, trained on 26M words)

slide-37
SLIDE 37

R T S C

SDM MDM beamforming

70 10 20 30 40 50 60

WER/%

63.2 56.1 54.8 46.8 29.6 ICSI AMI ICSI AMI AMI

IHM

ASR Word Error Rates for GMM/HMM Systems

Baseline GMM results

slide-38
SLIDE 38

R T S C

SDM MDM beamforming

70 10 20 30 40 50 60

WER/%

53.1 47.8 49.5 41.0 26.6 ICSI AMI ICSI AMI AMI

IHM

ASR Word Error Rates for baseline DNN/HMM Systems

Baseline DNN results

slide-39
SLIDE 39

R T S C

SDM MDM beamforming

70 10 20 30 40 50 60

WER/%

53.1 47.8 49.5 41.0 26.6 ICSI AMI ICSI AMI AMI

IHM

ASR Word Error Rates for baseline DNN/HMM Systems

Baseline DNN results

DNN has 10–15% WER reduction over GMM

slide-40
SLIDE 40

R T S C

Concatenating input features

6 hidden layers 2048 hidden units ~4000 tied state outputs 8 x 11x120 FBANK inputs mic array

slide-41
SLIDE 41

R T S C

DNN results!

beamforming vs concatenation

GMM / Beamforming DNN / Beamforming DNN / Concatenation (4ch)

60 10 20 30 40 50

WER/%

49.5

ASR Word Error Rates for AMI corpus MDM

51.2 54.8

slide-42
SLIDE 42

R T S C

Convolutional ! Neural Networks! (CNNs)

slide-43
SLIDE 43

R T S C

Convolutional ! Neural Networks

Yann Le Cun, 1989 onwards

slide-44
SLIDE 44

R T S C

v3 v4 v5 v40 v1 v2 h3 h4 h1 h2

p1 p2 Inputs shared weights convolutional bands maxpool

CNN – Single channel

slide-45
SLIDE 45

R T S C

v3 v4 v5 v40 v1 v2 h3 h4 h1 h2

p1 p2 Inputs shared weights convolutional bands maxpool

5 sigmoid layers 2048 hidden units ~4000 tied state outputs

v3 v4 v5 v40 v1 v2 h3 h4 h1 h2

p1 p2 Inputs shared weights convolutional bands maxpool

CNN – Single channel

slide-46
SLIDE 46

R T S C

v3 v4 v5 v40 v1 v2 h3 h4 h1 h2

p1 p2 Inputs shared weights convolutional bands maxpool

5 sigmoid layers 2048 hidden units ~4000 tied state outputs

v3 v4 v5 v40 v1 v2 h3 h4 h1 h2

p1 p2 Inputs shared weights convolutional bands maxpool

CNN – Single channel

Statics, deltas, double-deltas for all context frames of band 1 128 convolutional filterbanks width 9! shift 1 maxpool size 2

slide-47
SLIDE 47

R T S C

v3 v4 v5 v40 v1 v2 h3 h4 h1 h2

p1 p2 Inputs shared weights convolutional bands maxpool

v3 v4 v5 v40 v1 v2

v3 v4 v5 v40 v1 v2

CNN – Multi-channel

slide-48
SLIDE 48

R T S C

v3 v4 v5 v40 v1 v2 h3 h4 h1 h2

p1 p2 Inputs shared weights convolutional bands maxpool

v3 v4 v5 v40 v1 v2

v3 v4 v5 v40 v1 v2

CNN – Multi-channel

5 sigmoid layers 2048 hidden units ~4000 tied state outputs

v3 v4 v5 v40 v1 v2 h3 h4 h1 h2

p1 p2 Inputs shared weights convolutional bands maxpool

v3 v4 v5 v40 v1 v2

v3 v4 v5 v40 v1 v2

slide-49
SLIDE 49

R T S C

CNN – Cross-channel

v3 v4 v5 v40 v1 v2 h3 h4 h1 h2

p1 p2 Inputs shared weights convolutional bands

v3 v4 v5 v40 v1 v2

v3 v4 v5 v40 v1 v2 h3 h4 h1 h2 h3 h4 h1 h2 c3 c4 c1 c2 cross-channel maxpooling cross-band maxpooling

slide-50
SLIDE 50

R T S C

SDM systems

GMM BF DNN BF CNN BF DNN BF / maxout

70 10 20 30 40 50 60

WER/%

AMI ICSI AMI AMI AMI ICSI ICSI ICSI

63.2 56.1 53.1 47.8 51.3 46.5 50.8 45.9

slide-51
SLIDE 51

R T S C

GMM BF DNN BF CNN BF CNN x-chan DNN BF / maxout

60 10 20 30 40 50

WER/%

AMI ICSI AMI AMI AMI ICSI ICSI ICSI

54.8 46.8 49.5 41.0 45.9 38.1 48.4 37.8

AMI ICSI

46.4 39.0

MDM systems

slide-52
SLIDE 52

R T S C

GMM BF DNN BF CNN BF CNN x-chan DNN BF / maxout

60 10 20 30 40 50

WER/%

AMI ICSI AMI AMI AMI ICSI ICSI ICSI

54.8 46.8 49.5 41.0 45.9 38.1 48.4 37.8

AMI ICSI

46.4 39.0

MDM systems

CNN has 7–8% WER reduction over DNN!

(16–19% WER reduction over GMM)

slide-53
SLIDE 53

R T S C

Discussion

  • CNN with single distant mic has similar WER to

DNN with 8 beamformed mics!

  • Channel-wise convolution followed by cross-channel

max-pooling better than multi-channel convolution!

  • Cross-channel CNNs still work even if channel
  • rder changed at test… able to pick the most

informative channel!

  • Invariances across time and frequency important in

multi-channel case!

  • CNN improvements over DNN less for ‘piecewise’

hidden units (maxout, ReLU) [but only DSR data….]

slide-54
SLIDE 54

R T S C

Future data?

  • Existing corpora:

MC-WSJ, AMI, ICSI!

  • Desiderata for new datasets!
  • Recorded in a variety of

environments!

  • Highly multimodal!
  • Natural & conversational

speech!

  • Wide range of challenges!
  • signal processing!
  • language processing!
  • Evaluation campaigns
slide-55
SLIDE 55

R T S C

Future data?

  • Sheffield Wargames

Corpus !

  • Many mic channels!
  • Arrays + headmount !
  • Tracking info!
  • Cameras!
  • http://

mini.dcs.shef.ac.uk/ data-2/

slide-56
SLIDE 56

R T S C

Conclusions

  • Improvements due to: !
  • deep structures to learn feature representations, !
  • wide context-dependent output, !
  • temporal context at input, correlated features
  • From these experiments and others!
  • DNNs offer 10–30% relative improvement over GMMs!
  • CNNs offer 5–10% relative improvement over DNNs!
  • Beamforming still ~5% better than multichannel learning !
  • Log spectral features give improvement over MFCCs
slide-57
SLIDE 57

R T S C

Practical details for DNNs

  • Computing platform!
  • High-end PC, with “gamers GPU” (e.g. GTX690)
  • Open source software, eg:!
  • Kaldi – http://kaldi.sourceforge.net!
  • Theano – http://deeplearning.net/software/theano!
  • Pylearn2 - http://deeplearning.net/software/pylearn2/!
  • Torch7 – http://www.torch.ch!
  • Quicknet – http://www.icsi.berkeley.edu/Speech/qn.html