Training neural network acoustic models on (multichannel) waveforms - - PowerPoint PPT Presentation

training neural network acoustic models on multichannel
SMART_READER_LITE
LIVE PREVIEW

Training neural network acoustic models on (multichannel) waveforms - - PowerPoint PPT Presentation

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN View this talk on YouTube: https://youtu.be/sI_8EA0_ha8 Training neural network acoustic models on (multichannel) waveforms Ron Weiss in SANE 2015


slide-1
SLIDE 1

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Training neural network acoustic models on (multichannel) waveforms

Ron Weiss in SANE 2015 2015-10-22 Joint work with Tara Sainath, Kevin Wilson, Andrew Senior, Arun Narayanan, Michiel Bacchiani, Oriol Vinyals, Yedid Hoshen

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 1 / 31

View this talk on YouTube: https://youtu.be/sI_8EA0_ha8

slide-2
SLIDE 2

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Outline

1

Review: Filterbanks

2

Waveform CLDNN

3

What do these things learn

4

Multichannel waveform CLDNN

2

Sainath, T. N., Weiss, R. J., Senior, A., Wilson, K. W., and Vinyals, O. (2015b). Learning the speech front-end with raw waveform CLDNNs. In Proc. Interspeech

4

Sainath, T. N., Weiss, R. J., Wilson, K. W., Narayanan, A., Bacchiani, M., and Senior, A. (2015c). Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms. In Proc. ASRU. to appear

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 2 / 31

slide-3
SLIDE 3

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Acoustic modeling in 2015

his captain was thin and haggard sil hh ih z sil k ae sil t ihn wah s thih n ae n hh ae sil g er sil d

0.0 0.5 1.0 1.5 2.0 Time (seconds) 5 10 15 20 25 30 35 mel band

Classify each 10ms audio frame into context-dependent phoneme state Log-mel filterbank features passed into a neural network Modern vision models are trained directly from the pixels, can we train an acoustic model directly from the samples?

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 3 / 31

slide-4
SLIDE 4

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Frequency domain filterbank: log-mel

waveform window1 FFT | | mel log feature frame 1 window2 FFT | | mel log feature frame 2 windowN FFT | | mel log feature frame N

localization in time pointwise nonlinearity bandpass filtering dynamic range compression

Bandpass filtering implemented using FFT and mel warping

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 4 / 31

slide-5
SLIDE 5

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Time-domain filterbank

waveform BP filter1 nonlinearity smoothing/decimation log/ 3 √ feature band 1 BP filter2 nonlinearity smoothing/decimation log/ 3 √ feature band 2 BP filterP nonlinearity smoothing/decimation log/ 3 √ feature band P

fine time structure removed here! :)

Swap order of filtering and decimation, but basically the same thing Cochleagrams, gammatone features for ASR (Schluter et al., 2007)

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 5 / 31

slide-6
SLIDE 6

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Time-domain filterbank as a neural net layer

windowed waveform segment n conv1 ReLU max pool stabilized log f1[n] conv2 ReLU max pool stabilized log f2[n] convP ReLU max pool stabilized log fP[n]

These are common neural network operations

(FIR) filter → convolution nonlinearity → rectified linear (ReLU) activation smoothing/decimation → pooling

Window waveform into short (< 300ms) overlapping segments Pass each segment into FIR filterbank to generate feature frame

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 6 / 31

slide-7
SLIDE 7

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Previous work: Representation learning from waveforms

Jaitly and Hinton (2011) unsupervised representation learning using a time-convolutional RBM supervised DNN training on learned features for phone recognition T¨ uske et al. (2014), Bhargava and Rose (2015) supervised training, fully connected DNN learns similar filter shapes at different shifts Palaz et al. (2013, 2015b,a), Hoshen et al. (2015), Golik et al. (2015) supervised training, convolution to share parameters across time shifts

No improvement over log-mel baseline on large vocabulary task in above work

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 7 / 31

slide-8
SLIDE 8

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Deep waveform DNN (Hoshen et al., 2015)

Convolution

F filters 25 ms weights

Input

275 ms

Max pooling

25 ms window 10 ms step

Nonlinearity

log(ReLU(...))

Fully connected

4 layers, 640 units ReLU activations

Softmax

13568 classes

convolution output (F x 4401) nonlinearity output (F x 26)

Choose parameters to match log-mel DNN

40 filters, 25ms impulse response, 10 ms hop stack 26 frames of context using strided pooling: 40x26 “brainogram”

Adding stabilized log compression gave 3-5% relative WER decrease Overall 5-6% relative WER increase compared to log-mel DNN

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 8 / 31

slide-9
SLIDE 9

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

CLDNN (Sainath et al., 2015a)

Combine all the neural net tricks: CLDNN = Convolution + LSTM + DNN

Frequency convolution gives some pitch/vocal tract length invariance LSTM layers model long term temporal structure DNN learns linearly separable function of LSTM state

4 − 6% improvement over LSTM baseline No need for extra frames of context in input: memory in LSTM can remember previous inputs

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 9 / 31

slide-10
SLIDE 10

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Waveform CLDNN (Sainath et al., 2015b)

Time convolution (tConv) produces a 40dim frame

35ms window (M = 561 samples), hopped by 10ms

CLDNN similar to (Sainath et al., 2015a)

Frequency convolution (fConv) layer:

8x1 filter, 256 outputs, pool by 3 without overlap 8x256 output fed into linear dim reduction layer

3 LSTM layers:

832 cells/layer with 512 dim projection layer

DNN layer:

1024 nodes, ReLU activations linear dim reduction layer with 512 outputs

Total of 19M parameters, 16K in tConv All trained jointly with tConv filterbank

tConv fConv LSTM LSTM LSTM DNN

  • utput targets

raw waveform M samples

xt ∈ ℜP Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 10 / 31

slide-11
SLIDE 11

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Experiments

US English Voice Search task,

1

Clean dataset: 3M utterances (∼ 2k hours) train, 30k (∼ 20 hours) test

2

MTR20 multicondition dataset: simulated noise and reverberation

SNR between 5-25dB (average ∼ 20dB) RT60 between 0-400ms (average ∼ 160ms) Target to mic distance between 0-2m (average ∼ 0.75m)

13522 context-dependent state outputs Asynchronous SGD training, optimizing a cross-entropy loss

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 11 / 31

slide-12
SLIDE 12

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Compared to log-mel (Sainath et al., 2015b)

Train/test set Feature WER Clean log-mel 14.0 waveform 13.7 MTR20 log-mel 16.2 waveform 16.2 waveform+log-mel 15.7 Matches performance of log-mel baseline in clean and moderate noise 3% relative improvement by stacking log-mel features and tConv output

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 12 / 31

slide-13
SLIDE 13

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

How important are LSTM layers? (Sainath et al., 2015b)

MTR20 WER Architecture log-mel waveform D6 22.3 23.2 F1L1D1 17.3 17.8 F1L2D1 16.6 16.6 F1L3D1 16.2 16.2 Fully connected DNN: waveform 4% worse than log-mel Log-mel outperforms waveform with one or zero LSTM layers Time convolution layer gives short term shift invariance, but seems to need recurrence to model longer time scales

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 13 / 31

slide-14
SLIDE 14

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Bring on the noise (Sainath et al., 2015c)

MTR12: noisier version of MTR20 12dB average SNR, 600ms average RT60, more farfield Num filters log-mel waveform 40 25.2 24.7 84 25.0 23.7 128 24.4 23.5 Waveform consistently outperforms log-mel in high noise Larger improvements with more filters

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 14 / 31

slide-15
SLIDE 15

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Filterbank magnitude responses

mel

5 10 15 20 25 30 35 Filter index 1 2 3 4 5 6 7 8 Frequency (kHz) 30 27 24 21 18 15 12 9 6 3

trained

5 10 15 20 25 30 35 Filter index 1 2 3 4 5 6 7 8 Frequency (kHz) 12 16 20 24 28 32 36

Sort filters by index of frequency band with peak magnitude Looks mostly like an auditory filterbank

mostly bandpass filters, bandwidth increases with center frequency

Consistently higher resolution in low frequencies:

20 filters below 1kHz vs ∼10 in mel somewhat consistent with an ERB auditory frequency scale

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 15 / 31

slide-16
SLIDE 16

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

What happens when we add more filters?

> 80 filters below 1kHz:

  • vercomplete basis

Not all bandpass anymore

harmonic stacks wider bandwidths

10 20 30 40 50 Filter index 0.000 0.062 0.125 0.188 0.250 0.312 0.375 0.438 0.500 0.562 0.625 0.688 0.750 Frequency (kHz) 30.0 32.5 35.0 37.5 40.0 42.5 45.0 47.5 10 20 30 40 Filter index 10 20 30 40 50 60 70 Filter index 5 10 15 20 25 30 35 Filter index 1 2 3 4 5 6 7 8 Frequency (kHz) 30.0 32.5 35.0 37.5 40.0 42.5 45.0 47.5 5 10 15 20 25 Filter index 1 2 3 4 5 6 7 8 2 4 6 8 10 12 Filter index 20 40 60 80 100 120 Filter index 1 2 3 4 5 6 7 8 Frequency (kHz) 30.0 32.5 35.0 37.5 40.0 42.5 45.0 47.5 10 20 30 40 Filter index 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 Filter index

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 16 / 31

slide-17
SLIDE 17

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

What if we had a microphone array...

Build a noise robust multichannel ASR system by cascading:

1

speech enhancement to reduce noise e.g. localization + beamforming + nonlinear postfiltering

2

acoustic model, possibly trained on the output of 1

Perform multichannel enhancement and acoustic modeling jointly?

Seltzer et al. (2004) explored this idea using a GMM acoustic model we’re going to use neural nets

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 17 / 31

slide-18
SLIDE 18

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Filter-and-sum beamforming

y[t] =

C−1

  • c=0

hc[t] ∗ xc[t − τc]

s τ0 τ1 τ2 τ3 x0[t − τ0] x1[t − τ1] x2[t − τ2] x3[t − τ3] align

Typical to have separate localization model estimate τc, and a beamformer estimate filter weights Use P filters to capture many fixed steering delays yp[t] =

C−1

  • c=0

hp

c[t] ∗ xc[t]

Just another convolution across a multichannel waveform

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 18 / 31

slide-19
SLIDE 19

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Multichannel waveform CLDNN (Sainath et al., 2015c)

Convolution C x N x P weights Input C x M samples Max pooling M-N+1 window Nonlinearity log(ReLU(...)) 1 X P

convolution output (P x M-N+1) nonlinearity output (1 x P)

Multichannel tConv layer

bank of filter-and-sum beamformers, but without explicit localization and alignment does both spatial and spectral filtering

Feeds into same CLDNN as in single channel case

fConv LSTM LSTM LSTM DNN

  • utput targets

raw waveform M x C samples

xt ∈ ℜP

pool + nonlin C time filters tConv

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 19 / 31

slide-20
SLIDE 20

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Experiments

MTR12 dataset, but simulating an 8 channel linear mic array Look at different microphone subsets

1 channel: mic 1 2 channel: mics 1,8: 14cm spacing 4 channel: mics 1,3,6,8: 4cm-6cm-4cm spacing 8 channel: mics 1-8: 2cm spacing

100 different room configurations Noise and target speaker location randomly selected for each utterance Main test set with same conditions as training

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 20 / 31

slide-21
SLIDE 21

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Compared to log mel (Sainath et al., 2015c)

Input Num filters 1ch 2ch 4ch 8ch log-mel 128 24.4 22.0 21.7 22.0 waveform 128 23.5 21.8 21.3 21.1 waveform 256

  • 21.7

20.8 20.6 Log-mel improves with additional channels (stack features from each channel) (Swietojanski et al., 2013) but not as much as waveform

fine time structure discarded with the phase

Waveform improvements saturate at 128 filters with 2 channels Continue to see improvements with 256 filters with 4 and 8 channels

can learn more complex spatial responses with more microphones, allowing net to make good use of extra capacity in filterbank layer

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 21 / 31

slide-22
SLIDE 22

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

How many LSTM layers does it take?

Input Num filters Num LSTM layers WER waveform, 2ch 128 1 25.8 waveform, 2ch 128 2 23.9 waveform, 2ch 128 3 21.8 waveform, 2ch 128 4 21.5 As in 1 channel case, modeling temporal context with LSTM layers is key to getting good performance Starts to saturate at 3 LSTM layers

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 22 / 31

slide-23
SLIDE 23

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

What’s a Beampattern?!

Magnitude response as a function of direction of arrival to microphone array pass “multimic impulse” with different delays into filter, measure resp.

5 10 15 20 25 Time (milliseconds) 0.2 0.1 0.0 0.1 0.2

Multichannel impulse response

Channel 0 Channel 1 1 2 3 4 5 6 7 8 Frequency (kHz) 15 10 5 5 10 15 20 Magnitude (dB)

Frequency response

60 degrees 75 degrees 90 degrees 20 10 0 10 20 Magnitude . 30 60 90 120 150 180 Direction of arrival (degrees)

Spatial response

0.8 kHz 1.0 kHz 1.2 kHz

Beampattern

5 10 15 20

null

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 23 / 31

slide-24
SLIDE 24

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

What is this thing learning? Example filters

0.5 0.0 0.5

Impulse responses

Channel 0 Channel 1 60 120 180

Beampattern

6 12 18 24 30 0.2 0.0 0.2 60 120 180 6 12 18 24 30 0.2 0.0 0.2 60 120 180 6 12 18 24 30 0.2 0.1 0.0 0.1 0.2 60 120 180 6 12 18 24 30 0.2 0.0 0.2 60 120 180 6 12 18 24 30 5 10 15 20 25 Time (milliseconds) 0.2 0.0 0.2 1 2 3 4 5 6 7 8 Frequency (kHz) 60 120 180 6 12 18 24 30 0.5 0.0 0.5

Impulse responses

Channel 0 Channel 1 60 120 180

Beampattern

6 12 18 24 30 1 1 60 120 180 6 12 18 24 30 0.5 0.0 0.5 60 120 180 6 12 18 24 30 1 1 60 120 180 6 12 18 24 30 1 1 60 120 180 6 12 18 24 30 5 10 15 20 25 Time (milliseconds) 4 2 2 1 2 3 4 5 6 7 8 Frequency (kHz) 60 120 180 6 12 18 24 30

Similar coefficients across channels but shifted, similar to steering delay Most filters have bandpass freq. response, similar scale to 1ch ∼ 80% of the filters have a significant spatial response

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 24 / 31

slide-25
SLIDE 25

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Even more example filters

Several filters with the same center frequency, different null directions Enables upstream layers to differentiate between energy coming from different directions in narrow bands

0.2 0.0 0.2 6 12 18 24 30 5 10 15 20 25 0.5 0.0 0.01.02.03.04.05.06.07.08.0 6 12 18 24 30 0.0 0.5 6 12 18 24 30 5 10 15 20 25 0.2 0.0 0.2 0.01.02.03.04.05.06.07.08.0 6 12 18 24 30 0.5 0.0 0.5 6 12 18 24 30 0.5 0.0 0.5 6 12 18 24 30 5 10 15 20 25 1 1 0.01.02.03.04.05.06.07.08.0 6 12 18 24 30

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 25 / 31

slide-26
SLIDE 26

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Compared to traditional beamforming (Sainath et al., 2015c)

System 2ch 4ch 8ch

  • racle D+S

22.8 22.5 22.4 waveform 21.8 21.3 21.1 Delay-and-sum (D+S) baseline using oracle time difference of arrival, passed into 1ch waveform model Despite lack of explicit localization waveform outperforms D+S

upper layers learn invariance to direction of arrival?

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 26 / 31

slide-27
SLIDE 27

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Mismatched array geometry (Sainath et al., 2015c)

Spacing System 14cm 10cm 6cm 2cm 0cm1

  • racle D+S 2ch

22.8 23.2 23.3 23.7 23.5 waveform 2ch, 14cm 21.8 22.2 23.3 30.7 33.9 waveform 2ch, multi-geo 21.9 21.7 21.9 21.8 23.1 Oracle D+S more robust to mismatches in microphone spacing Degraded performance if mic array spacing differs widely from training “Multi-geometry” training set by sampling 2 channels with replacement for each utterance in the original 8 channel set

net trained on this data becomes invariant to microphone spacing also robust to decoding a single channel?!

1repeat signal from mic 1 Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 27 / 31

slide-28
SLIDE 28

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Multigeometrained filters

multi-geo

0.4 0.2 0.0 0.2 0.4

Impulse responses

Channel 0 Channel 1 60 120 180

Beampattern

6 12 18 24 30 2 2 60 120 180 6 12 18 24 30 0.2 0.1 0.0 0.1 0.2 60 120 180 6 12 18 24 30 2 2 60 120 180 6 12 18 24 30 0.2 0.1 0.0 0.1 0.2 60 120 180 6 12 18 24 30 5 10 15 20 25 Time (milliseconds) 2 2 1 2 3 4 5 6 7 8 Frequency (kHz) 60 120 180 6 12 18 24 30

fixed-geo

0.5 0.0 0.5

Impulse responses

Channel 0 Channel 1 60 120 180

Beampattern

6 12 18 24 30 0.2 0.0 0.2 60 120 180 6 12 18 24 30 0.2 0.0 0.2 60 120 180 6 12 18 24 30 0.2 0.1 0.0 0.1 0.2 60 120 180 6 12 18 24 30 0.2 0.0 0.2 60 120 180 6 12 18 24 30 5 10 15 20 25 Time (milliseconds) 0.2 0.0 0.2 1 2 3 4 5 6 7 8 Frequency (kHz) 60 120 180 6 12 18 24 30

Still get bandpass filters, but without strong spatial responses

  • nly 30% of the filters have a null

several filters primarily respond to only one channel

Upper layers of the network somehow learn to model directionality?

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 28 / 31

slide-29
SLIDE 29

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Mismatched test (Sainath et al., 2015c)

System Simulated (14cm) Rerecorded (28cm) waveform, 1ch 19.3 23.8 waveform, 2ch, 14cm 18.2 23.7

  • racle D+S, 2ch

19.2 23.3 waveform, 2ch, multi-geo 17.8 21.1

*after sequence training

Slightly more realistic “Rerecorded” test set:

replay sources from eval set through speakers in a living room record using an 8-channel linear microphone array with 4cm spacing artificially mixed using same SNR distribution as MTR12set

Multigeometraining still works

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 29 / 31

slide-30
SLIDE 30

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

Conclusion

From feature engineering to... deep net architecture engineering: Supervised training to learn filter coefficients, optimized jointly with target objective Waveform CLDNN matches log-mel on clean, outperforms it on noisy Larger performance improvement with multichannel input Secret sauce: LSTM layers Multicondition training/data augmentation work really well: clean and noisy, various mic array spacings

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 30 / 31

slide-31
SLIDE 31

Review: Filterbanks Waveform CLDNN What do these things learn Multichannel waveform CLDNN

References I

Bhargava, M. and Rose, R. (2015). Architectures for deep neural network based acoustic models defined over windowed speech

  • waveforms. In Proc. Interspeech.

Golik, P., T¨ uske, Z., Schl¨ uter, R., and Ney, H. (2015). Convolutional neural networks for acoustic modeling of raw time signal in

  • LVCSR. In Proc. Interspeech.

Hoshen, Y., Weiss, R. J., and Wilson, K. W. (2015). Speech Acoustic Modeling from Raw Multichannel Waveforms. In Proc. ICASSP. Jaitly, N. and Hinton, G. (2011). Learning a better representation of speech soundwaves using restricted Boltzmann machines. In

  • Proc. ICASSP.

Palaz, D., Collobert, R., and Magimai.-Doss, M. (2013). Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. In Proc. Interspeech. Palaz, D., Magimai.-Doss, M., and Collobert, R. (2015a). Analysis of CNN-based speech recognition system using raw speech as

  • input. In Proc. Interspeech.

Palaz, D., Magimai.-Doss, M., and Collobert, R. (2015b). Convolutional neural networks-based continuous speech recognition using raw speech signal. Technical report. Sainath, T. N., Vinyals, O., Senior, A., and Sak, H. (2015a). Convolutional, long short-term memory, fully connected deep neural

  • networks. In Proc. ICASSP.

Sainath, T. N., Weiss, R. J., Senior, A., Wilson, K. W., and Vinyals, O. (2015b). Learning the speech front-end with raw waveform

  • CLDNNs. In Proc. Interspeech.

Sainath, T. N., Weiss, R. J., Wilson, K. W., Narayanan, A., Bacchiani, M., and Senior, A. (2015c). Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms. In Proc. ASRU. to appear. Schluter, R., Bezrukov, L., Wagner, H., and Ney, H. (2007). Gammatone features and feature combination for large vocabulary speech recognition. In Proc. ICASSP. Seltzer, M. L., Raj, B., and Stern, R. M. (2004). Likelihood-maximizing beamforming for robust hands-free speech recognition. IEEE Transactions on Speech and Audio Processing, 12(5):489–498. Swietojanski, P., Ghoshal, A., and Renals, S. (2013). Hybrid acoustic models for distant and multichannel large vocabulary speech

  • recognition. In Proc. ASRU, pages 285–290.

T¨ uske, Z., Golik, P., Schl¨ uter, R., and Ney, H. (2014). Acoustic modeling with deep neural networks using raw time signal for

  • LVCSR. In Proc. Interspeech.

Ron Weiss Training neural network acoustic models on (multichannel) waveforms in SANE 2015 31 / 31

slide-32
SLIDE 32

Extra slides

5

Extra slides

slide-33
SLIDE 33

Extra slides

Even more multicondition training

Test set Input Train set Clean MTR20 MTR12 log-mel MTR20 10.9 12.6 25.8 log-mel MTR12 13.4 14.6 19.6 log-mel MTR20+12 11.1 12.3 19.6 waveform MTR12 13.7 14.5 18.6 waveform MTR20+12 11.0 12.6 18.4

*after sequence training

Training on very noisy data hurts performance in clean CLDNNs have a lot of capacity: Training on both recovers clean performance, still works well on noisy

slide-34
SLIDE 34

Extra slides

Why does this work? tConv / pooling (Sainath et al., 2015b)

Input window size Pooling Initialization MTR20 WER 25ms none random 19.9 35ms max random 16.4 35ms max gammatone fixed 16.4 35ms max gammatone 16.2 35ms l2 gammatone 16.4 35ms average gammatone 16.8 Pooling gives shift invariance over short (35 - 25 = 10ms) time scale Poor performance without pooling - fixed phase Best results with (ERB) gammatone initialization and max pooling

because of filter ordering assumed by fConv? max preserves transients smoothed out by other pooling functions?

Not training tConv layer is slightly worse

slide-35
SLIDE 35

Extra slides

How important is frequency convolution? (Sainath et al., 2015b)

Input Architecture MTR20 WER log-mel F1L3D1 16.2 waveform F1L3D1 16.2 log-mel L3D1 16.5 waveform L3D1 16.5 waveform L3D1, rand init 16.5 Analyze results for different FxLyDz architectures Log-mel and waveform match performance if we remove fConv layer No difference in performance when randomly initializing tConv layer

fConv layer requires ordering of features coming out of tConv layer

slide-36
SLIDE 36

Extra slides

Filterbank impulse responses

learned

0.2 0.0 0.2 0.5 0.0 0.5 1 1 2 2 5 10 15 20 25 Time (ms) 5 5 5 10 15 20 25 Time (ms) 5 10 15 20 25 Time (ms) 5 10 15 20 25 Time (ms)

gammatone

0.01 0.00 0.01 0.02 0.01 0.00 0.01 0.02 0.04 0.02 0.00 0.02 0.04 0.05 0.00 0.05 5 10 15 20 25 Time (ms) 0.1 0.0 0.1 5 10 15 20 25 Time (ms) 5 10 15 20 25 Time (ms) 5 10 15 20 25 Time (ms)

slide-37
SLIDE 37

Extra slides

Does it correspond to an auditory frequency scale?

5 10 15 20 25 30 35 40

Filter index

0.1 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 2.0 3.0 4.0 5.0 6.0 7.0 8.0

Frequency (kHz)

mel (fbreak =700Hz) ERB (fbreak =228Hz fmax =3.8kHz) MTR 12dB MTR 20dB Clean

Dick Lyon on mel spectrograms: “their amplitude scale is too logarithmic, and their frequency scale not logarithmic enough” Deep learning agrees: scale consistent with ERB spanning 3.8kHz Except it adds ∼ 5 filters above 4kHz

slide-38
SLIDE 38

Extra slides

Single channel “brainograms”

gammatone trained

slide-39
SLIDE 39

Extra slides

Multichannel WER breakdown (Sainath et al., 2015c)

2 4 6 8 10 12 14 16 18 20 20 25 30 35 SNR WER 0.3 0.4 0.5 0.6 0.7 0.8 0.9 20 22 24 WER Reverb Time (s) 1 1.5 2 2.5 3 3.5 4 20 22 24 Target To Mic Distance (m) WER raw1ch raw2ch raw4ch raw8ch

Larger improvements at lowest SNRs Consistent improvements across range of reverb times and target distances

slide-40
SLIDE 40

Extra slides

Compared to traditional beamforming (Sainath et al., 2015c)

Compare waveform model to two baselines

1

delay-and-sum (D+S) using oracle time difference of arrival (TDOA), passed into 1ch waveform model

2

time-aligned multichannel (TAM) using oracle TDOA, passed into multichannel waveform model

System 2ch 4ch 8ch

  • racle D+S

22.8 22.5 22.4

  • racle TAM

21.7 21.3 21.3 waveform 21.8 21.3 21.1 Despite lack of explicit localization waveform does better than D+S, matches TAM

upper layers learn invariance to direction of arrival?

TAM learns filters similar to “uncompensated” waveform