Large Vocabulary Continuous Speech Recognition with Long Short-Term - PowerPoint PPT Presentation

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks Ha¸ sim Sak, Andrew Senior, Oriol Vinyals, Georg Heigold, Erik McDermott, Rajat Monga, Mark Mao, Fran¸ coise Beaufays andrewsenior/hasim@google.com See Sak et al. [2014b,a]

Overview • Recurrent neural networks • Training RNNs • Long short-term memory recurrent neural networks • Distributed training of LSTM RNNs • Acoustic modeling experiments • Sequence training LSTM RNNs Google Speech LVCSR with LSTM RNNs 2/37

Recurrent neural networks • An extension of feed-forward neural x 1 networks x 2 • Output fed back as input with time x 3 delay. x 4 • A dynamic time-varying neural r 1 y 1 network y 2 r 2 • Recurrent layer activations encode y 3 r 3 a “state”. y 4 r 4 • Sequence labelling, classification, y 5 r 5 prediction, mapping. r 6 • Speech recognition [Robinson et al., 1993] Google Speech LVCSR with LSTM RNNs 3/37

Back propagation through time Unroll the recurrent network through time. y 1 x 1 y 2 x 2 y 3 x 3 y 4 x 4 • Truncating at some limit “bptt steps” it looks like a DNN. • External gradients provided at the outputs • e.g. gradient of cross entropy loss • Internal gradients computed with the chain rule (backpropagation). Google Speech LVCSR with LSTM RNNs 4/37

Simple RNN Simple RNN architecture in two alternative representations: x t x t − 1 x t input W hx W hx W hx W hh . . . h t h t − 1 h t hidden W yh W hh W yh W yh y t y t − 1 y t output (a) RNN (b) RNN unrolled in time RNN hidden and output layer activations: h t = σ ( W hx x t + W hh h t − 1 + b h ) y t = φ ( W yh h t + b y ) Google Speech LVCSR with LSTM RNNs 5/37

Training RNNs • Forward pass: calculate activations for each input sequentially and update network state • Backward pass: calculate error and back propagate through network and time (back-propagation through time (BPTT)) • Update weights with the gradients summed over all time steps for each weight • Truncated BPTT: error is truncated after a specified back-propagation time steps Google Speech LVCSR with LSTM RNNs 6/37

Backpropagation through time Acoustic features θ θ' θ'' θ''' State posteriors θ θ' θ'' θ''' External gradients θ θ' θ'' θ''' Internal gradients } } } } + + + + δθ' δθ'' δθ''' δθ Google Speech LVCSR with LSTM RNNs 7/37

Long Short-Term Memory (LSTM) RNN • Learning long-term dependencies is difficult with simple RNNs, unstable training due to vanishing gradients problem [Hochreiter, 1991] • Limited capability (5-10 time steps) to model long-term dependencies • LSTM RNN architecture designed to address these problems [Hochreiter and Schmidhuber, 1997] • LSTM memory block: memory cell storing temporal state of network and 3 multiplicative units (gates) controlling the flow of information Google Speech LVCSR with LSTM RNNs 8/37

Long Short-Term Memory Recurrent Neural Networks • Replace the units of an RNN with memory cells with sigmoid • Input gate • Forget gate • Output gate Input Cell State Input Output Gate Δt Forget Δt Gate Output Gate • Enables long-term dependency learning • Reduces the vanishing/exploding gradient problems • 4 × more parameters than RNN Google Speech LVCSR with LSTM RNNs 9/37

LSTM RNN Architecture Input gate: controls flow of input activations into cell Output gate: controls output flow of cell activations Forget gate: process continuous input streams [Gers et al., 2000] “Peephole” connections added from cells to gates to learn precise timing of outputs [Gers et al., 2003] f t o t output: y t i t input: x t c t g h cell m t Google Speech LVCSR with LSTM RNNs 10/37

LSTM RNN Related Work • LSTM performs better than RNN for learning context-free and context-sensitive languages [Gers and Schmidhuber, 2001] • Bidirectional LSTM for phonetic labeling of acoustic frames on the TIMIT [Graves and Schmidhuber, 2005] • Online and offline handwriting recognition with bidirectional LSTM better than HMM-based system [Graves et al., 2009] • Deep LSTM - stack of multiple LSTM layers - combined with CTC and RNN transducer predicting phone sequences gets state of the art results on TIMIT [Graves et al., 2013] Google Speech LVCSR with LSTM RNNs 11/37

LSTM RNN Activation Equations An LSTM network computes a mapping from an input sequence x = ( x 1 , ..., x T ) to an output sequence y = ( y 1 , ..., y T ) by calculating the network unit activations using the following equations iteratively from t = 1 to T : i t = σ ( W ix x t + W im m t − 1 + W ic c t − 1 + b i ) (1) f t = σ ( W fx x t + W fm m t − 1 + W fc c t − 1 + b f ) (2) c t = f t ⊙ c t − 1 + i t ⊙ g ( W cx x t + W cm m t − 1 + b c ) (3) o t = σ ( W ox x t + W om m t − 1 + W oc c t + b o ) (4) m t = o t ⊙ h ( c t ) (5) y t = φ ( W ym m t + b y ) (6) Google Speech LVCSR with LSTM RNNs 12/37

Proposed LSTM Projected (LSTMP) RNN • O ( N ) learning computational complexity with stochastic gradient descent (SGD) per time step • Recurrent connections from cell output units ( n c ) to cell input units, input gates, output gates and forget gates • Cell output units connected to network output units • Learning computational complexity dominated by n c × (4 × n c + n o ) parameters • For more effective use of parameters, add a recurrent projection layer with n r linear projections ( n r < n c ) after LSTM layer. • Now n r × (4 × n c + n o ) parameters Google Speech LVCSR with LSTM RNNs 13/37

LSTM RNN Architectures LSTM RNN architectures input input LSTM LSTM recurrent output (a) LSTM output (b) LSTMP Google Speech LVCSR with LSTM RNNs 14/37

LSTMP RNN Activation Equations With the proposed LSTMP architecture, the equations for the activations of network units change slightly, the m t − 1 activation vector is replaced with r t − 1 and the following is added: i t = σ ( W ix x t + W im r t − 1 + W ic c t − 1 + b i ) (7) f t = σ ( W fx x t + W fm r t − 1 + W fc c t − 1 + b f ) (8) c t = f t ⊙ c t − 1 + i t ⊙ g ( W cx x t + W cm r t − 1 + b c ) (9) o t = σ ( W ox x t + W om r t − 1 + W oc c t + b o ) (10) m t = o t ⊙ h ( c t ) (11) r t = W rm m t (12) y t = φ ( W yr r t + b y ) (13) where the r denote the recurrent unit activations. Google Speech LVCSR with LSTM RNNs 15/37

Deep LSTM RNN Architectures LSTM RNN architectures input input input input LSTM LSTM LSTM LSTM recurrent recurrent output LSTM (a) LSTM output LSTM output (c) LSTMP recurrent (b) DLSTM output (d) DLSTMP Google Speech LVCSR with LSTM RNNs 16/37

Distributed Training of LSTM RNNs • Asynchronous stochastic gradient descent (ASGD) to optimize network parameters • Google Brain’s distributed parameter server: store, read and update the model parameters (50 shards) • Training replicas on 200 machines (data parallelism) • 3 synchronized threads in each machine (data parallelism) • Each thread operating on mini batch of 4 sequences simultaneously • TBPTT: 20 time steps of forward and backward pass • Training: read fresh parameters, process 3 × 4 × 20 time steps of input, send gradients to parameter server • Clip cell activations to [-50, 50] range for long utterances Google Speech LVCSR with LSTM RNNs 17/37

Asynchronous Stochastic Gradient Descent 1 Replica 4 Utterances per thread Thread 0 Thread 1 Thread 2 Internal gradients Parameter server shards Google Speech LVCSR with LSTM RNNs 18/37

Asynchronous Stochastic Gradient Descent 199 more replicas 1 Replica 4 Utterances per thread Thread 0 Thread 1 Thread 2 Internal gradients Parameter server shards Google Speech LVCSR with LSTM RNNs 19/37

Asynchrony Three forms of asynchrony: • Within a replica every bptt steps frame chunk is computed with different parameters. • State is carried over from one chunk to the next. • Each replica is updating independently. • Each shard of the parameter server is updated independently. Google Speech LVCSR with LSTM RNNs 20/37

System • Google Voice Search in US English • 3M (1900hours) 8kHz anonymized training utterances • 600M 25ms frames (10ms offset) • Normalized 40-dimensional log-filterbank energy features • 3-state HMMs with 14 , 000 context-dependent states • Cross-entropy loss • Targets from DNN Viterbi forced-alignment • 5 frame output delay • Hybrid Unidirectional“DLSMTP” • 2 layers of 800 cells with 512 linear projection layer. • 13M parameters Google Speech LVCSR with LSTM RNNs 21/37

Evaluation • Scale posteriors by priors for inference. • Deweight silence prior. • Evaluate ASR on a test set of 22 , 500 utterances • First pass LM of 23 million n -grams, lattice rescoring with an LM of 1 billion 5-grams Google Speech LVCSR with LSTM RNNs 22/37

Results for LSTM RNN Acoustic Models WERs and frame accuracies on development and training sets: L number of layers, for shallow (1L) and deep (2,4,5,7L) networks C number of memory cells and N total number of parameters C Depth N Dev Train WER (%) (%) (%) 840 5L 37M 67.7 70.7 10.9 440 5L 13M 67.6 70.1 10.8 600 2L 13M 66.4 68.5 11.3 385 7L 13M 66.2 68.5 11.2 750 1L 13M 63.3 65.5 12.4 Google Speech LVCSR with LSTM RNNs 23/37

Large Vocabulary Continuous Speech Recognition with Long Short-Term - PowerPoint PPT Presentation

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks Ha sim Sak, Andrew Senior, Oriol Vinyals, Georg Heigold, Erik McDermott, Rajat Monga, Mark Mao, Fran coise Beaufays andrewsenior/hasim@google.com See

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

HMMS and Speech HMMS and Speech HMMS and Speech Recognition Recognition Recognition Presented

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Speech recognition Brief history Technology Computer Literacy 1 Lecture 22 How does

Effective Open Source Speech Recognition in Your Application #kde-speech Peter Grasch

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 15-492/18-492 Speech Recognition Template matching Speech Recognition by

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Vocabulary and Reading in Secondary School (VaRiSS) Jessie Ricketts Royal Holloway Vocabulary

VOCABULARY ATI TEAS ENGLISH AND LANGUAGE USAGE VOCABULARY Vocabulary questions on this part of

HIGH IMPACT PERSUASIVE PRESENTATION SKILLS atcen.com | info@atcen.com | +603 7728 2623 : 8 th

Market Performance and Planning Forum Welcome to the Inaugural Meeting Karen Edson Vice

Increase Revenue Reducing sliding fee applications Jennifer Hagerty Vice President PointCare

Checking your vision For your PTO For your practices Using strategies for improving

Nondestructive Evaluation Education, Experiences and Career at NASA ASNT Chapter Meeting

MENDHAM TOWNSHIP BUDGET INTRODUCTION MARCH 27, 2018 MAYOR, RICHARD DIEGNAN DEPUTY MAYOR,

Spectral Techniques in the Sirt Basin, Libya Authors: Sam Yates, Irena Kivior, Shiferaw Damte,

Global Network Interference Detection over the RIPE Atlas Network Adventures in Pervasive

Sambuz

Useful Links

Newsletter

Mail Us