Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20: Pronunciation Modeling Instructor: Preethi Jyothi Oct 16, 2017  

Pronunciation Dictionary/Lexicon Pronunciation model/dictionary/lexicon: Lists one or more • pronunciations for a word Typically derived from language experts: Sequence of phones • wri tu en down for each word Dictionary construction involves: • 1. Selecting what words to include in the dictionary 2. Pronunciation of each word (also, check for multiple pronunciations)

Grapheme-based models

Graphemes vs. Phonemes Instead of a pronunciation dictionary, one could represent a • pronunciation as a sequence of graphemes (or le tu ers). That is, model at the grapheme level. Useful technique for low-resourced/under-resourced languages • Main advantages: • 1. Avoid the need for phone-based pronunciations 2. Avoid the need for a phone alphabet 3. Works pre tu y well for languages with a direct link between graphemes (le tu ers) and phonemes (sounds)

Grapheme-based ASR WER (%) Language ID System Vit CN CNC Kurmanji Phonetic 67.6 65.8 205 64.1 Kurdish Graphemic 67.0 65.3 Phonetic 41.8 40.6 Tok Pisin 207 39.4 Graphemic 42.1 41.1 Phonetic 55.5 54.0 Cebuano 301 52.6 Graphemic 55.5 54.2 Phonetic 54.9 53.5 Kazakh 302 51.5 Graphemic 54.0 52.7 Phonetic 70.6 69.1 Telugu 303 67.5 Graphemic 70.9 69.5 Phonetic 51.5 50.2 Lithuanian 304 48.3 Graphemic 50.9 49.5 Image from: Gales et al., Unicode-based graphemic systems for limited resource languages, ICASSP 15 5188

Graphemes vs. Phonemes Instead of a pronunciation dictionary, one could represent a • pronunciation as a sequence of graphemes (or le tu ers) Useful technique for low-resourced/under-resourced languages • Main advantages: • 1. Avoid the need for phone-based pronunciations 2. Avoid the need for a phone alphabet 3. Works pre tu y well for languages with a direct link between graphemes (le tu ers) and phonemes (sounds)

Grapheme to phoneme (G2P) conversion

Grapheme to phoneme (G2P) conversion Produce a pronunciation (phoneme sequence) given a wri tu en • word (grapheme sequence) Learn G2P mappings from a pronunciation dictionary • Useful for: • ASR systems in languages with no pre-built lexicons • Speech synthesis systems • Deriving pronunciations for out-of-vocabulary (OOV) words •

G2P conversion (I) One popular paradigm: Joint sequence models [BN12] • Grapheme and phoneme sequences are first aligned using • EM-based algorithm Results in a sequence of graphones (joint G-P tokens) • Ngram models trained on these graphone sequences • WFST-based implementation of such a joint graphone model • [Phonetisaurus] [BN12]:Bisani & Ney , “Joint sequence models for grapheme-to-phoneme conversion”,Specom 2012 [Phonetisaurus] J. Novak, Phonetisaurus Toolkit

G2P conversion (II) Neural network based methods are the new state-of-the-art • for G2P Bidirectional LSTM-based networks using a CTC output • layer [Rao15]. Comparable to Ngram models. Incorporate alignment information [Yao15]. Beats Ngram • models. No alignment. Encoder-decoder with a tu ention. Beats the • above systems [Toshniwal16].

LSTM + CTC for G2P conversion [Rao15] Model Word Error Rate (%) Galescu and Allen [4] 28.5 ect Chen [7] 24.7 Bisani and Ney [2] 24.5 Novak et al. [6] 24.4 Wu et al. [12] 23.4 5-gram FST 27.2 8-gram FST 26.5 Unidirectional LSTM with Full-delay 30.1 DBLSTM-CTC 128 Units 27.9 DBLSTM-CTC 512 Units 25.8 DBLSTM-CTC 512 + 5-gram FST 21.3 [Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015

G2P conversion (II) Neural network based methods are the new state-of-the-art • for G2P Bidirectional LSTM-based networks using a CTC output • layer [Rao15]. Comparable to Ngram models. Incorporate alignment information [Yao15]. Beats Ngram • models. No alignment. Encoder-decoder with a tu ention. Beats the • above systems [Toshniwal16].

Seq2seq models   (with alignment information [Yao15]) Method PER (%) WER (%) encoder-decoder LSTM 7.53 29.21 encoder-decoder LSTM (2 layers) 7.63 28.61 uni-directional LSTM 8.22 32.64 uni-directional LSTM (window size 6) 6.58 28.56 bi-directional LSTM 5.98 25.72 bi-directional LSTM (2 layers) 5.84 25.02 bi-directional LSTM (3 layers) 5.45 23.55 Data Method PER (%) WER (%) CMUDict past results [20] 5.88 24.53 bi-directional LSTM 5.45 23.55 NetTalk past results [20] 8.26 33.67 bi-directional LSTM 7.38 30.77 Pronlex past results [20,21] 6.78 27.33 bi-directional LSTM 6.51 26.69 [Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015

G2P conversion (II) Neural network based methods are the new state-of-the-art • for G2P Bidirectional LSTM-based networks using a CTC output • layer [Rao15]. Comparable to Ngram models. Incorporate alignment information [Yao15]. Beats Ngram • models. No alignment. Encoder-decoder with a tu ention. Beats the • above systems [Toshniwal16]. [Rao15] Grapheme-to-phoneme conversion using LSTM RNNs, ICASSP 2015 [Yao15] Sequence-to-sequence neural net models for G2P conversion, Interspeech 2015 [Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

Encoder-decoder + a tu ention for G2P [Toshniwal16] y t Attention Layer c t α t Encoder Decoder h T g h 3 h 2 h 1 d t x T g x 3 x 2 x 1 [Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

Encoder-decoder + a tu ention for G2P [Toshniwal16] Data Method PER (%) CMUDict BiDir LSTM + Alignment [6] 5.45 DBLSTM-CTC [5] - DBLSTM-CTC + 5-gram model [5] - Encoder-decoder + global attn 5 . 04 ± 0 . 03 Encoder-decoder + local- m attn 5 . 11 ± 0 . 03 Encoder-decoder + local- p attn 5 . 39 ± 0 . 04 Ensemble of 5 [Encoder-decoder + global attn] models 4 . 69 Pronlex BiDir LSTM + Alignment [6] 6.51 Encoder-decoder + global attn 6 . 24 ± 0 . 1 Encoder-decoder + local- m attn 5 . 99 ± 0 . 11 Encoder-decoder + local- p attn 6 . 49 ± 0 . 06 NetTalk BiDir LSTM + Alignment [6] 7 . 38 y t Attention Layer Encoder-decoder + global attn 7 . 14 ± 0 . 72 c t Encoder-decoder + local- m attn 7 . 13 ± 0 . 11 Encoder-decoder + local- p attn 8 . 41 ± 0 . 19 α t Encoder Decoder h T g h 3 h 2 h 1 d t x T g x 3 x 2 x 1 [Toshniwal16] Jointly learning to align and convert graphemes to phonemes with neural attention models, SLT 2016.

Sub-phonetic feature-based models

Pronunciation Model Phone-Based Articulatory Features Each word is a Parallel streams of sequence of phones articulator movements Based on theory of Tends to be highly articulatory phonology 1 language dependent s eh n s PHONE 1 [C. P. Browman and L. Goldstein, Phonology ‘86]

Pronunciation Model Articulatory Features Parallel streams of TT-LOC TB-LOC articulator movements LIP- TB- OPEN VELUM LIP- OPEN TT- LOC OPEN Based on theory of articulatory phonology 1 GLOTTIS s eh n s PHONE open/labial LIP critical/alveolar mid/alveolar closed/alveolar critical/alveolar TON.TIP mid/uvular mid/palatal mid/uvular TON.BODY open critical open GLOTTIS closed open closed VELUM

Example: Pronunciations for word “sense” open/labial LIP mid/palatal mid/uvular mid/uvular TB critical/alveolar mid/alveolar closed/alveolar critical/alveolar TT CANONICAL open critical open GLOT open closed closed VEL s s eh n PHONE open/labial LIP mid/uvular mid/palatal mid/uvular TB critical/alveolar mid/alveolar closed/alveolar critical/alveolar TT E.g. OBSERVED open critical open GLOT open closed closed VEL s n t s eh_n PHONE Simple asynchrony across feature streams   can appear as many phone alterations [Adapted from Livescu ’05]

Dynamic Bayesian Networks (DBNs) • Provides a natural framework to efficiently encode multiple streams of articulatory features • Simple DBN with three random variables in each time frame A t-1 A t A t+1 B t-1 B t B t+1 C t+1 C t-1 C t frame t-1 frame t frame t+1

DBN model of pronunciation   Word 0 1 2 3 Posn Word s ao l v solve Trans Phn Posn s ao l v L-Lag 0 s ao l v 1 - s ao l Phn L-Lag T -Lag G-Lag Prev- Phone L-Phn T -Phn G-Phn Observed feature values Lip-Op TT -Op Glot sur sur sur Lip-Op TT -Op Glot P. Jyothi, E. Fosler-Lussier & K. Livescu, Interspeech’12

Factorized DBN model 1   Word Trans Posn Set 1 Phn Set 2 L-Lag T -Lag G-Lag Prev- Set 3 Phone L-Phn T -Phn G-Phn Set 4 Lip-Op TT -Op Glot Set 5 sur sur sur Lip-Op TT -Op Glot

Cascade of Finite State Machines Word Trans Posn Word Posn Phn Phn, F 1 Trans G- L-Lag T -Lag L-Lag Lag T -Lag G-Lag Phn, Trans, F 2 L-Lag,T -Lag, Prev- G-Lag Phn Prev-Phn G- L-Phn, L-Phn T -Phn Phn Phn T -Phn, F 3 G-Phn Lip- TT - Glot op op Lip-op, F 4 TT -op, surLip-op, sur sur sur Lip- TT - Glot Glot surTT -op, op op surGlot F 5 1 [P. Jyothi, E. Fosler-Lussier, K. Livescu, Interspeech ’12]

Weighted Finite State Machine x 2 : y 2 /1.3 x 1 : y 1 /1.5 x 4 : y 4 /0.6 x 3 : y 3 /2.0

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20: Pronunciation Modeling Instructor: Preethi Jyothi Oct 16, 2017 Pronunciation Dictionary/Lexicon Pronunciation model/dictionary/lexicon: Lists one or more

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 19: Search,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language

Background: the production of sound for speech Adam Albright (albright@mit.edu) LSA 2017

Survival Analysis: Introduction Survival Analysis typically focuses on time to event data. In the

Artrite reumatoide: nuove strategie biologiche Prof. Fabrizio Cantini U.O.C. di Reumatologia

Audit on Use of Fingolimod in Treatment of Relapsing-Remitting MS in University Hospitals

Course Business One new dataset on CourseWeb for this week Another add-on package

Distributed Practice! Louis is a positive psychologist interested in peoples subjective

OBSTRUCTIVE SLEEP Lapses in Concentration: big increase if 6 hours (Van Dongen Sleep 2003)

Phase 1 allocation COVID-19 vaccine: Work Group considerations Kathleen Dooling, MD MPH