Acoustic Modeling Hsin-min Wang References: 1. X. Huang et. al., - - PowerPoint PPT Presentation

acoustic modeling
SMART_READER_LITE
LIVE PREVIEW

Acoustic Modeling Hsin-min Wang References: 1. X. Huang et. al., - - PowerPoint PPT Presentation

Acoustic Modeling Hsin-min Wang References: 1. X. Huang et. al., Spoken Language Processing, Chapter 9 2. The HTK Book Definition of Speech Recognition Problem For the given acoustic observation X = X 1 X 2 X n , the goal of speech


slide-1
SLIDE 1

Acoustic Modeling

Hsin-min Wang

References:

  • 1. X. Huang et. al., Spoken Language Processing, Chapter 9
  • 2. The HTK Book
slide-2
SLIDE 2

2

Definition of Speech Recognition Problem

For the given acoustic observation X=X1X2…Xn, the goal

  • f speech recognition is to find out the corresponding

word sequence W=w1w2…wm that has the maximum posterior probability P(W|X)

( )

( ) (

)

( ) ( ) (

)

W X W X W X W X W W

W W W

P P P P P P max arg max arg max arg ˆ = = =

{ }

N i m i

...,v v v V w ...w w w w , , : where ...

2 1 2 1

∈ = W

Acoustic Modeling Language Modeling

slide-3
SLIDE 3

3

Major Challenges

The practical challenge is how to build accurate acoustic models, P(X|W), and language models, P(W), that can truly reflect the spoken language to be recognized

– For large vocabulary speech recognition, there are a large number of

  • words. We need to decompose a word into a subword sequence, thus

P(X|W) is closely related to phonetic modeling – P(X|W) should take into account speaker variations, pronunciation variations, environment variations, and context-dependent phonetic coarticulation variations – Any static acoustic or language model will not meet the needs of real applications, so it is vital to dynamically adapt both P(X|W) and P(W) to maximize P(W|X)

The decoding process of finding the best word sequence W to match the input speech signal X in speech recognition systems is more than a simple pattern recognition problem since there are an infinite number of word patterns to search

slide-4
SLIDE 4

4

Variability in the Speech Signal

Context Variability

– Context variability at the word/sentence level

  • E.g. “Mr. Wright should write to Ms. Wright right away about his

Ford or four door Honda”

  • Same pronunciation but different meaning: Wright, write, right
  • Phonetically identical and semantically relevant: Ford or, four door

– Context variability at the phonetic level

  • The acoustic realization of phoneme /ee/ for word peat and wheel

depends on its left and right context fast speech or spontaneous speech ?

slide-5
SLIDE 5

5

Variability in the Speech Signal (cont.)

Style Variability

– Isolated speech recognition

  • Users have to pause between each word
  • Eliminate errors such as Ford or and four door
  • A significant reduction in computational complexity
  • Unnatural to most people
  • The throughput is significant lower than that for continuous speech

– Continuous speech recognition

  • Error rate for causal, spontaneous, and conversational speech is

higher than for carefully articulated read speech

  • The higher the speaking rate, the higher the error rate
  • Emotional changes cause more significantly variations
slide-6
SLIDE 6

6

Variability in the Speech Signal (cont.)

Speaker Variability

– Inter-speaker differences

  • vocal tract size, length and width of the neck and a range of physical

characteristics: gender, age, dialect, health, education, and personal style

– The same speaker is often unable to precisely produce the same utterance

  • The shape of the vocal tract movement and rate of delivery may vary from

utterance to utterance

– Speaker-independent (SI) speech recognition

  • Large performance fluctuations among different speakers
  • Speakers with accents have higher error rates

– Speaker-dependent (SD) speech recognition

  • With the SD data and training, the system can capture the SD acoustic

characteristics and, thus, can improve the recognition accuracy

  • A typically SD speech recognition system can reduce the word recognition

error by more than 30% as compared with a comparable SI speech recognition system

slide-7
SLIDE 7

7

Variability in the Speech Signal (cont.)

Environment Variability

– The world we live in is full of sounds of varying loudness from different sources

  • We have to deal with various background sounds (noises)

– In mobile environments, the spectrum of noises varies significantly because the speaker moves around – Noise may also be present from the input device itself, such as microphone and A/D interference noises – We can reduce the error rates by using multi-style training or adaptive techniques – Environment variability remains as one of the most severe challenges facing today’s state-of-the-art speech systems

slide-8
SLIDE 8

8

Evaluation of Automatic Speech Recognition

Performance evaluation of speech recognition systems is critical, and the Word recognition Error Rate (WER) is

  • ne of the most important measures

There are typically three types of word recognition errors

– Substitution – Deletion – Insertion

Calculating the WER by aligning the correct word string against the recognized word string

– Maximum substring matching problem – Handled by the dynamic programming algorithm Correct: “the effect is clear” Recognized: “effect is not clear”

  • ne deletion and one insertion

sentence correct in the words

  • f

No. 100% Rate Error Word Ins Dels Subs + + =

slide-9
SLIDE 9

9

Algorithm to Measure the WER

//denotes for the word length of the recognized sentence //denotes for the word length of the correct sentence

//Two common settings of error penalties subPen = 10; /* HTK error penalties */ delPen = 7; insPen = 7; subPenNIST = 4;/* NIST error penalties*/ delPenNIST = 3; insPenNIST = 3;

Rec j Cor i

Presentation topic: Write a tool to calculate speech recognition accuracy of the 2nd project, and give a presentation to introduce your algorithm and source codes.

slide-10
SLIDE 10

10

Signal Processing – Extracting Features

Signal Acquisition – microphone + PC soundcard (sampling rate) End-Point Detection

– We can use either push to talk or continuously listening to activate speech signal acquisition

MFCC and Its Dynamic Features

– Time-domain features vs. Frequency-domain features – Capture temporal changes by using delta coefficients

Feature Transformation

– We can transform the feature vectors to improve class separability – We can use a number of dimension reduction techniques to map the feature vectors into more effective representations, e.g. principal component analysis (PCA), linear discriminant analysis (LDA), etc

push and hold while talking Need a speech end-point detector Presentation topic: LDA for speech recognition

slide-11
SLIDE 11

11

Phonetic Modeling – Selecting Appropriate

Units

For general-purpose large vocabulary speech recognition, it is difficult to build whole-word models because

– Every new task contains novel words without available training data, such as proper nouns and newly invented words – There is simply too many words, and these different words may have different acoustic realization. It is unlikely to have sufficient repetitions of all words in various contexts

Issues in choosing appropriate modeling units

– Accurate: accurately represent the acoustic realization that appears in different contexts – Trainable: have enough data to estimate the parameters of the unit (HMM model parameters) – Generalizable: any new word can be derived from a predefined unit inventory for task-independent speech recognition

slide-12
SLIDE 12

12

Comparison of Different Units

Word vs. Subword

– Word: semantic meaning, capturing within-word coarticulation, accurate if enough data are available, trainable only for small tasks, not generalizable

  • For small-vocabulary speech recognition, e.g. digit recognition, whole word

models are both accurate and trainable but there is no need to be generalizable

– Phone: more trainable and generalizable, but less accurate

  • There are only about 50 phones in English and 30 in Mandarin Chinese
  • The realization of a phoneme is strongly affected by its immediately

neighboring phonemes

– Syllable: a compromise between the word and phonetic models.

  • Syllables are larger than phones.
  • There only about 1,300 tone-dependent syllables in Mandarin Chinese and

50 in Japanese, which makes syllable a suitable unit for these languages

  • The large number of syllables (over 30,000) in English presents a challenge

in term of trainability

slide-13
SLIDE 13

13

Context Dependency

Phone and Phoneme

– In speech science, the term phoneme is used to denote any of the minimal units of speech sound in a language that can serve to distinguish one word from another – The term phone is used to denote a phoneme’s acoustic realization – E.g. English phoneme /t/ has two very different acoustic realizations in the word sat and meter. We had better treat them as two different phones when building a spoken language system

Why Context Dependency?

– If we make unit context dependent, we can significantly improve the recognition accuracy, provided there are enough training data – A context usually refers to the immediate left and/or right neighboring phones

slide-14
SLIDE 14

14

Context Dependency (cont.)

Triphone (Intra-word triphone)

– A triphone model is a phonetic model that takes into consideration both the left and right neighboring phones – Two phones having the same identity but different left or right contexts are considered different triphones – Triphone models capture the most important co-articulatory effects – Trainability is a challenging issue. We need to balance trainability and accuracy with a number of parameter-sharing techniques

Modeling inter-word context-dependent phones is complicated

– The juncture effect on word boundaries is one of the most serious coarticulation phenomena in continuous speech

slide-15
SLIDE 15

15

Context Dependency (cont.)

Stress also plays an important role in the realization of a particular phoneme

– Word-level stress

  • Stressed vowel tend to have longer duration, higher pitch, and more

intensity, while unstressed vowels appear to move toward a neutral, central schwa-like phoneme

  • free stress (e.g. English) vs. bound stress (French and Polish)

– Sentence-level stress

  • Represents the overall stress pattern of continuous speech
  • Sentence-level stress

is very hard to model without incorporating semantic and pragmatic knowledge

– Usually only word-stress is used for creating allophones

The position of the stressed syllable is fixed in a word

When the variations resulting from coarticulatory processes can be consciously perceived, the modified phonemes are called allophones

slide-16
SLIDE 16

16

Clustered Acoustic-Phonetic Units

Triphone modeling assumes that every triphone context is different. Actually, many phones have similar effects on the neighboring phones

– /b/ and /p/ (or /r/ and /w/) have similar effects on the following vowel

It is desirable to find instances of similar contexts and merge them

– This would lead to a much more manageable number of models that can be better trained

slide-17
SLIDE 17

17

Clustered Acoustic-Phonetic Units (cont.)

Model-based clustering State-based clustering (state-tying)

– Merge the similar states of two models while keep the dissimilar states distinct

Data-driven state-tying

slide-18
SLIDE 18

18

Clustered Acoustic-Phonetic Units (cont.)

Microsoft’s approach - State-based clustering

– Generalize clustering to the state-dependent output distributions across different phonetic models

  • Each cluster represents a set of similar HMM states and is called a senone
  • A subword model is composed of a sequence of senones
slide-19
SLIDE 19

19

Clustered Acoustic-Phonetic Units (cont.)

Some example questions used in building senone trees

slide-20
SLIDE 20

20

Clustered Acoustic-Phonetic Units (cont.)

Comparison of Recognition Performance

slide-21
SLIDE 21

21

Lexical Baseforms

When appropriate subword units are used, we must have the correct pronunciation for each word so that concatenation of subword units can accurately represent the word to be recognized

– The dictionary represents the standard pronunciation used as a starting point for building a workable speech recognition system – The COMLEX dictionary from LDC has about 90,000 baseforms that cover most words used in many years of The Wall Street Journal – The CMU dictionary has about 100,000 baseforms

slide-22
SLIDE 22

22

Lexical Baseforms – Proper Names

Dictionaries often don’t include proper names

– The 20,000 names included in the COMLEX dictionary are a small fraction of 1-2 million names in the USA

We often need to derive their pronunciation automatically

– Unlike Spanish or Italian, rule-based letter-to-sound (LTS) conversion for English is often impractical, since so many words in English don’t follow phonological rules – A trainable LTS converter is attractive, since its performance can be improved by constantly learning from examples so that it can generalize rules for the specific task

  • Neural networks, HMMs or CART (See Chap. 4)
  • Need an exception dictionary
slide-23
SLIDE 23

23

Lexical - Pronunciation Variation

We also need to provide alternative pronunciations for words that may have very different pronunciations In continuous speech recognition, we must also use phonologic rules to modify interword pronunciations or to have reduced sounds

– Assimilation (同化)

  • A typical coarticulation phenomenon – a change in a segment to

make it more like a neighboring segment

  • “did you” /d ih jh y ah/, “set you” /s eh ch er/

– Deletion

  • /t/ and /d/ are often deleted before a consonant

Pronunciations in spontaneous speech differ significantly from the standard baseform

slide-24
SLIDE 24

24

Pronunciation Network

To incorporate widespread pronunciations, we can use a probabilistic finite state machine to model each word’s pronunciation variations

slide-25
SLIDE 25

25

Acoustic Modeling – Scoring Acoustic Features

After feature extraction, we have a sequence of acoustic feature vectors X=x1x2…xn, such as the MFCC vector as our input data We need to estimate the probability of these features, given the word or phonetic model, W, so that we can recognize the input data for the correct word This probability is refereed to as acoustic probability, P(X|W)

– The most successful method for acoustic modeling – HMM – When the amount of training data is sufficient, parameter tying becomes unnecessary – Discrete, continuous, semicontinuous HMMs

  • A continuous model with a large number of mixtures offers the best

recognition accuracy

  • The discrete model is computational efficient, but has the worst performance
  • The semicontinuous model provides a viable alternative between system

robustness and trainability

slide-26
SLIDE 26

26

Review of HMM Modeling

Three types of HMM state output probabilities

– Discrete HMM (DHMM): bj(k)=P(ot=vk|st=j)

  • The observations are quantized into a number of symbols by vector

quantization (using a codebook of M code words, {v1, v2,…, vM})

– Continuous HMM (CHMM)

  • The state observation distribution of HMM is modeled by

multivariate Gaussian mixture density functions (M mixtures)

– Semicontinuous or tied-mixture HMM (SCHMM)

  • The HMM state mixture density functions are tied together across all

the models to form a set of shared kernels (shared Gaussians)

( ) ( )

( )

( )

( ) ( )

( ) ,

exp 2 1 , ;

1 1 2 1 2 1 1

∑           − Σ − − = ∑ = ∑ =

= − = = M k jk jk T jk jk L jk M k jk jk jk M k jk jk j

c N c b c b µ

  • µ
  • Σ

Σ µ

  • π

∑ = ≥

= M k jk jk

c c

1

1

( ) ( ) (

)

( ) ( )

k k M k j M k k j j

N k b v f k b b Σ µ

  • ,

;

1 1

∑ = ∑ =

= =

slide-27
SLIDE 27

27

Review of HMM Modeling (cont.)

The most important parameter for the output probability distribution is the number of mixtures or the size of the codebooks Comparison of Recognition Performance

slide-28
SLIDE 28

28

Isolated vs. Continuous Speech Training

The added null arc transition probability should satisfy the constraint ∑ j

ij

a ˆ

The ability to automatically align each individual HMM to the corresponding unsegmented speech observation sequence is one of the powerful features in the forward-backward algorithm

slide-29
SLIDE 29

29

Isolated vs. Continuous Speech Training (cont.)

If we have a continuous training utterance one three, we compose a sentence HMM No explicit effort is needed to find word boundaries

slide-30
SLIDE 30

30

Characteristics of Mandarin Chinese

Four levels of linguistic units A monosyllabic-structure language

– All characters are monosyllabic

Most characters are morphemes (詞素) A word is composed of one to several characters Homophones

– Different characters sharing the same syllable

Initial-Final Syllable Character Word Phonological significance Semantic significance

slide-31
SLIDE 31

31

Characteristics of Mandarin Chinese

Chinese syllable structure

韻 滑音 ㄐㄧㄤ(jiiang) ji i a ng

slide-32
SLIDE 32

32

Phonetic Structure of Mandarin Syllables

Syllables (1,345) Base-syllables (408) INITIAL’s (21) FINAL’s (37) Phone-like Units/Phones (33) Tones (4+1)

slide-33
SLIDE 33

33

Sub-syllable HMM Modeling for Mandarin Chinese

INITIALs

slide-34
SLIDE 34

34

Sub-syllable HMM Modeling for Mandarin Chinese

FINALs

io(ㄧㄛ) ㄤ

38 FINALs