Pattern Recognition Part 9: Speaker and Speech Recognition Gerhard - - PowerPoint PPT Presentation

pattern recognition
SMART_READER_LITE
LIVE PREVIEW

Pattern Recognition Part 9: Speaker and Speech Recognition Gerhard - - PowerPoint PPT Presentation

Pattern Recognition Part 9: Speaker and Speech Recognition Gerhard Schmidt Christian-Albrechts-Universitt zu Kiel Faculty of Engineering Institute of Electrical and Information Engineering Digital Signal Processing and System Theory Speaker


slide-1
SLIDE 1

Pattern Recognition

Gerhard Schmidt Christian-Albrechts-Universität zu Kiel Faculty of Engineering Institute of Electrical and Information Engineering Digital Signal Processing and System Theory

Part 9: Speaker and Speech Recognition

slide-2
SLIDE 2

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 2

  • Speaker and Speech Recognition

Contents

❑ Literature ❑ Speaker recognition ❑ Motivation ❑ Speaker verification and speaker identification ❑ Model adaption ❑ Discriminative approaches ❑ Speech recognition ❑ Fundamentals ❑ Statistical speech recognition ❑ Conclusion and outlook

Contents

slide-3
SLIDE 3

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 3

  • Speaker and Speech Recognition

Literature

Gaussian mixture models:

❑ C. M. Bishop: Pattern Recognition and Machine Learning, Springer, 2006 ❑ L. Rabiner, B.H. Juang: Fundamentals of Speech Recognition, Prentice Hall, 1993

Speech recognition:

❑ C. M. Bishop: Pattern Recognition and Machine Learning, Springer, 2006 ❑ B. Pfister, T. Kaufmann: Sprachverarbeitung, Springer, 2008 (in German)

Speaker recognition:

❑ G. Kolano: Lernverfahren zur Sprecherverifikation, Shaker, 2000 (in German) ❑ J. Benesty, et al.: Handbook on Speech Processing, Chapters 37 and 38 on „Speaker Recognition“, Springer, 2008

slide-4
SLIDE 4

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 4

  • Speaker and Speech Recognition

Contents

❑ Literature ❑ Speaker recognition ❑ Motivation ❑ Speaker verification and speaker identification ❑ Model adaption ❑ Discriminative approaches ❑ Speech recognition ❑ Fundamentals ❑ Statistical speech recognition ❑ Conclusion and outlook

Contents

slide-5
SLIDE 5

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 5

  • Speaker and Speech Recognition

Motivation

Applications for speaker recognition

❑ Admission control (for supplementation of immobilizer systems in cars or admission to protected areas or rooms). ❑ Personalization of speech services (systems recognize the user/caller again and can access preference data bases). ❑ Improvement of speech signal enhancement schemes (e.g., speaker specific signal reconstruction). ❑ The post-training (optimization) of a speech recognition system can be done speaker dependent. In the case that a speech

dialog system is used randomly by multiple users, the post-training/adaptation of the recognizer can be speaker-dependent

slide-6
SLIDE 6

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 6

  • Speaker and Speech Recognition

Variants of Speaker Recognition – Part 1

Differentiation between verification and identification Speaker verification: Binary decision – is a speaker really the person he pretends to be? Speaker identification: 1-out-of-N-deciscion – Which one of N speakers is active?

slide-7
SLIDE 7

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 7

  • Speaker and Speech Recognition

Variants of Speaker Recognition – Part 2

Differentiation between text-dependent and text- independent speaker verification Text-dependent verification: The speaker knows a password that he has to speak or a new password that has to be spoken is provided for every verification. Text-independent verification: The speaker‘s utterance is unknown.

slide-8
SLIDE 8

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 8

  • Speaker and Speech Recognition

Variants of Speaker Recognition – Part 3

Differentiation between „closed-set“ and „open-set“ identification „closed“ (closed-set) identification: All potential speakers are known in advance – no new speakers are added later. „Open“ (open-set) identification: The potential speakers are not known in advance. It is not necessarily known, how many speakers exist.

slide-9
SLIDE 9

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 9

  • Speaker and Speech Recognition

Variants of Speaker Recognition – Part 4

Again, a differentiation between text-dependent and text-independent variants is possible.

slide-10
SLIDE 10

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 10

  • Speaker and Speech Recognition

Variants of Speaker Recognition – Part 5

Differentiation between non-discriminant and discriminant training methods Non-discriminant training: The models are trained for each speaker independently, i.e., the model has to fit to the extracted training data as good as possible – however, a good discrimination of other speakers is not considered. Discriminant training: All speakers are considered during the training of the models to fit the individual models not

  • nly to one speaker, but also to learn the differences between the speaker features.
slide-11
SLIDE 11

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 11

  • Speaker and Speech Recognition

Basics of Speaker Recognition – Part 1

Distortion-reducing preprocessing and segmentation Feature extraction (with normalization) Feature vector Binary decision Accumulation

  • f the single

logarithmic probabilities or distances over time Model for the features

  • f the speaker to

be verified

Speaker verification

Universal background model for other speakers Feedback of the decision for adapting the model Short-term spectrum of the distortion-reduced signal

slide-12
SLIDE 12

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 12

  • Speaker and Speech Recognition

Basics of Speaker Recognition – Part 2

Distortion-reducing preprocessing and segmentation Feature extraction (with normalization) Feature vector 1-out-of-(N+1) decision Accumulation of the single logarithmic probabilities or distances over time New speaker model

Speaker identification

Universal background model for other speakers Generation of a new speaker model Short-term spectrum of the distortion-reduced signal Speaker model 1 Speaker model N

slide-13
SLIDE 13

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 13

  • Speaker and Speech Recognition

Difficulties in Speaker Recognition

Some typical problems…

❑ In many practical applications only a relatively small amount of training data for the individual speakers is available.

Additionally, this training data is often not phonetically „balanced“. During the recognition itself, a decision should be made as fast as possible.

❑ As a consequence, text-independent systems become a strong text-dependency: Speaker A speaks words that are

contained in the small training set of speaker B, but not in his own. That probability to identify speaker B is rather high for a small amount of training data.

❑ It is often reported in literature that preprocessing or normalization have a negative influence on the recognition rate.

This is true if the recording conditions during training and test match well. However, such a match between training and test conditions is not always given in practice.

❑ Speech pauses should be removed before the recognition task itself. Otherwise, the background noise will have a strong

influence on the decision: speakers with similar background noise during recording will be preferred.

slide-14
SLIDE 14

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 14

  • Speaker and Speech Recognition

Preprocessing and Segmentation – Part 1 Subband structure:

Analysis filterbank Segmentation Filter characteristic Input PSD estimation Noise PSD estimation PSD= power spectral density

slide-15
SLIDE 15

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 15

  • Speaker and Speech Recognition

Preprocessing and Segmentation – Part 2

Noise reduction:

Noise reduction without limitation of the attenuation (needed for the segmentation) Noise reduction with limitation of the attenuation (needed for the signal enhancement)

Segmentation:

If the noise reduction filter is open in 10…30 percent of all subbands, the current frame is classified to contain speech.

slide-16
SLIDE 16

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 16

  • Speaker and Speech Recognition

Preprocessing and Segmentation – Part 3

Example:

❑ Input signal ❑ Signal after

noise reduction

❑ Signal after

segmentation

Time-frequency analysis of the noisy input signal Time-frequency analysis of the noise-reduced signal Time-frequency analysis of the segmented noise-reduced signal Time in seconds Frequency in Hz Time in seconds Time in seconds Frequency in Hz Frequency in Hz

slide-17
SLIDE 17

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 17

  • Speaker and Speech Recognition

Feature Extraction – Part 1

Mel-filtered cepstral coefficients (MFCCs):

Computation of the (squared) magnitude Mel filtering Logarithm Discrete cosine transform

❑ The first (zeroth) coefficient of the feature vectors is often replaced by the

normalized short-term power of the current signal frame.

❑ The normalization is done such that the maximum short-term power of an

utterance is mapped to a defined value.

slide-18
SLIDE 18

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 18

  • Speaker and Speech Recognition

Feature Extraction – Part 2

❑ Many publications deal with the selection of features. The most common conclusion is that a compact representation of

the short-term spectral envelope should be used.

❑ MFCCs and cepstral coefficients (with slight modification) have proven to be useful. ❑ It is astonishing that these are the same features that are used for speech recognition. In the application of speech

recognition, the interest is to remove differences between speakers to obtain only information about the words that have been spoken.

❑ However, it should be mentioned that different preprocessing is used for speaker and speech recognition. ❑ As a consequence, it can be concluded that a speaker-specific speech recognition yields better results compared to a non

speaker-specific one – this can also be observed in practice. For this reason, it is often desired to adapt the models of a speech recognition system to the current speaker.

Some remarks:

slide-19
SLIDE 19

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 19

  • Speaker and Speech Recognition

Speaker Recognition With Codebooks – Recognition Phase

Speaker-specific feature codebook Speaker-specific threshold codebooks Speaker identity under test Test utterance Distance calculation with the background codebook Distance calculation with the speaker- specific codebook Distance comparison with consideration

  • f the speaker specific threshold

Acceptance or rejection of the speaker identity under test

Flow chart – speaker verification:

slide-20
SLIDE 20

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 20

  • Speaker and Speech Recognition

Speaker Recognition With Codebooks– Training Phase

Flow chart – speaker verification:

Speech data

  • f a speaker

Speech data

  • f the background speakers

Feature extraction Feature extraction Codebook training Codebook training Save the speaker-specific feature codebook Save the speaker-specific threshold codebook Save the background feature codebook Calculate the speaker-specific thresholds

slide-21
SLIDE 21

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 21

  • Speaker and Speech Recognition

Speaker Recognition With Gaussian Mixture Models – Recognition Phase (Part 1)

Approach of the speaker verification:

❑ Pose two hypothesis: ❑ If the same „costs“ for different kinds of errors are assumed, the target and the test speaker are decided to be same

person if The matrix contains the feature vectors of the utterance (after noise and speech pauses have been removed).

slide-22
SLIDE 22

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 22

  • Speaker and Speech Recognition

Speaker Recognition With Gaussian Mixture Models – Recognition Phase (Part 2)

Approach of the speaker verification :

❑ The conditional probabilities can be re-written as follows: ❑ This yields for our condition: ❑ Different speaker probabilities can be modeled by the ratio of

and .

slide-23
SLIDE 23

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 23

  • Speaker and Speech Recognition

Speaker Recognition With Gaussian Mixture Models – Recognition Phase (Part 3)

Feature 1 Feature 2 Feature 1 Feature 2 Mutual density Mutual density

Observed data Probability density model (trained on data of hypothesis H0, i.e. on training data of the target speaker) Probability density model (trained on data of hypothesis H1 , i.e. on training data of non-target speaker(s)) Decision Multiplication with the speaker probability Multiplication with the complementary speaker probability

slide-24
SLIDE 24

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 24

  • Speaker and Speech Recognition

Speaker Recognition With Gaussian Mixture Models – Recognition Phase (Part 4)

Approach of the speaker verification:

❑ If Gaussian mixture models are used, the (logarithmic) probability density functions are:

The superscripts (s) and (b) denote the individual speaker and background model, respectively.

slide-25
SLIDE 25

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 25

  • Speaker and Speech Recognition

Speaker Recognition With Gaussian Mixture Models – Recognition Phase (Part 5)

❑ The decision rule

can be re-written as follows:

Approach of the speaker verification :

slide-26
SLIDE 26

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 26

  • Speaker and Speech Recognition

Results of a Speaker Verification – Part 1

❑ The results are taken from the dissertation of G. Kolano (work done at the Daimler Research Center in Ulm, see literature

section for details).

❑ A data base with 106 speakers (only male speakers) has been used. The data based consists of English double-digits (i.e.,

the vocabulary is limited).

❑ All data has been transmitted over telephone channels. Thus, the bandwidth of the data is approximately 3.8 kHz (8 kHz

sample rate). Especially for speaker recognition, these are rather bad boundary conditions.

❑ Out of the 106 speakers, 33 have been used for training the background models, the remaining 73 have been used for

the evaluation of the speaker identification.

❑ MFCCs have been used as features. They were only computed if the current signal frame has been classified as voiced

speech.

Boundary conditions:

slide-27
SLIDE 27

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 27

  • Speaker and Speech Recognition

Results of a Speaker Verification – Part 2

❑ The background model has the same size as the speaker model for the cases. ❑ Results in terms of error rates:

Model order Codebuch Gaussian (Number of codebook entries approach mixture model

  • r number Gaussian distributions)

4 11.5 % 4.2 % 8 9.6 % 3.0 % 16 8.2 % 2.3 % 32 6.8 % 2.0 %

Comparison between codebooks and GMMs:

Conclusion: GMMs are – at least in this test – clearly superior to codebook approaches, but …

slide-28
SLIDE 28

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 28

  • Speaker and Speech Recognition

Results of a Speaker Verification – Part 3

❑ The covariance matrices of the GMM approach were fully populated. Thus, clearly a larger amount of model parameters

have been used in this approach and the computational complexity is clearly higher.

❑ Number of model parameters:

Model order Codebook Gaussian (Number of codebook entries approach mixture model

  • r number Gaussian distributions)

4 68 683 8 136 1367 16 272 2735 32 544 5471

Comparison between codebooks and GMMs:

Conclusion: … GMMs require clearly more memory and computational power, compared to codebook approaches.

slide-29
SLIDE 29

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 29

  • Speaker and Speech Recognition

Results of a Speaker Verification – Part 4

❑ So far, individual thresholds and a priory-probabilities have been trained for each speaker. ❑ Comparison between global and individual thresholds:

Model order Codebook Gaussian (Number of codebook entries approach mixture model

  • r number Gaussian distributions)

4 12.9 % / 11.5 % 5.3 % / 4.2 % 8 11.1 % / 9.6 % 4.1 % / 3.0 % 16 9.6 % / 8.2 % 3.4 % / 2.3 % 32 8.2 % / 6.8 % 3.0 % / 2.0 %

Comparison between global and individual thresholds:

Conclusion: By training the thresholds, the recognition rate can be improved

  • r the number of parameters can

be decreased. Individual Threshold Global threshold

slide-30
SLIDE 30

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 30

  • Speaker and Speech Recognition

From Speaker Verification to Speaker Identification

Flow chart – Speaker identification:

Speaker-specific feature models Speaker-specific threshold/distance models Test utterance „Scoring“ with the background model „Scoring“ with the speaker- specific models Computation of the best speaker model or detection

  • f a new speaker

Adaptation of the „winning“ speaker model or generation of a new speaker model Selection of the best speaker model

slide-31
SLIDE 31

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 31

  • Speaker and Speech Recognition

Results of a Speaker Identification – Part 1

Boundary conditions:

❑ The results are taken from a publication of D. Reynolds (work done at the MIT, see literature section for details). ❑ A data base with 51 speakers (only male speakers) has been used. The data base consists of English conversations

(approximately 10 utterances with a duration of 45 seconds each).

❑ All data has been transmitted over telephone channels. Thus, the bandwidth of the data is approximately 3.8 kHz

(8 kHz sample rate).

❑ MFCCs have been used as features. Modeling has been done with GMMs, where only diagonal covariance matrices

have been used.

slide-32
SLIDE 32

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 32

  • Speaker and Speech Recognition

Results of a Speaker Identification – Part 2

❑ Length of test and training data vs. recognition rate:

Length Model order Length of test data

  • f training

(number of Gaussian data distributions) 1 sec 5 sec 10 sec 30 sec 8 54.6 % 79.8 % 86.6 % 16 63.7 % 87.3 % 90.5 % 32 64.6 % 85.3 % 88.4 % 60 sec 8 66.1 % 91.5 % 97.3 % 16 74.9 % 95.7 % 98.8 % 32 78.6 % 95.6 % 98.3 % 90 sec 8 71.5 % 95.5 % 98.8 % 16 79.0 % 98.0 % 99.7 % 32 84.7 % 98.8 % 99.6 %

Results:

slide-33
SLIDE 33

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 33

  • Speaker and Speech Recognition

Adaption of the Models During Run-Time – Part 1

General:

❑ After a speaker recognition has been successful (this should be validated e.g. by using a dialog system), the speaker

model of the active speaker can be adapted.

❑ Generally, all model parameters can be adapted. However, updating only the mean values of GMMs proved to provide a

good cost-value ratio. For codebooks, the mean values can be seen as the individual codebook entries, i.e., all parameters are adapted.

❑ Both, the amount of training data and the number of new feature vectors should be considered.

The codebook adaption can be done according to where denotes the new codebook entry and the old one. is the number of vectors that have been used to form the entry during training and is the number of those feature vectors which have been assigned to the corresponding codebook vector.

slide-34
SLIDE 34

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 34

  • Speaker and Speech Recognition

Adaption of the Models During Run-Time – Part 2

General:

❑ The mean values of GMMs can be updated similar to the codebooks by a modified iteration step of the EM algorithm

(see last lecture). First, a „soft“ assignment to the individual classes is done (E-step): Next, the mean values are corrected (M-step) by The variable denotes the sum of the „soft“ assignments of the kth class in the last iteration during the training.

slide-35
SLIDE 35

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 35

  • Speaker and Speech Recognition

Adaption of the Models During Run-Time – Part 3

Example:

Input feature 1 Input feature1 Input feature 2 Input feature 2 Gaussian distributions before adaptation Gaussian distributions after adaptation New featue vectors New feature vectors

slide-36
SLIDE 36

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 36

  • Speaker and Speech Recognition

Discriminative Approaches – Part 1

❑ In discriminative approaches, the aim is not only to optimize the assignment of training data to a model but also to

  • ptimize the discrimination between other models at the same time.

❑ Examples for such approaches are neural networks or learning vector quantization (LVQ). ❑ The advantage of such methods is in general an improve recognition rate. ❑ However, it is more difficult to include new speakers into the models when discriminant approaches are used. If this

should be necessary, all model parameters (even those of already known speakers) have to be recalculated – while only a new speaker model had to be generated in the approaches discussed so far.

General:

slide-37
SLIDE 37

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 37

  • Speaker and Speech Recognition

Discriminative Approaches – Part 2

Neuronal networks (e.g. with radial basis functions):

❑ Input data of the neural network are the feature vectors of the

training data of all speakers.

❑ The desired output is a vector which contains a 1 at the index of the

current speaker. All other vector elements are either set to 0 or 1.

❑ Standard training methods for neural networks attempt to minimize

the quadratic distance between the output of the neural network and the desired output. This leads to discriminant methods.

slide-38
SLIDE 38

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 38

  • Speaker and Speech Recognition

Contents

❑ Literature ❑ Speaker recognition ❑ Motivation ❑ Speaker verification and speaker identification ❑ Model adaption ❑ Discriminative approaches ❑ Speech recognition ❑ Fundamentals ❑ Statistical speech recognition ❑ Conclusion and outlook

Contents

slide-39
SLIDE 39

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 39

  • Speech and Speech Recognition

Speech Generation and Speech Recognition

Overview:

Creation of the message Integration of language (English) Vocal tract Vocal cords Neuromuscular activities „Acoustic“ Channel Movement of the basilar membrane Understanding of the message Conversion based

  • n a language

Neuronal activity

slide-40
SLIDE 40

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 40

  • Speech and Speech Recognition

„History“

1952 at Bell Labs

❑ First digit recognition ❑ Estimation of the energy in the formant frequencies (resonant frequencies of the vocal tract)

In the 60th

❑ Improved digit recognition ❑ Breakthroughs in spectral estimation (FFT, cepstrum), dynamic time warping, and hidden Markov models

Hidden Markov models in speech recognition

❑ Mathematics from Baum et al. (1966 – 1972) ❑ Application to speech recognition from Baker (CMU Dragon System, 1974) ❑ Development at IBM (Baker, Jelinek, Bahl, Mercer, and others)

(Deep) Neural Networks

❑ Helped this technology to get as successful as seen in todays products (usually server-based architecturs) ❑ Developement started about 15 years ago

slide-41
SLIDE 41

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 41

  • Speech and Speech Recognition

Speech Dialog Systems

Speech signal Speech signal Text Text Semantic representation Semantic representation

Overview of a speech dialog system:

Speech recognition Dialog manager Parser Prompter Speech syntheses

slide-42
SLIDE 42

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 42

  • Speech and Speech Recognition

Fundamental Principle of Speech Recognition

Speech signal with background noise

Preprocessing

Speech signal with little disturbances

Feature extraction Classification

Features (e.g. MFCCs)

Decision

Evaluation (“N-best” list) Recognized text

❑ Reduces background noise and echoes ❑ Combines several microphone signals ❑ Compresses the amount of data ❑ Extracts the important parameters for the speech recognition ❑ Performs for each activated class an evaluation ❑ As a result often the N-best evaluations are determined ❑ Determines the best entry based on additional prior

knowledge (word probabilities)

slide-43
SLIDE 43

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 43

  • Speech and Speech Recognition

Variants of Speech Recognition Systems – Part 1

Single word recognizer

❑ Single words or short commands ❑ The (command) words are spoken isolated (with pauses)

Keyword spotter

❑ Single words respectively word orders in arbitrary statements ❑ In case the keyword was detected, a new recognizer is started

Recognition of connected words

❑ Sequence of fluent spoken words from a small vocabulary

Continuous speech recognizer

❑ Whole, fluently spoken sentences

slide-44
SLIDE 44

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 44

  • Speech and Speech Recognition

Variants of Speech Recognition Systems – Part 2

Speaker-dependent systems

❑ Such systems have to be trained individually for each speaking person. ❑ The training phase can (depending on the size of the vocabulary, respectively the desired quality) take some time.

Speaker-independent systems

❑ There is no need for a (speaker specific) training phase. ❑ To obtain an appropriate quality a large training database has to be provided.

Speaker-adaptive systems

❑ It starts with an universal model, which is then gradually adapted to the speaker. ❑ This can be done during a short training phase or during runtime.

slide-45
SLIDE 45

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 45

  • Speech and Speech Recognition

Variants of Speech Recognition Systems – Part 3

Systems with small vocabulary

❑ Up to a few hundred words ❑ Typically used for control tasks

Systems with large vocabulary

❑ Several 100.000 words ❑ Dictation, address input ❑ Vocabulary at this size, often have many phonetically similar words. ❑ To reduce the numbers of mistakes, usually a so-called language model is needed

(describes the relationship between words).

slide-46
SLIDE 46

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 46

  • Speech and Speech Recognition

Evaluation of Speech Recognition Systems – Part 1 Basics

❑ Usually speech recognition systems respectively speech dialogue system are evaluated by using word error rates. ❑ In practice, however, this value is overstated. Other criteria are also important. It is also important, for example, how

much computing power, respectively, memory is needed by the system or after which time the result is available.

❑ To evaluate speech dialogue systems also the quality of the speech syntheses, the so-called start-up time and many

  • ther values are of special interest.

Word error rate

slide-47
SLIDE 47

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 47

  • Speech and Speech Recognition

Evaluation of Speech Recognition Systems – Part 2

Word error rate

❑ The word error rate can efficiently be derived by means of a dynamic programming.

Defining the word reference sequence by and the word sequence determined by the recognizer by the following distance between the sequences can be concluded:

slide-48
SLIDE 48

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 48

  • Speech and Speech Recognition

Evaluation of Speech Recognition Systems – Part 2

Word error rate

❑ Initialization: ❑ Derivation of the word error rate ❑ Note that with this definition word error rates greater than a 100 % can be achieved (due to lots of insertions).

slide-49
SLIDE 49

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 49

  • Speech and Speech Recognition

Maximum A-posteriori Probability Rule (MAP Rule) – Part 1

Recognition criterion

❑ For a given feature sequence the one series of words out of all possible (permitted) word series should be

selected, which exhibit the maximum a-posterior probability:

❑ Using Bayes’ theorem, it can be concluded: ❑ Due to the fact that the probability of the feature sequences is constant for the maximization, this has nor influences

  • n the decision and can be neglected:
slide-50
SLIDE 50

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 50

  • Speech and Speech Recognition

Maximum A-posteriori Probability Rule (MAP Rule) – Part 2

Recognition criterion

❑ Optimization function: ❑ The probability

indicates the probability to observe the feature sequence , if the series of words was spoken. To model such probabilities hidden Markov models (HMMs) have proved to be

  • suitable. This part of the optimization criterion is called the acoustic model.

❑ The probability is the a-priori probability of the word series . This probability is independent of the

  • bservation sequence and describes a priori knowledge about the word series (e.g. that some words do occur more
  • ften than others). This part of the optimization criterion is called the language model.
slide-51
SLIDE 51

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 51

  • Speech and Speech Recognition

Computationally Efficient Model Restrictions – Part 1

Limitations of HMM model degrees of freedom

❑ If at first a large number of Gaussian distributions is permit for the several HMMs or for the states of the HMMs, one

can try to use the same mean vectors and covariance matrices for further calculations. The weights of the several Gaussian distributions can be selected individually for each model or model state, respectively.

❑ This kind of hidden Markov models are called semi-continuous HMMs. ❑ Through this, a lot of memory and computational load can be saved: ❑ The needed memory is reduced due to the fact that the mean value vectors and covariance matrices can be

reused for all models (and have to be saved individually for all models respectively model states).

❑ Also the contribution of the single Gaussian distributions are only computed once per time frame respectively

feature vector determination (and not individually for each model respectively each model state).

slide-52
SLIDE 52

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 52

  • Speech and Speech Recognition

Computationally Efficient Model Restrictions – Part 2

Limitations of HMM model degrees of freedom

❑ In addition, a limited number of Gaussian distributions can be assigned to each HMM, respectively HMM state. In this

case the computational load (and also memory to a small extent) can be further reduced.

❑ This is done by storing only the indices of the active Gaussian distributions

  • f each model state (e.g. 8 to 32 Gaussian distributions from 512 to 2048).

❑ The base Gaussian distributions can be, for example, derived out of

a big speech database by using the EM algorithm.

slide-53
SLIDE 53

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 53

  • Speech and Speech Recognition

Base Units of Hidden Markov Models – Part 1

Approximately 1000 speakers

I Il Ev vn lE

Acoustic modeling

❑ Acoustic basic units are extracted out of a big speech database

(e.g. 1000 speaking persons, where their were talking approximately one hour) to train the HMMs.

❑ Such basic units can be phonemes, but also phoneme pairs or

groups of 3 phonemes.

❑ In addition, often for key words (e.g. numbers) single word

models are trained.

❑ There are about 50 phonemes in each language. For phoneme

pairs respectively groups of 3 phonemes, the amount of

  • ccurring groups within a language is considerably smaller than

50², respectively 50³.

slide-54
SLIDE 54

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 54

  • Speech and Speech Recognition

Base Units of Hidden Markov Models – Part 2

Acoustic modeling

❑ The composition of the essentials models (given due to the vocabulary) out of base units has the advantage that this is

pretty simple (see following slides) and can be done during runtime of the speech recognizer.

❑ Therefore, it is possible to wait for the answer of a speech dialog, generate a corresponding answer by use of a speech

syntheses system, and start then the new excitation. With this recognition method, only those word possibilities are used, which make sense in this case.

❑ This allows to keep the vocabulary small, which leads to less errors and a lower computational load. ❑ Especially for database queries (e.g. Google search, operating of an MP3 player, etc.) this is important.

slide-55
SLIDE 55

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 55

  • Speech and Speech Recognition

Training versus Processing During Run-Time of a Recognizer Training During run-time

Vocabulary Speech signal Acoustic model p(X|W) Language model p(W) Feature extraction Decoding Training of the acoustic model Training material (speech data) Training mat. (text) Training of the language model Text processing

slide-56
SLIDE 56

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 56

  • Speech and Speech Recognition

Composition of HMMs – Part 1

Parallel connection of HMMs

❑ For the parallel composition of HMMs, only the transition probability and the a-priori probability have to be combined. ❑ Example for a parallel connection of two simple left-right models, each with two emitted states:

slide-57
SLIDE 57

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 57

  • Speech and Speech Recognition

Composition of HMMs – Part 2

Parallel connection of HMMs

❑ Example for the parallel connection of two simple left-right models:

slide-58
SLIDE 58

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 58

  • Speech and Speech Recognition

Composition of HMMs – Part 3

Series connection of HMMs

❑ Example for the series connection of two simple left-right models:

slide-59
SLIDE 59

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 59

  • Speech and Speech Recognition

Composition of HMMs – Part 4

Generation of the active vocabulary

❑ The HMMs have to connected as efficient as possible with each other (graph theory). ❑ Example (fragment) for German double digits:

ein zwei drei neun und dreißig vierzig fünfzig neunzig

slide-60
SLIDE 60

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 60

  • Speech and Speech Recognition

Research Directions

Feature extraction:

❑ Psycho-acoustic motivated feature extraction ❑ Use of additional information (speaker direction, etc.)

Acoustic modeling:

❑ Use of different base units as phonemes ❑ Improved modeling (e.g. with sound duration models) ❑ Using neural networks and mixed approaches

Adaption:

❑ Adaption with regard to the current speaker ❑ Feature transformation to reduce the dependence on the recording conditions

Training:

Discriminative approaches

slide-61
SLIDE 61

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 61

  • Speaker and Speech Recognition

Contents

❑ Literature ❑ Speaker recognition ❑ Motivation ❑ Speaker verification and speaker identification ❑ Model adaption ❑ Discriminative approaches ❑ Speech recognition ❑ Fundamentals ❑ Statistical speech recognition ❑ Conclusion and outlook

Contents

slide-62
SLIDE 62

Digital Signal Processing and System Theory | Pattern Recognition | Speaker and Speech Recognition Slide 62

  • Speaker and Speech and Speech Recognition

Summary and Outlook

Summary:

❑ Speaker recognition ❑ Motivation ❑ Speaker verification and speaker identification ❑ Model adaption ❑ Discriminative approaches ❑ Speech recognition ❑ Fundamentals ❑ Statistical speech recognition

Next week:

❑ Neural networks