[PPT] - Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis Kishore PowerPoint Presentation

SLIDE 1

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 1

Speech Technology: A Practical Introduction

Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis

Kishore Prahallad Email: skishore@cs.cmu.edu Carnegie Mellon University & International Institute of Information Technology Hyderabad

SLIDE 2

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 2

Topics

Spectrogram
Cepstrum
Mel-Frequency Analysis
Mel-Frequency Cepstral Coefficients

SLIDE 3

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 3

Spectrogram

SLIDE 4

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 4

Speech signal represented as a sequence of spectral vectors

FFT FFT FFT

Spectrum

SLIDE 5

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 5

Speech signal represented as a sequence of spectral vectors

FFT

Spectrum

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

SLIDE 6

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 6

Speech signal represented as a sequence of spectral vectors

FFT

Spectrum

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Hz

Amp.

SLIDE 7

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 7

Speech signal represented as a sequence of spectral vectors

FFT

Spectrum

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Hz Amplitude Rotate it by 90 degrees

SLIDE 8

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 8

Speech signal represented as a sequence of spectral vectors

FFT

Spectrum

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Hz

MAP spectral amplitude to a grey level (0-

255) value. 0 represents black and 255 represents white.

Higher the amplitude, darker the

corresponding region. Amplitude

SLIDE 9

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 9

Speech signal represented as a sequence of spectral vectors

FFT

Spectrum

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Hz Time

SLIDE 10

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 10

Speech signal represented as a sequence of spectral vectors

FFT

Spectrum

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Hz Time Time Vs Frequency representation of a speech signal is referred to as spectrogram

SLIDE 11

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 11

Some Real Spectrograms

Dark regions indicate peaks (formants) in the spectrum

SLIDE 12

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 12

Why we are bothered about spectrograms

Phones and their properties are better observed in spectrogram

SLIDE 13

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 13

Why we are bothered about spectrograms

Sounds can be identified much better by the Formants and by their transitions

SLIDE 14

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 14

Why we are bothered about spectrograms

Sounds can be identified much better by the Formants and by their transitions Hidden Markov Models implicitly model these spectrograms to perform speech recognition

SLIDE 15

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 15

Usefulness of Spectrogram

Time-Frequency representation of the speech signal
Spectrogram is a tool to study speech sounds (phones)
Phones and their properties are visually studied by phoneticians
Hidden Markov Models implicitly model spectrograms for speech to

text systems

Useful for evaluation of text to speech systems

– A high quality text to speech system should produce synthesized speech whose spectrograms should nearly match with the natural sentences.

SLIDE 16

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 16

Cepstral Analysis

SLIDE 17

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 17

A Sample Speech Spectrum

Frequency (Hz) dB

Peaks denote dominant frequency

components in the speech signal

Peaks are referred to as formants
Formants carry the identity of the sound

SLIDE 18

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 18

What we want to Extract? – Spectral Envelope

Formants and a smooth curve connecting them
This Smooth curve is referred to as spectral envelope

Frequency (Hz) dB

SLIDE 19

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 19

Spectral Envelope

Spectral Envelope Spectrum Spectral details

SLIDE 20

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 20

Spectral Envelope

Spectral Envelope Spectrum Spectral details log X[k] log H[k] log E[k]

SLIDE 21

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 21

Spectral Envelope

Spectral Envelope Spectrum Spectral details log X[k] log H[k] log E[k] log X[k] = log H[k] + log E[k]

1. Our goal: We want to

separate spectral envelope and spectral details from the spectrum.

2. i.e Given log X[k],
btain log H[k] and log

E[k], such that log X[k] = log H[k] + log E[k]

SLIDE 22

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 22

How to achieve this separation ?

SLIDE 23

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 23

Play a Mathematical Trick

Spectral Envelope Spectral details Spectrum

Trick: Take FFT of

the spectrum!!

An FFT on spectrum

referred to as Inverse FFT (IFFT).

Note: We are dealing

with spectrum in log domain (part of the trick)

IFFT of log spectrum

would represent the signal in pseudo- frequency axis

SLIDE 24

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 24

Play a Mathematical Trick

Spectral Envelope A pseudo-frequency axis Spectral details Spectrum

SLIDE 25

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 25

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis Low Freq. region High Freq. region

SLIDE 26

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 26

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis Low Freq. region High Freq. region IFFT

SLIDE 27

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 27

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis Low Freq. region High Freq. region IFFT Treat this as a sine wave with 4 cycles per sec.

SLIDE 28

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 28

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis Low Freq. region High Freq. region IFFT Treat this as a sine wave with 4 cycles per sec. Gives a peak at 4 Hz in frequency axis

SLIDE 29

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 29

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis Low Freq. region High Freq. region IFFT Treat this as a sine wave with 4 cycles per sec. Gives a peak at 4 Hz in frequency axis

SLIDE 30

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 30

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis Low Freq. region High Freq. region IFFT

SLIDE 31

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 31

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis Low Freq. region High Freq. region IFFT Treat this as a sine wave with 100 cycles per sec. Gives a peak at 100 Hz in frequency axis

SLIDE 32

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 32

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis Low Freq. region High Freq. region IFFT IFFT

SLIDE 33

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 33

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis

SLIDE 34

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 34

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis IFFT log X[k] = log H[k] + log E[k] log H[k] log E[k]

SLIDE 35

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 35

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis IFFT log X[k] = log H[k] + log E[k] log H[k] log E[k] x[k] = h[k] + e[k]

SLIDE 36

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 36

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis IFFT log X[k] = log H[k] + log E[k] log H[k] log E[k] x[k] = h[k] + e[k] In practice all you have access to only log X[k] and hence you can obtain x[k]

SLIDE 37

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 37

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis IFFT log X[k] = log H[k] + log E[k] log H[k] log E[k] x[k] = h[k] + e[k] If you know x[k] Filter the low frequency region to get h[k]

SLIDE 38

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 38

Play a Mathematical Trick

Spectral Envelope Spectrum Spectral details A pseudo-frequency axis IFFT log X[k] = log H[k] + log E[k] log H[k] log E[k] x[k] = h[k] + e[k]

x[k] is referred to as Cepstrum
h[k] is obtained by considering

the low frequency region of x[k].

h[k] represents the spectral

envelope and is widely used as feature for speech recognition

SLIDE 39

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 39

Cepstral Analysis

] [ ] [ ] [ sides both

n

FFT inverse Taking || ] [ || log || ] [ || log || ] [ || log sides both

n

Log Take magnitude denotes || . || || ] [ || || ] [ || || ] [ || ] [ ] [ ] [ k e k h k x k E k H k X k E k H k X k E k H k X + = + = − = =

SLIDE 40

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 40

Mel-Frequency Analysis

SLIDE 41

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 41

Review: What we did

We captured spectral envelope (curve

connecting all formants)

BUT: Perceptual experiments say human ear

concentrates on certain regions rather than using whole of the spectral envelope….

Frequency (Hz) dB

SLIDE 42

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 42

Mel-Frequency Analysis

Mel-Frequency analysis of speech is

based on human perception experiments

It is observed that human ear acts as filter

– It concentrates on only certain frequency components

These filters are non-uniformly spaced on

the frequency axis

– More filters in the low frequency regions – Less no. of filters in high frequency regions

SLIDE 43

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 43

Mel-Frequency Filters

SLIDE 44

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 44

Mel-Frequency Filters

More no. of filters in low

freq. region

Lesser no. of filters in high freq. region

SLIDE 45

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 45

Mel-Frequency Cepstral Coefficients (MFCC)

Spectrum Mel-Filters Mel-Spectrum
Say log X[k] = log (Mel-Spectrum)
NOW perform Cepstral analysis on log X[k]

– log X[k] = log H[k] + log E[k] – Taking IFFT – x[k] = h[k] + e[k]

Cepstral coefficients h[k] obtained for Mel-

spectrum are referred to as Mel-Frequency Cepstral Coefficients often denoted by MFCC

SLIDE 46

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 46

Speech signal represented as a sequence of spectral vectors

FFT

Spectrum

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Mel-Filters Cepstral Analy.

SLIDE 47

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 47

Speech signal represented as a sequence of CEPSTRAL vectors

FFT

Spectrum

FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT FFT

Cepstral Vectors

SLIDE 48

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 48

Why we are going to use MFCC

Speech synthesis

– Used for joining two speech segments S1 and S2 – Represent S1 as a sequence of MFCC – Represent S2 as a sequence of MFCC – Join at the point where MFCCs of S1 and S2 have minimal Euclidean distance

Used in speech recognition

– MFCC are mostly used features in state-of-art speech recognition system

SLIDE 49

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 49

Summary: Process of Feature Extraction

Speech is analyzed over short analysis window
For each short analysis window a spectrum is obtained

using FFT

Spectrum is passed through Mel-Filters to obtain Mel-

Spectrum

Cepstral analysis is performed on Mel-Spectrum to
btain Mel-Frequency Cepstral Coefficients
Thus speech is represented as a sequence of Cepstral

vectors

It is these Cepstral vectors which are given to pattern

classifiers for speech recognition purpose

SLIDE 50

Speech Technology - Kishore Prahallad (skishore@cs.cmu.edu) 50

Additional Reading

Chapter 6

Speech Technology: A Practical Introduction

Topic: Spectrogram, Cepstrum and Mel-Frequency Analysis

Topics

Spectrogram

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Speech signal represented as a sequence of spectral vectors

Some Real Spectrograms

Why we are bothered about spectrograms

Why we are bothered about spectrograms

Why we are bothered about spectrograms

Usefulness of Spectrogram

Cepstral Analysis

A Sample Speech Spectrum

components in the speech signal

What we want to Extract? – Spectral Envelope

Spectral Envelope

Spectral Envelope

Spectral Envelope

How to achieve this separation ?

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Play a Mathematical Trick

Cepstral Analysis

] [ ] [ ] [ sides both

FFT inverse Taking || ] [ || log || ] [ || log || ] [ || log sides both

Log Take magnitude denotes || . || || ] [ || || ] [ || || ] [ || ] [ ] [ ] [ k e k h k x k E k H k X k E k H k X k E k H k X + = + = − = =

Mel-Frequency Analysis

Review: What we did

connecting all formants)

concentrates on certain regions rather than using whole of the spectral envelope….

Mel-Frequency Analysis

based on human perception experiments

– It concentrates on only certain frequency components

the frequency axis

– More filters in the low frequency regions – Less no. of filters in high frequency regions

Mel-Frequency Filters

Mel-Frequency Filters

Mel-Frequency Cepstral Coefficients (MFCC)

– log X[k] = log H[k] + log E[k] – Taking IFFT – x[k] = h[k] + e[k]

spectrum are referred to as Mel-Frequency Cepstral Coefficients often denoted by *MFCC*

Speech signal represented as a sequence of spectral vectors

Why we are going to use MFCC

– Used for joining two speech segments S1 and S2 – Represent S1 as a sequence of MFCC – Represent S2 as a sequence of MFCC – Join at the point where MFCCs of S1 and S2 have minimal Euclidean distance

– MFCC are mostly used features in state-of-art speech recognition system

Summary: Process of Feature Extraction

using FFT

Spectrum

vectors

classifiers for speech recognition purpose

Additional Reading

– Pg: 273 – 281 – Pg: 304 – 311 – Pg: 314 - 316

spectrum are referred to as Mel-Frequency Cepstral Coefficients often denoted by MFCC