- 3. Feature Extraction
3. Feature Extraction 3.1 Feature Extraction from Speech or other - - PowerPoint PPT Presentation
3. Feature Extraction 3.1 Feature Extraction from Speech or other - - PowerPoint PPT Presentation
3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music See Schukat-Talamazzini Chapter 3 2 Goal of Feature Extraction Capture essential information about speech Be robust against background
2
3.1 Feature Extraction from Speech
… or other types of audio like music
See Schukat-Talamazzini Chapter 3
3
Goal of Feature Extraction
- Capture essential information about speech
- Be robust against background noise
- Steps:
- Sampling and quantization
- Short time analysis
- Transform to frequency space
- Filtering
- Optimize class separability
4
Overview Feature Extraction
Convert the continuous speech signal into a sequence of vectors Each window gives one vector The following slides will give the details of this procedure From: HTK-manual
5
Sampling and Quantization
Measure signal periodically and store in variable Sampling rate: T Quantization: use B bits to represent signal a 2B possible values fn: sampled values of the signal numbered using index n what happens when you store a signal in a computer?
6
Sampling Theorem
- Reconstruction of original signal is
- nly possible if the signals highest
frequency is limited
- Let fG the frequency limit
- Else: spectral aliasing
that is frequencies will be confused T fG 2 1
7
Pre-emphasis
- Correct for filtering of the lips
- Boosts higher frequencies
- Iterative scheme:
- Typical values: =0.95
1 ´ n n n
f f f
What does it do for =1
8
From Signal to Spectrum: Fourier Transform
- Definition
n n i n m n i m
e w f e F ) (
) (
wn : window function : frequency times 2 i: imaginary unit The window cut’s the sum to a number of finite values Complex exponentials are easier than cos or sin functions
9
Example: putting a rectangular on a speech signal
Fram e shi f t t yp. : 10m s Fram e w i dt h t yp. : 25m s
10
Fourier Transform in Practice
- Use “Fast Fourier Transform” (FFT)
- Requires number of samples N to be power
- f 2 (e.g. N=256)
- Code available
- Complexity N log( N)
11
Established Window Functions
- Use to get sharper peaks
- Rectangular window:
- Generalized Hamming Window:
- Gauss window:
- Parabola window:
1
R n
w
) 1 2 cos( ) 1 ( N n wH
n 2
) 2 / 3 2 / ( 5 . N N n G n
e w
) 1 ( 4 N n N n wP
n
( =0.54 : standard Hamming window)
n=0...N-1
- Window functions vanish outside this interval
12
Rewrite of Fourier Transform
- Definition:
- Window functions vanish outside the
interval n=0...N-1
- Define
N 1 2
n n i n m n i m
e w f e F ) (
) (
1 2 ) ( N n N n i n n m m
e w f F
13
Example for ö
Short time spectrum Smoothed spectrum
Frequency (Hz) Frequency (Hz)
How can you best look at multiple spectra at the same time
14
Spectrogram
- Calculate a spectrum for any point in time
- Code the local intensity: color/grey scale
Time
15
Spectrogram
http://www.wilhelm-kurz-software.de/dynaplot/applicationnotes/spectrogram.htm
"To return to the main menu, press the star key".
16
Use praat to generate a Spectrogram
- Praat: software for doing
phonetics by computer
- Written by: Paul Boersma
and David Weenink
- quite powerful:
spectrograms, formants, pitch, …
- Download:
http://www.fon.hum.uva.nl/ praat/
17
Use praat to generate a Spectrogram
- Praat: software for doing
phonetics by computer
- Written by: Paul Boersma
and David Weenink
- quite powerful:
spectrograms, formants, pitch, …
- Download:
http://www.fon.hum.uva.nl/ praat/
18
Use praat to generate a Spectrogram
a demo
19
Smoothing the Spectrum: filter bank
- Idea: imitate ear
- Do an average over neighboring frequencies
- Scale the frequencies according to the Mel or the
Bark scale a Reduction from 256 Fourier coefficients to 24
- utputs of a filter bank
20
Example of a Filterbank
21
Filterbank
- Spacing of center frequency:
– According to mel scale:
- Low frequency cut off:
– E.g. 300 Hz (for telephone speech)
- High frequency cut off:
– E.g. 3400 Hz (for telephone speech )
- Different settings for e.g. head set connected PC
) 700 1 ( log 2595 ) (
10
f f Mel
How can you adjust to different vocal tracts?
22
Vocal Tract Length Normalization
- Idea:
- Average position of formants depends on length
- f vocal tract
- a varying position of frequencies of filter bank
- A kind of speaker adaptation
23
Vocal Tract Length Normalization: Frequency Warping
- Translation table
for frequencies
- Keep minimum
and maximum frequency unchanged
min=0.8 to max=1.2
24
Training the Warping Factor
- Issue: how to scale for a specific speaker
- Slow version:
- Use 11 different warping factors
- Do speech recognition with all of them
- Pick the best one
- Oldest approach
- Not very efficient
- Improvement: 10% less recognition errors
25
From Spectrum to Cepstrum
- Name: swapping of letters (spectrum/cepstrum)
- Useful as a preparation to remove channel
distortions
- Cepstral mean subtraction (CMS)
method to remove channel distortions What are examples
- f channel distortions?
26
Definition “Cepstrum”
Fourier Transform log Discrete Cosine Transform Signal Spectrum Cepstrum
27
Math for Cepstrum
- en: original signal (e.g. excitation from glotis)
- fn: measured signal
- hn: impulse response of channel (e.g. vocal
tract, telephone, room acoustics)
n n n m m
e h f
28
Math for Cepstrum
- Apply Fourier transform F
- Use convolution theorem
} { } {
n n n m n
e h f F F
} { } { } {
n n n
e h f F F F
29
Math for Cepstrum
- Apply logarithm
- Impulse response and excitation now separated
- If stationary part of impulse response hn can
now be removed
}) { log( }) { log( }) { log(
n n n
e h f F F F
30
Cepstrum: do discrete cosine transform after log
- Discrete cosine transform:
,... 2 , 1 ) ) 2 / 1 ( cos( ) log( 2
1 ) ( ) (
n N l n F N c
N l m l m n
You do not need to remember this formula
31
Dynamic Features
- Spectrum captures local aspects of speech
- Window size 25 ms
- Capture slow changes in spectrum
- Other name: delta features
32
Dynamic Features
- Capture slow changes in spectrum
33
Dynamic Features
- Calculate first and second derivatives
- Naïve approach to first derivative
– Continuous function – Time discrete sampling
t t t f t t f dt t df 2 ) ( ) ( ) ( 2 ) ( ) ( ) (
m m m
t f t f dt t df
tm: m-th sample of the signal
34
Difference/Regression
Sample i-th component of feature vector m m-3 m-2 m+1 m-1 m+2 m+3 Regression curve Line through extremes
35
Regression Formula
M i M i i m i m
i t f t f i dt t df
1 2 1
2 )) ( ) ( ( ) (
Can you make it agree with
t t t f t t f dt t df 2 ) ( ) ( ) (
36
Dynamic Features
- Invented by Furui 1981
- Standard in any modern ASR system
- Alternative:
- Linear mapping of neighboring feature vectors
- Issue:
- Dimension of feature vectors
37
Linear Discriminant Analysis
- Method to decrease size of feature vector
- Maximize severability of class regions
- Linear transform of feature vectors
- More: later in the lecture
38 Complete Pipeline for Mel-Frequency Cepstral Coefficients (MFCC)
Sampling Windowing Fast Fourier Transform
512 Fourier Coefficients
Absolute Value Mel-scaled Filterbank log Discrete Cosine Transform Dynamic Features (1. and 2. derivative) Linear Discriminant Analysis
16 kHz; 16 Bit quantization
Pre-emphasis
Signal
Feature Vectors Window size: 25 ms Typical values: 24 filterbank values keep only 20 lowest cepstra 60 dimensional vector
39
Alternative Feature Extraction Methods
- LP-Cepstrum (LP=linear prediction)
- Derived from speech coding
- No longer much in use
- PLP (=Perceptual linear prediction)
- For certain applications popular
- Claim: mode noise robust than MFCCs
- Main change: us |.|1/3 instead of log in MFCC
40
Summary
- Classical “plain vanilla” feature extraction:
Mel-Frequency Cepstral Coefficients
- Main deficiency: not very noise robust
- Used in
- Speech Recognition
- Speaker Recognition
- Music genre classification
41
3.2 Feature Extraction from Image Processing
42
Overview
- Feature types:
- Color
- Texture
- Edge
43
Image
44
Physics
- It’s all electromagnetic (EM) radiation
- Different colors correspond to radiation of
different wavelengths
- Intensity of each wavelength specified by
amplitude
- We perceive EM radiation within the 400-
700 nm range, a tiny piece of spectrum between infra-red and ultraviolet
45
Visible Light
46
Color and Wavelength
Most light we see is not just a single wavelength, but a combination of many wavelengths (see below). This profile is often referred to as a spectrum, or spectral power distribution.
47
Image Representation (RGB)
48
Image Representation (Channels)
49
Image Representation
(r,g,b)
C pixels wide R pixels long
50
Color Histogram
Calculate percentage of color present in image Deficiency: loss of regional information
51
Localized Features
Do color histogram for any region of the image
52
Edge Detection: Sobel Operator
) / arctan( | | 1 2 1 1 2 1 1 1 2 2 1 1
2 2
Gx Gy Gy Gx G Gy Gx
Apply matrices Gx and Gy to any image region
53
Texture Image Examples
- From the VisTex Texture Database
54 Gaussian window modulated with a complex sinusoid Gabor filters at different scales and spatial frequencies Top row shows anti-symmetric (or odd) filters, bottom row the symmetric (or even) filters Visual Cortical cells have band-pass responses very similar to Gabor filters
Gabor filters
) 2 / ) ( exp( ) 2 / ) ( exp( )) ( exp( ) , (
2 2 2 2
j K G
You do not need to remember this formula
55
Summary
- Main features for image recognition
- Color
- Edges
- Texture