Nonlinear Aspects of Speech Production: Fractals and Chaotic - - PowerPoint PPT Presentation

nonlinear aspects of speech production fractals and
SMART_READER_LITE
LIVE PREVIEW

Nonlinear Aspects of Speech Production: Fractals and Chaotic - - PowerPoint PPT Presentation

Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory National Technical University of Athens, Greece (NTUA) Robot Perception and Interaction Unit, Athena Research and Innovation


slide-1
SLIDE 1

Computer Vision, Speech Communication & Signal Processing Group, Intelligent Robotics and Automation Laboratory

National Technical University of Athens, Greece (NTUA)

Robot Perception and Interaction Unit,

Athena Research and Innovation Center (Athena RIC)

Nonlinear Aspects of Speech Production: Fractals and Chaotic Dynamics

Petros Maragos

1

Summer School on Speech Signal Processing (S4P) DA-IICT, Gandhinagar, India, 9-11 Sept. 2018

slide-2
SLIDE 2

2

Outline

 Nonlinear Speech Processing  Turbulence: Fractals, Chaotic Dynamics  Multiscale Fractal Dimensions of Speech Sounds  Fractal Modulations for Fricative Sounds  Chaotic Dynamics of Speech Sounds  Algorithms for Speech Fractal & Chaos Analysis  Application to Speech Recognition  Application to Music Recognition

slide-3
SLIDE 3

) (n s

IMPULSE TRAIN GENERATOR PITCH PERIOD GLOTTAL PULSE MODEL G(z)

X

RANDOM NOISE GENERATOR

X

VOCAL TRACT MODEL V(z) VOCAL TRACT PARAMETERS RADIATION MODEL R(z)

V

A

N

A ) (n u

VOICED/UNVOICED SWITCH

(Rabiner & Schafer, 1978)

Linear Source-Filter Model

slide-4
SLIDE 4

Nonlinear Fluid Dynamic of the Vocal Tract

(Kaiser 1993)

slide-5
SLIDE 5

Physics of Speech Airflow

  • airflow variables:

= air density; = pressure = 3D air particle velocity

  • governing equations:

mass conservation (continuity eqn): momentum conservation (Navier-Stokes eqn): state equation:

  • time-varying boundary conditions

u 

 

u t        

 

2

1 3 u u u p g u u t                                 

1.4

const. p  

 p

slide-6
SLIDE 6

Speech Aerodynamics

  • Reynolds number:
  • low viscosity μ

high Re inertia forces viscous forces

  • “aerodynamic” phenomena (Re >>1):

air jet, rotational motion, separated airflow, boundary layers, vortices, turbulence

  • experimental & theoretical evidence for nonlinear phenomena:

Teager (1970s–1980s), Kaiser (1983 – ), Thomas (1986), McGowan (1988), Barney, Shadle & Davis (1999), ... ( ) ( ) Re velocity scale length scale     

  

slide-7
SLIDE 7

Vortices

  • vorticity:
  • VORTEX is a flow region of similar
  • a vortex can be generated by:

– velocity gradients in boundary layers – separated air flow – curved geometry of vocal tract

  • dynamics of vortex propagation:

vorticity twisting & stretching diffusion of vorticity u        

 

2

u u t                     u     

2

  

slide-8
SLIDE 8

Nonlinear Speech Processing

  • Modulations
  • Turbulence

– Fractals – Chaos

slide-9
SLIDE 9

Turbulence

  • flow state with broad-spectrum rapidly-varying (in space and

time) velocity and vorticity

  • transition to turbulence is easier for higher Re flows
  • eddies: vortices of a characteristic size 
  • Energy Cascade Theory (Richardson,1922)

(multiscale hierarchy of eddies)

  • 5/3 spectral law (Kolmogorov, 1941):

wavenumber energy dissipation rate velocity wavenumber spectrum

 

2 3 5 3

, S k r r k  

2 / k     r 

 

, S k r 

slide-10
SLIDE 10

Turbulence, Fractals and Chaos

  • fractal geometry quantifies multiscale structures

in turbulence

  • Kolmogorov’s 5/3 law
  • we use fractal dimension to quantify

“amount” of turbulence in speech

  • chaos

turbulence

     

2 3

Var u x u x x x         

 

slide-11
SLIDE 11

Multiscale Fractal Dimension of Speech Spounds

0.5 1 1.5 2 2.5 3 3.5 4 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 SCALE (millisec) FRACTAL DIMENSION of / F / 5 10 15 20 25 30 −400 400 TIME (millisec) SPEECH SIGNAL: / F / 5 10 15 20 25 30 −800 800 TIME (millisec) SPEECH SIGNAL: / V / 0.5 1 1.5 2 2.5 3 3.5 4 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 SCALE (millisec) FRACTAL DIMENSION of / V / 0.5 1 1.5 2 2.5 3 3.5 4 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 SCALE (millisec) FRACTAL DIMENSION of / IY / 5 10 15 20 25 30 −3000 3000 TIME (millisec) SPEECH SIGNAL: / IY /

/f/ /iy/ /v/

[ P. Maragos & A. Potamianos, JASA 1999 ]

slide-12
SLIDE 12

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1

/ao/,DE=6, #1846

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1

/iy/,DE=5, #1068

Speech Attractors

500 1000 1500 −1 −0.5 0.5 1 Time X(t) /ao/ 200 400 600 800 1000 −1 −0.5 0.5 1 Time X(t) /iy/

−1.5 −1 −0.5 0.5 1 −1.5 −1 −0.5 0.5 1 −1.5 −1 −0.5 0.5 1

/k/,DE=6, #816

200 400 600 800 −1 −0.5 0.5 1 Time X(t) /k/

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1

/s/,DE=5, #829

200 400 600 800 −1 −0.5 0.5 1 Time X(t) /s/

[ Pitsikalis & Maragos, Speech Commun 2009 ]

slide-13
SLIDE 13

Multiscale Fractal Dimensions for Speech Sounds

Refs:

  • P. Maragos and A. Potamianos, “Fractal Dimensions of Speech Sounds: Computation and

Application to Automatic Speech Recognition”, Journal of Acoustical Society of America, March 1999.

  • P. Maragos, “Fractal Signal Analysis Using Mathematical Morphology”, in Advances in

Electronics and Electron Physics, vol.88, Academic Press, 1994.

slide-14
SLIDE 14

14

FRACTALS: Definitions

  • Mandelbrot’s definition

set is fractal Hausdorff dim topological dim

  • Examples
  • Signals

A function is a fractal if its graph is a fractal set in is continuous

S 

( )

H

D S

( )

T

D S 1 2

T H

D D     1

T H

D D     2 3

T H

D D     S = S = S =

fractal curve fractal surface fractal dust

:

v

f    ( ) Gr f

1 v

 f [ ( )] 1

T H

v D D Gr f v     

slide-15
SLIDE 15

15

‘FRACTAL’ DIMENSIONS (OF SETS IN Rν)

Hausdorff dimension Minkowski-Bouligand dimension box counting dimension similarity dimension

T H MB BC

D D D D v £ £ £ = £

H S

D D £

H

D =

MB

D =

BC

D =

S

D =

slide-16
SLIDE 16

Morphological Measurement of Fractal Dimension

  • Minkowski cover of curve
  • Fractal (Minkowski-Bouligand) dimension
  • Least-Squares line fit to data

 

: ( )

B z G

G rB z C r

     

 

1,2 D 

       

1

; 2

B D B B

A r A r area C r length of G r r r

          

 

2

log ,log 1

B

A r r r D     

slide-17
SLIDE 17

Morphological (Flat & Weighted) Filters

Dilation (Max-plus convolution): Erosion (Min-plus correlation):

( )( ) max ( ) ( )

y

f g x f y g x y     ( )( ) min ( ) ( )

y

f g x f y g y x   

( ) f g f g g  

( ) f g f g g  

100 200 300 50 100 Sample Index

ORIGINAL SIGNAL

−10 10 10 Sample Index PARABOLA PULSE 100 200 300 −10 50 100 110

DILATION BY FLAT & PARABOLIC SE

100 200 300 −10 50 100 110

EROSION BY FLAT & PARABOLIC SE

100 200 300 −10 50 100 110

OPENING BY FLAT & PARABOLIC SE

100 200 300 −10 50 100 110

CLOSING BY FLAT & PARABOLIC SE

Opening: Closing:

slide-18
SLIDE 18

18

Minkowski Fractal Dimension of 1D Curve and Morphological Algorithm for 1D Signals

slide-19
SLIDE 19

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 1.2 1.4 1.6 1.8 2 TIME (in sec)

ZERO−CROSSINGS MS−AMPLITUDE FRACTAL DIMENSION / SOOTHING / SPEECH SIGNAL

ST Speech & Fractal Dimension

slide-20
SLIDE 20

Multiscale Speech Fractal Dimension

  • short-time speech

signal

  • signal graph
  • fractal

constant power law

  • variable power law
  • multiscale fractal

“dimension” (speech fractogram):

  • f

short-time speech segment around time

 ,

t S T t  

 

 

 

2

, :0 G t S t R t T    

 

2

, as

D

area G B C   

  

   

  D t MFD  ,

 

 

2 D

area G B C

 

 

t

slide-21
SLIDE 21

Multiscale Fractal Dimension of Speech Spounds

0.5 1 1.5 2 2.5 3 3.5 4 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 SCALE (millisec) FRACTAL DIMENSION of / F / 5 10 15 20 25 30 −400 400 TIME (millisec) SPEECH SIGNAL: / F / 5 10 15 20 25 30 −800 800 TIME (millisec) SPEECH SIGNAL: / V / 0.5 1 1.5 2 2.5 3 3.5 4 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 SCALE (millisec) FRACTAL DIMENSION of / V / 0.5 1 1.5 2 2.5 3 3.5 4 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 SCALE (millisec) FRACTAL DIMENSION of / IY / 5 10 15 20 25 30 −3000 3000 TIME (millisec) SPEECH SIGNAL: / IY /

/f/ /iy/ /v/

[ P. Maragos & A. Potamianos, JASA 1999 ]

slide-22
SLIDE 22

1 2 3 4 1 1.2 1.4 1.6 1.8 2

/AA/

SCALE (millisec) FRACTAL DIMENSION of /AA/ 1 2 3 4 1 1.2 1.4 1.6 1.8 2

/B/

SCALE (millisec) FRACTAL DIMENSION of /B/ 1 2 3 4 1 1.2 1.4 1.6 1.8 2

/F/

SCALE (millisec) FRACTAL DIMENSION of /F/ 1 2 3 4 1 1.2 1.4 1.6 1.8 2

/R/

SCALE (millisec) FRACTAL DIMENSION of /R/ 1 2 3 4 1 1.2 1.4 1.6 1.8 2

/EN/

SCALE (millisec) FRACTAL DIMENSION of /EN/ 1 2 3 4 1 1.2 1.4 1.6 1.8 2

/M/

SCALE (millisec) FRACTAL DIMENSION of /M/

Mean and standard deviation (error bars) of the multiscale fractal dimension for the phonemes /aa/, /b/, /en/, /f/, /m/, /r/ from the TIMIT database (20 ms window, updated every 10 ms. Average over 200 phonemic instances.)

slide-23
SLIDE 23

Mean MFD for /sh/, /zh/, /uh/, /t/, /d/

slide-24
SLIDE 24

81.2% 83.5% 84.5%

   

1 1,

, , , D D C E C E    

 

, , , E C E C  

   

1 16 1 16

, , , , E C E C D D    

 

Features Models 5-mixture Gaussians

85.6% 86.3%

10-mixture Gaussians

88.6% 88.9%

} , , , , , { C E C E C E     } , { } , , , , , { D D C E C E C E       Word Percent Correct For the E-set Recognition Task (ISOLET Database, 5-Mixture Gaussians per HMM State)

Word Percent Correct for the E-set Recognition Task

Maragos & Potamianos, JASA 1999

slide-25
SLIDE 25

Fractal Modulations for Fricative Sounds

Ref:

  • A. G. Dimakis and P. Maragos, “Phase Modulated Resonances Modeled as Self-

Similar Processes With Application to Turbulent Sounds”, IEEE Transactions on Signal Processing, Nov. 2005.

slide-26
SLIDE 26
  • An important class of statistically self-similar random

processes defined by their measured power spectra: A truly enormous collection of natural phenomena exhibit 1/f- type spectral behavior

  • ver

a wide frequency range:

(frequency variations in quartz crystal oscillators, geophysical variations, heart rate variations, electronic device noises, network traffic flow and economic time series.)

  • Most

popular mathematical model for Gaussian 1/f processes: Fractional Brownian Motion (FBM)

1/f Noises

2

( ) | | S

   

slide-27
SLIDE 27

27

Noises

  • Stochastic processes with power spectrum
  • Filtering white noise with convolution kernel

(Fractional Integration)

  • Non – exponential autocorrelation
  • White noise
  • Pink noise
  • Fractal Brownian Motion
  • Brown noise
  • Black noise
  • Applications: electronics, geophysics, astronomy, music,

acoustics, optics, economics, traffic flows, communications, geometry of nature

1 f 

1 f  

2 1

t

1

t



  

1   

1 3     2    2   

slide-28
SLIDE 28

Examples of FFT-based Synthesis of 1D FBM

slide-29
SLIDE 29

29

FBM Synthesis of Fractal Landscapes

D = 2.15 D = 2.5 D = 2.8

  • R. Voss, 1988
slide-30
SLIDE 30

1/f Speech Modulation Model

  • Model a resonance of a random speech phoneme as a phase-

modulated 1/f signal:

  • Nonlinear phase signal P(t) modeled as 1/f random

process.

  • Useful model for broad resonances often
  • bserved in

fricative voiced or unvoiced sounds and probably caused by nonlinear phenomena during speech production.

 

( )

( ) cos ( )

c t

S t A t P t

     

slide-31
SLIDE 31
  • Isolate resonance: Bandpass filter the speech signal.
  • Demodulate

filtered signal using ESA,

  • btain

instant frequency F(t), and median filter to reduce spikes.

  • Estimate phase modulation signal P(t) by integrating IF:
  • Fit 1/fγ model to P(t). Methods tested include:
  • Linear regression on Periodogram
  • Estimation using variance of wavelet coefficients
  • Maximum Likelihood estimation

Parameter Estimation in 1/f-PM

( ) 2 ( ( ) )

t

P t F F d     

slide-32
SLIDE 32 1 2 3 4 5 6 7
  • 20
  • 15
  • 10
  • 5
5 10

Scale m

10 1 10 2 10 3 10 4
  • 300
  • 250
  • 200
  • 150
  • 100
  • 50

Frequency (Hz)

=2.99

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 2800 3000 3200 3400 3600 3800 4000 4200 4400 4600 4800

Time (sec)

1000 2000 3000 4000 5000 6000 7000 8000
  • 180
  • 160
  • 140
  • 120
  • 100
  • 80
  • 60

Frequency (Hz)

/S/ phoneme experiment

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1

Time (sec)

Speech signal

0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
  • 2
  • 1
1 2 3 4 5 6 7 8

Time (sec)

Power Spectrum Phase modulation P(t) Instant Frequency PSD of P(t)

  • Var. of wavelet coefficients
slide-33
SLIDE 33 10 1 10 2 10 3 10 4
  • 350
  • 300
  • 250
  • 200
  • 150
  • 100
  • 50
50

Frequency (Hz)

=3.63

0.01 0.02 0.03 0.04 0.05 0.06 0.07
  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2
0.2 0.4 0.6 0.8 1

Time (sec)

/Z/ phoneme experiment

Speech signal

1000 2000 3000 4000 5000 6000 7000 8000
  • 180
  • 160
  • 140
  • 120
  • 100
  • 80

Frequency (Hz)

Power Spectrum

0.01 0.02 0.03 0.04 0.05 0.06 0.07 2400 2600 2800 3000 3200 3400 3600 3800 4000 4200 4400

Time (sec)

Instant Frequency

0.01 0.02 0.03 0.04 0.05 0.06 0.07
  • 20
  • 15
  • 10
  • 5
5 10

Time (sec)

1 2 3 4 5 6 7
  • 20
  • 15
  • 10
  • 5
5 10

Scale m

Phase modulation P(t) PSD of P(t)

  • Var. of wavelet coefficients
slide-34
SLIDE 34

Chaotic Dynamics of Speech Sounds

Refs:

  • V. Pitsikalis and P. Maragos, “Filtered Dynamics and Fractal Dimensions for

Noisy Speech Recognition”, IEEE Signal Processing Letters, Nov. 2006.

  • V. Pitsikalis and P. Maragos, “Analysis and Classification of Speech Signals by

Generalized Fractal Dimension Features”, Speech Communication, Dec. 2009.

  • I. Kokkinos and P. Maragos, “Nonlinear Speech Analysis Using Models for

Chaotic Systems”, IEEE Transactions Speech and Audio Processing, Nov. 2005.

slide-35
SLIDE 35

Embedding-Attractor Reconstruction

  • Parameters to specify:

,

E

T D

 

 

h G 

 

1

x t

 

1

2 x t T 

 

1

x t T 

200 400 600 800 1000 1200 1400 1600 1800 2000 −10 −8 −6 −4 −2 2 4 6 8 10

X−projection (Lorenz) Time X(t)

 

1

Y t 

 

1

x t

 

1

x t T 

 

 

1

1

E

x t D T  

           

−10 −8 −6 −4 −2 2 4 6 8 10 −10 −8 −6 −4 −2 2 4 6 8 10 −10 10

Reconstructed Lorenz Attractor,20000 iterations

X Y Z

−10 −8 −6 −4 −2 2 4 6 8 10 −20 −10 10 20 5 10 15 20 25 30 Lorenz Attractor;σ=5 R=15 B=1;∆t=0.25; 20000 iterations

X Y Z

slide-36
SLIDE 36
  • Nonlinear Dynamic

System (Lorenz)

  • Attractor
  • 1D Projection

dx x y dt dy R x y x z dt dz B z x y dt                 

−10 −8 −6 −4 −2 2 4 6 8 10 −20 −10 10 20 5 10 15 20 25 30

Lorenz Attractor;σ=5 R=15 B=1;∆t=0.25; 20000 iterations

X Z Y

500 1000 1500 2000 2500 3000 3500 4000 −10 −8 −6 −4 −2 2 4 6 8 10

X−projection (Lorenz) Time X(t)

slide-37
SLIDE 37

Time Delay

  • Average Mutual Information between
  • “Optimum” Time Delay:

( ), ( ) x t x t T 

     

 

 

 

 

 

 

 

Pr ( ), Pr , log Pr Pr x t x t T I T x t x t T x t x t T              

min argmin ( )

  • pt

T

T I T

     

50 100 150 1 2 3 4 5

Lorenz System,σ=5 R=15 B=1,∆t=0.25,#10000 Time Delay T Average Mutual Information

slide-38
SLIDE 38

Embedding Dimension

  • Sufficient:
  • False Neighbors: from projection
  • True Neighbors: from dynamics
  • False Neighbors Criterion
  • When % false neighbors =0,

Attractor is unfolded

2

E Attractor

D D  

1 1 ,

( ) ( ) ( ) ( ) ( ) ( )

d d d d i j d d

y i y j y i y j R Threshold y i y j

 

     

1 2 3 4 5 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Lorenz System,σ=5 R=15 B=1,∆t=0.25,#2000 Embedding Dimension DE % False Neighbors

slide-39
SLIDE 39

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1

/ao/,DE=6, #1846

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1

/iy/,DE=5, #1068

Speech Attractors

500 1000 1500 −1 −0.5 0.5 1 Time X(t) /ao/ 200 400 600 800 1000 −1 −0.5 0.5 1 Time X(t) /iy/

−1.5 −1 −0.5 0.5 1 −1.5 −1 −0.5 0.5 1 −1.5 −1 −0.5 0.5 1

/k/,DE=6, #816

200 400 600 800 −1 −0.5 0.5 1 Time X(t) /k/

−1 −0.5 0.5 1 −1 −0.5 0.5 1 −1 −0.5 0.5 1

/s/,DE=5, #829

200 400 600 800 −1 −0.5 0.5 1 Time X(t) /s/

[ Pitsikalis & Maragos, Speech Commun 2009 ]

slide-40
SLIDE 40

Correlation Dimension (Speech)

 Correlation Dimension:  Correlation integral:

N: # of points, r: scale, x: set points, Η: Ηeavyside function

   

 

1

1 , 1

N i j i j i

C N r H r x x N N

 

     

 

log , lim lim log

C r N

C N r D r

 

slide-41
SLIDE 41

Correlation Dimension (Lorenz)

  

 

1

1 , 1

N i j i j i

C N r H r x x N N

 

     

 

log , lim lim log

C r N

C N r D r

 

10

−1

10 10

1

10

4

10

5

10

6

Lorenz System,σ=5 R=15 B=1,∆t=0.25,#4000 Correlation Integral Scale

0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 1.8 1.85 1.9 1.95 2 2.05 2.1 2.15 2.2

Scale Local Slope Lorenz System,σ=5 R=15 B=1,∆t=0.25,#4000

slide-42
SLIDE 42

10

−3

10

−2

10

−1

10 10

1

10 10

1

10

2

10

3

10

4

10

5

10

6

10

7

averaging over 8 phonemes of type /ao/ scale correlation integral plain averaging weighted averaging 10

−2

10

−1

10 10

1

10

−1

10 10

1

10

2

10

3

10

4

10

5

10

6

10

7

averaging over 15 phonemes of type /iy/ scale correlation integral plain averaging weighted averaging 10

−2

10

−1

10 10

1

10 10

1

10

2

10

3

10

4

10

5

10

6

10

7

averaging over 13 phonemes of type /s/ scale correlation integral plain averaging weighted averaging 10

−3

10

−2

10

−1

10 10

1

10 10

1

10

2

10

3

10

4

10

5

10

6

averaging over 11 phonemes of type /k/ scale correlation integral plain averaging weighted averaging

Correlation Integrals of Speech Sounds

/ao/ /iy/ /k/ /s/

slide-43
SLIDE 43

Fractal Features

800 1000 1200 1400 −1 −0.5 0.5 1

(ms)

N-d Cleaned Embedding N-d Signal Local SVD speech signal Filtered Dynamics - Correlation Dimension (8) Noisy Embedding Filtered Embedding

FDCD

Multiscale Fractal Dimension (6)

MFD

Geometrical Filtering

−0.2 0.2 0.4 0.6 −0.2 0.2 0.4 −0.2 0.2 0.4 −0.2 0.2 0.4 0.6 −0.2 0.2 0.4 −0.2 0.2 0.4

0.1 0.2 0.3 0.05 0.1 0.15 Median Neighborhood Distance Density Filtered Noisy

Neighborhood Distance Reduction

Projection

50 100 150 200 250 300 350 400 −600 −400 −200 200 400 600 Mproj−noisy Mproj−clean Mproj−filt

Enhanced Speech

[ Pitsikalis & Maragos, IEEE SPL 2006 ]

slide-44
SLIDE 44

Noisy Speech Database: Aurora 2

 Task: Speaker Independent Recognition of Digit Sequences  TI - Digits at 8kHz  Training (8440 Utterances per scenario, 55M/55F)

 Clean (8kHz, G712)  Multi-Condition (8kHz, G712)

 4 Noises (artificial): subway, babble, car, exhibition  5 SNRs : 5, 10, 15, 20dB , clean

 Testing, artificially added noise

 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean]  A: noises as in multi-cond train., G712 (28028 Utters)  B: restaurant, street, airport, train station, G712 (28028 Utters)  C: subway, street (MIRS) (14014 Utters)

slide-45
SLIDE 45

Average Recognition Results on Aurora 2: plain CD vs FDCD

20 40 60 80 100 Word Accuracy (%) 20dB 15dB 10dB 5dB 0dB Aver SNR +Plain CD +FDCD

Plain CD: Correlation Dimension without Dynamical Filtering

slide-46
SLIDE 46

Average Recognition Results on Aurora 2: FDCD

10 20 30 40 50 60 70 80 90 100 Word Accuracy (%) Clean 20dB 15dB 10dB 5dB 0dB Aver SNR Baseline +FDCD

Up to +40%

slide-47
SLIDE 47

Average Recognition Results on Aurora 2: MFD

10 20 30 40 50 60 70 80 90 100 Word Accuracy clean 20 dB 15 dB 10 dB 5 dB 0 dB Ave. SNR

Baseline MFD

Up to +27%

slide-48
SLIDE 48

Average Recognition Results on Aurora 2

2 12 22 32 42 52 62 72 Accuracy Clean 20 15 10 5 average SNR

Plain Fractal Features (Aurora 2)

CD FDCD MFD

slide-49
SLIDE 49

Average Recognition Results on Aurora 2: Hybrid Features: Fractals and Modulations

30 40 50 60 70 80 90 100

Accuracy

20 dB 10 dB 5 dB

SNR Baseline +FMP +FDCD +FMP+FDCD

Up to +61%

[ Pitsikalis & Maragos, IEEE SPL 2006; Speech Commun 2009 ]

slide-50
SLIDE 50

Lyapunov Exponents (L.E.s)

k-th Lyapunov Number: k-th Lyapunov Exponent:

1/

lim ( )

n

n k n k

L r



ln( )

k k

L  

1 2 1

e

k k D

    

     

slide-51
SLIDE 51

Lyapunov Exponents (II)

  • Quantify signal predictability

(orbits convergence-divergence rates in phase space)

  • Positive L.E.  exponential divergence

Negative L.E.  exponential convergence

  • Dissipative system  sum of L.Es <0

Chaotic system  at least one L.E >0

  • Invariants of system dynamics  useful for

characterization /recognition purposes

  • Determine prediction horizon

(upper bound of system predictability)

slide-52
SLIDE 52

Prediction on Reconstructed Attractor

(Kokkinos & Maragos, T-SAP 2005)

Goal: capture dynamics of MIMO system from input-output pairs Models tested:

  • Local Polynomials
  • Global Polynomials
  • Radial Basis Function networks
  • Takagi-Sugeno-Kang models
  • Support Vector Machines

1

( )

n n

X F X

 

1

( )

n n

X X

  f

slide-53
SLIDE 53

Computation of Lyapunov Exponents

  • Consider an orbit
  • Oseledec matrix:
  • i-th L.E.

, is i-th eigenvalue of OSL

  • Limitations:
  • Only approximation of Jacobian J of f is available

( F is an approximation to f )

  • Ill-conditioned nature of OSL 

recursive QR decomposition technique

  • Limited data set  local L.E.s

1 1

lim ( ) ( ) ( ) ( )

T T T N F N F F F N

X X X X

 

         O SL J J J J  

1

( ), 1,2, ,

n n

X X n N

 

 f 

log( )

i i

s  

i

s

slide-54
SLIDE 54

Validation of Lyapunov Exponents

  • Inverse time sequencing of data
  • True exponents flip sign (divergence of nearby orbits

becomes convergence & vice versa)

  • False exponents remain negative

(artifact of embedding process no dependence on system dynamics)

  • Models that learn the data (and not the system

dynamics) fail to give such results.

  • RBF nets, TSK-0, Global Polynomials ... failed
  • SVM, TSK-1 succeeded
slide-55
SLIDE 55

Applications to Speech Signals

(Kokkinos & Maragos 2005)

  • Prediction – coding with global polynomials

(smaller MSE than LPC with same # of params )

  • Speech analysis using Lyapunov exponents
  • Vowels have small positive L.E.s
  • Voiced fricatives have bigger positive L.E.s
  • Unvoiced fricatives have no validated L.E.s

(too noisy)

  • Stop sounds have no validated L.E.s

(non-stationary)

  • Non-validated L.E.s are still useful
slide-56
SLIDE 56

Speech Results: validated L.E.s

  • Phoneme: /aa/
  • Phoneme: /v/

200 400 600 800 1000 1200 −1 −0.5 0.5 1

# of Jacobians used

Lyapunov Exponents of /aa/ Direct L.E. −Inverse L.E.

−0.5 0.5 1 −0.5 0.5 1 −0.5 0.5 1 X1

Reconstructed Attractor of /aa/

X2 X3 200 400 600 800 1000 1200 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1

Original Signal: /aa/

Sample index 1D Signal 200 400 600 800 1000 −0.5 0.5 1

Original Signal: /v/

Sample index 1D Signal −0.5 0.5 1 −0.5 0.5 1 −0.5 0.5 1 X1

Reconstructed Attractor of /v/

X2 X3

200 400 600 800 −0.6 −0.4 −0.2 0.2 0.4 0.6

# of Jacobians used

Lyapunov Exponents of /v/ Direct L.E. −Inverse L.E.

slide-57
SLIDE 57

Speech Results: Non-validated L.E.s

  • Phoneme: /sh/
  • Phoneme: /t/

500 1000 1500 −1 −0.5 0.5 1

Original Signal: /sh/

Sample index 1D Signal −1 1 −1 1 −1 −0.5 0.5 1 X1

Reconstructed Attractor of /sh/

X2 X3 −0.5 0.5 1 −0.5 0.5 1 −0.5 0.5 1 X1

Reconstructed Attractor of /t/

X2 X3 200 400 600 800 1000 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1

Original Signal: /t/

Sample index 1D Signal

500 1000 1500 −2 −1 1 2

# of Jacobians used

Lyapunov Exponents of /sh/ Direct L.E. −Inverse L.E. 200 400 600 800 1000 −1.5 −1 −0.5 0.5 1 1.5 2

# of Jacobians used

Lyapunov Exponents of /t/ Direct L.E. −Inverse L.E.

slide-58
SLIDE 58

Speech Lyapunov Exponents

[ Kokkinos & Maragos, IEEE T-SAP 2005 ]

slide-59
SLIDE 59

Speech Sound Classification

  • Using only L.E.s (PCA projection of 3 first L.E.s):

x :Unvoiced Fric., o :Unvoiced Stop

x :Unvoiced Fric., o :Voiced Fric. x :Vowel, o :Voiced Stop x :Vowel, o :Unvoiced Fric.

> :Unvoiced Stop, o :Voiced Stop

x :Vowel, o :Unvoiced Stop

  • When combined with MFCC: (4 classes)

~12% smaller error using K-NN classifier

slide-60
SLIDE 60

Other Works on Speech Fractals or Chaotic Dynamics

 C. A. Pickover and A. Khorasani, ‘‘Fractal Characterization of Speech Waveform Graphs,’’ Computer Graphics 1986.  P. J. B. Jackson and C. H. Shadle, “Frication noise modulated by voicing, as revealed by pitch-scaled decomposition”, J. Acoust. Soc. Amer. 2000.  S. McLaughlin and P. Maragos, “Nonlinear Methods for Speech Analysis and Synthesis”, in Advances in Nonlinear Signal and Image Processing, edited by S. Marshall and G. L. Sicuranza, EURASIP Book Series on Signal Processing and Communications, Hindawi Publ. Corp., 2006, pp.103-140.  M. Zaki, J. N. Shah and H. A. Patil, “Effectiveness of Multiscale Fractal Dimension-based Phonetic Segmentation in Speech Synthesis for Low Resource Language”, in Proc. Int’l

  • Conf. on Asian Language Processing (IALP) 2014.

 K. López-de-Ipina, J. Solé-Casals, H. Eguiraun, J.B. Alonso, C.M. Travieso, A.Ezeiza, N Barroso, M. Ecay-Torres, P. Martinez-Lage, Blanca Beitia, “Feature selection for spontaneous speech analysis to aid in Alzheimer’s disease diagnosis: A fractal dimension approach”, Computer Speech & Language 2015.  E. Tzinis, G. Paraskevopoulos, C. Baziotis, A. Potamianos, “Integrating Recurrence Dynamics for Speech Emotion Recognition”, in Proc. Interspeech 2018.

slide-61
SLIDE 61

61

Fractals and Music

Ref:

  • A. Zlatintsi and P. Maragos, “Multiscale Fractal Analysis of Musical Instrument Signals

with Application to Recognition”, IEEE Transactions on Audio, Speech and Language Processing, Apr. 2013.

slide-62
SLIDE 62

Multiscale Fractal Dimension of Music Sounds

1 2 3 4 5 3 4 5 6 7 8

LOG SCALE LOG AREA

Bass Bassoon Bb Clarinet Cello Flute Horn Tuba

[ Zlatintsi & Maragos, T‐ASLP 2013 ]

slide-63
SLIDE 63

63

Morphological Covering Method

D  2  lim log[AB(s) / s2] log(1/ s)

Double Bass steady state (solid line), its multiscale flat dilations and erosions at scales s=25,75, where B is a 3-sample symmetric horizontal segment with zero height.

  • P. Maragos and A. Potamianos. J. Acoust. Soc. Amer., 1999.

1 2 3 4 5 3 4 5 6 7 8

LOG SCALE LOG AREA

Bass Bassoon Bb Clarinet Cello Flute Horn Tuba

Multiscale Fractal Analysis of Musical Instrument Signals

log[AB(s)] vs log(s) for the seven analyzed instruments for the note C3 except for Bb Clarinet and Flute shown for C5 instead. Note the difference in the slope for larger scales . (for 30ms analysis window).

slide-64
SLIDE 64

64

MFD Analysis for Steady State of the Note

Upright Bass Clarinet Cello Flute Clarinet Oboe Piano

Mean MFD (middle line) and standard deviation (error bars) (for 30 ms analysis window, updated every 15 ms).

slide-65
SLIDE 65

65

MFD Analysis for Steady State of the Note

D = 1.42 D = 1.35 D = 1.65 D = 1.46 D = 1.37 D = 1.89

Horn Tuba Bassoon Bb Clarinet Flute Bb Clarinet Flute

slide-66
SLIDE 66

66

MFD Analysis on Synthesized Signals

  • Strong dependence on the frequency

f=5Hz f=50Hz f=100Hz f=200Hz f=300Hz f=400Hz f=500Hz f=600Hz

slide-67
SLIDE 67

67

Analysis of MFD during the Attack

 MFDs estimated for the 7 analyzed instruments attacks, averaged over the whole range (using 30 ms analysis windows).

  • Similar as for the steady state
  • Higher D for small scales st and

more fragmentation.

  • Increased value of D(s = 1).
  • Clear distinction of D among some
  • f the analyzed instruments.

Mean MFD and standard deviation of the attack and steady state of A3 for Cello (left images) and F4 for Flute (right images).

slide-68
SLIDE 68

68

MFD Variability of the Steady State for the Same Instrument over One Octave

  • Dependence on the acoustical

frequency and the MFD profile that increases rapidly for higher frequency sounds.

  • The instruments’ specific MFDs

beholds the shape observed for the specific octave. MFD of Bb Clarinet steady state notes, over one octave for one 30 ms analysis window.

slide-69
SLIDE 69

69

Experimental Evaluation

  • Double Bass, Bassoon & Tuba best recognized
  • Low discriminability between Bb Clarinet & Flute
  • Enhanced discriminability for Bassoon, Bb

Clarinet and Horn

  • Decreased for Cello & Flute
  • On average MFD+MFCC features improve the

recognition over the baseline

Mean Accuracy

ΗΜΜ Ν=5 Μ=3

Example of the 13 logarithmically sampled points of the MFD, for Bb Clarinet (A3), forming the MFDLG feature vector.

slide-70
SLIDE 70

70

Conclusions

 Existence-Importance of nonlinear speech structure of turbulence type (fractals, chaotic dynamics)  Speech technology systems can benefit from including such nonlinear features  Find/extract robust nonlinear features of turbulence type  Improve computational algorithms  Fuse nonlinear with linear features  Applications also to other sound signals, e.g. music

For more information, demos, and current results: http://cvsp.cs.ntua.gr and http://robotics.ntua.gr