Harmonic Structure Transform for Speaker Recognition Kornel - - PowerPoint PPT Presentation

harmonic structure transform for speaker recognition
SMART_READER_LITE
LIVE PREVIEW

Harmonic Structure Transform for Speaker Recognition Kornel - - PowerPoint PPT Presentation

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music &


slide-1
SLIDE 1

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Harmonic Structure Transform for Speaker Recognition

Kornel Laskowski & Qin Jin

Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music & Hearing, Stockholm, Sweden

29 August, 2011

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 1/21

slide-2
SLIDE 2

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Spectral Transforms in General

Given x ≡ the energy spectrum of a speech frame, y = F−1 log

  • MTx
  • − normalization term

The matrix M is a filterbank, whose columns look like: · · · · · · M defines the number of filters, and their central frequencies, widths, and general shapes.

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 2/21

slide-3
SLIDE 3

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Spectral Transforms in General

Given x ≡ the energy spectrum of a speech frame, y = F−1 log

  • MTx
  • − normalization term

The matrix M is a filterbank, whose columns look like: · · · · · · M defines the number of filters, and their central frequencies, widths, and general shapes. Importantly here, the filters of all such filterbanks integrate energy across frequencies related by adjacency.

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 2/21

slide-4
SLIDE 4

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

The Harmonic Structure Transform (HST)

In contrast, the HST is implemented by a matrix H whose columns look like: · · · · · · Each filter integrates energy across frequencies related by harmonicity (not adjacency). this is novel (Laskowski & Jin, 2010) for speaker recognition related to (Li´ enard, Barras & Signol, 2008) for pitch detection unknown: number of filters, and their fundamental frequencies, tooth widths, and individual tooth shapes

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 3/21

slide-5
SLIDE 5

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Outline of this Talk

1 Baseline Performance

What is known?

2 Experiments in HSCC Filterbank Design

linear spacing in fundamental frequency piecewise linear spacing in fundamental frequency logarithmic spacing in fundamental frequency fundamental frequency range and density

3 Score-level Fusion with Standard MFCCs 4 Generalization 5 Conclusions Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 4/21

slide-6
SLIDE 6

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

HST Processing

idealized FFT (comb filter h) xt frame FFT

fh [i − 1] fh [i] fh [i + 1] as a function of i

analysis every 8 ms frames 32 ms wide comb filter teeth triangular (global width parameter) 400 filters, linearly spanning from 50 Hz to 450 Hz logarithm at each filter

  • utput, then normalization

decorrelation using LDA yields harmonic structure cepstral coefficients (HSCCs)

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 5/21

slide-7
SLIDE 7

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

HSCC Modeling for Classification

As simple as possible.

  • ne GMM per speaker

1

assume one Gaussian element

2

determine optimal number ND of LDA dimensions

3

hold ND fixed

4

determine optimal number of NG Gaussians

maximum likelihood closed-set classification (MAP under uniform prior)

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 6/21

slide-8
SLIDE 8

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Available Results (Laskowski & Jin, ODYSSEY 2010)

Wall Street Journal data, mostly read speech 100-way closed-set classification, per gender ≈1500 10-second trials, per gender and dataset matched channel and matched multi-session conditions Female, ♀ Male, ♂ System Dev Test Dev Test F0 17.6 18.4 26.2 27.4 HST/LDA 99.7 99.9 99.7 99.7 MEL/DCT 98.7 99.3 99.3 98.6 MEL/LDA 98.7 99.3 99.3 98.9

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 7/21

slide-9
SLIDE 9

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Session Mismatch

MIXER5 data, various speaking styles 66-way closed-set classification ≈3000 10-second trials, per dataset matched channel and matched session: accuracies of 100% matched channel but mismatched session: System Dev Test F0 14.1 16.2 HST/LDA 59.8 68.1 MEL/DCT 74.4 84.4 MEL/LDA 81.5 87.8

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 8/21

slide-10
SLIDE 10

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 400 400 200 400 200 400 200 400 200 400 200 400 200 200 400 200 400 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-11
SLIDE 11

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 400 200 400 200 400 200 400 200 400 200 400 200 400 200 400 200 400 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-12
SLIDE 12

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 400 200 400 200 400 200 400 200 400 400 200 400 200 400 200 200 400 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-13
SLIDE 13

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 400 200 400 200 400 200 400 200 200 400 200 400 400 200 200 400 400 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-14
SLIDE 14

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 400 200 400 200 200 400 200 400 400 200 200 400 200 200 400 400 400 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-15
SLIDE 15

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 400 200 400 200 200 400 200 400 400 200 200 400 400 400 400 200 200 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-16
SLIDE 16

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 200 400 200 400 400 200 200 400 400 200 400 200 400 200 400 200 400 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-17
SLIDE 17

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 200 200 400 400 200 400 200 200 400 200 400 400 200 400 200 400 400 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-18
SLIDE 18

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 400 200 400 200 400 200 400 200 400 200 400 200 400 200 400 200 400 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-19
SLIDE 19

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 400 200 400 200 400 200 400 200 400 200 400 200 400 200 400 200 400 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-20
SLIDE 20

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 400 200 400 200 400 200 400 200 400 200 400 200 400 200 400 200 400 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-21
SLIDE 21

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 400 400 200 400 200 400 200 200 400 200 400 200 400 200 400 400 200 34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-22
SLIDE 22

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 51.6 56.7 49.6 59.8 48.2 56.7 48.5 63.9 52.5 60.4 59.4 67.7 54.5 64.7 65.0 66.5 56.7 200 400 400 400 200 400 200 400 200 200 400 200 400 200 400 200 400

?

34.4 26.8 27.9 27.2 38.0 26.5 28.5 28.9 42.2 28.4 30.3 37.1 42.4 26.8 33.3 41.6 42.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

slide-23
SLIDE 23

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Piecewise Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 59.8 400 400+200 62.5 63.9 66.1 67.7 65.1 66.8 66.6 66.3 66.5 67.1 66.3 66.5 65.5 400 400+200 400+200+200 400+200+200+200 400+200+200+200+200 400 400+200 400+200+200 400+200+200+200 400 400+200 400+200+200 38.0 40.2 42.2 42.8 42.4 44.0 43.5 44.1 45.4 42.0 42.3 43.4 44.1 45.3

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21

slide-24
SLIDE 24

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Piecewise Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 59.8 400 400+200 62.5 63.9 400 400+200 66.1 400+200+200 67.7 65.1 66.8 66.6 66.3 66.5 67.1 66.3 66.5 65.5 400 400+200 400+200+200 400+200+200+200 400+200+200+200+200 400 400+200 400+200+200 400+200+200+200 38.0 40.2 42.2 42.8 42.4 44.0 43.5 44.1 45.4 42.0 42.3 43.4 44.1 45.3

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21

slide-25
SLIDE 25

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Piecewise Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 59.8 400 400+200 62.5 63.9 400 400+200 66.1 400+200+200 67.7 400 65.1 400+200 66.8 400+200+200 66.6 400+200+200+200 66.3 66.5 67.1 66.3 66.5 65.5 400 400+200 400+200+200 400+200+200+200 400+200+200+200+200 38.0 40.2 42.2 42.8 42.4 44.0 43.5 44.1 45.4 42.0 42.3 43.4 44.1 45.3

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21

slide-26
SLIDE 26

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Piecewise Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 59.8 400 400+200 62.5 63.9 400 400+200 66.1 400+200+200 67.7 400 65.1 400+200 66.8 400+200+200 66.6 400+200+200+200 66.3 66.5 400 400+200 67.1 400+200+200 66.3 400+200+200+200 66.5 400+200+200+200+200 65.5 38.0 40.2 42.2 42.8 42.4 44.0 43.5 44.1 45.4 42.0 42.3 43.4 44.1 45.3

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21

slide-27
SLIDE 27

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Piecewise Linear Spacing of Fundamental Frequencies

50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 59.8 400 400+200 62.5 63.9 400 400+200 66.1 400+200+200 67.7 400 65.1 400+200 66.8 400+200+200 66.6 400+200+200+200 66.3 66.5 400 400+200 67.1 400+200+200 66.3 400+200+200+200 66.5 400+200+200+200+200 65.5 38.0 40.2 42.2 42.8 42.4 44.0 43.5 44.1 45.4 42.0 42.3 43.4 44.1 45.3

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 10/21

slide-28
SLIDE 28

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Logarithmic Spacing of Fundamental Frequencies

In the linear case: Nh fundamental frequencies fh between fmin and fmax: fh [i] = f min

h

+ i − 1 Nh − 1

  • f max

h

− f min

h

  • 1 ≤ i ≤ Nh
50 150 250 450 850 200 400 600 800 1000

frequency fh (Hz) filter index i Baseline

Baseline Logarithmic PW Linear Linear 34.4 38.0 45.4 45.3 59.8 56.7 66.3 66.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21

slide-29
SLIDE 29

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Logarithmic Spacing of Fundamental Frequencies

In the linear case: Nh fundamental frequencies fh between fmin and fmax: fh [i] = f min

h

+ i − 1 Nh − 1

  • f max

h

− f min

h

  • 1 ≤ i ≤ Nh
50 150 250 450 850 200 400 600 800 1000

frequency fh (Hz) filter index i Baseline Linear

Linear Baseline Logarithmic PW Linear 34.4 38.0 45.4 45.3 59.8 56.7 66.3 66.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21

slide-30
SLIDE 30

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Logarithmic Spacing of Fundamental Frequencies

In the linear case: Nh fundamental frequencies fh between fmin and fmax: fh [i] = f min

h

+ i − 1 Nh − 1

  • f max

h

− f min

h

  • 1 ≤ i ≤ Nh
50 150 250 450 850 200 400 600 800 1000

frequency fh (Hz) filter index i Baseline Linear PW Linear

PW Linear Linear Baseline Logarithmic 34.4 38.0 45.4 45.3 59.8 56.7 66.3 66.0

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21

slide-31
SLIDE 31

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Logarithmic Spacing of Fundamental Frequencies

In the linear case: Nh fundamental frequencies fh between fmin and fmax: fh [i] = f min

h

+ i − 1 Nh − 1

  • f max

h

− f min

h

  • 1 ≤ i ≤ Nh
50 150 250 450 850 200 400 600 800 1000

frequency fh (Hz) filter index i Baseline Linear PW Linear Logarithmic

PW Linear Linear Baseline Logarithmic 34.4 38.0 45.4 45.3 59.8 56.7 66.3 66.0

fh [i] = f min

h

f max

h

f min

h

i−1

Nh−1 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 11/21

slide-32
SLIDE 32

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Range and Density of Fundamental Frequencies

Three manipulations: fmin: raise to 62.5 Hz fmax: raise to maximize accuracy (to 4000 Hz) Nh: lower to maximize accuracy (to 1129)

50 850 4000 921 1000 1129 1468

frequency fh (Hz) filter index i Logarithmic

Logarithmic Baseline Linear PW Linear Lower N Raise UB Raise LB 38.0 59.8 34.4 45.4 45.3 56.7 66.3 66.0 64.7 70.2 70.9 51.2 50.6 43.4

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21

slide-33
SLIDE 33

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Range and Density of Fundamental Frequencies

Three manipulations: fmin: raise to 62.5 Hz fmax: raise to maximize accuracy (to 4000 Hz) Nh: lower to maximize accuracy (to 1129)

50 850 4000 921 1000 1129 1468

frequency fh (Hz) filter index i Logarithmic Raise LB

Logarithmic Raise LB Baseline Linear PW Linear Lower N Raise UB 38.0 59.8 34.4 45.4 45.3 56.7 66.3 66.0 64.7 70.2 70.9 51.2 50.6 43.4

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21

slide-34
SLIDE 34

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Range and Density of Fundamental Frequencies

Three manipulations: fmin: raise to 62.5 Hz fmax: raise to maximize accuracy (to 4000 Hz) Nh: lower to maximize accuracy (to 1129)

50 850 4000 921 1000 1129 1468

frequency fh (Hz) filter index i Logarithmic Raise LB Raise UB

Logarithmic Raise LB Raise UB Baseline Linear PW Linear Lower N 38.0 59.8 34.4 45.4 45.3 56.7 66.3 66.0 64.7 70.2 70.9 51.2 50.6 43.4

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21

slide-35
SLIDE 35

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Range and Density of Fundamental Frequencies

Three manipulations: fmin: raise to 62.5 Hz fmax: raise to maximize accuracy (to 4000 Hz) Nh: lower to maximize accuracy (to 1129)

50 850 4000 921 1000 1129 1468

frequency fh (Hz) filter index i Logarithmic Raise LB Raise UB Lower N

Logarithmic Raise LB Raise UB Lower N Baseline Linear PW Linear 38.0 59.8 34.4 45.4 45.3 56.7 66.3 66.0 64.7 70.2 70.9 51.2 50.6 43.4

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 12/21

slide-36
SLIDE 36

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Interpolation with Standard MFCCs

linear interpolation with DCT-decorrelated log-Mel energies

20 coefficients 25 coefficients (always worse)

70 72 74 76 78 80 82 0.2 0.4 0.6 0.8 1

accuracy interpolation parameter MEL/DCT 25 MEL/DCT 20

MEL/DCT + HSCC HSCC 6.4%abs 3.5%abs 80.8 74.4 70.9

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 13/21

slide-37
SLIDE 37

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Interpolation with LDA-Rotated MFCCs

linear interpolation with LDA-decorrelated log-Mel energies

20 coefficients (better in combination) 25 coefficients (better alone)

70 75 80 85 90 0.2 0.4 0.6 0.8 1

accuracy interpolation parameter MEL/LDA 25 MEL/LDA 20

+ HSCC MEL/LDA HSCC 3.3%abs 10.6%abs 81.5 84.8 70.9

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 14/21

slide-38
SLIDE 38

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Summary of Accuracy on DevSet

Logarithmic Raise LB Raise UB Lower N Baseline Linear PW Linear MEL/DCT + HSCC + HSCC MEL/LDA DEVSET 4.2%abs 27.6%rel 5.0%abs 31.8%rel 33.5%rel 10.7%abs EVALSET 38.0 59.8 34.4 45.4 45.3 56.7 66.3 66.0 64.7 70.2 70.9 51.2 50.6 43.4 44.3 50.8 50.8 49.6 57.8 59.2 68.1 72.5 72.7 72.1 78.1 79.8 41.0 ??.? 80.8 74.4 89.0 84.8 81.5 84.8 89.3 84.3

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 15/21

slide-39
SLIDE 39

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Accuracy on EvalSet

Logarithmic Raise LB Raise UB Lower N Baseline Linear PW Linear MEL/DCT + HSCC + HSCC MEL/LDA DEVSET EVALSET 4.2%abs 27.6%rel 5.0%abs 31.8%rel 33.5%rel 10.7%abs 38.0 59.8 34.4 45.4 45.3 56.7 66.3 66.0 64.7 70.2 70.9 51.2 50.6 43.4 44.3 50.8 50.8 49.6 57.8 59.2 68.1 72.5 72.7 72.1 78.1 79.8 41.0 ??.? 80.8 74.4 89.0 84.8 81.5 84.8 89.3 84.3

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 16/21

slide-40
SLIDE 40

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Accuracy on EvalSet

Logarithmic Raise LB Raise UB Lower N Baseline Linear PW Linear 33.5%rel 10.7%abs MEL/DCT + HSCC + HSCC MEL/LDA DEVSET EVALSET 4.2%abs 27.6%rel 5.0%abs 31.8%rel 38.0 59.8 34.4 45.4 45.3 56.7 66.3 66.0 64.7 70.2 70.9 51.2 50.6 43.4 44.3 50.8 50.8 49.6 57.8 59.2 68.1 72.5 72.7 72.1 78.1 79.8 41.0 ??.? 80.8 74.4 89.0 84.8 81.5 84.8 89.3 84.3

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 16/21

slide-41
SLIDE 41

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Accuracy on EvalSet

Logarithmic Raise LB Raise UB Lower N Baseline Linear PW Linear 33.5%rel 10.7%abs 27.6%rel 4.2%abs MEL/DCT + HSCC + HSCC MEL/LDA 5.0%abs 31.8%rel DEVSET EVALSET 38.0 59.8 34.4 45.4 45.3 56.7 66.3 66.0 64.7 70.2 70.9 51.2 50.6 43.4 44.3 50.8 50.8 49.6 57.8 59.2 68.1 72.5 72.7 72.1 78.1 79.8 41.0 ??.? 80.8 74.4 89.0 84.8 81.5 84.8 89.3 84.3

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 16/21

slide-42
SLIDE 42

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Conclusions

1 evaluated the baseline transform in session mismatch

viable: twice as many errors an equivalent MFCC system

2 errors can be reduced by a third (33.5%rel) by optimizing:

the number of filters in the filterbank the fundamental frequency corresponding to each filter

3 logarithmic spacing of fundamental frequencies is better

than linear spacing

more filters for low fundamental frequencies fewer filters for high fundamental frequencies

4 in an equivalent MFCC system, errors can be reduced by

almost a third (27.6-31.8%rel) via score-level fusion with the improved HST system

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 17/21

slide-43
SLIDE 43

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Future Directions

1 change framing policy from 8 ms/32 ms to something longer

intonation and voice quality are supra-segmental larger temporal support → greater spectral resolution

2 optimize the tooth shape of comb filters 3 find a data-independent decorrelation transform

leading to a compact (< 25 coefficients) representation

4 explore adaptation from a universal background model 5 generalize to binary speaker verification (and NIST SREs) Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 18/21

slide-44
SLIDE 44

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Potential Impact

1 a new general representation of the spectrum 2 deliberately orthogonal to spectral envelope features

(MFCCs, LPCCs, etc.)

but computed in an identical manner

3 likely beneficial not only for speaker recognition, but also:

  • nline speaker diarization

classification of “emotional speech” clinical voice quality assessment

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 19/21

slide-45
SLIDE 45

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

Some Interesting Insights ...

1 currently in HST, spectral energy < 300 Hz is zeroed out

this improves closed-set speaker classification but the fundamental (zeroth harmonic) is ignored the fundamental is thought to play a role in emotional expression

2 that optimal filter spacing is logarithmic is curious

independent of the logarithmic tonotopicity of the basilar membrane greater acuity in discriminating among harmonic sounds (not just pure tones)

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 20/21

slide-46
SLIDE 46

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions

THANK YOU

Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 21/21