Harmonic Structure Transform for Speaker Recognition Kornel - PowerPoint PPT Presentation

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music & Hearing, Stockholm, Sweden 29 August, 2011 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 1/21

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Spectral Transforms in General Given x ≡ the energy spectrum of a speech frame, F − 1 � � �� M T x y = log − � normalization term � The matrix M is a filterbank, whose columns look like: · · · · · · M defines the number of filters, and their central frequencies , widths , and general shapes . Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 2/21

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Spectral Transforms in General Given x ≡ the energy spectrum of a speech frame, F − 1 � � �� M T x y = log − � normalization term � The matrix M is a filterbank, whose columns look like: · · · · · · M defines the number of filters, and their central frequencies , widths , and general shapes . Importantly here , the filters of all such filterbanks integrate energy across frequencies related by adjacency . Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 2/21

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions The Harmonic Structure Transform (HST) In contrast, the HST is implemented by a matrix H whose columns look like: · · · · · · Each filter integrates energy across frequencies related by harmonicity (not adjacency). this is novel (Laskowski & Jin, 2010) for speaker recognition related to (Li´ enard, Barras & Signol, 2008) for pitch detection unknown: number of filters, and their fundamental frequencies , tooth widths , and individual tooth shapes Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 3/21

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Outline of this Talk 1 Baseline Performance What is known? 2 Experiments in HSCC Filterbank Design linear spacing in fundamental frequency piecewise linear spacing in fundamental frequency logarithmic spacing in fundamental frequency fundamental frequency range and density 3 Score-level Fusion with Standard MFCCs 4 Generalization 5 Conclusions Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 4/21

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions HST Processing frame FFT idealized FFT analysis every 8 ms x t (comb filter h ) frames 32 ms wide f h [ i − 1] comb filter teeth triangular (global width parameter) 400 filters, linearly spanning f h [ i ] from 50 Hz to 450 Hz logarithm at each filter f h [ i + 1] output, then normalization decorrelation using LDA yields harmonic structure cepstral coefficients (HSCCs) as a function of i Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 5/21

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions HSCC Modeling for Classification As simple as possible. one GMM per speaker assume one Gaussian element 1 determine optimal number N D of LDA dimensions 2 hold N D fixed 3 determine optimal number of N G Gaussians 4 maximum likelihood closed-set classification (MAP under uniform prior) Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 6/21

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Available Results (Laskowski & Jin, ODYSSEY 2010) Wall Street Journal data, mostly read speech 100-way closed-set classification, per gender ≈ 1500 10-second trials, per gender and dataset matched channel and matched multi-session conditions Female, ♀ Male, ♂ System Dev Test Dev Test F 0 17.6 18.4 26.2 27.4 HST/LDA 99.7 99.9 99.7 99.7 MEL/DCT 98.7 99.3 99.3 98.6 MEL/LDA 98.7 99.3 99.3 98.9 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 7/21

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Session Mismatch MIXER5 data, various speaking styles 66-way closed-set classification ≈ 3000 10-second trials, per dataset matched channel and matched session: accuracies of 100% matched channel but mismatched session : System Dev Test F 0 14.1 16.2 HST/LDA 59.8 68.1 MEL/DCT 74.4 84.4 MEL/LDA 81.5 87.8 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 8/21

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Linear Spacing of Fundamental Frequencies 50Hz 150Hz 250Hz 350Hz 450Hz 550Hz 650Hz 750Hz 850Hz 400 34.4 56.7 200 26.8 51.6 400 27.9 56.7 200 27.2 49.6 400 38.0 59.8 200 26.5 48.2 400 28.5 56.7 200 28.9 48.5 400 42.2 63.9 200 28.4 52.5 400 30.3 60.4 200 37.1 59.4 400 42.4 67.7 200 26.8 54.5 400 33.3 64.7 200 41.6 65.0 400 42.0 66.5 Laskowski & Jin INTERSPEECH 2011, Firenze, Italy 9/21

Harmonic Structure Transform for Speaker Recognition Kornel - PowerPoint PPT Presentation

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music &

Contents averages averages Contents Contents Harmonic mean (average) Harmonic mean (average)

Harmonic Map Let f : T 2 S 3 = SU (2) be a harmonic map. A harmonic map is a critical

Topic 10: The Z Transform o Introduction to Z Transform o Relationship to the Fourier transform o

Fourier Series and Transform Overview Why Fourier transform? Trigonometric functions Who is

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

Class 14: Simple harmonic motion Class 14: Simple harmonic motion Origin of simple harmonic motion

SMART GOVERNMENT INVOICING: INVOICE PROCESSING PLATFORM LEAD. TRANSFORM. DELIVER LEAD. TRANSFORM.

Speaker Recognition and Speaker Recognition and the ETSI Standard the ETSI Standard Distributed

Math 211 Math 211 Lecture #34 Forced Harmonic Motion November 14, 2003 2 Forced Harmonic

Math 211 Math 211 Lecture #35 Forced Harmonic Motion November 18, 2002 2 Forced Harmonic

Math 211 Math 211 Lecture #35 Forced Harmonic Motion November 19, 2001 2 Forced Harmonic

Math 211 Math 211 Lecture #35 Forced Harmonic Motion April 16, 2001 2 Forced Harmonic Motion

Topic 4: Continuous-Time Fourier Transform (CTFT) o Introduction to Fourier Transform o Fourier

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Discussion 10: Iterators, Generators and Streams Nancy Shaw (nshaw99@berkeley.edu) Caroline

On the Concrete Security of Goldreichs Pseudorandom Generator Geo ff roy Couteau - Aurlien

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Random Testing in PVS Sam Owre owre@csl.sri.com URL: http://www.csl.sri.com/~owre/ Computer

WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 Instructor: Preethi Jyothi

Computational Tools for the Exploration of Melodic Characteristics CompMusic Seminar, IIT-Madras,

ARIMA and ARFIMA models Christopher F Baum ECON 8823: Applied Econometrics Boston College,

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019

Sambuz

Useful Links

Newsletter

Mail Us

Harmonic Structure Transform for Speaker Recognition Kornel - PowerPoint PPT Presentation

Prolegomena Baseline Filterbank Design Score-Level Fusion Generalization Conclusions Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music &

Contents averages averages Contents Contents Harmonic mean (average) Harmonic mean (average)

Harmonic Map Let f : T 2 S 3 = SU (2) be a harmonic map. A harmonic map is a critical

Topic 10: The Z Transform o Introduction to Z Transform o Relationship to the Fourier transform o

Fourier Series and Transform Overview Why Fourier transform? Trigonometric functions Who is

Speech Processing 15-492/18-492 Speaker ID Who is speaking? Speaker ID, Speaker Recognition

Class 14: Simple harmonic motion Class 14: Simple harmonic motion Origin of simple harmonic motion

SMART GOVERNMENT INVOICING: INVOICE PROCESSING PLATFORM LEAD. TRANSFORM. DELIVER LEAD. TRANSFORM.

Speaker Recognition and Speaker Recognition and the ETSI Standard the ETSI Standard Distributed

Math 211 Math 211 Lecture #34 Forced Harmonic Motion November 14, 2003 2 Forced Harmonic

Math 211 Math 211 Lecture #35 Forced Harmonic Motion November 18, 2002 2 Forced Harmonic

Math 211 Math 211 Lecture #35 Forced Harmonic Motion November 19, 2001 2 Forced Harmonic

Math 211 Math 211 Lecture #35 Forced Harmonic Motion April 16, 2001 2 Forced Harmonic Motion

Topic 4: Continuous-Time Fourier Transform (CTFT) o Introduction to Fourier Transform o Fourier

Combining Speech and Speaker Recognition - A Joint Modeling Approach Hang Su Supervised by:

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Discussion 10: Iterators, Generators and Streams Nancy Shaw (nshaw99@berkeley.edu) Caroline

On the Concrete Security of Goldreichs Pseudorandom Generator Geo ff roy Couteau - Aurlien

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Random Testing in PVS Sam Owre owre@csl.sri.com URL: http://www.csl.sri.com/~owre/ Computer

WFSTs in ASR &amp; Basics of Speech Production Lecture 6 CS 753 Instructor: Preethi Jyothi

Computational Tools for the Exploration of Melodic Characteristics CompMusic Seminar, IIT-Madras,

ARIMA and ARFIMA models Christopher F Baum ECON 8823: Applied Econometrics Boston College,

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019

Sambuz

Useful Links

Newsletter

Mail Us

WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 Instructor: Preethi Jyothi