language dialect and speaker recognition using gaussian
play

Language, Dialect, and Speaker Recognition Using Gaussian Mixture - PowerPoint PPT Presentation

Language, Dialect, and Speaker Recognition Using Gaussian Mixture Models on the Cell Processor Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner {nmalyska, smohindra, karen.lauro, reynolds, kepner}@ll.mit.edu


  1. Language, Dialect, and Speaker Recognition Using Gaussian Mixture Models on the Cell Processor Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner {nmalyska, smohindra, karen.lauro, reynolds, kepner}@ll.mit.edu This work is sponsored by the United States Air Force under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government. MIT Lincoln Laboratory

  2. Outline • Introduction • Recognition for speech applications using GMMs • Parallel implementation of the GMM • Performance model • Conclusions and future work MIT Lincoln Laboratory

  3. Introduction Automatic Recognition Systems • In this presentation, we will discuss technology that can be applied to different kinds of recognition systems – Language recognition – Dialect recognition – Speaker recognition Who is the speaker? What dialect are they using? What language are they speaking? MIT Lincoln Laboratory

  4. Introduction The Scale Challenge • Speech processing problems are often described as one person interacting with a single computer system and receiving a response MIT Lincoln Laboratory

  5. Introduction The Scale Challenge • Real speech applications, however, often involve data from multiple talkers and use multiple networked multicore machines – Interactive voice response systems – Voice portals – Large corpus evaluations with hundreds of hours of data Information About Speaker, Dialect, or Language MIT Lincoln Laboratory

  6. Introduction The Computational Challenge • Speech-processing algorithms are computationally expensive • Large amounts of data need to be available for these applications – Must cache required data efficiently so that it is quickly available • Algorithms must be parallelized to maximize throughput – Conventional approaches focus on parallel solutions over multiple networked computers – Existing packages not optimized for high-performance-per-watt machines with multiple cores, required in embedded systems with power, thermal, and size constraints – Want highly-responsive “real-time” systems in many applications, including in embedded systems MIT Lincoln Laboratory

  7. Outline • Introduction • Recognition for speech applications using GMMs • Parallel implementation of the GMM • Performance model • Conclusions and future work MIT Lincoln Laboratory

  8. Recognition Systems Summary • A modern language, dialect, or speaker recognition system is composed of two main stages – Front-end processing – Pattern recognition Decision on the Pattern Front End identity, dialect, Speech Recognition or language of speaker • We will show how a speech signal is processed by modern recognition systems – Focus on a recognition technology called Gaussian mixture models MIT Lincoln Laboratory

  9. Recognition Systems Frame-Based Processing • The first step in modern speech systems is to convert incoming speech samples into frames • A typical frame rate for a speech stream is 100 frames per second Speech Samples Speech Frames … Frame Number Time MIT Lincoln Laboratory

  10. Recognition Systems Frame-Based Processing • The first step in modern speech systems is to convert incoming speech samples into frames • A typical frame rate for a speech stream is 100 frames per second Speech Samples Speech Frames … Frame Number Time MIT Lincoln Laboratory

  11. Recognition Systems Frame-Based Processing • The first step in modern speech systems is to convert incoming speech samples into frames • A typical frame rate for a speech stream is 100 frames per second Speech Samples Speech Frames … Frame Number Time MIT Lincoln Laboratory

  12. Recognition Systems Frame-Based Processing • The first step in modern speech systems is to convert incoming speech samples into frames • A typical frame rate for a speech stream is 100 frames per second Speech Samples Speech Frames … Frame Number Time MIT Lincoln Laboratory

  13. Recognition Systems Frame-Based Processing • The first step in modern speech systems is to convert incoming speech samples into frames • A typical frame rate for a speech stream is 100 frames per second Speech Samples Speech Frames … Frame Number Time MIT Lincoln Laboratory

  14. Recognition Systems Front-End Processing • Front-end processing converts observed speech frames into an alternative representation, features – Lower dimensionality – Carries information relevant to the problem Speech Frames Feature Vectors X = x x � x { , , , } 1 2 K Dim 1 Front End Dim 2 x x x x 1 2 3 4 Feature Number Frame Number MIT Lincoln Laboratory

  15. Recognition Systems Pattern Recognition Training Training Features • A recognition system makes decisions about observed Dim 1 data based on a knowledge of past data Dim 2 x x x x 1 2 3 4 • During training , the system learns about the data it uses to make decisions – A set of features are collected from a certain language, dialect, or speaker MIT Lincoln Laboratory

  16. Recognition Systems Pattern Recognition Training Training Features • A recognition system makes decisions about observed data based on a knowledge of past data x • During training , the system 2 x 1 learns about the data it uses Dim 2 Dim 1 to make decisions – A set of features are Model p x collected from a certain ( ) language, dialect, or speaker – A model is generated to represent the data Dim 2 Dim 1 MIT Lincoln Laboratory

  17. Recognition Systems Gaussian Mixture Models • A Gaussian mixture model (GMM) represents features as the weighted sum of multiple Gaussian distributions • Each Gaussian state i has a Model λ μ – Mean i λ x ( | ) p Σ – Covariance i – Weight w i Dim 2 Dim 1 MIT Lincoln Laboratory

  18. Recognition Systems Gaussian Mixture Models w i μ p x ( ) Parameters i Σ i Dim 2 Dim 1 MIT Lincoln Laboratory

  19. Recognition Systems Gaussian Mixture Models p x ( ) Parameters Model States Dim 2 Dim 1 MIT Lincoln Laboratory

  20. Recognition Systems Language, Speaker, and Dialect Models Languages, Dialects, or Speakers Model λ λ x ( | ) p 2 C Parameters Model λ Model λ 1 3 Model States In LID, DID, and SID, λ we train a set of target models C for each dialect, language, or speaker Dim 2 Dim 1 MIT Lincoln Laboratory

  21. Recognition Systems Universal Background Model λ x ( | ) p C Parameters λ Model C Model States We also train a universal background Dim 2 Dim 1 λ model representing all speech C MIT Lincoln Laboratory

  22. Recognition Systems Hypothesis Test : is from the hypothesized class H X 0 test • Given a set of test : is not from the hypothesized class H X 1 test observations , we perform a hypothesis test to determine whether a certain class produced it = x x � x { , , , } X 1 2 test K Dim 2 Dim 1 MIT Lincoln Laboratory

  23. Recognition Systems Hypothesis Test : is from the hypothesized class H X 0 test • Given a set of test : is not from the hypothesized class H X 1 test observations , we perform a hypothesis test to determine whether a certain class λ x ( | ) p 1 produced it = x x � x { , , , } X 1 2 test K 0 ? H Dim 2 Dim 1 λ x ( | ) p C 1 ? H Dim 2 Dim 1 Dim 2 Dim 1 MIT Lincoln Laboratory

  24. Recognition Systems Hypothesis Test • Given a set of test observations , we perform a hypothesis test to determine whether a certain class λ x ( | ) p 1 produced it = x x � x { , , , } X 1 2 test K English? Dim 2 Dim 1 λ x ( | ) p C Not English? Dim 2 Dim 1 Dim 2 Dim 1 MIT Lincoln Laboratory

  25. Recognition Systems Log-Likelihood Ratio Score • We determine which hypothesis is true using the ratio: ≥ ⎧ threshold, accept H ( | ) p X H 0 ⎨≤ 0 ⎩ threshold, reject ( | ) p X H H 0 1 • We use the log-likelihood ratio score to decide whether an observed speaker, language, or dialect is the target Λ = λ − λ ( ) log[ ( | )] log[ ( | )] X p X p X C C ≥ λ ⎧ threshold, generated by X Λ C ⎨< ( ) X λ threshold, generated by ⎩ X C MIT Lincoln Laboratory

  26. Recognition Systems Log-Likelihood Computation λ • The observation log-likelihood given a model is: λ x ( | ) p p X λ log[ ( | )]? Dim 2 Dim 1 Dim 2 Dim 1 MIT Lincoln Laboratory

  27. Recognition Systems Log-Likelihood Computation λ • The observation log-likelihood given a model is: λ x ( | ) p p X λ log[ ( | )]? Dim 2 Dim 1 Dim 2 Dim 1 ⎛ ⎞ ( ) K M ∑ ∑ λ = − − Σ − − x μ 1 x μ T 1 1 log[ ( | )] log exp ( ) ( ) p X ⎜ C ⎟ 2 K i i i i ⎝ ⎠ = 1 1 i Dot product MIT Lincoln Laboratory

  28. Recognition Systems Log-Likelihood Computation λ • The observation log-likelihood given a model is: λ x ( | ) p p X λ log[ ( | )]? Dim 2 Dim 1 Dim 2 Dim 1 ⎛ ⎞ ( ) K M ∑ ∑ λ = − − Σ − − x μ 1 x μ T 1 1 log[ ( | )] log exp ( ) ( ) p X ⎜ C ⎟ 2 K i i i i ⎝ ⎠ = 1 1 i Constant derived from weight and covariance MIT Lincoln Laboratory

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend