augmented statistical models exploiting generative models
play

Augmented Statistical Models: Exploiting Generative Models in - PowerPoint PPT Presentation

Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Martin Layton & Mark Gales 9 December 2005 Cambridge University Engineering Department NIPS 2005 Augmented Statistical Models: Exploiting Generative


  1. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Martin Layton & Mark Gales 9 December 2005 Cambridge University Engineering Department NIPS 2005

  2. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Overview • Generative models in discriminative classifiers – Fisher score-space – Generative score-space • Augmented Statistical Models – extension of standard models, e.g. GMMs and HMMs – allows additional dependencies to be represented • Discriminative training – maximum margin – conditional maximum likelihood • TIMIT results Cambridge University NIPS 2005 1 Engineering Department

  3. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Generative Models in Discriminative Classifiers Cambridge University NIPS 2005 2 Engineering Department

  4. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers The Hidden Markov Model o o o o o 1 2 3 4 T q t q t+1 () () b () b b 2 4 3 1 2 3 4 5 a a a 34 a o t o t+1 12 23 45 a a 33 a 22 44 (a) Standard HMM phone topology (b) HMM Dynamic Bayesian Network • Observations conditionally independent of other observations given state. • States conditionally independent of other states given previous states. • Poor model of the speech process - piecewise constant state-space. Cambridge University NIPS 2005 3 Engineering Department

  5. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Fisher Score-spaces • Jaakkola & Haussler (1999) • Method of incorporating generative models within a discriminative framework • Define a base generative model ˆ p ( O ; λ ) – 1-dimensional log-likelihood – not enough information for good classification • Instead use a score-space φ F ( O ; λ ) – tangent-space captures essence of generative process � � φ F ( O ; λ ) = ∇ λ ln ˆ p ( O ; λ ) – dimensionality of score-space: parameters λ – suitable for discriminative training (SVMs, etc) – has been applied to many tasks, e.g. comp. biology and speech recognition Cambridge University NIPS 2005 4 Engineering Department

  6. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Generative Score-spaces • Smith & Gales (2002) • Extension for supervised binary classification tasks p ( O ; λ (1) ) and ˆ p ( O ; λ (2) ) • Define class-conditional base models ˆ – includes log-likelihood ratio to improve discrimination – avoids wrap-around (different O ’s map to the same point in score-space) • Score-space φ LL ( O ; λ )   p ( O ; λ (1) ) − ln ˆ p ( O ; λ (2) ) ln ˆ p ( O ; λ (1) ) φ LL ( O ; λ ) = ∇ λ (1) ln ˆ   p ( O ; λ (2) ) −∇ λ (2) ln ˆ – suitable for discriminative training — SVMs – no probabilistic interpretation – restricted to binary problems Cambridge University NIPS 2005 5 Engineering Department

  7. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Augmented Statistical Models Cambridge University NIPS 2005 6 Engineering Department

  8. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Dependency Modelling • Speech data is dynamic — observations are not of a fixed length • Dependency modelling essential part of speech recognition p ( o 1 , . . . , o T ; λ ) = p ( o 1 ; λ ) p ( o 2 | o 1 ; λ ) . . . p ( o T | o 1 , . . . , o T − 1 ; λ ) – impractical to directly model in this form – make extensive use of conditional independence • Two possible forms of conditional independence – latent (unobserved) variables – observed variables • Even if given a set of dependencies (form of Bayesian Network) – need to determine how dependencies interact Cambridge University NIPS 2005 7 Engineering Department

  9. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Dependency Modelling q t−1 q q q t t+2 t+1 o t−1 o o o t+2 t t+1 • Commonly use a member (or mixture) of the exponential family 1 � � α T T ( O ) p ( O ; α ) = τ ( α ) h ( O ) exp h ( O ) is the reference distribution α are the natural parameters τ is the normalisation term T ( O ) are sufficient statistics • What is the appropriate form of statistics ( T ( O ) )? – for diagram above, T ( O ) = � T − 2 t =1 o t o t +1 o t +2 Cambridge University NIPS 2005 8 Engineering Department

  10. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Augmented Statistical Models • Augmented statistical models (related to fibre bundles)     ∇ λ ln ˆ p ( O ; λ ) � � 1 ∇ 2 1 2! vec λ ln ˆ p ( O ; λ )      α T p ( O ; λ , α ) = τ ( λ , α )ˆ p ( O ; λ ) exp   .   . .    � � ∇ ρ 1 ρ ! vec λ ln ˆ p ( O ; λ ) • Two sets of parameters: – λ - parameters of base distribution ( ˆ p ( O ; λ ) ) – α - natural parameters of local exponential model • Normalisation term τ ( λ , α ) ensures valid PDF � p ( O ; λ , α ) = ¯ p ( O ; λ , α ) p ( O ; λ , α ) d O = 1; τ ( λ , α ) – can be very difficult to estimate Cambridge University NIPS 2005 9 Engineering Department

  11. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Example: Augmented GMM p ( o ; λ ) = � M • Use a GMM as the base distribution: ˆ m =1 c m N ( o ; µ m , Σ m ) � M � M p ( o ; λ , α ) = 1 � � n Σ − 1 P ( n | o ; λ ) α T c m N ( o ; µ m , Σ m )exp n ( o − µ n ) τ m =1 n =1 • Simple two component one-dimensional example: 0.20 0.7 0.35 0.18 0.6 0.30 0.16 0.5 0.25 0.14 0.12 0.4 0.20 0.10 0.3 0.15 0.08 0.06 0.2 0.10 0.04 0.1 0.05 0.02 0.00 0.0 0.00 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 α = [0 . 0 , 0 . 0] T α = [ − 1 . 0 , − 1 . 0] T α = [1 . 0 , − 1 . 0] T Cambridge University NIPS 2005 10 Engineering Department

  12. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Augmented Model Dependencies • If the base distribution is a latent-variable model — GMM,HMM,... – Sufficient statistics contain a first-order differential T � P ( θ t = { s j , m }| O ; λ ) Σ − 1 ∇ µ jm ln ˆ p ( O ; λ ) = jm ( O t − µ jm ) t =1 – depends on a posterior – compact representation of effects of all observations • Augmented models of this form: – retain independence assumptions of the base model – remove conditional independence assumptions of the base model... ... since the local exponential model depends on a posterior • For HMM base models, – observations are dependent on all observations and all latent states – higher-order derivatives create increasingly powerful models Cambridge University NIPS 2005 11 Engineering Department

  13. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Discriminative Training Cambridge University NIPS 2005 12 Engineering Department

  14. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Maximum Margin Estimation • Consider the simplified two-class problem • Bayes’ decision rule (consider λ fixed) ω 1 P ( ω 2 | O ) = P ( ω 1 ) τ ( λ (2) , α (2) ) ¯ p ( O ; λ (1) , α (1) ) P ( ω 1 | O ) > 1 < P ( ω 2 ) τ ( λ (1) , α (1) ) ¯ p ( O ; λ (2) , α (2) ) ω 2 – class priors P ( ω 1 ) and P ( ω 2 ) • Can be rewritten as a linear decision boundary in a generative score-space, ω 1 � ¯ � � P ( ω 1 ) τ ( λ (2) , α (2) ) � p ( O ; λ (1) , α (1) ) 1 + 1 > T ln T ln 0 < p ( O ; λ (2) , α (2) ) P ( ω 2 ) τ ( λ (1) , α (1) ) ¯ ω 2 � �� � � �� � w T φ LL ( O ; λ ) b – no need to explicitly calculate τ ( λ (1) , α (1) ) or τ ( λ (2) , α (2) ) • Note: restrictions on α ’s required to ensure a valid PDF Cambridge University NIPS 2005 13 Engineering Department

  15. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Maximum Margin Estimation (cont.) • First-order Generative score-space given by   p ( O ; λ (1) ) − ln ˆ p ( O ; λ (2) ) ln ˆ φ LL ( O ; λ ) = 1 p ( O ; λ (1) ) ∇ λ (1) ln ˆ   T p ( O ; λ (2) ) −∇ λ (2) ln ˆ – independent of augmented parameters α • Linear decision boundary specified by � α (2) T � T w T = α (1) T 1 – only a function of the exponential model parameters α • Bias calculated as a by-product of training — depends on both α and λ • Potentially many parameters to estimate: – maximum margin estimation (MME) good choice — SVM training Cambridge University NIPS 2005 14 Engineering Department

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend