unsupervised neural and bayesian models for zero resource
play

Unsupervised neural and Bayesian models for zero-resource speech - PowerPoint PPT Presentation

Unsupervised neural and Bayesian models for zero-resource speech processing MIT CSAIL, 15 Nov. 2016 Herman Kamper University of Edinburgh; TTI at Chicago http://www.kamperh.com Speech recognition success 1 / 35 Speech recognition success 1


  1. Unsupervised neural and Bayesian models for zero-resource speech processing MIT CSAIL, 15 Nov. 2016 Herman Kamper University of Edinburgh; TTI at Chicago http://www.kamperh.com

  2. Speech recognition success 1 / 35

  3. Speech recognition success 1 / 35

  4. Speech recognition success 1 / 35

  5. Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) 1 / 35

  6. Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours of labelled speech audio; ∼ 350M words of text 1 / 35

  7. Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours of labelled speech audio; ∼ 350M words of text • But: Can we do this for all 7000 languages spoken in the world? 1 / 35

  8. Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology 2 / 35

  9. Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology Criticism: Always some data; semi-supervised problem 2 / 35

  10. Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology Criticism: Always some data; semi-supervised problem Reasons for purely unsupervised case: • Modelling infant language acquisition [R¨ as¨ anen, SpecCom’12] • Language acquisition in robotics [Renkens and Van hamme, IS’15] • Analysis of audio for unwritten languages [Besacier et al., SpecCom’14] • New insights and models for speech processing [Jansen et al., ICASSP’13] 2 / 35

  11. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : 3 / 35

  12. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : 3 / 35

  13. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : f a ( · ) 3 / 35

  14. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 3 / 35

  15. Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 2. Unsupervised segmentation and clustering : How do we discover meaningful units in unlabelled speech? 3 / 35

  16. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35

  17. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35

  18. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35

  19. Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35

  20. Full-coverage segmentation and clustering 5 / 35

  21. Full-coverage segmentation and clustering 5 / 35

  22. Full-coverage segmentation and clustering 5 / 35

  23. Unsupervised speech processing: Two problems 1. Unsupervised frame-level Cool model representation learning : f a ( · ) 2. Unsupervised segmentation and clustering : We focus on full-coverage segmentation and clustering 6 / 35

  24. Unsupervised speech processing: Two problems 1. Unsupervised frame-level Cool model representation learning : f a ( · ) 2. Unsupervised segmentation and clustering : We focus on full-coverage segmentation and clustering Our claim: Unsupervised speech processing benefits from both top-down and bottom-up modelling 6 / 35

  25. Top-down and bottom-up modelling Top-down: Use knowledge of higher-level units to learn about lower-level parts Bottom-up: Piece together lower-level parts to get more complex higher-level structures [Feldman et al., CCSS’09] 7 / 35

  26. Unsupervised frame-level representation learning: The Correspondence Autoencoder

  27. Unsupervised frame-level representation learning: The Correspondence Autoencoder Micha Elsner Daniel Renshaw Aren Jansen Sharon Goldwater

  28. Supervised representation learning using DNN Output: predict phone states ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35

  29. Supervised representation learning using DNN Output: predict phone states ay ey k v Phone classifier learned jointly Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35

  30. Supervised representation learning using DNN Output: predict phone states ay ey k v Phone classifier learned jointly Unsupervised modelling: No phone class targets to train network on Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35

  31. Autoencoder (AE) neural network Reconstruct input Input speech frame [Badino et al., ICASSP’14] 10 / 35

  32. Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? Input speech frame [Badino et al., ICASSP’14] 10 / 35

  33. Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? • Idea: Unsupervised term discovery Input speech frame [Badino et al., ICASSP’14] 10 / 35

  34. Unsupervised term discovery (UTD) 11 / 35

  35. Unsupervised term discovery (UTD) Can we use these discovered word pairs to give weak top-down supervision? 11 / 35

  36. Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 12 / 35

  37. Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 12 / 35

  38. Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 12 / 35

  39. Autoencoder (AE) Reconstruct input Input speech frame 13 / 35

  40. Correspondence autoencoder (cAE) Frame from other word in pair Frame from one word 14 / 35

  41. Correspondence autoencoder (cAE) Frame from other word in pair Unsupervised feature extractor f a ( · ) Frame from one word 14 / 35

  42. Correspondence autoencoder (cAE) Frame from other word in pair Combine top-down and bottom-up information Unsupervised feature extractor f a ( · ) Frame from one word 14 / 35

  43. Correspondence autoencoder (cAE) Train correspondence (1) (4) Train stacked autoencoder autoencoder (pretraining) Initialize weights Speech corpus (3) Unsupervised feature extractor (2) Unsupervised term discovery Align word pair frames [Kamper et al., ICASSP’15] 15 / 35

  44. Intrinsic evaluation: Isolated word query task 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE 16 / 35

  45. Intrinsic evaluation: Isolated word query task 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE Extended: [Renshaw et al., IS’15] and [Yuan et al., IS’16] 16 / 35

  46. Unsupervised segmentation and clustering: The Segmental Bayesian Model

  47. Unsupervised segmentation and clustering: The Segmental Bayesian Model Aren Jansen Sharon Goldwater

  48. Full-coverage segmentation and clustering 18 / 35

  49. Full-coverage segmentation and clustering 18 / 35

  50. Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : 19 / 35

  51. Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : Our approach uses whole-word segmental representations, i.e. acoustic word embeddings [Kamper et al., IS’15; Kamper et al., TASLP’16] 19 / 35

  52. Acoustic word embeddings 20 / 35

  53. Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) 20 / 35

  54. Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) Dynamic programming alignment has quadratic complexity, while embedding comparison is linear time. Can use standard clustering. 20 / 35

  55. Unsupervised segmental Bayesian model Speech waveform 21 / 35

  56. Unsupervised segmental Bayesian model Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

  57. Unsupervised segmental Bayesian model Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

  58. Unsupervised segmental Bayesian model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

  59. Unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

  60. Unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Acoustic modelling Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend