multimodal learning from images and speech
play

Multimodal learning from images and speech KU Leuven & UPF - PowerPoint PPT Presentation

Multimodal learning from images and speech KU Leuven & UPF Barcelona, January 2019 Herman Kamper E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/ Advances in speech recognition 3 / 35 Advances in speech


  1. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large Play play sitting yellow young 16 / 35

  2. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water play sitting yellow young 16 / 35

  3. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water semantic play sitting yellow young 16 / 35

  4. Task 2: Keyword spotting Keyword Example of matched utterance Type beach a boy in a yellow shirt is walking on a beach . . . correct behind a surfer does a flip on a wave mistake bike a dirt biker flies through the air variant boys two children play soccer in the park semantic large . . . a rocky cliff overlooking a body of water semantic play children playing in a ball pit variant sitting two people are seated at a table with drinks semantic yellow a tan dog jumping over a red and blue toy mistake young a little girl on a kid swing semantic 16 / 35

  5. Task 3: Semantic speech retrieval burning Written query: fire burning burning [Kamper et al., TASLP’19] 17 / 35

  6. Human (MTurk) evaluation 18 / 35

  7. Human (MTurk) evaluation Keyword Top retrieved utterance Human label ocean man falling off a blue surfboard in the ocean 5 / 5 snowy a skier catches air over the snow 5 / 5 bike a dirt biker rides through some trees 4 / 5 children a group of young boys playing soccer 4 / 5 field two white dogs running in the grass together 3 / 5 swimming a woman holding a young boy slide down a 3 / 5 water slide into a pool carrying small dog running in the grass with a toy in its 2 / 5 ∗ mouth large a group of people on a zig path through the 1 / 5 ∗ mountains hair two women and a man smile for the camera 0 / 5 ∗ 18 / 35

  8. Task 3: Semantic speech retrieval 19 / 35

  9. Task 3: Semantic speech retrieval TextPrior VisionTagPrior VisionSpeechCNN VisionCNN SupervisedBoWCNN TextWuP TextParagram 0 20 40 60 80 100 P @10 19 / 35

  10. Task 3: Semantic speech retrieval TextPrior VisionTagPrior VisionSpeechCNN VisionCNN SupervisedBoWCNN TextWuP TextParagram 0 10 20 30 Spearman’s ρ 20 / 35

  11. But this model is trained for English? n t y vis r t a f ( X ) a i m h h s Loss max feedfwd L conv VGG max X [Kamper et al., Interspeech’17] 21 / 35

  12. Task 4: Cross-lingual keyword spotting Arapaho speech collection (want to search) Given English keyword: ‘Disease’ [Kamper and Roth, SLTU’18] 22 / 35

  13. Task 4: Cross-lingual keyword spotting English speech collection (want to search) Given German keyword: ‘Hunde’ [Kamper and Roth, SLTU’18] 22 / 35

  14. Task 4: Cross-lingual keyword spotting Cross-lingual keyword spotter German (text) tags springt Hunde d ˆ y de l f ( X ) e F Loss max feedfwd ℓ conv VGG-16 max I X English speech [Kamper and Roth, SLTU’18] 23 / 35

  15. 2. Multimodal One-Shot Learning from Images and Speech

  16. 2. Multimodal One-Shot Learning from Images and Speech Herman Ryan Eloff Engelbrecht

  17. You are the robot 26 / 35

  18. You are the robot 26 / 35

  19. You are the robot 26 / 35

  20. You are the robot 26 / 35

  21. You are the robot 26 / 35

  22. You are the robot 26 / 35

  23. You are the robot 26 / 35

  24. You are the robot ? 26 / 35

  25. Unimodal one-shot learning and classification – three – one – five – two – four [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  26. Unimodal one-shot learning and classification – three – one Query: – five y = ? ˆ ( two ) – two – four [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  27. Unimodal one-shot learning and classification – three – one Query: – five y = ? ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  28. Unimodal one-shot learning and classification – three – one Support set Query: y = ? – five ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  29. Unimodal one-shot learning and classification – three – one Support set Query: y = ? – five ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  30. Unimodal one-shot learning and classification – three – one Support set Query: y = ? – five ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  31. Unimodal one-shot learning and classification – three – one Support set Query: y = two – five ˆ ( two ) – two – four One-shot speech learning One-shot speech classification [Fei-Fei et al., PAMI’06]; [Lake et al., CogSci’14] 27 / 35

  32. Multimodal one-shot learning and matching Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 28 / 35

  33. Multimodal one-shot learning and matching Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 28 / 35

  34. Multimodal one-shot learning and matching Matching set Query: Support set ( two ) ? Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 28 / 35

  35. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  36. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  37. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  38. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  39. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  40. Our framework Matching set Query: Support set ( two ) Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., arXiv’18] 29 / 35

  41. Our approach to multimodal one-shot learning 30 / 35

  42. Our approach to multimodal one-shot learning • Requires within-modality distance metrics • Can be done directly over features: DTW over speech, cosine over image pixels • Or distance metrics can be learned from background data • Compare these on TIDigits (speech) paired with MNIST (images) 30 / 35

  43. Background data Omniglot (no digits): 31 / 35

  44. Background data Omniglot (no digits): Isolated labelled words (no digits): 31 / 35

  45. Models for metric learning Classifier network: g n cricket i d n e g a r t a s l X 32 / 35

  46. Models for metric learning Classifier network: Siamese network: g n d ( y 1 , y 2 ) cricket distance i d n e g a r t a s l y 1 = f ( X 1 ) y 2 = f ( X 2 ) X X 2 X 1 32 / 35

  47. Multimodal one-shot matching DTW + Pixels FFNN Classifier CNN Classifier Siamese CNN (offline) Siamese CNN (online) 0 20 40 60 80 100 Accuracy (%) 33 / 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend