vision and sound
play

Vision and Sound Computer Vision Fall 2018 Columbia University - PowerPoint PPT Presentation

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video representations Vision Hearing 1D Convolution 3D Convolution 1D Convolution 3D Convolution 1D Convolution 3D Convolution Slide credit: Andrew Owens


  1. Vision and Sound Computer Vision Fall 2018 Columbia University

  2. Single-modality video representations Vision Hearing 1D Convolution 3D Convolution 1D Convolution 3D Convolution 1D Convolution 3D Convolution Slide credit: Andrew Owens

  3. (McGurk 1976)

  4. Same audio, different video! (McGurk 1976)

  5. Object Recognition Objects f ( x s ; ω ) Sound

  6. Natural Synchronization X Lion min D KL ( F ( x i ) || f ( x i )) f i f ( x s ; ω ) F ( x v ; Ω ) Sound Vision

  7. Millions of Unlabeled Videos

  8. SoundNet m s r e o i r f o e g v a e W t a C Convolutional Neural Network

  9. Sound Recognition Classifying sounds in ESC-50 Method Accuracy Chance 2% Human Consistency 81%

  10. Sound Recognition Classifying sounds in ESC-50 Method Accuracy Chance 2% SVM-MFCC 39% 44% Random Forest CNN, Piczak 2015 64% Human Consistency 81%

  11. Sound Recognition Classifying sounds in ESC-50 Method Accuracy Chance 2% SVM-MFCC 39% 44% Random Forest CNN, Piczak 2015 64% 10% gain SoundNet 74% Human Consistency 81%

  12. Vision vs Sound Low-dimensional embeddings via Maaten and Hinton, 2007 Vision Sound

  13. Sensor Power Consumption Camera Microphone ~1 watt ~1 milliwatt

  14. What does it learn? m s r e o i r f o e g v a e W t a C

  15. Layer 1

  16. What does it learn? m s r e o i r f o e g v a e W t a C

  17. Layer 5 Smacking-like

  18. Layer 5 Chime-like

  19. What does it learn? m s r e o i r f o e g v a e W t a C

  20. Layer 7 Scuba-like

  21. Layer 7 Parents-like

  22. Audiovisual Grounding Which regions are making which sounds?

  23. Audiovisual Grounding

  24. Which objects make which sounds?

  25. The sound of clicked object

  26. The sound of clicked object

  27. The sound of clicked object

  28. Collect unlabeled videos

  29. Mix Sound Tracks

  30. How to recover originals? Audio-only: • Ill-posed • permutation problem

  31. Vision can help Video Analysis Network Audio Synthesizer Network Sound of target video Audio Analysis Network

  32. Audiovisual Model Video Analysis Network Max CNN Pool K vision 
 channels

  33. Audiovisual Model Video Analysis Network Max CNN Pool K vision 
 channels s 1 Audio Analysis Network s 2 STFT K audio 
 U-Net channels … s K … Sound spectrogram

  34. Audiovisual Model Video Analysis Network Audio Synthesizer Max CNN Network Pool K vision 
 channels Sound of target video s 1 Audio Analysis Network s 2 STFT K audio 
 U-Net channels … s K … Sound spectrogram

  35. Original Audio

  36. What does this sound like?

  37. What does this sound like?

  38. What does this sound like?

  39. What regions are making sound? Original Video Estimated Volume

  40. What sounds are they making? Original Video Embedding (projected and visualized as color)

  41. Adjusting Volume

  42. Learning audio-visual correspondences ( ( , , ) , ) ( , ( , real or fake? Slide credit: Andrew Owens

  43. Learning audio-visual correspondences ( , ) ( , “moo” real or fake ? Slide credit: Andrew Owens

  44. Idea #1: random pairs ( , ) ( , Arandjelovic, Zisserman. ICCV 2017 Slide credit: Andrew Owens

  45. Arandjelovic, Zisserman. ICCV 2017

  46. Vision hidden units Arandjelovic, Zisserman. ICCV 2017

  47. Sound hidden units Arandjelovic, Zisserman. ICCV 2017

  48. Sound Recognition Arandjelovic, Zisserman. ICCV 2017

  49. Visual Recognition Linear classifier on top of features (ImageNet) Arandjelovic, Zisserman. ICCV 2017

  50. Idea #1: random pairs ( , ) ( , Slide credit: Andrew Owens

  51. Idea #2: time-shifted pairs ( ( , , ) Slide credit: Andrew Owens

  52. Idea #2: time-shifted pairs Slide credit: Andrew Owens

  53. Fused audio-visual representation Aligned vs. misaligned 3D Convolution 3D Convolution 3D Convolution 1D Convolution 3D Convolution 1D Convolution 1D Convolution Slide credit: Andrew Owens

  54. Fused audio-visual representation Aligned vs. misaligned 3D Convolution 3D Convolution 3D Convolution + concat at 1D Convolution 3D Convolution “conv2” 1D Convolution 1D Convolution Slide credit: Andrew Owens

  55. What does the network learn? Aligned vs. misaligned Class activation map (Zhou et al. 2016) Aligned vs. misaligned Slide credit: Andrew Owens

  56. Top responses per category (speech examples omitted) Dribbling basketball

  57. Dribbling basketball

  58. Dribbling basketball

  59. Playing organ

  60. Playing organ

  61. Playing organ

  62. Chopping wood

  63. Chopping wood

  64. Chopping wood

  65. Application: on/off-screen source separation Good morning! Guten Morgen! Task: separate on-screen sounds from background noise Slide credit: Andrew Owens

  66. Creating training data On-scr Synthetic sound mixture On-screen Off-screen + Slide credit: Andrew Owens VoxCeleb

  67. On/off-screen source separation On-screen Off-screen + Multisensory features Regression Frequen STFT Time Slide credit: Andrew Owens

  68. On/off-screen source separation On-screen Off-screen + u-net concat (Ronneberger 2015) Frequen Time Slide credit: Andrew Owens

  69. On/off-screen source separation On-screen Off-screen Training: 4-sec. videos • Inverse STFT VoxCeleb + AudioSet • d + L 1 loss on log spec. • No labels or face detection • u-net concat (Ronneberger 2015) Frequen Time Slide credit: Andrew Owens

  70. Input video

  71. On-screen prediction

  72. Off-screen prediction

  73. Input video

  74. On-screen prediction

  75. On-screen prediction

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend