Vision and Sound Computer Vision Fall 2018 Columbia University

Single-modality video representations Vision Hearing 1D Convolution 3D Convolution 1D Convolution 3D Convolution 1D Convolution 3D Convolution Slide credit: Andrew Owens

(McGurk 1976)

Same audio, different video! (McGurk 1976)

Object Recognition Objects f ( x s ; ω ) Sound

Natural Synchronization X Lion min D KL ( F ( x i ) || f ( x i )) f i f ( x s ; ω ) F ( x v ; Ω ) Sound Vision

Millions of Unlabeled Videos

SoundNet m s r e o i r f o e g v a e W t a C Convolutional Neural Network

Sound Recognition Classifying sounds in ESC-50 Method Accuracy Chance 2% Human Consistency 81%

Sound Recognition Classifying sounds in ESC-50 Method Accuracy Chance 2% SVM-MFCC 39% 44% Random Forest CNN, Piczak 2015 64% Human Consistency 81%

Sound Recognition Classifying sounds in ESC-50 Method Accuracy Chance 2% SVM-MFCC 39% 44% Random Forest CNN, Piczak 2015 64% 10% gain SoundNet 74% Human Consistency 81%

Vision vs Sound Low-dimensional embeddings via Maaten and Hinton, 2007 Vision Sound

Sensor Power Consumption Camera Microphone ~1 watt ~1 milliwatt

What does it learn? m s r e o i r f o e g v a e W t a C

Layer 1

Layer 5 Smacking-like

Layer 5 Chime-like

Layer 7 Scuba-like

Layer 7 Parents-like

Audiovisual Grounding Which regions are making which sounds?

Audiovisual Grounding

Which objects make which sounds?

The sound of clicked object

Collect unlabeled videos

Mix Sound Tracks

How to recover originals? Audio-only: • Ill-posed • permutation problem

Vision can help Video Analysis Network Audio Synthesizer Network Sound of target video Audio Analysis Network

Audiovisual Model Video Analysis Network Max CNN Pool K vision   channels

Audiovisual Model Video Analysis Network Max CNN Pool K vision   channels s 1 Audio Analysis Network s 2 STFT K audio   U-Net channels … s K … Sound spectrogram

Audiovisual Model Video Analysis Network Audio Synthesizer Max CNN Network Pool K vision   channels Sound of target video s 1 Audio Analysis Network s 2 STFT K audio   U-Net channels … s K … Sound spectrogram

Original Audio

What does this sound like?

What regions are making sound? Original Video Estimated Volume

What sounds are they making? Original Video Embedding (projected and visualized as color)

Adjusting Volume

Learning audio-visual correspondences ( ( , , ) , ) ( , ( , real or fake? Slide credit: Andrew Owens

Learning audio-visual correspondences ( , ) ( , “moo” real or fake ? Slide credit: Andrew Owens

Idea #1: random pairs ( , ) ( , Arandjelovic, Zisserman. ICCV 2017 Slide credit: Andrew Owens

Arandjelovic, Zisserman. ICCV 2017

Vision hidden units Arandjelovic, Zisserman. ICCV 2017

Sound hidden units Arandjelovic, Zisserman. ICCV 2017

Sound Recognition Arandjelovic, Zisserman. ICCV 2017

Visual Recognition Linear classifier on top of features (ImageNet) Arandjelovic, Zisserman. ICCV 2017

Idea #1: random pairs ( , ) ( , Slide credit: Andrew Owens

Idea #2: time-shifted pairs ( ( , , ) Slide credit: Andrew Owens

Idea #2: time-shifted pairs Slide credit: Andrew Owens

Fused audio-visual representation Aligned vs. misaligned 3D Convolution 3D Convolution 3D Convolution 1D Convolution 3D Convolution 1D Convolution 1D Convolution Slide credit: Andrew Owens

Fused audio-visual representation Aligned vs. misaligned 3D Convolution 3D Convolution 3D Convolution + concat at 1D Convolution 3D Convolution “conv2” 1D Convolution 1D Convolution Slide credit: Andrew Owens

What does the network learn? Aligned vs. misaligned Class activation map (Zhou et al. 2016) Aligned vs. misaligned Slide credit: Andrew Owens

Top responses per category (speech examples omitted) Dribbling basketball

Dribbling basketball

Playing organ

Chopping wood

Application: on/off-screen source separation Good morning! Guten Morgen! Task: separate on-screen sounds from background noise Slide credit: Andrew Owens

Creating training data On-scr Synthetic sound mixture On-screen Off-screen + Slide credit: Andrew Owens VoxCeleb

On/off-screen source separation On-screen Off-screen + Multisensory features Regression Frequen STFT Time Slide credit: Andrew Owens

On/off-screen source separation On-screen Off-screen + u-net concat (Ronneberger 2015) Frequen Time Slide credit: Andrew Owens

On/off-screen source separation On-screen Off-screen Training: 4-sec. videos • Inverse STFT VoxCeleb + AudioSet • d + L 1 loss on log spec. • No labels or face detection • u-net concat (Ronneberger 2015) Frequen Time Slide credit: Andrew Owens

Input video

On-screen prediction

Off-screen prediction

Input video

On-screen prediction

Vision and Sound Computer Vision Fall 2018 Columbia University - PowerPoint PPT Presentation

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video representations Vision Hearing 1D Convolution 3D Convolution 1D Convolution 3D Convolution 1D Convolution 3D Convolution Slide credit: Andrew Owens

SOUND SOUND Wha hat is t is sound sound? Click on the image below to find out. Sounds are

? Message sound Message P(wolf|sound) P(sound| wolf) x P(wolf) 1 9/4/19 P(sound| wolf)

Sonification - Sound of Science VU, WS 2013 Lecture 8 - Parameter Mapping Visda Goudarzi

SYNTHESIZING 3D SOUND SYNTHESIZING 3D SOUND AND AND SOUND LOCALIZATION SOUND LOCALIZATION

Sound & Editing Lily, Matt, Mei, Michaela Sound WHAT IS SOUND? An audible vibration of the

Sound 1 Sound "50% of the movie experience is sound - George Lucas Sound is used

Sound Slide 2 / 50 Characteristics of Sound Sound can travel through any kind of matter, but

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Distance Sensors: Sound, Light and Vision THOMAS MAIER SEMINAR: INTELLIGENT ROBOTICS 1

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Sound Energy Surin Mrs. Giordano 5th grade What makes sound The back and forth motion is

Sound Change Principles of Sound Change: How sounds and

The Sound Group Joe Bota Aaron Camm Alex Cueto Brief Overview The Physics of Sound Audio

SoUNd ride I.D. Ciro Dvila SoUNd ride Concept. Sound Ride is inspired in the SUN RIDE

Sound 2: frequency analysis Tues. March 27, 2018 1 Speed of Sound Sound travels at about 340

SOUND POLIGON - Transportable sound studio for prison use 1. Sound proof pannels 2. Wall for

has been the focus of much interest recently, involving an interplay of methods Date : November 22,

management of patients with T2DM & CVD Richard Hobbs, MD Oxford, United Kingdom Cardio

Testbeam Studies with Silicon Strip Module Prototypes for the ATLAS-Detector towards the HL-LHC B

A Brief Introduction to Modular Forms Catherine M. Hsu Department of Mathematics University of

Connecting and Supporting Socially Responsible Educators Networking Workshop by Anastasia Khawaja

Target Conditioned Sampling: Optimizing Data Selection for Multilingual NMT Xinyi Wang,

Multilingual and low-resource ASR Lecture 18 CS 753 Instructor: Preethi Jyothi Recall Hybrid

Drupal 8 Multilingual Wonderland Gabor Hojtsy Acquia Foreign language site Multilingual site