Vision and Sound Computer Vision Fall 2018 Columbia University - - PowerPoint PPT Presentation

vision and sound
SMART_READER_LITE
LIVE PREVIEW

Vision and Sound Computer Vision Fall 2018 Columbia University - - PowerPoint PPT Presentation

Vision and Sound Computer Vision Fall 2018 Columbia University Single-modality video representations Vision Hearing 1D Convolution 3D Convolution 1D Convolution 3D Convolution 1D Convolution 3D Convolution Slide credit: Andrew Owens


slide-1
SLIDE 1

Vision and Sound

Computer Vision Fall 2018 Columbia University

slide-2
SLIDE 2 3D Convolution 3D Convolution 3D Convolution 1D Convolution 1D Convolution 1D Convolution

Single-modality video representations Vision Hearing

Slide credit: Andrew Owens
slide-3
SLIDE 3

(McGurk 1976)

slide-4
SLIDE 4

Same audio, different video! (McGurk 1976)

slide-5
SLIDE 5

f(xs; ω)

Object Recognition

Sound Objects

slide-6
SLIDE 6

F(xv; Ω)

Lion

f(xs; ω)

Natural Synchronization

Sound Vision

min

f

X

i

DKL (F(xi)||f(xi))

slide-7
SLIDE 7

Millions of Unlabeled Videos

slide-8
SLIDE 8

SoundNet

W a v e f

  • r

m C a t e g

  • r

i e s Convolutional Neural Network

slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13

Sound Recognition

Method Accuracy Chance 2% Human Consistency 81%

Classifying sounds in ESC-50

slide-14
SLIDE 14

Sound Recognition

Method Accuracy Chance 2% SVM-MFCC 39% Random Forest 44% CNN, Piczak 2015 64% Human Consistency 81%

Classifying sounds in ESC-50

slide-15
SLIDE 15

Sound Recognition

Method Accuracy Chance 2% SVM-MFCC 39% Random Forest 44% CNN, Piczak 2015 64% SoundNet 74% Human Consistency 81%

Classifying sounds in ESC-50

10% gain

slide-16
SLIDE 16

Vision vs Sound

Low-dimensional embeddings via Maaten and Hinton, 2007

Vision Sound

slide-17
SLIDE 17

Sensor Power Consumption

Camera Microphone

~1 watt ~1 milliwatt

slide-18
SLIDE 18

What does it learn?

W a v e f

  • r

m C a t e g

  • r

i e s

slide-19
SLIDE 19

Layer 1

slide-20
SLIDE 20

What does it learn?

W a v e f

  • r

m C a t e g

  • r

i e s

slide-21
SLIDE 21

Layer 5

Smacking-like

slide-22
SLIDE 22

Layer 5

Chime-like

slide-23
SLIDE 23

What does it learn?

W a v e f

  • r

m C a t e g

  • r

i e s

slide-24
SLIDE 24

Layer 7

Scuba-like

slide-25
SLIDE 25

Layer 7

Parents-like

slide-26
SLIDE 26

Audiovisual Grounding

Which regions are making which sounds?

slide-27
SLIDE 27

Audiovisual Grounding

slide-28
SLIDE 28

Which objects make which sounds?

slide-29
SLIDE 29

The sound of clicked object

slide-30
SLIDE 30

The sound of clicked object

slide-31
SLIDE 31

The sound of clicked object

slide-32
SLIDE 32

Collect unlabeled videos

slide-33
SLIDE 33

Mix Sound Tracks

slide-34
SLIDE 34

Audio-only:

  • Ill-posed
  • permutation problem

How to recover originals?

slide-35
SLIDE 35 Audio Synthesizer Network Video Analysis Network Audio Analysis Network Sound of target video

Vision can help

slide-36
SLIDE 36 CNN K vision
 channels Video Analysis Network Max Pool

Audiovisual Model

slide-37
SLIDE 37 STFT Sound spectrogram CNN K vision
 channels s1 s2 sK Video Analysis Network Audio Analysis Network K audio 
 channels Max Pool U-Net

Audiovisual Model

slide-38
SLIDE 38 Audio Synthesizer Network STFT Sound spectrogram CNN K vision
 channels s1 s2 sK Video Analysis Network Audio Analysis Network K audio 
 channels Sound of target video Max Pool U-Net

Audiovisual Model

slide-39
SLIDE 39

Original Audio

slide-40
SLIDE 40

What does this sound like?

slide-41
SLIDE 41

What does this sound like?

slide-42
SLIDE 42

What does this sound like?

slide-43
SLIDE 43

What regions are making sound?

Original Video Estimated Volume

slide-44
SLIDE 44

What sounds are they making?

Original Video Embedding (projected and visualized as color)

slide-45
SLIDE 45

Adjusting Volume

slide-46
SLIDE 46

( , , )

( , )

( ,

real or fake?

( ,

Slide credit: Andrew Owens

Learning audio-visual correspondences

slide-47
SLIDE 47

( , )

( ,

“moo” Learning audio-visual correspondences

Slide credit: Andrew Owens

real or fake?

slide-48
SLIDE 48

( , )

( ,

Idea #1: random pairs

Arandjelovic, Zisserman. ICCV 2017 Slide credit: Andrew Owens
slide-49
SLIDE 49 Arandjelovic, Zisserman. ICCV 2017
slide-50
SLIDE 50 Arandjelovic, Zisserman. ICCV 2017

Vision hidden units

slide-51
SLIDE 51 Arandjelovic, Zisserman. ICCV 2017

Sound hidden units

slide-52
SLIDE 52 Arandjelovic, Zisserman. ICCV 2017

Sound Recognition

slide-53
SLIDE 53 Arandjelovic, Zisserman. ICCV 2017

Visual Recognition

Linear classifier on top of features (ImageNet)

slide-54
SLIDE 54

( , )

( ,

Idea #1: random pairs

Slide credit: Andrew Owens
slide-55
SLIDE 55

(

( , , )

Idea #2: time-shifted pairs

Slide credit: Andrew Owens
slide-56
SLIDE 56

Idea #2: time-shifted pairs

Slide credit: Andrew Owens
slide-57
SLIDE 57

Fused audio-visual representation

3D Convolution 3D Convolution 1D Convolution 1D Convolution 1D Convolution 3D Convolution 3D Convolution

Aligned vs. misaligned

Slide credit: Andrew Owens
slide-58
SLIDE 58

+ Fused audio-visual representation

3D Convolution 3D Convolution 3D Convolution 1D Convolution 1D Convolution 1D Convolution 3D Convolution

Aligned vs. misaligned concat at “conv2”

Slide credit: Andrew Owens
slide-59
SLIDE 59

What does the network learn?

Class activation map (Zhou et al. 2016)

Aligned vs. misaligned Aligned vs. misaligned

Slide credit: Andrew Owens
slide-60
SLIDE 60

Dribbling basketball

Top responses per category (speech examples omitted)

slide-61
SLIDE 61

Dribbling basketball

slide-62
SLIDE 62

Dribbling basketball

slide-63
SLIDE 63

Playing organ

slide-64
SLIDE 64

Playing organ

slide-65
SLIDE 65

Playing organ

slide-66
SLIDE 66

Chopping wood

slide-67
SLIDE 67

Chopping wood

slide-68
SLIDE 68

Chopping wood

slide-69
SLIDE 69

Application: on/off-screen source separation Task: separate on-screen sounds from background noise

Good morning! Guten Morgen!

Slide credit: Andrew Owens
slide-70
SLIDE 70

On-scr

VoxCeleb

On-screen Off-screen

+

Creating training data

Synthetic sound mixture

Slide credit: Andrew Owens
slide-71
SLIDE 71

Regression Multisensory features STFT

Time Frequen

On/off-screen source separation

+

On-screen Off-screen

Slide credit: Andrew Owens
slide-72
SLIDE 72

concat u-net

(Ronneberger 2015)

On/off-screen source separation

+

Time Frequen

On-screen Off-screen

Slide credit: Andrew Owens
slide-73
SLIDE 73

concat

d

Training:

  • 4-sec. videos
  • VoxCeleb + AudioSet
  • L1 loss on log spec.
  • No labels or face detection

Inverse STFT

On/off-screen source separation

On-screen Off-screen

+

Time Frequen

u-net

(Ronneberger 2015) Slide credit: Andrew Owens
slide-74
SLIDE 74

Input video

slide-75
SLIDE 75

On-screen prediction

slide-76
SLIDE 76

Off-screen prediction

slide-77
SLIDE 77

Input video

slide-78
SLIDE 78

On-screen prediction

slide-79
SLIDE 79

On-screen prediction