Ambient Sound Provides Supervision for Visual Learning Andrew Owens - - PowerPoint PPT Presentation

ambient sound provides supervision for visual learning
SMART_READER_LITE
LIVE PREVIEW

Ambient Sound Provides Supervision for Visual Learning Andrew Owens - - PowerPoint PPT Presentation

Ambient Sound Provides Supervision for Visual Learning Andrew Owens 1 , Jiajun Wu 1 , Josh H. McDermott 1 , William T. Freeman 1 , 2 , and Antonio Torralba 1 1 MIT & 2 Google Research ECCV 2016 Presented by An T. Nguyen 1 Introduction


slide-1
SLIDE 1

Ambient Sound Provides Supervision for Visual Learning

Andrew Owens1, Jiajun Wu1, Josh H. McDermott1, William T. Freeman1,2, and Antonio Torralba1

1MIT & 2Google Research

ECCV 2016

Presented by An T. Nguyen

1

slide-2
SLIDE 2

Introduction

Problem

◮ Learn Image Representation without labels ... ◮ ... that useful for a real task (e.g. Object Recognition).

2

slide-3
SLIDE 3

Introduction

Problem

◮ Learn Image Representation without labels ... ◮ ... that useful for a real task (e.g. Object Recognition).

Idea

◮ Set up a pretext task. ◮ To solve pretext task, model must learn good representation.

2

slide-4
SLIDE 4

Introduction

Problem

◮ Learn Image Representation without labels ... ◮ ... that useful for a real task (e.g. Object Recognition).

Idea

◮ Set up a pretext task. ◮ To solve pretext task, model must learn good representation.

Learn to predict a “natural signal”...

◮ ...that available for ‘free’.

2

slide-5
SLIDE 5

Introduction

Problem

◮ Learn Image Representation without labels ... ◮ ... that useful for a real task (e.g. Object Recognition).

Idea

◮ Set up a pretext task. ◮ To solve pretext task, model must learn good representation.

Learn to predict a “natural signal”...

◮ ...that available for ‘free’. ◮ This paper: Sound.

2

slide-6
SLIDE 6

Introduction

Problem

◮ Learn Image Representation without labels ... ◮ ... that useful for a real task (e.g. Object Recognition).

Idea

◮ Set up a pretext task. ◮ To solve pretext task, model must learn good representation.

Learn to predict a “natural signal”...

◮ ...that available for ‘free’. ◮ This paper: Sound. ◮ Others: Camera motion.

(Agrawal et. al., Jayaraman & Grauman, 2015)

2

slide-7
SLIDE 7

Data

Yahoo Flickr Creative Commons 100 Million Dataset. (Thomee et. al. 2015)

3

slide-8
SLIDE 8

Data

Yahoo Flickr Creative Commons 100 Million Dataset. (Thomee et. al. 2015)

◮ 360,000 video subset. ◮ Sample one image per 10sec. ◮ Extract 3.75 sec of sound around. ◮ 1.8 mil. train examples.

3

slide-9
SLIDE 9

Examples 1

(flickr.com/photos/41894173046@N01/4530333858)

Sound Video

4

slide-10
SLIDE 10

Examples 2

(flickr.com/photos/42035325@N00/8029349128)

Sound Video

5

slide-11
SLIDE 11

Examples 3

(flickr.com/photos/zen/2479982751)

Sound Video

6

slide-12
SLIDE 12

Challenges

◮ Sound is sometimes indicative of image. ◮ But sometimes not.

7

slide-13
SLIDE 13

Challenges

◮ Sound is sometimes indicative of image. ◮ But sometimes not.

Sound producing objects

◮ outside image. ◮ not always produce sound.

7

slide-14
SLIDE 14

Challenges

◮ Sound is sometimes indicative of image. ◮ But sometimes not.

Sound producing objects

◮ outside image. ◮ not always produce sound.

Video

◮ is edited. ◮ has noisy, background sound.

7

slide-15
SLIDE 15

Challenges

◮ Sound is sometimes indicative of image. ◮ But sometimes not.

Sound producing objects

◮ outside image. ◮ not always produce sound.

Video

◮ is edited. ◮ has noisy, background sound.

Question: What representation can we learn?

7

slide-16
SLIDE 16

Represent sound

Pre-process

8

slide-17
SLIDE 17

Represent sound

Pre-process

◮ Filter waveform ... (mimic human ear).

8

slide-18
SLIDE 18

Represent sound

Pre-process

◮ Filter waveform ... (mimic human ear). ◮ Compute statistics (e.g. mean of each freq. channel).

8

slide-19
SLIDE 19

Represent sound

Pre-process

◮ Filter waveform ... (mimic human ear). ◮ Compute statistics (e.g. mean of each freq. channel). ◮ → sound texture: 502-dim vector.

8

slide-20
SLIDE 20

Represent sound

Pre-process

◮ Filter waveform ... (mimic human ear). ◮ Compute statistics (e.g. mean of each freq. channel). ◮ → sound texture: 502-dim vector.

Two labeling models

  • 1. Cluster sound texture (k-mean).

8

slide-21
SLIDE 21

Represent sound

Pre-process

◮ Filter waveform ... (mimic human ear). ◮ Compute statistics (e.g. mean of each freq. channel). ◮ → sound texture: 502-dim vector.

Two labeling models

  • 1. Cluster sound texture (k-mean).
  • 2. PCA, 30 projections, threshold → binary codes.

8

slide-22
SLIDE 22

Represent sound

Pre-process

◮ Filter waveform ... (mimic human ear). ◮ Compute statistics (e.g. mean of each freq. channel). ◮ → sound texture: 502-dim vector.

Two labeling models

  • 1. Cluster sound texture (k-mean).
  • 2. PCA, 30 projections, threshold → binary codes.

Given an image

  • 1. Predict sound cluster.
  • 2. Predict 30 binary codes (multi-label classification).

8

slide-23
SLIDE 23

Training

9

slide-24
SLIDE 24

Training

Convolutional Neural Network

◮ Similar to (Krizhevsky et. al. 2012). ◮ Implemented in Caffe.

9

slide-25
SLIDE 25

Training

10

slide-26
SLIDE 26

Visualizing neurons (in upper layers)

11

slide-27
SLIDE 27

Visualizing neurons (in upper layers)

Method: for each neuron

11

slide-28
SLIDE 28

Visualizing neurons (in upper layers)

Method: for each neuron

  • 1. Find images with large activation.

11

slide-29
SLIDE 29

Visualizing neurons (in upper layers)

Method: for each neuron

  • 1. Find images with large activation.
  • 2. Find locations with large contribution to activation.

11

slide-30
SLIDE 30

Visualizing neurons (in upper layers)

Method: for each neuron

  • 1. Find images with large activation.
  • 2. Find locations with large contribution to activation.
  • 3. Highlight these regions.

11

slide-31
SLIDE 31

Visualizing neurons (in upper layers)

Method: for each neuron

  • 1. Find images with large activation.
  • 2. Find locations with large contribution to activation.
  • 3. Highlight these regions.
  • 4. Show to human on AMT.

11

slide-32
SLIDE 32

Visualizing neurons

12

slide-33
SLIDE 33

Visualizing neurons

12

slide-34
SLIDE 34

Visualizing neurons

12

slide-35
SLIDE 35

Visualizing neurons

13

slide-36
SLIDE 36

Visualizing neurons

13

slide-37
SLIDE 37

Visualizing neurons

13

slide-38
SLIDE 38

Detectors Histogram

Sound

14

slide-39
SLIDE 39

Detectors Histogram

Sound Ego Motion

14

slide-40
SLIDE 40

Detectors Histogram

Sound Ego Motion Labeled Scenes (supervised)

14

slide-41
SLIDE 41

Observations

◮ Each method learn some kinds of representations... ◮ ...depend on the pretext task.

15

slide-42
SLIDE 42

Observations

◮ Each method learn some kinds of representations... ◮ ...depend on the pretext task.

Representation learned from sound

◮ Objects with distinctive sound. ◮ Complementary to other methods.

15

slide-43
SLIDE 43

Object/Scene Recognition (1-vs-rest SVM)

  • 1. Agrawal et.al. 2015
  • 4. Doersch et.al 2015
  • 20. Kr¨

ahenb¨ uhl et.al. 2016

  • 35. Wang & Gupta 2015

16

slide-44
SLIDE 44

Object/Scene Recognition (1-vs-rest SVM)

  • 1. Agrawal et.al. 2015
  • 4. Doersch et.al 2015
  • 20. Kr¨

ahenb¨ uhl et.al. 2016

  • 35. Wang & Gupta 2015

16

slide-45
SLIDE 45

Object/Scene Recognition (1-vs-rest SVM)

Comparable Performance to Others

  • 1. Agrawal et.al. 2015
  • 4. Doersch et.al 2015
  • 20. Kr¨

ahenb¨ uhl et.al. 2016

  • 35. Wang & Gupta 2015

16

slide-46
SLIDE 46

Object Detection (Pretrain Fast-RCNN)

  • 1. Agrawal et.al. 2015
  • 4. Doersch et.al 2015
  • 20. Kr¨

ahenb¨ uhl et.al. 2016

  • 35. Wang & Gupta 2015

17

slide-47
SLIDE 47

Object Detection (Pretrain Fast-RCNN)

  • 1. Agrawal et.al. 2015
  • 4. Doersch et.al 2015
  • 20. Kr¨

ahenb¨ uhl et.al. 2016

  • 35. Wang & Gupta 2015

17

slide-48
SLIDE 48

Object Detection (Pretrain Fast-RCNN)

Similar Performance to Motion

  • 1. Agrawal et.al. 2015
  • 4. Doersch et.al 2015
  • 20. Kr¨

ahenb¨ uhl et.al. 2016

  • 35. Wang & Gupta 2015

17

slide-49
SLIDE 49

Discussion

Sound

◮ is abundant. ◮ can learn good representations. ◮ complementary to visual info.

18

slide-50
SLIDE 50

Discussion

Sound

◮ is abundant. ◮ can learn good representations. ◮ complementary to visual info.

Future work

◮ Other sound representations. ◮ What object/scene detectable by sound?

18

slide-51
SLIDE 51

Bonus: Visually Indicative Sound

(Owens et. al. 2016, vis.csail.mit.edu)

19