Concept Detection: Concept Detection: Convergence to Local Features - - PowerPoint PPT Presentation

concept detection concept detection convergence to local
SMART_READER_LITE
LIVE PREVIEW

Concept Detection: Concept Detection: Convergence to Local Features - - PowerPoint PPT Presentation

Concept Detection: Concept Detection: Convergence to Local Features Convergence to Local Features and Opportunities Beyond and Opportunities Beyond Shih Fu Chang 1 , Junfeng He 1 , Yu Gang Jiang 1,2 , Elie El Khoury 3 , Chong Wah Ngo 2


slide-1
SLIDE 1

Concept Detection: Concept Detection: Convergence to Local Features Convergence to Local Features and Opportunities Beyond and Opportunities Beyond

Shih‐Fu Chang1, Junfeng He1, Yu‐Gang Jiang1,2, Elie El Khoury3, Chong‐Wah Ngo2, Akira Yanagawa1, Eric Zavesky1 g g g y

1 DVMM Lab, Columbia University

, y

2 City University of Hong Kong 3 IRIT, Toulouse, France

IRIT, Toulouse, France

TRECVID 2008 workshop, NIST

slide-2
SLIDE 2

Overview: 5 components & 6 runs Overview: 5 components & 6 runs

5 components

Classifiers

6

Local Feature Global Feature

6 5

Global Feature CU‐VIREO374

374-d fea.

SVM 4

CU VIREO374 Web Images

3

Web Images

1 4 2

Face & Audio

Filtering

2

slide-3
SLIDE 3

Overview: overall performance Overview: overall performance

TRECVID 2008 Type‐A Submissions (161)

0 14 0.16 0.18

  • n

TRECVID 2008 Type A Submissions ( )

0.1 0.12 0.14

ge Precisio

0.06 0.08

ean Averag

0.02 0.04

Me

– Local feature alone already achieves near top performance – Every other component contributes incrementally to the final detection

3

slide-4
SLIDE 4

Overview: per‐concept performance Overview: per‐concept performance

0.4

CU_2_run4+face&audio CU_4_run5+cu‐vireo374

0.3 0.35

CU_5_local_global CU_6_local_only MEAN MAX

0.2 0.25 0.1 0.15 0.05

4

slide-5
SLIDE 5

Outline Outline

5 components 5 components

Classifiers Classifiers

6

Local Feature Global Feature

6 5

Global Feature CU‐VIREO374

374‐d fea. 374‐d fea.

SVM 4

CU VIREO374 Web Images

3

Web Images

1 4 2

Face & Audio

Filtering

5

slide-6
SLIDE 6

Bag‐of‐Visual‐Words (BoW) Bag‐of‐Visual‐Words (BoW)

6

slide-7
SLIDE 7

Representation Choices of BoW Representation Choices of BoW

d h h

  • Word weighting scheme

– How to weight the importance of a word to an How to weight the importance of a word to an image?

  • Spatial information

Spatial information

– Are the spatial locations of keypoints useful?

7

slide-8
SLIDE 8

Weighting Scheme Weighting Scheme

d l

  • Traditional…

– Binary, Term frequency (TF), inverse document frequency y, q y ( ), q y (IDF)…

  • Our method

soft weighting

  • Our method – soft weighting

Assign a keypoint to multiple visual ‐‐ Assign a keypoint to multiple visual words ‐‐ weights are determined by keypoint‐ to‐ word similarity

Details in: Jiang et al. CIVR 2007.

Image from http://www.cs.joensuu.fi/pages/franti/vq/lkm15.gif 8

slide-9
SLIDE 9

Vocabulary Size & Weighting Scheme Vocabulary Size & Weighting Scheme

TRECVID 2006 Test Data

0.1 0.12

ion

Binary TF TF‐IDF Soft

0.06 0.08

age Precisi

0.04

Mean Avera

0.02

M

500 1,000 5,000 10,000

Vocabulary Size

– Soft weighting

  • Improve TF by 10%‐20%

– More accurate to assess the importance of a keypoint

9

slide-10
SLIDE 10

Spatial Information Spatial Information

l d

  • Partition image into equal‐sized regions
  • Concatenate BoW features from the regions

g

– Poor generalizability

F = (f11, f12, f13, f21, f22, f23, f31, f32, f33)

10

slide-11
SLIDE 11

Spatial Information Spatial Information

TRECVID 2006 Test Data (soft‐weighting)

0.12 0.14

sion C 006 est ata(so t e g t g) 1×1 region 2×2 regions 3×3 regions 4×4 regions

0.08 0.10

age precis

0.04 0.06

an avera

0.00 0.02

Mea

500 1000 5000 10000

Vocabulary size

– Spatial Information does not help much for p p concept detection

  • 2x2 is a good choice
  • 3x3 and 4x4 may cause mismatch problem

11

slide-12
SLIDE 12

Local Feature Representation Framework Local Feature Representation Framework

  • a
  • K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, L. Van Gool,

“A comparison of affine region detectors”, IJCV, vol. 65, pp. 43‐72, 2005.

12

slide-13
SLIDE 13

Internal Results – Local Features Internal Results – Local Features

O C 2008

  • Over TRECVID 2008 Test Data

0.14 0.16

Similar!

13% 13% MAP: 0.157

0.08 0.1 0.12 0.04 0.06 0.02

1x1 (3 detectors) 1x1 (2 detectors) 2x2 1x3 Run6: Fusion

13

slide-14
SLIDE 14

Failure Cases ‐ I Failure Cases ‐ I

misses misses

  • Flower

– Small visual area Small visual area – Coloration/texture too similar t b k d to background scene

  • Possible Solutions

– Color‐descriptor – Color‐descriptor – Class‐specific visual words

14

slide-15
SLIDE 15

Failure Cases ‐ II Failure Cases ‐ II

i

  • Boat_Ship, Airplane_flying

misses misses

– Learning biased by background scene – Difficulty from occlusion

  • Possible Solution

F t l ti – Feature selection

15

slide-16
SLIDE 16

Summary – Local Features Summary – Local Features

  • BoW with good representation choices achieved

very impressive performance y p p

  • Soft‐weighting is very effective

l l l l f l

  • Multiple spatial layouts are useful
  • Multi‐detectors do not help much

p

  • Rooms for future improvement
  • Class‐specific visual words, feature

selection, color‐descriptor etc.

16

slide-17
SLIDE 17

Outline Outline

5 components 5 components

Classifiers Classifiers

6

Local Feature Global Feature

6 5

Global Feature CU‐VIREO374

374‐d fea. 374‐d fea.

SVM 4

CU VIREO374 Web Images

3

Web Images

1 4 2

Face & Audio

Filtering

17

slide-18
SLIDE 18

Global Features Global Features

  • Grid based Color Moments (225 d)
  • Grid‐based Color Moments (225‐d)
  • Wavelet texture (81‐d)

0.35 0.4

A_ CU_6_local_only_6 A_ CU_5_local_global_5

0 2 0.25 0.3 0.1 0.15 0.2 0.05

18

slide-19
SLIDE 19

Outline Outline

5 components 5 components

Classifiers Classifiers

6

Local Feature Global Feature

6 5

Global Feature CU‐VIREO374

374‐d fea. 374‐d fea.

SVM 4

CU VIREO374 Web Images

3

Web Images

1 4 2

Face & Audio

Filtering

19

slide-20
SLIDE 20

CU‐VIREO374 CU‐VIREO374

  • Fusion of Columbia374 and VIREO374

Fusion of Columbia374 and VIREO374

Feature Dimension Grid‐based color moment (LUV) 225 Columbia374 Gabor Texture 48 Edge Direction Histogram 73 Bag of visual words (soft weighting) 500 VIREO374 Bag‐of‐visual‐words (soft weighting) 500 Grid‐based Color Moment (Lab) 225 Grid‐based Wavelet Texture 81

Performance of CU‐VIREO374 over TRECVID 2006 Test Data

CU‐VIREO374 VIREO374 Columbia374

Scores on the TRECVID2008 corpora:

http://www.ee.columbia.edu/ln/dvmm/CU‐VIREO374/

20

Yu‐Gang Jiang, Akira Yanagawa, Shih‐Fu Chang, Chong‐Wah Ngo, "Fusing Columbia374 and VIREO‐374 for Large Scale Semantic Concept Detection", Columbia University ADVENT Technical Report #223‐2008‐1, Aug. 2008.

slide-21
SLIDE 21

Concept Fusion Using CU‐VIREO374 Concept Fusion Using CU‐VIREO374

f h

  • Train a SVM for each concept

– Using CU‐VIREO374 scores as features Using CU VIREO374 scores as features

0 16 0.18

  • n

TRECVID 2008 Test Data 2.2% 2.2%

0.08 0.1 0.12 0.14 0.16

age Preciso

0.02 0.04 0.06 0.08

Mean Avera

– Performance improvement is merely 2%

CU‐VIREO374 Run5 Run4: run5+CU‐VIREO374 M

Run5: Local+global

Performance improvement is merely 2%

  • Need a better concept fusion model!

21

slide-22
SLIDE 22

Outline Outline

5 components 5 components

Classifiers Classifiers

6

Local Feature Global Feature

6 5

Global Feature CU‐VIREO374

374‐d fea. 374‐d fea.

SVM 4

CU VIREO374 Web Images

3

Web Images

1 4 2

Face & Audio

Filtering

22

slide-23
SLIDE 23

Exploring External Images from Web Exploring External Images from Web

bl

  • Problem

– Sparsity of positive data

Concept Name # Positive shots Concept Name # Positive shots

Classroom 224 Harbor 195 Bridge 158 Telephone 184 Emergency_Vehicle 88 Street 1551 D 122 D t ti /P t t 134 Dog 122 Demonstration/Protest 134 Kitchen 250 Hand 1515 Airplane_flying 72 Mountain 239 Two_people 3630 Nighttime 424 Bus 87 Boat_Ship 437 D i 258 Fl 582 Driver 258 Flower 582 Cityscape 288 Singing 366

Total # of shots in TV’08 Dev: 36 262 Total # of shots in TV 08 Dev: 36,262

23

slide-24
SLIDE 24

Challenging Issues Challenging Issues

k f h l f “ i il

  • How to make use of the large amount of “noisily

labeled” web images for concept detection?

– Issue 1: filter the false positive samples

Flickr Images

Good

Flickr Images

Good Bad Bad

24

slide-25
SLIDE 25

Challenging Issues Challenging Issues

k f h l f “ i il

  • How to make use of the large amount of “noisily

labeled” web images for concept detection?

– Issue 1: filter the false positive samples – Issue 2: overcome the cross‐domain problem Issue 2: overcome the cross domain problem

Flickr Flickr TRECVID TRECVID

25

slide-26
SLIDE 26

Preliminary Results Preliminary Results

b i 8 000 f li k

  • Web image set: 18,000 from Flickr

– Issue 1: filter the false positive samples

  • Graph based semi‐supervised learning

– Issue 2: overcome the cross‐domain problem p

  • Weighted SVM
  • Results

Results

– MAP: no difference “ ” i 0%

0.2 0.25 0.3 A_CU_Run‐4 C_CU_Run‐3 (Bug Free)

– “Bus”: improve 50%

  • Open Problem!

0.05 0.1 0.15

26

slide-27
SLIDE 27

Outline Outline

5 components 5 components

Classifiers Classifiers

6

Local Feature Global Feature

6 5

Global Feature CU‐VIREO374

374‐d fea. 374‐d fea.

SVM 4

CU VIREO374 Web Images

3

Web Images

1 4 2

Face & Audio

Filtering

27

slide-28
SLIDE 28

Face Detection and Tracking Face Detection and Tracking

i (O C lb )

  • Face Detection (OpenCV Toolbox)
  • Tracking based on face location and skin color

g

Backward Forward Character 1 Backward tracking Forward tracking Pt1 x Pt2 x Start Frame End Frame Pt1 Pt1 Pt1 Pt2 Pt1 Pt2

...

Face Detection Tracking

28

slide-29
SLIDE 29

“Two people” Detector Two_people Detector

  • 2 people
  • 250 frames
  • 2 people
  • 250 frames/person1 & 150 frames/person2
  • 100 frames/1 person & 150 frames/2 people
  • Drawback

100 frames/1 person & 150 frames/2 people

  • Drawback

– Cannot find person when face is too small or invisible

29

slide-30
SLIDE 30

Detecting “Singing” based on Audio Detecting Singing based on Audio

ib

  • Vibrato

– “the variation of the frequency of an musical instrument or

  • f the voice”
  • Harmonic Coefficient Ha
  • Harmonic Coefficient Ha

– It corresponds to the most important trigonometric series f h

  • f the spectrum

– Ha is higher in the presence of singing voice

30

slide-31
SLIDE 31

Performance – Face & Audio Performance – Face & Audio

“ l ” b % d “ i i ” b 8%

  • Improve “two_people” by 4% and “singing” by 8%

– Simple heuristics help detect specific concepts.

0.3 0.2 0.25

CU_4_fuse_baseline_4 CU_2_face_base_2

0.1 0.15 0.05

Two_people Singing 31

slide-32
SLIDE 32

Conclusions Conclusions

  • Convergence to Local Features

– Local feature alone achieved an impressive MAP of 0.157

  • Representation choices are critical for good performance

– The combination of local features and global features introduces a The combination of local features and global features introduces a moderate gain (MAP 0.162)

  • CU‐VIREO374

– Useful resource for concept fusion and video search – A better fusion model is needed

  • Face & Audio detectors

– Simple heuristics help detect specific concepts

  • Training from external web images – open problem

– Useful for concepts lacking in positive training samples p g p g p – Challenges:

  • Unreliable labels
  • Domain differences
  • Domain differences

32

slide-33
SLIDE 33

More information at: http://www.ee.columbia.edu/dvmm/

33