A Compact and Discriminative Face Track Descriptor Omkar M Parkhi, - - PowerPoint PPT Presentation

a compact and discriminative face track descriptor
SMART_READER_LITE
LIVE PREVIEW

A Compact and Discriminative Face Track Descriptor Omkar M Parkhi, - - PowerPoint PPT Presentation

A Compact and Discriminative Face Track Descriptor Omkar M Parkhi, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman Recognising and verifying faces in videos 2 Recognition Verification same different VF 2 : a new compact face track


slide-1
SLIDE 1

Omkar M Parkhi, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

A Compact and Discriminative Face Track Descriptor

slide-2
SLIDE 2

Recognising and verifying faces in videos

2

Recognition Verification

same different

slide-3
SLIDE 3

VF2: a new compact face track descriptor

▶Discriminative ▶Useful for different tasks (Recognition, Verification) ▶Extremely compact

3

Face track: sequence of face detections in consecutive frames. face track descriptor

slide-4
SLIDE 4

Large scale face retrieval

▶ Example of a typical target dataset ▶ 5 years of evening news programs ▶ 10,000 hrs of broadcast ▶ 20 Million frames, ▶ 2.1 Million face tracks ▶ Real time performance

4

http://www.robots.ox.ac.uk/~vgg/research/on-the-fly/ ▶ 30 frames per track on average ▶ Typical 4000D descriptor → 1 TB ▶ Our descriptor → 270 MB

slide-5
SLIDE 5

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

5

slide-6
SLIDE 6
  • 1. Dense feature computation

▶ Input: a face track ▶ Aligned or unaligned ▶ No facial landmarks required (eyes, nose, etc.) ▶ Output: a set of local features ▶ Extracted from all frames ▶ Dense RootSIFT at multiple scales ▶ 64-D PCA

6

slide-7
SLIDE 7

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

7

slide-8
SLIDE 8

GMM

[Perronnin et al. ECCV 2012]

  • 2. Fisher Vector encoding

8

first and second order statistics

FV encoding

+ sqrt-L2
 normalisation

Φ =            v1 u1 v2 u2 . . . vK uK           

uk = 1 M√2πk

M

X

i=1

γk(xi) ✓xi − µk σi − 1 ◆2 vk = 1 M√πk

M

X

i=1

γk(xi)xi − µk σi

xi

Gaussians (μk,Σk)

µk

xi γk(xi)

Assignment Hard

Dense SIFT

slide-9
SLIDE 9

Gaussian components as part detectors

  • 2. Fisher Vector Encoding

9

      x

W − 1 2 y H − 1 2

     

Spatial (x,y) Augmentation

slide-10
SLIDE 10

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

10

slide-11
SLIDE 11
  • 3. Video and jittered pooling

▶ Typically each frame is pooled independently ▶ Complex inference procedures combining multiple descriptors ▶ Large memory footprint

11

[Sivic et al. CVPR 09, Everingham et. al IVC 09,, Wolf et al. CVPR 2011]

slide-12
SLIDE 12
  • 3. Video and jittered pooling

▶ Single descriptor per track ▶ Smaller memory footprint ▶ Easy to use ▶ Improved performance

12

[Application to Action Recognition: Oneata, Verbeek, Schmid ICCV 2013]

slide-13
SLIDE 13
  • 3. Video and jittered pooling

▶ Data augmentation ▶ Data augmentation without training set increase ▶ Improvement in the performance

13

[Paulin et al. CVPR 2014]

slide-14
SLIDE 14

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

14

slide-15
SLIDE 15

Learn to discriminate faces

  • 4. Metric Learning

15

d2

W (x, y) = kW x W yk2 < b

same person

d2

W (u, v) = kW u W vk2 > b

x y u v

different people

d2

W (x, y) = kW x W yk2

[Simonyan, Parkhi, Vedaldi, Zisserman BMVC 2013]

W

x

=

learnt projection Fisher
 Vector z

slide-16
SLIDE 16

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

16

slide-17
SLIDE 17

Parseval Tight Frame

  • 5. Binarisation

▶ Low-dimensional real-valued descriptor → high dimensional binary ▶ 4x decrease in memory footprint (128D real → 1024D binary) ▶ Fast distance computation ▶ Alternative binarisation methods could be used

17

q ⨉ m Columns from a random rotation
 matrix

sign

=

q

q bits only 1 1 1 [Jégou et al. ICASSP 2012, Simonyan et al. PAMI 2014] real-valued
 descriptor

m

U z sign(Uz) z U q

slide-18
SLIDE 18

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

18

slide-19
SLIDE 19

Face Verification

YouTube Faces Dataset

19

▶ Face verification in videos ▶ 3,425 videos of 1,595 celebrities ▶ Videos collected from internet ▶ Wide pose, expression and illumination variation ▶ 10 splits of 600 pairs of videos ▶ Restricted setting: Use provided pairs ▶ Unrestricted setting: Free to form own pairs.

different same [Wolf, Hassner, Moaz CVPR 2011]

slide-20
SLIDE 20

Image Pool (Soft assignment FV) Video Pool (Soft assignment FV) Video Pool hard asignment fv Video Pool + Jittered Pool Video Pool. + Binar. 1024 bit + jitt. Video Pool. + Joint sim. + jitt.

Error

4.5 9 13.5 18

12.3 13.4 14.2 16.2 15 17.3

Face Verification

YouTube Faces Dataset

20

slide-21
SLIDE 21

Face Verification

YouTube Faces Dataset

21

MGBS & SVM- APEM FUSION STFRD & PMML VSOF & OSS (Adaboost) DDML (Combined) VF 1024D (binary) VF 256D Deep Face (facebook.com)

Error

5.5 11 16.5 22

8.6 12.3 13.4 18.5 20 19.9 21.4 21.2

Requires additional training data.

2 2

slide-22
SLIDE 22

Weakly supervised face classification

Oxford Buffy Dataset

22

▶ “Buffy The Vampire Slayer” ▶ Face tracks from 7 episodes of season 5. ▶ Both frontal and profile detections ▶ Weak supervision from transcript and subtitles ▶ Multi Class classification for every episode

[Everingham et al. IVC 2009, Sivic et al. CVPR 2009]

slide-23
SLIDE 23

Weakly supervised classification

Oxford Buffy Dataset

23

Sivic et al. (HOG RBF MKL) VF ( GMMs trained on Buffy ) VF ( GMMs trained on YTF ) VF ( GMMs trained on YTF ) + Jitt. Pool 1024D VF ( GMMs trained on YTF 2048b)

  • Avg. AP

0.79 0.808 0.825 0.843 0.86

0.82 0.86 0.8 0.81 0.81

2 2 2 2

slide-24
SLIDE 24

Very simple yet powerful face track descriptor

Recap

24

▶Track descriptor in 128 bytes ▶Face landmarks and alignment not required ▶One descriptor per track

  • ▶State of the art/comparable results on multiple tasks

▶YouTube Faces Dataset ▶Oxford Buffy Dataset

Can be trained with very small amount of data

Extremely easy to compute

Code online soon. Questions?