[PPT] - A Compact and Discriminative Face Track Descriptor Omkar M Parkhi, PowerPoint Presentation

SLIDE 1

Omkar M Parkhi, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman

A Compact and Discriminative Face Track Descriptor

SLIDE 2

Recognising and verifying faces in videos

2

Recognition Verification

same different

SLIDE 3

VF2: a new compact face track descriptor

▶Discriminative ▶Useful for different tasks (Recognition, Verification) ▶Extremely compact

3

Face track: sequence of face detections in consecutive frames. face track descriptor

SLIDE 4

Large scale face retrieval

▶ Example of a typical target dataset ▶ 5 years of evening news programs ▶ 10,000 hrs of broadcast ▶ 20 Million frames, ▶ 2.1 Million face tracks ▶ Real time performance

4

http://www.robots.ox.ac.uk/~vgg/research/on-the-fly/ ▶ 30 frames per track on average ▶ Typical 4000D descriptor → 1 TB ▶ Our descriptor → 270 MB

SLIDE 5

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

5

SLIDE 6

1. Dense feature computation

▶ Input: a face track ▶ Aligned or unaligned ▶ No facial landmarks required (eyes, nose, etc.) ▶ Output: a set of local features ▶ Extracted from all frames ▶ Dense RootSIFT at multiple scales ▶ 64-D PCA

6

SLIDE 7

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

7

SLIDE 8

GMM

[Perronnin et al. ECCV 2012]

2. Fisher Vector encoding

8

first and second order statistics

FV encoding

+ sqrt-L2  normalisation

Φ =            v1 u1 v2 u2 . . . vK uK           

uk = 1 M√2πk

M

X

i=1

γk(xi) ✓xi − µk σi − 1 ◆2 vk = 1 M√πk

M

X

i=1

γk(xi)xi − µk σi

xi

Gaussians (μk,Σk)

µk

xi γk(xi)

Assignment Hard

Dense SIFT

SLIDE 9

Gaussian components as part detectors

2. Fisher Vector Encoding

9

      x

W − 1 2 y H − 1 2

     

Spatial (x,y) Augmentation

SLIDE 10

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

10

SLIDE 11

3. Video and jittered pooling

▶ Typically each frame is pooled independently ▶ Complex inference procedures combining multiple descriptors ▶ Large memory footprint

11

[Sivic et al. CVPR 09, Everingham et. al IVC 09,, Wolf et al. CVPR 2011]

SLIDE 12

3. Video and jittered pooling

▶ Single descriptor per track ▶ Smaller memory footprint ▶ Easy to use ▶ Improved performance

12

[Application to Action Recognition: Oneata, Verbeek, Schmid ICCV 2013]

SLIDE 13

3. Video and jittered pooling

▶ Data augmentation ▶ Data augmentation without training set increase ▶ Improvement in the performance

13

[Paulin et al. CVPR 2014]

SLIDE 14

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

14

SLIDE 15

Learn to discriminate faces

4. Metric Learning

15

d2

W (x, y) = kW x W yk2 < b

same person

d2

W (u, v) = kW u W vk2 > b

x y u v

different people

d2

W (x, y) = kW x W yk2

[Simonyan, Parkhi, Vedaldi, Zisserman BMVC 2013]

W

x

=

learnt projection Fisher  Vector z

SLIDE 16

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

16

SLIDE 17

Parseval Tight Frame

5. Binarisation

▶ Low-dimensional real-valued descriptor → high dimensional binary ▶ 4x decrease in memory footprint (128D real → 1024D binary) ▶ Fast distance computation ▶ Alternative binarisation methods could be used

17

q ⨉ m Columns from a random rotation  matrix

sign

=

q

q bits only 1 1 1 [Jégou et al. ICASSP 2012, Simonyan et al. PAMI 2014] real-valued  descriptor

m

⨉

U z sign(Uz) z U q

SLIDE 18

d2

W (x, y)

[011001010]

Outline

1. Dense feature computation 2. Fisher Vector encoding 3. Video and jittered pooling 4. Compression by metric learning 5. Binarisation 6. Results

18

SLIDE 19

Face Verification

YouTube Faces Dataset

19

▶ Face verification in videos ▶ 3,425 videos of 1,595 celebrities ▶ Videos collected from internet ▶ Wide pose, expression and illumination variation ▶ 10 splits of 600 pairs of videos ▶ Restricted setting: Use provided pairs ▶ Unrestricted setting: Free to form own pairs.

different same [Wolf, Hassner, Moaz CVPR 2011]

SLIDE 20

Image Pool (Soft assignment FV) Video Pool (Soft assignment FV) Video Pool hard asignment fv Video Pool + Jittered Pool Video Pool. + Binar. 1024 bit + jitt. Video Pool. + Joint sim. + jitt.

Error

4.5 9 13.5 18

12.3 13.4 14.2 16.2 15 17.3

Face Verification

YouTube Faces Dataset

20

SLIDE 21

Face Verification

YouTube Faces Dataset

21

MGBS & SVM- APEM FUSION STFRD & PMML VSOF & OSS (Adaboost) DDML (Combined) VF 1024D (binary) VF 256D Deep Face (facebook.com)

Error

5.5 11 16.5 22

8.6 12.3 13.4 18.5 20 19.9 21.4 21.2

Requires additional training data.

2 2

SLIDE 22

Weakly supervised face classification

Oxford Buffy Dataset

22

▶ “Buffy The Vampire Slayer” ▶ Face tracks from 7 episodes of season 5. ▶ Both frontal and profile detections ▶ Weak supervision from transcript and subtitles ▶ Multi Class classification for every episode

[Everingham et al. IVC 2009, Sivic et al. CVPR 2009]

SLIDE 23

Weakly supervised classification

Oxford Buffy Dataset

23

Sivic et al. (HOG RBF MKL) VF ( GMMs trained on Buffy ) VF ( GMMs trained on YTF ) VF ( GMMs trained on YTF ) + Jitt. Pool 1024D VF ( GMMs trained on YTF 2048b)

Avg. AP

0.79 0.808 0.825 0.843 0.86

0.82 0.86 0.8 0.81 0.81

2 2 2 2

SLIDE 24

Very simple yet powerful face track descriptor

Recap

24

▶Track descriptor in 128 bytes ▶Face landmarks and alignment not required ▶One descriptor per track

▶State of the art/comparable results on multiple tasks

▶YouTube Faces Dataset ▶Oxford Buffy Dataset

▶

Can be trained with very small amount of data

▶

Extremely easy to compute

▶

Code online soon. Questions?