[PPT] - From Benedict Cumberbatch to Sherlock Holmes : Character PowerPoint Presentation

SLIDE 1

ANGELO SHERLOCK ANDERSON

DONOVAN

JOHN JOHN SHERLOCK LESTRADE

From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a script

Arsha Nagrani, Andrew Zisserman VGG, University of Oxford

SLIDE 2

VGG, Dept. of Engineering Science, University of Oxford

Goal

2

Useful for: Ø Content-based browsing e.g. ‘Fast-forward to when Sherlock first meets John’ Ø One step closer to story understanding

JOHN SHERLOCK

Identify every character in every frame of the video

SLIDE 3

VGG, Dept. of Engineering Science, University of Oxford

Previous Approaches

3

Transcripts who speaks what Subtitles what is spoken when

Everingham et al, 2006

who speaks when

Ø Rely on transcripts or subtitles as weak supervision Ø This supervision is only weak, hence techniques like Multiple Instance Learning (MIL) are required as well

Cour et al, 2009 Bojanowski et al, 2013 Tapaswi et al, 2016

SLIDE 4

VGG, Dept. of Engineering Science, University of Oxford

CNN Face descriptors are marvelous Can we recognise characters from TV shows using faces of their actors only?

4

Raw Facetracks from the video Actor Images from the Web

SLIDE 5

VGG, Dept. of Engineering Science, University of Oxford

5

Actor Character

Actor Images are usually taken from red carpet photoshoots Ø Frontal Ø Good lighting Ø Standard expressions

Challenges – Different Domains

Benedict Cumberbatch Sherlock Holmes

SLIDE 6

VGG, Dept. of Engineering Science, University of Oxford

Challenges

6

Profiles Extreme poses Partial Occlusions Small faces, low resolution Lighting and contrast

SLIDE 7

VGG, Dept. of Engineering Science, University of Oxford

How do we deal with this?

1.

Augmentation of actor images

2.

Character Context

3.

Speech Modality

7

SLIDE 8

VGG, Dept. of Engineering Science, University of Oxford

How do we deal with this?

1.

Augmentation of actor images

2.

Character Context

3.

Speech Modality

8

SLIDE 9

VGG, Dept. of Engineering Science, University of Oxford

1. Augmentation of Actor Images

9

1. Down-sampling using Bi-cubic interpolation
2. Contrast Adjustment
3. Horizontal Flips

SLIDE 10

VGG, Dept. of Engineering Science, University of Oxford

How do we deal with this?

1.

Augmentation of actor images

2.

Character Context

3.

Speech Modality

10

SLIDE 11

VGG, Dept. of Engineering Science, University of Oxford

2. Character Context

11

ACTOR CHARACTER Different regions of support

Ø VGGFace CNN features are obtained from cropped face regions Ø By retraining on character images we learn hair, make up, expressions of the character Ø Learn the `hairstyle’ of the character, not the `hairstyle’ of the actor.

CHARACTER

SLIDE 12

VGG, Dept. of Engineering Science, University of Oxford

How do we deal with this?

1.

Augmentation of actor images

2.

Retraining on facetracks from the TV show

3.

Speech Modality

12

SLIDE 13

VGG, Dept. of Engineering Science, University of Oxford

3. Speech Modality – Voice Classifier

13

Speaker Identification

Who is the speaker? Input Output SHERLOCK

SLIDE 14

VGG, Dept. of Engineering Science, University of Oxford

Speech Modality Pipeline

14

VoxCeleb CNN

1024 ASV Spectrogram SVM classifier

SHERLOCK

High confidence facetrack Character Face Classifier

1. Labels from facetracks
2. Active Speaker Verification
3. Feature Extraction
4. Classification

SLIDE 15

VGG, Dept. of Engineering Science, University of Oxford

Active Speaker Verification - SyncNet

15

Small distance if synchronised

Chung, J. S., and Zisserman, A. "Out of time: automated lip sync in the wild." Asian Conference

n Computer Vision, 2016.

SLIDE 16

VGG, Dept. of Engineering Science, University of Oxford

Voice Feature Extraction- VGGVox

16

300x512 Raw audio signal 300x512

maxpool maxpool

7x7x96

5x5x256

3x3x256 3x3x256 3x3x256

C1 C2 C3 C4 C5 FC7 FC8

3x3x256 3x3x256

avgpool FC6

9x1

Nagrani, A. Chung, J. S., and Zisserman, A. ”VoxCeleb: A large scale speaker identification dataset ” INTERSPEECH, 2017 Pretrained on 1,251 speakers (VoxCeleb)

SLIDE 17

VGG, Dept. of Engineering Science, University of Oxford

Voice Classifier

Ø 1 vs rest SVM classifier Ø Apply to audio segments where

the corresponding face is difficult to identify

17

SVM classifier

Obtain labels for extreme poses and profiles

SLIDE 18

VGG, Dept. of Engineering Science, University of Oxford

Putting it all together – Inputs

18

1. Actor images from the web

Benedict Cumberbatch Martin Freeman Rupert Graves

Actor images Cast Lists easily available on IMDB

Ø Facetracks are obtained using tracking by detection Ø Goal is to update labels using all techniques mentioned so far

2. Un-labelled facetracks from the TV show

SLIDE 19

VGG, Dept. of Engineering Science, University of Oxford

Putting it all together

1. Actor Face Classifier

ØTrained on augmented actor images only

2. Character Face Classifier

ØTrained on character face images, takes into account face context

3. Character Face Classifier after Voice Correction

ØTrained on face labels following correction by the voice classifier

19

We use three 1-vs-rest SVM face classifiers:

SLIDE 20

VGG, Dept. of Engineering Science, University of Oxford

Propagation of Confident Labels

20

Actor Images Correct some labels with Voice Classifier Face tracks

1. Actor Face Classifier
3. Character Face Classifier after Voice Correction
2. Character Face Classifier

Most confident labels Train Train

SLIDE 21

VGG, Dept. of Engineering Science, University of Oxford

Demo of results at each stage

21

SLIDE 22

VGG, Dept. of Engineering Science, University of Oxford

Results at each stage - Sherlock

22

0.2 0.4 0.6 0.8 1

proportion of tracks

0.2 0.4 0.6 0.8 1

per sample accuracy

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

PR curve, E01 Per Character Accuracy

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

face(actor) AP:0.98 face(character) AP:0.99 face+voice(character) AP:0.99

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

face(actor) face(character) face+voice(character)

Very few actor images Many speaking parts

SLIDE 23

VGG, Dept. of Engineering Science, University of Oxford

Results - Casablanca

23

RICK RICK ILSA ILSA ILSA LASZLO RENAULT Profiles Dark faces Partial Occlusions

SLIDE 24

VGG, Dept. of Engineering Science, University of Oxford

Comparison to state-of-the-art - Casablanca

24

[2] O. M. Parkhi, E. Rahtu, A. Zisserman, "It's in the bag: Stronger supervision for automated face labelling", ICCV Workshop, 2015

0.2 0.4 0.6 0.8 1

proportion of tracks

0.2 0.4 0.6 0.8 1

per sample accuracy

0.2 0.4 0.6 0.8 1

proportion of tracks

0.2 0.4 0.6 0.8 1

per sample accuracy

0.2 0.4 0.6 0.8 1

proportion of tracks

0.2 0.4 0.6 0.8 1

per sample accuracy

0.2 0.4 0.6 0.8 1

proportion of tracks

0.2 0.4 0.6 0.8 1

per sample accuracy

Bojanowski ’13 [1] Parkhi ’15 [2] Actor face only Our method (final) AP: 0.75 AP: 0.93 AP: 0.89 AP: 0.96

[1] P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic, "Finding actors and actions in movies", ICCV, 2013

SLIDE 25

VGG, Dept. of Engineering Science, University of Oxford

Demo

25

SLIDE 26

VGG, Dept. of Engineering Science, University of Oxford

What have we missed?

26

Small and very dark faces Extreme occlusion cases where the character is not speaking Back of heads

SLIDE 27

VGG, Dept. of Engineering Science, University of Oxford

Summary

27

Ø Novel approach that eschews transcripts, subtitles or manual annotation Ø Multimodal method with both voice and face context for recognition Ø Recognises profiles, partial occlusions and extreme poses Ø Beats the state-of-the-art on the Casablanca dataset