TEXT, LANGUAGE, AND IMAGERY
Yu-Ting Peng
1
Hello! My name is... Buffy Automatic TV series Naming of Characters - - PowerPoint PPT Presentation
T EXT , LANGUAGE , AND IMAGERY Yu-Ting Peng 1 R ESOURCE - S CRIPTS 2 R ESOURCE - S UBTITLES 3 R ESOURCE - N EWS 4 R ESOURCE - W IKIPEDIA 5 Paper Resource Objective News Name faces Names and Faces in the News , by T. Berg, A. photos
1
2
3
4
5
Paper Resource Objective
Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned- Miller and D. Forsyth, CVPR 2004. News photos Name faces
Everingham, J. Sivic and A. Zisserman, BMVC 2006. Movies / TV series (video) Name faces
Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series (video) Action retrieval and movie structure recovery
2008. Corel dataset (images) learning classifiers for nouns and relationships (prep. & adj).
6
Aim: Given an input image and an associated
Application: to label faces in news images or to
7
half a million news pictures and captions from
Obtained 44,773 face images more realistic than usual face recognition
it contains faces captured “in the wild” in a variety
8
9
Words are classified as verbs by first applying a list of morphological rules to present tense singular forms, and then comparing these to a database of known verbs. identifying two or more capitalized words followed by a present tense verb. This lexicon is matched to each caption. Each face detected in an image is associated with every name extracted from the associated caption
10
11
12
Face detector (K. Mikolajczyk) - biased to frontal faces Rectify face to canonical pose.
points to canonical positions
Remove images with low rectification score
13 13
kernel principal components analysis (kPCA)-to
linear discriminant analysis (LDA) - to project
14
15
Repeat until convergence
16
Aim: different names that actually correspond to
merge names that correspond to faces that look
17
18
Paper Resource Objective
Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned- Miller and D. Forsyth, CVPR 2004. News photos Name faces
Everingham, J. Sivic and A. Zisserman, BMVC 2006. Movies / TV series (video) Name faces
Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series (video) Action retrieval and movie structure recovery
2008. Corel dataset (images) learning classifiers for nouns and relationships (prep. & adj).
19
20
*slides from Andrew Zisserman
Aim - automatically label television or movie
21
22
24
25
Face detector- by P
Frontal face KLT tracker-point
Reduces the volume
Allows stronger
26
*Pictures from Andrew Zisserman
27
*slides from Andrew Zisserman
28
Representing Face (SIFT Descriptor or Simple pixel-wised descriptor) Face normalization (Affine transform) Locate facial features (Nine facial features eyes/nose/mouth)
29 Matching the appearance of the face can be extremely
challenging; clothing can provide additional cues
Represent Clothing Appearance by detecting a bounding box
containing cloth of a person
Similar clothing appearance suggests the same character, but
different clothing does not necessarily imply a different character
Straightforward weighting of the clothing appearance relative
to the face appearance proved effective
30
31
32
*Graphs from Andrew Zisserman
33
Paper Resource Objective
Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned- Miller and D. Forsyth, CVPR 2004. News photos Name faces
Everingham, J. Sivic and A. Zisserman, BMVC 2006. Movies / TV series (video) Name faces
Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series (video) Action retrieval and movie structure recovery
2008. Corel dataset (images) learning classifiers for nouns and relationships (prep. & adj).
34
Aim: Automatically extracting large collections of
Method: recovering scene structure in movies
Application: semantic retrieval and indexing,
35
This uncovered structure can be used to analyze
To model the scene structure, we propose a
36
37
38
screenplay
Dialogues speaker identity,
no time-stamps
closed captions Dialogues time-stamps nothing else.
39
40
After pronoun resolution
Paper Resource Objective
Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned- Miller and D. Forsyth, CVPR 2004. News photos Name faces
Everingham, J. Sivic and A. Zisserman, BMVC 2006. Movies / TV series (video) Name faces
Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series (video) Action retrieval and movie structure recovery
2008. Corel dataset (images) learning classifiers for nouns and relationships (prep. & adj).
41
42
Aim: to learn classifiers
above(statue,rocks);
larger(water,statue)
43
Each image region is represented by a set of
The classifiers for nouns are based on these
The classifiers for relationships are based on
44
45
Visual feature nouns nouns Parameters of the appearance models A type of relationship Parameters of the relationship model Visual feature Image features
46
The rjk represent the possible words for the relationship between regions (j, k).
to simultaneously solve for the correspondence
E-step: evaluate possible assignments using the
M-step: Using the probabilistic distribution of
47
use a Bayesian network to represent our labeling
Previous approaches estimate nouns for regions
48
49
near(birds,sea); below(birds,sun); above(sun, sea); larger(sea,sun); brighter(sun,sea); below(waves,sun) below(coyote, sky); below(bush, sky); left(bush, coyote); greener(grass, coyote); below(grass,sky)
50
below(building, sky); below(tree,building); below(tree, skyline); behind(buildings,tree); blueish(sky, tree) above(statue,rocks);
larger(water,statue) below(flowers,horses);
below(flowers,foals)
Lots of data out there with both text and images Text provides potential labels of images Scripts give cues about scene structure and
Understanding the semantics of language can
51
52
extract names wrong association