O RGANIZATION OF THE TUTORIAL 8:30 9:30 Part 1: Introduction Part - - PowerPoint PPT Presentation

o rganization of the tutorial
SMART_READER_LITE
LIVE PREVIEW

O RGANIZATION OF THE TUTORIAL 8:30 9:30 Part 1: Introduction Part - - PowerPoint PPT Presentation

I MAGE T AG A SSIGNMENT , R EFINEMENT AND R ETRIEVAL Tutorial June 26, 2016 Xirong Li Tiberio Uricchio Lamberto Ballan Renmin University of China University of Florence University of Florence & Stanford University Marco Bertini Cees Snoek


slide-1
SLIDE 1

IMAGE TAG ASSIGNMENT, REFINEMENT AND RETRIEVAL

Xirong Li

Renmin University of China

Tiberio Uricchio

University of Florence

Lamberto Ballan

University of Florence & Stanford University

Marco Bertini

University of Florence

Cees Snoek

University of Amsterdam & Qualcomm Research Netherlands

Alberto Del Bimbo

University of Florence

1

June 26, 2016

Tutorial

slide-2
SLIDE 2

ORGANIZATION OF THE TUTORIAL

8:30 – 9:30 Part 1: Introduction Part 2: Taxonomy 9:30 – 10:00 Part 3: Experimental protocol Part 4: Evaluation 10:00 – 10:45 Coffee break 10:45 – 12:00 Part 4: Evaluation cont’d 12:00 – 12:30 Part 5: Conclusion and future directions

2

slide-3
SLIDE 3

READING MATERIAL

Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval, 
 ACM Computing Surveys, 49(1):14, June 2016.

3

slide-4
SLIDE 4

PART 1 INTRODUCTION

  • Problem statement
  • Course organization

4

slide-5
SLIDE 5

PROGRESS IN IMAGE RETRIEVAL

  • Query-by-Image content

5

IBM, QBIC

slide-6
SLIDE 6

PROGRESS IN IMAGE RETRIEVAL

  • Query-by-sketch

6

Del Bimbo, PAMI 1997

slide-7
SLIDE 7

PROGRESS IN IMAGE RETRIEVAL

  • By 2000 problem well understood

7

Smeulders, PAMI 2000

slide-8
SLIDE 8

PROGRESS IN IMAGE RETRIEVAL

  • By 2008 the field blossomed, but social context mostly ignored

8

DaAa, CSUR 2008

slide-9
SLIDE 9

IMAGES WANT TO BE SHARED

9

Almost all these services allow users to tag, rate, like, and swipe photos. 


slide-10
SLIDE 10

DAILY NUMBER OF PHOTOS SHARED ON SELECT PLATFORMS

10

Mary Meeker Internet Trends 2016

slide-11
SLIDE 11

BUSINESS CASE

11

Mary Meeker Internet Trends 2016

slide-12
SLIDE 12

AVERAGE DAILY TIME SPENT PER USA USER

12

Mary Meeker Internet Trends 2016

slide-13
SLIDE 13

EXAMPLES

13

slide-14
SLIDE 14

EXAMPLES

14

slide-15
SLIDE 15

EXAMPLES

15

slide-16
SLIDE 16

PROBLEMS OF TAGS: IRRELEVANCE

  • Tags are few, imprecise, ambiguous, and overly personalized

16

Nikon Airplane 2016

slide-17
SLIDE 17

PROBLEMS OF TAGS: DYNAMICS

  • In a social network, users continuously add images and create

new terms given the freedom of tagging.

17

Brexit

slide-18
SLIDE 18

PROBLEMS OF TAGS: SCALE

  • Web-scale quantity of media.

18

slide-19
SLIDE 19

THE LONG TAIL OF IMAGE TAGS

19

  • Some tags are popular and have millions of example images.
  • Others are rare, occurring in few images

Kordumova et al. MMM 2016

slide-20
SLIDE 20

TAGGING BEHAVIOR

Study by Sigurbjörnsson and van Zwol in WWW 2008 on Flickr

  • The head of the distribution contains too generic tags to be useful

(the top 5 most frequent: 2006, 2005, wedding, party, and 2004).

  • The tail contains the infrequent tags with incidentally occurring

terms such as misspellings and complex phrases.

20

slide-21
SLIDE 21

AN N-GRAM PERSPECTIVE

Study by Kordumova et al in MMM 2016 on Flickr

  • Most of the frequent tags are unigrams.
  • As the frequency goes down more bigrams appear.
  • Towards the end trigrams and four-grams occur

21

christmas tree kaffir cat wine cellar barrel storage mediterranean water shrew

slide-22
SLIDE 22

TAGS PER PHOTO (IN 2008)

  • A few photos are exceptionally well tagged
  • 64% of photos have 1, 2 or 3 tags only.

22

Sigurbjörnsson and van Zwol, WWW 2008

slide-23
SLIDE 23

WORDNET CATEGORIES OF TAGS

  • 48% of 3.7M tags could not be matched.

23

28% 16% 13% 9% 7% 27% Unclassified Location Artefact or Object Person or Group Action or Event Time Other 48%

Sigurbjörnsson and van Zwol, WWW 2008

slide-24
SLIDE 24

ABOUT THIS TUTORIAL

  • This tutorial focuses on challenges and solutions for content-based

image retrieval in the context of online image sharing and tagging.

  • We present a unified review on three closely linked problems, i.e., 


tag assignment, tag refinement, and tag-based image retrieval.

  • We introduce a taxonomy to structure the literature, understand the

ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations.

  • We present an open-source testbed, with training sets of varying

sizes and three test datasets, to evaluate 11 methods of varied learning complexity. http://www.micc.unifi.it/tagsurvey/

24

slide-25
SLIDE 25

ABOUT THIS TUTORIAL

  • This tutorial focuses on challenges and solutions for content-based

image retrieval in the context of online image sharing and tagging.

  • We present a unified review on three closely linked problems, i.e., 


tag assignment, tag refinement, and tag-based image retrieval.

  • We introduce a taxonomy to structure the literature, understand the

ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations.

  • We present an open-source testbed, with training sets of varying

sizes and three test datasets, to evaluate 11 methods of varied learning complexity. http://www.micc.unifi.it/tagsurvey/

25

slide-26
SLIDE 26

ABOUT THIS TUTORIAL

  • This tutorial focuses on challenges and solutions for content-based

image retrieval in the context of online image sharing and tagging.

  • We present a unified review on three closely linked problems, i.e., 


tag assignment, tag refinement, and tag-based image retrieval.

  • We introduce a taxonomy to structure the literature, understand the

ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations.

  • We present an open-source testbed, with training sets of varying

sizes and three test datasets, to evaluate 11 methods of varied learning complexity. http://www.micc.unifi.it/tagsurvey/

26

slide-27
SLIDE 27

ABOUT THIS TUTORIAL

  • This tutorial focuses on challenges and solutions for content-based

image retrieval in the context of online image sharing and tagging.

  • We present a unified review on three closely linked problems, i.e., 


tag assignment, tag refinement, and tag-based image retrieval.

  • We introduce a taxonomy to structure the literature, understand the

ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations.

  • We present an open-source testbed, with training sets of varying

sizes and three test datasets, to evaluate 11 methods of varied learning complexity. http://www.micc.unifi.it/tagsurvey/

27

slide-28
SLIDE 28

ABOUT THIS TUTORIAL

  • This tutorial focuses on challenges and solutions for content-based

image retrieval in the context of online image sharing and tagging.

  • We present a unified review on three closely linked problems, i.e., 


tag assignment, tag refinement, and tag-based image retrieval.

  • We introduce a taxonomy to structure the literature, understand the

ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations.

  • We present an open-source testbed, with training sets of varying

sizes and three test datasets, to evaluate 11 methods of varied learning complexity. http://www.micc.unifi.it/tagsurvey/

28

slide-29
SLIDE 29

TASK: TAG ASSIGNMENT

  • Given an unlabeled image, tag assignment strives to assign a

number of tags related to the image content

  • How many tags ? Fixed or variable number ?

29

slide-30
SLIDE 30

TASK: TAG REFINEMENT

  • Given an image associated with some initial tags, tag refinement

aims to remove irrelevant tags from the initial tag list and enrich it with novel, yet relevant, tags.

30

slide-31
SLIDE 31

TASK: TAG RETRIEVAL

  • Given a tag and a collection of images labeled with the tag (and

possibly other tags), the goal of tag retrieval is to retrieve images relevant with respect to the tag of interest.

31

Query: bride

slide-32
SLIDE 32

PART 2 TAXONOMY

  • Foundations
  • tag relevance
  • A two-dimensional taxonomy
  • Media for tag relevance
  • Learning for tag relevance

32

slide-33
SLIDE 33

FOUNDATIONS

The basic elements to be considered when developing methods for tag assignment, refinement and retrieval are:

  • An image x
  • A tag t
  • A user u
  • A user u can share an image x, assigning tag t to it

33

slide-34
SLIDE 34

FOUNDATIONS

A set of users U contributes a set of n socially tagged images X. All tags used to describe X form a vocabulary V composed of m tags.

34

Vocabulary = {court, 1, number, bristol, roby, fishing, me}

slide-35
SLIDE 35

FOUNDATIONS

  • Depending on the social network we can assume the availability
  • f a set of user information 𝜤 (e.g. user contacts, geo-

localization, etc.)

35

slide-36
SLIDE 36

TAG RELEVANCE

Tag assignment, refinement and retrieval share an essential component: a way to measure the relevance between a tag and a given image This function considers the image x, tag t and user information 𝜤: 


fɸ(x, t; 𝜤)

36

slide-37
SLIDE 37

EXAMPLE FOR TAG REFINEMENT

37

Li et al. TMM 2009

slide-38
SLIDE 38

UNIFIED FRAMEWORK

38

Test Media Filter & Precompute Tag Relevance Image x User Information Filtered media , Prior

S

Θ

Learning Transduction-based Training Media Tasks Assignment Refinement Retrieval Tag t

X

fΦ(x, t; Θ)

Auxiliary Components Instance-based Model-based Inductive Transductive

ˆ S

slide-39
SLIDE 39

UNIFIED FRAMEWORK

39

Test Media Filter & Precompute Tag Relevance Image x User Information Filtered media , Prior

S

Θ

Learning Transduction-based Training Media Tasks Assignment Refinement Retrieval Tag t

X

fΦ(x, t; Θ)

Auxiliary Components Instance-based Model-based Inductive Transductive

ˆ S

slide-40
SLIDE 40

UNIFIED FRAMEWORK

Training media is obtained from social networks, i.e. with unreliable user- generated annotations. It can be filtered to remove unwanted tags or images.

40

Test Media Filter & Precompute Tag Relevance Image x User Information Filtered media , Prior

S

Θ

Learning Transduction-based Training Media Tasks Assignment Refinement Retrieval Tag t

X

fΦ(x, t; Θ)

Auxiliary Components Instance-based Model-based Inductive Transductive

ˆ S

slide-41
SLIDE 41

UNIFIED FRAMEWORK

41

Test Media Filter & Precompute Tag Relevance Image x User Information Filtered media , Prior

S

Θ

Learning Transduction-based Training Media Tasks Assignment Refinement Retrieval Tag t

X

fΦ(x, t; Θ)

Auxiliary Components Instance-based Model-based Inductive Transductive

ˆ S

Training media is obtained from social networks, i.e. with unreliable user- generated annotations. It can be filtered to remove unwanted tags or images.

slide-42
SLIDE 42

AUXILIARY COMPONENTS: FILTER

  • A common practice is to eliminate overly personalized tags like

‘hadtopostsomething’

  • e.g. by excluding tags that are not part of WordNet or Wikipedia
  • Often tags that do not appear enough times in the collection are

eliminated.

  • Reduction of vocabulary size is also important for when using an

image-tag association matrix

  • Since batch tagging tends to reduce the quality of tags, these

types of images can be excluded

42

slide-43
SLIDE 43

BATCH TAGGING

A unique user constraint prevents ‘spam’ from batch tagging

43

Li et al. TMM 2009

slide-44
SLIDE 44

AUXILIARY COMPONENTS: PRECOMPUTE

  • It is practical to precompute information for the learning.
  • A common precomputation is tag occurrence and co-occurrence.
  • Occurrence can be used to penalize excessively frequent tags
  • Co-occurrence is used to capture semantic similarity of tags directly

from users’ behavior

  • Semantic similarity typically obtained by Flickr context distance

44

slide-45
SLIDE 45

FLICKR CONTEXT DISTANCE

(b)

FCS (bridge, river) = 0.65

45

NGD(x, y) = max{log h(x), log h(y)} − log h(x, y) log N − min{log h(x), log h(y)} ,

  • Based on the Normalized Google

Distance.

  • Measures the co-occurence of two

tags with respect to their single tag

  • ccurrencies.
  • No semantics is involved, works for

any tag.

FCS(x, y) = e−NGD(x,y)/σ

h(x) h(y) h(x,y)

[Jiang et al. 2009]

slide-46
SLIDE 46

UNIFIED FRAMEWORK

46

Test Media Filter & Precompute Tag Relevance Image x User Information Filtered media , Prior

S

Θ

Learning Transduction-based Training Media Tasks Assignment Refinement Retrieval Tag t

X

fΦ(x, t; Θ)

Auxiliary Components Instance-based Model-based Inductive Transductive

ˆ S

slide-47
SLIDE 47

TAXONOMY

47

Media Learning

Instance Model TransducVve Tag

2 1

  • Tag + Image

13 15 12

Tag + Image + User

5 7 3 Taxonomy structures 60 papers along Media and Learning dimensions

slide-48
SLIDE 48

TAXONOMY

48

Media Learning

Instance Model TransducVve Tag

2 1

  • Tag + Image

13 15 12

Tag + Image + User

5 7 3 Taxonomy structures 60 papers along Media and Learning dimensions

slide-49
SLIDE 49

MEDIA FOR TAG RELEVANCE

Depending on the modalities exploited we can divide the methods between those that use:

  • Tag
  • e.g. considering ranking of tags as a proxy of user’s priorities
  • Tag + image
  • e.g. considering the set of tags assigned to an image
  • Tag, image + user information
  • e.g. considering the behaviors of different users tagging similar images

49

slide-50
SLIDE 50

MEDIA: TAGS

These methods reduce the problem to text retrieval Find similarly tagged images by

  • user-provided tag ranking [Sun et al. 2011],
  • tag co-occurrence [Sigurbjönsson and van Zwol 2008; Zhu et al. 2012] or
  • topic modelling [Xu et al. 2009]

These methods assume that test images have already been tagged as well, so unsuited for tag assignment.

50

slide-51
SLIDE 51

MEDIA: TAGS AND IMAGES

The main idea of these works is to exploit visual consistency, i.e. the fact that visually similar images should have similar tags. Three main approaches:

  • 1. Use visual similarity between test image and database
  • 2. Use similarity between images with same tags
  • 3. Learn classifiers from social images + tags

51

slide-52
SLIDE 52

MEDIA: TAGS AND IMAGES

Two tactics to combine the similarity between images and tags

  • 1. Sequential: compute visual similarity, then use the tag

modality

  • 2. Simultaneous: use both modalities at the same time,
  • A unified graph composed by the fusion of a visual similarity graph with

an image-tag connection graph [Ma et al. 2010]

  • Tag and image similarities as constraints to reconstruct an image-tag

association matrix [Wu et al. 2013; Xu et al. 2014; Zhu et al. 2010]

52

slide-53
SLIDE 53

MEDIA: TAGS, IMAGES AND USER INFO

In addition to tags and images, this group of works exploits user information, motivated from varied perspectives. Such as:

  • User identities [Li et al. 2009b],
  • Tagging preferences [Sawant et al. 2010],
  • User reliability [Ginsca et al. 2014],
  • Photo time stamps [Kim and Xing 2013, McParlane et al. 2013a]
  • Geo-localization [McParlane et al. 2013b]
  • Image group memberships [Johnson et al. 2015]

53

slide-54
SLIDE 54

TAXONOMY

54

Media Learning

Instance Model TransducVve Tag

2 1

  • Tag + Image

13 15 12

Tag + Image + User

5 7 3 Taxonomy structures 60 papers along Media and Learning dimensions

slide-55
SLIDE 55

LEARNING FOR TAG RELEVANCE

  • We can divide the learning methods in transductive and inductive.

The former do not make a distinction between learning and test dataset, the latter may be further divided in methods that produce an explicit model and those that are instance based.

  • We therefore divide the methods in instance-based, model-based

and transduction-based.

  • Typically inductive methods have better computational scalability

than transductive ones.

55

slide-56
SLIDE 56

INSTANCE BASED

  • This class of methods compares new test images with training

instances.

  • There are no parameters and the complexity grows with the

number of instances.

  • Approaches are typically based on variants of k-NN, with or

witout weighted voting

56

slide-57
SLIDE 57

MODEL BASED

  • This class of methods learns its parameters from a training set.


A model can be tag-specific or holistic, i.e. for all tags.

  • Tag-specific: use linear or fast intersection kernel SVMs trained on

features augmented by pre-trained classifiers of popular tags, or relevant positive and negative examples

  • Holistic: use topic modeling with relevance computed using a topic

vector of the image and a topic vector of the tag.

57

slide-58
SLIDE 58

TRANSDUCTION BASED

  • This class of methods evaluate tag relevance for a given image-

tag pair by minimizing a cost function over a set of images.

  • The majority of these methods is based on matrix factorization

58

slide-59
SLIDE 59

PROS AND CONS

Instance-based

  • Pro: flexible and adaptable to manage new images and tags.
  • Con: require to manage training media, a task that may become

complex with increasing amount of data.

Model-based

  • Pro: training data is represented compactly, leading to swift

computations, especially when using linear classifiers.

  • Con: need to retrain to cope with new imagery of a tag or when

expanding the vocabulary.

Transduction-based

  • Pro: exploit better inter-tag and inter-image relationships, through matrix

factorization.

  • Con: difficult to manage large datasets, because of memory and/or

computational complexity.

59

slide-60
SLIDE 60

UNIFIED FRAMEWORK

60

Test Media Filter & Precompute Tag Relevance Image x User Information Filtered media , Prior

S

Θ

Learning Transduction-based Training Media Tasks Assignment Refinement Retrieval Tag t

X

fΦ(x, t; Θ)

Auxiliary Components Instance-based Model-based Inductive Transductive

ˆ S

slide-61
SLIDE 61

TAXONOMY

61

Media Learning

Instance Model TransducVve Tag

2 1

  • Tag + Image

13 15 12

Tag + Image + User

5 7 3 Taxonomy structures 60 papers along Media and Learning dimensions

slide-62
SLIDE 62

TAXONOMY

62

TagCooccur SemanVcField TagRanking KNN TagVote TagCooccur+ TagProp TagFeature RelExample RobustPCA TensorAnalysis

slide-63
SLIDE 63

ORGANIZATION OF THE TUTORIAL

8:30 – 9:30 Part 1: Introduction Part 2: Taxonomy 9:30 – 10:00 Part 3: Experimental protocol Part 4: Evaluation 10:00 – 10:45 Coffee break 10:45 – 12:00 Part 4: Evaluation cont’d 12:00 – 12:30 Part 5: Conclusion and future directions

63

slide-64
SLIDE 64

PART 3 OUR EXPERIMENTAL PROTOCOL

  • Limitations in current evaluation
  • Training and test data
  • Evaluation setup

64

slide-65
SLIDE 65

LIMITATIONS IN CURRENT EVALUATION

  • Results are not directly comparable
  • homemade datasets
  • selected subsets of a benchmark set
  • varied implementation
  • preprocessing, parameters, features, …
  • Results are not easily reproducible
  • For many methods, no source code or executable is provided
  • Single-set evaluation
  • Split a dataset into training/testing, at risk of overfitting

65

slide-66
SLIDE 66

PROPOSED PROTOCOL

  • Results are often not comparable
  • use full-size test datasets
  • same implemenation whenever applicable
  • Results are reproducible
  • open-source
  • Cross-set evaluation
  • Training and test datasets are constructed independently

66

slide-67
SLIDE 67

SOCIALLY-TAGGED TRAINING DATA

  • Data gathering procedure [Li et al. 2012]
  • using WordNet nouns as querie to uniformly sample Flickr images uploaded

between 2006 and 2010

  • remove batch-tagged images (simple yet effective trick to improve data quality)
  • Training sets of varied size
  • Train1M (a random subset of the collected Flickr images)
  • Train100k (a random subset of Train1m)
  • Train10k (a random subset of Train1m)

67

ImageNet already provides labeled examples for over 20k

  • categories. Is it necessary to learn from socially tagged data?
slide-68
SLIDE 68

SOCIAL TAGS VERUS IMAGENET ANNOTATIONS

  • ImageNet annotations
  • computer vision oriented, focusing on fine-grained visual objects
  • single label per image
  • Social tags
  • follow context, trends and events in the real world
  • describe both the situation and the entity presented in the visual content

68 2007-01-26 poppy poppy tulip tulip red summer poppy

  • range

nature flower 2007-04-22 winter tree baum frost 2007-12-27 poppy rot red sky cloud field 2008-02-17

...

A Flickr user’s album

Credits: http://www.flickr.com/people/regina_austria

slide-69
SLIDE 69

IMAGENET EXAMPLES ARE BIASED

  • By web image search engines

69

  • D. Vreeswijk, K. van de Sande, C. Snoek, A.

Smeulders, All Vehicles are Cars: Subclass Preferences in Container Concepts, ICMR 2012

Credit: figure from [Vreeswijk et al. 2012]

slide-70
SLIDE 70

TEST DATA

  • Three test datasets
  • contributed by distinct research groups

70

Test dataset Contributors MIRFlickr[Huiskes 2010] LIACS Medialab, Leiden University NUS-WIDE[Chua 2009] LMS, NaVonal University of Sigapore Flickr51[Wang 2010] Microsod Research Asia

slide-71
SLIDE 71

MIRFLICKR

  • Image collection
  • 25,000 high-quality photographic images from Flickr
  • Labeling criteria
  • Potential labels: visibile to some extent
  • Relevant labels: saliently present
  • Test tag set
  • 14 relevant labels: baby, bird, car, cloud, dog, flower, girl, man, night

people, portrait, river, sea, tree

  • Applicability
  • Tag assignment
  • Tag refinement

71

  • M. Huiskes, B. Thomee,M. Lew. “New trends and ideas in visual concept detecVon: the MIR

Flickr retrieval evaluaVon iniVaVve”, MIR 2010 http://press.liacs.nl/mirflickr/

slide-72
SLIDE 72

NUS-WIDE

  • Image collection
  • 260K images randomly crawled from Flickr
  • Labeling criteria
  • An active learning strategy to reduce the amount of manual labeling
  • Test tag set
  • 81 tags containing objects (car, dog), people (police, military), scene

(airport, beach), and events (swimming, wedding)

  • Applicability
  • tag assignment
  • tag refinement
  • tag retrieval

72

T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y.-T. Zheng. “NUS-WIDE: A Real-World Web Image Database from NaVonal University of Singapore”, CIVR 2009 http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm

slide-73
SLIDE 73

FLICKR51

  • Image collection
  • 80k images collected from Flickr using a predefined set of tags as

queries

  • Labeling criteria
  • Given a tag, manually check the relevance of images labelled with the tag
  • Three relevance levels: very relevant, relevant, and irrelevant
  • Test tag set
  • 51 tags, and some are ambiguous, e.g, apple, jaguar
  • Applicability
  • Tag retrieval

73

[1] M. Wang, X.-S. Hua, H.-J. Zhang. “Towards a relevant and diverse search of social images”, IEEE TransacVons on MulVmedia 2010 [2] Y. Gao, M. Wang , Z.-J. Zha, J. Sheng, X. Li, X. Wu. “Visual-Textual Joint Relevance Learning for Tag-Based Social Image Search”, IEEE TransacVons on Image Processing, 2013

slide-74
SLIDE 74

VISUAL FEATURES

  • Traditional bag of visual words [van de Sande 2010]
  • SIFT points quantized by a codebook of size 1,024
  • Plus a compact 64-d color feature vector [Li 2007]
  • CNN features
  • A 4,096-d FC7 vector after ReLU activation, extracted by the pre-trained 16-

layer VGGNet [Simonyan 2015]

74

slide-75
SLIDE 75

EVALUATION

Three tasks as introduced in Part 1

  • Tag assignment
  • Tag refinement
  • Tag retrieval

75

slide-76
SLIDE 76

EVALUATING TAG ASSIGNMENT/REFINEMENT

  • A good method for tag assignment shall
  • rank relevant tags before irrelevant tags for a given image
  • rank relevant images before irrelevant images for a given tag
  • Two criteria
  • Image-centric: Mean image Average Precision (MiAP)
  • Tag-centric: Mean Average Precision (MAP)

76

MiAP is biased towards frequent tags MAP is affected by rare tags

slide-77
SLIDE 77

EVALUATING TAG RETRIEVAL

  • A good method for tag retrieval shall
  • rank relevant images before irrelevant images for a given tag
  • Two criteria
  • Mean Average Precision (MAP) to measure the overall ranks
  • Normalized Discounted Cumulative Gain (NDCG) to measure the top ranks

77

slide-78
SLIDE 78

SUMMARY

78

Data servers [1] http://www.micc.unifi.it/tagsurvey [2] http://www.mmc.ruc.edu.cn/research/tagsurvey/data.html

slide-79
SLIDE 79

LIMITATIONS IN OUR PROTOCOL

  • Tag informativeness in tag assignment

79

dog pet dog beach versus

  • X. Qian, X.-S. Hua, Y. Tang, T. Mei, Social

Image Tagging With Diverse Semantics, IEEE Transactions on Cybernetics 2014

How to assess informativeness?

slide-80
SLIDE 80

LIMITATIONS IN OUR PROTOCOL

  • Image diversity in tag retrieval

80

Figure from [Wang et al. 2010]

How to measure diversity?

  • M. Wang, X.-S. Hua, H.-J. Zhang, Towards a relevant

and diverse search of social images, IEEE Transactions

  • n Multimedia 2010
slide-81
SLIDE 81

LIMITATIONS IN OUR PROTOCOL

  • Semantic ambiguity
  • E.g., search for jaguar in Flickr51

81

SemanticField RelExamples

Need fine-grained annotation

  • X. Li, S. Liao, W. Lan, X. Du, G. Yang,

Z e ro - s h o t i m a g e t a g g i n g b y hierarchical semantic embedding, SIGIR 2015

slide-82
SLIDE 82

REFERENCES

82

  • [Chua 2009] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y.-T. Zheng. NUS-WIDE: A Real-World Web

Image Database from National University of Singapore, CIVR 2009

  • [Huiskes 2010] M. Huiskes, B. Thomee, M Lew, New trends and ideas in visual concept detection: the

MIR Flickr retrieval evaluation initiative, MIR 2010.

  • [Li 2007] M. Li, Texture Moment for Content-Based Image Retrieval, ICME 2007
  • [Li 2012] X. Li, C. Snoek, M. Worring, A. Smeulders, Harvesting social images for bi-concept search,

IEEE Transactions on Multimedia 2012

  • [Li 2015] X. Li, S. Liao, W. Lan, X. Du, G. Yang, Zero-shot image tagging by hierarchical semantic

embedding, SIGIR 2015

  • [Simonyan 2015] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale

Image Recognition, ICLR 2015

  • [Qian 2014] X. Qian, X.-S. Hua, Y. Tang, T. Mei, Social Image Tagging With Diverse Semantics, IEEE

Transactions on Cybernetics 2014

  • [van de Sande 2010] K. van de Sande, T. Gevers, C. Snoek, Evaluating Color Descriptors for Object

and Scene Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010

  • [Vreeswijk 2012] D. Vreeswijk, K. van de Sande, C. Snoek, A. Smeulders, All Vehicles are Cars:

Subclass Preferences in Container Concepts, ICMR 2012

  • [Wang 2010] M. Wang, X.-S. Hua, H.-J. Zhang, Towards a relevant and diverse search of social

images, IEEE Transactions on Multimedia 2010

slide-83
SLIDE 83

PART 4 EVALUATION: ELEVEN KEY METHODS

  • Goal: evaluates key methods based on various Media and Learning

paradigm

  • Q: What are their key ingredients ?
  • Q: What is the computational cost of each of them ?

83

slide-84
SLIDE 84

KEY METHODS

  • Covering all published methods is obviously impractical
  • We do not consider methods:
  • Which do not show significant improvements or novelties w.r.t. the seminal

papers in the field

  • Methods that are difficult to replicate
  • We drive our choice by the intention to cover methods that aim

for each of the three tasks, exploiting varied modalities and using distinct learning mechanisms

  • We select 11 representative methods

84

slide-85
SLIDE 85

KEY METHODS

  • Each method is required to output tag relevance of each test

image and each test tag

85

f(x1, t1) f(x1, t2) . . . f(x1, tm) f(x2, t1) f(x2, t2) . . . f(x2, tm) . . . . . . ... . . . f(xn, t1) f(xn, t2) . . . f(xn, tm)

m tags n images

slide-86
SLIDE 86

KEY METHODS

86

Media \ Learning Instance Based Model Based Transduc7ve Based Tag

Seman7cField

[Zhu et al. 2012]

TagCooccur

[Sigurbjörnsson and van Zwol 2008] Tag + Image

TagRanking

[Liu et al. 2009]

KNN

[Makadia et al. 2010]

TagProp

[Guillaumin et al. 2009]

TagFeature

[Chen et al. 2012]

RelExample

[Li and Snoek 2013]

RobustPCA

[Zhu et al. 2010] Tag + Image + User TagVote

TagCooccur+

[Li et al. 2009b]

TensorAnalysis

[Sang et al. 2012a]

slide-87
SLIDE 87

KEY METHODS

87

Media \ Learning Instance Based Model Based Transduc7ve Based Tag

Seman7cField

[Zhu et al. 2012]

TagCooccur

[Sigurbjörnsson and van Zwol 2008] Tag + Image

TagRanking

[Liu et al. 2009]

KNN

[Makadia et al. 2010]

TagProp

[Guillaumin et al. 2009]

TagFeature

[Chen et al. 2012]

RelExample

[Li and Snoek 2013]

RobustPCA

[Zhu et al. 2010] Tag + Image + User TagVote

TagCooccur+

[Li et al. 2009b]

TensorAnalysis

[Sang et al. 2012a]

slide-88
SLIDE 88
  • Tags of similar semantics usually co-occur in user images
  • SemanticField measures an averaged similarity between a tag

and the user tags already assigned to the image

  • Two similarity measures between words:
  • Flickr context similarity
  • Wu-Palmer similarity on WordNet

SEMANTICFIELD

88

[Zhu et al. 2012] Instance-Based Tag

red sun beach birthday canon

sunset

is it similar? band lights concert personal guitar

cat

is it similar?

0.9 0.1

Zhu et al. Sampling and Ontologically Pooling Web Images for Visual Concept Learning. IEEE TMM 2012

slide-89
SLIDE 89

FLICKR CONTEXT SIMILARITY

(b)

FCS (bridge, river) = 0.65

89

Y-G. Jiang, C.-W. Ngo, S.-F. Chang. SemanVc context transfer across heterogeneous sources for domain adapVve video search. ACM MulVmedia 2009

NGD(x, y) = max{log h(x), log h(y)} − log h(x, y) log N − min{log h(x), log h(y)} ,

  • Based on the Normalized Google

Distance.

  • Measures the co-occurence of two

tags with respect to their single tag

  • ccurrencies.
  • No semantics is involved, works for

any tag.

FCS(x, y) = e−NGD(x,y)/σ

h(x) h(y) h(x,y)

slide-90
SLIDE 90

WU-PALMER SIMILARITY

90

Sim(w1, w2) = max h 2 ∗ depth(LCS(w1, w2)) length(w1, w2) + 2 ∗ depth(LCS(w1, w2)) i

  • It is a measure between

concepts in an ontology restricted to taxonomic links.

  • Considers the depth of x, y

and their least common subsumer (LCS).

  • Typically used with WordNet.
  • Z. Wu, M. Palmer. Verb semanVcs and lexical selecVon. ACL 1994
slide-91
SLIDE 91
  • Sim is the similarity between t and the other image tags
  • Needs some user tags. Not applicable to Tag Assignment
  • Complexity O(m · lx): the number of image tags lx times m tags
  • Memory O(m2): quadratic w.r.t. the vocabulary of m tags

SEMANTICFIELD

91

[Zhu et al. 2012] Instance-Based Tag

fSemF ield(x, t) := 1 lx

lx

X

i=1

sim(t, ti),

slide-92
SLIDE 92

KEY METHODS

92

Media \ Learning Instance Based Model Based Transduc7ve Based Tag

Seman7cField

[Zhu et al. 2012]

TagCooccur

[Sigurbjörnsson and van Zwol 2008] Tag + Image

TagRanking

[Liu et al. 2009]

KNN

[Makadia et al. 2010]

TagProp

[Guillaumin et al. 2009]

TagFeature

[Chen et al. 2012]

RelExample

[Li and Snoek 2013]

RobustPCA

[Zhu et al. 2010] Tag + Image + User TagVote

TagCooccur+

[Li et al. 2009b]

TensorAnalysis

[Sang et al. 2012a]

slide-93
SLIDE 93
  • TagRanking assigns a rank to each user tag, based on their

relevance to the image content.

  • Tag probabilities are first estimated in the KDE phase.
  • Then a random walk is performed on a tag graph, built from visual

exemplar similarity and tags semantic similarity.

TAGRANKING

93

[Liu et al. 2009] Instance-Based Tag + Image

flower tree bird sky (1) bird (0.36) (2) flower (0.28) (3) sky (0.21) (4) tree (0.15)

bird tree

flower

sky S(bird) S(flower) S(sky) S(tree) Exemplar Similarity Concurrence Similarity

Gaussian Kernel Density Estimation Random walk on Tag graph

p(t|x)

  • D. Liu, X.-S. Hua, L. Yang, M. Wang, H.-J. Zhang. Tag ranking. WWW 2009
slide-94
SLIDE 94

TAGRANKING

94

[Liu et al. 2009] Instance-Based Tag + Image

  • Suitable only for Tag Retrieval: it doesn’t add or remove user tags.
  • lx is a tie-breaker when two images have the same tag rank.
  • Complexity O(m · d · n + L · m2): KDE on n images + L iter

random walk

  • Memory O(max(d · n, m2)): max of the two steps

fT agRanking(x, t) = −rank(t) + 1 lx ,

slide-95
SLIDE 95

KEY METHODS

95

Media \ Learning Instance Based Model Based Transduc7ve Based Tag

Seman7cField

[Zhu et al. 2012]

TagCooccur

[Sigurbjörnsson and van Zwol 2008] Tag + Image

TagRanking

[Liu et al. 2009]

KNN

[Makadia et al. 2010]

TagProp

[Guillaumin et al. 2009]

TagFeature

[Chen et al. 2012]

RelExample

[Li and Snoek 2013]

RobustPCA

[Zhu et al. 2010] Tag + Image + User TagVote

TagCooccur+

[Li et al. 2009b]

TensorAnalysis

[Sang et al. 2012a]

slide-96
SLIDE 96

KNN

96

[Makadia et al. 2010] Instance-Based Tag + Image

2.8 1.9 0.3 4.9 3.2 1.5 sunset sun ship sea miami palm sun sea train livingroom concert singer effects sunset road backhome wood Bookshelf handwork

  • Similar images share

similar tags

  • Finds k nearest images

with a distance d

  • Counts the frequency of

tags in the neighborhood

  • Assign the top ranked

tags to the test image

  • A. Makadia, V. Pavlovic, and S. Kumar. A new baseline for image annotaVon. ECCV 2008
slide-97
SLIDE 97

KNN

97

[Makadia et al. 2010] Instance-Based Tag + Image

2.8 1.9 0.3 4.9 3.2 1.5 sunset sun ship sea miami palm sun sea train livingroom concert singer effects sunset road backhome wood Bookshelf handwork

  • Similar images share

similar tags

  • Finds k nearest images

with a distance d

  • Counts the frequency of

tags in the neighborhood

  • Assign the top ranked

tags to the test image

slide-98
SLIDE 98
  • kt is the number of images with t in the visual neighborhood of x.
  • User tags on test image are not used. Not applicable to Tag

Refinement.

  • Complexity O(d · |S| + k · log|S|): proportional to d feature

dimensionality and k nearest neighbors

  • Memory O(d · |S|): d-dimensional features

KNN

98

Instance-Based Tag + Image [Makadia et al. 2010]

fKNN(x, t) := kt,

slide-99
SLIDE 99

TAGVOTE

99

[Li et al. 2009b] Instance-Based Tag + Image + User

  • Adds two improvements

to KNN-voting:

  • Unique-user

constraint

  • Tag prior frequency
  • X. Li, C. Snoek, M. Worring. Learning Social Tag Relevance by Neighbor VoVng. IEEE TMM 2009
slide-100
SLIDE 100
  • kt is the number of images with t in the visual neighborhood of x
  • nt is the frequency of tag t in S
  • Like KNN, user tags on test image are not used. Not applicable to

Tag Refinement

  • Complexity O(d · |S| + k · log|S|) – same complexity as KNN
  • Memory O(d · |S|)

TAGVOTE

100

Instance-Based Tag + Image

fT agV ote(x, t) := kt − k nt |S|, images labeled with in . Following

[Li et al. 2009b]

slide-101
SLIDE 101

TAGPROP

101

[Guillaumin et al. 2009] Model-Based Tag + Image

2.8 1.9 0.3 4.9 3.2 1.5 sunset sun ship sea miami palm sun sea train livingroom concert singer effects sunset road backhome wood Bookshelf handwork

  • Key improvement:

give different weights to image neighborhs

  • Probabilistic metric

learning on image ranks or distance

p(yiw = +1) = X

j

⇡ijp(yiw = +1|j), (

Probability of tag w on image I X p(yiw = +1|j) = ( 1 − ✏ for yjw = +1, ✏

  • therwise,

Probability of tag w on neighbor J

  • M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid. TagProp: DiscriminaVve Metric Learning in

Nearest Neighbor Models for Image Auto-AnnotaVon. ICCV 2009

slide-102
SLIDE 102
  • I(xj,t) returns 1 if xj is labeled with t, 0 otherwise.

TAGPROP

102

[Guillaumin et al. 2009] Model-Based Tag + Image

fT agP rop(x, t) :=

k

X

j

πj · I(xj, t),

set ⇡ij = k log-likelihood

⇡ij = exp(−dθ(i, j)) P

j0 exp(−dθ(i, j0)),

Rank weights Distance weights

slide-103
SLIDE 103
  • A logistic regressor per tag upon fTagProp, is added to promote rare

tags and penalize frequent ones.

  • User tags on test image are not used. Not applicable to Tag

Refinement

  • Complexity O(l · m · k): l steps of gradient descent
  • Memory O(d · |S|): same as KNN, extra 2m for logistic regression

TAGPROP

103

[Guillaumin et al. 2009] Model-Based Tag + Image

fTagProp(x, t) := σ ⇣ at ·

  • k

X

j

πj · I(xj, t)

  • + bt

σ(z) = 1 1 + e−z

slide-104
SLIDE 104

KEY METHODS

104

Media \ Learning Instance Based Model Based Transduc7ve Based Tag

Seman7cField

[Zhu et al. 2012]

TagCooccur

[Sigurbjörnsson and van Zwol 2008] Tag + Image

TagRanking

[Liu et al. 2009]

KNN

[Makadia et al. 2010]

TagProp

[Guillaumin et al. 2009]

TagFeature

[Chen et al. 2012]

RelExample

[Li and Snoek 2013]

RobustPCA

[Zhu et al. 2010] Tag + Image + User TagVote

TagCooccur+

[Li et al. 2009b]

TensorAnalysis

[Sang et al. 2012a]

slide-105
SLIDE 105

TAGCOOCCUR

105

[Sigurbjörnsson and van Zwol 2008]

Instance-Based Tag

  • Refines user tags by looking for co-occurrences in training set
  • Tags are given a score based on an heuristic that takes into account ranks,

stability and frequency of tags

Recommended Tags Gaudi Spain Catalunya architecture church Candidate Tags Sagrada Familia: Barcelona Gaudi Spain architecture Catalunya church Barcelona: Spain Gaudi 2006 Catalunya Europe travel User-defined Tags Sagrada Familia Barcelona

Tag Co-occurence Tag Aggregation & Ranking

  • B. Sigurbjörnsson, R. van Zwol. Flickr tag recommendaVon based on collecVve knowledge, WWW 2008
slide-106
SLIDE 106
  • Descriptive lowers the contribution of very high frequency tags
  • Rank-promotion measures tags contribution w.r.t tag ranks
  • Stability promotes tags for which statistics are more stable
  • Vote is 1 if t is among the 25 top ranked tags of ti, 0 otherwise
  • Depends on user tags of the test image, not applicable to Tag

Assignment

  • Complexity O(m · lx): same as SemanticField
  • Memory O(m2)

106

TAGCOOCCUR

[Sigurbjörnsson and van Zwol 2008]

Instance-Based Tag ftagcooccur(x, t) = descriptive(t)

lx

X

i=1

vote(ti, t) · rank-promotion(ti, t) · stability(ti),

slide-107
SLIDE 107

KEY METHODS

107

Media \ Learning Instance Based Model Based Transduc7ve Based Tag

Seman7cField

[Zhu et al. 2012]

TagCooccur

[Sigurbjörnsson and van Zwol 2008] Tag + Image

TagRanking

[Liu et al. 2009]

KNN

[Makadia et al. 2010]

TagProp

[Guillaumin et al. 2009]

TagFeature

[Chen et al. 2012]

RelExample

[Li and Snoek 2013]

RobustPCA

[Zhu et al. 2010] Tag + Image + User TagVote

TagCooccur+

[Li et al. 2009b]

TensorAnalysis

[Sang et al. 2012a]

slide-108
SLIDE 108

TAGCOOCCUR+

108

[Li et al. 2009b] Instance-Based Tag + Image

  • A variant of TagCooccur that is improved by considering the image content

in addition to solely user tags

  • The heuristic is updated by multipling TagCooccur score with a corrective

factor based on Tag Vote scores

  • rc(t) is the rank of t when sorting ftagvote(x,t) in descending order. kc is a

positive weighting parameter

  • Complexity O(d · |S| + k · log|S|): same complexity as TagVote
  • Memory O(d · |S|)

ftagcooccur+(x, t) = ftagcooccur(x, t) · kc kc + rc(t) − 1,

  • X. Li, C. Snoek, M. Worring. Learning Social Tag Relevance by Neighbor VoVng. IEEE TMM 2009
slide-109
SLIDE 109

KEY METHODS

109

Media \ Learning Instance Based Model Based Transduc7ve Based Tag

Seman7cField

[Zhu et al. 2012]

TagCooccur

[Sigurbjörnsson and van Zwol 2008] Tag + Image

TagRanking

[Liu et al. 2009]

KNN

[Makadia et al. 2010]

TagProp

[Guillaumin et al. 2009]

TagFeature

[Chen et al. 2012]

RelExample

[Li and Snoek 2013]

RobustPCA

[Zhu et al. 2010] Tag + Image + User TagVote

TagCooccur+

[Li et al. 2009b]

TensorAnalysis

[Sang et al. 2012a]

slide-110
SLIDE 110

TAGFEATURE

110

Model-Based Tag + Image

  • Train per-tag classifier with tagged images as positive examples and

random untagged images as negative examples.

  • Since rare tags are only

associated with a limited number of positive training images, they may degrade SVMs performance

SVM - sunset not tagged with sunset randomly selected tagged with sunset Sunset?

0.9

Visual Features [.4 .2 .5 .6 ...] Visual Features [.1 .4 .7 .2 ...]

[Chen et al. 2012]

  • L. Chen, D. Xu, I. Tsang, J. Luo. Tag-Based Image Retrieval Improved by Augmented Features and Group-

Based Refinement. IEEE TMM 2012

slide-111
SLIDE 111

TAGFEATURE

111

[Chen et al. 2012] Model-Based Tag + Image

  • TagFeature idea is to enrich visual features with tag augmented features,

derived from prelearned SVM classifiers of popular concepts.

Visual Features [.4 .2 .5 .6 ...] SVM - beach SVM - cat SVM - sunset 0.7 0.1 0.9 Augmented Features [.7 .1 .9 ...] tagged with sunset not tagged with sunset randomly selected sunset? Final SVM sunset

0.9

slide-112
SLIDE 112
  • Linear classifiers are used to reduce computational cost
  • It allows to sum up all the support vectors into a single vector xt
  • d visual features and d’ tag features, i.e. svm classifiers
  • User tags on test image are not used. Not applicable to Tag

Refinement.

  • Complexity O((d + d’) nm), n images, m tags
  • Memory O(m (d + d’))

112

TAGFEATURE

[Chen et al. 2012] Model-Based Tag + Image

fT agF eature(x, t) := b+ < xt, x >,

slide-113
SLIDE 113

KEY METHODS

113

Media \ Learning Instance Based Model Based Transduc7ve Based Tag

Seman7cField

[Zhu et al. 2012]

TagCooccur

[Sigurbjörnsson and van Zwol 2008] Tag + Image

TagRanking

[Liu et al. 2009]

KNN

[Makadia et al. 2010]

TagProp

[Guillaumin et al. 2009]

TagFeature

[Chen et al. 2012]

RelExample

[Li and Snoek 2013]

RobustPCA

[Zhu et al. 2010] Tag + Image + User TagVote

TagCooccur+

[Li et al. 2009b]

TensorAnalysis

[Sang et al. 2012a]

slide-114
SLIDE 114

RELEXAMPLE

114

[Li and Snoek 2013] Model-Based Tag + Image

  • Negative examples which are visually similar to positive can be misclassified
  • RelExample exploits positive and negative training examples which are

deemed to be more relevant with respect to the test tag t

sheep

...

Crowd-annotated images Negative Bootstrap Compressing Ensembles of SVMs Positive Example Selection Tag relevance estimation relevance scores

  • f sheep

0.87 0.70 0.15

images labeled with sheep

  • Positive examples are

selected by taking the top- ranked images by TagVote and SemanticField

  • Negative examples are

selected by Negative Bootstrap [Li et al. 2013]

  • X. Li, C. Snoek. Classifying tag relevance with relevant posiVve and negaVve examples. ACM MM 2013
slide-115
SLIDE 115

RELEXAMPLE

115

[Li and Snoek 2013] Model-Based Tag + Image

  • Negative Bootstrap [Li et al. 2013] trains a series of classifiers gt that

explicitly address mis-classified examples at previous step

Adapve Sampling Virtual labeling Classifier learning Random sampling Selecon Predicon Classifier aggregaon

t

U

t

U ~

Posive examples Negave examples Visual classifiers

Gt(x, w) = t − 1 t Gt−1(x, w) + 1 t gt(x, w).

slide-116
SLIDE 116
  • T iterations for a corresponding number of trained classifiers
  • User tags on test image are not used. Not applicable to Tag

Refinement.

  • Complexity O(Tdp2): training T SVM classifiers
  • Memory O(dp + dq): d visual features, p pos and q neg examples

116

RELEXAMPLE

[Li and Snoek 2013] Model-Based Tag + Image

fRelExample(x, t) := 1 T

T

X

l=1

(bl +

nl

X

j=1

αl,j · yl,j · K(x, xl,j)),

slide-117
SLIDE 117

KEY METHODS

117

Media \ Learning Instance Based Model Based Transduc7ve Based Tag

Seman7cField

[Zhu et al. 2012]

TagCooccur

[Sigurbjörnsson and van Zwol 2008] Tag + Image

TagRanking

[Liu et al. 2009]

KNN

[Makadia et al. 2010]

TagProp

[Guillaumin et al. 2009]

TagFeature

[Chen et al. 2012]

RelExample

[Li and Snoek 2013]

RobustPCA

[Zhu et al. 2010] Tag + Image + User TagVote

TagCooccur+

[Li et al. 2009b]

TensorAnalysis

[Sang et al. 2012a]

slide-118
SLIDE 118

ROBUSTPCA

118

[Zhu et al. 2010] Transduction-Based Tag + Image

  • Based on a few assumptions on tag characteristics:
  • low-rank property: the semantic space spanned by tags can be

approximated by a smaller subset of salient words derived from the

  • riginal space
  • tag correlation: semantic tags are correlated
  • visual consistency: visually similar images have similar tags
  • error sparsity for the image-tag matrix: user’s tagging is reasonably

accurate and one image is usually labelled with few tags

Zhu et al. Image Tag Refinement Towards Low-Rank, Content-Tag Prior Error Sparsity. ACM MM 2010

slide-119
SLIDE 119

ROBUSTPCA

119

[Zhu et al. 2010] Transduction-Based Tag + Image

  • RobustPCA factorize the tag matrix D into a low-rank matrix A and a sparse

error matrix E.

  • Explicitly enforces content consistency and tag correlation with Laplacian

graph-based regularizers.

slide-120
SLIDE 120
  • The problem reduces to recover the noise-free matrix A, so each

column vector can be used to represent the corresponding images.

  • Tc and Tt are regularizer based respectively on the similarity of

images and tags.

  • Complexity O(cm2n+c’n3): SVD computation
  • Memory O(cn · m + c’ · (n2 + m2)): Full matrix D, tag and image

similarity matrices.

120

min

A,E

||A||∗ + λ1||E||1 + λ2[Tc(A) + Tt(A)] subject to D = A + E

ROBUSTPCA

[Zhu et al. 2010] Transduction-Based Tag + Image

slide-121
SLIDE 121

KEY METHODS

121

Media \ Learning Instance Based Model Based Transduc7ve Based Tag

Seman7cField

[Zhu et al. 2012]

TagCooccur

[Sigurbjörnsson and van Zwol 2008] Tag + Image

TagRanking

[Liu et al. 2009]

KNN

[Makadia et al. 2010]

TagProp

[Guillaumin et al. 2009]

TagFeature

[Chen et al. 2012]

RelExample

[Li and Snoek 2013]

RobustPCA

[Zhu et al. 2010] Tag + Image + User TagVote

TagCooccur+

[Li et al. 2009b]

TensorAnalysis

[Sang et al. 2012a]

slide-122
SLIDE 122

TENSORANALYSIS

122

[Sang et al. 2012a] Transduction-Based Tag + Image + User

  • The method considers that, on top of visual appearance, images tagged by

similar users can capture more semantic correlations

  • Jointly models the ternary relations between users, tags and images
  • It uses a tensor-based representation and Tucker decomposition to

inference latent subspaces for the latent factors

tag(u, i, t) ⊆ U × I × VT

E

Sang et al. User-Aware Image Tag Refinement via Ternary SemanVc Analysis. IEEE TMM 2012

slide-123
SLIDE 123

TENSORANALYSIS

123

[Sang et al. 2012a] Transduction-Based Tag + Image + User

  • Only qualitative differences are important. The task is cast into a ranking

problem to determine which tag is more relevant for a user to describe an image.

  • Thus the method adopt a three state logic:
  • positive tags: tags assigned by the users,
  • negative tags: dissimilar tags that do not occur together with positive tags.
  • neutral tags: the other tags, removed from the learning process

Binary vs ternary logic

slide-124
SLIDE 124

TENSORANALYSIS

124

[Sang et al. 2012a] Transduction-Based Tag + Image + User

  • H is the heaviside function, T{U,I,T} are laplacian graph-based regularizers.
  • Optimization is performed iteratively using stochastic gradient descent, one

latent matrix at a time.

  • Complexity O(|P1| · (rT · m2 + rU ·rI ·rT)) – P1 is the ones in D, r{U,I,T} are latent

matrices dimensionalities.

  • Memory O(n2 + m2 + u2) – the three regularizers matrices.

argmin

θ

X

t+∈T +

X

t−∈T −

H(ˆ yt− − ˆ yt+) + λ1(||θ||2) + λ2(TU(θ) + TI(θ) + TT (θ)) θ = {U, I, T}

slide-125
SLIDE 125

EVALUATION: EXPERIMENTAL RESULTS

  • Q: We evaluate the eleven methods for different tasks and scenarios.

What are their performances?

  • Q: What is the computational cost of each of them?

125

slide-126
SLIDE 126

ANALYSIS OF COMPLEXITY

126

  • SemanticField and TagCooccur have the best scalability
  • The model-based methods require less memory and run faster in the test

stage, but at the expense of SVM model learning in the training stage

  • The two transduction-based methods have limited scalability, and can
  • perate only on small sized S

Computational Complexity Memory Footprint TagFeature TagVote TensorAnalysis RobustPCA RelExample TagCooccur TagCooccur+ TagProp KNN TagRanking SemanticField

slide-127
SLIDE 127

EVALUATION

127

  • We report a thorough evaluation of the methods on the proposed testbed
  • Here we discuss only few main results. Please refer to our survey paper for

the full picture.

Assignment Refinement Retrieval KNN X X TagVote X X TagProp X X TagFeature X X RelExample X X TagCooccur X X TagCooccur+ X X RobustPCA X X TensorAnalysis X X SemanVcField X TagFeature X

slide-128
SLIDE 128

TAG ASSIGNMENT

128

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Average Precision

river sea night man baby cloud tree girl portrait car people bird flower dog

Tags

KNN TagVote TagProp TagFeature RelExample

# # #

  • All methods benefit from using CNN Features
  • RelExample has better performance than TagFeature due to its filtering

component

  • TagProp has the best MAP

. Its performance is similar to KNN, TagVote since they all use the same basic nearest-neighbor label propagation

MIRFlickr test set, trained on Train1m. CNN Features BovW Features

slide-129
SLIDE 129

TAG ASSIGNMENT

129

1 2 3 4 5 6 7 8 9 10111213

Number of ground truth tags

1 2 3 4 5 6 7 8

Number of images with the best AP

#104

Train10k - NUS-WIDE

CNN + KNN CNN + TagVote CNN + TagProp CNN + TagFeature CNN + RelExample

1 2 3 4 5 6 7 8 9 10111213

Number of ground truth tags

1 2 3 4 5 6 7 8

Number of images with the best AP

#104

Train100k - NUS-WIDE

CNN + KNN CNN + TagVote CNN + TagProp CNN + TagFeature CNN + RelExample

1 2 3 4 5 6 7 8 9 10111213

Number of ground truth tags

1 2 3 4 5 6 7 8

Number of images with the best AP

#104

Train1m - NUS-WIDE

CNN + KNN CNN + TagVote CNN + TagProp CNN + TagFeature CNN + RelExample

  • Test images are grouped in terms of their number of ground truth tags. The

area of a colored bar is proportional to the number of images that the corresponding method scores best.

  • When increasing the training set size, the most visible change is that of

TagFeature and RelExample on images with one ground truth tag.

slide-130
SLIDE 130

TAG REFINEMENT

130

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Average Precision

river sea night man baby cloud tree girl people portrait car flower bird dog

Tags

UserTags TagCooccur BovW + TagCooccur+ BovW + RobustPCA CNN + TagCooccur+ CNN + RobustPCA

MIRFlickr test set, trained on Train100k. CNN Features BovW Features

  • All methods have performance superior to user tagging
  • The tag + image based methods outperform the tag based TagCooccur
  • RobustPCA provides the best performance
slide-131
SLIDE 131

TAG REFINEMENT

131

1 2 3 4 5 6 7 8 9 10111213

Number of ground truth tags

1 2 3 4 5 6 7 8

Number of images with the best AP

#104

Train10k - NUS-WIDE

UserTags TagCooccur CNN + TagCooccurPlus CNN + RobustPCA

1 2 3 4 5 6 7 8 9 10111213

Number of ground truth tags

1 2 3 4 5 6 7 8

Number of images with the best AP

#104

Train100k - NUS-WIDE

UserTags TagCooccur CNN + TagCooccurPlus CNN + RobustPCA

1 2 3 4 5 6 7 8 9 10111213

Number of ground truth tags

1 2 3 4 5 6 7 8

Number of images with the best AP

#104

Train1m - NUS-WIDE

UserTags TagCooccur CNN + TagCooccurPlus

  • CNN+RobustPCA has the best performance in every group of images
  • Almost the totality of images with more than 4 ground truth tags are better

refined by RobustPCA than the other methods

  • TagCooccur+ refines tags better than TagCoccur
slide-132
SLIDE 132

TAG RETRIEVAL

132

Average precision 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Tags basin chicken apple telephone rainbow beach weapon matrix spider sailboat

  • lympics

cow chopper snowman aquarium dolphin rabbit jaguar lion wolf

  • wl

fighter starfish penguin swimmer horse rice flame forest fruit seagull palace decoration wildlife hairstyle waterfall sport eagle glacier turtle watch car dog furniture shark jellyfish panda statue bird flower hockey TagPosition SemanticField CNN+TagVote CNN+TagProp CNN+RelExample

  • As for Tag Assignment,

TagVote and TagProp provide the best performance

  • For 33 out of 51 test

tags, RelExample gives average precision higher than 0.9

slide-133
SLIDE 133

TAG RETRIEVAL

133

(a) TagPosition (b) SemanticField (c) BovW + RelExample (d) CNN + RelExample

The top 10 ranked images for ‘jaguar’ TagPosition SemanticField BovW + RelExample CNN + RelExample Lower diversity

slide-134
SLIDE 134

COMMON PATTERNS

134

  • Some common patterns have emerged, indipendently from the task:
  • All methods benefit from using CNN Features
  • The more social data for training, the better performance is obtained
  • With small-scale training sets, tag + image based methods that

conducts model-based learning with denoised training examples turn

  • ut to be the most effective solution
slide-135
SLIDE 135

IMAGENET AS TRAINING SET

135

  • Some methods can’t be run or require modifications:
  • No user information in ImageNet; Tag+Image+User must be able to

remove their dependency on user

  • Tag co-occurrences are limited in ImageNet because images are

labelled with a single WordNet synset

  • We ran an empirical evaluation between Train100k, Train1m and ImageNet
  • We tested TagVote (without unique-user constraint) and TagProp

ImageNet already provides labeled examples for over 20k

  • categories. Is it necessary to learn from socially tagged data?
slide-136
SLIDE 136

IMAGENET RESULTS

136

Tag Assignment MIRFlickr NUS-WIDE Training Set TagVote TagProp TagVote TagProp MiAP scores: Train100k 0.377 0.383 0.392 0.389 Train1M 0.389 0.392 0.414 0.393 ImageNet200k 0.345 0.304 0.325 0.368 MAP scores: Train100k 0.641 0.647 0.386 0.405 Train1M 0.664 0.668 0.429 0.420 ImageNet200k 0.532 0.532 0.363 0.362

# #

  • Methods trained on socially tagged datasets show better performance

for tag assignment.

slide-137
SLIDE 137

IMAGENET RESULTS

137

  • TagVote and TagProp trained on ImageNet200k have better performance
  • n images with a single relevant tag.
  • On the other groups, Train100k and Train1M are a better choice.
  • For its single-label nature, ImageNet is less effective for assigning

multiple labels to an image.

1 2 3 4 5 6 7 8 9 10

Number of ground truth tags

2 4 6 8

Number of images with the best AP

#104

TagVote

Train100k Train1m ImageNet200k

1 2 3 4 5 6 7 8 9 10

Number of ground truth tags

2 4 6 8

Number of images with the best AP

#104

TagProp

Train100k Train1m ImageNet200k

slide-138
SLIDE 138

IMAGENET RESULTS

138 agProp 0.389 0.393 0.368 0.405 0.420 0.362 Tag Retrieval Flickr51 NUS-WIDE Training Set TagVote TagProp TagVote TagProp MAP scores: Train100k 0.854 0.860 0.742 0.745 Train1M 0.874 0.871 0.753 0.745 ImageNet200k 0.873 0.873 0.762 0.762 NDCG20 scores: Train100k 0.838 0.863 0.849 0.856 Train1M 0.894 0.851 0.891 0.853 ImageNet200k 0.920 0.898 0.843 0.847

# #

  • For retrieval, in general the two socially tagged yield better performance

than ImageNet200k. However, in some cases is not!

  • Train100k and Train1m yields better performance on tags where ImageNet

examples lack diversity (for instance ‘running’).

  • ImageNet200k performance gain is largely due to a few tags where social

tagging is very noisy.

slide-139
SLIDE 139

IMAGENET RESULTS

139

ImageNet already provides labeled examples for over 20k

  • categories. Is it necessary to learn from socially tagged data?
  • Yes!
  • For tag assignment social media examples are a preferred resource
  • f training data.
  • For tag retrieval ImageNet may provide better performance, yet the

performance gain is largely due to a few tags where social tagging is very noisy.

slide-140
SLIDE 140

CONCLUSIONS

140

  • We went through eleven key methods of various media and learning.
  • Take home messages:
  • The more social data for training, the better performance is obtained
  • Substituting BovW for CNN features boosts all methods performance.
  • TagVote and TagProp provide the best overall performance for

Assignment and Retrieval.

  • RobustPCA is the choice for Refinement.
  • Given a small sized training set, the model-based RelExample may be

a better performance.

slide-141
SLIDE 141

SOFTWARE

  • Jingwei, a framework for evaluating image tag assignment, tag

refinement and tag-based image retrieval:

  • https://github.com/li-xirong/jingwei
  • Hands on:
  • Run TagVote on Train10k + MIRFlickr
  • Learning new tag models on the fly

141

slide-142
SLIDE 142

PRINCIPLES OF DESIGN

142

  • Usability
  • Python APIs
  • cross-planorm: linux, window, mac
  • Readability
  • Majority of the code is wriAen in Python
  • Flexibility
  • Extend easily to new datasets and new visual features
slide-143
SLIDE 143

CODE ARCHITECTURE OF JINGWEI

143

pickled result matrix test tags test images

slide-144
SLIDE 144

REFERENCES

144

  • [Jiang et al. 2009] Jiang, Yu-Gang, Chong-Wah Ngo, and Shih-Fu Chang. "Semantic context

transfer across heterogeneous sources for domain adaptive video search." Proceedings of the 17th ACM international conference on Multimedia. ACM, 2009.

  • [Liu et al. 2011] Liu, Yiming, et al. "Textual query of personal photos facilitated by large-scale

web data." Pattern Analysis and Machine Intelligence, IEEE Transactions on 33.5 (2011): 1022-1036.

  • [Zhu et al. 2012] S. Zhu, C.-W. Ngo, and Y.-G. Jiang. 2012. “Sampling and Ontologically

Pooling Web Images for Visual Concept Learning”. IEEE Transactions on Multimedia 14, 4 (2012), 1068–1078.

  • [Liu et al. 2009] D. Liu, X.-S. Hua, L. Yang, M. Wang, and H.-J. Zhang. 2009. “Tag Ranking”. In
  • Proc. of WWW. 351–360.
  • [Makadia et al. 2010] A. Makadia, V. Pavlovic, and S. Kumar. 2010. “Baselines for Image

Annotation”. International Journal of Computer Vision 90, 1 (2010), 88–105.

  • [Li et al. 2009b] X. Li, C. Snoek, and M. Worring. “Learning Social Tag Relevance by Neighbor

Voting”. IEEE Transactions on Multimedia 11, 7 (2009), 1310–1322.

  • [Guillaumin et al. 2009] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. 2009.

“TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Auto- Annotation”. In Proc. of ICCV. 309–316.

slide-145
SLIDE 145

REFERENCES

145

  • [Sigurbjörnsson and van Zwol 2008] B. Sigurbjörnsson and R. van Zwol. 2008. “Flickr tag

recommendation based on collective knowledge”. In Proc. of WWW. 327–336.

  • [Chen et al. 2012] L. Chen, D. Xu, I. Tsang, and J. Luo. 2012. “Tag-Based Image Retrieval

Improved by Augmented Features and Group-Based Refinement”. IEEE Transactions on Multimedia 14, 4 (2012), 1057–1067.

  • [Li and Snoek 2013] X. Li and C. Snoek. 2013. “Classifying tag relevance with relevant positive

and negative examples”. In Proc. of ACM MM. 485–488.

  • [Zhu et al. 2010] G. Zhu, S. Yan, and Y. Ma. 2010. “Image Tag Refinement Towards Low-Rank,

Content-Tag Prior and Error Sparsity”. In Proc. of ACM MM. 461–470.

  • [Sang et al. 2012] J. Sang, C. Xu, and J. Liu. 2012a. “User-Aware Image Tag Refinement via

Ternary Semantic Analysis”. IEEE Transactions on Multimedia 14, 3 (2012), 883–895.

slide-146
SLIDE 146

ORGANIZATION OF THE TUTORIAL

8:30 – 9:30 Part 1: Introduction Part 2: Taxonomy 9:30 – 10:00 Part 3: Experimental protocol Part 4: Evaluation 10:00 – 10:45 Coffee break 10:45 – 12:00 Part 4: Evaluation cont’d 12:00 – 12:30 Part 5: Conclusion and future directions

146

slide-147
SLIDE 147
  • Summary
  • Future directions

147

PART 5 CONCLUSION AND FUTURE DIRECTIONS

slide-148
SLIDE 148

READING MATERIAL

Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval, 
 ACM Computing Surveys, 49(1):14, June 2016.

148

slide-149
SLIDE 149

SUMMARY: UNIFIED FRAMEWORK

149

Test Media Filter & Precompute Tag Relevance Image x User Information Filtered media , Prior

S

Θ

Learning Transduction-based Training Media Tasks Assignment Refinement Retrieval Tag t

X

fΦ(x, t; Θ)

Auxiliary Components Instance-based Model-based Inductive Transductive

ˆ S

slide-150
SLIDE 150

SUMMARY: TAXONOMY

150

Media Learning

Instance Model TransducVve Tag

2 1

  • Tag + Image

13 15 12

Tag + Image + User

5 7 3 Taxonomy structures 60 papers along Media and Learning dimensions

slide-151
SLIDE 151

SUMMARY: KEY METHODS

151

Media \ Learning Instance Based Model Based Transduc7ve Based Tag

Seman7cField

[Zhu et al. 2012]

TagCooccur

[Sigurbjörnsson and van Zwol 2008] Tag + Image

TagRanking

[Liu et al. 2009]

KNN

[Makadia et al. 2010]

TagProp

[Guillaumin et al. 2009]

TagFeature

[Chen et al. 2012]

RelExample

[Li and Snoek 2013]

RobustPCA

[Zhu et al. 2010] Tag + Image + User TagVote

TagCooccur+

[Li et al. 2009b]

TensorAnalysis

[Sang et al. 2012a]

slide-152
SLIDE 152

SUMMARY: OPEN-SOURCE TESTBED

152

Data servers [1] http://www.micc.unifi.it/tagsurvey [2] http://www.mmc.ruc.edu.cn/research/tagsurvey/data.html Jingwei, a framework for evaluating image tag assignment, tag refinement and tag-based image retrieval:
 [3] https://github.com/li-xirong/jingwei

slide-153
SLIDE 153

SUMMARY: TAKE HOME MESSAGES

153

  • The more social data for training, the better performance is
  • btained
  • Substituting BovW for CNN features boosts all methods

performance.

  • TagVote and TagProp provide the best overall performance for

Assignment and Retrieval.

  • RobustPCA is the choice for Refinement.
  • Given a small sized training set, the model-based RelExample may

be a better performance.

slide-154
SLIDE 154

FUTURE: MUCH REMAINS TO BE DONE

154

  • Novel deep-learning features likely to boost the performance of the

tag + image methods further

  • Learning strategy capable of jointly exploiting tag, image, and user

information in a much more scalable manner than currently feasible.

  • The importance of the filter component, which refines socially

tagged training examples in advance to learning, is underestimated.

  • Image retrieval by multi-tag query is another important yet largely

unexplored problem.

slide-155
SLIDE 155

CNN THAT BLENDS VISUAL INFORMAITON

FROM THE IMAGE AND ITS NEIGHBORS

155

  • dd
slide-156
SLIDE 156

QUALITATIVE RESULTS

156

  • dd
slide-157
SLIDE 157

POPULAR AND UNPOPULAR LATENT SENSES

  • Introduce latent senses to capture nuances in popularity
  • What makes an image unpopular is also informative
  • Popularity and unpopularity learned independently at train time
  • Single popularity score calculated at test time

157

Cappallo et al. ICMR 2015 Popular latent senses Unpopular latent senses

slide-158
SLIDE 158

1M MICRO-BLOG IMAGES

  • New, challenging dataset of 1 million images from social media
  • Twitter posts containing images from TREC 2013 Microblog track
  • Retweet and Favorite counts for popularity prediction research
  • Many graphical, non-photographic images

158

http://staff.fnwi.uva.nl/s.h.cappallo/data.html

slide-159
SLIDE 159

PROBLEM: EVENT DETECTION IN VIDEO

Mazloom et al. TMM 2016

slide-160
SLIDE 160

. . . . . .

TagBook = {woman, outdoor, metal-crads-project, welding machine, man, kitchen,…, wall, gym, rock-climbing}

. . . . . . . . .

woman, outdoor, metal-crads-project, welding machine man, kitchen, metallic, cleaning, oven, spray, glasses, man, snowboard, snow, board-trick, man, climb-on, wall, gym, rock-climbing

Video data Tags

Source set: Social-tagged web videos

TAGBOOK: DERIVED FROM SOCIAL-TAGGED VIDEO

Mazloom et al. TMM 2016

slide-161
SLIDE 161

TAGBOOK: NEW VIDEO REPRESENTATION

Mazloom et al. TMM 2016

slide-162
SLIDE 162

BEYOND TAGS: EMOJI

  • Visual grammar of interaction
  • Language independent
  • Age accessible
  • Widely supported
  • Semantically diverse
  • Easy form factor for smart phones and watches

162

Cappallo et al. MM 2015

slide-163
SLIDE 163

IMAGE2EMOJI

163

Cappallo et al. MM 2015

slide-164
SLIDE 164

THIS CVPR: Fast Zero-Shot Image Tagging

164

Zhang et al. CVPR 2016

slide-165
SLIDE 165

WISHING YOU A GREAT CONFERENCE

Xirong Li

Renmin University of China

Tiberio Uricchio

University of Florence

Lamberto Ballan

University of Florence & Stanford University

Marco Bertini

University of Florence

Cees Snoek

University of Amsterdam & Qualcomm Research Netherlands

Alberto Del Bimbo

University of Florence

165

June 26, 2016

Tutorial