Machine Learning Solutions to Visual Recognition Problems Jakob - - PowerPoint PPT Presentation

machine learning solutions to visual recognition problems
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Solutions to Visual Recognition Problems Jakob - - PowerPoint PPT Presentation

Machine Learning Solutions to Visual Recognition Problems Jakob Verbeek Habilitation ` a Diriger des Recherches Universit e Grenoble Alpes Jury Prof. Eric Gaussier Univ. Grenoble Alpes Pr esident Prof. Matthieu Cord Univ. Pierre et


slide-1
SLIDE 1

Machine Learning Solutions to Visual Recognition Problems

Jakob Verbeek Habilitation ` a Diriger des Recherches Universit´ e Grenoble Alpes Jury

  • Prof. Eric Gaussier
  • Univ. Grenoble Alpes

Pr´ esident

  • Prof. Matthieu Cord
  • Univ. Pierre et Marie Curie

Rapporteur

  • Prof. Erik Learned-Miller
  • Univ. of Massachusetts

Rapporteur

  • Prof. Andrew Zisserman
  • Univ. of Oxford

Rapporteur

  • Dr. Cordelia Schmid

INRIA Rhˆ

  • ne-Alpes

Examinateur

  • Prof. Tinne Tuytelaars

K.U. Leuven Examinateur

1 / 45

slide-2
SLIDE 2

Learning-based methods to understand natural imagery

◮ Recognition: people, objects, actions, events, . . . ◮ Localization: box, segmentation mask, space-time tube, . . . ◮ A technique-driven rather than application-driven approach

2 / 45

slide-3
SLIDE 3

Layout of this presentation

◮ Synthetic review of past activities ◮ Overview of contributions ◮ Perspectives

3 / 45

slide-4
SLIDE 4

Part I Synthetic review of past activities

4 / 45

slide-5
SLIDE 5

Academic background: 1994 — 2005 — 2016

◮ 1994-1998: MSc Artificial Intelligence, University of Amsterdam

◮ With honors, Peter Gr¨

unwald, Ronald de Wolf, Paul Vit´ anyi

◮ 1999-2000: MSc Logic, ILLC, University of Amsterdam

◮ With honors, Michiel van Lambalgen

◮ 2000-2004: PhD Computer Science, University of Amsterdam

◮ Ben Kr¨

  • se, Nikos Vlassis, Frans Groen

◮ 2005-2007: Postdoctoral fellow, INRIA Rhˆ

  • ne-Alpes

◮ Bill Triggs

◮ since 2007: Permantent researcher, INRIA Rhˆ

  • ne-Alpes

◮ 2009: Promotion to CR1 ◮ 2016: Outstanding research distinction (PEDR) 5 / 45

slide-6
SLIDE 6

Supervised PhD students

◮ 2006-2010: Matthieu Guillaumin

◮ Amazon research, Berlin, Germany

◮ 2008-2011: Josip Krapac

◮ PostDoc Univ. Zagreb, Croatia

◮ 2009-2012: Thomas Mensink, AFRIF best thesis award 2012

◮ PostDoc Univ. Amsterdam, Netherlands

◮ 2010-2014: Gokberk Cinbis, AFRIF best thesis award 2014

◮ Assistent Prof. Bilkent Univ. Ankara, Turkey

◮ 2011-2015: Dan Oneat

¸˘ a

◮ Data scientist, Eloquentix, Bucharest, Romania

◮ Since 2013: Shreyas Saxena ◮ Since 2016: Pauline Luc

6 / 45

slide-7
SLIDE 7

Research funding: ANR, EU, Cifre, LabEx

◮ 2006-2009: Cognitive-Level Annotation using Latent Statistical

Structure (CLASS), funded by European Union

◮ 2008-2010: Interactive Image Search, funded by ANR ◮ 2009-2012: Modeling multi-media documents for cross-media

access, Cifre PhD with Xerox Research Centre Europe

◮ 2010-2013: Quaero Consortium for Multimodal Person Recognition,

funded by ANR

◮ 2011-2015: AXES: Access to Audiovisual Archives, funded by

European Union

◮ 2013-2016: Physionomie: Physiognomic Recognition for Forensic

Investigation, funded by ANR

◮ 2016-2018: Weakly supervised structured prediction for semantic

segmentation, Cifre with Facebook AI Research

◮ 2016-2020: Deep covolutional and recurrent networks for image

speech and text, Laboratory of Execelence Persyval

7 / 45

slide-8
SLIDE 8

Publications

◮ 19 journal articles: 14 in TPAMI, IJCV, PR, TIP ◮ 34 conference papers: 25 (6 oral) ECCV, CVPR, ICCV, NIPS ◮ 5723 citations, H-index 36, i10-index 58 (Google scholar) ◮ 3 patents, joint inventions with Xerox Research Centre Europe

8 / 45

slide-9
SLIDE 9

Research community service

◮ Associate editor

◮ International Journal of Computer Vision (since 2014) ◮ Image and Vision Computing Journal (since 2011)

◮ Chairs for international conferences

◮ Tutorial chair ECCV 2016 ◮ Area chair CVPR 2015 ◮ Area chair ECCV 2012, 2014. ◮ Area chair BMVC 2012, 2013, 2014. 9 / 45

slide-10
SLIDE 10

Part II Overview of contributions

10 / 45

slide-11
SLIDE 11

Layout of this presentation

◮ Synthetic review of past activities ◮ Overview of contributions

  • 1. The Fisher vector representation
  • 2. Metric learning approaches
  • 3. Learning with incomplete supervision

◮ Perspectives

11 / 45

slide-12
SLIDE 12

The Fisher vector representation

◮ Data representation by Fisher score vector

[Jaakkola and Haussler, 1999] ∇θ ln p(x; θ), θ ∈ I RD (1)

◮ Useful to represent non-vectorial data, e.g. sets, sequences,. . . ◮ For images: iid GMM for sets of local descriptors

[Perronnin and Dance, 2007] p(x1:N) =

N

  • n=1

K

  • k=1

πkN(xn; µk, σk) (2)

◮ Fisher vector contains local first and second order statistics

∇(πk,µk,σk) ln p(x; θ) = b + A

N

  • n=1

p(k|xn)

  • 1, xn, x2

n

⊤ (3)

12 / 45

slide-13
SLIDE 13

Related publications

◮ Fisher vectors for non-iid image models

Cinbis, Schmid, Verbeek [CVPR’12, PAMI’16], 40 citations

◮ Approximate power and L2 normalization of FV

Oneata, Schmid, Verbeek [CVPR’14], 23 citations

◮ Application for action and event recognition

Oneata, Schmid, Verbeek, Wang [ICCV’13, IJCV’15], 158 citations

◮ Application for object localization

Cinbis, Schmid, Verbeek [ICCV’13], 64 citations

◮ Fisher vectors for descriptor layout coding

Jurie, Krapac, Verbeek [ICCV’11], 135 citations

13 / 45

slide-14
SLIDE 14

Fisher vectors for non-iid image models

◮ Independence assumption generates sum-pooling in FV

◮ Bag-of-words [Csurka et al., 2004, Sivic and Zisserman, 2003]

and iid GMM FV [Perronnin and Dance, 2007]

◮ Very poor assumption from modeling perspective

◮ Images are locally self-similar ◮ Representation should discount frequent events

◮ Compensated by power normalization, Hellinger or χ2 kernel

14 / 45

slide-15
SLIDE 15

Replace iid models with non-iid exchangeable counterparts

π wi xi λk µk

i=1,2,...,N k=1,2,...,K

bk ak mk βk α π wi xi λk µk

i=1,2,...,N k=1,2,...,K

Gaussian mixture Latent Gaussian mixture

◮ Bayesian approach treat model parameters as latent variables ◮ Compute Fisher vector w.r.t. hyper-parameters ◮ Variational inference to approximate intractable gradients

15 / 45

slide-16
SLIDE 16

Comparison to power normalization

10 20 30 40 50 60 70 80 90 100 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α = 1.0e−02 α = 1.0e−01 α = 1.0e+00 α = 1.0e+01 α = 1.0e+02 α = 1.0e+03 square−root

MoG FV

  • 1 -0.8-0.6-0.4-0.2 0

0.2 0.4 0.6 0.8 1

SqrtMoG or LatMoG FV

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1 LatMoG-1 LatMoG-2 LatMoG-3 LatMoG-4 SqrtMoG

Bag-of-word counts Gaussian mean parameter

◮ Fisher vector non-iid model vs. power-normalization ◮ Qualitatively similar monotonic concave transformations ◮ Latent variable model explains effectiveness of power-normal.

16 / 45

slide-17
SLIDE 17

Layout of this presentation

◮ Synthetic review of past activities ◮ Overview of contributions

  • 1. The Fisher vector representation
  • 2. Metric learning approaches
  • 3. Learning with incomplete supervision

◮ Perspectives

17 / 45

slide-18
SLIDE 18

Metric learning approaches

◮ Measures of similarity or distance have many applications

◮ Retrieval and matching of local descriptors or entire images ◮ Nearest neighbor prediction models ◮ Verification: do two objects belong to the same category

◮ Supervised training to discover the important features

◮ Notion of similarity is task dependent

◮ Methods such as FDA [Fisher, 1936] use only second moments

FDA [Mensink et al., 2012]

18 / 45

slide-19
SLIDE 19

Related publications

◮ Coordinated Local Metric Learning

Saxena and Verbeek [ICCV’15 Workshop]

◮ Metric learning for nearest class-mean classifier

Csurka, Mensink, Perronin, Verbeek [PAMI’13, ECCV’12 oral], 126 citations

◮ Multiple instance metric learning

Guillaumin, Schmid, Verbeek [ECCV’10], 83 citations

◮ Discriminative metric learning in nearest neighbor models

Guillaumin, Mensink, Schmid, Verbeek [ICCV’09 oral], 377 citations

◮ Logistic discriminant metric learning

Guillaumin, Schmid, Verbeek [ICCV’09], 420 citations

19 / 45

slide-20
SLIDE 20

Instantaneous adaptation to new samples and classes

◮ Consider photo-sharing service: stream of labeled images

◮ Re-training a discriminative model for new data is costly ◮ Generative models easily updated, but often perform worse ◮ KNN classifiers are very costly to evaluate for large dataset 20 / 45

slide-21
SLIDE 21

Instantaneous adaptation to new samples and classes

◮ Consider photo-sharing service: stream of labeled images

◮ Re-training a discriminative model for new data is costly ◮ Generative models easily updated, but often perform worse ◮ KNN classifiers are very costly to evaluate for large dataset

◮ Nearest mean classifier is linear and easily updated

y = arg mink ||W (x − µk)||2 (4)

◮ Maximum likelihood estimation with softmax loss

p(y = k|x) ∝ exp −||W (x − µk)||2 (5)

◮ Corresponds to posterior in generative Gaussian mixture model

p(x|y = k) = N (x; µk, Σ) (6)

20 / 45

slide-22
SLIDE 22

Experimental evaluation: ImageNet Challenge 2010

◮ Train 1: metric and means from 1,000 classes ◮ Train 2: metric from 800 classes, means on all 1,000 ◮ Test: 200 classes not used for metric in (Train 2)

Error in % KNN NCM Trained on all 38.4 36.4 Trained on 800 42.4 39.9

◮ Linear NCM classifier better than non-parametric KNN

◮ In both cases metric is learned

◮ Training from other classes moderately impacts performance

21 / 45

slide-23
SLIDE 23

Visualization of nearest classes using L2 and learned metric

◮ Classes closest to center of “Palm” in FV image space ◮ Learned Mahalanobis metric semantically more meaningful

◮ Improves predication accuracy ◮ Remaining errors are more sensible 22 / 45

slide-24
SLIDE 24

Layout of this presentation

◮ Synthetic review of past activities ◮ Overview of contributions

  • 1. The Fisher vector representation
  • 2. Metric learning approaches
  • 3. Learning with incomplete supervision

◮ Perspectives

23 / 45

slide-25
SLIDE 25

Learning with incomplete supervision

◮ Acquiring labeled training data is often costly

◮ Pixel-wise semantic segmentation of images ◮ Spatio-temporal localization of actions in video

◮ Latent variable modeling to handle missing labels

◮ Iterative learning and inference schemes ◮ Expectation-maximization, multiple instance learning 24 / 45

slide-26
SLIDE 26

Related publications

◮ Multi-fold multiple instance learning

Cinbis, Schmid, Verbeek [CVPR’14, PAMI’16], 48 citations

◮ Tree-structured CRF models for interactive image labeling.

Csurka, Mensink, Verbeek [CVPR’11, PAMI’13], 48 citations

◮ Face recognition from caption-based supervision

Guillaumin, Mensink, Schmid, Verbeek [CVPR’08, ECCV’08 oral, IJCV’12], 172 citations

◮ Unsupervised metric learning for face identification in video

Cinbis, Schmid, Verbeek [ICCV’11], 71 citations

◮ Multimodal semi-supervised learning for image classification

Guillaumin, Schmid, Verbeek [CVPR’10 oral], 246 citations

◮ Web image search using query-relative classifiers

Allan, Jurie, Krapac, Verbeek [CVPR’10], 88 citations

◮ Weakly supervised semantic segmentation

Triggs and Verbeek [CVPR’07, NIPS’08 oral], 348 citations

25 / 45

slide-27
SLIDE 27

Weakly supervised object localization

◮ Bounding box annotations expensive and error prone ◮ Image classification response expected strongest on object ◮ Suggests iterative refinement procedure

◮ Train model from location hypothesis (init. from full image) ◮ Update location hypothesis based on estimated model 26 / 45

slide-28
SLIDE 28

Limitation of standard multiple instance learning [Dietterich et al., 1997]

◮ Select object hypothesis with maximum score

f (x) = b + w, x = b +

N

  • n=1

anxn, x (7)

◮ Data near orthogonal in high-dimensional feature spaces ◮ Immediate and poor convergence of training process −1 −0.5 0.5 1 0.2 0.4 0.6 0.8 High dimensional FV Density Inner Product

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Normalized Score Frequency Below 50% overlap Above 50% overlap Training windows 27 / 45

slide-29
SLIDE 29

Multi-fold multiple instance learning

◮ Goal: avoid score domination by previous hypothesis ◮ Approach: split data in non-overlapping subsets

◮ Train model all but one set ◮ Update hypotheses in held-out set

1 2 3 4 5 6 7 8 9 10 11 10 20 30 40 Iteration CorLoc 20−fold MIL 10−fold MIL 2−fold MIL standard MIL 10−fold MIL, 516 dim. standard MIL, 516 dim.

◮ Note difference between high and low dimensional case

28 / 45

slide-30
SLIDE 30

Part III Perspectives

29 / 45

slide-31
SLIDE 31

Layout of this presentation

◮ Synthetic review of past activities ◮ Overview of contributions ◮ Perspectives

  • 1. Interface between graphical models and deep learning
  • 2. Towards a generic visual recognition engine
  • 3. Towards systematic network architecture learning

30 / 45

slide-32
SLIDE 32

Interface between graphical models and deep learning

◮ Relations between probabilistic inference and recurrent nets

◮ Mean-field inference as RNN

[Schwing and Urtasun, 2015, Zheng et al., 2015]

◮ Generalize to loopy belief propagation and beyond ◮ Trainable special-purpose inference algorithms 31 / 45

slide-33
SLIDE 33

Interface between graphical models and deep learning

◮ Relations between probabilistic inference and recurrent nets

◮ Mean-field inference as RNN

[Schwing and Urtasun, 2015, Zheng et al., 2015]

◮ Generalize to loopy belief propagation and beyond ◮ Trainable special-purpose inference algorithms

◮ Conditional random field models extremely useful in vision

◮ Semantic segmentation, depth estimation, optical flow, . . . ◮ CNNs used for unary and pairwise terms [Lin et al., 2016] ◮ Flexible higher-order terms missing, RNN-based approach

building on ideas in [Pinheiro and Collobert, 2014]?

31 / 45

slide-34
SLIDE 34

Interface between graphical models and deep learning

◮ Relations between probabilistic inference and recurrent nets

◮ Mean-field inference as RNN

[Schwing and Urtasun, 2015, Zheng et al., 2015]

◮ Generalize to loopy belief propagation and beyond ◮ Trainable special-purpose inference algorithms

◮ Conditional random field models extremely useful in vision

◮ Semantic segmentation, depth estimation, optical flow, . . . ◮ CNNs used for unary and pairwise terms [Lin et al., 2016] ◮ Flexible higher-order terms missing, RNN-based approach

building on ideas in [Pinheiro and Collobert, 2014]?

◮ Hierarchical Bayesian networks to define structured priors

◮ Parameter sharing: formalize pre-training and fine-tuning ◮ Network topology learning: sparsity, stick-breaking, etc.

[Adams et al., 2010, Kulkarni et al., 2015]

31 / 45

slide-35
SLIDE 35

Towards a generic visual recognition engine

◮ Explicit supervision is a bottleneck in recognition

◮ Cost effectiveness of acquiring a labeled training set

◮ Latency: new categories require time to get labeled

◮ Once labeled and trained, new categories must be indexed 32 / 45

slide-36
SLIDE 36

Towards a generic visual recognition engine

◮ Explicit supervision is a bottleneck in recognition

◮ Cost effectiveness of acquiring a labeled training set

◮ Latency: new categories require time to get labeled

◮ Once labeled and trained, new categories must be indexed

◮ Towards a “generic visual recognition engine”

◮ Generalization across semantics, e.g. using word embedding

[Frome et al., 2013, Mikolov et al., 2013]

◮ Massive noisy weakly supervised datasets

[Chatfield et al., 2015, Joulin et al., 2015]

◮ Unsupervised learning as regularization [Doersch et al., 2015,

Isola et al., 2016, Kingma and Welling, 2014]

32 / 45

slide-37
SLIDE 37

Towards systematic network architecture learning

◮ CNN architecture-land is a scary place

◮ Hyper-parameters include: number of layers, number of

channels per layer, filter size per layer, stride per layer, number

  • f pooling vs. convolutional layers, type of pooling operator

per layer, size of the pooling regions, ordering of pooling and convolutional layers, channel connectivity pattern between layers, type of activation per layer (ReLU, MaxOut)

33 / 45

slide-38
SLIDE 38

Towards systematic network architecture learning

◮ CNN architecture-land is a scary place

◮ Hyper-parameters include: number of layers, number of

channels per layer, filter size per layer, stride per layer, number

  • f pooling vs. convolutional layers, type of pooling operator

per layer, size of the pooling regions, ordering of pooling and convolutional layers, channel connectivity pattern between layers, type of activation per layer (ReLU, MaxOut)

◮ Exponentially large: exhaustive search is intractable

◮ Massive reliance on pre-trained nets: AlexNet, VGG-16/19 ◮ Local search techniques [Chen et al., 2016] 33 / 45

slide-39
SLIDE 39

Towards systematic network architecture learning

◮ CNN architecture-land is a scary place

◮ Hyper-parameters include: number of layers, number of

channels per layer, filter size per layer, stride per layer, number

  • f pooling vs. convolutional layers, type of pooling operator

per layer, size of the pooling regions, ordering of pooling and convolutional layers, channel connectivity pattern between layers, type of activation per layer (ReLU, MaxOut)

◮ Exponentially large: exhaustive search is intractable

◮ Massive reliance on pre-trained nets: AlexNet, VGG-16/19 ◮ Local search techniques [Chen et al., 2016]

◮ Important recent advances orthogonal: residual and highway

networks [He et al., 2016, Srivastava et al., 2015]

33 / 45

slide-40
SLIDE 40

Making progress

34 / 45

slide-41
SLIDE 41

Making progress

◮ Formulate as a continuous optimization problem by relaxation

◮ Recent work with Shreyas Saxena, currently under review 34 / 45

slide-42
SLIDE 42

Meeting structural desiderata

◮ Should encompass all CNN architectures

◮ Up to a certain size

◮ Should consist of few atomic building blocks

◮ Recombined in exponentially many configurations

◮ Should be configurable with continuous parameters

◮ Learning architecture from data 35 / 45

slide-43
SLIDE 43

Meeting structural desiderata

◮ Should encompass all CNN architectures

◮ Up to a certain size

◮ Should consist of few atomic building blocks

◮ Recombined in exponentially many configurations

◮ Should be configurable with continuous parameters

◮ Learning architecture from data

◮ All conditions met by multi-dimensional network

◮ Ensembles of all architectures are also included ◮ Joint classification and segmentation output possible

Scales Layers Input Output

35 / 45

slide-44
SLIDE 44

The Convolutional Neural Fabric

◮ Node: one response map at a particular scale ◮ Layer axis: all signal flows in this direction ◮ Scale axis: full scale pyramid to 1×1 map, S = log2 Npix ◮ Channel axis: different maps of same scale in same layer

Scales Layers Channels Input Internal Node 36 / 45

slide-45
SLIDE 45

A homogeneous local connectivity structure

◮ Activations computed as sum of convolutions of neighboring

scales and channels at previous layer a(s, c, l) =

+1

  • i=−1

+1

  • j=−1

Conv(a(s + i, c + j, l − 1); W ij

scl)

(8)

◮ Trellis → standard back-propagation [Rumelhart et al., 1986]

Scales Layers Channels Input Internal Node 37 / 45

slide-46
SLIDE 46

Experimental evaluation: Part Labels

Part Labels Year # Params. SP Acc. P Acc. Tsogkas et al. [Tsogkas et al., 2015] 2015 >414M 96.97 — Liu et al. [Liu et al., 2015] 2015 >33M — 95.24 Kae et al. [Kae et al., 2013] 2013 0.7M 94.95 — Convolutional Neural Fabric 0.1M 95.58 94.60

image prediction image prediction

◮ Competitive with the best hand-crafted architectures

◮ Without structured prediction model: CRF, RBM, etc. ◮ Much fewer (pre-trained) parameters ◮ Only using 2,000 train images, instead of >1M ImageNet 38 / 45

slide-47
SLIDE 47

Mean-squared filter weight per edge: MNIST classification

Input Scales Output Layers

◮ Multiple scales active per layer, and multiple layers per scale

39 / 45

slide-48
SLIDE 48

Personal perspective: 2005 — 2015 — 2025

◮ 2005: fresh PhD looking for applied machine learning field ◮ Machine vision: difficult structured prediction problems ◮ INRIA-LEAR: vision + machine learning + mountains =

:-)

40 / 45

slide-49
SLIDE 49

Personal perspective: 2005 — 2015 — 2025

◮ 2005: fresh PhD looking for applied machine learning field ◮ Machine vision: difficult structured prediction problems ◮ INRIA-LEAR: vision + machine learning + mountains =

:-)

◮ 2025: Back to the future, back to machine learning ? ◮ Convergence of machine perception techniques,

following decades of divergence in specialized fields

◮ Fundamental progress from general learning, inference, and

memory mechanisms rather than from application specifics

40 / 45

slide-50
SLIDE 50

Thank you !

in particular my former and current students Matthieu, Josip, Thomas, Gokberk, Dan, Shreyas, Pauline

41 / 45

slide-51
SLIDE 51

References I

[Adams et al., 2010] Adams, R., Wallach, H., and Ghahramani, Z. (2010). Learning the structure of deep sparse graphical models. In AISTATS. [Chatfield et al., 2015] Chatfield, K., Arandjelovi´ c, R., Parkhi, O., and Zisserman, A. (2015). On-the-fly learning for visual search of large-scale image and video datasets. International Journal of Multimedia Information Retrieval. [Chen et al., 2016] Chen, T., Goodfellow, I., and Shlens, J. (2016). Net2net: Accelerating learning via knowledge transfer. In ICLR. [Csurka et al., 2004] Csurka, G., Dance, C., Fan, L., Willamowski, J., and Bray, C. (2004). Visual categorization with bags of keypoints. In ECCV Int. Workshop on Stat. Learning in Computer Vision. [Dietterich et al., 1997] Dietterich, T., Lathrop, R., and Lozano-P´ erez, T. (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1-2):31–71. [Doersch et al., 2015] Doersch, C., Gupta, A., and Efros, A. (2015). Unsupervised visual representation learning by context prediction. In ICCV. [Fisher, 1936] Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188. [Frome et al., 2013] Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., and Mikolov, T. (2013). DeViSE: A deep visual-semantic embedding model. In NIPS. 42 / 45

slide-52
SLIDE 52

References II

[He et al., 2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. Arxiv preprint. [Isola et al., 2016] Isola, P., Zoran, D., Krishnan, D., and Adelson, E. (2016). Learning visual groups from co-occurrences in space and time. In ICLR. [Jaakkola and Haussler, 1999] Jaakkola, T. and Haussler, D. (1999). Exploiting generative models in discriminative classifiers. In NIPS. [Joulin et al., 2015] Joulin, A., van der Maaten, L., Jabri, A., and Vasilache, N. (2015). Learning visual features from large weakly supervised data. Arxiv preprint. [Kae et al., 2013] Kae, A., Sohn, K., Lee, H., and Learned-Miller, E. (2013). Augmenting CRFs with Boltzmann machine shape priors for image labeling. In CVPR. [Kingma and Welling, 2014] Kingma, D. and Welling, M. (2014). Auto-encoding variational Bayes. In ICLR. [Kulkarni et al., 2015] Kulkarni, P., Zepeda, J., Jurie, F., P´ erez, P., and Chevallier, L. (2015). Learning the structure of deep architectures using l1 regularization. In BMVC. [Lin et al., 2016] Lin, G., Shen, C., van den Hengel, A., and Reid, I. (2016). Efficient piecewise training of deep structured models for semantic segmentation. In CVPR. [Liu et al., 2015] Liu, S., Yang, J., Huang, C., , and Yang, M.-H. (2015). Multi-objective convolutional learning for face labeling. In CVPR. 43 / 45

slide-53
SLIDE 53

References III

[Mensink et al., 2012] Mensink, T., Verbeek, J., Perronnin, F., and Csurka, G. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV. [Mikolov et al., 2013] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NIPS. [Perronnin and Dance, 2007] Perronnin, F. and Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR. [Pinheiro and Collobert, 2014] Pinheiro, P. and Collobert, R. (2014). Recurrent convolutional neural networks for scene labeling. In ICML. [Rumelhart et al., 1986] Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323:533–536. [Schwing and Urtasun, 2015] Schwing, A. and Urtasun, R. (2015). Fully connected deep structured networks. Arxiv preprint. [Sivic and Zisserman, 2003] Sivic, J. and Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. In ICCV. [Srivastava et al., 2015] Srivastava, R., Greff, K., and Schmidhuber, J. (2015). Training very deep networks. In NIPS. [Tsogkas et al., 2015] Tsogkas, S., Kokkinos, I., Papandreou, G., and Vedaldi, A. (2015). Deep learning for semantic part segmentation with high-level guidance. Arxiv preprint. 44 / 45

slide-54
SLIDE 54

References IV

[Zheng et al., 2015] Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., and Torr, P. (2015). Conditional random fields as recurrent neural networks. In ICCV. 45 / 45