Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana - - PowerPoint PPT Presentation

deep ten texture encoding network
SMART_READER_LITE
LIVE PREVIEW

Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana - - PowerPoint PPT Presentation

Deep TEN: Texture Encoding Network Hang Zhang, Jia Xue, Kristin Dana 1 Hang Zhang Highlight and Overview Introduced Encoding-Net a new architecture of CNNs Achieved state-of-the-art results on texture recognition MINC-2500, FMD ,GTOS,


slide-1
SLIDE 1

Deep TEN: Texture Encoding Network

Hang Zhang, Jia Xue, Kristin Dana

Hang Zhang

1

slide-2
SLIDE 2

Highlight and Overview

  • Introduced Encoding-Net

a new architecture of CNNs

  • Achieved state-of-the-art results on texture recognition

MINC-2500, FMD ,GTOS, KTH, 4D-Light

  • Released the ArXiv paper (CVPR 17) and Torch Implementation (GPU

backend)

Hang Zhang

2

slide-3
SLIDE 3

Challenges for Texture Recognition

  • Orderless
  • Distributions

Hang Zhang

3

slide-4
SLIDE 4

Classic Vision Approaches

Hang Zhang

4

slide-5
SLIDE 5

Classic Vision Approaches

Feature extraction Filterbank responses or SIFT Hang Zhang

5

slide-6
SLIDE 6

Classic Vision Approaches

Feature extraction Dictionary Learning Hang Zhang

6

slide-7
SLIDE 7

Classic Vision Approaches

Feature extraction Dictionary Learning

Encoding

Bag-of-words , VQ or VLAD

Hang Zhang

7

slide-8
SLIDE 8

Classic Vision Approaches

Feature extraction Dictionary Learning

Encoding

Classifier

Hang Zhang

8

slide-9
SLIDE 9

Classic Vision Approaches

Feature extraction Dictionary Learning

Encoding

  • The input image sizes are flexible
  • No domain-transfer problem

Hang Zhang

Classifier

9

slide-10
SLIDE 10

Comparing to Deep Learning Framework

Feature extraction Dictionary Learning

Encoding

SVM FC Layer

  • Preserve Spatial Info
  • Domain Transfer
  • Fix size

Hang Zhang

10

slide-11
SLIDE 11

Comparing to Deep Learning Framework

Feature extraction Dictionary Learning

Encoding

SVM FC Layer

  • Can we bridge the gap?

Hang Zhang

11

slide-12
SLIDE 12

Hybrid Solution

Histogram Encoding Dictionary SIFT / Filter Bank Responses SVM

BoWs

Hang Zhang

12

slide-13
SLIDE 13

Hybrid Solution and Its Limitation

  • Off-the-Shelf
  • The dictionary and the

encoders are fixed once built

  • Feature learning and encoding

are not benefiting from the labeled data

Histogram Encoding Dictionary SIFT / Filter Bank Responses SVM

BoWs

Fisher Vector Dictionary Pre-trained CNNs SVM

FV-CNN

Off-the-Shelf Hang Zhang

13

slide-14
SLIDE 14

End-to-end Encoding

Residual Encoding Dictionary

Convolutional Layers FC Layer End-to-End

Deep-TEN

Encoding Layer Histogram Encoding Dictionary SIFT / Filter Bank Responses SVM

BoWs

Off-the-Shelf Fisher Vector Dictionary Pre-trained CNNs SVM

FV-CNN

Hang Zhang

14

slide-15
SLIDE 15

Bag-of-Words (BoW) Encoder

  • Given a set of visual features 𝑌 = {𝑦%, … 𝑦(}, and a learned codebook C =

𝑑%, … 𝑑, (the input features is 𝑒-dimension and 𝑂 is number of visual features and 𝐿 is number of codewords )

  • The assignment weight 𝑏12 correspond to the visual feature 𝑦1 assigned to

each codeword 𝑑2. Hard-assignment: 𝑏12 = 𝜀( 𝑦1 − 𝑑2

6 =

min

:∈ %,…,

{=𝑦1 − 𝑑

:= 6})

  • BoWs counts the occurrences of the visual words ∑ 𝑏1
  • 1

Hang Zhang

15

slide-16
SLIDE 16

Residual Encoders

  • The Fisher Vector, concatenating the gradient of GMM with respect to the mean

and standard deviation 𝐻BC

D = E 𝑏12 ( 1F%

𝑦1 − 𝑑2 𝐻GC

D = E 𝑏12 ( 1F%

𝑦1 − 𝑑2 6 − 1

  • VLAD (1st order, hard-assignment)

𝑊

2 =

E 𝑦1 − 𝑑2

( 1F(( JK FBC

Hang Zhang

16

slide-17
SLIDE 17

Residual Encoding Model

  • Residual vector 𝑠

12 = 𝑦1 − 𝑑2

  • Aggregating residuals with

assignment weights 𝑓2 = E 𝑏12𝑠

12

  • 1

Input Dictionary Residuals Assign Aggregate

Encoding-Layer

Hang Zhang

17

slide-18
SLIDE 18

Feature Distributions and Assigning

  • Soft-assignment

𝑏12 = exp (−𝛾 𝑠

1: 6)

∑ exp (−𝛾 𝑠

1: 6) , :F%

  • Learnable Smoothing Factor

𝑏12 = exp (−𝑡2 𝑠

12 6)

∑ exp (−𝑡

: 𝑠 1: 6) , :F%

Hang Zhang

18

slide-19
SLIDE 19

End-to-end Learning

  • The loss function is differentiable w.r.t

the input 𝑌 and the parameters (Dictionary 𝐸 and smoothing factors 𝑡)

  • The Encoding Layer can be trained end-

to-end by standard Stochastic Gradient Decent (SGD) with backpropagation

Residual Encoding Dictionary

Convolutional Layers FC Layer End-to-End

Deep-TEN

Encoding Layer

Hang Zhang

19

slide-20
SLIDE 20

Hang Zhang

20

slide-21
SLIDE 21

Hang Zhang

21

slide-22
SLIDE 22

Hang Zhang

22

slide-23
SLIDE 23

Relation to Dictionary Learning

  • Dictionary learning approaches usually are achieved by unsupervised

grouping (e.g. K-means) or minimizing the reconstruction error (e.g. K-SVD).

  • The Encoding Layer makes the inherent dictionary differentiable w.r.t the

loss function and learns the dictionary in a supervised manner.

Hang Zhang

23

slide-24
SLIDE 24

Relation to BoWs and Residual Encoders

  • Generalize BoWs, VLAD & Fisher Vector
  • Arbitrary input sizes, output fixed length

representation

  • NetVLAD decouples the codewords with

their assignments 𝑏 = 𝑔(𝑦) instead of 𝑏 = 𝑔(𝑦, 𝑒)

Input Dictionary Residuals Assign Aggregate

Encoding-Layer

Hang Zhang

24

slide-25
SLIDE 25

Relation to Global Pooling Layer

  • Sum Pooling (avg Pooling)

Let 𝐿 = 1 and d = 0, then 𝑓 = ∑ 𝑦1

( 1F%

and BW

BXK

= BW

BY

  • SPP-Layer (He et. al. ECCV 2014)

Fix bin numbers instead of receptive field, reshaping, arbitrary input size)

  • Bilinear Pooling (Lin et. al. ICCV 2015)

sum of the outer product across different location

Hang Zhang

25

slide-26
SLIDE 26

Methods Overview

Hang Zhang

26

slide-27
SLIDE 27

Domain Transfer

  • The Residual Encoding Representation 𝑓2 = ∑ 𝑏12𝑠

12

  • 1
  • For a visual feature 𝑦1 that appears frequently in the data
  • It is likely to close to a visual center 𝑒2
  • 𝑓2 is close to zero, since 𝑠

12 = 𝑦1 − 𝑒2 ≈ 0

  • 𝑓

: (𝑘 ≠ 𝑙) is close to zero, since 𝑏1: = ^_` (abc dKc

e)

∑ ^_` (abf dKf e)

g fhi

≈ 0

  • The Residual Encoding discard the frequently appearing features, which is

like to be domain specific (useful for fine-tuning pre-trained features)

Hang Zhang

27

slide-28
SLIDE 28

Experiments

  • Datasets
  • Gold-standard material & texture datasets: MINC-2500, KTH, FMD
  • 2 Recent datasets: GTOS, Light Field
  • General recognition datasets: MIT-Indoor, Caltech-101
  • Baseline approaches (off-the-shelf)
  • FV-SIFT (128 Gaussian Components, 32𝐿 → 512)
  • FV-CNN (Cimpoi et. al. pre-trained VGG-VD & ResNet, 32GMM)

Hang Zhang

28

slide-29
SLIDE 29

Dataset Examples

Hang Zhang

29

slide-30
SLIDE 30

Deep-TEN Architecture

Hang Zhang

30

slide-31
SLIDE 31

Comparing to the Baselines

Hang Zhang

31

slide-32
SLIDE 32

Multi-size Training (using different image sizes)

  • Deep-TEN ideally accepts

arbitrary sizes (larger than a constant)

  • Training with predefined sizes

iteratively in different epochs w/o modifying the solver

  • Adopt single-size testing for

simplicity

Hang Zhang

32

slide-33
SLIDE 33

Multi-size Training

Hang Zhang

33

slide-34
SLIDE 34

Comparing to State-of-the-Art

  • Prior approaches
  • (1) relies on assembling features
  • (2)adopts an additional SVM classifier for classification.

Hang Zhang

34

slide-35
SLIDE 35

Extra Thoughts

  • So many labeled datasets: object recognition, scene understanding,

material recognition

  • How to benefit from them
  • Simply merging datasets (different label strategy)
  • Share convolutional features (domain transfer problem)

Hang Zhang

35

slide-36
SLIDE 36

Joint Encoding

  • Multi-task learning
  • Encoding Layer carries the domain specific information
  • Convolutional Layers are generic
  • Joint training on two datasets
  • CIFAR-10 (50,000 training images with size

36×36)

  • STL-10 (5,000 training images with size

96×96)

CIFAR

STL Conv Layers E1 E2

Hang Zhang

36

slide-37
SLIDE 37

Experimental Results for Joint Training

Hang Zhang

  • Joint training on two datasets (simple network architecture)
  • CIFAR-10 (50,000 training images with size 36×36)
  • STL-10 (5,000 training images with size 96×96)

37

The SoA for CIFAR-10 is 95.4% using 1,001 layers ResNet (He et. al. ECCV 2016)

slide-38
SLIDE 38

Summary

  • Proposed a new model
  • Integrated the entire dictionary learning and encoding

into a single layer of CNN

  • Generalize residual encoders (VLAD, FV), suitable for

texture recognition and achieved state-of-the-art results

  • Introduced a new CNN architecture
  • Making deep learning framework more flexible by

allowing arbitrary input image sizes

  • Carries domain-specific information and make the

learned features easier to transfer

Hang Zhang

38

slide-39
SLIDE 39

Thank you!

  • We provide efficient Torch implementation with CUDA backend at

https://github.com/zhanghang1989/Deep-Encoding

Hang Zhang

39