Unsupervised Discovery of Object Landmarks as Structural - - PowerPoint PPT Presentation

unsupervised discovery of object landmarks as structural
SMART_READER_LITE
LIVE PREVIEW

Unsupervised Discovery of Object Landmarks as Structural - - PowerPoint PPT Presentation

Unsupervised Discovery of Object Landmarks as Structural Representations Yuting Zhang 1 , Yijie Guo 1 , Yixin Jin 1 , Yijun Luo 1 , Zhiyuan He 1 , Honglak Lee 1,2 1 University of Michigan, Ann Arbor 2 Google Brain Structural representations of


slide-1
SLIDE 1

Unsupervised Discovery of Object Landmarks as Structural Representations

Yuting Zhang1, Yijie Guo1, Yixin Jin1, Yijun Luo1, Zhiyuan He1, Honglak Lee1,2

1 University of Michigan, Ann Arbor 2 Google Brain

slide-2
SLIDE 2

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Structural representations of images

  • Computer vision seeks to understand visual structures.
  • Poses, contours, 3D shapes, …
  • Physically conceptualized, perceptible by humans
  • Deep neural networks can learn latent representations.
  • Desired properties: distributed, sparse, transferable, …
  • Not as conceptualized and interpretable as explicit structures
  • Extra supervision is needed to bridge the gap between latent

representations and explicit structures

  • costly to obtain and often unavailable
slide-3
SLIDE 3

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Structural representations of images

  • Computer vision seeks to understand visual structures.
  • Poses, contours, 3D shapes, …
  • Physically conceptualized, perceptible by humans
  • Deep neural networks can learn latent representations.
  • Desired properties: distributed, sparse, transferable, …
  • Not as conceptualized and interpretable as explicit structures
  • Typically, extra supervision is needed to bridge the gap between

latent representations and explicit structures

  • costly to obtain and often unavailable

Can we train a deep neural network to get image representations of explicit structures without supervision?

slide-4
SLIDE 4

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

The explicit structure

  • We consider a specific type of explicit structures:
  • Compact representation of object shapes
  • Generally applicable to many object categories

Object landmarks

Can we train a deep neural network to get image representations of explicit structures without supervision?

slide-5
SLIDE 5

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Our framework

Unsupervised landmark discovery Image representation Latent features

Task

slide-6
SLIDE 6

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Our framework

Image representation Latent features Image reconstruction Unsupervised landmark discovery

slide-7
SLIDE 7

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Our framework

Latent features Image reconstruction Unsupervised landmark discovery

slide-8
SLIDE 8

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Technical outline

Latent features Image reconstruction

  • Unsupervised object

landmark discovery

Unsupervised landmark discovery

slide-9
SLIDE 9

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Technical outline

Latent features Image reconstruction

  • Unsupervised object

landmark discovery

  • A fully differentiable neural

network architecture

  • The image reconstruction can encourage

the learning of informative landmarks and features.

Training signal

Unsupervised landmark discovery

slide-10
SLIDE 10

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Technical outline

Latent features Image reconstruction Unsupervised landmark discovery

slide-11
SLIDE 11

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Input image Reconstructed image Latent features Landmark coordinates

slide-12
SLIDE 12

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Input image Latent features Reconstructed image Landmark coordinates Input image Landmark coordinates

Related work:

James Thewlis, Hakan Bilen, and Andrea Vedaldi, “Unsupervised learning of object landmarks by factorized spatial embeddings,” In ICCV, 2017.

Unsupervised landmark discovery

  • A differentiable formulation
  • Unsupervised constraints to

define a valid landmark detector

slide-13
SLIDE 13

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark detector: Architecture

Encoder-decoder with skip-links Foreground Background Input image Landmark coordinates

Channel-wise softmax

Heatmap to coordinate

slide-14
SLIDE 14

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark detector: Architecture

Encoder-decoder with skip-links Foreground Input image Landmark coordinates

Channel-wise softmax

Heatmap to coordinate Background

slide-15
SLIDE 15

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

From heatmaps to coordinates

Ours:

A foreground heatmap Isotropic Gaussian approximation

  • Averaged coordinate weighted by the heatmap
  • (x,y) is differentiable with respect to the heatmap

N ✓ (x, y), σ σ ◆

Landmark coordinate

slide-16
SLIDE 16

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark discovery

  • The neural network can be used to output landmark coordinates.
  • However, without additional training objectives,

the landmark coordinates can be arbitrary latent features.

(x1, y1) (x2, y2) … (xK, yK)

Can be arbitrary without physical meanings

3 desirable properties for a landmark detector

slide-17
SLIDE 17

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Original heatmap Gaussian heatmap

Property 1: Concentration of heatmap values

For a detector, the output heatmap should concentrate in a local region.

  • Encourage the Gaussian

variance to be small.

Earlier stage Later stage

slide-18
SLIDE 18

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Property 2: Separation of landmarks

  • Different landmarks should cover different visual semantics.
  • Penalize if the pairwise distances among landmarks are too small.

Lsep =

1,...,K

X

k6=k0

exp k(xk0, yk0) (xk, yk)k2

2

2σ2

sep

!

slide-19
SLIDE 19

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Property 3: Equivariance

  • For a transformation g that does not change local visual semantics.
  • The landmarks on the two images should satisfy the same

transformation g. g

slide-20
SLIDE 20

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Property 3: Equivariance

  • For a transformation g that does not change local visual semantics.
  • The landmarks on the two images should satisfy the same

transformation g. g

slide-21
SLIDE 21

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Property 3: Equivariance

  • For a transformation g that does not change local visual semantics.
  • The landmarks on the two images should satisfy the same

transformation g. g

slide-22
SLIDE 22

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Property 3: Equivariance

  • For a transformation g that does not change local visual semantics.
  • The landmarks on the two images should satisfy the same

transformation g. g

slide-23
SLIDE 23

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Property 3: Equivariance

  • Equivariance for landmark discovery has been explored by Thewlis et al, 2017.
  • Ours are directly formulated on the landmark coordinate.

Leqv =

K

X

k=1

kg(x0

k, y0 k) (xk, yk)k2 2 (Thewlis et al, 2017) James Thewlis, Hakan Bilen, and Andrea Vedaldi, “Unsupervised learning of object landmarks by factorized spatial embeddings,” In ICCV, 2017.

  • For a transformation g that does not change local visual semantics.
  • The landmarks on the two images should satisfy the same

transformation g. g

slide-24
SLIDE 24

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Property 3: Equivariance – the transformation

  • Random thin-plate-spline (TPS) to synthesize the transformation g
  • Global affine: Translation, Scaling, Rotation
  • Local TPS:
  • For videos, also use the optical flows as the transformation g
slide-25
SLIDE 25

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Input image Latent features Reconstructed image Landmark coordinates Input image Landmark coordinates

Unsupervised landmark discovery

slide-26
SLIDE 26

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Input image Latent features Reconstructed image Landmark coordinates

slide-27
SLIDE 27

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Input image Latent features Reconstructed image Landmark coordinates

Landmark-based extraction of latent features

Input image Latent features

  • Weighted average-pooling with

differentiable pooling masks

slide-28
SLIDE 28

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Landmark-based extraction of latent features

Input image Latent features

slide-29
SLIDE 29

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based feature extraction

H W # channels

Gaussian heatmap

slide-30
SLIDE 30

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based feature extraction

H W # channels Weighted global average pooling # channels

slide-31
SLIDE 31

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based feature extraction

H W # channels

Weighted global average pooling

slide-32
SLIDE 32

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based feature extraction

H W # channels

Weighted global average pooling

Original heatmap

slide-33
SLIDE 33

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based feature extraction

Weighted global average pooling using each heatmap Latent feature attached to each landmark # channels

H W # channels

slide-34
SLIDE 34

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based feature extraction

Weighted global average pooling using each heatmap Latent feature attached to each landmark # channels

H W # channels

slide-35
SLIDE 35

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Landmark-based extraction of latent features

Input image Latent features

slide-36
SLIDE 36

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Input image Latent features Reconstructed image Landmark coordinates

slide-37
SLIDE 37

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Input image Latent features Reconstructed image Landmark coordinates

Landmark-conditioned image decoding

Latent features Reconstructed image Landmark coordinates

  • Reverting the landmark and

feature encoding

slide-38
SLIDE 38

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Input image Latent features Reconstructed image Landmark coordinates Latent features Reconstructed image Landmark coordinates

.… .

Landmark-conditioned image decoding

slide-39
SLIDE 39

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based decoding

… Normalization across channels .….

Gaussian heatmap with fixed variance Uniform heatmap for BG

# channels

slide-40
SLIDE 40

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based decoding: all landmarks

… …

Latent features

Pixel-wise duplication

slide-41
SLIDE 41

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based decoding: all landmarks

… …

Latent features

Pixel-wise duplication

slide-42
SLIDE 42

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based decoding: all landmarks

… … …

Latent features

Pixel-wise duplication

slide-43
SLIDE 43

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based decoding: all landmarks

… … …

Latent features

Pixel-wise duplication

slide-44
SLIDE 44

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark-based decoding: all landmarks

… … … …

Latent features

Pixel-wise duplication

slide-45
SLIDE 45

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Landmark-conditioned image decoding

Latent features Reconstructed image Landmark coordinates

slide-46
SLIDE 46

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Overview of our neural network architecture

Input image Latent features Reconstructed image Landmark coordinates

slide-47
SLIDE 47

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Experimental results

slide-48
SLIDE 48

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark discovery: Faces, 10 landmarks

Ours Thewlis et al.

(Thewlis et al.) James Thewlis, Hakan Bilen, and Andrea Vedaldi, “Unsupervised learning of object landmarks by factorized spatial embeddings,” In ICCV, 2017.

Thewlis at al. Ours

slide-49
SLIDE 49

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Unsupervised discovery: Faces, 10 landmarks

Thewlis et al. Errors

Forehead landmark to the left Lower-lip landmark to the right Mouth-corner landmark on the forehead Right-eyebrow landmark on the left side Forehead landmark to the left

Thewlis at al. Incorrect landmarks Expected correct location

slide-50
SLIDE 50

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Unsupervised discovery: Faces, 10 landmarks

Ours

Ours

Thewlis et al. Errors

Forehead landmark to the left Lower-lip landmark to the right Mouth-corner landmark on the forehead Right-eyebrow landmark on the left side Forehead landmark to the left

Thewlis at al.

slide-51
SLIDE 51

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Unsupervised discovery: Faces, 30 landmarks

slide-52
SLIDE 52

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Unsupervised discovery: Faces, 30 landmarks

slide-53
SLIDE 53

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Unsupervised landmark discovery: Cat head

slide-54
SLIDE 54

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Unsupervised landmark discovery: Cat head

slide-55
SLIDE 55

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Unsupervised landmarks: shoes, cars, animals, MNIST

slide-56
SLIDE 56

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Unsupervised landmark discovery: Human3.6M

slide-57
SLIDE 57

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Unsupervised landmark discovery: Human3.6M

Optical flows for the equivariance

slide-58
SLIDE 58

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Quantitative evaluation: Regression to Ground Truth Landmarks

  • Train a linear regression model to map the discovered landmark to

human-annotated landmarks without finetuning the neural network.

Linear regression

Discovered landmarks Human-annotated landmarks

slide-59
SLIDE 59

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

7.95 9.73 5.39 6.67 5.83 3.46 3.15

2 4 6 8 10 12

Supervised Ours Thewlis et al. Supervised method Localization error

MAFL faces (5 target landmarks)

slide-60
SLIDE 60

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Semi-supervised learning

  • Better landmark detector using less training samples

0.5 1 1.5 2 2 3 4 5 6 7 8 9 10

TCDCN (error: 7.95) MTCNN (error: 5.39) Ours (10 landmarks) Ours (30 landmarks)

500 training samples

Localization error # label samples

slide-61
SLIDE 61

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Cars, cat heads, human bodies

4.14 14.84 15.35 5.80 5.87 7.51 26.94 26.76 11.11 11.42

5 10 15 20 25 30

Human (16 discovered, 32 target) Cat (20 discovered, 7 target) Cat (10 discovered, 7 target) Car (24 discovered, 5 target) Car (10 discovered, 5 target)

Thewlis et al. Ours

Localization error

slide-62
SLIDE 62

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Facial attribute classification

  • Landmark coordinates as visual representations
  • Predicting 13 binary facial attributes that are related to the facial shape.

Arched Eyebrows, Bags Under Eyes, Big Lips, Big Nose, Double Chin, High Cheekbones, Male, Mouth Slightly Open, Narrow Eyes, Oval Face, Pointy Nose, Receding Hairline, Smiling

Method Feature dimension Accuracy Ours (discovered landmarks) 60 83.2

slide-63
SLIDE 63

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Facial attribute classification

[FaceNet] Florian Schroff, Dmitry Kalenichenko, and James Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in CVPR, 2015

  • Landmark coordinates as visual representations
  • Predicting 13 binary facial attributes that are related to the facial shape.

Arched Eyebrows, Bags Under Eyes, Big Lips, Big Nose, Double Chin, High Cheekbones, Male, Mouth Slightly Open, Narrow Eyes, Oval Face, Pointy Nose, Receding Hairline, Smiling

Method Feature dimension Accuracy Ours (discovered landmarks) 60 83.2 FaceNet (top-layer) 128 80.0 FaceNet (conv-layer) 1792 82.4

slide-64
SLIDE 64

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Landmark discovery Latent features Image reconstruction

slide-65
SLIDE 65

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Latent features Image reconstruction

Disentangled

slide-66
SLIDE 66

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Image manipulation

  • Discover landmarks and extract latent features from an image.
  • Manipulate the landmarks to generate new images / videos.

Discovered landmarks Manipulated landmarks

Original images Manipulated images L a n d m a r k s

slide-67
SLIDE 67

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Image manipulation

  • Discover landmarks and extract latent features from an image.
  • Manipulate the landmarks to generate new images / videos.

Original images Manipulated images Landmarks

Discovered landmarks Manipulated landmarks

slide-68
SLIDE 68

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Image manipulation: Human body

  • Discover landmarks and extract latent features from an image.
  • Manipulate the landmarks to generate new images / videos.

manipulating all 16 landmarks

Landmarks Manipulated images Original images

Discovered landmarks Manipulated landmarks

slide-69
SLIDE 69

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Conclusions

  • Unsupervised object landmark discovery as image representations

with explicit structures

  • A fully differentiable neural network architecture
slide-70
SLIDE 70

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Conclusions

  • Unsupervised object landmark discovery as image representations

with explicit structures

  • A fully differentiable neural network architecture
  • Our unsupervised model can
  • produce meaningful landmarks
  • perform competitively to supervised facial landmark detector
  • provide a neural-network interface that humans can manipulate
slide-71
SLIDE 71

Our paper: Unsupervised Discovery of Object Landmarks as Structural Representations

Poster E19

Thank you!

Project page (Code & results):

http://ytzhang.net/projects/lmdis-rep

Unsupervised Discovery of Object Landmarks as Structural Representations