[PPT] - Self-supervised Learning of Interpretable Keypoints from Unlabelled PowerPoint Presentation

SLIDE 1

Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos

Tomas Jakab VGG, University of Oxford Ankush Gupta DeepMind, London Hakan Bilen School of Informatics, University of Edinburgh Andrea Vedaldi VGG, University of Oxford Facebook AI Resesarch, London www.robots.ox.ac.uk/~vgg/research/ unsupervised_pose/ Oral presentation, CVPR 2020

SLIDE 2

Goal

Learn object keypoints detectors from unlabelled videos and unpaired pose prior

Self-supervised landmarks predicted by our model

SLIDE 3

Factorizing appearance and keypoints

input image reconstruction

encoder

decoder appearance encoder

style image 2D keypoint bottleneck

(𝑦", 𝑧") (𝑦$, 𝑧$) . . .

≈

reconstruction loss

Related works on self-supervised landmarks: Thewlis, Bilen, Vedaldi. Proc. NIPS, 2017 Thewlis, Bilen, Vedaldi. Proc. ICCV, 2017 Jakab, Gupta, Bilen, Vedaldi. Proc. NeurIPS, 2018 Zhang, Guo, Jin, Luo, He, Lee. Proc. CVPR, 2018 Thewlis, Albanie, Bilen, Vedaldi. Proc. CVPR, 2019 Wiles, Koepke, Zisserman. Proc. BMVC, 2018 Lorenz, Bereska, Milbich, Ommer. Proc. CVPR, 2019 Thewlis, Albanie, Bilen, Vedaldi. Proc. CVPR, 2019

keypoints not interpretable L

SLIDE 4

Learning to label as image translation

input image reconstruction

encoder

decoder

skeleton image

appearance encoder

style image looks like a skeleton?

discriminator

unpaired pose prior

SLIDE 5

Reintroducing bottleneck

input image reconstruction

encoder

decoder

looks like a skeleton?

discriminator

keypoint detector analytical renderer

pre-trained offline

appearance encoder

style image handcrafted bottleneck skeleton image unpaired pose prior

SLIDE 6

Landmarks estimation

Human3.6M Pennaction

prediction prediction

SLIDE 7

Unpaired transfer

MPie landmarks 300-W predictions unlabelled face videos Voxceleb2

SLIDE 8

Evaluation: unsupervised to labeled keypoints

discovered keypoints semantic keypoints

supervised linear regression

ther self-supervised methods

directly predicting semantic keypoints

ur method

training time test time

unlabelled videos/images unlabelled videos/images + unpaired posed prior no additional data images annotated with

bject keypoints

SLIDE 9

Human pose estimation

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0

hourglass (supervised) Thewlis et al. Zhang et al. Lorenz et al.

urs

Simplified Human3.6M %-MSE norm. by image size

unsupervised discovery + supervised regression

no paired data 0.0 5.0 10.0 15.0 20.0 25.0

hourglass (supervised) Jakab & Gupta et al

urs

Human3.6M MSE in pixels

unsupervised discovery + supervised regression

no paired data

SLIDE 10

Facial landmarks estimation

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

LBF TCDCN RAR Wing Loss Thewlis et al. Thewlis et al. Thewlis et al. Wiles et al.

urs
urs +

supervised regression

300-W %-MSE norm. by inter-ocular distance

unsupervised discovery + supervised regression

no paired data

SLIDE 11

Unpaired sample efficiency

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

400k (full dataset) 5000 500 50

number of unpaired samples Simplified Human3.6M %-MSE norm. by image size

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0

6000 (full dataset) 500 50

number of unpaired samples 300-W %-MSE norm. by inter-ocular distance

SLIDE 12

Disentangling appearance and geometry

Mixing geometry and appearance by conditioning on a different identity geometry appearance reconstruction

SLIDE 13

Manipulating geometric representation

SLIDE 14

Self-supervised Learning of Interpretable Keypoints from Unlabelled Videos

Tomas Jakab VGG, University of Oxford Ankush Gupta DeepMind, London Hakan Bilen School of Informatics, University of Edinburgh Andrea Vedaldi VGG, University of Oxford Facebook AI Resesarch, London www.robots.ox.ac.uk/~vgg/research/ unsupervised_pose/ Oral presentation, CVPR 2020