Synthesizing Normalized Faces from Facial Identity Features - - PowerPoint PPT Presentation

synthesizing normalized faces from facial identity
SMART_READER_LITE
LIVE PREVIEW

Synthesizing Normalized Faces from Facial Identity Features - - PowerPoint PPT Presentation

Synthesizing Normalized Faces from Facial Identity Features Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, William T. Freeman, Google, Inc. University of Massachusetts Amherst, MIT CSAIL CVPR 2017 Presented by:


slide-1
SLIDE 1

Synthesizing Normalized Faces from Facial Identity Features

Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, William T. Freeman, Google, Inc. University of Massachusetts Amherst, MIT CSAIL CVPR 2017 Presented by: Kapil Krishnakumar

slide-2
SLIDE 2

Problem

  • Want method for synthesizing a frontal, neutral expression image of a person’s

face given an input face photograph

  • One-to-one mapping from identity to image
  • Method of pre-processing images to remove irregularities

Image Credit: Cole et al.

slide-3
SLIDE 3

Related Work

Zhmoginov and Sandler et al.

Image Credit: Zhmoginov and Sandler. Inverting face embeddings with convolutional neural works. Blanz and Vetter et al. A Morphable Model For The Synthesis Of 3D Faces Cootes et al. Active Appearance Models Hassner et al. Effective Face Frontalization in Unconstrained Images

Blanz and Vetter et al. Cootes et al. Hassner et al.

slide-4
SLIDE 4

Approach

  • Morphing of Images (Data Augmentation)
  • Encoder (Image to Feature Vector)
  • Decoder (Feature Vector to Normalized Image)

○ Landmarks ○ Texture Image Credit: Cole et al.

slide-5
SLIDE 5

Architecture

Image Credit: Cole et al.

Facenet MLP Landmarks FC/CNN Textures Differentiable Warping

slide-6
SLIDE 6

FaceNet (Background) (Schroff et al. 2015)

  • Face Images -> 128-D vectors
  • Trained using triplet loss. Embeddings of two pictures
  • f A should be more similar than picture of person A

and person B.

  • Uses GoogLeNet’s NN2 Architecture

Image Credit: Cole CVPR 2017 Talk (https://www.youtube.com/watch?v=jVAClXpHgAI) | Szegedy et al. Going deeper with convolutions

slide-7
SLIDE 7

Encoder

  • Use pretrained FaceNet
  • Extract 1024-D “avgpool”

layer of “NN2” architecture

  • Append and train Fully

Connected Layer from 1024 to F dimensions on this layer.

Image Credit: Szegedy et al. Going deeper with convolutions

slide-8
SLIDE 8

Encoder

Image Credit: Szegedy et al. Going deeper with convolutions Fully Connected Feature Vector

  • Use pretrained FaceNet
  • Extract 1024-D “avgpool”

layer of “NN2” architecture

  • Append and train Fully

Connected Layer from 1024 to F dimensions on this layer.

slide-9
SLIDE 9

Decoder

  • Separating landmarks and textures

more effective than just predicting image

  • Landmarks estimated using shallow

MLP with ReLUs applied on feature vector

○ FV -> [(x,y),.....]

  • Textures estimated using fully

connected or CNN

○ FV -> Image Feature Vector Predicted Landmarks Predicted Texture Shallow MLP CNN/FC Image Credit: Cole et al

slide-10
SLIDE 10

Decoder

  • Use differentiable image warping

to combine landmarks and textures

Predicted Landmarks Predicted Texture Differentiable Image Warping Image Credit: Cole et al

slide-11
SLIDE 11

Decoder

Image Credit: Cole et al

slide-12
SLIDE 12

Differentiable Image Warping

Input Image with Landmarks Final Landmarks Textures with Landmarks Mean Landmarks of training data Dense Flow Field with Spline Interpolation Final Output Image Credit: Cole et al

slide-13
SLIDE 13

Differentiable Spline Interpolation

Input Landmarks Final Landmarks Distance Matrix Polyharmonic Interpolation Displacement Flow Field X,Y, Magnitude Image Credit: Cole et al

slide-14
SLIDE 14

Training

Image Credit: Cole et al.

Facenet MLP Landmarks FC/CNN Textures

slide-15
SLIDE 15

Training

Image Credit: Cole et al.

Facenet MLP Landmarks FC/CNN Textures Ground Truth Textures Ground Truth Landmarks

slide-16
SLIDE 16

Training with FaceNet Loss

Image Credit: Cole et al.

Facenet MLP Landmarks FC/CNN Textures Facenet Ground Truth Textures Ground Truth Landmarks

slide-17
SLIDE 17

Training Loss

  • Separately penalize predicted landmarks and textures using mean squared

error

  • Penalize differences in resulting encodings from input image and rendered

image when passed through FaceNet

○ Highly expensive to train Image Credit: Cole et al

slide-18
SLIDE 18

Data Augmentation: Random Morphs

  • Problem: Don’t have database of normalized face photos to train decoder

network on

  • Solution: Morphing Data Augmentation

Select one of k=200 Nearest Neighbors using distance defined by Landmarks and Textures Linear Interpolation

(Landmarks & Textures)

Image Credit: Cole et al

slide-19
SLIDE 19

Data Augmentation: Gradient Domain Compositing

  • Morphing cannot capture hair and background detail
  • Combine morphed image onto an original background using gradient domain

compositing

Image Credit: Cole et al

slide-20
SLIDE 20

Data Augmentation

Input Augmented Image Credit: Cole et al

slide-21
SLIDE 21

Data Augmentation

Image Credit: Cole et al

slide-22
SLIDE 22

Training Data

  • Dataset used to train VGG-Face network. 2.6M photos
  • Processing:

○ Average all images for each individual by morphing ○ Each image is then warped to average landmarks of individual ○ Pixel values are averaged to form average image of individual.

  • Gives 1K unique identities images
  • Use Kazemi and Sullivan for extracting groundtruth Landmarks
  • Augmentation produces 1M images

Image Credit: Cole et al

slide-23
SLIDE 23

Experiments: Labeled Faces in the Wild

  • Identities mutually exclusive to VGG face dataset

Hassner Image Credit: Cole et al

slide-24
SLIDE 24

Experiments: Labeled Faces in the Wild

  • Histograms of FaceNet L2 error between input and synthesized images.
  • 1.242 is threshold for clustering identities in FaceNet feature space
  • Blue: With Facenet Training Loss
  • Green: Without Facenet Training Loss

Image Credit: Cole et al

slide-25
SLIDE 25

Robustness to Occlusions

Image Credit: Cole CVPR 2017 Talk (https://www.youtube.com/watch?v=jVAClXpHgAI)

slide-26
SLIDE 26

Extensions: 3-D Model Fitting

  • Easier to fit normalized face image on 3D

morphable model.

Image Credit: Cole et al

slide-27
SLIDE 27

Extensions: Automatic-Photo Adjustment

Image Credit: Cole et al

slide-28
SLIDE 28

Extensions: Automatic-Photo Adjustment

Image Credit: Cole et al

slide-29
SLIDE 29

Advantages

  • Splitting of generative tasks (Landmarks and Textures) can be better than

directly outputting result

  • Fresh use of spline interpolation as differentiable module in NN
  • Augmentation technique allows training of decoder with only 1K images to

perform extremely well.

  • Tough features like hair and eyes are well defined in normalized images
  • Robustness to occlusions
slide-30
SLIDE 30

Disadvantages

  • No “ground truth” to compare Normalized Images

○ Though measure of performance can be defined as FaceNet closeness between image and normalized image ○ Cannot get human annotated ground truth

  • Dependent on out of box methods for getting Landmarks and Textures labels

○ Paper doesn’t show experiments on other techniques other than Kazemi ○ Unclear on how Texture labels are generated.

  • Backgrounds are unrealistic and blurry