Models s Prof. Leal-Taix and Prof. Niessner 1 Condit itio ional - - PowerPoint PPT Presentation

models s
SMART_READER_LITE
LIVE PREVIEW

Models s Prof. Leal-Taix and Prof. Niessner 1 Condit itio ional - - PowerPoint PPT Presentation

More Generati ative Models s Prof. Leal-Taix and Prof. Niessner 1 Condit itio ional l GANs on Videos Challenge: Each frame is high quality, but temporally inconsistent Prof. Leal-Taix and Prof. Niessner 2 Video-to to-Vid


slide-1
SLIDE 1

More Generati ative Models s 

  • Prof. Leal-Taixé and Prof. Niessner

1

slide-2
SLIDE 2

Condit itio ional l GANs on Videos

  • Challenge:

– Each frame is high quality, but temporally inconsistent

  • Prof. Leal-Taixé and Prof. Niessner

2

slide-3
SLIDE 3

Video-to to-Vid ideo Synthesis is

  • Sequential Generator:
  • Conditional Image Discriminator 𝐸𝑗 (is it real image)
  • Conditional Video Discriminator 𝐸𝑤 (temp. consistency via flow)
  • Prof. Leal-Taixé and Prof. Niessner

3

Wang et al. 18: Vid2Vid

past L source frames past L generated frames (set L = 2) Full Learning Objective:

slide-4
SLIDE 4

Video-to to-Vid ideo Synthesis is

  • Prof. Leal-Taixé and Prof. Niessner

4

Wang et al. 18: Vid2Vid

slide-5
SLIDE 5

Video-to to-Vid ideo Synthesis is

  • Key ideas:

– Separate discriminator for temporal parts

  • In this case based on optical flow

– Consider recent history of prev. frames – Train all of it jointly

  • Prof. Leal-Taixé and Prof. Niessner

5

Wang et al. 18: Vid2Vid

slide-6
SLIDE 6

Deep Video Portrait its

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-7
SLIDE 7

Deep Video Portrait its

Similar to “Image-to-Image Translation” (Pix2Pix) [Isola et al.]

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-8
SLIDE 8

Deep Video Portrait its

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-9
SLIDE 9

Deep Video Portrait its

Siggraph’18 [Kim et al 18]: Deep Portraits

Neural Network converts synthetic data to realistic video

slide-10
SLIDE 10

Deep Video Portrait its

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-11
SLIDE 11

Deep Video Portrait its

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-12
SLIDE 12

Deep Video Portrait its

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-13
SLIDE 13

Deep Video Portrait its

Siggraph’18 [Kim et al 18]: Deep Portraits

Interactive Video Editing

slide-14
SLIDE 14

Deep Video Portrait its: : Insig ights

  • Synthetic data for tracking is great anchor / stabilizer
  • Overfitting on small datasets works pretty well
  • Need to stay within training set w.r.t. motions
  • No real learning; essentially, optimizing the problem

with SGD

  • > should be pretty interesting for future directions

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-15
SLIDE 15

Everyb ybody y Dance Now

[Chan et al. ’18] Everybody Dance Now

slide-16
SLIDE 16

Everyb ybody y Dance Now

[Chan et al. ’18] Everybody Dance Now

slide-17
SLIDE 17

Everyb ybody y Dance Now

[Chan et al. ’18] Everybody Dance Now

slide-18
SLIDE 18

Everybo ybody y Dance Now: Insig ights

  • Conditioning via tracking seems promising!

– Tracking quality translates to resulting image quality – Tracking human skeletons is less developed than faces

  • Temporally it’s not stable… (e.g., OpenPose etc.)

– Fun fact, there were like 4 papers with a similar same idea that appeared around the same time…

[Chan et al. ’18] Everybody Dance Now

slide-19
SLIDE 19

Deep Voxe xels ls

[Sitzmann et al. ’18] Deep Voxels

slide-20
SLIDE 20

Deep Voxe xels ls

  • Main idea for video generation:

– Why learn 3D operations with 2D Convs !?!? – We know how 3D transformations work

  • E.g., 6 DoF rigid pose [ R | t ]

– Incorporate these into the architectures

  • Need to be differentiable!

– Example application: novel view point synthesis

  • Given rigid pose, generate image for that view

[Sitzmann et al. ’18] Deep Voxels

slide-21
SLIDE 21

Deep Voxe xels ls

[Sitzmann et al. ’18] Deep Voxels

slide-22
SLIDE 22

Deep Voxe xels ls

[Sitzmann et al. ’18] Deep Voxels

Issue: we don’t know the depth for the target!

  • > Per-pixel softmax along the ray
  • > Network learns the depth

Occlusion Network:

slide-23
SLIDE 23

Deep Voxe xels ls

[Sitzmann et al. ’18] Deep Voxels

slide-24
SLIDE 24

Deep Voxe xels ls

[Sitzmann et al. ’18] Deep Voxels

slide-25
SLIDE 25

Deep Voxels ls: Insig ights

  • Lifting from 2D to 3D works great

– No need to take specific care for temp. coherency!

  • All 3D operations are differentiable
  • Currently, only for novel view-point synthesis

– I.e., cGAN for new pose in a given scene

[Sitzmann et al. ’18] Deep Voxels

slide-26
SLIDE 26

Neural Renderin ing with Neural l Textures

slide-27
SLIDE 27

Auto toregress ssive Mode dels

  • Prof. Leal-Taixé and Prof. Niessner

27

slide-28
SLIDE 28

Autoregressive ive Models vs GANs

  • GANs learn implicit data distribution

– i.e., output are samples (distribution is in model)

  • Autoregressive models learn an explicit distribution

governed by a prior imposed by model structure

– i.e., outputs are probabilities (e.g., softmax)

  • Prof. Leal-Taixé and Prof. Niessner

28

slide-29
SLIDE 29

PixelRN RNN

  • Goal: model distribution of natural images
  • Interpret pixels of an image as product of conditional

distributions

– Modeling an image → sequence problem – Predict one pixel at a time – Next pixel determined by all previously predicted pixels

  • Use a Recurrent Neural Network
  • Prof. Leal-Taixé and Prof. Niessner

29

[Van den Oord et al 2016]

slide-30
SLIDE 30

PixelRN RNN

  • Prof. Leal-Taixé and Prof. Niessner

30

[Van den Oord et al 2016]

For RGB

slide-31
SLIDE 31

PixelRN RNN

  • Prof. Leal-Taixé and Prof. Niessner

31

𝑦𝑗 ∈ 0,255 → 256-way softmax

[Van den Oord et al 2016]

slide-32
SLIDE 32

PixelRN RNN

  • Row LSTM model architecture
  • Image processed row by row
  • Hidden state of pixel depends
  • n the 3 pixels above it

– Can compute pixels in row in parallel

  • Incomplete context for each

pixel

  • Prof. Leal-Taixé and Prof. Niessner

32

[Van den Oord et al 2016]

slide-33
SLIDE 33

PixelRN RNN

  • Diagonal BiLSTM model

architecture

  • Solve incomplete context

problem

  • Hidden state of pixel 𝑞𝑗,𝑘depends
  • n 𝑞𝑗,𝑘−1 and 𝑞𝑗−1,𝑘
  • Image processed by diagonals
  • Prof. Leal-Taixé and Prof. Niessner

33

[Van den Oord et al 2016]

slide-34
SLIDE 34

PixelRN RNN

  • Masked Convolutions
  • Only previously predicted

values can be used as context

  • Mask A: restrict context

during 1st conv

  • Mask B: subsequent convs
  • Masking by zeroing out

values

  • Prof. Leal-Taixé and Prof. Niessner

34

[Van den Oord et al 2016]

slide-35
SLIDE 35

PixelRN RNN

  • Generated

64x64 images, trained on ImageNet

  • Prof. Leal-Taixé and Prof. Niessner

35

[Van den Oord et al 2016]

slide-36
SLIDE 36

PixelCNN

  • Row and Diagonal LSTM layers have potentially

unbounded dependency range within the receptive field

– Can be very computationally costly

  • PixelCNN:

– standard convs capture a bounded receptive field – All pixel features can be computed at once (during training)

  • Prof. Leal-Taixé and Prof. Niessner

36

[Van den Oord et al 2016]

slide-37
SLIDE 37

PixelCNN

  • Model preserves spatial

dimensions

  • Masked convolutions to avoid

seeing future context

  • Prof. Leal-Taixé and Prof. Niessner

37

[Van den Oord et al 2016]

http://sergeiturukin.com/2017/02/22/pixelcnn.htm

Mask A

slide-38
SLIDE 38

Gated PixelC lCNN

  • Gated blocks
  • Imitate multiplicative complexity of PixelRNNs to

reduce performance gap between PixelCNN and PixelRNN

  • Replace ReLU with gated block of sigmoid, tanh
  • Prof. Leal-Taixé and Prof. Niessner

38

[Van den Oord et al 2016]

𝑧 = tanh 𝑋

𝑙,𝑔 ∗ 𝑦 ⊙ 𝜏(𝑋 𝑙,𝑕 ∗ 𝑦)

kth layer sigmoid element-wise product convolution

slide-39
SLIDE 39

PixelCNN Blind Spot

  • Prof. Leal-Taixé and Prof. Niessner

39

[Van den Oord et al 2016]

http://sergeiturukin.com/2017/02/24/gated-pixelcnn

5x5 image / 3x3 conv Receptive Field Unseen context

slide-40
SLIDE 40
  • Split convolution to two stacks
  • Horizontal stack conditions on

current row

  • Vertical stack conditions on pixels

above

Pixe xelCNN: : Elimin inatin ing Blind Spot

  • Prof. Leal-Taixé and Prof. Niessner

40

[Van den Oord et al 2016]

slide-41
SLIDE 41

Condit itio ional l Pixe xelCNN

  • Conditional image generation
  • E.g., condition on semantic class, text description
  • Prof. Leal-Taixé and Prof. Niessner

41

[Van den Oord et al 2016]

𝑧 = tanh 𝑋

𝑙,𝑔 ∗ 𝑦 + 𝑊 𝑙,𝑔 𝑈 ℎ

⊙ 𝜏(𝑋

𝑙,𝑕 ∗ 𝑦 + 𝑊 𝑙,𝑕 𝑈 ℎ)

latent vector to be conditioned on

slide-42
SLIDE 42

Condit itio ional l Pixe xelCNN

  • Prof. Leal-Taixé and Prof. Niessner

42

[Van den Oord et al 2016]

slide-43
SLIDE 43

Autoregressive ive Models vs GANs

  • Advantages of autoregressive:

– Explicitly model probability densities – More stable training – Can be applied to both discrete and continuous data

  • Advantages of GANs:

– Have been empirically demonstrated to produce higher quality images – Faster to train

  • Prof. Leal-Taixé and Prof. Niessner

43

slide-44
SLIDE 44

Deep p Learning ng in Highe her Dimens nsions

  • ns
  • Prof. Leal-Taixé and Prof. Niessner

44

slide-45
SLIDE 45

Multi-Dim Dimensio ional l ConvNets

  • 1D ConvNets

– Audio / Speech – Also Point Clouds

  • 2D ConvNets

– Images (AlexNet, VGG, ResNet -> Classification, Localization, etc..)

  • 3D ConvNets

– For videos – For 3D data

  • 4D ConvNets

– E.g., dynamic 3D data (Haven’t seen much work there) – Simulations

  • Prof. Leal-Taixé and Prof. Niessner

45

slide-46
SLIDE 46

Remember: : 1D Convolu lutio ions

4 3 2

  • 5

3 5 2 5 5 6 1/3 1/3 1/3 3 𝑔 𝑕 𝑔 ∗ 𝑕 4 ⋅ 1 3 + 3 ⋅ 1 3 + 2 ⋅ 1 3 = 3

  • Prof. Leal-Taixé and Prof. Niessner

46

slide-47
SLIDE 47

Remember: : 1D Convolu lutio ions

4 3 2

  • 5

3 5 2 5 5 6 1/3 1/3 1/3 3 𝑔 𝑕 𝑔 ∗ 𝑕 3 ⋅ 1 3 + 2 ⋅ 1 3 + (−5) ⋅ 1 3 = 0

  • Prof. Leal-Taixé and Prof. Niessner

47

slide-48
SLIDE 48

Remember: : 1D Convolu lutio ions

4 3 2

  • 5

3 5 2 5 5 6 1/3 1/3 1/3 3 𝑔 𝑕 𝑔 ∗ 𝑕 2 ⋅ 1 3 + (−5) ⋅ 1 3 + 3 ⋅ 1 3 = 0

  • Prof. Leal-Taixé and Prof. Niessner

48

slide-49
SLIDE 49

Remember: : 1D Convolu lutio ions

4 3 2

  • 5

3 5 2 5 5 6 1/3 1/3 1/3 3 1 𝑔 𝑕 𝑔 ∗ 𝑕 (−5) ⋅ 1 3 + 3 ⋅ 1 3 + 5 ⋅ 1 3 = 1

  • Prof. Leal-Taixé and Prof. Niessner

49

slide-50
SLIDE 50

Remember: : 1D Convolu lutio ions

4 3 2

  • 5

3 5 2 5 5 6 1/3 1/3 1/3 3 1 10/3 𝑔 𝑕 𝑔 ∗ 𝑕 3 ⋅ 1 3 + 5 ⋅ 1 3 + 2 ⋅ 1 3 = 10 3

  • Prof. Leal-Taixé and Prof. Niessner

50

slide-51
SLIDE 51

Remember: : 1D Convolu lutio ions

4 3 2

  • 5

3 5 2 5 5 6 1/3 1/3 1/3 3 1 10/3 4 𝑔 𝑕 𝑔 ∗ 𝑕 5 ⋅ 1 3 + 2 ⋅ 1 3 + 5 ⋅ 1 3 = 4

  • Prof. Leal-Taixé and Prof. Niessner

51

slide-52
SLIDE 52

Remember: : 1D Convolu lutio ions

4 3 2

  • 5

3 5 2 5 5 6 1/3 1/3 1/3 3 1 10/3 4 4 𝑔 𝑕 𝑔 ∗ 𝑕 2 ⋅ 1 3 + 5 ⋅ 1 3 + 5 ⋅ 1 3 = 4

  • Prof. Leal-Taixé and Prof. Niessner

52

slide-53
SLIDE 53

Remember: : 1D Convolu lutio ions

4 3 2

  • 5

3 5 2 5 5 6 1/3 1/3 1/3 3 1 10/3 4 4 16/3 𝑔 𝑕 𝑔 ∗ 𝑕 5 ⋅ 1 3 + 5 ⋅ 1 3 + 6 ⋅ 1 3 = 16 3

  • Prof. Leal-Taixé and Prof. Niessner

53

slide-54
SLIDE 54

1D ConvN vNets: : WaveNet

[van der Ooord 16] https://deepmind.com/blog/wavenet-generative-model-raw-audio/

slide-55
SLIDE 55

1D ConvN vNets: : WaveNet

[van der Ooord 16] https://deepmind.com/blog/wavenet-generative-model-raw-audio/

slide-56
SLIDE 56

3D Classifi ification

Class from 3D model (e.g., obtained with Kinect Scan)

[Maturana et al. 15] & [Qi et al. 16] 3D vs Multi-view

slide-57
SLIDE 57

3D Semantic ic Segmentatio ion

[Dai et al. 17] ScanNet

1500 densely annotated 3D scans; 2.5 mio RGB-D frames

slide-58
SLIDE 58

Volum umetr tric Grids

slide-59
SLIDE 59

Volume metric ic Grids

Volumetric Data Structures

– Occupancy grids – Ternary grids – Distance Fields – Signed Distance fields

(binary) Voxel Grid Shape completion error (higher == better)

slide-60
SLIDE 60

3D Shape Completio ion on Grids

[Dai et al. 17] CNNComplete

Works with 32 x 32 x 32 voxels…

slide-61
SLIDE 61

ScanNet nNet: Semant ntic Segmen entation n in 3D

[Dai et al. 17] ScanNet

slide-62
SLIDE 62

ScanNet: : Slidin ing Window

[Dai et al. 17] ScanNet

slide-63
SLIDE 63

Surfa faceNet: Stereo Reconstructio ion

Run on 32 x 32 x 32 blocks -> takes forever…

[Ji et al. 17] SurfaceNet

slide-64
SLIDE 64

ScanComple lete: Fully ly Convo volu lutio ional

Train on crops of scenes Test on entire scenes

[Dai et al. 18] ScanComplete

slide-65
SLIDE 65

Depen enden ent Predictions: ns: Autoreg egres essi sive e Neural Networks

[Dai et al.]: ScanComplete

  • Prof. Leal-Taixé and Prof. Niessner

65

slide-66
SLIDE 66

Spatial Exten ent: Coarse se-to to-Fine ne Pred edictions ns

[Dai et al.]: ScanComplete

  • Prof. Leal-Taixé and Prof. Niessner

66

slide-67
SLIDE 67

ScanComple lete: Fully ly Convo volu lutio ional

[Dai et al. 18] ScanComplete

Input Partial Scan Completed Scan

slide-68
SLIDE 68

Conclu lusio ion so far

  • Volumetric Grids are easy

– Encode free space – Encode distance fields – Need a lot of memory – Need a lot of processing time – But can be used sliding window or fully-conv.

slide-69
SLIDE 69

Conclu lusio ion so far

Surface occupancy gets smaller with higher resolutions

slide-70
SLIDE 70

Volum umetr tric Hierar archi hies

slide-71
SLIDE 71

Discrim imin inative ive Tasks

Structure is known in advance!

OctNet: Learning Deep 3D Representations at High Resolutions (CVPR 2017) O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis (SIG17)

State of the art is somewhere here…

slide-72
SLIDE 72

Generative ive Tasks

Need to infer structure!

Octree Generating Networks: Efficient Convolutional Architectures for High-resolution Outputs OctNetFusion: Learning Depth Fusion from Data (that one not end to end) Pretty interesting: they have end-to-end method: i.e.,split voxels that are partially

  • ccupied
slide-73
SLIDE 73

Conclu lusio ion so far

  • Hierarchies

– are great for reducing memory and runtime – Comes at a performance hit – Easier for discriminative tasks when structure is known

slide-74
SLIDE 74

Multi ti-view

slide-75
SLIDE 75

Multip iple le Views: : Classifi ificatio ion

  • RGB images from fixed views around object:

– view pooling for classification (only RGB; no spatial corr. )

Multi-view Convolutional Neural Networks for 3D Shape Recognition

slide-76
SLIDE 76

Multip iple le Views: : Segmentatio ion

3D Shape Segmentation with Projective Convolutional Networks This one is interesting in a sense that it does 3D shape segmentation (only on CAD models) But it uses multi-view and has a spatial correlation on top of the mesh surface

slide-77
SLIDE 77

Fun thing…

Volumetric and Multi-View CNNs for Object Classification on 3D Data

slide-78
SLIDE 78

Hybrid: : Volum umetr tric + Multi ti-view

slide-79
SLIDE 79

3D Volume metric ic + Multi-vie view

[Dai & Niessner 18] 3DMV

slide-80
SLIDE 80

3D Volume metric ic + Multi-vie view

[Dai & Niessner 18] 3DMV

slide-81
SLIDE 81

3D Volume metric ic + Multi-vie view

[Dai & Niessner 18] 3DMV

slide-82
SLIDE 82

3D Volume metric ic + Multi-vie view

[Dai & Niessner 18] 3DMV

slide-83
SLIDE 83

Conclu lusio ion so far

  • Hybrid:

– Nice way to combine color and geometry – Great performance (best so far for segmentation) – End-to-end helps less than we hoped for – Could be faster…

slide-84
SLIDE 84

Next Lectures

  • Next Lecture -> Jan 28th

– Domain Adaptation and Transfer Learning – Possible graphs if time permits

  • Keep working on the projects!
  • Prof. Leal-Taixé and Prof. Niessner

84