Style le GAN Prof. Leal-Taix and Prof. Niessner 1 Style leGAN - - PowerPoint PPT Presentation

style le gan
SMART_READER_LITE
LIVE PREVIEW

Style le GAN Prof. Leal-Taix and Prof. Niessner 1 Style leGAN - - PowerPoint PPT Presentation

Style le GAN Prof. Leal-Taix and Prof. Niessner 1 Style leGAN Style-based generator Traditional Prof. Leal-Taix and Prof. Niessner 2 [Karras et al. 19] StyleGAN Style leGAN Style-based generator Traditional Prof. Leal-Taix and


slide-1
SLIDE 1

Style le GAN

  • Prof. Leal-Taixé and Prof. Niessner

1

slide-2
SLIDE 2

Style leGAN

  • Prof. Leal-Taixé and Prof. Niessner

2

[Karras et al. 19] StyleGAN

Traditional Style-based generator

slide-3
SLIDE 3

Style leGAN

  • Prof. Leal-Taixé and Prof. Niessner

3

[Karras et al. 19] StyleGAN

Traditional Style-based generator

slide-4
SLIDE 4

Style leGAN

  • Prof. Leal-Taixé and Prof. Niessner

4

[Karras et al. 19] StyleGAN

FID (Frechet inception distance) on 50k gen. images

  • > Architecture is similar to Progressive Growing GAN
slide-5
SLIDE 5

Style leGAN

  • Prof. Leal-Taixé and Prof. Niessner

5

[Karras et al. 19] StyleGAN

https://youtu.be/kSLJriaOumA

slide-6
SLIDE 6

Style leGAN

  • Prof. Leal-Taixé and Prof. Niessner

6

[Karras et al. 19] StyleGAN

https://youtu.be/kSLJriaOumA

slide-7
SLIDE 7

Style leGAN2

Interesting analysis about design choices!

– https://arxiv.org/pdf/1912.04958.pdf – https://github.com/NVlabs/stylegan2 – https://youtu.be/c-NJtV9Jvp0

  • Prof. Leal-Taixé and Prof. Niessner

7

slide-8
SLIDE 8

Autoregressiv ive Models ls

  • Prof. Leal-Taixé and Prof. Niessner

8

slide-9
SLIDE 9

Autore regressive Models vs GANs

  • GANs learn implicit data distribution

– i.e., output are samples (distribution is in model)

  • Autoregressive models learn an explicit distribution

governed by a prior imposed by model structure

– i.e., outputs are probabilities (e.g., softmax)

  • Prof. Leal-Taixé and Prof. Niessner

9

slide-10
SLIDE 10

Pix ixelR lRNN

  • Goal: model distribution of natural images
  • Interpret pixels of an image as product of conditional

distributions

– Modeling an image → sequence problem – Predict one pixel at a time – Next pixel determined by all previously predicted pixels

  • Use a Recurrent Neural Network
  • Prof. Leal-Taixé and Prof. Niessner

10

[Van den Oord et al 2016]

slide-11
SLIDE 11

Pix ixelR lRNN

  • Prof. Leal-Taixé and Prof. Niessner

11

[Van den Oord et al 2016]

For RGB

slide-12
SLIDE 12

Pix ixelR lRNN

  • Prof. Leal-Taixé and Prof. Niessner

12

𝑦𝑗 ∈ 0,255 → 256-way softmax

[Van den Oord et al 2016]

slide-13
SLIDE 13

Pix ixelR lRNN

  • Row LSTM model architecture
  • Image processed row by row
  • Hidden state of pixel depends
  • n the 3 pixels above it

– Can compute pixels in row in parallel

  • Incomplete context for each

pixel

  • Prof. Leal-Taixé and Prof. Niessner

13

[Van den Oord et al 2016]

slide-14
SLIDE 14

Pix ixelR lRNN

  • Diagonal BiLSTM model

architecture

  • Solve incomplete context

problem

  • Hidden state of pixel

𝑞𝑗,𝑘depends on 𝑞𝑗,𝑘−1 and 𝑞𝑗−1,𝑘

  • Image processed by diagonals
  • Prof. Leal-Taixé and Prof. Niessner

14

[Van den Oord et al 2016]

slide-15
SLIDE 15

Pix ixelR lRNN

  • Masked Convolutions
  • Only previously predicted

values can be used as context

  • Mask A: restrict context

during 1st conv

  • Mask B: subsequent convs
  • Masking by zeroing out

values

  • Prof. Leal-Taixé and Prof. Niessner

15

[Van den Oord et al 2016]

slide-16
SLIDE 16

Pix ixelR lRNN

  • Generated

64x64 images, trained on ImageNet

  • Prof. Leal-Taixé and Prof. Niessner

16

[Van den Oord et al 2016]

slide-17
SLIDE 17

Pix ixelCNN

  • Row and Diagonal LSTM layers have potentially

unbounded dependency range within the receptive field

– Can be very computationally costly

  • PixelCNN:

– standard convs capture a bounded receptive field – All pixel features can be computed at once (during training)

  • Prof. Leal-Taixé and Prof. Niessner

17

[Van den Oord et al 2016]

slide-18
SLIDE 18

Pix ixelCNN

  • Model preserves spatial

dimensions

  • Masked convolutions to avoid

seeing future context

  • Prof. Leal-Taixé and Prof. Niessner

18

[Van den Oord et al 2016]

http://sergeiturukin.com/2017/02/22/pixelcnn.h

Mask A

slide-19
SLIDE 19

Gated Pix ixelCNN

  • Gated blocks
  • Imitate multiplicative complexity of PixelRNNs to

reduce performance gap between PixelCNN and PixelRNN

  • Replace ReLU with gated block of sigmoid, tanh
  • Prof. Leal-Taixé and Prof. Niessner

19

[Van den Oord et al 2016]

𝑧 = tanh 𝑋

𝑙,𝑔 ∗ 𝑦 ⊙ 𝜏(𝑋 𝑙,𝑕 ∗ 𝑦)

kth layer sigmoid element-wise product convolution

slide-20
SLIDE 20

Pix ixelCNN Bli lind Spot

  • Prof. Leal-Taixé and Prof. Niessner

20

[Van den Oord et al 2016]

http://sergeiturukin.com/2017/02/24/gated-pixelcnn.html

5x5 image / 3x3 conv Receptive Field Unseen context

slide-21
SLIDE 21
  • Split convolution to two stacks
  • Horizontal stack conditions on

current row

  • Vertical stack conditions on pixels

above

Pix ixelCNN: : Eli limin inatin ing Bli lind Spot

  • Prof. Leal-Taixé and Prof. Niessner

21

[Van den Oord et al 2016]

slide-22
SLIDE 22

Conditional Pix ixelCNN

  • Conditional image generation
  • E.g., condition on semantic class, text description
  • Prof. Leal-Taixé and Prof. Niessner

22

[Van den Oord et al 2016]

𝑧 = tanh 𝑋

𝑙,𝑔 ∗ 𝑦 + 𝑊 𝑙,𝑔 𝑈 ℎ

⊙ 𝜏(𝑋

𝑙,𝑕 ∗ 𝑦 + 𝑊 𝑙,𝑕 𝑈 ℎ)

latent vector to be conditioned on

slide-23
SLIDE 23

Conditional Pix ixelCNN

  • Prof. Leal-Taixé and Prof. Niessner

23

[Van den Oord et al 2016]

slide-24
SLIDE 24

Autore regressive Models vs GANs

  • Advantages of autoregressive:

– Explicitly model probability densities – More stable training – Can be applied to both discrete and continuous data

  • Advantages of GANs:

– Have been empirically demonstrated to produce higher quality images – Faster to train

  • Prof. Leal-Taixé and Prof. Niessner

24

slide-25
SLIDE 25

Autore regressive Models

  • State of the art is pretty impressive 
  • Prof. Leal-Taixé and Prof. Niessner

25

Generating Diverse High-Fidelity Images with VQ-VAE-2 https://arxiv.org/pdf/1906.00446.pdf [Razavi et al. 19] Vector Quantized Variational AutoEncoder

slide-26
SLIDE 26

Generativ ive Models ls

  • n Vid

ideos

  • Prof. Leal-Taixé and Prof. Niessner

26

slide-27
SLIDE 27

GANs on Vid ideos

Two options

– Single random variable z seeds entire video (all frames)

  • Very high dimensional output
  • How to do for variable length?
  • Future frames deterministic given past

– Random variable z for each frame of the video

  • Need conditioning for future from the past
  • How to get combination of past frames + random vectors during training

General issues

– Temporal coherency – Drift over time (many models collapse to mean image)

  • Prof. Leal-Taixé and Prof. Niessner

27

slide-28
SLIDE 28

GANs on Vid ideos: : DVD-GAN GAN

  • Prof. Leal-Taixé and Prof. Niessner

28

[Clark et al. 2019] Adversarial Video Generation on Complex Datasets

slide-29
SLIDE 29

GANs on Vid ideos: : DVD-GAN GAN

  • Prof. Leal-Taixé and Prof. Niessner

29

[Clark et al. 2019] Adversarial Video Generation on Complex Datasets

slide-30
SLIDE 30

GANs on Vid ideos: : DVD-GAN GAN

  • Trained on Kinetics-600 dataset

– 256 x 256, 128 x 128, and 64 x 64 – Lengths of up 48 frames

  • > This is state of the art!
  • > Videos from scratch still incredibly challenging
  • Prof. Leal-Taixé and Prof. Niessner

30

[Clark et al. 2019] Adversarial Video Generation on Complex Datasets

slide-31
SLIDE 31

Conditional GANs on Vid ideos

  • Challenge:

– Each frame is high quality, but temporally inconsistent

  • Prof. Leal-Taixé and Prof. Niessner

31

slide-32
SLIDE 32

Vid ideo-to to-Vid ideo Synthesis is

  • Sequential Generator:
  • Conditional Image Discriminator 𝐸𝑗 (is it real image)
  • Conditional Video Discriminator 𝐸𝑤 (temp. consistency via flow)
  • Prof. Leal-Taixé and Prof. Niessner

32

Wang et al. 18: Vid2Vid

past L source frames past L generated frames (set L = 2) Full Learning Objective:

slide-33
SLIDE 33

Vid ideo-to to-Vid ideo Synthesis is

  • Prof. Leal-Taixé and Prof. Niessner

33

Wang et al. 18: Vid2Vid

slide-34
SLIDE 34

Vid ideo-to to-Vid ideo Synthesis is

  • Prof. Leal-Taixé and Prof. Niessner

34

Wang et al. 18: Vid2Vid

slide-35
SLIDE 35

Vid ideo-to to-Vid ideo Synthesis is

  • Key ideas:

– Separate discriminator for temporal parts

  • In this case based on optical flow

– Consider recent history of prev. frames – Train all of it jointly

  • Prof. Leal-Taixé and Prof. Niessner

35

Wang et al. 18: Vid2Vid

slide-36
SLIDE 36

Deep Vid ideo Port rtraits

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-37
SLIDE 37

Deep Vid ideo Port rtraits

Similar to “Image-to-Image Translation” (Pix2Pix) [Isola et al.]

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-38
SLIDE 38

Deep Vid ideo Port rtraits

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-39
SLIDE 39

Deep Vid ideo Port rtraits

Siggraph’18 [Kim et al 18]: Deep Portraits

Neural Network converts synthetic data to realistic video

slide-40
SLIDE 40

Deep Vid ideo Port rtraits

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-41
SLIDE 41

Deep Vid ideo Port rtraits

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-42
SLIDE 42

Deep Vid ideo Port rtraits

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-43
SLIDE 43

Deep Vid ideo Port rtraits

Siggraph’18 [Kim et al 18]: Deep Portraits

Interactive Video Editing

slide-44
SLIDE 44

Deep Vid ideo Port rtraits: : In Insights

  • Synthetic data for tracking is great anchor / stabilizer
  • Overfitting on small datasets works pretty well
  • Need to stay within training set w.r.t. motions
  • No real learning; essentially, optimizing the problem

with SGD

  • > should be pretty interesting for future directions

Siggraph’18 [Kim et al 18]: Deep Portraits

slide-45
SLIDE 45

Every rybody Dance Now

[Chan et al. ’18] Everybody Dance Now

slide-46
SLIDE 46

Every rybody Dance Now

[Chan et al. ’18] Everybody Dance Now

slide-47
SLIDE 47

Every rybody Dance Now

[Chan et al. ’18] Everybody Dance Now

slide-48
SLIDE 48

Every rybody Dance Now

  • cGANs work with

different input

  • Requires consistent input

i.e., accurate tracking

  • Network has no explicit

3D notion

[Chan et al. ’18] Everybody Dance Now

slide-49
SLIDE 49

Every rybody Dance Now: : In Insights

  • Conditioning via tracking seems promising!

– Tracking quality translates to resulting image quality – Tracking human skeletons is less developed than faces

  • Temporally it’s not stable… (e.g., OpenPose etc.)

– Fun fact, there were like 4 papers with a similar idea that appeared around the same time…

[Chan et al. ’18] Everybody Dance Now

slide-50
SLIDE 50

Next Lectures

  • Next Lectures:

– Neural Rendering – 3D Deep Learning

  • Keep working on the projects!
  • Prof. Leal-Taixé and Prof. Niessner

50

slide-51
SLIDE 51

See you next week 

  • Prof. Leal-Taixé and Prof. Niessner

51