VIDEO-TO-VIDEO SYNTHESIS Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, - - PowerPoint PPT Presentation

video to video synthesis
SMART_READER_LITE
LIVE PREVIEW

VIDEO-TO-VIDEO SYNTHESIS Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, - - PowerPoint PPT Presentation

VIDEO-TO-VIDEO SYNTHESIS Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro GENERATIVE ADVERSARIAL NETWORKS Unconditional GANs False Generator Discriminator ~ Discriminator True 2 Image credit:


slide-1
SLIDE 1

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro

VIDEO-TO-VIDEO SYNTHESIS

slide-2
SLIDE 2

2

GENERATIVE ADVERSARIAL NETWORKS

Unconditional GANs Generator Discriminator Discriminator False True

Image credit: Celebrity dataset, Jensen Huang, Founder and CEO of NVIDIA, Ian Goodfellow, Father of GANs.

~

slide-3
SLIDE 3

3

After training for a while using NVIDIA DGX1 machines

Fun sampling time begin Generator

Image credit: NVIDIA StyleGAN

slide-4
SLIDE 4

4

CONDITIONAL GANS

Allow user more control on the sampling process

Modeling (training) Sampling (testing)

Generated result Given info (e.g. image, text)

  • utput style

Given info (e.g. image, text)

slide-5
SLIDE 5

5

SKETCH-CONDITIONAL GANS

Generator

Image credit: NVIDIA pix2pixHD

slide-6
SLIDE 6

6

IMAGE-CONDITIONAL GANS

Image credit: NVIDIA MUNIT

slide-7
SLIDE 7

7

MASK-CONDITIONAL GANS

Semantic Image Synthesis

slide-8
SLIDE 8

8

MASK-CONDITIONAL GANS

Semantic Image Synthesis

slide-9
SLIDE 9

9

LIVE DEMO

I need to get an RTX Ready Laptop (https://www.nvidia.com/en- us/geforce/gaming-laptops/20- series/) It is running live in GTC Will be online for everyone to try out in NVIDIA AI Playground website (https://www.nvidia.com/en- us/research/ai-playground/)

slide-10
SLIDE 10

10

Interface

slide-11
SLIDE 11

11

slide-12
SLIDE 12

12

PROBLEM WITH PREVIOUS METHODS

input result

slide-13
SLIDE 13

13

Batch Norm (Ioffe et al. 2015)

𝑧 = 𝑦 − 𝜈 𝜏 ⋅ 𝛿 + 𝛾

normalization affine transform de-normalization

PROBLEM WITH PREVIOUS METHODS

removes label information

1

𝑦=

1

𝑦=

same output!

slide-14
SLIDE 14

14

PROBLEM WITH PREVIOUS METHODS

input result

slide-15
SLIDE 15

15

PROBLEM WITH PREVIOUS METHODS

  • Do not feed the label map directly to network
  • Use the label map to generate normalization layers instead
slide-16
SLIDE 16

16

element-wise conv

𝛿

Parameter-free Batch Norm

conv

SPADE(SPatially Adaptive DEnormalization)

𝛾

𝑧 = 𝑦 − 𝜈 𝜏 ⋅ 𝛿 + 𝛾

network input network output

𝑦 𝑧

(label free)

label free

slide-17
SLIDE 17

19

SPADE

SPatially Adaptive DE-normalization

element-wise conv

𝛿

Parameter-free Batch Norm conv

𝛾

slide-18
SLIDE 18

20

SPADE ReLU 3x3 Conv ReLU 3x3 Conv SPADE SPADE ResBlk

SPADE RESIDUAL BLOCKS

slide-19
SLIDE 19

21

~

SPADE ResBlk SPADE ResBlk SPADE ResBlk SPADE ResBlk

SPADE GENERATOR

slide-20
SLIDE 20

22

PROBLEM WITH PREVIOUS METHODS

input w/o SPADE w/ SPADE

slide-21
SLIDE 21

23

slide-22
SLIDE 22

24

slide-23
SLIDE 23

25

slide-24
SLIDE 24

26

Multimodal Results on Flickr

IMAGE RESULTS

slide-25
SLIDE 25

27

Multimodal Results on Flickr

IMAGE RESULTS

slide-26
SLIDE 26

28

slide-27
SLIDE 27

29

slide-28
SLIDE 28

30

slide-29
SLIDE 29

31

slide-30
SLIDE 30

33

VIDEO-TO-VIDEO SYNTHESIS

slide-31
SLIDE 31

34

IMAGE-TO-IMAGE SYNTHESIS

Car Road Tree Sidewalk Building

slide-32
SLIDE 32

35

VIDEO-TO-VIDEO SYNTHESIS

slide-33
SLIDE 33

36

VIDEO-TO-VIDEO SYNTHESIS

slide-34
SLIDE 34

37

VIDEO-TO-VIDEO SYNTHESIS

slide-35
SLIDE 35

38

VIDEO-TO-VIDEO SYNTHESIS

slide-36
SLIDE 36

39

MOTIVATION

  • AI-based rendering

Traditional graphics Geometry, texture, lighting Machine learning graphics Data

slide-37
SLIDE 37

40

  • AI-based rendering
  • High-level semantic manipulation

Largely explored

MOTIVATION

Original image New image Edit here!

Segmentation Keypoint Detection etc Image/video synthesis

little explored (this work)

High-level representation

slide-38
SLIDE 38

41

PREVIOUS WORK

Video style transfer

COVST [2017], ArtST [2016]

Unconditional synthesis

MoCoGAN [2018], TGAN [2017], VGAN [2016]

Video prediction

MCNet [2017], PredNet [2017]

Image translation

pix2pixHD [2018], CRN [2017], pix2pix [2017]

slide-39
SLIDE 39

42

PREVIOUS WORK: FRAME-BY-FRAME RESULT

slide-40
SLIDE 40

43

OUR METHOD

  • Sequential generator
  • Multi-scale temporal discriminator
  • Spatio-temporal progressive training procedure
slide-41
SLIDE 41

44

OUR METHOD

Sequential Generator

W

slide-42
SLIDE 42

45

OUR METHOD

Image Discriminator Video Discriminator

D1 D2 D3 D1 D2 D3

Multi-scale Discriminators Sequential Generator

W

slide-43
SLIDE 43

46

OUR METHOD

...

Spatially progressive Temporally progressive

Spatio-temporally Progressive Training

...

Residual blocks

Alternating training

T T T S S S

slide-44
SLIDE 44

47

RESULTS

slide-45
SLIDE 45

48

RESULTS

  • Semantic → Street view scenes
  • Edges → Human faces
  • Poses → Human bodies
slide-46
SLIDE 46

49

RESULTS

  • Semantic → Street view scenes
  • Edges → Human faces
  • Poses → Human bodies
slide-47
SLIDE 47

50

STREET VIEW: CITYSCAPES

Semantic map pix2pixHD COVST (video style transfer) Ours

slide-48
SLIDE 48

51

STREET VIEW: BOSTON

slide-49
SLIDE 49

52

STREET VIEW: NYC

slide-50
SLIDE 50

53

RESULTS

  • Semantic → Street view scenes
  • Edges → Human faces
  • Poses → Human bodies
slide-51
SLIDE 51

54

FACE SWAPPING (FACE → EDGE → FACE)

input edges

  • utput
slide-52
SLIDE 52

55

FACE SWAPPING (SLIMMER FACE)

input (slimmed) edges (slimmed) output

slide-53
SLIDE 53

56

FACE SWAPPING (SLIMMER FACE)

input (slimmed) edges (slimmed) output

slide-54
SLIDE 54

57

MULTI-MODAL EDGE → FACE

Style 1 Style 2 Style 3

slide-55
SLIDE 55

58

RESULTS

  • Semantic → Street view scenes
  • Edges → Human faces
  • Poses → Human bodies
slide-56
SLIDE 56

59

MOTION TRANSFER (BODY → POSE → BODY)

input poses

  • utput
slide-57
SLIDE 57

60

MOTION TRANSFER (BODY → POSE → BODY)

input poses

  • utput
slide-58
SLIDE 58

61

MOTION TRANSFER (BODY → POSE → BODY)

input poses

  • utput
slide-59
SLIDE 59

62

MOTION TRANSFER (BODY → POSE → BODY)

input poses

  • utput
slide-60
SLIDE 60

63

MOTION TRANSFER

slide-61
SLIDE 61

64

EXTENSION: FRAME PREDICTION

  • Goal: predict future frames given past frames
  • Our method: decompose prediction into two steps
  • 1. predict the semantic map for next frame
  • 2. synthesize the frame based on the semantic map
slide-62
SLIDE 62

65

EXTENSION: FRAME PREDICTION

Ground truth PredNet MCNet Ours

slide-63
SLIDE 63

66

INTERACTIVE GRAPHICS

slide-64
SLIDE 64

67

PATH TO INTERACTIVE GRAPHICS

  • Real-time inference
  • Combining with existing graphics pipeline
  • Domain gap between real input and synthetic input
slide-65
SLIDE 65

68

PATH TO INTERACTIVE GRAPHICS

  • Real-time inference
  • Combining with existing graphics pipeline
  • Domain gap between real input and synthetic input
slide-66
SLIDE 66

69

PATH TO INTERACTIVE GRAPHICS

  • Real-time inference
  • FP16 + TensorRT → ~5 times speed up
  • 36ms (27.8 fps) for 1080p inference
  • Overall: 15~25 fps
slide-67
SLIDE 67

70

PATH TO INTERACTIVE GRAPHICS

  • Real-time inference
  • Combining with existing graphics pipeline
  • CARLA: open-source simulator for autonomous driving research
  • Make game engine render semantic maps
  • Pass the maps to the network and display the inference result
slide-68
SLIDE 68

71

PATH TO INTERACTIVE GRAPHICS

  • Real-time inference
  • Combining with existing graphics pipeline
  • Domain gap between real input and synthetic input
  • Network trained on real data but tested on synthetic data
  • Things that differ: Object shapes/edges, density of objects, camera viewpoints,

etc

  • On-going work
slide-69
SLIDE 69

72

ORIGINAL CARLA IMAGE

slide-70
SLIDE 70

73

RENDERED SEMANTIC MAPS

slide-71
SLIDE 71

74

RECORDED DEMO RESULTS

slide-72
SLIDE 72

75

RECORDED DEMO RESULTS

slide-73
SLIDE 73

76

CONCLUSION

slide-74
SLIDE 74

77

CONCLUSION

  • What can we achieve?
  • What can it be used for?
slide-75
SLIDE 75

78

CONCLUSION

  • What can we achieve?
  • Synthesize high-res realistic images
slide-76
SLIDE 76

79

CONCLUSION

  • What can we achieve?
  • Synthesize high-res realistic images
  • Produce temporally-smooth videos
slide-77
SLIDE 77

80

CONCLUSION

  • What can we achieve?
  • Synthesize high-res realistic images
  • Produce temporally-smooth videos
  • Reinvent interactive graphics
slide-78
SLIDE 78

81

CONCLUSION

  • What can we achieve?
  • What can it be used for?
  • AI-based rendering
  • High-level semantic manipulation

Traditional graphics

Machine learning graphics

Original image New image

High-level representation

slide-79
SLIDE 79

THANK YOU

https://github.com/NVIDIA/vid2vid