VIDEO-TO-VIDEO SYNTHESIS Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro
GENERATIVE ADVERSARIAL NETWORKS Unconditional GANs False Generator Discriminator ~ Discriminator True 2 Image credit: Celebrity dataset, Jensen Huang, Founder and CEO of NVIDIA, Ian Goodfellow, Father of GANs.
After training for a while using NVIDIA DGX1 machines Fun sampling time begin Generator 3 Image credit: NVIDIA StyleGAN
CONDITIONAL GANS Allow user more control on the sampling process Modeling (training) Given info (e.g. image, text) Generated result Sampling (testing) Given info (e.g. image, text) output style 4
SKETCH-CONDITIONAL GANS Generator 5 Image credit: NVIDIA pix2pixHD
IMAGE-CONDITIONAL GANS 6 Image credit: NVIDIA MUNIT
MASK-CONDITIONAL GANS Semantic Image Synthesis 7
MASK-CONDITIONAL GANS Semantic Image Synthesis 8
LIVE DEMO I need to get an RTX Ready Laptop (https://www.nvidia.com/en- us/geforce/gaming-laptops/20- series/) It is running live in GTC Will be online for everyone to try out in NVIDIA AI Playground website (https://www.nvidia.com/en- us/research/ai-playground/) 9
Interface 10
11
PROBLEM WITH PREVIOUS METHODS input result 12
PROBLEM WITH PREVIOUS METHODS Batch Norm (Ioffe et al. 2015) 0 0 π§ = π¦ β π 1 0 π¦ = π¦ = 0 0 β πΏ + πΎ 0 1 π same output! affine transform de-normalization normalization removes label information 13
PROBLEM WITH PREVIOUS METHODS input result 14
PROBLEM WITH PREVIOUS METHODS β’ Do not feed the label map directly to network β’ Use the label map to generate normalization layers instead 15
SPADE ( SP atially A daptive DE normalization) πΎ conv conv πΏ π¦ π§ network output network input Parameter-free (label free) Batch Norm label free element-wise π§ = π¦ β π β πΏ + πΎ π 16
πΎ conv conv πΏ Parameter-free element-wise Batch Norm SPADE SPatially Adaptive DE-normalization 19
SPADE RESIDUAL BLOCKS 3x3 Conv 3x3 Conv SPADE SPADE ReLU ReLU SPADE ResBlk 20
SPADE GENERATOR SPADE SPADE SPADE SPADE ~ ResBlk ResBlk ResBlk ResBlk 21
PROBLEM WITH PREVIOUS METHODS input w/o SPADE w/ SPADE 22
23
24
25
Multimodal Results on Flickr IMAGE RESULTS 26
Multimodal Results on Flickr IMAGE RESULTS 27
28
29
30
31
VIDEO-TO-VIDEO SYNTHESIS 33
IMAGE-TO-IMAGE SYNTHESIS Building Tree Car Sidewalk Road 34
VIDEO-TO-VIDEO SYNTHESIS 35
VIDEO-TO-VIDEO SYNTHESIS 36
VIDEO-TO-VIDEO SYNTHESIS 37
VIDEO-TO-VIDEO SYNTHESIS 38
MOTIVATION β’ AI-based rendering Traditional graphics Geometry, texture, lighting Machine learning graphics Data 39
MOTIVATION β’ AI-based rendering β’ High-level semantic manipulation little explored (this work) Largely explored Edit here! Segmentation Image/video synthesis Keypoint Detection etc High-level representation Original image New image 40
PREVIOUS WORK Image translation Unconditional synthesis pix2pixHD [2018], CRN [2017], pix2pix [2017] MoCoGAN [2018], TGAN [2017], VGAN [2016] Video style transfer Video prediction MCNet [2017], PredNet [2017] COVST [2017], ArtST [2016] 41
PREVIOUS WORK: FRAME-BY-FRAME RESULT 42
OUR METHOD β’ Sequential generator β’ Multi-scale temporal discriminator β’ Spatio-temporal progressive training procedure 43
OUR METHOD Sequential Generator W 44
OUR METHOD Sequential Generator Multi-scale Discriminators Image Discriminator Video Discriminator D 1 D 2 D 1 D 2 D 3 D 3 W 45
OUR METHOD Spatio-temporally Progressive Training Spatially progressive Residual blocks Alternating training ... ... T T S S S T Temporally progressive 46
RESULTS 47
RESULTS β’ Semantic β Street view scenes β’ Edges β Human faces β’ Poses β Human bodies 48
RESULTS β’ Semantic β Street view scenes β’ Edges β Human faces β’ Poses β Human bodies 49
STREET VIEW: CITYSCAPES Semantic map pix2pixHD COVST (video style transfer) Ours 50
STREET VIEW: BOSTON 51
STREET VIEW: NYC 52
RESULTS β’ Semantic β Street view scenes β’ Edges β Human faces β’ Poses β Human bodies 53
FACE SWAPPING (FACE β EDGE β FACE) input edges output 54
FACE SWAPPING (SLIMMER FACE) input (slimmed) edges (slimmed) output 55
FACE SWAPPING (SLIMMER FACE) input (slimmed) edges (slimmed) output 56
MULTI-MODAL EDGE β FACE Style 1 Style 2 Style 3 57
RESULTS β’ Semantic β Street view scenes β’ Edges β Human faces β’ Poses β Human bodies 58
MOTION TRANSFER (BODY β POSE β BODY) input poses output 59
MOTION TRANSFER (BODY β POSE β BODY) input poses output 60
MOTION TRANSFER (BODY β POSE β BODY) input poses output 61
MOTION TRANSFER (BODY β POSE β BODY) input poses output 62
MOTION TRANSFER 63
EXTENSION: FRAME PREDICTION β’ Goal: predict future frames given past frames β’ Our method: decompose prediction into two steps β’ 1. predict the semantic map for next frame β’ 2. synthesize the frame based on the semantic map 64
EXTENSION: FRAME PREDICTION Ground truth PredNet MCNet Ours 65
INTERACTIVE GRAPHICS 66
PATH TO INTERACTIVE GRAPHICS β’ Real-time inference β’ Combining with existing graphics pipeline β’ Domain gap between real input and synthetic input 67
PATH TO INTERACTIVE GRAPHICS β’ Real-time inference β’ Combining with existing graphics pipeline β’ Domain gap between real input and synthetic input 68
PATH TO INTERACTIVE GRAPHICS β’ Real-time inference FP16 + TensorRT β ~5 times speed up β’ 36ms (27.8 fps) for 1080p inference β’ β’ Overall: 15~25 fps 69
PATH TO INTERACTIVE GRAPHICS β’ Real-time inference β’ Combining with existing graphics pipeline β’ CARLA: open-source simulator for autonomous driving research Make game engine render semantic maps β’ Pass the maps to the network and display the inference result β’ 70
PATH TO INTERACTIVE GRAPHICS β’ Real-time inference β’ Combining with existing graphics pipeline β’ Domain gap between real input and synthetic input β’ Network trained on real data but tested on synthetic data β’ Things that differ: Object shapes/edges, density of objects, camera viewpoints, etc β’ On-going work 71
ORIGINAL CARLA IMAGE 72
RENDERED SEMANTIC MAPS 73
RECORDED DEMO RESULTS 74
RECORDED DEMO RESULTS 75
CONCLUSION 76
CONCLUSION β’ What can we achieve? β’ What can it be used for? 77
CONCLUSION β’ What can we achieve? β’ Synthesize high-res realistic images 78
CONCLUSION β’ What can we achieve? β’ Synthesize high-res realistic images β’ Produce temporally-smooth videos 79
CONCLUSION β’ What can we achieve? β’ Synthesize high-res realistic images β’ Produce temporally-smooth videos β’ Reinvent interactive graphics 80
CONCLUSION β’ What can we achieve? β’ What can it be used for? β’ AI-based rendering β’ High-level semantic manipulation Traditional graphics High-level representation Machine learning graphics Original image New image 81
THANK YOU https://github.com/NVIDIA/vid2vid
Recommend
More recommend