DVDNet Deep Blind Video Decaptioning with 3D-2D Gated Convolutions - - PowerPoint PPT Presentation

dvdnet
SMART_READER_LITE
LIVE PREVIEW

DVDNet Deep Blind Video Decaptioning with 3D-2D Gated Convolutions - - PowerPoint PPT Presentation

2018 ChaLearn Looking at People Challenge - Track 2. Video Decaptioning DVDNet Deep Blind Video Decaptioning with 3D-2D Gated Convolutions Dahun Kim*, Sanghyun Woo*, Joonyoung Lee, In So Kweon 1 Our Problem Remove text overlays in video


slide-1
SLIDE 1

1

DVDNet Deep Blind Video Decaptioning with 3D-2D Gated Convolutions

Dahun Kim*, Sanghyun Woo*, Joonyoung Lee, In So Kweon

2018 ChaLearn Looking at People Challenge

  • Track 2. Video Decaptioning
slide-2
SLIDE 2

Our Problem

Need to consider two important points:

  • 1. Video : Sequence of frames)
  • 2. Blind : No inpainting mask)

Remove text overlays in video

slide-3
SLIDE 3

Model Overview

Two important points :

  • Video : Sequence of frames
  • Blind : No inpainting mask
  • 3D-2D U-net
  • Residual learning

+ Gated convolution

Output Input

3Dgated- CNN Encoder 2Dgated- CNN Decoder

Skipconnections Prediction

slide-4
SLIDE 4

Vanilla 2D U-Net*

Two important points :

  • Video : Sequence of frames
  • Blind : No inpainting mask

Frame-by-frame operation

  • Spatial context
  • Scene dynamics

2DCNN Encoder 2DCNN Decoder

Skipconnections Prediction Input

* Ronneberger, O.et al. “U-net: Convolutional networks for biomedical image segmentation.” MICCAI 2015.

slide-5
SLIDE 5

Input : Multiple frames

Scene dynamics

  • Aggregate hints from spatio-temporal neighborhoods

 Object movements  Subtitle changes

slide-6
SLIDE 6

Vanilla 3D U-Net*

Multiple frame prediction

3DCNN Encoder 3DCNN Decoder

Skipconnections Prediction Input

  • Hard problem
  • Heavy
  • Not uniform prediction

* C¸ ic¸ek, O¨ .et al. “3d u-net: learning dense volumetric segmentation from sparse annotation.” MICCAI 2016.

slide-7
SLIDE 7

Output : Single frame

Focus on a single frame

  • Aggregate hints from lagging and leading frames.

Leading frames Lagging frames Output Center frame 3D-2D U-Net

  • Easy problem
  • Light-weight
  • Temporal view range
slide-8
SLIDE 8

3D-2D U-Net architecture

Focus on a single frame

Input

  • 3D convolutions to flatten the encoder features into one frame.

 to match the shape and concatenate. 3Dgated- CNN Encoder 2Dgated- CNN Decoder

Skipconnections Prediction

slide-9
SLIDE 9

Residual Learning

Two important points :

  • Video : Sequence of frames
  • Blind : No inpainting mask
  • Residual learning
  • Not touching good pixels
  • Focus on the corrupted regions

Output Input

3Dgated- CNN Encoder 2Dgated- CNN Decoder

Skipconnections Prediction

 Implicitly knows the inpainting mask

slide-10
SLIDE 10

+ Attention

Gated Convolution*

Input feature Conv Conv Sigmoid

  • 0-1 value (Gating)
  • Attentioning

* Yu, J. et al. “Free-form image inpainting with gated convolution”. arXiv preprint arXiv:1806.03589.

slide-11
SLIDE 11

Loss Function

L1 + gradient L1 + SSIM loss

slide-12
SLIDE 12

Quantative Results

slide-13
SLIDE 13

Qualitative Results

slide-14
SLIDE 14

14

DVDNet Deep Blind Video Decaptioning with 3D-2D Gated Convolutions

Dahun Kim*, Sanghyun Woo*, Joonyoung Lee, In So Kweon

2018 ChaLearn Looking at People Challenge

  • Track 2. Video Decaptioning