dvdnet
play

DVDNet Deep Blind Video Decaptioning with 3D-2D Gated Convolutions - PowerPoint PPT Presentation

2018 ChaLearn Looking at People Challenge - Track 2. Video Decaptioning DVDNet Deep Blind Video Decaptioning with 3D-2D Gated Convolutions Dahun Kim*, Sanghyun Woo*, Joonyoung Lee, In So Kweon 1 Our Problem Remove text overlays in video


  1. 2018 ChaLearn Looking at People Challenge - Track 2. Video Decaptioning DVDNet Deep Blind Video Decaptioning with 3D-2D Gated Convolutions Dahun Kim*, Sanghyun Woo*, Joonyoung Lee, In So Kweon 1

  2. Our Problem Remove text overlays in video Need to consider two important points: 1. Video : Sequence of frames) 2. Blind : No inpainting mask)

  3. Model Overview 3D gated- 2D gated- CNN CNN Encoder Decoder Input Skipconnections Prediction Output Two important points : • Video : Sequence of frames • 3D-2D U-net • Residual learning • Blind : No inpainting mask + Gated convolution

  4. Vanilla 2D U-Net* Frame-by-frame operation • Spatial context 2D CNN 2D CNN Encoder Decoder Input Skipconnections Prediction Two important points : • Video : Sequence of frames • Scene dynamics • Blind : No inpainting mask * Ronneberger, O.et al. “U -net: Convolutional networks for biomedical image segmentation .” MICCAI 2015.

  5. Input : Multiple frames Scene dynamics • Aggregate hints from spatio-temporal neighborhoods  Object movements  Subtitle changes

  6. Vanilla 3D U-Net* Multiple frame prediction 3D CNN 3D CNN Encoder Decoder Input Skipconnections Prediction • Hard problem • Heavy • Not uniform prediction * C¸ ic¸ek, O ¨ .et al. “3d u-net: learning dense volumetric segmentation from sparse annotation.” MICCAI 2016.

  7. Output : Single frame Focus on a single frame • Aggregate hints from lagging and leading frames. Lagging frames Leading frames 3D-2D U-Net • Easy problem • Light-weight Center frame • Temporal view range Output

  8. 3D-2D U-Net architecture Focus on a single frame 3D gated- 2D gated- CNN CNN Encoder Decoder Input Skipconnections Prediction • 3D convolutions to flatten the encoder features into one frame .  to match the shape and concatenate.

  9. Residual Learning 3D gated- 2D gated- CNN CNN Encoder Decoder Input Skipconnections Prediction Output  Implicitly knows the inpainting mask Two important points : • Video : Sequence of frames • Residual learning - Not touching good pixels • Blind : No inpainting mask - Focus on the corrupted regions

  10. + Attention Gated Convolution* • 0-1 value (Gating) • Attentioning Sigmoid Conv Conv Input feature * Yu, J . et al. “Free -form image inpainting with gated convolution”. arXiv preprint arXiv:1806.03589.

  11. Loss Function L1 + gradient L1 + SSIM loss

  12. Quantative Results

  13. Qualitative Results

  14. 2018 ChaLearn Looking at People Challenge - Track 2. Video Decaptioning DVDNet Deep Blind Video Decaptioning with 3D-2D Gated Convolutions Dahun Kim*, Sanghyun Woo*, Joonyoung Lee, In So Kweon 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend