Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline - PowerPoint PPT Presentation

CNN Based Object Detection in Large Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4

Outline • Introduction • Background • Challenge • Our approach • System framework • Object detection • Scene recognition • Body segmentation • Same style matching • Experiments • Conclusion

Background • Image retrieval • Video advertising Video out applications

Challenge • Real video data vs. image dataset - Clutter background - Multiple objects - Small objects - Variant pose/position - Partial occlusion

Our task • Problems ： • Content based object retrieval in large video images • High accuracy for same style matching • High speed in large video database • Solution ： • Accurate object detection + scene classification • Discriminated DNN features and PCA/LDA transformation • Speed up by parallel indexing and hierarchical filtering

System framework Scene Classification Video key indexing frame Object Body CNN Indexing detection segmentation feature Database Scene Classification Query image Faster-RCNN CNN Body Match query rect segmentation feature Distance sort Result

Object detection (I) • Object detection by faster-RCNN Faster-RCNN, Region proposals + object scores, [Ren, Shaoqing, et al. • NIPS2015] Trained on MS coco db (300k images) + video images (10k images) • More pervasive and general for images with multi-objects •

• Multi-class object detection including • Clothes(skirt ， jacket ， trousers ） • Bags （ handbag ， backpack ， draw-bar box ) • Electronics （ mobile, laptop ， TV ， keyboard ， mouse ， microwave oven ， oven ， refrigerator ） • Glasses, necklace, hat • Shoes

Object detection (II) • Object detection by CNN regression • Input an image, output the coordinates of the object rectangle [Erhan, Dumitru, et al. CVPR2014] • Efficient for images with single object, not recognized by faster-RCNN

Body Segmentation • Constraint by human body parts • CNN based body segmentation [Jonathan Long,CVPR2015] • Bounding box, body mask, body parsing original image segmentation image

Scene classification • CNN based Scene classification [Bolei Zhou, NIPS2014] Video Is Scene? CNN absed Multi-frame tags Key frame yes/no Scene classification fusion Scene classification Preciosn:65.8% Recall:74% Threshold@0.7 Preciosn:83.8% Recall:56.7% Non scene images Scene images of kitchen, office, living room, and bedroom

Scene classes 28 dentists • 0 kitchen 14 outdoor_ice_world 29 drugstore • 1 dining 15 indoor_ice_skating_rink 30 music_studio • 2 bakery 16 baseball 31 music_store • 3 ice_cream_parlor 17 football 32 sandbeach • 4 bathroom 18 basketball_court 33 hairsalon • 5 washing_room 19 swimming_pool 34 bar • 6 bedroom 20 track 35 pagoda • 7 living_room 21 bowling_alley 36 bamboo_forest • 8 office 22 billiards 37 mountain • 9 children_room 23 tennis 38 coast • 10 nursery 24 volleyball 39 creek • 11 toyshop 25 gymnasium 40 waterfall • 12 shoe_shop 26 pleasure_ground 41 grass • 13 jewelry_shop 27 hospital_room 42 other

Same style matching • SIFT feature matching Normalization of SIFT • Dimension : 128dim x 400pts • MAP 22% • • CNN feature of imagenet 1k classifier Model :VGG19 • Layers : fc7 • Dimension : 4096  600 • MAP 28% • • CNN feature of Same style classifier Model :VGG19 • Layers : fc7 • Dimension : 4096  600 • MAP 34% •

Multi-feature fusion • Same class matching classifier on imagenet 21k classes of 15M images • Same style matching classifier trained on 1239 queries of 1M images CNN Models Feature dim MAP Inception_bn1k 1024 24% Inception_21k 1024 34% Vgg19_caffe 4096 34% Inception_21k + vgg19_caffe 5120 43% • Speed • Nvidia K40 GPU, 10x faster than CPU i7 • Faster RCNN speed: 200ms/frame , image size 1920x1080 • Vgg19 feature speed: 60ms/frame, image size 256x256

Experiments • MAP precision on 3M testing images, trained on1M images Vgg 19model Full image Object PCA+LDA Inception-21k MAP rectangle × × × √ √ 27.8% × × × √ √ 34.2% × × √ √ √ 37.3% × × √ √ √ 43.1% × √ √ √ √ 46.1% • Speed up Parallel flann tree indexing • Hierarchical filtering by object classes, 10x faster speed • Query speed: 1s /image on 5000 teleplays with 2M images •

Query system GUI

Query examples on image dataset

Query examples on video dataset

Conclusion • Bounding box is important to recognize object • Fusion Same style matching with same class matching features to get higher accuracy • PCA and LDA further improve accuracy and speed • GPU is faster for CNN feature extraction • Speed up query by parallel indexing and hierarchical filtering

References Erhan, Dumitru, et al. "Scalable object detection using deep neural networks." Proceedings of the IEEE Conference on • Computer Vision and Pattern Recognition . 2014. Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in • Neural Information Processing Systems . 2015. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural • networks." Advances in neural information processing systems . 2012. Arandjelović , Relja, and Andrew Zisserman. "Three things everyone should know to improve object retrieval." Proceedings • of the IEEE Conference on Computer Vision and Pattern Recognition. 2012. Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolution Networks for Semantic Segmentation. CVPR 2015 • arXiv:1411.4038. Conditional Random Fields as Recurrent Neural Networks. S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. • Du, C. Huang, P. Torr ICCV 2015. Li Shen, Zhouchen Lin and Qingming Huang, Learning deep convolutional neural networks for places2 scene recognition, • Clinical Orthopaedics and Related Research, 2015 Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba and Aude Oliva, Learning Deep Features for Scene • Recognition using Places Database, NIPS, 2014 Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba, Object detectors emerge in deep scene cnns, • ICLR, 2015 Ruobing Wu, Baoyuan Wang, Wenping Wang and Yizhou Yu, Harvesting discriminative meta objects with deep CNN • features for Scene Classification, ICCV, 2015 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna,Rethinking the Inception • Architecture for Computer Vision, arXiv:1512.00567 ,2015

Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline - PowerPoint PPT Presentation

CNN Based Object Detection in Large Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline Introduction Background Challenge Our approach System framework Object detection Scene recognition Body

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

7. Video databases Video data representations Video = time-ordered sequence of correlated

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/26/2018 NVIDIA Video Technologies Overview Video

Video Sur Video Sur rveillance, rveillance, , Video Analyti Video Analyti ics, and You.

Bitmap (Raster) Images CO2016 Multimedia and Computer Graphics Roy Crole: Bitmap Images (CO2016,

HAAR-like features for images Images digit images are scanned hand written digits Digit

https://images-na.ssl-images-amazon.com/images/I/A1w4iP5ov-L._SY879_.jpg Translate this table to a

Sharing Your Story Through Online Video SHARING YOUR STORY THROUGH VIDEO Agenda 1 The power of

Learning from Unlabeled Video Carl Vondrick Columbia University Survivor Bias of Video Data

Image and Video Coding: Introduction bitstream encoder decoder Motivation Image and Video

091031 091031 VIDEO SIGNALS VIDEO SIGNALS Lecturer: Marco Marcon 091032 - AUDIO AND VIDEO

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

HELPFUL TIPS WHEN MAKING A KICKSTARTER VIDEO KICKSTARTER VIDEO KICKSTARTER VIDEO KICKSTARTER

Estdio de Vdeo HD HD Video Studio Rui Ribeiro Rui Ribeiro FCCN 31 de Maro 2011 I FCCN Video

Immigrant Children and Family Program County of Fresno Department of Social Services Child

Distance Learning: How to Read Books with Infants & Toddlers PRESENTED BY: CAROLA

Helping Hotels Connect with Travelers Through Innovative Marketing Software Solutions November

Q1 2015 Results Presentation 27 May 2015 Luigi Costa (CEO) 1 Nordic Nanovector ASA

Welcome to Our School! Staff Introductions Our Mission Every Student, Every Day! Did you know?

1 RIVER SERPIS Elevation above thalweg representing Tolerance to Inundation within river Serpis

A Look Ahead to the 2020 Legislative Session Education Priorities January 9, 2020 WELCOME A AND

Hozhoni Foundation Button Maker Presentation 3, ME 476C Dr. Sarah Oman Outline Team

Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline - PowerPoint PPT Presentation

CNN Based Object Detection in Large Video Images WangTao, wtao@qiyi.com IQIYI ltd. 2016.4 Outline Introduction Background Challenge Our approach System framework Object detection Scene recognition Body

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

7. Video databases Video data representations Video = time-ordered sequence of correlated

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/26/2018 NVIDIA Video Technologies Overview Video

Video Sur Video Sur rveillance, rveillance, , Video Analyti Video Analyti ics, and You.

Bitmap (Raster) Images CO2016 Multimedia and Computer Graphics Roy Crole: Bitmap Images (CO2016,

HAAR-like features for images Images digit images are scanned hand written digits Digit

https://images-na.ssl-images-amazon.com/images/I/A1w4iP5ov-L._SY879_.jpg Translate this table to a

Sharing Your Story Through Online Video SHARING YOUR STORY THROUGH VIDEO Agenda 1 The power of

Learning from Unlabeled Video Carl Vondrick Columbia University Survivor Bias of Video Data

Image and Video Coding: Introduction bitstream encoder decoder Motivation Image and Video

091031 091031 VIDEO SIGNALS VIDEO SIGNALS Lecturer: Marco Marcon 091032 - AUDIO AND VIDEO

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

HELPFUL TIPS WHEN MAKING A KICKSTARTER VIDEO KICKSTARTER VIDEO KICKSTARTER VIDEO KICKSTARTER

Estdio de Vdeo HD HD Video Studio Rui Ribeiro Rui Ribeiro FCCN 31 de Maro 2011 I FCCN Video

Immigrant Children and Family Program County of Fresno Department of Social Services Child

Distance Learning: How to Read Books with Infants &amp; Toddlers PRESENTED BY: CAROLA

Helping Hotels Connect with Travelers Through Innovative Marketing Software Solutions November

Q1 2015 Results Presentation 27 May 2015 Luigi Costa (CEO) 1 Nordic Nanovector ASA

Welcome to Our School! Staff Introductions Our Mission Every Student, Every Day! Did you know?

1 RIVER SERPIS Elevation above thalweg representing Tolerance to Inundation within river Serpis

A Look Ahead to the 2020 Legislative Session Education Priorities January 9, 2020 WELCOME A AND

Hozhoni Foundation Button Maker Presentation 3, ME 476C Dr. Sarah Oman Outline Team

Distance Learning: How to Read Books with Infants & Toddlers PRESENTED BY: CAROLA