BUPT-MCPRL@TRECVID 2019 - PowerPoint PPT Presentation

MCPRL 单击此处编辑母版标题样式 BUPT-MCPRL@TRECVID 2019 单击此处编辑母版副标题样式 Guanyu Chen Chong Chen, Xinyu Li, Xuanli Xiang Zhicheng Zhao, Yanyun Zhao, Fei Su Multimedia Communication and Pattern Recognition Labs, Beijing University of Posts and Telecommunications (BUPT-MCPRL) loraschen@bupt.edu.cn 1

Instance Search l Parse INS into multiple related visual subtasks, and propose a novel INS framework based on multi- task retrieval and re-ranking. l An improved two-pathway ECO network (IECO) is designed to enhance video feature extraction. l A new relative pose representation (RPR) is presented, and a light pose-based action recognition network is constructed to restrain the impacts of camera movement. l The experimental results on four datasets demonstrate the effectiveness of the proposed INS framework.

Instance Search Face Detection Action Human-Object Expression Recognition Interaction Recognition Object Detection Pose Estimation

Face Detection crop Face Feature dot product Face Face Cosine Features Detection Similarity Extractor detect and crop Query Face Feature Image

Face Detection Compare MTCNN with DSFD. DSFD model could detect wrong faces and the detected bounding boxes is not exactly accurate sometimes .

Face Detection 1st 1000th 3000th 5000th 10000th

Expression Recognition Face Crop + Expression Detection Resize Recognition Upper image: Architecture of expression-related action retrieval. MODEL_STRATEGY FER2013 Testsets （ Accuracy ） VGG19_SOFTMAX 68.89% VGG19_DROPOUT_RANDOMCROP_SOFTMAX 71.49% Lower table: Accuracy on public dataset FER2013.

Expression Recognition False Detection Laughing Crying Shouting

Human-Object Interaction Object Human Bounding Boxes Pose Detection Estimation Human-Object Interaction 1) Using YOLOv3 to detect key objects such as glass, bag, phone, person. 2) Feed human bounding boxes into HRNet to estimate human poses. 3) Calculate the relative distance between key objects and interactive keypoint to measure the dependences of human-object interaction and group the initial ranklist.

Human-Object Interaction Left: Architecture of HRNet [1] . It could Right: Comparison of OpenPose and HRNet. The extract high-resolution representation former method performs poorly when one person from input image. overlaps with another. [1] Sun, Ke, et al. “Deep High-Resolution Representation Learning for Human Pose Estimation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 2019.

Human-Object Interaction Pat Sit_on_couch Ian Holding_phone

Action Recognition Right: Architecture of SlowFast [2] , Left: Architecture of ECO [1] , we choose it as the basic network for taking videos with different frame video vector extraction. rates as input. [1] Zolfaghari, Mohammadreza, Kamaljeet Singh, and Thomas Brox. "Eco: Efficient convolutional network for online video understanding." Proceedings of the European Conference on Computer Vision (ECCV) . 2018. [2] Feichtenhofer, Christoph, et al. "Slowfast networks for video recognition." arXiv preprint arXiv:1812.03982 (2018).

Action Recognition ECO Video Feature1 4 frame Final Video Feature ECO Video Feature2 32 frame Upper framework: Architecture of proposed IECO. Pathway HMDB(mAP) UCF101(mAP) Lower table: Results on HMDB and UCF101 based on ECO with different One(16 frame) 46.68 67.90 pathways. It shows improvement of IECO on both two datasets. Two(4 & 32 frame) 54.39 72.89

Action Recognition Jack Kissing Stacey Hugging

Pose-based Action Detection Two types of pose-based action detection models. The left [1] encodes the time information of keypoints motion, and the right [2] encodes the position of keypoints in the image. [1] Choutas, Vasileios, et al. “Potion: Pose motion representation for action recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . 2018. [2] Ludl, Dennis, Thomas Gulde, and Cristóbal Curio. "Simple yet efficient real-time pose-based action recognition." arXiv preprint arXiv:1904.09140 (2019).

Pose-based Action Detection 𝑠𝑓𝑚𝑏_𝑒𝑗𝑡 : normalized distance between keypoint and nose. 𝑠𝑓𝑚𝑏_𝑏𝑜𝑕𝑚𝑓 : the angle between x-axis and the line that joins keypoint and nose.

Pose-based Action Detection Architecture(channels) JHMDB-1 Upper image: Network used for training RPP. (64, 128) 60.11 ± 2.81 (128, 256) 𝟕𝟑. 𝟑𝟘 ± 𝟑. 𝟔𝟏 Lower table: Results on JHMDB-1 with (64, 128, 256) 60.49 ± 3.93 various channels and blocks. (128, 256, 512) 61.09 ± 4.08

Pose-based Action Detection Concatenation method JHMDB-1-GT Run ID mAP Stacked(one pathway) 68.51 ± 4.25 F_M_E_E_BUPT_MCPRL_2 11.6 Two pathway 𝟖𝟐. 𝟒𝟗 ± 𝟑. 𝟐𝟒 F_M_E_E_BUPT_MCPRL_1 𝟐𝟐. 𝟘 Comparisons of two different concatenation methods. Improvement on INS19. Methods JHMDB-1 JHMDB-1-GT Choutas et. al. 59.1 70.8 Ludl et. al. 60.3 ± 1.3 65.5 ± 2.8 RPR(ours) 𝟕𝟑. 𝟑𝟘 ± 𝟑. 𝟔𝟏 𝟖𝟐. 𝟒𝟗 ± 𝟑. 𝟐𝟒 Results on JHMDB-1 compared with two state-of-the- art algorithms. JHMDB-1-GT means using pose data given by JHMDB dataset to classify pose representations.

Pose-based Action Detection Ian Open_door_enter

Conclusion l Parse INS into several related subtasks and propose a multi-task retrieval framework. l Detect specific person based on face matching l Apply expression recognition on related instances l The semantic dependences of target persons and the corresponding objects are measured to detect human-object interactions l A light pose-based action detection network and two-pathway ECO are constructed to re-rank INS result list l The experimental results on four datasets demonstrate the effectiveness of this INS framework

Future work l Human track l End-to-end trainable HOI models l Action localization l Integrating text and audio information l More reasonable fusion methods l …

单击此处编辑母版标题样式 Thanks! 单击此处编辑母版副标题样式 22

BUPT-MCPRL@TRECVID 2019 - PowerPoint PPT Presentation

MCPRL BUPT-MCPRL@TRECVID 2019 Guanyu Chen Chong Chen, Xinyu Li, Xuanli Xiang Zhicheng Zhao, Yanyun Zhao, Fei Su Multimedia Communication and Pattern Recognition

BUPT-MCPRL@TRECVID 2014: Surveillance Event Detection(SED) Qi Chen (chen_qi1990@163.com)

Instance Search Task Wenhui Jiang (jiang1st@bupt.edu.cn) Zhicheng Zhao, Fei Su, Mei Liu,

Instance Search Task Wenhui Jiang (jiang1st@bupt.edu.cn) Zhicheng Zhao, Qi Chen, Jinlong Zhao,

Learning From Video Browse Behavior Learning From Video Browse Behavior TRECVID 2009 TRECVID

George Awad National Institute of Standards and Technology Dakota Consulting, Inc 2 TRECVID

CMU @ TRECVID Event Detection @ Ming-yu Chen & Alex Hauptmann School of Computer Science

Columbia HLF: TRECVID2006 TRECVID TRECVID TRECVID 2005 2005 2005 (development)

Event Detection in Airport Surveillance The TRECVid 2008 Evaluation The TRECVid 2008 Evaluation

TRECVID 2008 CBCD TRECVID 2008. CBCD MCG-ICT-CAS MCG-ICT-CAS Sheng Tang Yongdong Zhang Ke Gao

TRECVID 2010 K TRECVID 2010 Known item Search it S h by NUS by NUS Xiangyu Chen, Jin Yuan

Adaptive Feature Discovery for TRECVID Broadcast News Video Story Segmentation @TRECVID Workshop

Robust Distant Supervision Relation Extraction via Deep Reinforcement Learning BUPT Pengda Qin ,

CNGI/CERNET2 Update MA, Yan BUPT --- a member of CERNET2 APAN 21, 2006/01/24, Tokyo Topic

Internet technology research Internet technology research with International collaboration with

What video applications What video applications brought to IP network brought to IP network MA

TRECVID 2018 Video to Text Description Asad A. Butt NIST George Awad NIST; Dakota Consulting,

How to Make Artificial Agents a Bit More Like Us Hedvig Kjellstrm Professor of Computer

Real-time Performance-Based Facial Animation Mark Pauly Eurographics 2012, Cagliari, Italy

Facial Expression Recognition YING SHEN SSE, TONGJI UNIVERSITY Facial expression recognition

People in Action Learning Objective: To study facial expressions relating to movement. NEXT

Out line Communicat ion Symbolic Nat ur al Language Processing Communicat ion

Modelling Appearance Cootes, Edwards, Taylor University of Manchester Lessons learned ASM is

Slides and Photographs: List 18 (Classic Reprint) (Paperback) Slides and Photographs: List 18

Moving from Fear From to Faith Habakkuk 2:4b but the righteous shall live by his

BUPT-MCPRL@TRECVID 2019 - PowerPoint PPT Presentation

MCPRL BUPT-MCPRL@TRECVID 2019 Guanyu Chen Chong Chen, Xinyu Li, Xuanli Xiang Zhicheng Zhao, Yanyun Zhao, Fei Su Multimedia Communication and Pattern Recognition

BUPT-MCPRL@TRECVID 2014: Surveillance Event Detection(SED) Qi Chen (chen_qi1990@163.com)

Instance Search Task Wenhui Jiang (jiang1st@bupt.edu.cn) Zhicheng Zhao, Fei Su, Mei Liu,

Instance Search Task Wenhui Jiang (jiang1st@bupt.edu.cn) Zhicheng Zhao, Qi Chen, Jinlong Zhao,

Learning From Video Browse Behavior Learning From Video Browse Behavior TRECVID 2009 TRECVID

George Awad National Institute of Standards and Technology Dakota Consulting, Inc 2 TRECVID

CMU @ TRECVID Event Detection @ Ming-yu Chen &amp; Alex Hauptmann School of Computer Science

Columbia HLF: TRECVID2006 TRECVID TRECVID TRECVID 2005 2005 2005 (development)

Event Detection in Airport Surveillance The TRECVid 2008 Evaluation The TRECVid 2008 Evaluation

TRECVID 2008 CBCD TRECVID 2008. CBCD MCG-ICT-CAS MCG-ICT-CAS Sheng Tang Yongdong Zhang Ke Gao

TRECVID 2010 K TRECVID 2010 Known item Search it S h by NUS by NUS Xiangyu Chen, Jin Yuan

Adaptive Feature Discovery for TRECVID Broadcast News Video Story Segmentation @TRECVID Workshop

Robust Distant Supervision Relation Extraction via Deep Reinforcement Learning BUPT Pengda Qin ,

CNGI/CERNET2 Update MA, Yan BUPT --- a member of CERNET2 APAN 21, 2006/01/24, Tokyo Topic

Internet technology research Internet technology research with International collaboration with

What video applications What video applications brought to IP network brought to IP network MA

TRECVID 2018 Video to Text Description Asad A. Butt NIST George Awad NIST; Dakota Consulting,

How to Make Artificial Agents a Bit More Like Us Hedvig Kjellstrm Professor of Computer

Real-time Performance-Based Facial Animation Mark Pauly Eurographics 2012, Cagliari, Italy

Facial Expression Recognition YING SHEN SSE, TONGJI UNIVERSITY Facial expression recognition

People in Action Learning Objective: To study facial expressions relating to movement. NEXT

Out line Communicat ion Symbolic Nat ur al Language Processing Communicat ion

Modelling Appearance Cootes, Edwards, Taylor University of Manchester Lessons learned ASM is

Slides and Photographs: List 18 (Classic Reprint) (Paperback) Slides and Photographs: List 18

Moving from Fear From to Faith Habakkuk 2:4b but the righteous shall live by his

CMU @ TRECVID Event Detection @ Ming-yu Chen & Alex Hauptmann School of Computer Science