understanding complex scenes a man holding a tennis
play

Understanding complex scenes a man holding a tennis racquet on - PowerPoint PPT Presentation

Understanding complex scenes a man holding a tennis racquet on a tennis court the man is on the tennis court playing a game Knowledge Freebase Text Vision Barack Obama is an American


  1.    

  2. Understanding complex scenes a man holding a tennis racquet on a tennis court the man is on the tennis court playing a game

  3.       

  4. Knowledge Freebase Text Vision Barack Obama is an American politician serving as the 44th President of the United States. Born in Honolulu, Hawaii, … in 2008, he defeated Republican nominee and was inaugurated as president on January 20, 2009. http://s122.photobucket.com/user/b (Wikipedia.org) meuppls/media/stampede.jpg.html

  5. Winning entries of COCO 2015 Caption Challenge        Compositional framework is *less elegant* but can potentially exploit non paired image-caption data more effectively

  6. Turing ng T est st Re Resu sult lts at the MS COCO Captioning Challenge 2015 % of captions that Official pass the Turing Test Rank MSR 32.2% % 1st Goog ogle le 31.7% 1st 1st Still a big gap! MSR Captivato tivator r 30.1% 3rd Mont ntreal eal/T /T or oront nto 27.2% 3rd Berkeley ley LRCN 26.8% 5th Other er gr grou oups ps: Baidu/ u/UCL CLA, Stanf anfor ord, , Tsinghua, hua, etc. Human 67.5% --

  7.     

  8. Visual concepts Celebrity Language Model A small boat in Ha Long Bay high ConvNets Confidence Landmark Model low This image contains: water, Features vector DMSM boat, lake, mountain, etc. [Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris Buehler, Chris Sienkiewicz submitted to CVPR Deep Vision 2016]

  9. [He, Zhang, Ren, Sun, 2015]    

  10. cabinets room wooden kitchen stove Repeat to generate 500 candidates cabinets sink floor [Fang, et al., CVPR 2015]

  11. The deep multimod modal al semant mantic ic model l [Fang, et al., CVPR 2015] sema mantic ntic space ce : The overall semantics of a caption will also be represented by a vector in this space. If these two vectors are close to each other, then the caption is a good match for the image. W 4 W 4 Otherwise, not a matching caption. H3 H3 H3 H3 W 3 W 3 H2 H2 W 2 W 2 H1 H1 W 1 W 1 Input t1 Input s Text: a man holding a tennis Fully connected Image feature racquet on a tennis court Convolution/pooling Raw Image pixels [Huang, He, Gao, Deng et al., 2013] [He, Zhang, Ren, Sun, 2015]

  12.    [Guo, Zhang, Hu, He, Gao, 2016]

  13. W 4 W 4 H3 H3 H3 H3 W 3 W 3 H2 H2 W 2 W 2 H1 H1 W 1 W 1 Input t1 Input s caption: a man holding a Image tennis racquet on a tennis court

  14. System Excellent Good Bad Embarrassing Fang et al., 40.6% 26.8% 28.8% 3.8% 2015 New 51.8% 23.4% 22.5% 2.4% system Human evaluation on 1000 random samples of the COCO test set.

  15. System Excellent Good Bad Embarrassing Fang et al., 12.0% 13.4% 63.0% 11.6% 2015 New 25.4% 24.1% 45.3% 5.2% system Human evaluation on Instagram test set, which contains 1380 random images that we scraped from Instagram.

  16. Cognitive Services http://CaptionBot.ai

Recommend


More recommend