Jiyang Zhang, Tong Gao
February 2020
Botto Bottom-up up and Top and Top-Down Down Atten Attention f tion for Image
- r Image
Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao - - PowerPoint PPT Presentation
February 2020 Botto Bottom-up up and Top and Top-Down Down Attention f Atten tion for Image or Image Captioning Captioning and Visua and Visual l Ques Question Answ tion Answering ering Jiyang Zhang, Tong Gao Background Image
Jiyang Zhang, Tong Gao
February 2020
(mean pooling)
(mean pooling)
Linear + Softmax
Object Embeddings
Attribute
Mean Pooling Last timestep
language LSTM) Word Embedding (learned)
Mean Pooling Last timestep
language LSTM) Word Embedding (learned)
Mean Pooling Last timestep
language LSTM) Word Embedding (learned)
Truncate
Confidence score for every candidate answers, trained with binary cross entropy loss
answers
SPICE: Semantic Propositional Image Caption Evaluation
dependency parse trees semantic scene graph
vectors on VQA model?
Visual Genome - cheating?
row, but some prepositions can appear for twice or more
it be harder to generate captions for more complicated images.
image caption generation task, like relevance, expressiveness, concreteness, creativity.
for different levels of accuracies achieved by our system, estimate the model can perform as well as humans in which age.