SLIDE 1 Attention for Machine Comprehension
Made by : Rishab Goel
Based on slides by: Alex Graves, Hien Quoc, Renjie Liao
SLIDE 2
Highway Networks
SLIDE 3
SLIDE 4
Benefits ...
SLIDE 5
Benefits ...
SLIDE 6 Importance ...
For training very deep architectures By allowing better information flow Better optimization Intuition : linear transformation/input suffice for learning, language at higher level of abstraction???
http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
SLIDE 8 Idea of Maxout
Hien Quoc Dang
SLIDE 9 Intuitions
Inspired from dropout Similar to bagging but integrated as a part of single network
Hien Quoc Dang
SLIDE 10 Idea of Maxout ...
Hien Quoc Dang
SLIDE 11 Idea of Maxout ...
Hien Quoc Dang
SLIDE 12 Comparison to Rectifiers
Hien Quoc Dang
SLIDE 13 Why Maxout Work ?
Hien Quoc Dang
SLIDE 14
SLIDE 15
SLIDE 16 Slides : Santi Pascual
SLIDE 17 LSTMs ...
Chris Olah’s blog
SLIDE 18
Need for Attention
The embeddings not sufficient to encode information over long distances Helps to attend to important patch of data Interpretability to the model
SLIDE 19
Attentive Reader
SLIDE 20
SLIDE 21
DYNAMIC COATTENTION NETWORKS FOR QUESTION ANSWERING
Authors : Caiming Xiong, Victor Zhong, Richard Socher
SLIDE 22
Introduction
Machine Comprehension No knowledge base required Till SQUAD no large scale, natural dataset Cloze style datasets like CNN/Mail Daily Synthetic/small size
SLIDE 23 About SQuAD
Consists questions on a set of Wikipedia articles Wh type questions The answer is a segment of text, or span
Source : Rajpurkar et al.
SLIDE 24 Model in nutshell ...
Socher et al
SLIDE 25 Doc and Query Encoder
Socher et al
SLIDE 26 Liked
Socher et al
SLIDE 27 Dynamic Decoder
Liked : all Socher et al
SLIDE 28 Highway Maxout Network ...
Socher et al
SLIDE 31 Implementation
- 1. CoreNLP for preprocessing
- 2. GloVe word vectors pretrained on 840B
Common Crawl corpus
- 3. OOV set to 0
- 4. Sentinel vectors randomly initialized, optimized
during training
Disliked
- Gagan (pt. 3)
- Akshay (pt. 4) claim not proven
SLIDE 32 Iterative process visualisation ...
Socher et al
SLIDE 34 Results
Disliked
- Haroun (ensemble gain too
much) Socher et al
SLIDE 35 Liked
Socher et al
SLIDE 36 Performance across diff. types of ques.
Liked
Socher et al
SLIDE 37 Ablation studies ...
Liked
Socher et al
SLIDE 38 Predictions
Socher et al
SLIDE 39 Logistic Regression Prediction : Theatre Museum
Socher et al
SLIDE 40 Comments : Trouble decoding multiple intuitive answer
Socher et al
SLIDE 41
Cons
Lack error analysis, need more ablation studies[Barun, Surag] System give extractive answer and not abstractive[Nupur] Do not compare HMN and MN[all] Unintuitive decoder[Dinesh]
SLIDE 42
Doubts ...
Why HMN worked out? Role of sentinel vectors?? Error propagation in argmax function Maxout for LSTMs as well (not clear) Use multiple initialisation of start and end pointers ( how ??)
SLIDE 43
Extensions ...
Use approach for others datasets like CNN/Daily Mail and MS COCO QA [Barun] Use different attention, Match LSTM [Barun] Bi-directional attention [Gagan] Use iterative idea to visual QA, classification, NER, SRL etc [Akshay, Surag] Find synonyms[Haroun]
SLIDE 44
Extensions ...
Combine char2vec and word2vec embeddings to represent the document and query
SLIDE 45
Thanks!
SLIDE 46