(Some) Research Trends in Question Answering (Q (QA) Advanced - - PowerPoint PPT Presentation

some research trends in
SMART_READER_LITE
LIVE PREVIEW

(Some) Research Trends in Question Answering (Q (QA) Advanced - - PowerPoint PPT Presentation

(Some) Research Trends in Question Answering (Q (QA) Advanced Topics in AI (Spring 2017) IIT Delhi Manoj Kumar Chinnakotla Senior Applied Scientist Artificial Intelligence and Research (AI&R) Microsoft, India Lecture Obje jective and


slide-1
SLIDE 1

(Some) Research Trends in Question Answering (Q (QA)

Manoj Kumar Chinnakotla Senior Applied Scientist Artificial Intelligence and Research (AI&R) Microsoft, India Advanced Topics in AI (Spring 2017) IIT Delhi

slide-2
SLIDE 2

Lecture Obje jective and Outline

  • To cover some interesting trends in QA research
  • Especially in applying deep learning techniques
  • With minimal manual intervention and feature design
  • QA from various different sources
  • Unstructured Text  Web, Free Text, Entities
  • Knowledge Bases  Freebase, REVERB, DBPedia
  • Community Forums  Stack overflow, Yahoo! Answers etc.
  • Will be covering three different and recent papers in each of the

above stream

slide-3
SLIDE 3

A Neural Network for Factoid QA over Paragraphs Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher (EMNLP 2014)

slide-4
SLIDE 4

Factoid QA

  • Given a description of an entity, identify the
  • Person
  • Place
  • Thing
  • Quiz Bowl
  • Quiz competition for high-school and college level students
  • Need to guess entities based on a question given in raw text
  • Each question has 4-6 sentences
  • Each sentence has unique clues about target entity
  • Players can answer any time – earlier the answer, more points scored
  • Questions follow “pyramidality” – earlier clues are harder

Holy Roman Empire

slide-5
SLIDE 5

How do they solve it?

  • Dependency Tree RNN (DT-RNN)
  • Trans-sentential Averaging
  • Resultant Model named as QANTA
  • Question Answering Neural Network with Trans-Sentential Averaging (QANTA)
slide-6
SLIDE 6

Example Model Construction

slide-7
SLIDE 7

Training

slide-8
SLIDE 8

Dataset

  • 100,000 Quiz Bowl Question/Answer pairs from 1998-2013
  • From the above, only Literature and History questions – 40%
  • History – 21,041
  • Literature – 22, 956
  • Answer Based Filtering
  • 65.6% of answers only occur once or twice and are eliminated
  • All answers which do not occur at least 6 times are eliminated
  • Final number of unique questions
  • History: 451
  • Literature: 595
  • Total Dataset
  • History: 4,460
  • Literature: 5,685
slide-9
SLIDE 9

Experimental Setup

Train Test History Questions: 3,761 Sentences: 14,217 Questions: 699 Sentences: 2,768 Literature Questions: 4,777 Sentences: 17,972 Questions: 908 Sentences: 3,577 Baselines

  • Bag of Words (BOW)
  • Bag of Words with DT based features (BOW-DT)
  • IR-QB: Maps questions to answers using Whoosh – a state-of-the-art IR engine
  • IR-Wiki: Maps questions to answers using Whoosh where answers also have their wiki text
slide-10
SLIDE 10

Results

slide-11
SLIDE 11

Positive Examples

slide-12
SLIDE 12

Negative Examples

slide-13
SLIDE 13

Nice Things about the Paper

  • Their use of dependency trees instead of bag of words or a mere sequence
  • f words was nice. They tried to use sophisticated linguistic information in

their model.

  • Joint learning of answer and question representations in the same vector

"most answers are themselves words (features) in other questions (e.g., a question on World War II might mention the Battle of the Bulge and vice versa). Thus, word vectors associated with such answers can be trained in the same vector space as question text enabling us to model relationships between answers instead of assuming incorrectly that all answers are independent.“

  • Nice Error Analysis in Sections 5.2 and 5.3
  • Tried some obvious extensions and provided reasoning for not working

"We tried transforming Wikipedia sentences into quiz bowl sentences by replacing answer mentions with appropriate descriptors (e.g., \Joseph Heller" with \this author"), but the resulting sentences suffered from a variety of grammatical issues and did not help the final result."

slide-14
SLIDE 14

Shortcomings

  • RNNs need multiple redundant training examples for learning

powerful meaning representations. In real world, how will we get those?

  • In some sense, they had lots of data about each question which can be used

to learn such rich representations. This is an unrealistic assumption.

  • They whittled down their training and test data to a small set of QA

pairs that *fit* their needs (no messy data)

"451 history answers and 595 literature answers that occur on average twelve times in the corpus".

slide-15
SLIDE 15

Open Question Answering with Weakly Supervised Embedding Models Bordes A, Weston J and Usunier N (ECML PKDD 2014)

slide-16
SLIDE 16

Task Defi finition

Scoring function over all KB triples

slide-17
SLIDE 17

Training Data

  • REVERB
  • 14M Triples
  • 2M Entities
  • 600K Relationships
  • Extracted from ClueWeb09 Corpus
  • Contains lots of noise
  • For instance, the following are three different entities:
slide-18
SLIDE 18

Automatic Training Data Generation

  • Generate Question-Triple pairs

automatically using the following 16 rules

  • For each triple, generate these 16

questions

  • Some exceptions: Only *-in.r

relation can generate “where did e r?” pattern question

  • Approximate size of Question-

Triple pairs: 16 X 14M

  • Syntactic structure is simple
  • Some of them may not be semantically valid
slide-19
SLIDE 19

Essence of f the Technique

  • Project both question (bag of words or n-grams) and KB triple into a

shared embedding space

  • Compute the similarity score in the shared embedding space and rank

triples based on it

slide-20
SLIDE 20

Training Data

  • Positive Training Samples
  • All questions generated from triples using rules
  • Negative Samples
  • For each original triple (q, t), randomly select a triple from KB
  • Replace each field of original triple t with ttmp with probability of 0.66
  • This way, it is possible to create negative samples which are close enough to the
  • riginal triples
  • Pose it as a ranking problem
  • Ranking Loss:
  • Constraints:
slide-21
SLIDE 21

Training Algorithm

  • Stochastic Gradient Descent (SGD)
  • 1. Randomly initialize W and V (mean 0, std. dev.

1 𝑙)

  • 2. Sample a positive training pair (qi,ti) from D
  • 3. Create a corrupted triple ti

1 such that ti 1 ≠ ti

  • 4. Use ranking loss and compute SGD update
  • 5. Enfore W and V are normalized at each step
  • Learning rate adapted using ADA-GRAD
slide-22
SLIDE 22

Capturing Question Variations using WIKIANSWERS

  • 2.5 M distinct questions
  • 18 M distinct paraphrase pairs
slide-23
SLIDE 23

Mult ltitask Learning wit ith Paraphrases

V V W W

q ti ti

1

qp

Ranking Loss Ranking Loss

800K Vocabulary 3.5M Entities (2 embeddings per entity) 250M Examples (Rules + Paraphrases)

64 dimensions 64 dimensions 64 dimensions 64 dimensions

slide-24
SLIDE 24

Fine-Tuning of f Similarity Function

  • The earlier model usually doesn’t converge to global optimum and

many correct answers still not ranked at the top

  • For further improvement, introduce one more weighting factor into

the query-triple similarity metric

  • After fixing W and V, M could be learnt using
slide-25
SLIDE 25

Evaluation Results

  • Test Set
  • 37 questions from WikiANSWERS which has at least one answer in REVERB
  • Additional paraphrase questions – 691 questions
  • Run various versions of PARALEX to get candidate triples and get them

human-judged

  • Total: 48K triples labeled
slide-26
SLIDE 26

Candidate Generation using String Matching

  • Construct a set of candidate strings containing
  • Do POS Tagging of the question
  • Get the following candidate strings
  • All noun phrases that appear less than 1000 times in REVERB
  • All proper nouns if any
  • Augment these strings with their singular form removing trailing “s” if any
  • Only consider triples which contain at least one of the above candidate strings in

them

  • Significantly reduces the number of candidates processed – 10K (on avg.) vs. 14M
slide-27
SLIDE 27

Sample Embeddings

slide-28
SLIDE 28

Nice things about the paper

  • Doesn’t use any manual annotation and no prior information from

any other dataset

  • Works on a highly noisy Knowledge Based (KB) – REVERB
  • Doesn’t assume the existence of “word alignment” tool for learning

the term equivalences

  • In fact, it automatically learns these term equivalences
  • Use of multi-task learning to combine learnings from paraphrases as

well as Q, A pairs was interesting

  • Achieves impressive results with minimal assumptions from training
  • r other resources
slide-29
SLIDE 29

Shortcomings

  • Many adhoc decisions across the paper
  • Bag of words for modeling a question is very crude
  • “what are cats afraid of?” vs. “what are afraid of cats?”
  • Perhaps, a DT-RNN or semantic role based RNN is much better
  • String matching based candidate generation was a major contributor

in boosting the accuracy which was introduced towards the end as a rescue step!

  • The fine-tuning step was introduced to improve the poor

performance of the initial model

  • Overall, more value seems to be coming from such hacks and tricks in

stead of the original idea itself!

slide-30
SLIDE 30

Potential Ext xtensions

  • Try out DT-RNN and variants for modeling the question
  • Similarly, try word2vec/GLoVE entity embeddings for entity and
  • bject and model the predicate using a simple RNN
  • Is it possible to model this problem using the sequence-to-sequence

framework?

  • Question -> Entity, Predicate, Object
slide-31
SLIDE 31

Hand in Glove: Deep Feature Fusion Network Architectures for Answer Quality Prediction Sai Praneeth, Goutham, Manoj Chinnakotla, Manish Shrivastava (SIGIR 2016, COLING 2016)

slide-32
SLIDE 32
  • Given a question Q and its set of community answers, rate the

answers corresponding to their quality.

  • Task announced in Sem Eval 2015 and Sem Eval 2016
  • All the previous best performing systems were based on – a) purely

hand-crafted feature engineering or b) only deep learning based systems

Que Question

  • n: Can I obtain Driving License my QID is written Employee

Des escription: Can I obtain Driving License my QID is written Employee, I saw list in gulf times but there isn't mentioned EMPLOYEE QID Profession. Ans nswer1: the word employee is a general term that refers to all the staff in your company either the manager, secretary up to the lowest position or whatever positions they have. you are all considered employees of your

  • company. (P

(Pot

  • tential)

Ans nswer2: your QID should specify what is the actual profession you have. I think for me, your chances to have a drivers license is low. (Go (Good

  • d)

Ans Answer3: dear Richard, his asking if he can obtain. means he have the drivers license. (B (Bad ad) Ans Answer4: Slim chance...... (Go (Good

  • d)

Answer Quality Prediction in Community QA Forums

slide-33
SLIDE 33
  • Central Idea
  • Exploit external resources available in the form of click-through data, Wikipedia text
  • etc. for improving the matching
  • Combine both hand-crafted and deep learning features using various architectures
  • Two solutions
  • Learn deep features using CNN and then combine with hand-crafted features using

a fully connected network (DFFN-CNN)

  • Learn deep features using a Bi-directional LSTM and then combine with hand-

crafted features using a fully connected network (DFFN-BLNA)

  • Hand-crafted features focus on learning similarities with the help of

various resources

Deep Feature Fusion Networks (DFFN)

slide-34
SLIDE 34
  • Wikipedia Based Features
  • TagMe Similarity
  • NE Similarity
  • Anchor Text
  • Google Cross-Wiki Dictionary
  • Clickthrough Data
  • DSSM Similarity
  • Paragraph2Vec similarity
  • Metadata Features
  • Author reputation score
  • Was answer written by the seeker himself?
  • Position of the answer
  • Presence of URLs
  • Presence of special characters – question mark, smiley, excalamations

Hand Craft fted Features (HCF)

slide-35
SLIDE 35

Deep Feature Fusion Networks (DFFN) (Contd..)

slide-36
SLIDE 36
  • Both these models achieve state-of-the-art performance on Sem Eval

2015 and Sem Eval 2016 tasks

Sem SemEval 2015 2015 Sem SemEval 2016 2016 Mod

  • del

F1 F1 Ac Accu. Mod

  • del

MAP AP F1 F1 Ac Accu. DFFN DFFN-BLNA 62.01* 75.20* DFFN-BLNA 83.91* 66.70* 77.65* DFFN DFFN-CNN 60.86 74.54 DFFN DFFN 81.77 65.75 76.42 JAI AIST 57.29 72.67 Kelp 79.19 64.36 75.11 HITSZ-ICRC 56.44 69.43 ConvKN 78.71 63.55 74.95 DFFN DFFN w/o HCF CF 56.04 69.73 DFFN DFFN w/o HCF CF 74.36 60.22 72.88 DFFN DFFN w/o CNN 51.65 67.12 DFFN DFFN w/o CNN 70.21 56.77 68.65

Deep Feature Fusion Networks (DFFN) (Contd..)

slide-37
SLIDE 37

Qualitative Examples

slide-38
SLIDE 38

Questions/Discussion Thanks!