Question Classification Ling573 NLP Systems and Applications - - PowerPoint PPT Presentation

question classification
SMART_READER_LITE
LIVE PREVIEW

Question Classification Ling573 NLP Systems and Applications - - PowerPoint PPT Presentation

Question Classification Ling573 NLP Systems and Applications April 22, 2014 Roadmap Question classification variations: Classification with diverse features SVM classifiers Sequence classifiers Question Classification:


slide-1
SLIDE 1

Question Classification

Ling573 NLP Systems and Applications April 22, 2014

slide-2
SLIDE 2

Roadmap

— Question classification variations:

— Classification with diverse features — SVM classifiers — Sequence classifiers

slide-3
SLIDE 3

Question Classification: Li&Roth

slide-4
SLIDE 4

Why Question Classification?

slide-5
SLIDE 5

Why Question Classification?

— Question classification categorizes possible answers

slide-6
SLIDE 6

Why Question Classification?

— Question classification categorizes possible answers

— Constrains answers types to help find, verify answer Q: What Canadian city has the largest population?

— Type?

slide-7
SLIDE 7

Why Question Classification?

— Question classification categorizes possible answers

— Constrains answers types to help find, verify answer Q: What Canadian city has the largest population?

— Type? à City — Can ignore all non-city NPs

slide-8
SLIDE 8

Why Question Classification?

— Question classification categorizes possible answers

— Constrains answers types to help find, verify answer Q: What Canadian city has the largest population?

— Type? à City — Can ignore all non-city NPs

— Provides information for type-specific answer selection

— Q: What is a prism? — Type? à

slide-9
SLIDE 9

Why Question Classification?

— Question classification categorizes possible answers

— Constrains answers types to help find, verify answer Q: What Canadian city has the largest population?

— Type? à City — Can ignore all non-city NPs

— Provides information for type-specific answer selection

— Q: What is a prism? — Type? à Definition

— Answer patterns include: ‘A prism is…’

slide-10
SLIDE 10

Challenges

slide-11
SLIDE 11

Challenges

— Variability:

— What tourist attractions are there in Reims? — What are the names of the tourist attractions in

Reims?

— What is worth seeing in Reims?

— Type?

slide-12
SLIDE 12

Challenges

— Variability:

— What tourist attractions are there in Reims? — What are the names of the tourist attractions in

Reims?

— What is worth seeing in Reims?

— Type? à Location

slide-13
SLIDE 13

Challenges

— Variability:

— What tourist attractions are there in Reims? — What are the names of the tourist attractions in

Reims?

— What is worth seeing in Reims?

— Type? à Location

— Manual rules?

slide-14
SLIDE 14

Challenges

— Variability:

— What tourist attractions are there in Reims? — What are the names of the tourist attractions in

Reims?

— What is worth seeing in Reims?

— Type? à Location

— Manual rules?

— Nearly impossible to create sufficient patterns

— Solution?

slide-15
SLIDE 15

Challenges

— Variability:

— What tourist attractions are there in Reims? — What are the names of the tourist attractions in

Reims?

— What is worth seeing in Reims?

— Type? à Location

— Manual rules?

— Nearly impossible to create sufficient patterns

— Solution?

— Machine learning – rich feature set

slide-16
SLIDE 16

Approach

— Employ machine learning to categorize by answer type

— Hierarchical classifier on semantic hierarchy of types

— Coarse vs fine-grained

— Up to 50 classes

— Differs from text categorization?

slide-17
SLIDE 17

Approach

— Employ machine learning to categorize by answer type

— Hierarchical classifier on semantic hierarchy of types

— Coarse vs fine-grained

— Up to 50 classes

— Differs from text categorization?

— Shorter (much!) — Less information, but — Deep analysis more tractable

slide-18
SLIDE 18

Approach

— Exploit syntactic and semantic information

— Diverse semantic resources

slide-19
SLIDE 19

Approach

— Exploit syntactic and semantic information

— Diverse semantic resources

— Named Entity categories — WordNet sense — Manually constructed word lists — Automatically extracted semantically similar word lists

slide-20
SLIDE 20

Approach

— Exploit syntactic and semantic information

— Diverse semantic resources

— Named Entity categories — WordNet sense — Manually constructed word lists — Automatically extracted semantically similar word lists

— Results:

— Coarse: 92.5%; Fine: 89.3% — Semantic features reduce error by 28%

slide-21
SLIDE 21

Question Hierarchy

slide-22
SLIDE 22

Learning a Hierarchical Question Classifier

— Many manual approaches use only :

slide-23
SLIDE 23

Learning a Hierarchical Question Classifier

— Many manual approaches use only :

— Small set of entity types, set of handcrafted rules

slide-24
SLIDE 24

Learning a Hierarchical Question Classifier

— Many manual approaches use only :

— Small set of entity types, set of handcrafted rules

— Note: Webclopedia’s 96 node taxo w/276 manual rules

slide-25
SLIDE 25

Learning a Hierarchical Question Classifier

— Many manual approaches use only :

— Small set of entity types, set of handcrafted rules

— Note: Webclopedia’s 96 node taxo w/276 manual rules

— Learning approaches can learn to generalize

— Train on new taxonomy, but

slide-26
SLIDE 26

Learning a Hierarchical Question Classifier

— Many manual approaches use only :

— Small set of entity types, set of handcrafted rules

— Note: Webclopedia’s 96 node taxo w/276 manual rules

— Learning approaches can learn to generalize

— Train on new taxonomy, but

— Someone still has to label the data…

— Two step learning: (Winnow)

— Same features in both cases

slide-27
SLIDE 27

Learning a Hierarchical Question Classifier

— Many manual approaches use only :

— Small set of entity types, set of handcrafted rules

— Note: Webclopedia’s 96 node taxo w/276 manual rules

— Learning approaches can learn to generalize

— Train on new taxonomy, but

— Someone still has to label the data…

— Two step learning: (Winnow)

— Same features in both cases

— First classifier produces (a set of) coarse labels — Second classifier selects from fine-grained children of coarse

tags generated by the previous stage

— Select highest density classes above threshold

slide-28
SLIDE 28

Features for Question Classification

— Primitive lexical, syntactic, lexical-semantic features

— Automatically derived — Combined into conjunctive, relational features — Sparse, binary representation

slide-29
SLIDE 29

Features for Question Classification

— Primitive lexical, syntactic, lexical-semantic features

— Automatically derived — Combined into conjunctive, relational features — Sparse, binary representation

— Words

— Combined into ngrams

slide-30
SLIDE 30

Features for Question Classification

— Primitive lexical, syntactic, lexical-semantic features

— Automatically derived — Combined into conjunctive, relational features — Sparse, binary representation

— Words

— Combined into ngrams

— Syntactic features:

— Part-of-speech tags — Chunks — Head chunks : 1st N, V chunks after Q-word

slide-31
SLIDE 31

Syntactic Feature Example

— Q: Who was the first woman killed in the Vietnam War?

slide-32
SLIDE 32

Syntactic Feature Example

— Q: Who was the first woman killed in the Vietnam War? — POS: [Who WP] [was VBD] [the DT] [first JJ] [woman

NN] [killed VBN] [in IN] [the DT] [Vietnam NNP] [War NNP] [? .]

slide-33
SLIDE 33

Syntactic Feature Example

— Q: Who was the first woman killed in the Vietnam War? — POS: [Who WP] [was VBD] [the DT] [first JJ] [woman

NN] [killed VBN] {in IN] [the DT] [Vietnam NNP] [War NNP] [? .]

— Chunking: [NP Who] [VP was] [NP the first woman]

[VP killed] [PP in] [NP the Vietnam War] ?

slide-34
SLIDE 34

Syntactic Feature Example

— Q: Who was the first woman killed in the Vietnam War? — POS: [Who WP] [was VBD] [the DT] [first JJ] [woman

NN] [killed VBN] {in IN] [the DT] [Vietnam NNP] [War NNP] [? .]

— Chunking: [NP Who] [VP was] [NP the first woman]

[VP killed] [PP in] [NP the Vietnam War] ?

— Head noun chunk: ‘the first woman’

slide-35
SLIDE 35

Semantic Features

— Treat analogously to syntax?

slide-36
SLIDE 36

Semantic Features

— Treat analogously to syntax?

— Q1:What’s the semantic equivalent of POS tagging?

slide-37
SLIDE 37

Semantic Features

— Treat analogously to syntax?

— Q1:What’s the semantic equivalent of POS tagging? — Q2: POS tagging > 97% accurate;

— Semantics? Semantic ambiguity?

slide-38
SLIDE 38

Semantic Features

— Treat analogously to syntax?

— Q1:What’s the semantic equivalent of POS tagging? — Q2: POS tagging > 97% accurate;

— Semantics? Semantic ambiguity?

— A1: Explore different lexical semantic info sources

— Differ in granularity, difficulty, and accuracy

slide-39
SLIDE 39

Semantic Features

— Treat analogously to syntax?

— Q1:What’s the semantic equivalent of POS tagging? — Q2: POS tagging > 97% accurate;

— Semantics? Semantic ambiguity?

— A1: Explore different lexical semantic info sources

— Differ in granularity, difficulty, and accuracy

— Named Entities — WordNet Senses — Manual word lists — Distributional sense clusters

slide-40
SLIDE 40

Tagging & Ambiguity

— Augment each word with semantic category — What about ambiguity?

— E.g. ‘water’ as ‘liquid’ or ‘body of water’

slide-41
SLIDE 41

Tagging & Ambiguity

— Augment each word with semantic category — What about ambiguity?

— E.g. ‘water’ as ‘liquid’ or ‘body of water’ — Don’t disambiguate

— Keep all alternatives — Let the learning algorithm sort it out — Why?

slide-42
SLIDE 42

Semantic Categories

— Named Entities

— Expanded class set: 34 categories

— E.g. Profession, event, holiday, plant,…

slide-43
SLIDE 43

Semantic Categories

— Named Entities

— Expanded class set: 34 categories

— E.g. Profession, event, holiday, plant,…

— WordNet: IS-A hierarchy of senses

— All senses of word + direct hyper/hyponyms

slide-44
SLIDE 44

Semantic Categories

— Named Entities

— Expanded class set: 34 categories

— E.g. Profession, event, holiday, plant,…

— WordNet: IS-A hierarchy of senses

— All senses of word + direct hyper/hyponyms

— Class-specific words

— Manually derived from 5500 questions

— E.g. Class: Food

— {alcoholic, apple, beer, berry, breakfast brew butter candy cereal

champagne cook delicious eat fat ..}

— Class is semantic tag for word in the list

slide-45
SLIDE 45

Semantic Types

— Distributional clusters:

— Based on Pantel and Lin — Cluster based on similarity in dependency relations — Word lists for 20K English words

slide-46
SLIDE 46

Semantic Types

— Distributional clusters:

— Based on Pantel and Lin — Cluster based on similarity in dependency relations — Word lists for 20K English words

— Lists correspond to word senses — Water:

— Sense 1: { oil gas fuel food milk liquid} — Sense 2: {air moisture soil heat area rain} — Sense 3: {waste sewage pollution runoff}

slide-47
SLIDE 47

Semantic Types

— Distributional clusters:

— Based on Pantel and Lin — Cluster based on similarity in dependency relations — Word lists for 20K English words

— Lists correspond to word senses — Water:

— Sense 1: { oil gas fuel food milk liquid} — Sense 2: {air moisture soil heat area rain} — Sense 3: {waste sewage pollution runoff}

— Treat head word as semantic category of words on

list

slide-48
SLIDE 48

Evaluation

— Assess hierarchical coarse->fine classification — Assess impact of different semantic features — Assess training requirements for diff’t feature set

slide-49
SLIDE 49

Evaluation

— Assess hierarchical coarse->fine classification — Assess impact of different semantic features — Assess training requirements for diff’t feature set — Training:

— 21.5K questions from TREC 8,9; manual; USC data

— Test:

— 1K questions from TREC 10,11

slide-50
SLIDE 50

Evaluation

— Assess hierarchical coarse->fine classification — Assess impact of different semantic features — Assess training requirements for diff’t feature set — Training:

— 21.5K questions from TREC 8,9; manual; USC data

— Test:

— 1K questions from TREC 10,11

— Measures: Accuracy and class-specific precision

slide-51
SLIDE 51

Results

— Syntactic features only:

— POS useful; chunks useful to contribute head chunks — Fine categories more ambiguous

slide-52
SLIDE 52

Results

— Syntactic features only:

— POS useful; chunks useful to contribute head chunks — Fine categories more ambiguous

— Semantic features:

— Best combination: SYN, NE, Manual & Auto word lists

— Coarse: same; Fine: 89.3% (28.7% error reduction)

slide-53
SLIDE 53

Results

— Syntactic features only:

— POS useful; chunks useful to contribute head chunks — Fine categories more ambiguous

— Semantic features:

— Best combination: SYN, NE, Manual & Auto word lists

— Coarse: same; Fine: 89.3% (28.7% error reduction)

— Wh-word most common class: 41%

slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56

Observations

— Effective coarse and fine-grained categorization

— Mix of information sources and learning — Shallow syntactic features effective for coarse — Semantic features improve fine-grained

— Most feature types help

— WordNet features appear noisy — Use of distributional sense clusters dramatically increases

feature dimensionality

slide-57
SLIDE 57

Question Classification with Support Vector Machines

— Hacioglu & Ward 2003 — Same taxonomy, training, test data as Li & Roth

slide-58
SLIDE 58

Question Classification with Support Vector Machines

— Hacioglu & Ward 2003 — Same taxonomy, training, test data as Li & Roth — Approach:

— Shallow processing — Simpler features — Strong discriminative classifiers

slide-59
SLIDE 59

Features & Processing

— Contrast: (Li & Roth)

— POS, chunk info; NE tagging; other sense info

slide-60
SLIDE 60

Features & Processing

— Contrast: (Li & Roth)

— POS, chunk info; NE tagging; other sense info

— Preprocessing:

— Only letters, convert to lower case, stopped, stemmed

slide-61
SLIDE 61

Features & Processing

— Contrast: (Li & Roth)

— POS, chunk info; NE tagging; other sense info

— Preprocessing:

— Only letters, convert to lower case, stopped, stemmed

— Terms:

— Most informative 2000 word N-grams — Identifinder NE tags (7 or 29 tags)

slide-62
SLIDE 62

Classification & Results

— Employs support vector machines for classification

— Best results: Bi-gram, 7 NE classes

slide-63
SLIDE 63

Classification & Results

— Employs support vector machines for classification

— Best results: Bi-gram, 7 NE classes

— Fewer NE categories better

— More categories, more errors

slide-64
SLIDE 64

Enhanced Answer Type Inference … Using Sequential Models

— Krishnan, Das, and Chakrabarti 2005 — Improves QC with CRF extraction of ‘informer spans’

slide-65
SLIDE 65

Enhanced Answer Type Inference … Using Sequential Models

— Krishnan, Das, and Chakrabarti 2005 — Improves QC with CRF extraction of ‘informer spans’ — Intuition:

— Humans identify Atype from few tokens w/little syntax

slide-66
SLIDE 66

Enhanced Answer Type Inference … Using Sequential Models

— Krishnan, Das, and Chakrabarti 2005 — Improves QC with CRF extraction of ‘informer spans’ — Intuition:

— Humans identify Atype from few tokens w/little syntax

— Who wrote Hamlet?

slide-67
SLIDE 67

Enhanced Answer Type Inference … Using Sequential Models

— Krishnan, Das, and Chakrabarti 2005 — Improves QC with CRF extraction of ‘informer spans’ — Intuition:

— Humans identify Atype from few tokens w/little syntax

— Who wrote Hamlet?

slide-68
SLIDE 68

Enhanced Answer Type Inference … Using Sequential Models

— Krishnan, Das, and Chakrabarti 2005 — Improves QC with CRF extraction of ‘informer spans’ — Intuition:

— Humans identify Atype from few tokens w/little syntax

— Who wrote Hamlet? — How many dogs pull a sled at Iditarod?

slide-69
SLIDE 69

Enhanced Answer Type Inference … Using Sequential Models

— Krishnan, Das, and Chakrabarti 2005 — Improves QC with CRF extraction of ‘informer spans’ — Intuition:

— Humans identify Atype from few tokens w/little syntax

— Who wrote Hamlet? — How many dogs pull a sled at Iditarod?

slide-70
SLIDE 70

Enhanced Answer Type Inference … Using Sequential Models

— Krishnan, Das, and Chakrabarti 2005 — Improves QC with CRF extraction of ‘informer spans’ — Intuition:

— Humans identify Atype from few tokens w/little syntax

— Who wrote Hamlet? — How many dogs pull a sled at Iditarod? — How much does a rhino weigh?

slide-71
SLIDE 71

Enhanced Answer Type Inference … Using Sequential Models

— Krishnan, Das, and Chakrabarti 2005 — Improves QC with CRF extraction of ‘informer spans’ — Intuition:

— Humans identify Atype from few tokens w/little syntax

— Who wrote Hamlet? — How many dogs pull a sled at Iditarod? — How much does a rhino weigh?

slide-72
SLIDE 72

Enhanced Answer Type Inference … Using Sequential Models

— Krishnan, Das, and Chakrabarti 2005 — Improves QC with CRF extraction of ‘informer spans’ — Intuition:

— Humans identify Atype from few tokens w/little syntax

— Who wrote Hamlet? — How many dogs pull a sled at Iditarod? — How much does a rhino weigh?

— Single contiguous span of tokens

slide-73
SLIDE 73

Enhanced Answer Type Inference … Using Sequential Models

— Krishnan, Das, and Chakrabarti 2005 — Improves QC with CRF extraction of ‘informer spans’ — Intuition:

— Humans identify Atype from few tokens w/little syntax

— Who wrote Hamlet? — How many dogs pull a sled at Iditarod? — How much does a rhino weigh?

— Single contiguous span of tokens

— How much does a rhino weigh?

slide-74
SLIDE 74

Enhanced Answer Type Inference … Using Sequential Models

— Krishnan, Das, and Chakrabarti 2005 — Improves QC with CRF extraction of ‘informer spans’ — Intuition:

— Humans identify Atype from few tokens w/little syntax

— Who wrote Hamlet? — How many dogs pull a sled at Iditarod? — How much does a rhino weigh?

— Single contiguous span of tokens

— How much does a rhino weigh? — Who is the CEO of IBM?

slide-75
SLIDE 75

Informer Spans as Features

— Sensitive to question structure

— What is Bill Clinton’s wife’s profession?

slide-76
SLIDE 76

Informer Spans as Features

— Sensitive to question structure

— What is Bill Clinton’s wife’s profession?

slide-77
SLIDE 77

Informer Spans as Features

— Sensitive to question structure

— What is Bill Clinton’s wife’s profession?

— Idea: Augment Q classifier word ngrams w/IS info

slide-78
SLIDE 78

Informer Spans as Features

— Sensitive to question structure

— What is Bill Clinton’s wife’s profession?

— Idea: Augment Q classifier word ngrams w/IS info — Informer span features:

— IS ngrams

slide-79
SLIDE 79

Informer Spans as Features

— Sensitive to question structure

— What is Bill Clinton’s wife’s profession?

— Idea: Augment Q classifier word ngrams w/IS info — Informer span features:

— IS ngrams — Informer ngrams hypernyms:

— Generalize over words or compounds

slide-80
SLIDE 80

Informer Spans as Features

— Sensitive to question structure

— What is Bill Clinton’s wife’s profession?

— Idea: Augment Q classifier word ngrams w/IS info — Informer span features:

— IS ngrams — Informer ngrams hypernyms:

— Generalize over words or compounds — WSD?

slide-81
SLIDE 81

Informer Spans as Features

— Sensitive to question structure

— What is Bill Clinton’s wife’s profession?

— Idea: Augment Q classifier word ngrams w/IS info — Informer span features:

— IS ngrams — Informer ngrams hypernyms:

— Generalize over words or compounds — WSD? No

slide-82
SLIDE 82

Effect of Informer Spans

— Classifier: Linear SVM + multiclass

slide-83
SLIDE 83

Effect of Informer Spans

— Classifier: Linear SVM + multiclass

— Notable improvement for IS hypernyms

slide-84
SLIDE 84

Effect of Informer Spans

— Classifier: Linear SVM + multiclass

— Notable improvement for IS hypernyms

— Better than all hypernyms – filter sources of noise

— Biggest improvements for ‘what’, ‘which’ questions

slide-85
SLIDE 85

Perfect vs CRF Informer Spans

slide-86
SLIDE 86

Recognizing Informer Spans

— Idea: contiguous spans, syntactically governed

slide-87
SLIDE 87

Recognizing Informer Spans

— Idea: contiguous spans, syntactically governed

— Use sequential learner w/syntactic information

slide-88
SLIDE 88

Recognizing Informer Spans

— Idea: contiguous spans, syntactically governed

— Use sequential learner w/syntactic information

— Tag spans with B(egin),I(nside),O(outside)

— Employ syntax to capture long range factors

slide-89
SLIDE 89

Recognizing Informer Spans

— Idea: contiguous spans, syntactically governed

— Use sequential learner w/syntactic information

— Tag spans with B(egin),I(nside),O(outside)

— Employ syntax to capture long range factors

— Matrix of features derived from parse tree

slide-90
SLIDE 90

Recognizing Informer Spans

— Idea: contiguous spans, syntactically governed

— Use sequential learner w/syntactic information

— Tag spans with B(egin),I(nside),O(outside)

— Employ syntax to capture long range factors

— Matrix of features derived from parse tree

— Cell:x[i,l], i is position, l is depth in parse tree, only 2 — Values:

— Tag: POS, constituent label in the position — Num: number of preceding chunks with same tag

slide-91
SLIDE 91

Parser Output

— Parse

slide-92
SLIDE 92

Parse Tabulation

— Encoding and table:

slide-93
SLIDE 93

CRF Indicator Features

— Cell:

— IsTag, IsNum: e.g. y4 = 1 and x[4,2].tag=NP — Also, IsPrevTag, IsNextTag

slide-94
SLIDE 94

CRF Indicator Features

— Cell:

— IsTag, IsNum: e.g. y4 = 1 and x[4,2].tag=NP — Also, IsPrevTag, IsNextTag

— Edge:

— IsEdge: (u,v) , yi-1=u and yi=v — IsBegin, IsEnd

slide-95
SLIDE 95

CRF Indicator Features

— Cell:

— IsTag, IsNum: e.g. y4 = 1 and x[4,2].tag=NP — Also, IsPrevTag, IsNextTag

— Edge:

— IsEdge: (u,v) , yi-1=u and yi=v — IsBegin, IsEnd

— All features improve

slide-96
SLIDE 96

CRF Indicator Features

— Cell:

— IsTag, IsNum: e.g. y4 = 1 and x[4,2].tag=NP — Also, IsPrevTag, IsNextTag

— Edge:

— IsEdge: (u,v) , yi-1=u and yi=v — IsBegin, IsEnd

— All features improve — Question accuracy: Oracle: 88%; CRF: 86.2%