PRESENTATION ON: A SHORTEST PATH DEPENDENCY KERNEL FOR RELATION - - PowerPoint PPT Presentation

presentation on
SMART_READER_LITE
LIVE PREVIEW

PRESENTATION ON: A SHORTEST PATH DEPENDENCY KERNEL FOR RELATION - - PowerPoint PPT Presentation

DEPENDENCY PARSING KERNEL METHODS PAPER PRESENTATION ON: A SHORTEST PATH DEPENDENCY KERNEL FOR RELATION EXTRACTION Hypothesis Dependency Parsing - CFG vs CCG Kernel Evaluation **Taken from CS388 by Raymond J. Mooney


slide-1
SLIDE 1

PRESENTATION ON:

“A SHORTEST PATH DEPENDENCY KERNEL FOR RELATION EXTRACTION”

DEPENDENCY PARSING KERNEL METHODS PAPER

**Taken from CS388 by Raymond J. Mooney University of T exas at Austin

  • Hypothesis
  • Dependency Parsing - CFG vs CCG
  • Kernel
  • Evaluation
slide-2
SLIDE 2

DEPENDENCY PARSING

DEPENDENCY PARSING KERNEL METHODS PAPER

*Taken from CS388 by Raymond J. Mooney University of T exas at Austin From: Wikipedia

slide-3
SLIDE 3

For sentence: John liked the dog in the pen.

S NP VP John V NP liked the dog in the pen

Sliked-VBD VP VP PP DT Nominal liked IN NP in the dog NN DT Nominal NN the pen NP John

pen-NN pen-NN in-IN dog-NN dog-NN liked-VBD John-NNP

NP VBD

liked-VBD

liked John dog pen

in

the the

nsubj dobj

det

det

DEPENDENCY PARSING KERNEL METHODS PAPER Parse Trees* Typed Dependency Parse Trees*

  • Represent structure in sentences using “lexical terms” linked by binary relations called “dependencies”
  • Labelled Case v/s unlabeled
  • Throw in a ROOT
  • Can be generated from parse trees. Can be generated directly (Collins’ NLP course has more)
slide-4
SLIDE 4

DEPENDENCY PARSING KERNEL METHODS PAPER

Parse Trees using CFG – with heads and without.

  • Created using PCFGs – using the CKY algorithm.
  • Weights for hand-written rules are obtained through trained.
  • Problem is that production used to expand a non-terminal is independent of context – solution? add heads
  • Even with heads, it doesn’t really care about semantics.

Sliked-VBD VP VP PP DT Nominal liked IN NP in the dog NN DT Nominal NN the pen NP John

pen-NN pen-NN in-IN dog-NN dog-NN liked-VBD John-NNP

NP VBD

liked-VBD

liked John dog pen

in

the the

nsubj dobj

det

det

Can convert a phrase structure parse to a dependency tree by making the head of each non-head child of a node depend on the head of the head child.* Example PCFG rules (no heads) with weights

slide-5
SLIDE 5

DEPENDENCY PARSING KERNEL METHODS PAPER

Combinatory Categorial Grammars

  • Phrase structure grammar, not dependency based
  • Can be converted to dependency parse
  • Based on application on different types of combinators.
  • Combinators act on 𝑏𝑠𝑕𝑣𝑛𝑓𝑜𝑢 and have a 𝑔𝑣𝑜𝑑𝑢𝑝𝑠
  • Think 𝜇-calculus
  • Special syntactic category for relative pronouns + using heads = long range (?)

from Steedman, and wikipedia Bunescu and Mooney, 2005 Unclear how to go from phrasal parse to dependency parse Gagan: Interesting Rishab: Better CCG parser Modern day parser?

slide-6
SLIDE 6

KERNEL METHODS

Adapted from: ACL’04 tutorial Jean-michel Renders Xerox research center europe (france)

DEPENDENCY PARSING KERNEL METHODS PAPER

slide-7
SLIDE 7

Polynomial kernel 2D to 3D

f

RBF kernel DEPENDENCY PARSING KERNEL METHODS PAPER

ACL-04 Tutorial

slide-8
SLIDE 8

ACL-04 Tutorial

KERNEL METHODS : INTUITIVE IDEA

 Find a mapping f such that, in the new space, problem solving is easier

(e.g. linear) … mapping is similar to features.

 The kernel represents the similarity between two objects (documents,

terms, …), defined as the dot-product in this new vector space

 The mapping is left implicit – Avoid expensive transformation.  Easy generalization of a lot of dot-product (or distance) based algorithms

DEPENDENCY PARSING KERNEL METHODS PAPER

slide-9
SLIDE 9

KERNEL : FORMAL

 A kernel 𝑙(𝑦, 𝑧)

 is a similarity measure  defined by an implicit mapping f,  from the original space to a vector space (feature space)  such that: 𝑙(𝑦, 𝑧)=f(𝑦)•f(𝑧)

 This similarity measure and the mapping include:

 Simpler structure (linear representation of the data)  Possibly infinite dimension (hypothesis space for learning)  … but still computational efficiency when computing 𝑙(𝑦, 𝑧)

 Valid Kernel: Any kernel that satisfies Mercer’s theorem

ACL-04 Tutorial

DEPENDENCY PARSING KERNEL METHODS PAPER

slide-10
SLIDE 10

KERNELS FOR TEXT

 Seen as ‘bag of words’ : dot product or polynomial kernels (multi-words)  Seen as set of concepts : GVSM kernels, Kernel LSI (or Kernel PCA), Kernel ICA,

…possibly multilingual

 Seen as string of characters: string kernels  Seen as string of terms/concepts: word sequence kernels  Seen as trees (dependency or parsing trees): tree kernels  Seen as the realization of probability distribution (generative model)

DEPENDENCY PARSING KERNEL METHODS PAPER

ACL-04 Tutorial

slide-11
SLIDE 11

DEPENDENCY PARSING KERNEL METHODS PAPER

Tree Kernels

  • Special case of general kernels defined on discrete structures (graphs).
  • Consider the following example:

𝑙 𝜐1, 𝜐2 = #𝑑𝑝𝑛𝑛𝑝𝑜 𝑡𝑣𝑐𝑢𝑠𝑓𝑓𝑡 𝑐𝑓𝑢𝑥𝑓𝑓𝑜 𝜐1, 𝜐2

  • Feature space is space of all subtrees. (huge)
  • Kernel is computed in polynomial time using:

𝑙 𝑈

1, 𝑈2 = 𝑜1∈𝑈

1

𝑜2∈𝑈

2

𝑙𝑑𝑝−𝑠𝑝𝑝𝑢𝑓𝑒(𝑜1, 𝑜2) 𝑙𝑑𝑝−𝑠𝑝𝑝𝑢𝑓𝑒 = 0, 𝑜1𝑝𝑠 𝑜2 𝑗𝑡 𝑏 𝑚𝑓𝑏𝑔 𝑝𝑠 𝑜1 ≠ 𝑜2

𝑗∈𝑑ℎ𝑗𝑚𝑒𝑠𝑓𝑜

1 + 𝑙𝑑𝑝−𝑠𝑝𝑝𝑢𝑓𝑒(𝑑ℎ 𝑜1, 𝑗 , 𝑑ℎ(𝑜2, 𝑗)) , 𝑝𝑥

From ACL tutorial 2004 wikipedia #common subtrees = 7 Similar ideas used in Culotta and Sorensen, 2004 Shantanu: not intuitive

slide-12
SLIDE 12

DEPENDENCY PARSING KERNEL METHODS PAPER

Interesting:

  • Remembers training data unlike other methods (regression, GDA)
  • Nice theoretical properties
  • Dual space
  • Most popular kernel method is SVMs
  • Kernel trick can lift other linear methods into 𝜚 space – PCA, for

example Multiclass SVM:

Given classes 0,1,2 … 𝑀 One-vs-all: learn 𝑕𝑗 𝑗

𝑀 which are functions on input space, and assign label that

gives maximum 𝑕𝑗 - 𝑃(𝑀) classifiers One-vs-one: learn 𝑕𝑗𝑘 𝑗,𝑘

𝑀,𝑀, one function for each pair of classes. Assign label

with most “votes”. - 𝑃(𝑀2) classifiers How does hierarchy help? (think S1 vs S2) Anshul: Multiclass SVM blows up for many classes. No finer relations.

slide-13
SLIDE 13

A SHORTEST PATH DEPENDENCY KERNEL FOR RELATION EXTRACTION

DEPENDENCY PARSING KERNEL METHODS PAPER HYPOTHESIS: If 𝑓1 and 𝑓2 are entities in a sentence related by 𝑆, then hypothesize that contribution of sentence dependency graph to establishing 𝑆(𝑓1, 𝑓2) is almost exclusively concentrated in the shortest path between 𝑓1 and 𝑓2 in the undirected dependency graph. Arindam: over simplified Nupur: didn’t verify hypothesis Barun: Useful. No statistical backing Swarnadeep: More examples/backing for hypothesis Dhruvin: When does it fail? Happy: intuition  All figures and tables from Bunescu and Mooney, 2005

slide-14
SLIDE 14

Information Extraction Coreference Resolution NER Relation Extraction

DEPENDENCY PARSING KERNEL METHODS PAPER

  • Paper extracts relations between
  • Intra-sentential entities
  • Which have been tagged using a
  • A kernel method using
  • Kernel based on the shortest path between entities in the sentence.

Akshay: Limited ontology Surag: no temporality

slide-15
SLIDE 15

PoS Chunking Shallow Parse Trees Dependency Trees

DEPENDENCY PARSING KERNEL METHODS PAPER

Syntactic knowledge helps with IE.

Different levels of syntactic knowledge have been used. Paper states the hypothesis that most of the information useful for Relation Extraction is concentrated in shortest path in undirected dependency graph between entities. Amount of syntactic knowledge

Assumptions:

All relations are intra-sentence. Sentences are independent of each other. Relations are known, entities are known.

How do we use syntactic knowledge

Ray and Craven, 2001: PoS and Chunking Zelenko et al, 2003 : Shallow parse trees based kernel methods Culotta and Sorensen, 2004 : Dependency trees Anshul: Mines implicit relations???  No strong reasons  Himanshu: dependency is hard Arindam: likes deep syntactic knowledge. Nupur: likes idea. Barun: is classification even useful? Gagan: dislike sentence assumption

slide-16
SLIDE 16

DEPENDENCY PARSING KERNEL METHODS PAPER

Mathematical! – Before the Kernel

Original Training Data: Articles, with entities and relations Processed Training data: (𝒚𝑗, 𝑧𝑗) 𝑗=0

𝑂

where 𝒚𝑗 is shortest path with types. – Will go through SVM in a bit. 𝑧𝑗 - 5+1 top level relations; 24 fine relation types. 𝑧𝑗 top level = {ROLE, PART, LOCATED, NEAR, SOCIAL} Handles negation – Attach (-) suffix to words modified by negative determiners. 𝒚𝑗 𝜚(𝒚𝑗)? Nupur, Gagan: nots! Anshul: more nots? Yashoteja: general – add more features! Prachi: what happened to 𝑓3? Happy: verb semantics? Use markov logic? Anshul: intrasentence context ; no novel relations ;

slide-17
SLIDE 17

Mathematical! –The Kernel

DEPENDENCY PARSING KERNEL METHODS PAPER 𝒚, 𝒛 are vectors Comments:

  • Don’t longer paths produce higher kernel score?

Yeah, but doesn’t matter because different set of features entirely… and hence lower SVM weights?

  • Normalization of kernel – produces drop in accuracy.
  • Empirically set it to 0 for longer chains. (intuitively good?)

SVM:

  • S1: One multi-class SVM to do relation detection and relation extraction
  • S2: One SVM to do relation detection, another SVM to classify relation after detection

Himanshu, Barun: Other similarity metrics? Semantics? Rishab: Theoretically SVMs are awesome; efficient ; synonyms?  Nupur,Akshay,…: antonyms, synonyms etc? 0 if m!=n?  Ankit: use redundancy? … : word2vec too? Prachi: Words + classes against sparsity ; Happy: RBF kernel? Shantanu: It isn’t using feature space fully. Weight it? Anshul: handle multiple shortest paths Surag: Lexical and unlexical features. Additional layer of inference?

slide-18
SLIDE 18

DEPENDENCY PARSING KERNEL METHODS PAPER

Experiments

ACE Corpus 422 documents + 97 test documents 6k training relation instances + 1.5k test Dependency Parsers:

  • CCG: Long range dependencies. Forms a DAG.
  • CFG: Local dependencies. Forms a tree. Used Collins’ parser.

SVMs:

  • S1 – normal multiclass SVM
  • S2 – T
  • help increase recall

Kernel:

  • 0 for paths longer than 10, 0 if unequal lengths. 

Comparision:

  • S1: Simple multiclass SVM
  • S2: Hierarchical SVM
  • K4: Sum of BoW kernel and dependency kernel from Culotta and Sorensen, 2004

RECALL: Assuming that independent sentences and only intra-sentential relations Akshay, Ankit: Limited dataset Rishab: something between CCG and CFG?; 5 classes ; likes 2 step. Gagan: CCGs are interesting. Why 10 Ankit: likes hierarchy Yashoteja: A lot of dot products are 0. Mess around with optimizer?

slide-19
SLIDE 19

DEPENDENCY PARSING KERNEL METHODS PAPER

Results

CFG dependency parsing does better than CCG – paper attributes it to Collins’ finding local dependencies better. Intuition between shortest path similar to Richards and Mooney, 1992: “In most relational domains, important concepts will be represented by a small number of fixed paths among the constants defining a positive instance – for example, the grandparent relation by a single path consisting of two parent relations.” Arindam: CCG vs CFG Akshay: Other papers? Surag, Dinesh, Shantanu: No error analysis when low recall  Anshul: Confusion matrix?

slide-20
SLIDE 20

Extensions:

A summary of a lot of people’s comments

  • Completeness:What about the finer classes?
  • Scale: NELL? OpenIE? Larger datasets? SVM might not work? A lot of features means no large scale?
  • System: combine with NER and Coreference? Entity and Relation extraction at the same time?
  • Dependency Parser: How does it do today? Better CCG parsers? Deeper semantic parsers?
  • Kernel: Convex combination of common kernels? RBF? (How?)
  • Data: Unstructured data/twitter [Gagan]
  • Setting: Is treating it as classification even a good idea? [Barun]

DEPENDENCY PARSING KERNEL METHODS PAPER