PhD course in Machine Learning Kernel Engineering Alessandro - - PowerPoint PPT Presentation

phd course in machine learning
SMART_READER_LITE
LIVE PREVIEW

PhD course in Machine Learning Kernel Engineering Alessandro - - PowerPoint PPT Presentation

PhD course in Machine Learning Kernel Engineering Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Kernel Engineering approaches Basic Combinations Canonical


slide-1
SLIDE 1

PhD course in Machine Learning

Alessandro Moschitti

Department of information and communication technology University of Trento

Email: moschitti@dit.unitn.it

Kernel Engineering

slide-2
SLIDE 2

Kernel Engineering approaches

Basic Combinations Canonical Mappings, e.g. object transformations Merging of Kernels

slide-3
SLIDE 3

Kernel Combinations an example

Kernel Combinations:

3 3 3 3 3 3

, ,

p Tree p Tree P Tree p p Tree Tree P Tree p Tree P Tree p Tree P Tree

K K K K K K K K K K K K K K K K × × = + × = × = + × =

× + × +

γ γ

kernel Tree features flat

  • f

kernel polynomial

3 Tree p

K K

slide-4
SLIDE 4

Object Transformation [Moschitti et al, CLJ 2008]

Canonical Mapping, φM()

  • bject transformation,
  • e. g. a syntactic parse tree, into a verb

subcategorization frame tree.

Feature Extraction, φE()

maps the canonical structure in all its fragments different fragment spaces, e. g. ST, SST and PT.

) , ( ) ( ) ( )) ( ( )) ( ( ) ( ) ( ) , (

2 1 2 1 2 1 2 1 2 1

S S K S S O O O O O O K

E E E M E M E

= ⋅ = ⋅ = ⋅ = φ φ φ φ φ φ φ φ

slide-5
SLIDE 5

Predicate Argument Classification

In an event:

target words describe relation among different entities the participants are often seen as predicate's

arguments.

Example:

Paul gives a talk in Rome

slide-6
SLIDE 6

Predicate Argument Classification

In an event:

target words describe relation among different entities the participants are often seen as predicate's

arguments.

Example:

[ Arg0 Paul] [ predicate gives ] [ Arg1 a talk] [ ArgM in Rome]

slide-7
SLIDE 7

Predicate-Argument Feature Representation

Given a sentence, a predicate p:

  • 1. Derive the sentence parse tree
  • 2. For each node pair <Np,Nx>
  • a. Extract a feature representation set

F

  • b. If Nx exactly covers the Arg-i, F is
  • ne of its positive examples
  • c. F is a negative example otherwise
slide-8
SLIDE 8

Vector Representation for the linear kernel

Phrase Type Predicate Word Head Word Parse Tree Path Voice Active Position Right

slide-9
SLIDE 9

Kernel Engineering: Tree Tailoring

slide-10
SLIDE 10

PAT Kernel [Moschitti, ACL 2004]

S N NP D N

VP

V Paul in delivers a talk PP IN NP jj

Fv,arg.0

formal N style

  • Arg. 0

a)

S N NP D N

VP

V Paul in delivers a talk PP IN NP jj formal N style

Fv,arg.1 b)

S N NP D N

VP

V Paul in delivers a talk PP IN NP jj formal N style

  • Arg. 1

Fv,arg.M c)

Arg.M

These are Semantic Structures Given the sentence:

[ Arg0 Paul] [ predicate delivers] [ Arg1 a talk] [ ArgM in formal Style]

slide-11
SLIDE 11

In other words we consider…

NP D N VP V delivers a talk S N Paul in PP IN NP jj formal N style

  • Arg. 1
slide-12
SLIDE 12

Sub-Categorization Kernel (SCF) [Moschitti, ACL 2004]

S N NP D N VP V Paul in delivers a talk PP IN NP jj formal N style

  • Arg. 1
  • Arg. M
  • Arg. 0

Predicate

slide-13
SLIDE 13

Experiments on Gold Standard Trees

PropBank and PennTree bank

about 53,700 sentences Sections from 2 to 21 train., 23 test., 1 and 22 dev. Arguments from Arg0 to Arg5, ArgA and ArgM for

a total of 122,774 and 7,359

FrameNet and Collins’ automatic trees

24,558 sentences from the 40 frames of Senseval 3 18 roles (same names are mapped together) Only verbs 70% for training and 30% for testing

slide-14
SLIDE 14

Argument Classification with Poly Kernel

slide-15
SLIDE 15

PropBank Results

slide-16
SLIDE 16

Argument Classification on PAT using different Tree Fragment Extractor

0.75 0.78 0.80 0.83 0.85 0.88 10 20 30 40 50 60 70 80 90 100 % Training Data Accuracy --- ST SST Linear PT

slide-17
SLIDE 17

FrameNet Results

ProbBank arguments vs. Semantic Roles

slide-18
SLIDE 18

Kernel Engineering: Node marking

slide-19
SLIDE 19

Marking Boundary nodes

slide-20
SLIDE 20

Node Marking Effect

slide-21
SLIDE 21

Different tailoring and marking

CMST MMST

slide-22
SLIDE 22

Experiments

PropBank and PennTree bank

about 53,700 sentences Charniak trees from CoNLL 2005

Boundary detection:

Section 2 training Section 24 testing PAF and MPAF

slide-23
SLIDE 23

Number of examples/nodes of Section 2

slide-24
SLIDE 24

Predicate Argument Feature (PAF) vs. Marked PAF (MPAF) [Moschitti et al, ACL-ws-2005]

slide-25
SLIDE 25

More general mappings: Semantic structures for re-ranking [Moschitti et al, CoNLL 2006]

slide-26
SLIDE 26

Other Shallow Semantic structures

[Moschitti and Quarteroni, NAACL 2008]

[ARG1 Antigens] were [AM−TMP originally] [rel defined] [ARG2 as non- self molecules]. [ARG0 Researchers] [rel describe] [ARG1 antigens][ARG2 as foreign molecules] [ARGM−LOC in the body]

slide-27
SLIDE 27

Shallow Semantic Trees for SST kernel

[Moschitti et al, ACL 2007]

slide-28
SLIDE 28

Merging of Kernels [ECIR 2007]: Question/Answer Classification

Syntactic/Semantic Tree Kernel Kernel Combinations Experiments

slide-29
SLIDE 29

Merging of Kernels [Bloehdorn & Moschitti, ECIR

2007 & CIKM 2007]

slide-30
SLIDE 30

Merging of Kernels

NP D N VP V gives a talk N good NP D N VP V gives a talk N solid

slide-31
SLIDE 31

Delta Evaluation is very simple

slide-32
SLIDE 32

Question Classification

Definition: What does HTML stand for? Description: What's the final line in the Edgar Allan Poe

poem "The Raven"?

Entity: What foods can cause allergic reaction in people? Human: Who won the Nobel Peace Prize in 1992? Location: Where is the Statue of Liberty? Manner: How did Bob Marley die? Numeric: When was Martin Luther King Jr. born? Organization: What company makes Bentley cars?

slide-33
SLIDE 33

Question Classifier based on Tree Kernels

Question dataset (http://l2r.cs.uiuc.edu/~cogcomp/Data/QA/QC/)

[Lin and Roth, 2005])

Distributed on 6 categories: Abbreviations, Descriptions, Entity,

Human, Location, and Numeric.

Fixed split 5500 training and 500 test questions Cross-validation (10-folds) Using the whole question parse trees

Constituent parsing Example

“What is an offer of direct stock purchase plan ?”

slide-34
SLIDE 34
slide-35
SLIDE 35

Kernels

BOW, POS are obtained with a simple tree, e.g. PT (parse tree) PAS (predicate argument structure)

BOX

is What an

  • ffer

an * * * * *

slide-36
SLIDE 36

Question classification

slide-37
SLIDE 37

Similarity based on WordNet

slide-38
SLIDE 38

Question Classification with S/STK

slide-39
SLIDE 39

Multiple Kernel Combinations

[Moschitti, CIKM 2008; Moschitti & Quarteroni, NAACL 2008; Moschitti et al., ACL 2007]

slide-40
SLIDE 40

TASK: Question/Answer Classification

The classifier detects if a pair (question and

answer) is correct or not

A representation for the pair is needed The classifier can be used to re-rank the output of

a basic QA system

slide-41
SLIDE 41

Dataset 2: TREC data

138 TREC 2001 test questions labeled as

“description”

2,256 sentences, extracted from the best ranked

paragraphs (using a basic QA system based on Lucene search engine on TREC dataset)

216 of which labeled as correct by one annotator

slide-42
SLIDE 42

Dataset 2: TREC data

138 TREC 2001 test questions labeled as

“description”

2,256 sentences, extracted from the best ranked

paragraphs (using a basic QA system based on Lucene search engine on TREC dataset)

216 of which labeled as correct by one annotator

A question is linked to many answers: all its derived pairs cannot be shared by training and test sets

slide-43
SLIDE 43

Bags of words (BOW) and POS-tags (POS)

To save time, apply STK to these trees:

BOX

is What an

  • ffer
  • f

* * * * *

BOX

VBZ WHNP DT NN IN * * * * *

slide-44
SLIDE 44

Word and POS Sequences

What is an offer of…? (word sequence, WSK)

 What_is_offer  What_is

WHNP VBZ DT NN IN…(POS sequence, POSSK)

 WHNP_VBZ_NN  WHNP_NN_IN

slide-45
SLIDE 45

Syntactic Parse Trees (PT)

slide-46
SLIDE 46

Predicate Argument Classification

In an event:

target words describe relation among different entities the participants are often seen as predicate's

arguments.

Example:

Paul gives a lecture in Rome

slide-47
SLIDE 47

Predicate Argument Classification

In an event:

target words describe relation among different entities the participants are often seen as predicate's

arguments.

Example:

[ Arg0 Paul] [ predicate gives ] [ Arg1 a lecture] [ ArgM in Rome]

slide-48
SLIDE 48

Predicate Argument Structure for Partial Tree Kernel (PASPTK)

[ARG1 Antigens] were [AM−TMP originally] [rel defined] [ARG2 as non- self molecules]. [ARG0 Researchers] [rel describe] [ARG1 antigens][ARG2 as foreign molecules] [ARGM−LOC in the body]

slide-49
SLIDE 49

Kernels and Combinations

Exploiting the property: k(x,z) = k1(x,z)+k2(x,z) BOW, POS, WSK, POSSK, PT, PASPTK

⇒ BOW+POS, BOW+PT, PT+POS, …

slide-50
SLIDE 50

Results on TREC Data

(5 folds cross validation)

20 22 24 26 28 30 32 34 36 38 40

F1-measure Kernel Type

slide-51
SLIDE 51

Results on TREC Data

(5 folds cross validation)

20 22 24 26 28 30 32 34 36 38 40

F1-measure Kernel Type

slide-52
SLIDE 52

Results on TREC Data

(5 folds cross validation)

20 22 24 26 28 30 32 34 36 38 40

F1-measure Kernel Type

slide-53
SLIDE 53

Results on TREC Data

(5 folds cross validation)

20 22 24 26 28 30 32 34 36 38 40

F1-measure Kernel Type

slide-54
SLIDE 54

Results on TREC Data

(5 folds cross validation)

20 22 24 26 28 30 32 34 36 38 40

F1-measure Kernel Type

slide-55
SLIDE 55

Results on TREC Data

(5 folds cross validation)

20 22 24 26 28 30 32 34 36 38 40

F1-measure Kernel Type

slide-56
SLIDE 56

Results on TREC Data

(5 folds cross validation)

20 22 24 26 28 30 32 34 36 38 40

F1-measure Kernel Type

BOW ≈ 24 POSSK+STK+PAS-PTK≈ 39 ⇒62 % of improvement

slide-57
SLIDE 57

SVM-light-TK Software

Encodes ST, SST and combination kernels

in SVM-light [Joachims, 1999]

Available at http://dit.unitn.it/~moschitt/ Tree forests, vector sets New extensions: the PT kernel will be released

asap

slide-58
SLIDE 58

Data Format

“What does Html stand for?” 1

|BT| (SBARQ (WHNP (WP What))(SQ (AUX does)(NP (NNP S.O.S.))(VP (VB stand)(PP (IN for))))(. ?)) |BT| (BOW (What *)(does *)(S.O.S. *)(stand *)(for *)(? *)) |BT| (BOP (WP *)(AUX *)(NNP *)(VB *)(IN *)(. *)) |BT| (PAS (ARG0 (R-A1 (What *)))(ARG1 (A1 (S.O.S. NNP)))(ARG2 (rel stand))) |ET| 1:1 21:2.742439465642236E-4 23:1 30:1 36:1 39:1 41:1 46:1 49:1 66:1 152:1 274:1 333:1 |BV| 2:1 21:1.4421347148614654E-4 23:1 31:1 36:1 39:1 41:1 46:1 49:1 52:1 66:1 152:1 246:1 333:1 392:1 |EV|

slide-59
SLIDE 59

Basic Commands

Training and classification

./svm_learn -t 5 -C T train.dat model ./svm_classify test.dat model

Learning with a vector sequence

./svm_learn -t 5 -C V train.dat model

Learning with the sum of vector and kernel

sequences

./svm_learn -t 5 -C + train.dat model

slide-60
SLIDE 60

Custom Kernel

Kernel.h

double custom_kernel(KERNEL_PARM

*kernel_parm, DOC *a, DOC *b);

if(a->num_of_trees && b->num_of_trees && a-

>forest_vec[i]!=NULL && b->forest_vec[i]! =NULL){// Test if one the i-th tree of instance a and b is an empty tree

slide-61
SLIDE 61

Custom Kernel: tree-kernel

k1= // summation of tree kernels

tree_kernel(kernel_parm, a, b, i, i)/ Evaluate tree kernel between the two i-th trees. sqrt(tree_kernel(kernel_parm, a, a, i, i) * tree_kernel(kernel_parm, b, b, i, i)); Normalize respect to both i-th trees.

slide-62
SLIDE 62

Custom Kernel: Polynomial kernel

if(a->num_of_vectors && b->num_of_vectors

&& a->vectors[i]!=NULL && b->vectors[i]! =NULL){ Check if the i-th vectors are empty.

k2= // summation of vectors

basic_kernel(kernel_parm, a, b, i, i)/ Compute standard kernel (selected according to the "second_kernel" parameter).

slide-63
SLIDE 63

Custom Kernel: Polynomial kernel

sqrt(

basic_kernel(kernel_parm, a, a, i, i) * basic_kernel(kernel_parm, b, b, i, i) ); //normalize vectors

return k1+k2;

slide-64
SLIDE 64

Conclusions

Kernel methods and SVMs are useful tools to

design language applications

Kernel design still require some level of expertise Engineering approaches to tree kernels

Basic Combinations Canonical Mappings, e.g.

Node Marking

Merging of kernels in more complex kernels

State-of-the-art in SRL and QC An efficient tool to use them

slide-65
SLIDE 65

Thank you

slide-66
SLIDE 66

References

Alessandro Moschitti, Silvia Quarteroni, Roberto Basili and Suresh Manandhar, Exploiting Syntactic and Shallow Semantic Kernels for Question/Answer Classification, Proceedings of the 45th Conference of the Association for Computational Linguistics (ACL), Prague, June 2007. Alessandro Moschitti and Fabio Massimo Zanzotto, Fast and Effective Kernels for Relational Learning from Texts, Proceedings of The 24th Annual International Conference on Machine Learning (ICML 2007), Corvallis, OR, USA. Daniele Pighin, Alessandro Moschitti and Roberto Basili, RTV: Tree Kernels for Thematic Role Classification, Proceedings of the 4th International Workshop on Semantic Evaluation (SemEval-4), English Semantic Labeling, Prague, June 2007. Stephan Bloehdorn and Alessandro Moschitti, Combined Syntactic and Semanitc Kernels for Text Classification, to appear in the 29th European Conference on Information Retrieval (ECIR), April 2007, Rome, Italy. Fabio Aiolli, Giovanni Da San Martino, Alessandro Sperduti, and Alessandro Moschitti, Efficient Kernel-based Learning for Trees, to appear in the IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Honolulu, Hawaii, 2007

slide-67
SLIDE 67

An introductory book on SVMs, Kernel methods and Text Categorization

slide-68
SLIDE 68

References

Roberto Basili and Alessandro Moschitti, Automatic Text Categorization: from Information Retrieval to Support Vector Learning, Aracne editrice, Rome, Italy. Alessandro Moschitti, Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees. In Proceedings of the 17th European Conference on Machine Learning, Berlin, Germany, 2006. Alessandro Moschitti, Daniele Pighin, and Roberto Basili, Tree Kernel Engineering for Proposition Re-ranking, In Proceedings of Mining and Learning with Graphs (MLG 2006), Workshop held with ECML/PKDD 2006, Berlin, Germany, 2006. Elisa Cilia, Alessandro Moschitti, Sergio Ammendola, and Roberto Basili, Structured Kernels for Automatic Detection of Protein Active Sites. In Proceedings of Mining and Learning with Graphs (MLG 2006), Workshop held with ECML/PKDD 2006, Berlin, Germany, 2006.

slide-69
SLIDE 69

References

Fabio Massimo Zanzotto and Alessandro Moschitti, Automatic learning of textual entailments with cross-pair similarities. In Proceedings of COLING-ACL, Sydney, Australia, 2006. Alessandro Moschitti, Making tree kernels practical for natural language learning. In Proceedings

  • f the Eleventh International Conference on European Association for

Computational Linguistics, Trento, Italy, 2006. Alessandro Moschitti, Daniele Pighin and Roberto Basili. Semantic Role Labeling via Tree Kernel joint inference. In Proceedings of the 10th Conference on Computational Natural Language Learning, New York, USA, 2006. Alessandro Moschitti, Bonaventura Coppola, Daniele Pighin and Roberto Basili, Semantic Tree Kernels to classify Predicate Argument Structures. In Proceedings of the the 17th European Conference on Artificial Intelligence, Riva del Garda, Italy, 2006.

slide-70
SLIDE 70

References

Alessandro Moschitti and Roberto Basili, A Tree Kernel approach to Question and Answer Classification in Question Answering Systems. In Proceedings of the Conference on Language Resources and Evaluation, Genova, Italy, 2006. Ana-Maria Giuglea and Alessandro Moschitti, Semantic Role Labeling via FrameNet, VerbNet and PropBank. In Proceedings of the Joint 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING-ACL), Sydney, Australia, 2006. Roberto Basili, Marco Cammisa and Alessandro Moschitti, Effective use of wordnet semantics via kernel-based learning. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005), Ann Arbor(MI), USA, 2005

slide-71
SLIDE 71

References

Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola and Roberto Basili. Hierarchical Semantic Role Labeling. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), Ann Arbor(MI), USA, 2005. Roberto Basili, Marco Cammisa and Alessandro Moschitti, A Semantic Kernel to classify texts with very few training examples. In Proceedings of the Workshop on Learning in Web Search, at the 22nd International Conference on Machine Learning (ICML 2005), Bonn, Germany, 2005. Alessandro Moschitti, Bonaventura Coppola, Daniele Pighin and Roberto Basili. Engineering of Syntactic Features for Shallow Semantic Parsing. In Proceedings of the ACL05 Workshop on Feature Engineering for Machine Learning in Natural Language Processing, Ann Arbor (MI), USA, 2005.

slide-72
SLIDE 72

References

Alessandro Moschitti, A study on Convolution Kernel for Shallow Semantic Parsing. In proceedings of ACL-2004, Spain, 2004. Alessandro Moschitti and Cosmin Adrian Bejan, A Semantic Kernel for Predicate Argument Classification. In proceedings of the CoNLL-2004, Boston, MA, USA, 2004.

  • M. Collins and N. Duffy, New ranking algorithms for parsing and

tagging: Kernels over discrete structures, and the voted perceptron. In ACL02, 2002. S.V.N. Vishwanathan and A.J. Smola. Fast kernels on strings and

  • trees. In Proceedings of Neural Information Processing Systems, 2002.
slide-73
SLIDE 73

References

AN INTRODUCTION TO SUPPORT VECTOR MACHINES

(and other kernel-based learning methods)

  • N. Cristianini and J. Shawe-Taylor Cambridge University Press

Xavier Carreras and Llu´ıs M`arquez. 2005. Introduction to the

CoNLL-2005 Shared Task: Semantic Role Labeling. In proceedings

  • f CoNLL’05.

Sameer Pradhan, Kadri Hacioglu, Valeri Krugler, Wayne Ward,

James H. Martin, and Daniel Jurafsky. 2005. Support vector learning for semantic argument classification. to appear in Machine Learning Journal.

slide-74
SLIDE 74

Algorithm

slide-75
SLIDE 75

The Impact of SSTK in Answer Classification

64 64.5 65 65.5 66 66.5 67 67.5 68 68.5 69 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 j F1-measure

Q(BOW)+ A(BOW) Q(BOW)+ A(PT,BOW) Q(PT)+ A(PT,BOW) Q(BOW)+ A(BOW,PT,PAS) Q(BOW)+ A(BOW,PT,PAS_N) Q(PT)+ A(PT,BOW,PAS) Q(BOW)+ A(BOW,PAS) Q(BOW)+ A(BOW,PAS_N)

slide-76
SLIDE 76

Mercer’s conditions (1)

slide-77
SLIDE 77

Mercer’s conditions (2)

If the Gram matrix:

is positive semi-definite there is a mapping φ that produces the target kernel function

) ,

j i x

x k G   ( =

slide-78
SLIDE 78

The lexical semantic kernel is not always a kernel

It may not be a kernel so we can use M´·M, where M is the

initial similarity matrix