Knowledge-, Corpus-, and Web-based Similarity Measures for Semantic - - PowerPoint PPT Presentation

knowledge corpus and web based similarity measures for
SMART_READER_LITE
LIVE PREVIEW

Knowledge-, Corpus-, and Web-based Similarity Measures for Semantic - - PowerPoint PPT Presentation

Introduction Methodology Results Discussion and Further Research Knowledge-, Corpus-, and Web-based Similarity Measures for Semantic Relations Extraction Alexander Panchenko alexander.panchenko@student.uclouvain.be Universit catholique de


slide-1
SLIDE 1

Introduction Methodology Results Discussion and Further Research

Knowledge-, Corpus-, and Web-based Similarity Measures for Semantic Relations Extraction

Alexander Panchenko

alexander.panchenko@student.uclouvain.be Université catholique de Louvain & Bauman Moscow State Technical University

14 October 2011 / Seminar of CENTAL

Alexander Panchenko 1/42

slide-2
SLIDE 2

Introduction Methodology Results Discussion and Further Research

Plan

1

Introduction

2

Methodology

3

Results

4

Discussion and Further Research

Alexander Panchenko 2/42

slide-3
SLIDE 3

Introduction Methodology Results Discussion and Further Research

Reference Paper

Panchenko A. Comparison of the Knowledge-, Corpus-, and Web-based Similarity Measures for Semantic Relations Extraction // Proceedings of the GEMS 2011 Workshop on Geometrical Models of Natural Language Semantics, EMNLP 2011, pages 11–21, 2011.

Alexander Panchenko 3/42

slide-4
SLIDE 4

Introduction Methodology Results Discussion and Further Research

Semantic Relations

r = ci, t, cj – semantic relation, where ci, cj ∈ C, t ∈ T C – concepts e.g. radio or receiver operating characteristic T – semantic relation types, e.g. hyponymy or synonymy R ⊆ C × T × C – set of semantic relations

Alexander Panchenko 4/42

slide-5
SLIDE 5

Introduction Methodology Results Discussion and Further Research

Semantic Relations Example: BLESS

Parameters: 200 source concepts Cs 8625 destination concepts Cd each concept c ∈ {Cr ∪ Cd} is a single English word T = { hyper, coord, mero, event, attri, random } 26554 semantic relations R ⊆ Cs × T × Cd

Alexander Panchenko 5/42

slide-6
SLIDE 6

Introduction Methodology Results Discussion and Further Research

Semantic Relations Example: BLESS

Parameters: 200 source concepts Cs 8625 destination concepts Cd each concept c ∈ {Cr ∪ Cd} is a single English word T = { hyper, coord, mero, event, attri, random } 26554 semantic relations R ⊆ Cs × T × Cd Examples, R: alligator, coord, snake freezer, attri, empty phone, hyper, device radio, mero, headphone eagle, random, award

Alexander Panchenko 5/42

slide-7
SLIDE 7

Introduction Methodology Results Discussion and Further Research

BLESS (Baroni & Lenci, 2011)

target concept relation type relatum concept alligator attri aggressive alligator attri aquatic alligator coord crocodile alligator coord frog alligator hyper animal alligator hyper beast alligator mero eye alligator mero foot ... ... ... alligator random addition alligator random constructive

Alexander Panchenko 6/42

slide-8
SLIDE 8

Introduction Methodology Results Discussion and Further Research

Another Example: Information Retrieval Thesaurus

Figure: A part of a the information retrieval thesaurus EuroVoc.

Alexander Panchenko 7/42

slide-9
SLIDE 9

Introduction Methodology Results Discussion and Further Research

Another Example: Information Retrieval Thesaurus

Figure: A part of a the information retrieval thesaurus EuroVoc.

R = energy-generating product, NT, energy industry energy technology, NT, energy industry petrolium, RT, fossil fuel energy technology, RT, oil technology ...

Alexander Panchenko 7/42

slide-10
SLIDE 10

Introduction Methodology Results Discussion and Further Research

Problem

Semantic Relations Extraction Method Input: lexically expressed concepts C, semantic relation types T Ouput: lexico-semantic relations ^ R ∼ R

Alexander Panchenko 8/42

slide-11
SLIDE 11

Introduction Methodology Results Discussion and Further Research

Problem

Semantic Relations Extraction Method Input: lexically expressed concepts C, semantic relation types T Ouput: lexico-semantic relations ^ R ∼ R Solutions: Pattern-based methods

Alexander Panchenko 8/42

slide-12
SLIDE 12

Introduction Methodology Results Discussion and Further Research

Problem

Semantic Relations Extraction Method Input: lexically expressed concepts C, semantic relation types T Ouput: lexico-semantic relations ^ R ∼ R Solutions: Pattern-based methods

Manually constructed patterns (Hearst, 1992) Semi-automatically constructed patterns (Snow et al., 2004) Unsupervised patterns learning (Etzioni et al., 2005)

Alexander Panchenko 8/42

slide-13
SLIDE 13

Introduction Methodology Results Discussion and Further Research

Problem

Semantic Relations Extraction Method Input: lexically expressed concepts C, semantic relation types T Ouput: lexico-semantic relations ^ R ∼ R Solutions: Pattern-based methods

Manually constructed patterns (Hearst, 1992) Semi-automatically constructed patterns (Snow et al., 2004) Unsupervised patterns learning (Etzioni et al., 2005)

Unsupervised similarity-based methods (Lin, 1998; Sahlgren,

2006)

Alexander Panchenko 8/42

slide-14
SLIDE 14

Introduction Methodology Results Discussion and Further Research

Problem

Semantic Relations Extraction Method Input: lexically expressed concepts C, semantic relation types T Ouput: lexico-semantic relations ^ R ∼ R Solutions: Pattern-based methods

Manually constructed patterns (Hearst, 1992) Semi-automatically constructed patterns (Snow et al., 2004) Unsupervised patterns learning (Etzioni et al., 2005)

Unsupervised similarity-based methods (Lin, 1998; Sahlgren,

2006)

Research Questions w.r.t. similarity-based methods: Which similarity measure is the best for relations extraction?

Alexander Panchenko 8/42

slide-15
SLIDE 15

Introduction Methodology Results Discussion and Further Research

Problem

Semantic Relations Extraction Method Input: lexically expressed concepts C, semantic relation types T Ouput: lexico-semantic relations ^ R ∼ R Solutions: Pattern-based methods

Manually constructed patterns (Hearst, 1992) Semi-automatically constructed patterns (Snow et al., 2004) Unsupervised patterns learning (Etzioni et al., 2005)

Unsupervised similarity-based methods (Lin, 1998; Sahlgren,

2006)

Research Questions w.r.t. similarity-based methods: Which similarity measure is the best for relations extraction? Do various measures capture relations of the same type?

Alexander Panchenko 8/42

slide-16
SLIDE 16

Introduction Methodology Results Discussion and Further Research

Motivation: Automatic Thesaurus Construction

Figure: A technology for automatic thesaurus construction.

Alexander Panchenko 9/42

slide-17
SLIDE 17

Introduction Methodology Results Discussion and Further Research

Motivation: Automatic Thesaurus Construction

Figure: A technology for automatic thesaurus construction.

Applications: Query expansion and query suggestion

Alexander Panchenko 9/42

slide-18
SLIDE 18

Introduction Methodology Results Discussion and Further Research

Motivation: Automatic Thesaurus Construction

Figure: A technology for automatic thesaurus construction.

Applications: Query expansion and query suggestion Navigation and browsing on the corpus

Alexander Panchenko 9/42

slide-19
SLIDE 19

Introduction Methodology Results Discussion and Further Research

Motivation: Automatic Thesaurus Construction

Figure: A technology for automatic thesaurus construction.

Applications: Query expansion and query suggestion Navigation and browsing on the corpus Visualization of the corpus

Alexander Panchenko 9/42

slide-20
SLIDE 20

Introduction Methodology Results Discussion and Further Research

Motivation: Automatic Thesaurus Construction

Figure: A technology for automatic thesaurus construction.

Applications: Query expansion and query suggestion Navigation and browsing on the corpus Visualization of the corpus ...

Alexander Panchenko 9/42

slide-21
SLIDE 21

Introduction Methodology Results Discussion and Further Research

The Contributions

Studying 21 corpus-, knowledge-, and web-based measures

Alexander Panchenko 10/42

slide-22
SLIDE 22

Introduction Methodology Results Discussion and Further Research

The Contributions

Studying 21 corpus-, knowledge-, and web-based measures Using the BLESS dataset

Alexander Panchenko 10/42

slide-23
SLIDE 23

Introduction Methodology Results Discussion and Further Research

The Contributions

Studying 21 corpus-, knowledge-, and web-based measures Using the BLESS dataset Analysis of the semantic relation types

Alexander Panchenko 10/42

slide-24
SLIDE 24

Introduction Methodology Results Discussion and Further Research

The Contributions

Studying 21 corpus-, knowledge-, and web-based measures Using the BLESS dataset Analysis of the semantic relation types

Reporting empirical relation distributions

Alexander Panchenko 10/42

slide-25
SLIDE 25

Introduction Methodology Results Discussion and Further Research

The Contributions

Studying 21 corpus-, knowledge-, and web-based measures Using the BLESS dataset Analysis of the semantic relation types

Reporting empirical relation distributions Finding most and least similar measures

Alexander Panchenko 10/42

slide-26
SLIDE 26

Introduction Methodology Results Discussion and Further Research

Similarity-based Semantic Relations Extraction

Semantic Relations Extraction Algorithm Input: Concepts C, Parameters of similarity measure P, Threshold k, Min.similarity value γ Output: Unlabeled semantic relations ^ R

1 S ← sim(C, P) ; 2 S ← normalize(S) ; 3 ^

R ← threshold(S, k, γ) ;

4 return ^

R ;

Alexander Panchenko 11/42

slide-27
SLIDE 27

Introduction Methodology Results Discussion and Further Research

Similarity-based Semantic Relations Extraction

Semantic Relations Extraction Algorithm Input: Concepts C, Parameters of similarity measure P, Threshold k, Min.similarity value γ Output: Unlabeled semantic relations ^ R

1 S ← sim(C, P) ; 2 S ← normalize(S) ; 3 ^

R ← threshold(S, k, γ) ;

4 return ^

R ; sim – one of 21 tested similarity measures

Alexander Panchenko 11/42

slide-28
SLIDE 28

Introduction Methodology Results Discussion and Further Research

Similarity-based Semantic Relations Extraction

Semantic Relations Extraction Algorithm Input: Concepts C, Parameters of similarity measure P, Threshold k, Min.similarity value γ Output: Unlabeled semantic relations ^ R

1 S ← sim(C, P) ; 2 S ← normalize(S) ; 3 ^

R ← threshold(S, k, γ) ;

4 return ^

R ; sim – one of 21 tested similarity measures normalize – similarity score normalization

Alexander Panchenko 11/42

slide-29
SLIDE 29

Introduction Methodology Results Discussion and Further Research

Similarity-based Semantic Relations Extraction

Semantic Relations Extraction Algorithm Input: Concepts C, Parameters of similarity measure P, Threshold k, Min.similarity value γ Output: Unlabeled semantic relations ^ R

1 S ← sim(C, P) ; 2 S ← normalize(S) ; 3 ^

R ← threshold(S, k, γ) ;

4 return ^

R ; sim – one of 21 tested similarity measures normalize – similarity score normalization threshold – kNN thresholding function R = |C|

i=1 {ci, t, cj : cj ∈ top k% concepts ∧ sij ≥ γ} .

Alexander Panchenko 11/42

slide-30
SLIDE 30

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0), corpus (SEMCOR).

Alexander Panchenko 12/42

slide-31
SLIDE 31

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0), corpus (SEMCOR). Variables:

h – the height of the network

Alexander Panchenko 12/42

slide-32
SLIDE 32

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0), corpus (SEMCOR). Variables:

h – the height of the network len(ci, cj) – length of the shortest path between concepts

Alexander Panchenko 12/42

slide-33
SLIDE 33

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0), corpus (SEMCOR). Variables:

h – the height of the network len(ci, cj) – length of the shortest path between concepts P(c) – probability of the concept, estimated from a corpus

Alexander Panchenko 12/42

slide-34
SLIDE 34

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0), corpus (SEMCOR). Variables:

h – the height of the network len(ci, cj) – length of the shortest path between concepts P(c) – probability of the concept, estimated from a corpus

Inverted Edge Count: sij = len(ci, cj)−1

Alexander Panchenko 12/42

slide-35
SLIDE 35

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0), corpus (SEMCOR). Variables:

h – the height of the network len(ci, cj) – length of the shortest path between concepts P(c) – probability of the concept, estimated from a corpus

Inverted Edge Count: sij = len(ci, cj)−1 Leacock-Chodorow: sij = −loglen(ci, cj) 2h

Alexander Panchenko 12/42

slide-36
SLIDE 36

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures (8)

Wu-Palmer sij = 2 · len(cr, lcs(ci, cj)) len(ci, lcs(ci, cj)) + len(cj, lcs(ci, cj)) + 2 · len(croot, lcs(ci, cj))

Alexander Panchenko 13/42

slide-37
SLIDE 37

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures (8)

Wu-Palmer sij = 2 · len(cr, lcs(ci, cj)) len(ci, lcs(ci, cj)) + len(cj, lcs(ci, cj)) + 2 · len(croot, lcs(ci, cj)) Resnik: sij = −log(P(lcs(ci, cj)))

Alexander Panchenko 13/42

slide-38
SLIDE 38

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures (8)

Wu-Palmer sij = 2 · len(cr, lcs(ci, cj)) len(ci, lcs(ci, cj)) + len(cj, lcs(ci, cj)) + 2 · len(croot, lcs(ci, cj)) Resnik: sij = −log(P(lcs(ci, cj))) Jiang-Conrath: sij = [2 · log(P(lcs(ci, cj))) − log(P(ci) − log(P(cj))]−1

Alexander Panchenko 13/42

slide-39
SLIDE 39

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures (8)

Wu-Palmer sij = 2 · len(cr, lcs(ci, cj)) len(ci, lcs(ci, cj)) + len(cj, lcs(ci, cj)) + 2 · len(croot, lcs(ci, cj)) Resnik: sij = −log(P(lcs(ci, cj))) Jiang-Conrath: sij = [2 · log(P(lcs(ci, cj))) − log(P(ci) − log(P(cj))]−1 Lin: sij = 2 · log(P(lcs(ci, cj))) log(P(ci) + log(P(cj))

Alexander Panchenko 13/42

slide-40
SLIDE 40

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0).

Alexander Panchenko 14/42

slide-41
SLIDE 41

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0). Variables:

gloss(c) – definition of the concept

Alexander Panchenko 14/42

slide-42
SLIDE 42

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0). Variables:

gloss(c) – definition of the concept sim(gloss(ci), gloss(cj)) – similarity of concepts’ glosses

Alexander Panchenko 14/42

slide-43
SLIDE 43

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0). Variables:

gloss(c) – definition of the concept sim(gloss(ci), gloss(cj)) – similarity of concepts’ glosses fi – context vector of ci, calculated on the corpus of all glosses

Alexander Panchenko 14/42

slide-44
SLIDE 44

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0). Variables:

gloss(c) – definition of the concept sim(gloss(ci), gloss(cj)) – similarity of concepts’ glosses fi – context vector of ci, calculated on the corpus of all glosses

Extended Lesk (Banerjee and Pedersen, 2003): sij =

  • ci∈Ci
  • cj∈Cj

sim(gloss(ci), gloss(cj)), where Ci = {c : ∃ c, t, ci}.

Alexander Panchenko 14/42

slide-45
SLIDE 45

Introduction Methodology Results Discussion and Further Research

Knowledge-based Measures

Description Data: semantic network (WORDNET 3.0). Variables:

gloss(c) – definition of the concept sim(gloss(ci), gloss(cj)) – similarity of concepts’ glosses fi – context vector of ci, calculated on the corpus of all glosses

Extended Lesk (Banerjee and Pedersen, 2003): sij =

  • ci∈Ci
  • cj∈Cj

sim(gloss(ci), gloss(cj)), where Ci = {c : ∃ c, t, ci}. Gloss Vectors (Patwardhan and Pedersen, 2006): sij = vi · vj vi vj, where vi =

  • ∀j:cj∈Gi

fj, where Gi =

  • c∈Ci

gloss(c)

Alexander Panchenko 14/42

slide-46
SLIDE 46

Introduction Methodology Results Discussion and Further Research

Corpus-based Measures (4)

Description Data: corpus (WACYPEDIA (800M), UKWAC (2000M)).

Alexander Panchenko 15/42

slide-47
SLIDE 47

Introduction Methodology Results Discussion and Further Research

Corpus-based Measures (4)

Description Data: corpus (WACYPEDIA (800M), UKWAC (2000M)). Variables: fi – context vector for ci

Alexander Panchenko 15/42

slide-48
SLIDE 48

Introduction Methodology Results Discussion and Further Research

Corpus-based Measures (4)

Description Data: corpus (WACYPEDIA (800M), UKWAC (2000M)). Variables: fi – context vector for ci Cosine: sij = fi · fj fi fj;

Alexander Panchenko 15/42

slide-49
SLIDE 49

Introduction Methodology Results Discussion and Further Research

Corpus-based Measures (4)

Description Data: corpus (WACYPEDIA (800M), UKWAC (2000M)). Variables: fi – context vector for ci Cosine: sij = fi · fj fi fj; Jaccard: sij = min(fi, fj)1 max(fi, fj)1 ;

Alexander Panchenko 15/42

slide-50
SLIDE 50

Introduction Methodology Results Discussion and Further Research

Corpus-based Measures (4)

Description Data: corpus (WACYPEDIA (800M), UKWAC (2000M)). Variables: fi – context vector for ci Cosine: sij = fi · fj fi fj; Jaccard: sij = min(fi, fj)1 max(fi, fj)1 ; Euclidian: sij = fi − fj ;

Alexander Panchenko 15/42

slide-51
SLIDE 51

Introduction Methodology Results Discussion and Further Research

Corpus-based Measures (4)

Manhattan: sij = fi − fj1 .

Alexander Panchenko 16/42

slide-52
SLIDE 52

Introduction Methodology Results Discussion and Further Research

Corpus-based Measures – Context Vector f

Bag-of-words Distributional Analysis (BDA)

|| La proposition de loi se réfère aux candidats || au permis de conduire et ne contient aucune disposition relative à des véhicules. Feature: lemma, e.g. "candidat" Feature Normalization (PMI): f(wordi, featurej) = fij = log P(word,feature)

P(word)P(feature)

Parameters: window size (1-20) window type (exact, floating) using stopwords number of features (5.000-300.000)

Alexander Panchenko 17/42

slide-53
SLIDE 53

Introduction Methodology Results Discussion and Further Research

Corpus-based Measures – Context Vector f

Syntactic Distributional Analysis (SDA) Feature: lemma#dependency, e.g. contenir#SUBJ Feature Normalization (PMI) Parameters:

types of dependencies (6,9,21) e.g. SUBJ, OBJ, NMOD, VMOD window type (exact, floating) using stopwords number of features (5.000-300.000)

Alexander Panchenko 18/42

slide-54
SLIDE 54

Introduction Methodology Results Discussion and Further Research

Web-based Measures (9)

Description Data: number of the hits returned by an IR system (GOOGLE, YAHOO, YAHOO BOSS, FACTIVA).

Alexander Panchenko 19/42

slide-55
SLIDE 55

Introduction Methodology Results Discussion and Further Research

Web-based Measures (9)

Description Data: number of the hits returned by an IR system (GOOGLE, YAHOO, YAHOO BOSS, FACTIVA). Variables:

hi – number of hits returned by query "ci"

Alexander Panchenko 19/42

slide-56
SLIDE 56

Introduction Methodology Results Discussion and Further Research

Web-based Measures (9)

Description Data: number of the hits returned by an IR system (GOOGLE, YAHOO, YAHOO BOSS, FACTIVA). Variables:

hi – number of hits returned by query "ci" hij – number of hits returned by the query "ci AND cj"

Alexander Panchenko 19/42

slide-57
SLIDE 57

Introduction Methodology Results Discussion and Further Research

Web-based Measures (9)

Description Data: number of the hits returned by an IR system (GOOGLE, YAHOO, YAHOO BOSS, FACTIVA). Variables:

hi – number of hits returned by query "ci" hij – number of hits returned by the query "ci AND cj"

Alexander Panchenko 19/42

slide-58
SLIDE 58

Introduction Methodology Results Discussion and Further Research

Web-based Measures (9)

Description Data: number of the hits returned by an IR system (GOOGLE, YAHOO, YAHOO BOSS, FACTIVA). Variables:

hi – number of hits returned by query "ci" hij – number of hits returned by the query "ci AND cj"

Normalized Google Distance (Cilibrasi and Vitanyi, 2007): sij = max(log(hi), log(hj)) − log(hij) log(M) − min(log(hi), log(hj))

Alexander Panchenko 19/42

slide-59
SLIDE 59

Introduction Methodology Results Discussion and Further Research

Web-based Measures (9)

Description Data: number of the hits returned by an IR system (GOOGLE, YAHOO, YAHOO BOSS, FACTIVA). Variables:

hi – number of hits returned by query "ci" hij – number of hits returned by the query "ci AND cj"

Normalized Google Distance (Cilibrasi and Vitanyi, 2007): sij = max(log(hi), log(hj)) − log(hij) log(M) − min(log(hi), log(hj)) PMI-IR (Turney, 2001): sij = −log P(ci, cj) P(ci)P(cj) = −log hij

  • i
  • j hihj

hihj

  • i hij

.

Alexander Panchenko 19/42

slide-60
SLIDE 60

Introduction Methodology Results Discussion and Further Research

"Theoretical" Classification of the Similarity Measures

Alexander Panchenko 20/42

slide-61
SLIDE 61

Introduction Methodology Results Discussion and Further Research

General Performance

Evaluation Protocol Precision = |R∩^

R| |^ R| , Recall = |R∩^ R| |R| , F1 = 2 · Precision·Recall Precision+Recall

R – all relations from BLESS, but random ^ R – extracted relations

Alexander Panchenko 21/42

slide-62
SLIDE 62

Introduction Methodology Results Discussion and Further Research

General Performance

Evaluation Protocol Precision = |R∩^

R| |^ R| , Recall = |R∩^ R| |R| , F1 = 2 · Precision·Recall Precision+Recall

R – all relations from BLESS, but random ^ R – extracted relations

Alexander Panchenko 21/42

slide-63
SLIDE 63

Introduction Methodology Results Discussion and Further Research

General Performance: Scores

@ Precision = 0.80

Alexander Panchenko 22/42

slide-64
SLIDE 64

Introduction Methodology Results Discussion and Further Research

General Performance: Scores

High Recall (@Precision=0.80) VS High Precision (@k=10-20%)

Alexander Panchenko 23/42

slide-65
SLIDE 65

Introduction Methodology Results Discussion and Further Research

General Performance: Learning Curve of the BDA-Cos

Alexander Panchenko 24/42

slide-66
SLIDE 66

Introduction Methodology Results Discussion and Further Research

General Performance: Learning Curve of the BDA-Cos

∆F11M−10M ∼ = 0.44

Alexander Panchenko 24/42

slide-67
SLIDE 67

Introduction Methodology Results Discussion and Further Research

General Performance: Learning Curve of the BDA-Cos

∆F11M−10M ∼ = 0.44 ∆F110M−100M ∼ = 0.16

Alexander Panchenko 24/42

slide-68
SLIDE 68

Introduction Methodology Results Discussion and Further Research

General Performance: Learning Curve of the BDA-Cos

∆F11M−10M ∼ = 0.44 ∆F110M−100M ∼ = 0.16 ∆F1100M−1000M ∼ = 0.03

Alexander Panchenko 24/42

slide-69
SLIDE 69

Introduction Methodology Results Discussion and Further Research

Number of Dimensions (Features)

Features are K most frequent dimensions Which K to choose?

Alexander Panchenko 25/42

slide-70
SLIDE 70

Introduction Methodology Results Discussion and Further Research

Number of Dimensions (Features)

Features are K most frequent dimensions Which K to choose?

5000 50,000 100,000 150,000 200,000 250,000 300,000 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96

X: 1e+005 Y: 0.9539

Number of dimensions (bag of words or syntactic features) Precision @ k=20% BDA−exact3−Cos BDA−float3−Cos BDA−sent−Cos SDA−6−Cos SDA−21−Cos

Alexander Panchenko 25/42

slide-71
SLIDE 71

Introduction Methodology Results Discussion and Further Research

Example of the Extracted Relations (BDA-sent-Cos)

Alexander Panchenko 26/42

slide-72
SLIDE 72

Introduction Methodology Results Discussion and Further Research

Comparing Relation Distributions

Evaluation Protocol Percent =

^ Rt |R∩^ R| · 100

^ Rt is a set of extracted relations of type t,

t∈T ^

Rt = |R ∩ ^ R|

Alexander Panchenko 27/42

slide-73
SLIDE 73

Introduction Methodology Results Discussion and Further Research

Comparing Relation Distributions

Evaluation Protocol Percent =

^ Rt |R∩^ R| · 100

^ Rt is a set of extracted relations of type t,

t∈T ^

Rt = |R ∩ ^ R| Issue: high sensitivity of the Percent to k:

Alexander Panchenko 27/42

slide-74
SLIDE 74

Introduction Methodology Results Discussion and Further Research

Comparing Relation Distributions: Scores @ k = 10%

Alexander Panchenko 28/42

slide-75
SLIDE 75

Introduction Methodology Results Discussion and Further Research

Comparing Relation Distributions: Scores @ k = 40%

Alexander Panchenko 29/42

slide-76
SLIDE 76

Introduction Methodology Results Discussion and Further Research

Comparing Relation Distributions: Scores

Similarity to the BLESS:

Random measure: χ2 = 5.36, p = 0.252 21 measures: χ2 = 89.94 − 4000, p < 0.001

Alexander Panchenko 30/42

slide-77
SLIDE 77

Introduction Methodology Results Discussion and Further Research

Comparing Relation Distributions: Scores

Similarity to the BLESS:

Random measure: χ2 = 5.36, p = 0.252 21 measures: χ2 = 89.94 − 4000, p < 0.001

Independence of the Relation Distributions:

21 measures: χ2 = 10487, p < 0.001, df = 80 knowledge-based measures: χ2 = 2529, df = 28, p < 0.001 corpus-based measures: χ2 = 245, df = 12, p < 0.001 web-based measures: χ2 = 3158, df = 32, p < 0.001

Alexander Panchenko 30/42

slide-78
SLIDE 78

Introduction Methodology Results Discussion and Further Research

Relation Distributions: Distribution of the Scores

Alexander Panchenko 31/42

slide-79
SLIDE 79

Introduction Methodology Results Discussion and Further Research

Relation Distributions: Most Similar Measures

Measures Dissimilarity Calculate distance xij between measures simi and simj: xij = xji =

  • t∈ T

(| ^ Ri

t| − | ^

Rj

t|)2

| ^ Rj

t|

^ Ri

t – correctly extracted relations of type t with measure simi

Alexander Panchenko 32/42

slide-80
SLIDE 80

Introduction Methodology Results Discussion and Further Research

Relation Distributions: Most Similar Measures

Alexander Panchenko 33/42

slide-81
SLIDE 81

Introduction Methodology Results Discussion and Further Research

Relation Distributions: Most Similar Measures

Threshold the 21 × 21 matrix X: if xij < 220 then xij = 0 Visualize the distances with the Fruchterman-Reingold (1991) graph layout

Alexander Panchenko 34/42

slide-82
SLIDE 82

Introduction Methodology Results Discussion and Further Research

Conclusion:

General Performance Best knowledge-based measure – Resnik (WORDNET) Best corpus-based and the best measure – BDA-exact3/5-Cos and SDA-21-Cos (UKWAC) Best web-based measure – NGD-YAHOO Best measures clearly separate correct and random relations Relations Distributions All measures extract many co-hyponyms The measures were grouped according to similarity of their relation distributions The measures provide complimentary results

Alexander Panchenko 35/42

slide-83
SLIDE 83

Introduction Methodology Results Discussion and Further Research

Further Research:

Evaluation Correlation r with human judgments Using a golden standard with synonyms Using a golden standard with MWE – thesauri, ontologies An application-based evaluation – query expansion

Alexander Panchenko 36/42

slide-84
SLIDE 84

Introduction Methodology Results Discussion and Further Research

Further Research:

Evaluation Correlation r with human judgments Using a golden standard with synonyms Using a golden standard with MWE – thesauri, ontologies An application-based evaluation – query expansion Methods A combined similarity measure – linear combination, logistic regression, committees, feature/similarity tensor, ... More measures – LSA, LDA, surface-based similarity, syntactic tree kernels, definition-based measures, SimRank,... Working with MWE Classify relations by type: hyponymy, synonymy, etc.

Alexander Panchenko 36/42

slide-85
SLIDE 85

Introduction Methodology Results Discussion and Further Research

Further Research: Evaluation

Correlation r with human judgments WordSim353 (Finkelstein, 2002) – 353 pairs Miller Charles (1991) – 30 pairs Rubenstein Goodenough (1965) – 65 pairs

Alexander Panchenko 37/42

slide-86
SLIDE 86

Introduction Methodology Results Discussion and Further Research

Further Research: Evaluation

Correlation r with human judgments WordSim353 (Finkelstein, 2002) – 353 pairs Miller Charles (1991) – 30 pairs Rubenstein Goodenough (1965) – 65 pairs

word1 word2 x (human) y (sim) tiger cat 7.35 0.85 book paper 7.46 0.95 computer keyboard 7.62 0.81 ... ... ... ... possibility girl 1.94 0.25 sugar approach 0.88 0.05

Alexander Panchenko 37/42

slide-87
SLIDE 87

Introduction Methodology Results Discussion and Further Research

Further Research: Evaluation

Correlation r with human judgments WordSim353 (Finkelstein, 2002) – 353 pairs Miller Charles (1991) – 30 pairs Rubenstein Goodenough (1965) – 65 pairs

word1 word2 x (human) y (sim) tiger cat 7.35 0.85 book paper 7.46 0.95 computer keyboard 7.62 0.81 ... ... ... ... possibility girl 1.94 0.25 sugar approach 0.88 0.05

Person correlation: r = covxy

sxsy

Alexander Panchenko 37/42

slide-88
SLIDE 88

Introduction Methodology Results Discussion and Further Research

Further Research: Evaluation

A golden standard with synonyms – Semantic Neighbors dataset Sources of synonyms:

WordNet synsets – 66.662 Roget’s paragraphs and semicolon groups – 142.366 Free DB of synonyms – 133.292

Alexander Panchenko 38/42

slide-89
SLIDE 89

Introduction Methodology Results Discussion and Further Research

Further Research: Evaluation

A golden standard with synonyms – Semantic Neighbors dataset Sources of synonyms:

WordNet synsets – 66.662 Roget’s paragraphs and semicolon groups – 142.366 Free DB of synonyms – 133.292

Merging Filtering: single noun ∧ many relations ∧ little ambiguity Add Randoms: ∀ wordi, syn, wordj ⇒ wordi, random, wordr

Alexander Panchenko 38/42

slide-90
SLIDE 90

Introduction Methodology Results Discussion and Further Research

Further Research: Evaluation

A golden standard with synonyms – Semantic Neighbors dataset Sources of synonyms:

WordNet synsets – 66.662 Roget’s paragraphs and semicolon groups – 142.366 Free DB of synonyms – 133.292

Merging Filtering: single noun ∧ many relations ∧ little ambiguity Add Randoms: ∀ wordi, syn, wordj ⇒ wordi, random, wordr Result:

463 "target" concepts 5913 "relatum" concepts 7341 semantic neighbors – synonyms, associations, etc. 14682 relations – to verify manually

Alexander Panchenko 38/42

slide-91
SLIDE 91

Introduction Methodology Results Discussion and Further Research

Further Research: Evaluation

A golden standard with synonyms – Semantic Neighbors dataset

target word relatum word relation type judge adjudicate syn judge arbitrate syn judge asessor syn judge chancellor syn judge decide syn judge gendarmerie syn judge sheriff syn ... ... ... judge pc random judge fare random judge lemon random

Alexander Panchenko 39/42

slide-92
SLIDE 92

Introduction Methodology Results Discussion and Further Research

Further Research: Evaluation

Evaluation of the SDA-6-Cos with the Semantic Neighbors: Precision@k10% = .9736, Precision@k20% = .9384

target word relatum word relation type sim aficionado enthusiast syn 0.07197 aficionado fan syn 0.05195 aficionado admirer syn 0.01964 aficionado addict syn 0.01326 aficionado devotee syn 0.01163 aficionado foundling random 0.00777 aficionado fanatic syn 0.00414 aficionado adherent syn 0.00353 aficionado capital random 0.00232 aficionado statute random 0.00029 aficionado blot random 0.00025 aficionado meddler random 0.00005 aficionado enlargement random 0.00003 aficionado bawdyhouse random 0.00000

Alexander Panchenko 40/42

slide-93
SLIDE 93

Introduction Methodology Results Discussion and Further Research

Further Research: Evaluation

Evaluation of the SDA-6-Cos with the Semantic Neighbors:

Alexander Panchenko 41/42

slide-94
SLIDE 94

Introduction Methodology Results Discussion and Further Research

Questions

Thank you! Questions?

Alexander Panchenko 42/42