[PPT] - Making Sense of Word Sense 24 February, 2011 Deutschen Gesellschaft PowerPoint Presentation

SLIDE 1

Making Sense of Word Sense

24 February, 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) Gottingen Rebecca J. Passonneau Vikas Bhardwaj Ansaf Salleb‐Aouissi Nancy Ide Vassar College

SLIDE 2

Outline

The word sense conundrum
The MASC Project
WordNet and sense annotation
MASC annotation rounds
Round 2: Multiple trained annotators
Interannotator agreement and beyond
Round 2: Mechanical turkers
Machine learning from labels versus features
Conclusion

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 2

SLIDE 3

Word sense conundrum

Adam Kilgariff , 2003, I don’t believe in word senses

– Abstractions from corpus clusters – Corpus citations . . . are the basic objects in the

ntology
James Pustejovsky, 1991, The generative lexicon

– No fixed set of conceptual primitives – A fixed number of generative devices – Lexical semantics is an interface between commonsense knowledge and linguistic form

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 3

SLIDE 4

Zipf’s Law

An epiphenomenon of . . .

Words (types or tokens)
Senses
Many other phenomena (Newman, M. E. J.;

2005): city population, books sold, net worth, . . .

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 4

SLIDE 5

Granularity

Concepts versus comparisons of experience

Infinite divisibility of reality: how fine‐grained should

a cluster be?

– WordNet senses for primitive, Adj: 1. Belonging to an early stage of development 2. Characteristic of an ancestral type 3. Preliterate or non‐industrial societies 4. Created by one without formal training

Shared experience: the basis of social reality, ways of

verbalizing social reality

1‐ 3. Anthropology 4. Art history

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 5

SLIDE 6

Corpus‐based Sense Classes

Ontological questions: deferred

– How are clusters used as basic ontological objects? – How is commonsense knowledge represented

Identify same/different contexts, within limits:

– Same? . . . a primitive granite boar, carved in prehistoric times, . . . has a primitive Easter‐island look, – Different? Bin Laden’s training camps were primitive . . . . . . one or more of the primitive gluing or ungluing

perations

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 6

SLIDE 7

Outline

The word sense conundrum
The MASC Project
WordNet and sense annotation
MASC annotation rounds
Round 2: Multiple trained annotators
Interannotator agreement and beyond
Round 2: Mechanical turkers
Machine learning from labels versus features
Conclusion

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 7

SLIDE 8

American National Corpus

24 February 2011 8 Deutschen Gesellschaft für Sprachwissenschaft (DGfS)

100 Million Words
Completely unrestricted
Post 1990 American English
Many genres

SLIDE 9

MASC: Manually Annotated Sub‐Corpus

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 9

Participants

Nancy Ide (PI, NSF CRI; Vassar College)
Collin Baker (ICSI; FrameNet)
Christiane Fellbaum (Princeton;

WordNet)

Rebecca J. Passonneau (Columbia

Univ.) Size: 500,000 Words,

Manually validated automatic
Manual annotations

Selected annotations

Token, Sentence, Lemma (Validated)
Named entities (Validated)
WordNet (Manual: 1.5 Million Word

Sentence Corpus)

FrameNet (Manual: 150K Words)

http://www.anc.org/MASC/

SLIDE 10

MASC Corpus

Three releases

– MASC I: 82K words, release date May, 2010 – MASC I‐II: 142 K words, release date March, 2011 – MASC I‐III: 500K words, release date July, 2011

Fourteen types of annotation

– Manually validated automatic, e.g., NP Chunks – Manual, e.g, word sense

Twenty genres, evenly balanced
Freely available from MASC website, and from:

– LDC – NLTK

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 10

SLIDE 11

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 11

Genre

rds

% of corpus 1 Court transcript 20817 4 2 Debate transcript 32325 6 3 Email 20470 4 4 Essay 25590 5 5 Fiction 25681 5 6 Gov't documents 24605 5 7 Journal 25635 5 8 Letters 24750 5 9 Newspaper/ newswire 17951 4 10 Non-fiction 25182 5 11 Spoken 25783 5 12 Technical 25426 5 13 Travel guides 26708 5 14 Twitter 24180 5 15 Blog 2 5 0 0 0 5 16 ficlets 2 5 0 0 0 5 17 movie script 28240 6 18 poetry 2 5 0 0 0 5 19 spam 2 5 0 0 0 5 20 jokes 2 5 0 0 0 5 Total 498343

Figures in bold indicate that the texts have not yet been chosen

SLIDE 12

Outline

The word sense conundrum
The MASC Project
WordNet and sense annotation
MASC annotation rounds
Round 2: Multiple trained annotators
Interannotator agreement and beyond
Round 2: Mechanical turkers
Machine learning from labels versus features
Conclusion

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 12

SLIDE 13

Word Sense Annotation Goals

Freely available word sense corpus
Harmonize WordNet and FrameNet
Investigate moderately polysemous words (avg.=7)
Large sentence‐based corpus

– 100 words, balanced for part‐of‐speech – 1000 sentences per word – Avg. sente3nce length in MASC I > 20 words – 2 million word corpus, representing 700 senses

Provide measures of interannotator agreement

– Chance corrected coefficients – Krippendorff’s Alpha

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 13

SLIDE 14

WordNet Sense Information

SENSEID: a unique identifier SYNSET: a list of synonymous senses (SENSEIDS) DEFINITION: a phrase EXAMPLES: list of glosses FREQUENCY COUNT: integer Nouns have domain, . . . etc Verbs have verb group, . . . etc Adjectives have attributes, . . . etc

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 14

SLIDE 15

WordNet Senses: time (noun)

8 WordNet senses used 1. (time1, clip2) (an instance or single occasion for some event) "this time he succeeded"; "he called four times"; "he could do ten at a clip" 2. (a period of time considered as a resource under your control and sufficient to accomplish something) "take time to smell the roses"; "I didn't have time to finish"; "it took more than half my time" 3. (an indefinite period (usually marked by specific attributes or activities)) "he waited a long time"; "the time of year for planting"; "he was a great actor in his time " 4. (a suitable moment) "it is time to go" 5. (the continuum of experience in which events pass from the future through the present to the past) 6. (a person's experience on a particular occasion) "he had a time holding back the tears"; "they had a good time together" 7. (time7, clock_time1) (a reading of a point in time as given by a clock) "do you know what time it is?"; "the time is 10 o'clock" 8. (time8, fourth_dimension1), (the fourth coordinate that is required (along with three spatial dimensions) to specify a physical event) 2 WordNet senses not used 9. (time9, meter4, metre3), (rhythm as given by division into parts of equal duration) 10. (time10, prison term1 , sentence3), (the period of time a prisoner is imprisoned) "he served a prison term of 15 months"; "his sentence was 5 to 10 years"; "he is doing time in the county jail"

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 15

SLIDE 16

Senses of time‐N

Sense Num Definitions 1 171 An instance or single occasion for an event 2 131 A period of time . . . sufficient to accomplish something 3 427 An indefinite period 4 59 A suitable moment 5 34 The continuum of experience . . . the future . . . 6 19 A person's experience on a particular occasion 7 38 A reading of a point in time as given by a clock 8 47 The fourth coordinate . . . to specify an event Total 926

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 16

SLIDE 17

Senses of time‐N

Sense Num Definitions 1 171 An instance or single occasion for an event

When When the bride the bride and groom and groom came together ame together for for the first the first time time

2 131 A period of time . . . sufficient to accomplish something 3 427 An indefinite period 4 59 A suitable moment 5 34 The continuum of experience . . . the future . . . 6 19 A person's experience on a particular occasion 7 38 A reading of a point in time as given by a clock 8 47 The fourth coordinate . . . to specify an event Total 926

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 17

SLIDE 18

Senses of time‐N

Sense Num Definitions 1 171 An instance or single occasion for an event 2 131 A period of time . . . sufficient to accomplish something 3 427 An indefinite period 4 59 A suitable moment

A time for A time for a youngster a youngster to enjoy to enjoy the fun and the fun and benefits benefits of

f camp . . .

camp . . .

5 34 The continuum of experience . . . the future . . . 6 19 A person's experience on a particular occasion 7 38 A reading of a point in time as given by a clock 8 47 The fourth coordinate . . . to specify an event Total 926

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 18

SLIDE 19

Senses of time‐N

Sense Num Definitions 1 171 An instance or single occasion for an event 2 131 A period of time . . . sufficient to accomplish something 3 427 An indefinite period 4 59 A suitable moment 5 34 The continuum of experience . . . the future . . .

Turn back the hands of time Turn back the hands of time and remember and remember when you . when you . . . .

6 19 A person's experience on a particular occasion 7 38 A reading of a point in time as given by a clock 8 47 The fourth coordinate . . . to specify an event Total 926

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 19

SLIDE 20

Outline

The word sense conundrum
The MASC Project
WordNet and sense annotation
MASC annotation rounds
Round 2: Multiple trained annotators
Interannotator agreement and beyond
Round 2: Mechanical turkers
Machine learning from labels versus features
Conclusion

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 20

SLIDE 21

MASC Sense Annotation Rounds

1050 sentences per lemma‐pos

– N up to 1050 from MASC – If N<1050, balance from OANC

Pre‐annotation sample: Sense inventory revision

– Random selection of 50 instances – WordNet 3.0 sense annotation – Multiple annotators

typically 2‐3; 6 in Round 2
Same core group of Vassar, Columbia undergrads
Highly trained
Sense revision (to be added to WordNet 3.1)
Core annotation

– 100 sentences subsample

FrameNet annotation
Multiple annotators (typically 2‐3) for interannotator agreement

– 900 sentences, one annotator per sentence (not always the same annotator)

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 21

SLIDE 22

Annotation Tool

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 22

Loads WordNet 3.0
Sense number
Definition
Glosses
Synset
Other labels
Collocation
Wrong POS
No sense applies
Not enough context
Subversion
Comment field used during

pre‐annotation sample

SLIDE 23

Outline

The word sense conundrum
The MASC Project
WordNet and sense annotation
MASC annotation rounds
Round 2: Multiple trained annotators
Interannotator agreement and beyond
Round 2: Mechanical turkers
Machine learning from labels versus features
Conclusion

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 23

SLIDE 24

Multiple Annotators

Poesio and Artstein, Reliability of Anaphoric Annotation Reconsidered, 2005

Anaphora
Lack of unique interpretation in the context of its occurrence due

to problematic annotation scheme, to be fixed by less specific representations (e.g., word senses, Buitelaar, 1998; Palmer et al., 2005)

Applies to polysemy, not to many other cases, e.g., anaphora
Word sense
Need for revision of sense inventory
Need for underspecification: annotators disagree unsystematically
Need to account for differences in interpretation: annotators disagree

systematically

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 24

SLIDE 25

Interannotator Agreement

Instances i, annotators j, annotation values k

Percent Agreement: Proportion of i where all j pick k

– Does not generalize well to multiple annotators – Does not take probability into account – Sensitive to data skew – Primarily a measure of coverage What proportion of instances have unanimity?

Agreement coefficient (Krippendorff’s Alpha):

Proportion of agreement > predicted by chance

– Same interpretation independent of data skew – Handles multiple annotators – Primarily a measure of variance – Does not indicate coverage

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 25

SLIDE 26

Example Agreement Matrix

Anns Instances 1 2 3 4 5 6 7 8 9 10 Ann1 a b b b c a c c b a Ann2 a c b a c c c b b b Ann3 a a b a c b c b b c

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 26

Percent Agreement
50 (5 of 10 columns)
63 (13 of 30 cells)
Agreement coefficient
Krippendorff’s Alpha: 30.1

p(a)= 8/30 = 0.27

Cohen’s Kappa: 27.8

p_Ann1(a)=p_Ann3(a)= 3/10 = 0.30 p_Ann2(a)= 2/10 = 0.20

SLIDE 27

Alpha Scores on Round 2, Trained

Lemma POS WN Senses Senses Used Alpha Outliers Alpha ‘ long Adj 9 4 0.67 1 0.80 fair 10 6 0.54 2 0.63 quiet 6 7 0.49 0.49 time Noun 7 7 0.68 0.68 work 10 8 0.62 0.62 land 11 9 0.49 1 0.54 tell Verb 12 10 0.46 0.46 say 8 8 0.38 2 0.52 show 11 10 0.46 1 0.48 know 11 10 0.38 2 0.63

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 27

SLIDE 28

Outline

The word sense conundrum
The MASC Project
WordNet and sense annotation
MASC annotation rounds
Round 2: Multiple trained annotators
Interannotator agreement and beyond
Round 2: Mechanical turkers
Machine learning from labels versus features
Conclusion

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 28

SLIDE 29

Interpreting Interannotator Agreement

Krippendorff:

– ≥ 0.67 supports tentative conclusions – ≥ 0.80 good reliability

Landis & Koch

– 0.21‐0.40 fair – 0.41‐0.60 moderate – 0.61‐0.80 substantial

Poesio & Artstein

– No single threshold applicable for all purposes – 0.70 for many NLP annotations

Passonneau

– Paradigmatic reliability analysis – E.g., significance tests based on uses of different annotators labels

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 29

SLIDE 30

Observed Variation in Alpha

Part‐of‐Speech effect

– Adjectives and Nouns have higher Alpha than verbs – Only a partial explanation of the variation

Within each part of speech, Alpha varies
Poor (inverse) correlation of Alpha with

#senses available (= ‐ 0.38)

Modest inverse correlation of Alpha with #

senses used (= ‐ 0.56)

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 30

SLIDE 31

Anveshan: Annotation Variance Estimation

For use with data from multiple annotators

– Identify outliers among annotators – Find subsets of annotators with similar behavior (systematic disagreement) – Identify confusable senses

Variation can occur due to differences among

– Annotators (expertise) – Items (difficulty) – Label sets (number and similarity of labels)

Uses Kullbach‐Liebler divergence, Jensen‐

Shannon divergence

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 31

SLIDE 32

Anveshan Basics

For each annotator , sense , compute
For each annotator , compute the average
Compute leverage for each , where

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 32

j

s

1

( , ) ( ) ( , )

i

j i a j m k i k

count s a P S s count s a

=

= = ∑

i

a

1 1

( ) 1

m

i a j m

P S s n

− =

= −

∑

i

a

, m m i ∀ ≠

( , ) | ( ) ( )|

k

Lev P Q P k Q k = −

∑

( )

i

a j

P s

SLIDE 33

Anveshan Basics, Continued

Compute Kullbach‐Liebler Divergence for each

annotator’s sense distribution against the average

f all other annotators’ sense distributions
Compute Jensen‐Shannon Divergence for each

annotator’s sense distribution against other annotators’ sense distributions, for all

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 33

1 1 ( , ) ( , ) ( , ), 2 2 ( ) 2 JSD P Q KLD P M KLD Q M P Q where M = + + =

( ) ( , ) ( )log ( )

i

P i KLD P Q P i Q i =∑

SLIDE 34

Outliers

KLD for long‐Adj

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 34

SLIDE 35

Systematic Disagreements

JSD and Alpha for show‐Verb

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 35

SLIDE 36

Confusability of Senses

Sense distributions for say‐Verb

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 36

10 20 30 40 50 60 A101 A103 Overall WN1 WN2 WN3 WN4 WN5 WN6 WN7 WN8 WN9 WN11

SLIDE 37

Outline

The word sense conundrum
The MASC Project
WordNet and sense annotation
MASC annotation rounds
Round 2: Multiple trained annotators
Interannotator agreement and beyond
Round 2: Mechanical turkers
Machine learning from labels versus features
Conclusion

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 37

SLIDE 38

Mechanical Turkers

Two adjectives
150 sentences per adjective

– 15 HITs; rejected for turkers not completing all 15 – 10 sentences per hit – 13 turkers – long (9 WN senses, all used): Alpha=0.15 – fair (10 WN senses, all used): Alpha =0.25

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 38

wn1 wn2 wn3 wn4 wn5 wn6 wn7 wn8 wn9 wn10 Other Total 891 437 52 60 63 21 21 135 10 14 236 1950 wn1 wn2 wn3 wn4 wn5 wn6 wn7 wn8 wn9 Other Total 659 458 115 110 160 156 66 56 64 115 1950

SLIDE 39

Outline

The word sense conundrum
The MASC Project
WordNet and sense annotation
MASC annotation rounds
Round 2: Multiple trained annotators
Interannotator agreement and beyond
Round 2: Mechanical turkers
Machine learning from labels versus features
Conclusion

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 39

SLIDE 40

Comparison of Two Learning Paradigms

Ground truth labels from author
Unsupervised learning from multilabels

– Maximum likelihood estimates obtained using EM

Supervised learning from features

– Features:

Word and sentence length features
Tf*Idf
Named Entities
DAL features (Dictionary of Affect in Language)

– SVMLight – 4‐fold cross validation – Best C‐values

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 40

SLIDE 41

Machine Learning from MultiLabels

GLAD (Whitehill et. al, 2009, Whose vote should

count more?)

Graphical model
Hidden variables

– true labels (Z) – labeler accuracy (α) – image difficulty (β)

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 41

SLIDE 42

Learning Results, Fair

Sense Ann Recall Precision F measure Accuracy

GLAD Fair‐j, WN1 MASC 0.92 0.94 0.93 0.93 AMT 1.00 0.71 0.85 0.79 Both 1.00 0.74 0.87 0.82 SVM NA 1.00 0.65 0.82 0.72 GLAD Fair‐j, WN2 MASC 0.69 0.48 0.59 0.83 AMT 0.81 0.93 0.87 0.96 Both 0.81 0.93 0.87 0.96 SVM NA 0.60 0.33 0.46 0.68

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 42

SLIDE 43

Learning Results, Long

Sense Annotators Recall Precision F measure Accuracy

GLAD Long‐j, WN1 MASC 0.88 0.84 0.86 0.84 AMT 1.00 0.98 0.99 0.99 Both 1.00 0.98 0.99 0.99 SVM 1.00 0.61 0.80 0.63 GLAD Long‐j, WN2 MASC 0.74 0.80 0.77 0.83 AMT 0.79 0.94 0.86 0.90 Both 0.95 0.97 0.96 0.97 SVM 0.85 0.83 0.84 0.66

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 43

SLIDE 44

Summary of Learning Results

GLAD

– Performs better on 13 turkers than on 6 trained annotators, apart from fair‐adj,WN1; why? – Combining trained/untrained labels

Improvement for long‐adj,WN2
Degradation for fair‐adj,WN1
No improvement for long‐adj, WN1, fair‐adj,WN2

– No consistent pattern of results – No apparent correlations of instance difficulty with features; of annotator expertise with interannotator agreement

SVM

– Performance not quite as good as GLAD

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 44

SLIDE 45

Outline

The word sense conundrum
The MASC Project
WordNet and sense annotation
MASC annotation rounds
Round 2: Multiple trained annotators
Interannotator agreement and beyond
Round 2: Mechanical turkers
Machine learning from labels versus features
Conclusion

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 45

SLIDE 46

Resnik and Yarowsky Proposal

1999, Distinguishing Systems and Distinguishing Senses

Use cross‐entropy, KLD, or related measure to compare probability

distributions of system against senses

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 46

Senses of Interest System 1 System 2 System 3 System 4 monetary 0.47 0.85 0.28 1.00 stake or share 0.42 0.05 0.24 0.00 benefit, advantage 0.06 0.05 0.24 0.00 Intellectual curiosity 0.05 0.05 0.24 0.00 0.42 0.05 0.24 0.00

( | , )

S i i i

P cs w context

SLIDE 47

Conclusion

Annotators can agree well above chance on a fine‐

grained sense inventory

Disagreement can be systematic

– Sense confusion – Subsets of annotators with different interpretations

Ground truth as a distribution over senses
Evaluation by comparison of sense distributions
Learning methods that take into account

– Distribution of sense probabilities – Features – Item difficulty – Annotator expertise – Sense difficulty

24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 47