Making Sense of Word Sense 24 February, 2011 Deutschen Gesellschaft - - PowerPoint PPT Presentation
Making Sense of Word Sense 24 February, 2011 Deutschen Gesellschaft - - PowerPoint PPT Presentation
Making Sense of Word Sense 24 February, 2011 Deutschen Gesellschaft fr Sprachwissenschaft (DGfS) Gottingen Rebecca J. Passonneau Nancy Ide Vikas Bhardwaj Vassar College Ansaf Salleb Aouissi Outline The word sense conundrum The MASC
Outline
- The word sense conundrum
- The MASC Project
- WordNet and sense annotation
- MASC annotation rounds
- Round 2: Multiple trained annotators
- Interannotator agreement and beyond
- Round 2: Mechanical turkers
- Machine learning from labels versus features
- Conclusion
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 2
Word sense conundrum
- Adam Kilgariff , 2003, I don’t believe in word senses
– Abstractions from corpus clusters – Corpus citations . . . are the basic objects in the
- ntology
- James Pustejovsky, 1991, The generative lexicon
– No fixed set of conceptual primitives – A fixed number of generative devices – Lexical semantics is an interface between commonsense knowledge and linguistic form
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 3
Zipf’s Law
An epiphenomenon of . . .
- Words (types or tokens)
- Senses
- Many other phenomena (Newman, M. E. J.;
2005): city population, books sold, net worth, . . .
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 4
Granularity
Concepts versus comparisons of experience
- Infinite divisibility of reality: how fine‐grained should
a cluster be?
– WordNet senses for primitive, Adj: 1. Belonging to an early stage of development 2. Characteristic of an ancestral type 3. Preliterate or non‐industrial societies 4. Created by one without formal training
- Shared experience: the basis of social reality, ways of
verbalizing social reality
1‐ 3. Anthropology 4. Art history
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 5
Corpus‐based Sense Classes
- Ontological questions: deferred
– How are clusters used as basic ontological objects? – How is commonsense knowledge represented
- Identify same/different contexts, within limits:
– Same? . . . a primitive granite boar, carved in prehistoric times, . . . has a primitive Easter‐island look, – Different? Bin Laden’s training camps were primitive . . . . . . one or more of the primitive gluing or ungluing
- perations
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 6
Outline
- The word sense conundrum
- The MASC Project
- WordNet and sense annotation
- MASC annotation rounds
- Round 2: Multiple trained annotators
- Interannotator agreement and beyond
- Round 2: Mechanical turkers
- Machine learning from labels versus features
- Conclusion
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 7
American National Corpus
24 February 2011 8 Deutschen Gesellschaft für Sprachwissenschaft (DGfS)
- 100 Million Words
- Completely unrestricted
- Post 1990 American English
- Many genres
MASC: Manually Annotated Sub‐Corpus
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 9
Participants
- Nancy Ide (PI, NSF CRI; Vassar College)
- Collin Baker (ICSI; FrameNet)
- Christiane Fellbaum (Princeton;
WordNet)
- Rebecca J. Passonneau (Columbia
Univ.) Size: 500,000 Words,
- Manually validated automatic
- Manual annotations
Selected annotations
- Token, Sentence, Lemma (Validated)
- Named entities (Validated)
- WordNet (Manual: 1.5 Million Word
Sentence Corpus)
- FrameNet (Manual: 150K Words)
http://www.anc.org/MASC/
MASC Corpus
- Three releases
– MASC I: 82K words, release date May, 2010 – MASC I‐II: 142 K words, release date March, 2011 – MASC I‐III: 500K words, release date July, 2011
- Fourteen types of annotation
– Manually validated automatic, e.g., NP Chunks – Manual, e.g, word sense
- Twenty genres, evenly balanced
- Freely available from MASC website, and from:
– LDC – NLTK
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 10
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 11
Genre
- rds
% of corpus 1 Court transcript 20817 4 2 Debate transcript 32325 6 3 Email 20470 4 4 Essay 25590 5 5 Fiction 25681 5 6 Gov't documents 24605 5 7 Journal 25635 5 8 Letters 24750 5 9 Newspaper/ newswire 17951 4 10 Non-fiction 25182 5 11 Spoken 25783 5 12 Technical 25426 5 13 Travel guides 26708 5 14 Twitter 24180 5 15 Blog 2 5 0 0 0 5 16 ficlets 2 5 0 0 0 5 17 movie script 28240 6 18 poetry 2 5 0 0 0 5 19 spam 2 5 0 0 0 5 20 jokes 2 5 0 0 0 5 Total 498343
Figures in bold indicate that the texts have not yet been chosen
Outline
- The word sense conundrum
- The MASC Project
- WordNet and sense annotation
- MASC annotation rounds
- Round 2: Multiple trained annotators
- Interannotator agreement and beyond
- Round 2: Mechanical turkers
- Machine learning from labels versus features
- Conclusion
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 12
Word Sense Annotation Goals
- Freely available word sense corpus
- Harmonize WordNet and FrameNet
- Investigate moderately polysemous words (avg.=7)
- Large sentence‐based corpus
– 100 words, balanced for part‐of‐speech – 1000 sentences per word – Avg. sente3nce length in MASC I > 20 words – 2 million word corpus, representing 700 senses
- Provide measures of interannotator agreement
– Chance corrected coefficients – Krippendorff’s Alpha
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 13
WordNet Sense Information
SENSEID: a unique identifier SYNSET: a list of synonymous senses (SENSEIDS) DEFINITION: a phrase EXAMPLES: list of glosses FREQUENCY COUNT: integer Nouns have domain, . . . etc Verbs have verb group, . . . etc Adjectives have attributes, . . . etc
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 14
WordNet Senses: time (noun)
8 WordNet senses used 1. (time1, clip2) (an instance or single occasion for some event) "this time he succeeded"; "he called four times"; "he could do ten at a clip" 2. (a period of time considered as a resource under your control and sufficient to accomplish something) "take time to smell the roses"; "I didn't have time to finish"; "it took more than half my time" 3. (an indefinite period (usually marked by specific attributes or activities)) "he waited a long time"; "the time of year for planting"; "he was a great actor in his time " 4. (a suitable moment) "it is time to go" 5. (the continuum of experience in which events pass from the future through the present to the past) 6. (a person's experience on a particular occasion) "he had a time holding back the tears"; "they had a good time together" 7. (time7, clock_time1) (a reading of a point in time as given by a clock) "do you know what time it is?"; "the time is 10 o'clock" 8. (time8, fourth_dimension1), (the fourth coordinate that is required (along with three spatial dimensions) to specify a physical event) 2 WordNet senses not used 9. (time9, meter4, metre3), (rhythm as given by division into parts of equal duration) 10. (time10, prison term1 , sentence3), (the period of time a prisoner is imprisoned) "he served a prison term of 15 months"; "his sentence was 5 to 10 years"; "he is doing time in the county jail"
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 15
Senses of time‐N
Sense Num Definitions 1 171 An instance or single occasion for an event 2 131 A period of time . . . sufficient to accomplish something 3 427 An indefinite period 4 59 A suitable moment 5 34 The continuum of experience . . . the future . . . 6 19 A person's experience on a particular occasion 7 38 A reading of a point in time as given by a clock 8 47 The fourth coordinate . . . to specify an event Total 926
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 16
Senses of time‐N
Sense Num Definitions 1 171 An instance or single occasion for an event
When When the bride the bride and groom and groom came together ame together for for the first the first time time
2 131 A period of time . . . sufficient to accomplish something 3 427 An indefinite period 4 59 A suitable moment 5 34 The continuum of experience . . . the future . . . 6 19 A person's experience on a particular occasion 7 38 A reading of a point in time as given by a clock 8 47 The fourth coordinate . . . to specify an event Total 926
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 17
Senses of time‐N
Sense Num Definitions 1 171 An instance or single occasion for an event 2 131 A period of time . . . sufficient to accomplish something 3 427 An indefinite period 4 59 A suitable moment
A time for A time for a youngster a youngster to enjoy to enjoy the fun and the fun and benefits benefits of
- f camp . . .
camp . . .
5 34 The continuum of experience . . . the future . . . 6 19 A person's experience on a particular occasion 7 38 A reading of a point in time as given by a clock 8 47 The fourth coordinate . . . to specify an event Total 926
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 18
Senses of time‐N
Sense Num Definitions 1 171 An instance or single occasion for an event 2 131 A period of time . . . sufficient to accomplish something 3 427 An indefinite period 4 59 A suitable moment 5 34 The continuum of experience . . . the future . . .
Turn back the hands of time Turn back the hands of time and remember and remember when you . when you . . . .
6 19 A person's experience on a particular occasion 7 38 A reading of a point in time as given by a clock 8 47 The fourth coordinate . . . to specify an event Total 926
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 19
Outline
- The word sense conundrum
- The MASC Project
- WordNet and sense annotation
- MASC annotation rounds
- Round 2: Multiple trained annotators
- Interannotator agreement and beyond
- Round 2: Mechanical turkers
- Machine learning from labels versus features
- Conclusion
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 20
MASC Sense Annotation Rounds
- 1050 sentences per lemma‐pos
– N up to 1050 from MASC – If N<1050, balance from OANC
- Pre‐annotation sample: Sense inventory revision
– Random selection of 50 instances – WordNet 3.0 sense annotation – Multiple annotators
- typically 2‐3; 6 in Round 2
- Same core group of Vassar, Columbia undergrads
- Highly trained
- Sense revision (to be added to WordNet 3.1)
- Core annotation
– 100 sentences subsample
- FrameNet annotation
- Multiple annotators (typically 2‐3) for interannotator agreement
– 900 sentences, one annotator per sentence (not always the same annotator)
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 21
Annotation Tool
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 22
- Loads WordNet 3.0
- Sense number
- Definition
- Glosses
- Synset
- Other labels
- Collocation
- Wrong POS
- No sense applies
- Not enough context
- Subversion
- Comment field used during
pre‐annotation sample
Outline
- The word sense conundrum
- The MASC Project
- WordNet and sense annotation
- MASC annotation rounds
- Round 2: Multiple trained annotators
- Interannotator agreement and beyond
- Round 2: Mechanical turkers
- Machine learning from labels versus features
- Conclusion
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 23
Multiple Annotators
Poesio and Artstein, Reliability of Anaphoric Annotation Reconsidered, 2005
- Anaphora
- Lack of unique interpretation in the context of its occurrence due
to problematic annotation scheme, to be fixed by less specific representations (e.g., word senses, Buitelaar, 1998; Palmer et al., 2005)
- Applies to polysemy, not to many other cases, e.g., anaphora
- Word sense
- Need for revision of sense inventory
- Need for underspecification: annotators disagree unsystematically
- Need to account for differences in interpretation: annotators disagree
systematically
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 24
Interannotator Agreement
Instances i, annotators j, annotation values k
- Percent Agreement: Proportion of i where all j pick k
– Does not generalize well to multiple annotators – Does not take probability into account – Sensitive to data skew – Primarily a measure of coverage What proportion of instances have unanimity?
- Agreement coefficient (Krippendorff’s Alpha):
Proportion of agreement > predicted by chance
– Same interpretation independent of data skew – Handles multiple annotators – Primarily a measure of variance – Does not indicate coverage
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 25
Example Agreement Matrix
Anns Instances 1 2 3 4 5 6 7 8 9 10 Ann1 a b b b c a c c b a Ann2 a c b a c c c b b b Ann3 a a b a c b c b b c
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 26
- Percent Agreement
- 50 (5 of 10 columns)
- 63 (13 of 30 cells)
- Agreement coefficient
- Krippendorff’s Alpha: 30.1
p(a)= 8/30 = 0.27
- Cohen’s Kappa: 27.8
p_Ann1(a)=p_Ann3(a)= 3/10 = 0.30 p_Ann2(a)= 2/10 = 0.20
Alpha Scores on Round 2, Trained
Lemma POS WN Senses Senses Used Alpha Outliers Alpha ‘ long Adj 9 4 0.67 1 0.80 fair 10 6 0.54 2 0.63 quiet 6 7 0.49 0.49 time Noun 7 7 0.68 0.68 work 10 8 0.62 0.62 land 11 9 0.49 1 0.54 tell Verb 12 10 0.46 0.46 say 8 8 0.38 2 0.52 show 11 10 0.46 1 0.48 know 11 10 0.38 2 0.63
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 27
Outline
- The word sense conundrum
- The MASC Project
- WordNet and sense annotation
- MASC annotation rounds
- Round 2: Multiple trained annotators
- Interannotator agreement and beyond
- Round 2: Mechanical turkers
- Machine learning from labels versus features
- Conclusion
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 28
Interpreting Interannotator Agreement
- Krippendorff:
– ≥ 0.67 supports tentative conclusions – ≥ 0.80 good reliability
- Landis & Koch
– 0.21‐0.40 fair – 0.41‐0.60 moderate – 0.61‐0.80 substantial
- Poesio & Artstein
– No single threshold applicable for all purposes – 0.70 for many NLP annotations
- Passonneau
– Paradigmatic reliability analysis – E.g., significance tests based on uses of different annotators labels
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 29
Observed Variation in Alpha
- Part‐of‐Speech effect
– Adjectives and Nouns have higher Alpha than verbs – Only a partial explanation of the variation
- Within each part of speech, Alpha varies
- Poor (inverse) correlation of Alpha with
#senses available (= ‐ 0.38)
- Modest inverse correlation of Alpha with #
senses used (= ‐ 0.56)
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 30
Anveshan: Annotation Variance Estimation
- For use with data from multiple annotators
– Identify outliers among annotators – Find subsets of annotators with similar behavior (systematic disagreement) – Identify confusable senses
- Variation can occur due to differences among
– Annotators (expertise) – Items (difficulty) – Label sets (number and similarity of labels)
- Uses Kullbach‐Liebler divergence, Jensen‐
Shannon divergence
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 31
Anveshan Basics
- For each annotator , sense , compute
- For each annotator , compute the average
- Compute leverage for each , where
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 32
j
s
1
( , ) ( ) ( , )
i
j i a j m k i k
count s a P S s count s a
=
= = ∑
i
a
1 1
( ) 1
m
i a j m
P S s n
− =
= −
∑
i
a
, m m i ∀ ≠
( , ) | ( ) ( )|
k
Lev P Q P k Q k = −
∑
( )
i
a j
P s
Anveshan Basics, Continued
- Compute Kullbach‐Liebler Divergence for each
annotator’s sense distribution against the average
- f all other annotators’ sense distributions
- Compute Jensen‐Shannon Divergence for each
annotator’s sense distribution against other annotators’ sense distributions, for all
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 33
1 1 ( , ) ( , ) ( , ), 2 2 ( ) 2 JSD P Q KLD P M KLD Q M P Q where M = + + =
( ) ( , ) ( )log ( )
i
P i KLD P Q P i Q i =∑
Outliers
KLD for long‐Adj
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 34
Systematic Disagreements
JSD and Alpha for show‐Verb
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 35
Confusability of Senses
Sense distributions for say‐Verb
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 36
10 20 30 40 50 60 A101 A103 Overall WN1 WN2 WN3 WN4 WN5 WN6 WN7 WN8 WN9 WN11
Outline
- The word sense conundrum
- The MASC Project
- WordNet and sense annotation
- MASC annotation rounds
- Round 2: Multiple trained annotators
- Interannotator agreement and beyond
- Round 2: Mechanical turkers
- Machine learning from labels versus features
- Conclusion
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 37
Mechanical Turkers
- Two adjectives
- 150 sentences per adjective
– 15 HITs; rejected for turkers not completing all 15 – 10 sentences per hit – 13 turkers – long (9 WN senses, all used): Alpha=0.15 – fair (10 WN senses, all used): Alpha =0.25
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 38
wn1 wn2 wn3 wn4 wn5 wn6 wn7 wn8 wn9 wn10 Other Total 891 437 52 60 63 21 21 135 10 14 236 1950 wn1 wn2 wn3 wn4 wn5 wn6 wn7 wn8 wn9 Other Total 659 458 115 110 160 156 66 56 64 115 1950
Outline
- The word sense conundrum
- The MASC Project
- WordNet and sense annotation
- MASC annotation rounds
- Round 2: Multiple trained annotators
- Interannotator agreement and beyond
- Round 2: Mechanical turkers
- Machine learning from labels versus features
- Conclusion
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 39
Comparison of Two Learning Paradigms
- Ground truth labels from author
- Unsupervised learning from multilabels
– Maximum likelihood estimates obtained using EM
- Supervised learning from features
– Features:
- Word and sentence length features
- Tf*Idf
- Named Entities
- DAL features (Dictionary of Affect in Language)
– SVMLight – 4‐fold cross validation – Best C‐values
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 40
Machine Learning from MultiLabels
- GLAD (Whitehill et. al, 2009, Whose vote should
count more?)
- Graphical model
- Hidden variables
– true labels (Z) – labeler accuracy (α) – image difficulty (β)
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 41
Learning Results, Fair
Sense Ann Recall Precision F measure Accuracy
GLAD Fair‐j, WN1 MASC 0.92 0.94 0.93 0.93 AMT 1.00 0.71 0.85 0.79 Both 1.00 0.74 0.87 0.82 SVM NA 1.00 0.65 0.82 0.72 GLAD Fair‐j, WN2 MASC 0.69 0.48 0.59 0.83 AMT 0.81 0.93 0.87 0.96 Both 0.81 0.93 0.87 0.96 SVM NA 0.60 0.33 0.46 0.68
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 42
Learning Results, Long
Sense Annotators Recall Precision F measure Accuracy
GLAD Long‐j, WN1 MASC 0.88 0.84 0.86 0.84 AMT 1.00 0.98 0.99 0.99 Both 1.00 0.98 0.99 0.99 SVM 1.00 0.61 0.80 0.63 GLAD Long‐j, WN2 MASC 0.74 0.80 0.77 0.83 AMT 0.79 0.94 0.86 0.90 Both 0.95 0.97 0.96 0.97 SVM 0.85 0.83 0.84 0.66
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 43
Summary of Learning Results
- GLAD
– Performs better on 13 turkers than on 6 trained annotators, apart from fair‐adj,WN1; why? – Combining trained/untrained labels
- Improvement for long‐adj,WN2
- Degradation for fair‐adj,WN1
- No improvement for long‐adj, WN1, fair‐adj,WN2
– No consistent pattern of results – No apparent correlations of instance difficulty with features; of annotator expertise with interannotator agreement
- SVM
– Performance not quite as good as GLAD
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 44
Outline
- The word sense conundrum
- The MASC Project
- WordNet and sense annotation
- MASC annotation rounds
- Round 2: Multiple trained annotators
- Interannotator agreement and beyond
- Round 2: Mechanical turkers
- Machine learning from labels versus features
- Conclusion
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 45
Resnik and Yarowsky Proposal
1999, Distinguishing Systems and Distinguishing Senses
- Use cross‐entropy, KLD, or related measure to compare probability
distributions of system against senses
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 46
Senses of Interest System 1 System 2 System 3 System 4 monetary 0.47 0.85 0.28 1.00 stake or share 0.42 0.05 0.24 0.00 benefit, advantage 0.06 0.05 0.24 0.00 Intellectual curiosity 0.05 0.05 0.24 0.00 0.42 0.05 0.24 0.00
( | , )
S i i i
P cs w context
Conclusion
- Annotators can agree well above chance on a fine‐
grained sense inventory
- Disagreement can be systematic
– Sense confusion – Subsets of annotators with different interpretations
- Ground truth as a distribution over senses
- Evaluation by comparison of sense distributions
- Learning methods that take into account
– Distribution of sense probabilities – Features – Item difficulty – Annotator expertise – Sense difficulty
24 February 2011 Deutschen Gesellschaft für Sprachwissenschaft (DGfS) 47