Overview SVS types&tokens Data Visualization Conclusion References
Looking at Word Meaning An interactive visualization of Semantic - - PowerPoint PPT Presentation
Looking at Word Meaning An interactive visualization of Semantic - - PowerPoint PPT Presentation
Overview SVS types&tokens Data Visualization Conclusion References Looking at Word Meaning An interactive visualization of Semantic Vector Spaces for Dutch synsets Kris Heylen, Dirk Speelman & Dirk Geeraerts KULeuven Quantitative
Overview SVS types&tokens Data Visualization Conclusion References
Purpose of the talk
- Peak inside the black box of Vector Space Models of lexical
semantics
- through an interactive visualization of word uses
- Allow Computational Linguists to do a direct, intrinsic
evaluation of their models and the semantics they capture
- Provide Lexicologists and Lexicographers with an explorative
tool for analyzing word meaning in large corpora
Overview SVS types&tokens Data Visualization Conclusion References
Overview
- 1. Semantic Vector Spaces as models of word meaning
- 2. Type vs token-level vector spaces
- 3. Case study: Data and set-up
- 4. Visualization
- 5. Conclusion and future work
Overview SVS types&tokens Data Visualization Conclusion References
Overview
- 1. Semantic Vector Spaces as models of word meaning
- 2. Type vs token-level vector spaces
- 3. Case study: Data and set-up
- 4. Visualization
- 5. Conclusion and future work
Overview SVS types&tokens Data Visualization Conclusion References
Semantic Vector Spaces as models of word meaning
Semantic Vector Spaces in Computational Linguistics
- standard technique in statistical NLP for the large-scale
automatic modeling of (lexical) semantics
- aka Vector Spaces Models, Distributional Semantic Models,
Word Spaces,... (see Turney & Pantel (2010) for overview)
- intuitive rationale, but largely black-box statistical technique
Linguistic origin: Distributional Hypothesis
- ”You shall know a word by the company it keeps” (Firth, 1957)
- a word’s meaning can be induced from its co-occurring words
- words appearing in similar contexts will have similar meanings
Overview SVS types&tokens Data Visualization Conclusion References
Semantic Vector Spaces as models of word meaning
Practical
Which two words out of a set of three have the same meaning?
- ngeval, koffie, accident
Occurrences in context from a corpus
Op de Brusselse ring deed zich een
- ngeval
met een vrachtwagen voor ’s Morgens drinkt hij een kop koffie met melk en suiker 2 bestuurders raakten gekwetst bij een
- ngeval
met een vrachtwagen in de avondspits veroorzaakte een accident een kilometerslange file als vieruurtje serveert het hotel koffie en gebak voor de gasten de auto was betrokken in een accident met een dodelijke afloop Met winterbanden is het risico op een
- ngeval
bij vriesweer veel kleiner
Overview SVS types&tokens Data Visualization Conclusion References
Semantic Vector Spaces as models of word meaning
word by context co-occurrence matrix
a u t
- s
l a c h t
- ff
e r v r a c h t w a g e n fi l e g e k w e t s t s u i k e r m e l k k
- p
- ngeval
120 424 388 82 270 11 3 1 accident 154 401 376 99 305 20 1 5 koffie 5 8 18 4 1 72 102 93
Overview SVS types&tokens Data Visualization Conclusion References
Semantic Vector Spaces as models of word meaning
word by word similarity matrix
- ngeval
accident koffie
- ngeval
1 .91 .08 accident .91 1 .17 koffie .08 .17 1
Overview SVS types&tokens Data Visualization Conclusion References
Vector Space Models of lexical semantics
Many different parameter settings
- context definition (document, window, dependency relations)
- weighting and similarity measures (PMI, cosine, jaccard,...)
- dimensionality reduction (SVD, LDA, NNMF, RI...)
- type vs token level; words vs relations
Wide variety of applications
- Psycholinguistic modeling of semantic memory
- Thesaurus extraction (WordNet)
- Lexical entailment, Query expansion
- Word sense disambiguation/induction
- Lexical variation between language varieties
- Historical studies of change in word meaning
Overview SVS types&tokens Data Visualization Conclusion References
Vector Space Models of lexical semantics
Unclear relation between parameters and semantics
- Which semantic structure do SVS models capture and how?
- Task-based evaluations only assess a-priori relations
- actual lexical-semantic structure is richer (Geeraerts (2010))
- Appeal for an intrinsic evaluation (Baroni & Lenci (2011))
SVSs have found little application in Linguistics proper
- Theoretical linguistics is becoming more data-driven
- Lexicologists (and lexicographers) try to describe semantic
structure based on a large number of corpus occurrences
- SVSs can provide such a (preliminary) structure but it needs
to be accessible for linguists
Overview SVS types&tokens Data Visualization Conclusion References
Vector Space Models of lexical semantics
Potential win-win solution for both problems
⇒An intuitive visualization of SVS output matrix Benefits:
- For computational linguist: Making SVS accessible for
evaluation by lexical semantic experts that goes beyond the pre-defined semantic relations of task-based evaluation
- For Lexicology: Tool for exploring and analyzing word
meaning in large amounts of corpus data that unlike traditional concordances have some preliminary structure
Overview SVS types&tokens Data Visualization Conclusion References
Overview
- 1. Semantic Vector Spaces as models of word meaning
- 2. Type vs token-level vector spaces
- 3. Case study: Data and set-up
- 4. Visualization
- 5. Conclusion and future work
Overview SVS types&tokens Data Visualization Conclusion References
Type vs token-level SVS
SVSs can model lexical semantics on two levels:
- 1. the type level: aggregating over all occurrences of a word,
giving a representation of a word’s general semantics. (e.g. Thesaurus extraction)
- 2. the token level: representing the semantics of each individual
- ccurrence of a word.(e.g. WSD)
Lexicological studies typically take a set of types and analyze how they ’carve up’ semantic space by looking at their tokens We use a type-level SVS for finding synsets and a token-level space for modeling the tokens within each synset.
Overview SVS types&tokens Data Visualization Conclusion References
Type vs token-level vector spaces
Token vector approach of Sch¨ utze (1998):
Token vector = average of context words’ type vector While walking to work, the teacher saw a barking dog chasing a cat f
- t
- ffi
c e n i g h t h a r e p e t p u p i l m i l k p u r r walk 4.7 2.3 2.4 0.2 1.9 0.1 work 1.2 4.9 3.2 0.1 2.3 0.1 teacher 0.3 1.3 0.8 1.2 4.3 0.5 0.1 see 0.2 0.4 1.2 0.7 0.9 0.8 0.7 0.1 bark 0.3 0.2 1.9 1.8 2.1 1.8 0.7 2.1 chase 2.8 1 2.1 3.1 2.2 1.1 0.9 0.8 cat 1.1 0.9 2.3 1.9 3.9 0.5 2.8 4.6 AVERAGE 1.51 1.57 1.99 1.10 1.76 1.56 0.81 1.10
Overview SVS types&tokens Data Visualization Conclusion References
Type vs token-level vector spaces
Our modified approach:
Token vector = weighted average of context words’ type vector, with as weights the PMI values between type and context words W E I G H T f
- t
- ffi
c e n i g h t h a r e p e t p u p i l m i l k p u r r walk 1.1 4.7 2.3 2.4 0.2 1.9 0.1 work 0.2 1.2 4.9 3.2 0.1 2.3 0.1 see 0.1 0.2 0.4 1.2 0.7 0.9 0.8 0.7 0.1 bark 3.1 0.3 0.2 1.9 1.8 2.1 1.8 0.7 2.1 chase 2.7 2.8 1 2.1 3.1 2.2 1.1 0.9 0.8 cat 2.1 1.1 0.9 2.3 1.9 3.9 0.5 2.8 4.6 w.Av. 1.73 0.95 2.11 1.94 2.44 1.14 1.13 1.95
Overview SVS types&tokens Data Visualization Conclusion References
Overview
- 1. Semantic Vector Spaces as models of word meaning
- 2. Type vs token-level vector spaces
- 3. Case study: Data and set-up
- 4. Visualization
- 5. Conclusion and future work
Overview SVS types&tokens Data Visualization Conclusion References
Case study: Data and set-up
Corpus
- Dutch newspaper materials from 1999 to 2005
- stratified for Netherlandic (500M) and Belgian Dutch(1.3G)
- automatically lemmatized, POS tagged and parsed with
Alpino (van Noord (2006)).
Dutch synsets
- 218 synsets containing 476 nouns (Ruette et al. (2012))
- dependency-based type-level SVS (Pad´
- & Lapata (2007))
- clustered with Clustering by Committee ( Pantel & Lin
(2002))
Overview SVS types&tokens Data Visualization Conclusion References
Case study: Data and set-up
Concept nouns in synset Infringement inbreuk, overtreding Genocide volkerenmoord, genocide Poll peiling, opiniepeiling, rondvraag Marihuana cannabis, marihuana Coup staatsgreep, coup Meningitis hersenvliesontsteking, meningitis Demonstrator demonstrant, betoger Airport vliegveld, luchthaven Collision aanrijding, botsing Computer screen computerschem, beeldscherm, monitor
Table: Dutch synsets (sample)
Overview SVS types&tokens Data Visualization Conclusion References
Case study: Data and set-up
Token model: second order contexts (Sch¨ utze (1998))
STEP 1: type-level SVS for context words
- 1 order context words: 573,127 words with frequency > 2
- 2 order context words: window of 4 left/right; 5430 words
among the 7000 most frequent (minus stoplist of 34 high-frequent function words) AND that occurred at least 50 times in both the Netherlandic and Belgian part of the corpus.
- weighting: positive PMI
STEP 2 token vectors
- sample: 100 Netherlandic and 100 Belgian newspaper issues
- window: 5 context words left and right of token
- token vector=(weighted) average of context word vectors
Overview SVS types&tokens Data Visualization Conclusion References
Case study: Data and set-up
STEP 3 token by token similarity matrix
- similarity measure: cosine
- final output: similarity matrix for each of 218 synsets
- reflect how the different synonyms carve up the “semantic
space” of the concept among themselves
Overview SVS types&tokens Data Visualization Conclusion References
Overview
- 1. Semantic Vector Spaces as models of word meaning
- 2. Type vs token-level vector spaces
- 3. Case study: Data and set-up
- 4. Visualization
- 5. Conclusion and future work
Overview SVS types&tokens Data Visualization Conclusion References
Visualization
HighD to 2D
- token similarity matrix is high dimensional
- faithful rendering in 2D: Kruskal’s non-metric
Multidimensional Scaling
- aim is not (yet) to find/impose latent structure
Integrated and interactive chart
- intergrate MDS plots with different types of meta-data
- let researcher choose which data to visualize in plot
- Motion Charts from Google Chart Tools
- also open source implementation with Python Image library
- Demo for Infringement , Computer Screen
Overview SVS types&tokens Data Visualization Conclusion References
Traditional KWIC concordance
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Traditional KWIC concordance
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Overview SVS types&tokens Data Visualization Conclusion References
Overview
- 1. Semantic Vector Spaces as models of word meaning
- 2. Type vs token-level vector spaces
- 3. Case study: Data and set-up
- 4. Visualization
- 5. Conclusion and future work
Overview SVS types&tokens Data Visualization Conclusion References
Conclusion and future work
Double benefit of visualizing SVSs
- For CompLx: Making SVS accessible for evaluation by lexical
semantic experts
- For Lexicology: Tool for exploring lexical semantics in large
amounts of corpus data
Desiderata (due to rather opportunistic use of GMC)
- larger stretches of text in bubbles
- applications to historic data (cf. Sagi et al. (2009))
- provide more structure in plots. (cf. Rohrdantz et al. (2011))
- show context features that make tokens similar
- allow input from users (e.g. additional coding)
- track feed-back from users (e.g. misplaced tokens)
Overview SVS types&tokens Data Visualization Conclusion References
For more information: http://wwwling.arts.kuleuven.be/qlvl dirk.geeraerts@arts.kuleuven.be kris.heylen@arts.kuleuven.be
Overview SVS types&tokens Data Visualization Conclusion References
References I
Baroni, Marco, & Lenci, Alessandro. 2011. How we BLESSed distributional semantic evaluation. Pages 1–10 of: Proceedings
- f the GEMS 2011 Workshop on GEometrical Models of Natural
Language Semantics. Edinburgh, UK: Association for Computational Linguistics. Firth, J. 1957. A synopsis of linguistic theory 1930-1955. In: Palmer, F R (ed), Selected papers of J.R. Firth. Longman. Geeraerts, Dirk. 2010. Theories of Lexical Semantics. Oxford: Oxford University Press. Pad´
- , Sebastian, & Lapata, Mirella. 2007. Dependency-based
construction of semantic space models. Computational Linguistics, 33(2), 161–199.
Overview SVS types&tokens Data Visualization Conclusion References
References II
Pantel, Patrick, & Lin, Dekang. 2002. Document clustering with
- committees. Pages 199–206 of: Proceedings of the 25th annual
international ACM SIGIR conference on Research and development in information retrieval. SIGIR ’02. New York, NY, USA: ACM. Rohrdantz, Christian, Hautli, Annette, Mayer, Thomas, Butt, Miriam, Keim, Daniel A, & Plank, Frans. 2011. Towards Tracking Semantic Change by Visual Analytics. Pages 305–310
- f: Proceedings of the 49th Annual Meeting of the Association
for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics.
Overview SVS types&tokens Data Visualization Conclusion References
References III
Ruette, Tom, Geeraerts, Dirk, Peirsman, Yves, & Speelman, Dirk.
- 2012. Semantic weighting mechanisms in scalable lexical
- sociolectometry. In: Szmrecsanyi, Benedikt, & W¨
alchli, Bernhard (eds), Aggregating dialectology and typology: linguistic variation in text and speech, within and across
- languages. Berlin: Mouton de Gruyter.
Sagi, Eyal, Kaufmann, Stefan, & Clark, Brady. 2009. Semantic Density Analysis: Comparing Word Meaning across Time and Phonetic Space. Pages 104–111 of: Proceedings of the Workshop on Geometrical Models of Natural Language
- Semantics. Athens, Greece: Association for Computational
Linguistics. Sch¨ utze, Hinrich. 1998. Automatic word sense discrimination. Computational Linguistics, 24(1), 97–124.
Overview SVS types&tokens Data Visualization Conclusion References