DISSECT - DIS tributional SE mantics C omposition T oolkit Georgiana - PDF document

DISSECT - DIS tributional SE mantics C omposition T oolkit Georgiana Dinu and Nghia The Pham and Marco Baroni Center for Mind/Brain Sciences (University of Trento, Italy) (georgiana.dinu|thenghia.pham|marco.baroni)@unitn.it Abstract paradigm has received a lot of attention in recent years and a number of compositional frameworks We introduce DISSECT, a toolkit to have been proposed in the distributional seman- build and explore computational models tic literature, see, e.g., Coecke et al. (2010) and of word, phrase and sentence meaning Mitchell and Lapata (2010). For example, in such based on the principles of distributional frameworks, the distributional representations of semantics. The toolkit focuses in partic- red and car may be combined, through various op- ular on compositional meaning, and im- erations, in order to obtain a vector for red car . plements a number of composition meth- The DISSECT toolkit ( http://clic. ods that have been proposed in the litera- cimec.unitn.it/composes/toolkit ) ture. Furthermore, DISSECT can be use- is, to the best of our knowledge, the first to ful to researchers and practitioners who provide an easy-to-use implementation of many need models of word meaning (without compositional methods proposed in the literature. composition) as well, as it supports var- As such, we hope that it will foster further work ious methods to construct distributional on compositional distributional semantics, as well semantic spaces, assessing similarity and as making the relevant techniques easily available even evaluating against benchmarks, that to those interested in their many potential applica- are independent of the composition infras- tions, e.g., to context-based polysemy resolution, tructure. recognizing textual entailment or paraphrase 1 Introduction detection. Moreover, the DISSECT tools to construct distributional semantic spaces from Distributional methods for meaning similarity are raw co-occurrence counts, to measure similarity based on the observation that similar words oc- and to evaluate these spaces might also be of cur in similar contexts and measure similarity use to researchers who are not interested in the based on patterns of word occurrence in large cor- compositional framework. DISSECT is freely pora (Clark, 2012; Erk, 2012; Turney and Pan- available under the GNU General Public License. tel, 2010). More precisely, they represent words, or any other target linguistic elements, as high- 2 Building and composing distributional dimensional vectors, where the dimensions repre- semantic representations sent context features. Semantic relatedness is as- sessed by comparing vectors, leading, for exam- The pipeline from corpora to compositional mod- ple, to determine that car and vehicle are very sim- els of meaning can be roughly summarized as con- ilar in meaning, since they have similar contextual sisting of three stages: 1 distributions. Despite the appeal of these meth- 1. Extraction of co-occurrence counts from cor- ods, modeling words in isolation has limited ap- pora In this stage, an input corpus is used to ex- plications and ideally we want to model semantics tract counts of target elements co-occurring with beyond word level by representing the meaning of some contextual features. The target elements phrases or sentences. These combinations are in- can vary from words (for lexical similarity), to finite and compositional methods are called for to pairs of words (e.g., for relation categorization), derive the meaning of a larger construction from the meaning of its parts. For this reason, the ques- 1 See Turney and Pantel (2010) for a technical overview of tion of compositionality within the distributional distributional methods for semantics.

to paths in syntactic trees (for unsupervised para- #create a semantic space from counts in phrasing). Context features can also vary from #dense format("dm"): word freq1 freq2 .. ss = Space.build(data="counts.txt", shallow window-based collocates to syntactic de- format="dm") pendencies. #apply transformations 2. Transformation of the raw counts This ss = ss.apply(PpmiWeighting()) ss = ss.apply(Svd(300)) stage may involve the application of weighting schemes such as Pointwise Mutual Information, #retrieve the vector of a target element feature selection, dimensionality reduction meth- print ss.get_row("car") ods such as Singular Value Decomposition, etc. The goal is to eliminate the biases that typically Figure 1: Creating a semantic space. affect raw counts and to produce vectors which better approximate similarity in meaning. ful command-line tools, however users with ba- 3. Application of composition functions sic Python familiarity are encouraged to use the Once meaningful representations have been Python interface that DISSECT provides. This constructed for the atomic target elements of section focuses on this interface (see the online interest (typically, words), various methods, such documentation on how to perform the same oper- as vector addition or multiplication, can be used ations with the command-line tools), that consists for combining them to derive context-sensitive of the following top-level packages: representations or for constructing representations for larger phrases or even entire sentences. #DISSECT packages composes.matrix composes.semantic_space DISSECT can be used for the second and composes.transformation third stages of this pipeline, as well as to measure composes.similarity composes.composition similarity among the resulting word or phrase vec- composes.utils tors. The first step is highly language-, task- and corpus-annotation-dependent. We do not attempt to implement all the corpus pre-processing and Semantic spaces and transforma- co-occurrence extraction routines that it would tions The concept of a semantic space require to be of general use, and expect instead as ( composes.semantic space ) is at the input a matrix of raw target-context co-occurrence core of the DISSECT toolkit. A semantic counts. 2 DISSECT provides various methods to space consists of co-occurrence values, stored re-weight the counts with association measures, as a matrix, together with strings associated to dimensionality reduction methods as well as the the rows of this matrix (by design, the target composition functions proposed by Mitchell and linguistic elements) and a (potentially empty) Lapata (2010) ( Additive , Multiplicative and Dila- list of strings associated to the columns (the tion ), Baroni and Zamparelli (2010)/Coecke et al. context features). A number of transforma- (2010) ( Lexfunc ) and Guevara (2010)/Zanzotto et tions ( composes.transformation ) can al. (2010) ( Fulladd ). In DISSECT we define and be applied to semantic spaces. We implement implement these in a unified framework and in a weighting schemes such as positive Pointwise computationally efficient manner. The focus of Mutual Information ( ppmi ) and Local Mu- DISSECT is to provide an intuitive interface for tual Information, feature selection methods, researchers and to allow easy extension by adding dimensionality reduction (Singular Value De- other composition methods. composition ( SVD ) and Nonnegative Matrix Factorization ( NMF )), and new methods can 3 DISSECT overview be easily added. 3 Going from raw counts to a transformed space is accomplished in just a few DISSECT is written in Python. We provide many lines of code (Figure 1). standard functionalities through a set of power- 2 These counts can be read from a text file containing two 3 The complete list of transformations currently sup- strings (the target and context items) and a number (the corre- ported can be found at http://clic.cimec.unitn. sponding count) on each line (e.g., maggot food 15 ) or it/composes/toolkit/spacetrans.html# from a matrix in format word freq1 freq2 ... spacetrans .

DISSECT - DIS tributional SE mantics C omposition T oolkit Georgiana - PDF document

DISSECT - DIS tributional SE mantics C omposition T oolkit Georgiana Dinu and Nghia The Pham and Marco Baroni Center for Mind/Brain Sciences (University of Trento, Italy) (georgiana.dinu|thenghia.pham|marco.baroni)@unitn.it Abstract paradigm has

To Dissect or not to Dissect That is the question (or is it) Historical perspective of

dAmico International Shipping DIS CORE VALUES. 2 DIS ESG at a glance. DIS Key facts

HoneyDrone: a medium-interaction Unmanned Aerial Vehicle HoneyDrone: a medium-interaction Unmanned

DNDC stands for D e n itrification and D e c omposition, two processes dominating loss of N and C

LIU-ABT systems: PSB BI.DIS controls prepared by R.A.Barlow (April 2015) 1 BI.DIS Diviseur

Using MAGIC to dissect the genetics of below- (and above-) ground wheat phenotypes Yeorgia

Design & Implementation of a Learning Health System in Australia Data Dissect Pty Ltd

Using Network Component Analysis to Dissect Regulatory Networks Mediated by Transcription Factors

Today Finish up performance measurement benchmarks Dissect some C code and assembly

DNS Cache Poisoning Attack Introduction The purpose of this presentation is to dissect the

Reminde nder/Recal call P Practi tice ces a and Tool oolkit it O Overvie iew Tuesday,

P ROJECT M ANAGEMENT T OOLKIT - A DDITIONAL D OCUMENTATION Available at:

The P ortable E xtensible T oolkit for S cientific C omputing Toby Isaac (building on slides from

Kaltur Kaltura Player a Player Toolkit oolkit FOSDEM 2015 Michael Dale Itay Kinnrot Kaltura

Introducing Reliability T oolkit: easy-to-use monitoring and alerting Robin van Zijll &

Whats New in disclosure? The AHRQ CANDOR Process and T oolkit Steve Kraman, M.D. Professor,

Tutorial: GAN Dissection What is learned inside a GAN? David Bau To follow along:

Stijn Wuyts (MPE) Natascha Frster Schreiber (MPE) Benjamin Magnelli (MPE) Raanan Nordon (MPE)

Global constraints (2/2) Marco Chiarandini Department of Mathematics & Computer Science

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation Lecture 02:

Dissecting PDF Documents Mark S. Rasmussen iPaper mark@improve.dk What Is This Session NOT

Dissecting Web Attacks Val Smith (valsmith@attackresearch.com) Colin Ames

Lecture 03: Layering, Naming, and Filesystem Design Just like RAM, hard drives provide us

The Beauty of Combinatorics November 1, 2012 () The Beauty of Combinatorics November 1, 2012 1

DISSECT - DIS tributional SE mantics C omposition T oolkit Georgiana - PDF document

DISSECT - DIS tributional SE mantics C omposition T oolkit Georgiana Dinu and Nghia The Pham and Marco Baroni Center for Mind/Brain Sciences (University of Trento, Italy) (georgiana.dinu|thenghia.pham|marco.baroni)@unitn.it Abstract paradigm has

To Dissect or not to Dissect That is the question (or is it) Historical perspective of

dAmico International Shipping DIS CORE VALUES. 2 DIS ESG at a glance. DIS Key facts

HoneyDrone: a medium-interaction Unmanned Aerial Vehicle HoneyDrone: a medium-interaction Unmanned

DNDC stands for D e n itrification and D e c omposition, two processes dominating loss of N and C

LIU-ABT systems: PSB BI.DIS controls prepared by R.A.Barlow (April 2015) 1 BI.DIS Diviseur

Using MAGIC to dissect the genetics of below- (and above-) ground wheat phenotypes Yeorgia

Design &amp; Implementation of a Learning Health System in Australia Data Dissect Pty Ltd

Using Network Component Analysis to Dissect Regulatory Networks Mediated by Transcription Factors

Today Finish up performance measurement benchmarks Dissect some C code and assembly

DNS Cache Poisoning Attack Introduction The purpose of this presentation is to dissect the

Reminde nder/Recal call P Practi tice ces a and Tool oolkit it O Overvie iew Tuesday,

P ROJECT M ANAGEMENT T OOLKIT - A DDITIONAL D OCUMENTATION Available at:

The P ortable E xtensible T oolkit for S cientific C omputing Toby Isaac (building on slides from

Kaltur Kaltura Player a Player Toolkit oolkit FOSDEM 2015 Michael Dale Itay Kinnrot Kaltura

Introducing Reliability T oolkit: easy-to-use monitoring and alerting Robin van Zijll &amp;

Whats New in disclosure? The AHRQ CANDOR Process and T oolkit Steve Kraman, M.D. Professor,

Tutorial: GAN Dissection What is learned inside a GAN? David Bau To follow along:

Stijn Wuyts (MPE) Natascha Frster Schreiber (MPE) Benjamin Magnelli (MPE) Raanan Nordon (MPE)

Global constraints (2/2) Marco Chiarandini Department of Mathematics &amp; Computer Science

CSE 440: Introduction to HCI User Interface Design, Prototyping, and Evaluation Lecture 02:

Dissecting PDF Documents Mark S. Rasmussen iPaper mark@improve.dk What Is This Session NOT

Dissecting Web Attacks Val Smith (valsmith@attackresearch.com) Colin Ames

Lecture 03: Layering, Naming, and Filesystem Design Just like RAM, hard drives provide us

The Beauty of Combinatorics November 1, 2012 () The Beauty of Combinatorics November 1, 2012 1

Design & Implementation of a Learning Health System in Australia Data Dissect Pty Ltd

Introducing Reliability T oolkit: easy-to-use monitoring and alerting Robin van Zijll &

Global constraints (2/2) Marco Chiarandini Department of Mathematics & Computer Science