Combining Heterogeneous Classifiers for Word-Sense Disambiguation Dan - PDF document

Proceedings of the SIGLEX/SENSEVAL Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, Philadelphia, July 2002, pp. 74-80. Association for Computational Linguistics. Combining Heterogeneous Classifiers for Word-Sense Disambiguation Dan Klein , Kristina Toutanova , H. Tolga Ilhan , Sepandar D. Kamvar and Christopher D. Manning Computer Science Department Stanford University Stanford, CA 94305-9040, USA Abstract large fraction of systems had scores clustered in a fairly narrow region (Senseval-2, 2001). This paper discusses ensembles of simple but het- We began building our system with 23 supervised erogeneous classifiers for word-sense disambigua- WSD systems, each submitted by a student taking tion, examining the Stanford- CS 224 N system en- the natural language processing course ( CS 224 N ) at tered in the S ENSEVAL -2 English lexical sample task. First-order classifiers are combined by a Stanford University in Spring 2000. Students were second-order classifier, which variously uses ma- free to implement whatever WSD method they chose. jority voting, weighted voting, or a maximum en- While most implemented variants of naive-Bayes, tropy model. While individual first-order classifiers perform comparably to middle-scoring teams’ sys- others implemented a range of other methods, in- tems, the combination achieves high performance. cluding n -gram models, vector space models, and We discuss trade-offs and empirical performance. memory-based learners. Taken individually, the best Finally, we present an analysis of the combination, examining how ensemble performance depends on of these systems would have turned in an accuracy error independence and task difficulty. of 61.2% in the S ENSEVAL -2 English lexical sample task (which would have given it 6th place), while others would have produced middling to low perfor- 1 Introduction mance. In this paper, we investigate how these clas- The problem of supervised word sense disambigua- sifiers behave in combination. tion ( WSD ) has been approached using many differ- In section 2, we discuss the first-order classifiers ent classification algorithms, including naive-Bayes, and describe our methods of combination. In sec- decision trees, decision lists, and memory-based tion 3, we discuss performance, analyzing what ben- learners. While it is unquestionable that certain al- efit was found from combination, and when. We also gorithms are better suited to the WSD problem than discuss aspects of the component systems which others (for a comparison, see Mooney (1996)), it substantially influenced overall performance. seems that, given similar input features, various algorithms exhibit roughly similar accuracies. 1 This 2 The System was supported by the S ENSEVAL -2 results, where a 2.1 Training Procedure This paper is based on work supported in part by the Na- Figure 1 shows the high-level organization of our tional Science Foundation under Grants IIS-0085896 and IIS- 9982226, by an NSF Graduate Fellowship, and by the Research system. Individual first-order classifiers each map Collaboration between NTT Communication Science Labora- lists of context word tokens to word-sense predic- tories, Nippon Telegraph and Telephone Corporation and CSLI, tions, and are self-contained WSD systems. The first- Stanford University. 1 In fact, we have observed that differences between imple- order classifiers are combined in a variety of ways mentations of a single classifier type, such as smoothing or win- with second-order classifiers. Second-order classi- dow size, impacted accuracy far more than the choice of classifiers are selectors, taking a list of first-order out- fication algorithm. 74

Chosen Final classifier didate second-order classifiers. Second-order clas- Classifier sifier types were identified by an ensemble size k Cross 2nd. order ranking Validation and a combination method m . One instance of each second-order type was constructed for each word. Majority Weighted Maximum 2nd. order classifiers Voting Voting Entropy We originally considered ensemble sizes k in the 1st. order ranking range { 1 , 3 , 5 , 7 , 9 , 11 , 13 , 15 } . For a second-order classifier with ensemble size k , the ensemble mem- 1st. order classifiers 1 2 3 4 5 6 7 8 bers were the top k first-order classifiers according Figure 1: High-level system organization. to the local rank described above. We combined first-order ensembles using one of 1 Split data into multiple training and held-out parts. three methods m : 2 Rank first-order classifiers globally (across all words). 3 Rank first-order classifiers locally (per word), • Majority voting: The sense output by the most breaking ties with global ranks. first-order classifiers in the ensemble was chosen. 4 For each word w 5 For each size k Ties were broken by sense frequency, in favor of 6 Choose the ensemble E w, k to be the top k classifiers more frequent senses. 7 For each voting method m 8 Train the ( k , m ) second-order classifier with E w, k • Weighted voting: Each first-order classifier was 9 Rank the second-order classifier types ( k , m ) globally. assigned a voting weight (see below). The sense 10 Rank the second-order classifier instances locally. receiving the greatest total weighted vote was 11 Choose the top-ranked second-order classifier for each word. 12 Retrain chosen per-word classifiers on entire training data. chosen. 13 Run these classifiers on test data, and evaluate results. • Maximum entropy: A maximum entropy classifier Table 1: The classifier construction process. was trained (see below) and run on the outputs of the first-order classifiers. puts and choosing from among them. An outline of the classifier construction process is given in ta- We considered all pairs of k and m , and so for ble 1. First, the training data was split into training each word there were 24 possible second-order clas- and held-out sets for each word. This was done us- sifiers, though for k = 1 all three values of m are ing 5 random bootstrap splits. Each split allocated equivalent and were merged. The k = 1 ensemble, 75% of the examples to training and 25% to held- as well as the larger ensembles ( k ∈ { 9 , 11 , 13 , 15 } ), out testing. 2 Held-out data was used both to select did not help performance once we had good first- the subsets of first-order classifiers to be combined, order classifier rankings (see section 3.4). and to select the combination methods. For m = Majority , there are no parameters to set. For each word and each training split, the 23 first- For the other two methods, we set the parameters of order classifiers were (independently) trained and the ( k , m ) second-order classifier for a word w using tested on held-out data. For each word, the first- the bootstrap splits of the training data for w . order classifiers were ranked by their average per- In the same manner as for the first-order classi- formance on the held-out data, with the most accu- fiers, we then ranked the second-order classifiers. rate classifiers at the top of the rankings. Ties were For each word, there was the local ranking of the broken by the classifiers’ (weighted) average perfo- second-order classifiers, given by their (average) ac- mance across all words. curacy on held-out data. Ties in these rankings were For each word, we then constructed a set of can- broken by the average performance of the classifier type across all words. The top second-order classi- 2 Bootstrap splits were used rather than standard n -fold fier for each word was selected from these tie-broken cross-validation for two reasons. First, it allowed us to generate an arbitrary number of training/held-out pairs while still rankings. leaving substantial held-out data set sizes. Second, this ap- At this point, all first-order ensemble members proach is commonly used in the literature on ensembles. Its well-foundedness and theoretical properties are discussed in and chosen second-order combination methods were Breiman (1996). In retrospect, since we did not take proper ad- retrained on the unsplit training data and run on the vantage of the ability to generate numerous splits, it might have final test data. been just as well to use cross-validation. 75

Combining Heterogeneous Classifiers for Word-Sense Disambiguation Dan - PDF document

Proceedings of the SIGLEX/SENSEVAL Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, Philadelphia, July 2002, pp. 74-80. Association for Computational Linguistics. Combining Heterogeneous Classifiers for Word-Sense

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Data Dependence in Data Dependence in Combining Classifiers Combining Classifiers Mohamed

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

WSD Word Sense Disambiguation: Determine from context (or otherwise) what Word Sense

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Making Sense of Word Sense 24 February, 2011 Deutschen Gesellschaft fr Sprachwissenschaft (DGfS)

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

When the plain sense of Scripture makes common sense, make no other sense, therefore take every

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Looking Back: History of American Media Learn these things Understand how printed press

1 Dots and Dashes Span The Globe Early Uses (cf. IM today!) Valentine by a Telegraph Clerk

Motivations Monitoring (the eigenstructure of) a (linear) system: On sensors positioning

Adopting Learning-based Visual Localization Methods for Indoor Positioning with WiFi Fingerprints

A Glimpse of t he Hist ory of Crypt ography Cunsheng Ding Depar t ment of Comput er Science

Computers and Economics Computers and Economics Week 12b - April 12 Week 13a April 17 1

InterDisciplinary Institute of Data Science USI Universit della Svizzera Italiana Warwick

and The Opportunity Today Raj Jaswa Professor, Entrepreneurship IIT Source: Stanford

Combining Heterogeneous Classifiers for Word-Sense Disambiguation Dan - PDF document

Proceedings of the SIGLEX/SENSEVAL Workshop on Word Sense Disambiguation: Recent Successes and Future Directions, Philadelphia, July 2002, pp. 74-80. Association for Computational Linguistics. Combining Heterogeneous Classifiers for Word-Sense

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Data Dependence in Data Dependence in Combining Classifiers Combining Classifiers Mohamed

Word Sense Disambiguation Word Sense Disambiguation (WSD) Given A

Word Meaning &amp; Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT

WSD Word Sense Disambiguation: Determine from context (or otherwise) what Word Sense

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Word Sense Disambiguation WORD SENSE DISAMBIGUATION Homonymy and Polysemy As we have seen,

Making Sense of Word Sense 24 February, 2011 Deutschen Gesellschaft fr Sprachwissenschaft (DGfS)

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

When the plain sense of Scripture makes common sense, make no other sense, therefore take every

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

Looking Back: History of American Media Learn these things Understand how printed press

1 Dots and Dashes Span The Globe Early Uses (cf. IM today!) Valentine by a Telegraph Clerk

Motivations Monitoring (the eigenstructure of) a (linear) system: On sensors positioning

Adopting Learning-based Visual Localization Methods for Indoor Positioning with WiFi Fingerprints

A Glimpse of t he Hist ory of Crypt ography Cunsheng Ding Depar t ment of Comput er Science

Computers and Economics Computers and Economics Week 12b - April 12 Week 13a April 17 1

InterDisciplinary Institute of Data Science USI Universit della Svizzera Italiana Warwick

and The Opportunity Today Raj Jaswa Professor, Entrepreneurship IIT Source: Stanford

Word Meaning & Word Sense Disambiguation CMSC 723 / LING 723 / INST 725 M ARINE C ARPUAT