A Spectral Algorithm for Learning Class-Based n -gram Models of - PDF document

A Spectral Algorithm for Learning Class-Based n -gram Models of Natural Language Karl Stratos † Do-kyum Kim ‡ Michael Collins † Daniel Hsu † † Department of Computer Science, Columbia University, New York, NY 10027 ‡ Department of Computer Science and Engineering, University of California–San Diego, La Jolla, CA 92093 Abstract diverse set of problems including named-entity recognition (Miller et al. , 2004; Turian et al. , 2010), syntactic chunking The Brown clustering algorithm (Brown et al. , (Turian et al. , 2010), parsing (Koo et al. , 2008), and lan- 1992) is widely used in natural language process- guage modeling (Kneser and Ney, 1993; Gao et al. , 2001). ing (NLP) to derive lexical representations that The Brown clustering algorithm assumes a model that is are then used to improve performance on vari- essentially a hidden Markov model (HMM), with a restric- ous NLP problems. The algorithm assumes an tion that each word in the vocabulary can only be emitted underlying model that is essentially an HMM, from a single state in the HMM (i.e, there is a deterministic with the restriction that each word in the vocab- mapping from words to underlying states). The algorithm ulary is emitted from a single state. A greedy, uses a greedy, bottom-up method in deriving the cluster- bottom-up method is then used to find the clus- ing. This method is a heuristic, in that there is no guarantee tering; this method does not have a guarantee of of recovering the correct clustering. In practice, the algo- finding the correct underlying clustering. In this rithm is quite computationally expensive: for example in paper we describe a new algorithm for clustering our experiments, the implementation of Liang (2005) takes under the Brown et al. model. The method relies over 22 hours to derive a clustering from a dataset with 205 on two steps: first, the use of canonical correla- million tokens and 300,000 distinct word types. tion analysis to derive a low-dimensional repre- sentation of words; second, a bottom-up hierar- This paper introduces a new algorithm for clustering un- chical clustering over these representations. We der the Brown et al. model (henceforth, the Brown model). show that given a sufficient number of training Crucially, under an assumption that the data is generated examples sampled from the Brown et al. model, from the Brown model, our algorithm is guaranteed to re- the method is guaranteed to recover the correct cover the correct clustering when given a sufficient num- clustering. Experiments show that the method ber of training examples (see the theorems in Section 5). recovers clusters of comparable quality to the al- The algorithm draws on ideas from canonical correlation gorithm of Brown et al. (1992), but is an order of analysis (CCA) and agglomerative clustering, and has the magnitude more efficient. following simple form: 1. Estimate a normalized covariance matrix from a corpus and use singular value decomposition (SVD) 1 INTRODUCTION to derive low-dimensional vector representations for word types (Figure 4). There has recently been great interest in the natural language processing (NLP) community in methods that de- 2. Perform a bottom-up hierarchical clustering of these rive lexical representations from large quantities of unla- vectors (Figure 5). beled data (Brown et al. , 1992; Pereira et al. , 1993; Ando In our experiments, we find that our clusters are compara- and Zhang, 2005; Liang, 2005; Turian et al. , 2010; Dhillon ble to the Brown clusters in improving the performance of et al. , 2011; Collobert et al. , 2011; Mikolov et al. , 2013a,b). a supervised learner, but our method is significantly faster. These representations can be used to improve accuracy on For example, both our clusters and Brown clusters improve various NLP problems, or to give significant reductions in the F1 score in named-entity recognition (NER) by 2-3 the number of training examples required for learning. The points, but the runtime of our method is around 10 times Brown clustering algorithm (Brown et al. , 1992) is one of faster than the Brown algorithm (Table 3). the most widely used algorithms for this task. Brown clustering representations have been shown to be useful in a The paper is structured as follows. In Section 2, we discuss

Input : corpus with N tokens of n distinct word types w (1) , . . . , w ( n ) ordered by decreasing frequency; number of clusters m . 0 1 Output : hierarchical clustering of w (1) , . . . , w ( n ) . 10 11 1. Initialize active clusters C = {{ w (1) } , . . . , { w ( m ) }} . 00 01 110 111 100 101 010 011 000 001 2. For i = m + 1 to n + m − 1 : of in bought run apple pear Apple IBM (a) If i ≤ n : set C = C ∪ {{ w ( i ) }} . (b) Merge c, c ′ ∈ C that cause the smallest decrease in the Figure 2: An example of a Brown word-cluster hierarchy likelihood of the corpus. taken from Koo et al. (2008). Each node in the tree is la- beled with a bit string indicating the path from the root node to that node, where 0 indicates a left branch and 1 indicates Figure 1: A standard implementation of the Brown cluster- a right branch. ing algorithm. for large values of n , a common trick used for practical related work. In Section 3, we establish the notation we implementation is to specify the number of active clusters use throughout. In Section 4, we define the Brown model. m ≪ n , for example, m = 1000 . A sketch of this imple- In Section 5, we present the main result and describe the mentation is shown in Figure 1. Using this technique, it is possible to achieve O ( N + nm 2 ) runtime. We note that algorithm. In Section 6, we report experimental results. our algorithm in Figure 5 has a similar form and asymp- 2 BACKGROUND totic runtime, but is empirically much faster. We discuss this issue in Section 6.3.1. 2.1 THE BROWN CLUSTERING ALGORITHM In this paper, we present a very different algorithm for de- The Brown clustering algorithm (Brown et al. , 1992) has riving a word hierarchy based on the Brown model. In been used in many NLP applications (Koo et al. , 2008; all our experiments, we compared our method against the Miller et al. , 2004; Liang, 2005). We briefly describe the highly optimized implementation of the Brown algorithm algorithm below; a part of the description was taken from in Figure 1 by Liang (2005). Koo et al. (2008). The input to the algorithm is a corpus of text with N tokens 2.2 CCA AND AGGLOMERATIVE CLUSTERING of n distinct word types. The algorithm initializes each Our algorithm in Figure 4 operates in a fashion similar to word type as a distinct cluster, and repeatedly merges the the mechanics of CCA. CCA is a statistical technique used pair of clusters that cause the smallest decrease in the like- to maximize the correlation between a pair of random vari- lihood of the corpus according to a discrete hidden Markov ables (Hotelling, 1936). A central operation in CCA to model (HMM). The observation parameters of this HMM achieve this maximization is SVD; in this work, we also are assumed to satisfy a certain disjointedness condition critically rely on SVD to recover the desired parameters. (Assumption 4.1). We will explicitly define the model in Section 4. Recently, it has been shown that one can use CCA-style algorithms, so-called spectral methods, to learn HMMs At the end of the algorithm, one obtains a hierarchy of in polynomial sample/time complexity (Hsu et al. , 2012). word types which can be represented as a binary tree as These methods will be important to our goal since the in Figure 2. Within this tree, each word is uniquely identi- Brown model can be viewed as a special case of an HMM. fied by its path from the root, and this path can be com- pactly represented with a bit string. In order to obtain We briefly note that one can view our approach from the a clustering of the words, we select all nodes at a cer- perspective of spectral clustering (Ng et al. , 2002). A spec- tain depth from the root of the hierarchy. For exam- tral clustering algorithm typically proceeds by constructing ple, in Figure 2 we might select the four nodes at depth a graph Laplacian matrix from the data and performing a 2 from the root, yielding the clusters { apple , pear } , standard clustering algorithm (e.g., k -means) on reduced- { Apple , IBM } , { bought , run } , and { of , in } . Note that dimensional points that correspond to the top eigenvalues the same clustering can be obtained by truncating each of the Laplacian. We do not make use of a graph Laplacian, word’s bit string to a 2-bit prefix. By using prefixes of var- but we do make use of spectral methods for dimensionality ious lengths, we can produce clusterings of different gran- reduction before clustering. ularities. Agglomerative clustering refers to hierarchical grouping of n points using a bottom-up style algorithm (Ward Jr, A naive implementation of this algorithm has runtime O ( n 5 ) . Brown et al. (1992) propose a technique to re- 1963; Shanbehzadeh and Ogunbona, 1997). It is com- duce the runtime to O ( n 3 ) . Since this is still not acceptable monly used for its simplicity, but a naive implementation

A Spectral Algorithm for Learning Class-Based n -gram Models of - PDF document

A Spectral Algorithm for Learning Class-Based n -gram Models of Natural Language Karl Stratos Do-kyum Kim Michael Collins Daniel Hsu Department of Computer Science, Columbia University, New York, NY 10027 Department of

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

Gram-Schmidt algorithm Aim lecture: We use the theory of last lecture to give an algorithm for

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

Joshua Hartigan Supervisor: Judy-anne Osborn Heres a matrix And heres its Gram

Many words share the same root word This week we are focusing on words with the root gram.

Anaerobes Veillonella Gram positive bacilli Clostridium perfringens, tetani,

N-grams & Language ID If N-gram models represent language models, can we use N-gram

Language as an Interface Spencer Kelly introduction The pope is catholic. language as data

Gram-Schmidt Finding Orthonormal Basis The famous Gram-Schmidt process is used to produce an

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

On the Approximability of Information Theoretic Clustering Ferdiando Cicalese, U. Verona Eduardo

Machine learning theory Theory of clustering Hamid Beigy Sharif university of technology June

Percolation Theory Percolation Theory Jie Gao Computer Science Department Stony Brook

Towards a Statistical Theory of Clustering Ulrike von Luxburg, Shai Ben-David Page 1 Ulrike von

Nice to meet you! The Network Matters Cloud-based applications generate significant network

Cluster algebras and applications Bernhard Keller Universit Paris Diderot Paris 7 DMV

Cluster Production in pBUU - Past and Future Pawel Danielewicz National Superconducting

Completion of Discrete Cluster Categories of type A . Emine Yldrm, joint with Ba Nguyen and

A Spectral Algorithm for Learning Class-Based n -gram Models of - PDF document

A Spectral Algorithm for Learning Class-Based n -gram Models of Natural Language Karl Stratos Do-kyum Kim Michael Collins Daniel Hsu Department of Computer Science, Columbia University, New York, NY 10027 Department of

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

Gram-Schmidt algorithm Aim lecture: We use the theory of last lecture to give an algorithm for

GOLD/SILVER/PLATINUM BARS &amp; COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

A spectral algorithm for learning hidden Markov models . . . h 3 h 2 h 1 x 3 x 2 x 1 Daniel Hsu

Joshua Hartigan Supervisor: Judy-anne Osborn Heres a matrix And heres its Gram

Many words share the same root word This week we are focusing on words with the root gram.

Anaerobes Veillonella Gram positive bacilli Clostridium perfringens, tetani,

N-grams &amp; Language ID If N-gram models represent language models, can we use N-gram

Language as an Interface Spencer Kelly introduction The pope is catholic. language as data

Gram-Schmidt Finding Orthonormal Basis The famous Gram-Schmidt process is used to produce an

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

On the Approximability of Information Theoretic Clustering Ferdiando Cicalese, U. Verona Eduardo

Machine learning theory Theory of clustering Hamid Beigy Sharif university of technology June

Percolation Theory Percolation Theory Jie Gao Computer Science Department Stony Brook

Towards a Statistical Theory of Clustering Ulrike von Luxburg, Shai Ben-David Page 1 Ulrike von

Nice to meet you! The Network Matters Cloud-based applications generate significant network

Cluster algebras and applications Bernhard Keller Universit Paris Diderot Paris 7 DMV

Cluster Production in pBUU - Past and Future Pawel Danielewicz National Superconducting

Completion of Discrete Cluster Categories of type A . Emine Yldrm, joint with Ba Nguyen and

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

N-grams & Language ID If N-gram models represent language models, can we use N-gram