Annealing Techniques for Unsupervised Statistical Language Learning - PDF document

Appeared in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004) , Barcelona, July 2004. Annealing Techniques for Unsupervised Statistical Language Learning Noah A. Smith and Jason Eisner Department of Computer Science / Center for Language and Speech Processing Johns Hopkins University, Baltimore, MD 21218 USA { nasmith,jason } @cs.jhu.edu Abstract In § 2 we review deterministic annealing (DA) and show how it generalizes the EM algorithm. § 3 Exploiting unannotated natural language data is hard shows how DA can be used for parameter estimation largely because unsupervised parameter estimation is for models of language structure that use dynamic hard. We describe deterministic annealing (Rose et al., programming to compute posteriors over hidden 1990) as an appealing alternative to the Expectation- structure, such as hidden Markov models (HMMs) Maximization algorithm (Dempster et al., 1977). Seek- and stochastic context-free grammars (SCFGs). In ing to avoid search error, DA begins by globally maxi- § 4 we apply DA to the problem of learning a tri- mizing an easy concave function and maintains a local gram POS tagger without labeled data. We then de- maximum as it gradually morphs the function into the scribe how one of the received strengths of DA— desired non-concave likelihood function. Applying DA its robustness to the initializing model parameters— to parsing and tagging models is shown to be straight- can be a shortcoming in situations where the ini- forward; significant improvements over EM are shown tial parameters carry a helpful bias. We present on a part-of-speech tagging task. We describe a vari- a solution to this problem in the form of a new ant, skewed DA, which can incorporate a good initializer algorithm, skewed deterministic annealing (SDA; when it is available, and show significant improvements § 5). Finally we apply SDA to a grammar induc- over EM on a grammar induction task. tion model and demonstrate significantly improved performance over EM ( § 6). § 7 highlights future di- 1 Introduction rections for this work. Unlabeled data remains a tantalizing potential re- source for NLP researchers. Some tasks can thrive 2 Deterministic annealing on a nearly pure diet of unlabeled data (Yarowsky, Suppose our data consist of a pairs of random vari- 1995; Collins and Singer, 1999; Cucerzan and ables X and Y , where the value of X is observed Yarowsky, 2003). But for other tasks, such as ma- and Y is hidden. For example, X might range chine translation (Brown et al., 1990), the chief over sentences in English and Y over POS tag se- merit of unlabeled data is simply that nothing else quences. We use X and Y to denote the sets of is available; unsupervised parameter estimation is possible values of X and Y , respectively. We seek notorious for achieving mediocre results. to build a model that assigns probabilities to each The standard starting point is the Expectation- ( x, y ) ∈ X × Y . Let � x = { x 1 , x 2 , ..., x n } be a corpus Maximization (EM) algorithm (Dempster et al., of unlabeled examples. Assume the class of models 1977). EM iteratively adjusts a model’s parame- is fixed (for example, we might consider only first- ters from an initial guess until it converges to a lo- order HMMs with s states, corresponding notion- cal maximum. Unfortunately, likelihood functions ally to POS tags). Then the task is to find good pa- in practice are riddled with suboptimal local max- θ ∈ R N for the model. The criterion most rameters � ima (e.g., Charniak, 1993, ch. 7). Moreover, max- commonly used in building such models from un- imizing likelihood is not equivalent to maximizing labeled data is maximum likelihood (ML); we seek task-defined accuracy (e.g., Merialdo, 1994). the parameters � θ ∗ : Here we focus on the search error problem. As- sume that one has a model for which improving n x | � � � Pr( x i , y | � argmax Pr( � θ ) = argmax θ ) (1) likelihood really will improve accuracy (e.g., at pre- � � dicting hidden part-of-speech (POS) tags or parse θ θ i =1 y ∈ Y trees). Hence, we seek methods that tend to locate entropy hilltop. They argue that to account for partially- mountaintops rather than hilltops of the likelihood observed (unlabeled) data, one should choose the distribution function. Alternatively, we might want methods that with the highest Shannon entropy, subject to certain data-driven find hilltops with other desirable properties. 1 constraints. They show that this desirable distribution is one of the local maxima of likelihood. Whether high-entropy local 1 Wang et al. (2003) suggest that one should seek a high- maxima really predict test data better is an empirical question.

Annealing Techniques for Unsupervised Statistical Language Learning - PDF document

Appeared in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004) , Barcelona, July 2004. Annealing Techniques for Unsupervised Statistical Language Learning Noah A. Smith and Jason Eisner Department of

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Simulated Annealing Simulated annealing is a probabilistic search algorithm. The

Simulated Annealing G5BAIM: Artificial Intelligence Methods Graham Kendall 15 Feb 09 1

What Is the Optimal Which Annealing . . . Annealing Schedule in Physical Meaning of . . . Need

Outline Convergence DM812 METAHEURISTICS Lecture 2 1. Simulated Annealing Simulated Annealing

A Practical Approach to Quantum Annealing GOTO CHICAGO 2020 AGENDA Practical Quantum Annealing

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Simulated Annealing Key idea: Vary temperature parameter, i.e. , probability of accepting

Simulated Annealing Chad Germany

Simulated quantum annealing of double- Simulated quantum annealing of double- well and multiwell

Simulated Annealing Key idea: Vary temperature parameter, i.e. , probability of accepting

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

!"#$%#&'()'*+,(%-%.+#(/'0),'123 4+-)5'6$7+,$/

Buccaneer, A General Dynamics Company February 13, 2012 CMMI/Bundled Payments Informal

Go and Be 2. Get the Log Out of Your Eye (Matt. 7:1-5) Reconciled 3. Gently Restore (Gal.

ICP Pre-Solicitation Conference Agenda 0800 Opening Remarks 0805 End State Contracting Overview

The Four Goals of Sabbath School When you dont know where you are going, any road will take

Chip-Secured Data Access: Confidential Data on Untrusted Servers L. Bouganim, P. Pucheral

24/10/2018 01/12/2018 01/07/2019 01/07/2020 01/07/2021 01/07/2022 Stage 2 Stage 3 Royal

The Nix Package Manager Eelco Dolstra e.dolstra@tudelft.nl Delft University of Technology, EWI,

Annealing Techniques for Unsupervised Statistical Language Learning - PDF document

Appeared in Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004) , Barcelona, July 2004. Annealing Techniques for Unsupervised Statistical Language Learning Noah A. Smith and Jason Eisner Department of

CHAPTER V V CHAPTER Annealing by Stochastic Annealing by Stochastic Neural Networks for

Simulated Annealing Simulated annealing is a probabilistic search algorithm. The

Simulated Annealing G5BAIM: Artificial Intelligence Methods Graham Kendall 15 Feb 09 1

What Is the Optimal Which Annealing . . . Annealing Schedule in Physical Meaning of . . . Need

Outline Convergence DM812 METAHEURISTICS Lecture 2 1. Simulated Annealing Simulated Annealing

A Practical Approach to Quantum Annealing GOTO CHICAGO 2020 AGENDA Practical Quantum Annealing

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Simulated Annealing Key idea: Vary temperature parameter, i.e. , probability of accepting

Simulated Annealing Chad Germany

Simulated quantum annealing of double- Simulated quantum annealing of double- well and multiwell

Simulated Annealing Key idea: Vary temperature parameter, i.e. , probability of accepting

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

!&quot;#$%#&amp;'()'*+,(%-%.+#(/'0),'123 4+-)5'6$7+,$/

Buccaneer, A General Dynamics Company February 13, 2012 CMMI/Bundled Payments Informal

Go and Be 2. Get the Log Out of Your Eye (Matt. 7:1-5) Reconciled 3. Gently Restore (Gal.

ICP Pre-Solicitation Conference Agenda 0800 Opening Remarks 0805 End State Contracting Overview

The Four Goals of Sabbath School When you dont know where you are going, any road will take

Chip-Secured Data Access: Confidential Data on Untrusted Servers L. Bouganim, P. Pucheral

24/10/2018 01/12/2018 01/07/2019 01/07/2020 01/07/2021 01/07/2022 Stage 2 Stage 3 Royal

The Nix Package Manager Eelco Dolstra e.dolstra@tudelft.nl Delft University of Technology, EWI,

!"#$%#&'()'*+,(%-%.+#(/'0),'123 4+-)5'6$7+,$/