arXiv:1508.01991v1 [cs.CL] 9 Aug 2015 els include LSTM networks, - PDF document

Bidirectional LSTM-CRF Models for Sequence Tagging Zhiheng Huang Wei Xu Kai Yu Baidu research Baidu research Baidu research huangzhiheng@baidu.com xuwei06@baidu.com yukai@baidu.com Abstract (Lafferty et al., 2001). Convolutional network based models (Collobert et al., 2011) have been re- In this paper, we propose a variety of Long cently proposed to tackle sequence tagging prob- Short-Term Memory (LSTM) based mod- lem. We denote such a model as Conv-CRF as els for sequence tagging. These mod- it consists of a convolutional network and a CRF arXiv:1508.01991v1 [cs.CL] 9 Aug 2015 els include LSTM networks, bidirectional layer on the output (the term of sentence level log- LSTM (BI-LSTM) networks, LSTM with likelihood (SSL) was used in the original paper). a Conditional Random Field (CRF) layer The Conv-CRF model has generated promising re- (LSTM-CRF) and bidirectional LSTM sults on sequence tagging tasks. In speech lan- with a CRF layer (BI-LSTM-CRF). Our guage understanding community, recurrent neural work is the first to apply a bidirectional network (Mesnil et al., 2013; Yao et al., 2014) and LSTM CRF (denoted as BI-LSTM-CRF) convolutional nets (Xu and Sarikaya, 2013) based model to NLP benchmark sequence tag- models have been recently proposed. Other rele- ging data sets. We show that the BI- vant work includes (Graves et al., 2005; Graves et LSTM-CRF model can efficiently use both al., 2013) which proposed a bidirectional recurrent past and future input features thanks to neural network for speech recognition. a bidirectional LSTM component. It can In this paper, we propose a variety of neural also use sentence level tag information network based models to sequence tagging task. thanks to a CRF layer. The BI-LSTM- These models include LSTM networks, bidirec- CRF model can produce state of the art (or tional LSTM networks (BI-LSTM), LSTM net- close to) accuracy on POS, chunking and works with a CRF layer (LSTM-CRF), and bidi- NER data sets. In addition, it is robust and rectional LSTM networks with a CRF layer (BI- has less dependence on word embedding LSTM-CRF). Our contributions can be summa- as compared to previous observations. rized as follows. 1) We systematically com- pare the performance of aforementioned models 1 Introduction on NLP tagging data sets; 2) Our work is the first to apply a bidirectional LSTM CRF (denoted Sequence tagging including part of speech tag- as BI-LSTM-CRF) model to NLP benchmark se- ging (POS), chunking, and named entity recogni- quence tagging data sets. This model can use both tion (NER) has been a classic NLP task. It has past and future input features thanks to a bidirec- drawn research attention for a few decades. The tional LSTM component. In addition, this model output of taggers can be used for down streaming can use sentence level tag information thanks to applications. For example, a named entity recog- a CRF layer. Our model can produce state of nizer trained on user search queries can be utilized the art (or close to) accuracy on POS, chunking to identify which spans of text are products, thus and NER data sets; 3) We show that BI-LSTM- triggering certain products ads. Another example CRF model is robust and it has less dependence is that such tag information can be used by a search on word embedding as compared to previous ob- engine to find relevant webpages. servations (Collobert et al., 2011). It can produce Most existing sequence tagging models are accurate tagging performance without resorting to linear statistical models which include Hid- word embedding. den Markov Models (HMM), Maximum entropy Markov models (MEMMs) (McCallum et al., The remainder of the paper is organized as fol- 2000), and Conditional Random Fields (CRF) lows. Section 2 describes sequence tagging mod-

els used in this paper. Section 3 shows the training are sigmoid and softmax activation functions as procedure. Section 4 reports the experiments re- follows. sults. Section 5 discusses related research. Finally 1 Section 6 draws conclusions. f ( z ) = 1 + e − z , (3) e z m 2 Models g ( z m ) = k e z k . (4) � In this section, we describe the models used in this paper: LSTM, BI-LSTM, CRF, LSTM-CRF and B−ORG B−MISC O BI-LSTM-CRF. O y 2.1 LSTM Networks Recurrent neural networks (RNN) have been em- h ployed to produce promising results on a variety of tasks including language model (Mikolov et al., x 2010; Mikolov et al., 2011) and speech recognition (Graves et al., 2005). A RNN maintains a EU rejects German call memory based on history information, which en- ables the model to predict the current output con- Figure 1: A simple RNN model. ditioned on long distance features. Figure 1 shows the RNN structure (Elman, In this paper, we apply Long Short-Term Mem- 1990) which has an input layer x , hidden layer ory (Hochreiter and Schmidhuber, 1997; Graves et al., 2005) to sequence tagging. Long Short- h and output layer y . In named entity tagging context, x represents input features and y Term Memory networks are the same as RNNs, represents tags. Figure 1 illustrates a named except that the hidden layer updates are replaced by purpose-built memory cells. As a result, they entity recognition system in which each word is tagged with other (O) or one of four entity may be better at finding and exploiting long range types: Person (PER) , Location (LOC) , Organi- dependencies in the data. Fig. 2 illustrates a sin- gle LSTM memory cell (Graves et al., 2005). The zation (ORG) , and Miscellaneous (MISC) . The sentence of EU rejects German call to x t x t is tagged boycott British lamb . as B-ORG O B-MISC O O O B-MISC O O , o t input gate i t output gate where B- , I- tags indicate beginning and interme- diate positions of entities. cell An input layer represents features at time t . x t C t h t They could be one-hot-encoding for word feature, dense vector features, or sparse features. An input layer has the same dimensionality as feature size. An output layer represents a probability distribu- f t forget gate tion over labels at time t . It has the same dimensionality as size of labels. Compared to feedfor- x t ward network, a RNN introduces the connection between the previous hidden state and current hid- Figure 2: A Long Short-Term Memory Cell. den state (and thus the recurrent layer weight pa- rameters). This recurrent layer is designed to store LSTM memory cell is implemented as the follow- history information. The values in the hidden and ing: output layers are computed as follows: = σ ( W xi x t + W hi h t − 1 + W ci c t − 1 + b i ) i t h ( t ) = f ( Ux ( t ) + Wh ( t − 1)) , (1) = σ ( W xf x t + W hf h t − 1 + W cf c t − 1 + b f ) f t y ( t ) = g ( Vh ( t )) , (2) c t = f t c t − 1 + i t tanh ( W xc x t + W hc h t − 1 + b c ) o t = σ ( W xo x t + W ho h t − 1 + W co c t + b o ) where U , W , and V are the connection weights to be computed in training time, and f ( z ) and g ( z ) = o t tanh ( c t ) h t

arXiv:1508.01991v1 [cs.CL] 9 Aug 2015 els include LSTM networks, - PDF document

Bidirectional LSTM-CRF Models for Sequence Tagging Zhiheng Huang Wei Xu Kai Yu Baidu research Baidu research Baidu research huangzhiheng@baidu.com xuwei06@baidu.com yukai@baidu.com Abstract (Lafferty et al., 2001). Convolutional network

Introductiontothelarge chargeexpansion Domenico Orlando Introduction Whos who S. Reffert

Michael Duff Imperial College London based on [arXiv:1301.4176 arXiv:1309.0546 arXiv:1312.6523

35 30 33 20 10 10 8 7 0 Feb 10 Aug 10 Feb 11 Aug 11 Feb 12 Aug 12 Feb 13 Aug 13

Introductiontothelarge chargeexpansion Domenico Orlando Introduction Whos who S. Reffert

arXiv:1508.03133 Jonas Lippuner Luke Roberts MICRA 2015 Stockholm, August 17 21, 2105 Solar

NG is NG? Koji Hashimoto (Osaka U) w/ Minoru Eto (Yamagata U) ArXiv:1508.00433 Is internal space

Alargecharge torulestrongcoupling Domenico Orlando Introduction Whos who S. Reffert (AEC

DM models with two mediators. How to save the WIMP Michael Duerr MU Programmtag 2016 Mainz, 12

mesino oscillations AKSHAY GHALSASI, DAVE MCKEEN, ANN NELSON arxiv:1508.05392 The one minute

The Entropy of a Hole in Space-Time Based on: arXiv:1305.0856, arXiv:1310.4204, arXiv:1406.nnnn

Strategic Planning The Process Jul/Aug Online Consultation Survey, analysis 16 Aug Feedback to

Alpha-bits, Teleportation and Black Holes ArXiv:1706.09434, ArXiv:1807.06041 Geoffrey Penington,

Conformal blocks from AdS Per Kraus (UCLA) Based on: Hijano, PK, Snively 1501.02260 Hijano, PK,

Dr G D Pol Foundation YMT College of Management PRESENTATION SCHEDULE M M S SEM III Week Roll

Close Down Governance Plan A summary of the remaining key governance meetings from 31 Jul 17 to 31

Using direct stop searches at ATLAS to constrain the parameter space of supersymmetric models

Data for Official Statistics Marco Puts, Piet Daas, Martijn Tennekes Road sensors Road sensor

MARKOV MODELING AND TRAFFIC FLOW MODELING FILTERS APPLIED IN EXISTING SIGNALING OF CELLULAR

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Markov Models Kunsch, H.R., State Space and Hidden Markov Models . ETH- Zurich, Zurich;

Hidden Markov ov Model (HMM) based S Speech Synthesis using ing HTS Toolkit. Presenter: Omer

The Hidden Stories Maria Wolters Reader in Design Informatics University of Edinburgh of

Temporal Models for Predicting Student Dropout in Massive Open Online Courses Fei Mi, Dit-Yan

Tampa Bay Water Piloting Utility Modeling Applications Alison Adams, Ph.D., P.E. Jeff Geurink,