Conditional Random Fields Andrea Passerini passerini@disi.unitn.it - PowerPoint PPT Presentation

Conditional Random Fields Andrea Passerini passerini@disi.unitn.it Statistical relational learning Conditional Random Fields

Generative vs discriminative models joint distributions Traditional graphical models (both BN and MN) model joint probability distributions p ( x , y ) In many situations we know in advance which variables will be observed, and which will need to be predicted (i.e. x vs y ) Hidden Markov Models (as a special case of BN) also model joint probabilities of states and observations, even if they are often used to estimate the most probable sequence of states y given the observations x A problem with joint distributions is that they need to explicitly model the probability of x , which can be quite complex (e.g. a textual document) Conditional Random Fields

Generative vs discriminative models y y 2 y y y y 1 n−1 n n+1 x 1 x n x i x n−1 x n+1 x 1 x 2 x n Naive Bayes Hidden Markov Model generative models Directed graphical models are called generative when the joint probability decouples as p ( x , y ) = p ( x | y ) p ( y ) The dependencies between input and output are only from the latter to the former: the output generates the input Naive Bayes classifiers and Hidden Markov Models are both generative models Conditional Random Fields

Generative vs discriminative models Discriminative models If the purpose is choosing the most probable configuration for the output variables, we can directly model the conditional probability of the output given the input: p ( y | x ) The parameters of such distribution have higher freedom wrt those of the full p ( x , y ) , as p ( x ) is not modelled This allows to effectively exploit the structure of x without modelling the interactions between its parts, but only those with the output Such models are called discriminative as they aim at modeling the discrimination between different outputs Conditional Random Fields

Conditional Random Fields (CRF , Lafferty et al. 2001) Definition Conditional random fields are conditional Markov networks: 1 � p ( y | x ) = Z ( x ) exp ( − E (( x , y ) C )) ( x , y ) C The partition function Z ( x ) is summed only over y to provide a proper conditional probability: � � � � − E (( x , y ′ ) C ) Z ( x ) = exp y ′ ( x , y ′ ) C Conditional Random Fields

Conditional Random Fields Feature functions K 1 � � p ( y | x ) = Z ( x ) exp λ k f k (( x , y ) C ) ( x , y ) C k = 1 The negated energy function is often written simply as a weighted sum of real-valued feature functions Each feature function should capture a certain characteristic of the clique variables Conditional Random Fields

Linear chain CRF Description (simple form) � K � h 1 � � � p ( y | x ) = Z ( x ) exp λ k f k ( y t , y t − 1 ) + µ h f h ( x t , y t ) t k = 1 h = 1 Models the relation between an input and an output sequence Output sequences are modelled as a linear chain, with a link between each consecutive output element Each output element is connected to the corresponding input. Conditional Random Fields

Linear chain CRF Description (more generic form) K 1 � � p ( y | x ) = Z ( x ) exp λ k f k ( y t , y t − 1 , x t ) t k = 1 the linear chain CRF can model arbitrary features of the input, not only identity of the current observation (like in HMMs) We can think of x t as a vector containing input information relevant for position t , possibly including inputs at previous or following positions We can easily make transition scores (between consecutive outputs y t − 1 , y t ) dependent also on current input x t Conditional Random Fields

Linear chain CRF Parameter estimation Parameters λ k of feature functions need to be estimated from data We estimate them from a training set of i.i.d. input/output sequence pairs D = { ( x ( i ) , y ( i ) ) } i = 1 , . . . , N each example ( x ( i ) , y ( i ) ) is made of a sequence of inputs and a corresponding sequence of outputs: x ( i ) = { x ( i ) y ( i ) = { y ( i ) 1 , . . . , x ( i ) 1 , . . . , y ( i ) T } T } Note For simplicity of notation we assume each training sequence have the same length. The generic form would replace T with T ( i ) Conditional Random Fields

Parameter estimation Maximum likelihood estimation Parameter estimation is performed maximizing the likelihood of the data D given the parameters θ = { λ 1 , . . . , λ K } As usual to simplify derivations we will equivalently maximize log-likelihood As CRF model a conditional probability, we will maximize conditional log-likelihood : N N � � p ( y ( i ) | x ( i ) ) = log p ( y ( i ) | x ( i ) ) ℓ ( θ ) = log i = 1 i = 1 Conditional Random Fields

Parameter estimation Maximum likelihood estimation Replacing the equation for conditional probability we obtain: � � N K 1 � � � λ k f k ( y ( i ) t , y ( i ) t − 1 , x ( i ) ℓ ( θ ) = t ) log Z ( x ( i ) ) exp i = 1 t k = 1 N K N � � � � λ k f k ( y ( i ) t , y ( i ) t − 1 , x ( i ) log Z ( x ( i ) ) = t ) − i = 1 t k = 1 i = 1 Conditional Random Fields

Gradient of the likelihood N N ∂ℓ ( θ ) � � � � � f k ( y ( i ) t , y ( i ) t − 1 , x ( i ) f k ( y , y ′ , x ( i ) t ) p θ ( y , y ′ | x ( i ) ) = t ) − ∂λ k i = 1 t i = 1 y , y ′ t � �� ˜ E [ f k ] E θ [ f k ] Interpretation ˜ E [ f k ] is the expected value of f k under the empirical distribution ˜ p ( y , x ) represented by the training examples E θ [ f k ] is the expected value of f k under the distribution represented by the model with the current value of the parameters : p θ ( y | x )˜ p ( x ) ( ˜ p ( x ) is the empirical distribution of x ) Conditional Random Fields

Gradient of the likelihood Interpretation ∂ℓ ( θ ) = ˜ E [ f k ] − E θ [ f k ] ∂λ k The gradient measures the difference between the expected value of the feature under the empirical and model distributions The gradient is zero when the model adheres to the empirical observations This highlights the risk of overfitting training examples Conditional Random Fields

Parameter estimation Adding regularization CRF often have a large number of parameters to account for different characteristics of the inputs Many parameters mean risk of overfitting training data In order to reduce the risk of overfitting, we penalize parameters with a too large norm Conditional Random Fields

Parameter estimation Zero-mean Gaussian prior A common choice is assuming a Gaussian prior over parameters, with zero mean and covariance σ 2 I (where I is the identity matrix) � � −|| θ || 2 p ( θ ) ∝ exp 2 σ 2 where Gaussian coefficient can be ignored as it’s independent of θ σ 2 is a free parameter determining how much to penalize feature weights moving away from the zero the log probability becomes: K λ 2 log ( p ( θ )) ∝ −|| θ || 2 � k 2 σ 2 = − 2 σ 2 k = 1 Conditional Random Fields

Parameter estimation Maximum a-posteriori estimation We can now estimate the maximum a-posteriori parameters: θ ∗ = argmax θ ℓ ( θ ) + log p ( θ ) = argmax θ ℓ r ( θ ) where the regularized likelihood ℓ r ( θ ) is: N K N K λ 2 � � � � � λ k f k ( y ( i ) t , y ( i ) t − 1 , x ( i ) log Z ( x ( i ) ) − k ℓ r ( θ ) = t ) − 2 σ 2 i = 1 t k = 1 i = 1 k = 1 Conditional Random Fields

Parameter estimation Optimizing the regularized likelihood Gradient ascent → usually too slow Newton’s method (uses Hessian, matrix of all second order derivatives) → too expensive to compute the Hessian Quasi-Netwon methods are often employed: compute an approximation of the Hessian with only first derivative (e.g. BFGS) limited-memory versions exist that avoid storing the full approximate Hessian (size is quadratic in the number of parameters) Conditional Random Fields

Inference Inference problems Computing the gradient requires computing the marginal distribution for each edge p θ ( y , y ′ | x ( i ) ) This has to be computed at each gradient step, as the set of parameters θ changes in the direction of the gradient Computing the likelihood requires computing the partition function Z ( x ) . During testing, finding the most likely labeling requires solving: y ∗ = argmax y p ( y | x ) Inference algorithms All such tasks can be performed efficiently by dynamic programming algorithms similar to those for HMM Conditional Random Fields

Inference algorithms Analogy to HMM Inference algorithms rely on forward , backward and Viterbi procedures analogous to those for HMM To simplify notation and highlight analogy to HMM, we will use the formula of CRF with clique potentials: 1 � p ( y | x ) = Ψ t ( y t , y t − 1 , x t ) Z ( x ) t where the clique potentials are: K � Ψ t ( y t , y t − 1 , x t ) = exp λ k f k ( y t , y t − 1 , x t ) k = 1 Conditional Random Fields

Inference algorithms Forward procedure The forward variable α t ( i ) collects the unnormalized probability of output y t = i and the sequence of inputs { x 1 , . . . , x t } : α t ( i ) ∝ p ( x 1 , . . . , x t , y t = i ) As for HMMs, it is computed recursively � α t ( i ) = Ψ t ( i , j , x t ) α t − 1 ( j ) j ∈ S where S is the set of possible values for the output variable Conditional Random Fields

Conditional Random Fields Andrea Passerini passerini@disi.unitn.it - PowerPoint PPT Presentation

Conditional Random Fields Andrea Passerini passerini@disi.unitn.it Statistical relational learning Conditional Random Fields Generative vs discriminative models joint distributions Traditional graphical models (both BN and MN) model joint

Multiscale Conditional 1) Generalization of conditional random fields (CRF) to multiscale

Conditional Random Fields [Hanna M. Wallach, Conditional Random Fields: An Introduction,

Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

Graphical Models - Part II Oliver Schulte - CMPT 726 Bishop PRML Ch. 8 Markov Random Fields

Visualization Visualization Height Fields and Contours Height Fields and Contours Scalar Fields

Markov random fields 2. conditional specifications 3. conditional auto-regression Rasmus

Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs,

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Review: Conditional Probability Conditional Probability The conditional probability of event

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Conditional quenched CLTs for random walks among random conductances Christophe Gallesco Nina

Limit theorems for excursion sets of stationary random fields Evgeny Spodarev | 23.01.2013 WIAS,

A conditional quenched CLT for random walks among random conductances on Z d Christophe Gallesco

Function Fields, Curves Introduction Function Fields vs. Curves and Global sections Function

Outline Outline Conditional Distribution and Density Conditional Distribution and

G en o M3 Building middleware-independent robotic components A. Mallet, C. Pasteur, M. Herrb, S.

On oscillating systems of interacting Hawkes processes Susanne Ditlevsen Eva L ocherbach

Entropy-based artificial viscosity Parabolic regularization and related topics Jean-Luc Guermond

Scalable Machine Learning 10. Distributed Inference and Applications Alex Smola Yahoo! Research

Onelight.com Publishing c 2010 World Population 2 Onelight.com Publishing c 2010 3

Annual Business Meeting 2016-2017 Executive Summary 2017 Annual Conference Board Meeting,

Nutritional Considerations: Nutrient Content & Variety Living (Well!) with Gastroparesis

Session Objectives 1) Describe the purpose and key aspects of incorporating standards of

Conditional Random Fields Andrea Passerini passerini@disi.unitn.it - PowerPoint PPT Presentation

Conditional Random Fields Andrea Passerini passerini@disi.unitn.it Statistical relational learning Conditional Random Fields Generative vs discriminative models joint distributions Traditional graphical models (both BN and MN) model joint

Multiscale Conditional 1) Generalization of conditional random fields (CRF) to multiscale

Conditional Random Fields [Hanna M. Wallach, Conditional Random Fields: An Introduction,

Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks

Graphical Models - Part II Oliver Schulte - CMPT 726 Bishop PRML Ch. 8 Markov Random Fields

Visualization Visualization Height Fields and Contours Height Fields and Contours Scalar Fields

Markov random fields 2. conditional specifications 3. conditional auto-regression Rasmus

Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs,

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Review: Conditional Probability Conditional Probability The conditional probability of event

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Conditional quenched CLTs for random walks among random conductances Christophe Gallesco Nina

Limit theorems for excursion sets of stationary random fields Evgeny Spodarev | 23.01.2013 WIAS,

A conditional quenched CLT for random walks among random conductances on Z d Christophe Gallesco

Function Fields, Curves Introduction Function Fields vs. Curves and Global sections Function

Outline Outline Conditional Distribution and Density Conditional Distribution and

G en o M3 Building middleware-independent robotic components A. Mallet, C. Pasteur, M. Herrb, S.

On oscillating systems of interacting Hawkes processes Susanne Ditlevsen Eva L ocherbach

Entropy-based artificial viscosity Parabolic regularization and related topics Jean-Luc Guermond

Scalable Machine Learning 10. Distributed Inference and Applications Alex Smola Yahoo! Research

Onelight.com Publishing c 2010 World Population 2 Onelight.com Publishing c 2010 3

Annual Business Meeting 2016-2017 Executive Summary 2017 Annual Conference Board Meeting,

Nutritional Considerations: Nutrient Content &amp; Variety Living (Well!) with Gastroparesis

Session Objectives 1) Describe the purpose and key aspects of incorporating standards of

Nutritional Considerations: Nutrient Content & Variety Living (Well!) with Gastroparesis