Probabilistic Graphical Models David Sontag New York University Lecture 12, April 23, 2013 David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 1 / 24
What notion of “best” should learning be optimizing? This depends on what we want to do Density estimation: we are interested in the full distribution (so later we can 1 compute whatever conditional probabilities we want) Specific prediction tasks: we are using the distribution to make a prediction 2 Structure or knowledge discovery: we are interested in the model itself 3 David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 2 / 24
Density estimation for conditional models Suppose we want to predict a set of variables Y given some others X , e.g., for segmentation or stereo vision output: disparity ! input: two images ! We concentrate on predicting p ( Y | X ), and use a conditional loss function loss ( x , y , ˆ M ) = − log ˆ p ( y | x ) . Since the loss function only depends on ˆ p ( y | x ), suffices to estimate the conditional distribution, not the joint David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 3 / 24
Density estimation for conditional models 1 � � � CRF: p ( y | x ) = φ c ( x , y c ) , Z ( x ) = φ c ( x , ˆ y c ) Z ( x ) c ∈ C ˆ y c ∈ C Parameterization as log-linear model: Weights w ∈ R d . Feature vectors f c ( x , y c ) ∈ R d . φ c ( x , y c ; w ) = exp( w · f c ( x , y c )) � � loss ( x , y , ˆ Empirical risk minimization with CRFs, i.e. min ˆ M E D M ) : 1 � w ML = arg min − log p ( y | x ; w ) |D| w ( x , y ) ∈D � � � � = arg max log φ c ( x , y c ; w ) − log Z ( x ; w ) w ( x , y ) ∈D c � � � � � = arg max w · f c ( x , y c ) − log Z ( x ; w ) w c ( x , y ) ∈D ( x , y ) ∈D David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 4 / 24
Application: Part-of-speech tagging United flies some large jet N V D A N United 1 flies 2 some 3 large 4 jet 5 David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 5 / 24
Graphical model formulation of POS tagging given: • a sentence of length n and a tag set T • one variable for each word, takes values in T • edge potentials θ ( i − 1 , i , t � , t ) for all i ∈ n , t , t � ∈ T example: United 1 flies 2 some 3 large 4 jet 5 T = { A , D , N , V } David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 6 / 24
Features for POS tagging Edge potentials: Fully parameterize ( T × T features and weights), i.e. θ i − 1 , i ( t ′ , t ) = w T t ′ , t where the superscript “T” denotes that these are the weights for the transitions Node potentials: Introduce features for the presence or absence of certain attributes of each word (e.g., initial letter capitalized, suffix is “ing”), for each possible tag ( T × #attributes features and weights) This part is conditional on the input sentence! Edge potential same for all edges. Same for node potentials. David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 7 / 24
Structured prediction Often we learn a model for the purpose of structured prediction , in which given x we predict y by finding the MAP assignment: argmax ˆ p ( y | x ) y Rather than learn using log-loss (density estimation), we use a loss function better suited to the specific task One reasonable choice would be the classification error : I { ∃ y ′ � = y s.t. ˆ p ( y ′ | x ) ≥ ˆ E ( x , y ) ∼ p ∗ [1 p ( y | x ) } ] which is the probability over all ( x , y ) pairs sampled from p ∗ that our classifier selects the right labels If p ∗ is in the model family, training with log-loss (density estimation) and classification error would perform similarly (given sufficient data) Otherwise, better to directly go for what we care about (classification error) David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 8 / 24
Structured prediction Consider the empirical risk for 0-1 loss (classification error): 1 I { ∃ y ′ � = y s.t. ˆ � p ( y ′ | x ) ≥ ˆ 1 p ( y | x ) } |D| ( x , y ) ∈D p ( y ′ | x ) ≥ ˆ Each constraint ˆ p ( y | x ) is equivalent to � f c ( x , y ′ � w · c ) − log Z ( x ; w ) ≥ w · f c ( x , y c ) − log Z ( x ; w ) c c The log-partition function cancels out on both sides. Re-arranging, we have: �� � � f c ( x , y ′ w · c ) − f c ( x , y c ) ≥ 0 c c Said differently, the empirical risk is zero when ∀ ( x , y ) ∈ D and y ′ � = y , �� � � f c ( x , y ′ w · f c ( x , y c ) − c ) > 0 . c c David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 9 / 24
Structured prediction Empirical risk is zero when ∀ ( x , y ) ∈ D and y ′ � = y , �� � � f c ( x , y ′ w · f c ( x , y c ) − c ) > 0 . c c In the simplest setting, learning corresponds to finding a weight vector w that satisfies all of these constraints (when possible) This is a linear program (LP)! How many constraints does it have? |D| ∗ |Y| – exponentially many! Thus, we must avoid explicitly representing this LP This lecture is about algorithms for solving this LP (or some variant) in a tractable manner David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 10 / 24
Structured perceptron algorithm Input: Training examples D = { ( x m , y m ) } Let f ( x , y ) = � c f c ( x , y c ). Then, the constraints that we want to satisfy are � � f ( x m , y m ) − f ( x m , y ) ∀ y � = y m w · > 0 , The perceptron algorithm uses MAP inference in its inner loop: MAP ( x m ; w ) = arg max y ∈Y w · f ( x m , y ) The maximization can often be performed efficiently by using the structure! The perceptron algorithm is then: Start with w = 0 1 While the weight vector is still changing: 2 For m = 1 , . . . , |D| 3 y ← MAP ( x m ; w ) 4 w ← w + f ( x m , y m ) − f ( x m , y ) 5 David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 11 / 24
Structured perceptron algorithm If the training data is separable , the perceptron algorithm is guaranteed to find a weight vector which perfectly classifies all of the data When separable with margin γ , number of iterations is at most � 2 � 2 R , γ where R = max m , y || f ( x m , y ) || 2 In practice, one stops after a certain number of outer iterations (called epochs ), and uses the average of all weights The averaging can be understood as a type of regularization to prevent overfitting David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 12 / 24
Allowing slack We can equivalently write the constraints as � � f ( x m , y m ) − f ( x m , y ) ∀ y � = y m w · ≥ 1 , Suppose there do not exist weights w that satisfy all constraints Introduce slack variables ξ m ≥ 0, one per data point, to allow for constraint violations: � � f ( x m , y m ) − f ( x m , y ) ∀ y � = y m w · ≥ 1 − ξ m , � Then, minimize the sum of the slack variables, min ξ ≥ 0 m ξ m , subject to the above constraints David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 13 / 24
Structural SVM (support vector machine) � ξ m + C || w || 2 min w ,ξ m subject to: � � f ( x m , y m ) − f ( x m , y ) ∀ m , y � = y m w · ≥ 1 − ξ m , ξ m ≥ 0 , ∀ m This is a quadratic program (QP). Solving for the slack variables in closed form, we obtain � �� � ξ ∗ f ( x m , y m ) − f ( x m , y ) m = max 0 , max y ∈Y 1 − w · Thus, we can re-write the whole optimization problem as � �� � � f ( x m , y m ) − f ( x m , y ) + C || w || 2 min max 0 , max y ∈Y 1 − w · w m David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 14 / 24
Hinge loss � � �� f ( x m , y m ) − f ( x m , y ) We can view max 0 , max y ∈Y 1 − w · as a loss function, called hinge loss When w · f ( x m , y m ) ≥ w · f ( x m , y ) for all y (i.e., correct prediction), this takes a value between 0 and 1 When ∃ y such that w · f ( x m , y ) ≥ w · f ( x m , y m ) (i.e., incorrect prediction), this takes a value ≥ 1 Thus, this always upper bounds the 0-1 loss! Minimizing hinge loss is good because it minimizes an upper bound on the 0-1 loss (prediction error) David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 15 / 24
Better Metrics It doesn’t always make sense to penalize all incorrect predictions equally! We can change the constraints to � � f ( x m , y m ) − f ( x m , y ) ≥ ∆( y , y m ) − ξ m , w · ∀ y , where ∆( y , y m ) ≥ 0 is a measure of how far the assignment y is from the true assignment y m This is called margin scaling (as opposed to slack scaling ) We assume that ∆( y , y ) = 0, which allows us to say that the constraint holds for all y , rather than just y � = y m A frequently used metric for MRFs is Hamming distance , where ∆( y , y m ) = � I[ y i � = y m i ∈ V 1 i ] David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 16 / 24
Structural SVM with margin scaling � � �� � ∆( y , y m ) − w · f ( x m , y m ) − f ( x m , y ) + C || w || 2 min max w y ∈Y m How to solve this? Many methods! Cutting-plane algorithm (Tsochantaridis et al., 2005) 1 Stochastic subgradient method (Ratliff et al., 2007) 2 Dual Loss Primal Weights algorithm (Meshi et al., 2010) 3 Frank-Wolfe algorithm (Lacoste-Julien et al., 2013) 4 David Sontag (NYU) Graphical Models Lecture 12, April 23, 2013 17 / 24
Recommend
More recommend