conditional random fields
play

Conditional Random Fields Dietrich Klakow Overview Sequence - PowerPoint PPT Presentation

Conditional Random Fields Dietrich Klakow Overview Sequence Labeling Bayesian Networks Markov Random Fields Conditional Random Fields Software example Sequence Labeling Tasks Sequence: a sentence Pierre Vinken , 61


  1. Conditional Random Fields Dietrich Klakow

  2. Overview • Sequence Labeling • Bayesian Networks • Markov Random Fields • Conditional Random Fields • Software example

  3. Sequence Labeling Tasks

  4. Sequence: a sentence Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .

  5. POS Labels Pierre NNP Vinken NNP , , 61 CD years NNS old JJ , , will MD join VB the DT board NN as IN a DT nonexecutive JJ director NN Nov. NNP 29 CD . .

  6. Chunking Task: find phrase boundaries:

  7. Chunking B-NP Pierre I-NP Vinken O , B-NP 61 I-NP years B-ADJP old O , B-VP will I-VP join B-NP the I-NP board B-PP as B-NP a I-NP nonexecutive I-NP director B-NP Nov. I-NP 29 O .

  8. Named Entity Tagging Pierre B-PERSON Vinken I-PERSON , O 61 B-DATE:AGE years I-DATE:AGE old I-DATE:AGE , O will O join O the O board B-ORG_DESC:OTHER as O a O nonexecutive O director B-PER_DESC Nov. B-DATE:DATE 29 I-DATE:DATE . O

  9. Supertagging Pierre N/N Vinken N , , 61 N/N years N old (S[adj]\NP)\NP , , will (S[dcl]\NP)/(S[b]\NP) join ((S[b]\NP)/PP)/NP the NP[nb]/N board N as PP/NP a NP[nb]/N nonexecutive N/N director N Nov. ((S\NP)\(S\NP))/N[num] 29 N[num] . .

  10. Hidden Markov Model

  11. HMM: just an Application of a Bayes Classifier [ ] π π π = π π π ˆ ˆ ˆ ( , ... ) arg max P ( x , x ... x , , ... ) 1 2 N 1 2 N 1 2 N π π π , .. 1 2 N

  12. Decomposition of Probabilities π π π P ( x , x .. x , , .. ) 1 2 N 1 2 N N ∏ = π π π P ( x | ) P ( | 1 ) − i i i i = i 1 π π P ( | ) : transition probability − i i 1 π P ( x | ) : emission probability i i

  13. Graphical view HMM Observation sequence X 1 X 2 X 3 X N ……. π 1 π 2 π 3 π N ……. Label sequence

  14. Criticism • HMMs model only limiter dependencies � come up with more flexible models � come up with graphical description

  15. Bayesian Networks

  16. Example for Bayesian Network From Russel and Norvig 95 AI: A Modern Approach = P ( C , S , R , W ) Corresponding joint distribution P ( W | S , R ) P ( S | C ) P ( R | C ) P ( C )

  17. Naïve Bayes Observations x 1 , …. x D are assumed to be independent D ∏ P ( x i z | ) = i 1

  18. Markov Random Fields

  19. • Undirected graphical model • New term: • clique in an undirected graph: • Set of nodes such that every node is connected to every other node • maximal clique : there is no node that can be added without add without destroying the clique property

  20. Example cliques: green and blue maximal clique: blue

  21. Factorization x : all nodes x ... x 1 N x : nodes in clique C C C : set of all maximal cliques M Ψ Ψ ≥ ( x ) : potential function ( ( x ) 0 ) C C C C Joint distribution described by graph 1 ∏ = Ψ p ( x ) C x ( ) C Z ∈ C C M Normalization ∑ ∏ = Ψ ( ) Z x C C ∈ x C C M Z is sometimes call the partition function

  22. Example x 2 x 1 x 3 x 5 x 4 What are the maximum cliques? Write down joint probability described by this graph � white board

  23. Energy Function Define − E ( x ) Ψ = ( x ) e C C C Insert into joint distribution ∑ − E ( x ) 1 C = ∈ p ( x ) e C C M Z

  24. Conditional Random Fields

  25. Definition Maximum random field were each random variable y i is conditioned on the complete input sequence x 1 , …x n y=(y 1 …y n ) y 1 y 2 y 3 y n-1 y n ….. x=(x 1 …x n ) x

  26. Distribution Distribution n N ∑∑ − λ f ( y , y , x , i ) 1 − j j i 1 i = = = p ( y | x ) e i 1 j 1 ( ) Z x λ parameters to be trained : j feature function f ( y , y , x , i ) : − j i 1 i (see maximum entropy models)

  27. Example feature functions Modeling transitions  if y = and y = 1 IN NNP i - 1 i  = f ( y , y , x , i ) 1 − 1 i i  else 0 Modeling emissions  = = if y and x 1 NNP September i i  = f ( y , y , x , i ) − 2 i 1 i  else 0

  28. Training • Like in maximum entropy models Generalized iterative scaling • Convergence: p(y|x) is a convex function � � unique maximum � Convergence is slow � Improved algorithms exist

  29. Decoding: Auxiliary Matrix Define additional start symbol y 0 =START and stop symbol y n+1 =STOP M i Define matrix ( x ) such that N ∑ − λ f ( y , y , x , i ) [ ] − j j i 1 i i i = = = M ( x ) M ( x ) e j 1 y y y y − − i 1 i i 1 i

  30. Reformulate Probability With that definition we have + 1 n 1 ∏ i = p ( y | x ) M ( x ) y y Z ( x ) i − 1 i = i 1 with ∑∑∑ ∑ + 1 2 n 1 = Z ( x ) ... M ( x ) M ( x ).... M ( x ) y y y y y y + 0 1 1 2 n n 1 y y y y 1 2 3 n

  31. Use Matrix Properties Use matrix product ∑ [ ] 1 2 1 2 = M ( x ) M ( x ) M ( x ) M ( x ) y y y y y y 0 2 0 1 1 2 y 1 with [ ] + 1 2 1 n = Z ( x ) M ( x ) M ( x )... M ( x ) = = y START , y STOP 0 + 1 n

  32. Software

  33. CRF++ • See http://crfpp.sourceforge.net/

  34. Summary • Sequence labeling problems • CRFs are • flexible • Expensive to train • Fast to decode

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend