Conditional Random Fields LING 572 Advanced Statistical Methods in - - PowerPoint PPT Presentation

conditional random fields
SMART_READER_LITE
LIVE PREVIEW

Conditional Random Fields LING 572 Advanced Statistical Methods in - - PowerPoint PPT Presentation

Conditional Random Fields LING 572 Advanced Statistical Methods in NLP February 11, 2020 1 Announcements HW4 grades out: 93.1 mean HW6 posted later today Implement beam search Note: pay attention to data format + feature vectors


slide-1
SLIDE 1

Conditional Random Fields

LING 572 Advanced Statistical Methods in NLP February 11, 2020

1

slide-2
SLIDE 2

Announcements

  • HW4 grades out: 93.1 mean
  • HW6 posted later today
  • Implement beam search
  • Note: pay attention to data format + feature vectors (in test time situation)
  • Reading #2 posted!
  • Due Feb 18 at 11AM

2

slide-3
SLIDE 3

Highlights

  • CRF is a form of undirected graphical model
  • Proposed by Lafferty, McCallum and Pereira in 2001
  • Used in many NLP tasks: e.g., Named-entity detection
  • Often conjoined with neural models, e.g. LSTM + CRF
  • Types:
  • Linear-chain CRF
  • Skip-chain CRF
  • General CRF

3

slide-4
SLIDE 4

Outline

  • Graphical models
  • Linear-chain CRF
  • Skip-chain CRF

4

slide-5
SLIDE 5

Graphical models

5

slide-6
SLIDE 6

Graphical model

  • A graphical model is a probabilistic model for which a graph denotes the

conditional independence structure between random variables:

  • Nodes: random variables
  • Edges: dependency relation between random variables
  • Types of graphical models:
  • Bayesian network: directed acyclic graph (DAG)
  • Markov random fields: undirected graph

6

slide-7
SLIDE 7

Bayesian network

7

slide-8
SLIDE 8

Bayesian network

  • Graph: directed acyclic graph (DAG)
  • Nodes: random variables
  • Edges: conditional dependencies
  • Each node X is associated with a probability function
  • Learning and inference: efficient algorithms exist.

P(X|parents(X))

8

slide-9
SLIDE 9

An example


(from http://en.wikipedia.org/wiki/Bayesian_network)

9

P(grassWet | sprinkler, rain) P(rain) P(sprinkler | rain)

slide-10
SLIDE 10

Another example

10

E C A B D

P(A|B, E) P(B) P(E) P(D|E) P(C|A)

slide-11
SLIDE 11

Bayesian network: properties

11

slide-12
SLIDE 12

12

E C A B D

slide-13
SLIDE 13

Naïve Bayes Model

13

Y

f1 f2 fn

slide-14
SLIDE 14

HMM

  • State sequence: X1,n+1
  • Output sequence: O1,n

14

X2 X3 Xn+1 X1

  • 1
  • 2
  • n

P(O1:n, X1:n+1) = π(X1)

n

i=1

P(Xi+1|Xi)P(Oi|Xi+1)

slide-15
SLIDE 15

Generative model

  • A directed graphical model in which the output (i.e., what to predict)

topologically precedes the input (i.e., what is given as observation).

  • Naïve Bayes and HMM are generative models.

15

slide-16
SLIDE 16

Markov Random Field

16

slide-17
SLIDE 17

Markov random field

  • Also called “Markov network”
  • A graphical model in which a set of random variables have a Markov

property:

  • Local Markov property: A variable is conditionally independent of all other

variables given its neighbors.

17

slide-18
SLIDE 18

Cliques

  • A clique in an undirected graph is a subset of its vertices such that every two vertices in the subset are

connected by an edge.

  • A maximal clique is a clique that cannot be extended by adding one more vertex.
  • A maximum clique is a clique of the largest possible size in a given graph.

18

A B C E D

clique: maximal clique: maximum clique:

slide-19
SLIDE 19

Clique factorization

19

A B C E D

slide-20
SLIDE 20

Conditional Random Field

20

A CRF is a random field globally conditioned on the observation X.

slide-21
SLIDE 21

Linear-chain CRF

21

slide-22
SLIDE 22

Motivation

  • Sequence labeling problem: e.g., POS tagging
  • HMM: Find best sequence, but cannot use rich features
  • MaxEnt: Use rich features, but may not find the best sequence
  • Linear-chain CRF: HMM + MaxEnt

22

slide-23
SLIDE 23

Relations between NB, MaxEnt, HMM, and CRF

23

slide-24
SLIDE 24

Most Basic Linear-chain CRF

24

slide-25
SLIDE 25

Linear-chain CRF (**)

25

slide-26
SLIDE 26

Training and decoding

  • Training: estimate
  • similar to the one used for MaxEnt
  • Ex: L-BFGS
  • Decoding: find the best sequence y
  • similar to the one used for HMM
  • Viterbi algorithm

λj

26

slide-27
SLIDE 27

Skip-chain CRF

27

slide-28
SLIDE 28

Motivation

  • Sometimes, we need to handle long-distance dependency, which is not

allowed by linear-chain CRF

  • An example: NE detection
  • “Senator John Green … Green ran …”

28

slide-29
SLIDE 29

29

Linear-chain CRF: Skip-chain CRF:

slide-30
SLIDE 30

CRFs in Larger Models

30

slide-31
SLIDE 31

CRFs in Larger Models

31

slide-32
SLIDE 32

32

Source: NLP Progress

slide-33
SLIDE 33

Summary

  • Graphical models:
  • Bayesian network (BN)
  • Markov random field (MRF)
  • CRF is a variant of MRF:
  • Linear-chain CRF: HMM + MaxEnt
  • Skip-chain CRF: can handle long-distance dependency
  • General CRF
  • Pros and cons of CRF:
  • Pros: higher accuracy than HMM and MaxEnt
  • Cons: training and inference can be very slow

33