Discriminative Models Joakim Nivre Uppsala University Department - - PowerPoint PPT Presentation

discriminative models
SMART_READER_LITE
LIVE PREVIEW

Discriminative Models Joakim Nivre Uppsala University Department - - PowerPoint PPT Presentation

Discriminative Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Discriminative Models 1(11) 1. Generative and Discriminative Models 2. Log-Linear Models 3. Local Discriminative Models


slide-1
SLIDE 1

Discriminative Models

Joakim Nivre

Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se

Discriminative Models 1(11)

slide-2
SLIDE 2
  • 1. Generative and Discriminative Models
  • 2. Log-Linear Models
  • 3. Local Discriminative Models
  • 4. Global Discriminative Models
  • 5. Reranking

Discriminative Models 2(11)

slide-3
SLIDE 3

Generative and Discriminative Models

A generative statistical model defines the joint probability P(x, y)

  • f input x and output y

◮ Pros:

◮ Learning problems have closed form solutions ◮ Related probabilities can be derived: ◮ Conditionalization: P(y|x) = P(x,y) P(x) ◮ Marginalization: P(x) = y P(x, y)

◮ Cons:

◮ Rigid independence assumptions (or intractable parsing) ◮ Indirect modeling of parsing problem Discriminative Models 3(11)

slide-4
SLIDE 4

Generative and Discriminative Models

A discriminative statistical model defines the conditional probability P(y|x) of output y given input x

◮ Pros:

◮ No rigid independence assumptions ◮ More direct modeling of parsing problem

◮ Cons:

◮ Learning problems require numerical approximation ◮ Related probabilities cannot be derived: ◮ No way to compute P(x, y) from P(y|x) ◮ No way to compute P(x) or P(y) from P(y|x) Discriminative Models 4(11)

slide-5
SLIDE 5

Generative and Discriminative Models

Two classes of discriminative models:

◮ Conditional models:

◮ Explicitly model the conditional probability P(y|x) ◮ Used in mapping X → Y: argmaxy P(y|x)

◮ Purely discriminative models:

◮ Directly optimize mapping X → Y ◮ No explicit model of conditional probability P(y|x) Discriminative Models 5(11)

slide-6
SLIDE 6

Log-Linear Models

P(y|x) = exp k

i=1 fi(x, y) · wi

  • y′∈GEN(x) exp

k

i=1 fi(x, y′) · wi

  • ◮ fi(x, y) = feature function

◮ wi = feature weight ◮ exp

k

i=1 fi(x, y) · wi

  • > 0

◮ exp

k

i=1 fi(x, y) · wi

y ′∈GEN(x) exp

k

i=1 fi(x, y ′) · wi

  • ◮ 0 ≤ P(y|x) ≤ 1

y ′ P(y ′|x) = 1

Discriminative Models 6(11)

slide-7
SLIDE 7

Log-Linear Models

y∗ = argmaxy P(y|x) = argmaxy

exp[ k

i=1 fi(x,y)·wi]

  • y′∈GEN(x) exp[

k

i=1 fi(x,y′)·wi]

= argmaxy exp k

i=1 fi(x, y) · wi

  • =

argmaxy k

i=1 fi(x, y) · wi

Discriminative Models 7(11)

slide-8
SLIDE 8

Local Discriminative Models

P(y|x) =

m

  • i=1

P(di|Φ(d1, . . . , di−1, x))

P(di|Φ(d1, . . . , di−1, x)) = exp k

i=1 fi(Φ(d1, . . . , di−1), di) · wi

  • d′∈GEN(x) exp

k

i=1 fi(Φ(d1, . . . , di−1), d′) · wi

  • ◮ Conditional model over local decisions

◮ Pros: unconstrained features, efficient learning/decoding ◮ Cons: approximate search (beam search or similar)

Discriminative Models 8(11)

slide-9
SLIDE 9

Global Discriminative Models

P(y|x) = exp k

i=1 fi(x, y) · wi

  • y′∈GEN(x) exp

k

i=1 fi(x, y′) · wi

  • ◮ Conditional model over global structure

◮ Factorization for efficient inference (dynamic programming) ◮ Pros: exact learning/decoding ◮ Cons: only local features, less efficient

Discriminative Models 9(11)

slide-10
SLIDE 10

Reranking

P(y|x) = exp k

i=1 fi(x, y) · wi

  • y′∈GENn(x) exp

k

i=1 fi(x, y′) · wi

  • ◮ Conditional model over global structure

◮ GENn(x) = n-best list for efficient inference ◮ Pros: unconstrained features, (almost) exact learning/decoding ◮ Cons: can be inefficient

Discriminative Models 10(11)

slide-11
SLIDE 11

Reranking

On Exact and Approximate Methods

What if the objective function we want to maximize is not efficiently computable in our favorite model?

  • 1. Use a simpler model (e.g., restrict feature scope)
  • 2. Use approximate inference (e.g., beam search or reranking)
  • 3. Use another objective function (e.g., labeled recall)

Which strategy works best is (usually) an empirical question!

Discriminative Models 11(11)