Maxent Models, Conditional Estimation, and Optimization
Dan Klein and Chris Manning Stanford University http://nlp.stanford.edu/
HLT-NAACL 2003 and ACL 2003 Tutorial
Maxent Models, Conditional Estimation, and Optimization Without - - PowerPoint PPT Presentation
Maxent Models, Conditional Estimation, and Optimization Without Magic That is, With Math! Dan Klein and Chris Manning Stanford University http://nlp.stanford.edu/ HLT-NAACL 2003 and ACL 2003 Tutorial Introduction
HLT-NAACL 2003 and ACL 2003 Tutorial
They give high accuracy performance
✂They make it easy to incorporate lots of linguistically important features
✂They allow automatic building of language independent, retargetable NLP modules
both observed data and the hidden stuff (gene- rate the observed data from hidden stuff):
✁All the best known StatNLP models:
✂ ✄✆☎ ✝✞ ✟ ✠ ✠✡ ☛✌☞ ✍✌✎✑✏ ✒ ✟ ✓✕✔ ☞ ✖ ✟ ✗ ☞ ✎ ✘ ✍ ✟ ✎ ✎ ✙ ✚ ✙✛☞ ✞ ✎✑✏ ✜ ✙ ☛ ☛ ☞ ✢ ✣ ✟ ✞ ✤ ✡ ✔ ✠ ✡ ☛✌☞ ✍✌✎✑✏ ✥ ✞ ✡ ✦ ✟ ✦ ✙ ✍ ✙ ✎ ✧ ✙ ✘ ✘ ✡ ✢ ✧ ☞ ★ ✧ ☎ ✚ ✞ ☞ ☞ ✝✞ ✟ ✠ ✠ ✟ ✞ ✎ ✩Discriminative (conditional) models take the data as given, and put a probability over
P(c,d) P(c|d)
variables, and lines for direct dependencies
probability table) based on incoming arcs c1 c2 c3 d1 d2 d3
✁ ✂ ✂d1 d 2 d 3
✄✆☎ ✝✟✞ ✠ ✡ ☎ ☛ ✠☞d1 d2 d3 Generative
✌✆✍ ✎ ✏ ☞ ✑ ✏✆✒ ✓ ✠ ✎✔ ✠ ☞ ☞ ✏ ✍ ✕Discriminative
Sequence Level Local Level
features are elementary pieces of evidence that link aspects of what we observe d with a category c that we want to predict.
properties of the input and a particular class (every one we present is). They pick out a subset.
✁[Value is 0 or 1]
✂We will freely say that Φ(d) is a feature of the data
∧ ∧ ∧ c = ci is a feature of the data-class pair (c, d).
f1(c, d) ≡ [c= “NN” ∧ ∧ ∧ ∧ islower(w0) ∧ ∧ ∧ ∧ ends(w0, “d”)]
✁f2(c, d) ≡ [c = “NN” ∧ ∧ ∧ ∧ w-1 = “to” ∧ ∧ ∧ ∧ t-1 = “TO”]
✁f3(c, d) ≡ [c = “VB” ∧ ∧ ∧ ∧ islower(w0)]
∈
) , (
) , (
D C d c i i
∈
) , ( ) , (
D C d c i i
Features are a word in document and class (they do feature selection to use reliable indicators)
✏Tests on classic Reuters data set (and others)
✑Naïve Bayes: 77.0% F
✒ ✓Linear regression: 86.0%
✓Logistic regression: 86.4%
✓Support vector machine: 86.5%
✔Emphasizes the importance of regularization (smoothing) for successful use of discriminative methods (not used in most early NLP/IR work)
Local Context
➟ ➓ ➒ ➆ ➜ ➆ ➎➠ ➉ ➎ ➆ ➠ ➔ ➡ ➅ ➔ ➌ ➔ ➓ ➢ ➎ ➑ ➤♠➥➦ ➧➨Local Context Feature Weights
♣ q ❦ ❞ ❯❄r ❞ ❡ ❬ ❦ts ✉✈ ✈ ✇ ① ②❭❞ ❝ ❯ ❧ ❯ ❫ r ❳ ❫ ❯❄r ❡ ③ ❚ ❡ ❬ ❡ ❞ ④ ❫ ❛ ⑤❜⑥⑦ ⑧⑨Local Context Features
▼ ◆ ❖ P ◗❙❘❚ ❯ ❱❳❲ ❱ ▼ ❨ ❩ ❩ ❬ ❬❭❫❪ ❴❵ ❯ ❛❝❜ ❞ ❪ ❛ ❜ ❡ ❴❵ ❯ ❛❝❜ ❞ ❢ P ❣ ❣ ❤ ❜ ❞ ✐ ❤❦❥ ❞ ❧ ❧✘♠ ♥ ❤♣♦Decision Point
q r❄s t ✉s ✈ s ✇ ①② ③ ④ ⑤ ⑤⑥⑧⑦ ⑨❄⑩ ❶ t s ✉ ⑩ ❷ s ❸ t s ❹❻❺ ❼❽ ❽ ❾✼❿ ❸ t ➀ ❺ ➁Is period end of sentence or abbreviation?
☞Features of head noun, preposition, etc.
☞P(w
✯|w
✰ ✱,…,w
✰ ✲). Features are word n-gram features, and trigger features which model repetitions of the same word.
✳Either: Local classifications decide parser actions or feature counts choose a parse.
probability distributions over it.
to maximize this likelihood.
✁It turns out to be trivial to choose weights: just relative frequencies.
takes the data as given and models only the conditional probability of the class.
✂We seek to maximize conditional likelihood.
✂Harder to do (as we’ll see…)
✂More closely related to classification error.
Classify from features sets {fi} to classes {c}.
✁Assign a weight λi to each feature fi.
✁For a pair (c,d), features vote with their weights:
✂vote(c) = Σλifi(c,d)
✁Choose the class c which maximizes Σλifi(c,d) = VB
✁There are many ways to chose weights
✄ ☎✝✆ ✞✟ ✆ ✠ ✡ ✞☛ ☞ ✌ ✍ ✎ ☞ ✏ ✑ ✟ ✒ ✞ ✞ ✆ ☞ ✡ ✓✕✔ ✖ ✎✝✗ ✟ ✓ ✑ ✗ ✗ ✎ ✍ ✎ ✆ ✏ ✆ ✘ ✑ ✖ ✠ ✓ ✆✚✙ ✑ ☞ ✏ ☞ ✒ ✏✜✛ ✆ ✢ ✆ ✎ ✛ ✣ ✡ ✗ ✎ ☞ ✡ ✣ ✆ ✏ ✎ ✞✆ ✟ ✡ ✎ ☛ ☞ ☛ ✍ ✑ ✟ ☛ ✞ ✞✆ ✟ ✡ ✟ ✓ ✑ ✗ ✗ ✎ ✍ ✎ ✟ ✑ ✡ ✎ ☛ ☞ ✤ ✥ ✦ ✦ ✧ ★ ✩ ✪ ✫ ✤ ✥ ✬ ✭ ✧ ★ ✩ ✪ ✫ ✮✰✯ ✱ ✲ ✮✰✯ ✳ ✴ ✯ ✵Σλifi(c,d)
✞ ☎ ✄ ✏ ☎ ✙✥✤ ✕ ✝ ✠ ✄ ✏ ☎ ✗ ✠ ✗ ✟ ☛ ✟ ✔ ✞ ✟✡✕ ✓ ☎ ✙ ✝ ☛ ✛ ✦ ✧ ★ ✩ ✩ ✪ ✫ ✬ ✭ ✮ ✯ ✰ ✭ ✱ ✲ ✳✵✴ ✶ ✷✹✸ ✺ ✶ ✻ ✷✹✸ ✼ ✽ ★ ✶ ✷✹✸ ✺ ✶ ✻ ✷✹✸ ✼✿✾ ✶ ❀ ✸ ❁ ✳ ✴ ❂❄❃ ❅❆ ✦ ✧ ★ ❇ ❈ ✪ ✫ ✬ ✭ ✮ ✯ ✰ ✭ ✱ ✲ ✳ ✴ ✶ ❀ ✸ ❁ ✽ ★ ✶ ✷✹✸ ✺ ✶ ✻ ✷✹✸ ✼ ✾ ✶ ❀ ✸ ❁ ✳ ✴ ❂❄❃ ❉❊ ❋{λi}
▼ ❍ P ▼ ❞❡ ❢ ❣ ❞ ❣✡❤ ✐ ❥ ❦ ✐❧ ♠♥ ♦ ❣ ❥ ❣ ♠♥ ❡ ♣ ♣ ❣ q ✐ ♣ ❣ ❦ ♠ ♠ ♦ ❚ ❯ ▼ ❍ ■ ❨ P ▼ P P ❬ ❬ ❚ ◗ ❨ ❑✡❭ ▲ ▼ ❚ ▼ ❍ ❑ ❖ ❙ ❚ ❨ ■ ❲ r'
) , ' ( exp
c i i i
d c f λ = ) , | ( λ d c P
i i i
d c f ) , ( exp λ
s✉t ✈①✇② ③ ④ ⑤ ✇ ② ⑥ ④ ② ⑦ ⑤ ⑦ ③ ✇⑨⑧ ⑩ ④❶ ❷t ❸ ⑦✉❹ ✇② ③ ④ ⑤ ✇ ② ⑧deciding how to weight features, given data.
probability distributions over classifications.
classes: SVMs, boosting, even perceptrons – though these methods are not as trivial to interpret as distributions over classes.
likelihood according to the exponential model has to do with entropy.
Given a model form, choose values of parameters to maximize the (conditional) likelihood of the data.
∈ ∈
= =
) , ( ) , ( ) , ( ) , (
log ) , | ( log ) , | ( log
D C d c D C d c
d c P D C P λ λ
'
) , ' ( exp
c i i i
d c f λ
i i i
d c f ) , ( exp λ
C,D
✠ ✎ ☛ ☞ ✍ ✂ ✄ ✕ ✎✖ ✎ ✗✄ ✍ ✄ ✖ ✑λ
✘) , | ( log ) , | ( log
) , ( ) , (
λ λ d c P D C P
D C d c ∑ ∈
=
∈
=
) , ( ) , (
log ) , | ( log
D C d c
D C P λ
'
) , ( exp
c i i i
d c f λ
i i i
d c f ) , ( exp λ
∈ ) , ( ) , ( '
) , ' ( exp log
D C d c c i i i
d c f λ
∈ ) , ( ) , (
) , ( exp log
D C d c i i i
d c f λ − = ) , | ( log λ D C P
i D C d c i i i
d c f λ λ ∂ ∂ =
∈ ) , ( ) , (
) , (
∈
∂ ∂ =
) , ( ) , (
) , (
D C d c i i i i
d c f λ λ
∈
=
) , ( ) , (
) , (
D C d c i
d c f
i D C d c i i ci i
d c f N λ λ λ λ ∂ ∂ = ∂ ∂
∈ ) , ( ) , (
) , ( exp log ) ( Derivative of the numerator is: the empirical count(f
i D C d c c i i i i
d c f M λ λ λ λ ∂ ∂ = ∂ ∂
∈ ) , ( ) , ( '
) , ' ( exp log ) (
∈
∂ ∂ =
) , ( ) , ( ' '
) , ' ( exp ) , ' ( exp 1
D C d c i c i i i c i i i
d c f d c f λ λ λ
∈
∂ ∂ =
) , ( ) , ( ' '
) , ' ( 1 ) , ' ( exp ) , ' ( exp 1
D C d c c i i i i i i i c i i i
d c f d c f d c f λ λ λ λ
i i i i D C d c c c i i i i i i
d c f d c f d c f λ λ λ λ ∂ ∂ =
∈
) , ' ( ) , ' ( exp ) , ' ( exp
) , ( ) , ( ' '
∈
) , ( ) , ( '
D C d c i c
λ
✑i
i
Perfect situation for general optimization (Part II)
✁But first … what has all this got to do with maximum entropy models?
∈
=
) , ( ) , (
log ) , | ( log
D C d c
D C P λ
'
) , ( exp
c i i i
d c f λ
i i i
d c f ) , ( exp λ
= ∂ ∂
i
D C P λ λ / ) , | ( log ) , ( count actual C fi ) , ( count predicted λ
i
f −
Lots of distributions out there, most of them very spiked, specific, overfit.
✁We want a distribution which is uniform except in specific ways we require.
✁Uniformity means high entropy – we can search for distributions which have properties we desire, but also have high entropy.
Event
Probability
“Surprise”
x x x
x p
H
❊ ❋ ✽i p i p
ˆ
∈
i
f x i x
pHEADS = 0.3
H(pH pT,) pH + pT = 1 pH = 0.3
1/e
VBD VBZ NNPS NNP NNS NN
f
✘ ✙ ✚ ☛ ☛ ✓ ☛ ☛ ✛ ✓ ☛ ☛✜ ✓ ☛ ☛ ✜ ✛ ✢ ✓ ✖ ✝ ✏ ✑ ✣ ✤f
✘ ✥ ✙ ✦✧ ★ ✦✩f
✭ ✙ ✚ ☛ ☛✜ ✓ ☛ ☛✜ ✛ ✢ ✓ ✖ ✝ ✏ ✑ ✣ ✤f
✭ ✥ ✙ ✧ ✮ ★ ✦ ✩2/36 2/36 8/36 8/36 8/36 8/36 2/36 2/36 12/36 12/36 4/36 4/36
✸✹ ✺ ✸✹ ✻ ✼ ✼✽ ✾ ✼ ✼✽ ✼ ✼ ✾ ✼ ✼λ
✻ ✼λ
✻ ✽ ✾ ✿λ
❀❂❁ ❃λ
❀ ❀ ❁ ✼λ
❀❄❁ ❃λ
❀ ❀ ❁ ✽ ✾ ✿Local Context Feature Weights
♥❧♦ ♣q r s❜t q ✉ ♦ ♦ r ✈ ♣ ✇ r ① ② s ✇ ③ ④ ⑤⑥ ⑦⑧ ⑨❶⑩ ❷❹❸ ✇ ① ✉ r t ❺ ✉ ✇ ♣ ① ① ❻ ❸ q ③ r ❼ s ① r ❺ q r ✉ ❺ ✇ ✉❽ ✉ ❾ ♣ ✈ ♦ r ♣ ①❹❿ ➀ ❺ ✉ ② s ❺➁ ❽ ♦ r ❾ s❧➂ ❾ r ♣ ✇ ❸ ♦ r t➄➃λ
✻ ✼λ
✻ ✽ ✾ ✿λ
✻ ✼λ
❀λ
❁ ❂λ
❃ ✽ ✾ ✿Local Context Feature Weights
❳ ❛ ❞ ♠ ❯ ❫ ❦ ♥✳♦ ♥ ❡ ❬ ❡ ❞ ❬ ♣ ❴ ❝ ❦ ❛ ❛ ❞ ♣ ❡ ♦ ♥ ❯ ❱ ♣ ❬ ❡ ❦ ❛ ❞ ✐ ❬ ♠ ❞ ❯ ♣ ❡ ❞ ❛ ❬ ❝ ❡ ❯ ❫ ♣ ♥rq ❞rs ❱ s ❳ t ❳ ✉ ❪ ❚ ♦ ❥ t ❘✳❙ ❯ ♣ ❴ ❯ ❝ ❬ ❡ ❞ ♥ ❥ t ❳ ✉ ❪ ❚ ✈ ❦ ❝ ✐ ✈ ❫ ❛ ❞ ♥ ❡ ❛ ❫ ♣ ❱ ✇②① ❡ ✐ ❬ ♣ ❥ t ❘✳❙ ❬ ♣ ❴ ❳ t ❳ ✉ ❪ ❚ ❯ ♣ ❴ ❞ ③❞ ♣ ❴ ❞ ♣ ❡ ✇②①④s ❩ ✐ ❯ ♥ ⑤ ❞ ❬ ❡ ❦ ❛ ❞ ❡ ① ③❞ ❬ ✇ ✇ ❫ ⑥ ♥ ❡ ✐ ❞ ✈ ❫ ❴ ❞ ✇ ❡ ❫ ❝ ❬ ③ ❡ ❦ ❛ ❞ ❡ ✐ ❯ ♥ ❯ ♣ ❡ ❞ ❛ ❬ ❝ ❡ ❯ ❫ ♣ sc φ1 φ 2 φ 3
= ) , | ( λ d c P
i i c
P c P ) | ( ) ( φ
'
) ' | ( ) ' (
c i i c
P c P φ
✪✬✫ ✭ ✮✬✯ ✰+∑
i i c
P c P ) | ( log ) ( log exp φ
+
'
) ' | ( log ) ' ( log exp
c i i c
P c P φ
✸✬✹ ✺ ✻✬✼ ✽i ic ic
c d f ) , ( exp λ
' ' '
) ' , ( exp
c i ic ic
c d f λ
❄✄❅ ❆✝❇ ❈✠❉ ❊ ❅ ❋ ❈● ❍Naïve-Bayes Maxent
✁✄✂ ☎ ✆ ✝ ✞ ✂ ✟ ☎ ✟ ✟ ✝ ✠ ✂ ✡ ✆ ☛ ✟ ✝ ☞ ☞ ✌✎✍ ✏✒✑ ✡ ✂ ☞ ✂ ✑ ✡ ✂ ✑ ✆ ✂ ✓ ✏ ✡ ✂ ✑ ✔ ✂✖✕ ✁✄✂ ☎ ✆ ✝ ✞ ✂ ✟ ✗ ✂ ✏✒✘ ✙ ✆ ✟ ✆ ☎ ✚ ✂ ✛ ✂ ☎ ✆ ✝ ✞ ✂ ✡ ✂ ☞ ✂ ✑ ✡ ✂ ✑ ✔ ✂ ✏✒✑ ✆ ☛ ☎ ✔ ✔☛ ✝ ✑ ✆ ✕ ✁✄✂ ☎ ✆ ✝ ✞ ✂ ✗ ✂ ✏ ✘ ✙ ✆ ✟ ✔ ☎ ✑ ✜ ✂ ✟ ✂ ✆ ✏✒✑ ✡ ✂ ☞ ✂ ✑ ✡ ✂ ✑ ✆ ✌✎✍ ✕ ✁✄✂ ☎ ✆ ✝ ✞ ✂ ✗ ✂ ✏ ✘ ✙ ✆ ✟ ✠ ✝ ✟ ✆ ✜ ✂ ✠ ✝ ✆ ✝ ☎ ✌ ✌ ✍ ✂ ✟ ✆ ✏ ✠ ☎ ✆ ✂ ✡ ✕ ✁✄✂ ☎ ✆ ✝ ✞ ✂ ✟ ✠ ✝ ✟ ✆ ✜ ✂ ☛ ✛ ✆ ✙ ✂ ✔☛ ✑ ✢ ✝ ✑ ✔ ✆ ✏ ✓ ✂ ✣(d) ∧ c = ci
✛ ☛ ✞ ✠ ✕ ✁✄✂ ☎ ✆ ✝ ✞ ✂ ✟ ✑ ✂ ✂ ✡ ✑ ☛ ✆ ✜ ✂ ☛ ✛ ✆ ✙ ✂ ✔☛ ✑ ✢ ✝ ✑ ✔ ✆ ✏ ✓ ✂ ✛ ☛ ✞ ✠ ✤ ✜✎✝ ✆ ✝ ✟ ✝ ☎ ✌ ✌✎✍ ☎ ✞ ✂ ✥ ✕ ✦ ✞ ☎ ✏✒✑ ✂ ✡ ✆ ☛ ✠ ☎ ✧ ✏ ✠ ✏✄★ ✂ ✢ ☛ ✏ ✑ ✆ ✌ ✏ ✚ ✂ ✌ ✏ ✙ ☛ ☛ ✡ ☛ ✛ ✡ ☎ ✆ ☎ ☎ ✑ ✡ ✔ ✌ ☎ ✟ ✟ ✂ ✟ ✕ ✦ ✞ ☎ ✏✒✑ ✂ ✡ ✆ ☛ ✠ ☎ ✧ ✏ ✠ ✏✄★ ✂ ✆ ✙ ✂ ✔☛ ✑ ✡ ✏ ✆ ✏ ☛ ✑ ☎ ✌ ✌ ✏ ✚ ✂ ✌ ✏ ✙ ☛ ☛ ✡ ☛ ✛ ✔ ✌ ☎ ✟ ✟ ✂ ✟ ✕Raining Sunny
✁ ✂ ✄✆☎ ✄ ☎ ✝ ✞ ✟ ✠ ✡ ☛ ✁ ✂ ✄✆☎ ✄✆☎ ☞ ✞ ✟ ✌ ✡ ☛Reality
✍ ✎✑✏✓✒ ✏✓✒ ✔ ✕ ✖ ✗ ✘ ✙ ✍ ✎ ✏ ✒ ✏✓✒ ✚ ✕ ✖ ✛ ✘ ✙Raining? M1 M2 NB Model
Take a model over (M1,…Mn,R) with features:
✂fri: Mi=+, R=r
✄ ☎ ✆✞✝ ✟✡✠ ☛λ λ λ λ
☞ ✌ ✂fsi: Mi=+, R=s
✄ ☎ ✆✞✝ ✟✡✠ ☛λ λ λ λ
✍ ✌ ✎exp(λ λ λ λ
☞ ✌λ λ λ
✍ ✌) is the factor analogous to P(+|r)/P(+|s)
✏… but instead of being 3, it will be 3
✑ ✒✔✓ ✏… because if it were 3, E[fri] would be far higher than the target of 3/8!
Lights Working Lights Broken P(g,r,w) = 3/7 P(r,g,w) = 3/7 P(r,r,b) = 1/7 Working? NS EW NB Model Reality
P(b,r,r) = (1/7)(1)(1) = 1/7 = 4/28
✁P(w,r,r)= (6/7)(1/2)(1/2) = 6/28 = 6/28
✁P(w|r,r) = 6/10!
P(b,r,r) = (1/2)(1)(1) = 1/2 = 4/8
✁P(
✂,r,r) = (1/2)(1/2)(1/2) = 1/8 = 1/8
✄P(w|r,r) = 4/5!
☎)
t h
✄✆☎ ✝ ✞✠✟ ✡✆☛ ☎ ☞✠✟T H H
HEADS λ λ λ
T H T
TAILS λ λ λ
HEADS
T T T H T H
− − − λ λ λ λ λ λ λ λ TAILS
λ
λ
TAILS HEADS
λ
2 2
✁✄✂ ☎ ✆✞✝ ✟✄✠ ✂ ✡✞✝1 3
✁✄✂ ☎ ✆ ✝ ✟✄✠ ✂ ✡✞✝4
✁✄✂ ☎ ✆✞✝ ✟ ✠ ✂ ✡✞✝λ λ λ
log P log P log P
λ
✬ ✦✭∞
✮ ✬ ✘ ✣✥✯ ✘ ✣ ✭ ✦ ✧ ✛✰ ✱ ✢ ✲ ✣ ✜ ✫ ✛ ✲ ✦ ✰ ✛✜ ✢ ✣ ✤ ✣✴✳ ✦ ✢ ✣ ✛✰ ✜ ✲ ✛ ✯ ✙ ✵ ✪ ✲ ✙✷✶ ✸ ✗ ✘✚✙ ✧✚✙ ✦ ✲ ✰ ✙ ✵ ✵ ✣ ✭ ✢ ✲ ✣ ✹ ✪ ✢ ✣ ✛ ✰ ✣ ✭ ✺ ✪ ✭ ✢ ✦✭ ✭ ✜ ✣ ✻ ✙ ✵ ✦✭ ✢ ✘ ✙ ✙ ✤ ✜ ✣ ✲ ✣✥✯ ✦ ✧ ✛✰ ✙ ✼ ✰ ✛ ✭ ✤ ✛ ✛ ✢ ✘ ✣ ✰ ✱ ✶ ✽ ✾❀✿ ❁ ❂ ❃ ❄ ❅ ❆ ❇ ❆ ❈❊❉ ❁ ❋ ❆ ❅λ
✬ ✣ ✧ ✧ ✹ ✙ ✫ ✣ ✰ ✣ ✢ ✙ ❙ ✹ ✪ ✢ ✜ ✲ ✙ ✭ ✪ ✤ ✦ ✹ ✧✩❚ ✹ ✣ ✱ ❯ ✶ ✸ ✗ ✘✚✙ ✛✜ ✢ ✣✥✤ ✣ ✳ ✦ ✢ ✣ ✛✰ ✬ ✛✰ ❱ ✢ ✢ ✦ ✻ ✙ ✫ ✛ ✲ ✙ ★ ✙ ✲ ❙ ✯ ✧✚✙ ✦ ✲ ✧✩❚ ❯ ✶ ✸ ❲ ✛ ✤ ✤ ✛✰ ✧✩❚ ✪ ✭ ✙ ✵ ✣ ✰ ✙ ✦ ✲ ✧ ❚ ✤ ✦ ❳ ✙ ✰ ✢ ✬ ✛ ✲ ✻ ✶4
❨❬❩ ❭ ❪❴❫ ❵❬❛ ❩ ❜ ❫1
❨❬❩ ❭ ❪❴❫ ❵❬❛ ❩ ❜ ❫ ❝❬❞ ❡ ❢ ❣ ❤✴❢ ❣ ❡ ❢ ❣λ
Posterior Prior Evidence
µ
✂ ✞ ✌✙★ ✂ ☛ ✝ ✂ ✞✎ ✖σ
✩ ✣ ✓ ✪ ✖ ✞ ✂ ✗ ✝✄✥ ✖ ✆ ✑ ✂ ☛ ✂ ✕✖ ✍ ✖ ☛ ✆ ✫ ✡ ☛ ✌ ☛ ✝ ✫ ✍ ✝ ✞ ✢ ✍ ✡ ✫ ✂ ☛ ✫ ☛ ✡ ✕ ✍ ✧ ✖ ✝ ☛ ✕✖ ✂ ✞ ✑ ☛ ✝ ✡ ☛ ★ ✂ ✗ ☎ ✖ ✬ ☎ ✆ ☎ ✂ ✗ ✗ ✘µ
✭ ✮ ✯ ✣ ✓ ✰σ
✩ ✭ ✱ ✚ ✡ ☛ ✲ ✆ ✆ ☎ ☛ ✑ ☛ ✝ ✆ ✝ ✞ ✢ ✗ ✘ ✚ ✖ ✗ ✗ ✣ ✳ ✴✶✵ ✷ ✸✶✹✺ ✻ ✼ ✵ ✽ ✵ ✺ ✾✿❀ ❁ ✼ ✿ ❂ ❁❄❃ ✵ ❅ ✷ ✺ ✿ ❅✵ ✿ ✺ ✷ ❅✹❆ ✵ ❇ ❈ ❈ ❉ ❊ ❋ ❋− − =
2 2
2 ) ( exp 2 1 ) (
i i i i i
P σ µ λ π σ λ
2σ2 =1 2σ2 = 10 2σ2 = ∞
Change the derivative: ) ( log λ P − ) , | ( log ) | , ( log λ λ D C P D C P =
∈
=
) , ( ) , (
) , | ( ) | , ( log
D C d c
d c P D C P λ λ k
i i i i
+ − −∑
2 2
2 ) ( σ µ λ
i i i
2
i i −
2σ2 =1 2σ2 = 10 2σ2 = ∞
Local Context Feature Weights
♥❭❞ ❝ ❬ ❦ ♦ ❞ ❫ ♣ ♦q ❫ ❫ ❡ ✐ ❯❄r ❱ts ❡ ✐ ❞ q ❫ ❛ ❞ ❝ ❫ q q ❫ r ✉ ❛ ❞ ♣ ❯ ❙ ❬ r ❴ ♦ ❯❄r ❱ ✈ ❞✳✇ ❡ ❬ ❱ ♣ ❞ ❬ ❡ ❦ ❛ ❞ ♦ ✐ ❬ ♠ ❞ ✈ ❬ ❛ ❱ ❞ ❛ ① ❞ ❯ ❱ ✐ ❡ ♦ ❞ ♠ ❞ r ❡ ✐ ❫ ❦ ❱ ✐ ❞ r ❡ ❯ ❛ ❞✳✇ ① ❫ ❛ ❴ ❬ r ❴ ❡ ❬ ❱ ✇ ✉ ❬ ❯ ❛ ♣ ❞ ❬ ❡ ❦ ❛ ❞ ♦ ❬ ❛ ❞ q ❫ ❛ ❞ ♦ ✉❞ ❝ ❯ ♣ ❯ ❝t②88.20 97.10
❈ ❉ ❊ ❋85.20 96.54
❈ ❉ ❊ ❋ ■ ▼ ❊Equivalent to adding two extra data points.
✁Similar to add-one smoothing for generative models.
4
✂☎✄ ✆ ✝✟✞ ✠☎✡ ✄ ☛✟✞1 5
✂☎✄ ✆ ✝✟✞ ✠☎✡ ✄ ☛✟✞∈
) , ( ) , ( '
D C d c c i i i i i i
f(x)
☞✓✒ ✑ ✆Rn
✎ ✑R
✔∇f(x)
✏ ✂ ✎ ✟ ✝n×1
☛ ✝ ✍ ✎ ✑ ✒ ✑ ☞✓✘ ✠ ✒ ✎ ✏✄✠ ✙ ✗ ✝ ✒ ✏ ☛ ✠ ✎ ✏ ☛ ✝ ✂∂f/∂xi
✔∇
✛n×n
✆✠ ✎ ✒ ✏✢✜ ✑ ☞ ✂ ✝ ✍ ✑ ✌ ✗ ✗ ✝ ✒ ✏ ☛ ✠ ✎ ✏ ☛ ✝ ✂∂2f/∂xi∂xj
✔n
n n n n
2 1 2 1 2 1 1 2 2
1
T 0)
2 T∇
2
T 0)
Is there a unique maximum?
✁How do we find it efficiently?
✁Does f have a special form?
*
x
i i i
i i w
i i i
) (x f w
( x w f
✁Convex Non-Convex Convexity guarantees a single, global maximum because any higher points are greedily reachable.
Start at some xi.
✁Repeatedly find a new xi+1 such that f(xi+1) ≥ f(xi).
Improve xi by choosing a search direction si and setting
✂1 i i ts x i
i i
+ +
search direction si.
maximizer:
) ( max arg
1 i i ts x i
ts x f x
i i
+ =
+ +
∇f ⋅si
i i
Until convergence:
✂☎✄ ✆✝ ✞ ✟ ✠ ✡☞☛ ✟☞☛ ✌ ✝ ✍ ✎ ✠ ✝ ✍ ☛∇f(x)
✏ ✑✓✒ ✔ ✝ ✞☛ ✕ ☛ ✎ ✌✖ ✡ ✎ ✗ ✘ ✞✙∇f(x)
✏ ✚Each iteration improves the value of f(x)
✏ ✚Guaranteed to find a local optimum (in theory could find a saddle point).
✚Why would you ever want anything else?
✛ ✜✣✢ ✤✦✥ ✧ ★✥ ✢ ✤✦✩ ✪✦✫ ✬ ✤ ✩ ✫ ✥ ✭ ✥ ✢ ✢ ✥ ✧ ✫ ✥ ✮ ✧ ✬ ✤ ✪ ✯ ✧ ✥ ✬ ✢ ✯ ✩ ✰✫✲✱ ✛ ✳ ✱ ✴ ✱ ✵∇f(x)
★ ✮ ✶ ✭ ✥ ★ ✮ ✷ ✯ ★ ✮ ✸ ✸ ✶ ✹✻✺ ✼ ✤ ✯ ✸ ✸✽ ✵ ✭ ✺ ✢ ✶ ✩ ✺ ✾ ✪ ✧ ✮ ✢ ✤✦✥ ✧ ✭ ✥ ✼ ✩ ✯ ✰ ✢ ✥ ✪ ✫ ✢ ✧ ✮ ✯ ✴ ✤ ✢ ✮ ✢ ✢ ✤ ✥ ✫ ✩ ✸ ✺ ✢ ✯ ✩ ✰ ✿si-1 = ∇f(xi-1)
✮ ❃ ❆ ✒ ✔ ✓ ✔ ✕ ✖✗ ✎ ✘ ✙ ✔ ✓ ✚ ✙ ✛∇f(xi)
✎ ✓ ✘ ✕ ✔ ✪ ✓ ✢ ✕si-1
T⋅∇f(xi) = ∇f(xi-1)T⋅∇f(xi) = 0
✮ ❃ ❇ ✛ ✕ ✔ ✫ ✢ ✥ ✔ ✎ ✣ ✢ ✓ ✖si = ∇f(xi),
✚ ✒ ✔ ✖ ✗ ✎ ✘ ✙ ✔ ✓ ✚ ✰ ✔ ✑ ✢ ✫✔ ✛∇f(xi+tsi) ≈ ∇f(xi)
+ t∇
❈f(xi) si = ∇f(xi) + t∇
❈f(xi)∇f(xi).
❉ ❊ ❋❍● ■si-1
❚ ❯si-1
T ⋅ (∇f(xi-1) + t∇
❱f(xi)∇f(xi))
❲ ❳∇f(xi-1)T∇f(xi) + t∇f(xi-1)T∇
❨f(xi)∇f(xi)
❲ ❩0 + t∇f(xi-1)T∇
❨f(xi)∇f(xi)
❳ ❬ ❭❪ ❫ ❴❛❵ ❜❝❞ ❡ ❢ ❵❣ ❫ ❢ ❭ ❝ ❵ ❜ ❝ ❪ ❤ ❢ ❣ ❜ ❞ ✐ ❪ ❥❦ ❪ ❣ ❵ ❣ ❫ ❢ ❣ ❫ ❴ ❵ ❧ ❞ ❭ ❫ ❡ ❢ ❝ ❵ ✐ ❫ ❢ ❪ ❣ ♠search along si ruined optimization in previous directions.
in the previous direction(s) zero.
∇f(xi+tsi)
✂ ✄ ☎ ✂ ✆ ✝ ✄✞ ✂ ✟ ✄✠ ✄✡ ✆ ☛ ✂ ✄ ☞ ✞ ✌ ✍ ✎ ✄ ✏ ☎s
✑si-1T ⋅
✒∇f(xi+tsi)] = 0
✑si-1T ⋅ [∇f(xi) + t∇
✓f(xi)si] = 0
✑si-1T ⋅ ∇f(xi) + si-1T ⋅ t∇
✓f(xi)si = 0
✑0 + si-1T ⋅ t∇
✓f(xi)si = 0
If ∇
✕T∇
✕T∇
✁∇
✕f(xi).
✖ ✗ ✑ ✔ ✍ ✘ ✡ ✌ ✞ ✏✒✑ ✙ ✚ ✑ ✚ ✞ ✛ ☎ ✛ ✍ ✟ ☛ ✠ ✞ ☛ ✔ ✠ ☛✢✜si-
1
si
∇f(xi)
✣✥✤ ✦ ✧✩★✪ ★✫ ✬ ✭ ✮ ★ ✫ ✯✱✰ ✪ ✬ ✦ ✲∇f(xi)
✭ ✮ ✭ ✯ ✧ ✰✱ ✪ ✧ ✲ ✱ ✰✳ ✱ ✜ ✧ ✜ ✤ ✴ ✱ ✵∇f(xi)
✜ ✱ ✤ ✲ ✱ ✜ ✶✫✷ ✸ ✬ ✤ ✧ ✤ ✱ ✳ ✩ ✧ ✪ ✛ ✱ ✷ ✴ ✣ ✛ ✩ ✧ ✲ ✤ ✛ ✱ ✜ ✴ ✭ ✹ ✭ ✺ ✛✢✜ ✧ ✴ ✧ ✬ ✩ ✲ ✦ ✬ ✻★✱ ✜ ✸ ✤ ✦★✧ ✩ ✧ ✰ ✬ ✛✢✜ ✛✢✜ ✸✽✼ ✲ ✱ ✜ ✶✫✷ ✸ ✬ ✤ ✧ ✳ ✩ ✱ ✶✢✧ ✲ ✤ ✛ ✱ ✜ ✱ ✵∇f(xi)
✭ ✾ ✿ ❀☎❁ ❂ ❃❄ ❅ ❃ ❆ ❅❈❇ ❉❊ ❃ ❄ ❁ ❅ ❉ ❊ ❆ ❁ ❋∇
◆f(xi)
❖P ◗✥❘ ❙ ❖❚ ❯ ❖ ❱ ❱★❲ ❙❳ ❨ ❩✢❬ ❭ ❪ ❪ ❳ ❖ ❙❚ ❫ ◗ ❩ ❙ ❳ ❚ ❘ ❩ ❬ P ❪✽❴ ❵ ❳ ❚ ❖ P ❩✢❛ ❲ ❱ ❳ ❛ ❳ P ❘ ❘ ❫ ❩ ❪ ◗ ❩ ❙ ❳ ❚ ❘ ❱❝❜❡❞ ❢ ❣ ❤ ❵ ❳ ◗ ❬ P ❬ ❘ ❯★P ❬ ❵∇
◆f(xi) –
❵ ❳ ◗ ❬ P ✐ ❘ ❤ ❬ ❙ ❛ ❖ ❥ ❳ P ❘ ❛ ❬ ◗ ❳ ❱ ❩ P ❦♠❧ ❖ P ◗ ❩ ❘ ❩ ❪ P ✐ ❘ ❚ ❬ P ❪ ❘ ❖P ❘ ♥ ❩ ❘ ✐ ❪ P ❬ ❘ ♦ ❴ ❘ ❫ ❳ ❙ ❳ ❖ ❙ ❳ ❬ ❘ ❫ ❳ ❙ ♥♣ ❳ ❘ ❘ ❳ ❙ ♦ ❵ ❖ ❜ ❪ ❞ q r✢s t t ✉✇✈ ✉✇① ② ③ ③ ④ ① ②⑤ s ⑥ ① ✈ ④ ② ⑦ s ⑧⑨ ✈ ⑩ ③ ④ ③ ❶ ① ⑤ ✉ ② ⑧ ❷ ① ❸ ⑥ ① ❹ ✉ ④ s ⑤ ❺ ✉ ⑥ ① ✈ ③ ✉ ④ ②❼❻ ❽ ❾ ⑨ ② ❺ ④ ③ ❶ ✉ ⑤ ❿ ✉ ③ ❶ ③ ❶ ① t ④ ❷ ❷ ④ ❿ ✉ ② ⑧ ⑥ ① ✈ s ⑥ ⑥ ① ② ✈ ① ⑤ ➀ ➁ ❷ ① ③ ✈ ❶ ① ⑥✫➂ ➃ ① ① ❹ ① ⑤ ➄➆➅1
−
i i i i
) ( ) ( ) ( ) (
1 1 − Τ − Τ
∇ ∇ ∇ ∇ =
i i i i i
x f x f x f x f β
Have to ensure we satisfy the constraints.
✁No guarantee that ∇f(x*) = 0, so how to recognize the max?
✂*
x
i
∇f(x*) = 0.
∇f(x*)
✔ ✄ ✍ ✟ ✓ ✍ ✝ ✍ ✖ ✗ ✓ ✕ ✝ ☛ ✟✙✘ ✂ ✑ ✂ ✏ ✚ ✕ ✝ ✛✎✓ ✔ ✂ ✑ ✝ ✍ ✑ ✍ ✏ ✑ ✒✞✓ ✂ ✜✞✓ ✔ ✝ ✍ ✏ ✂ ✕ ✄ ✑ ✍ ✂ ✠ ✘ ✏ ✂ ✟ ✓ ✗ ✓ ✕ ✝ ✢= ∑ ∇
i i i
x g ) ( λ ) (x f ∇
i i i
∂Λ/∂x = 0
✕ ☛ ✔ ✎ ✜ ☛ ✕ ✍ ✝ ✓ ☛ ✗ ✕ ✖ ✌ ✟✡☛ ✂ ✝✩★ ✟ ✂ ★ ✍ ✠ ✖ ✂ ✠ ✕ ✎ ✠ ☛ ✕ ✝ ✤✫✪∂Λ/∂λi = 0
✕ ☛ ✔ ✎ ✜ ☛ ✕ ✍ ✔ ✎ ✂✍ ✝ ✕ ✖ ✟ ✂ ✝i.
i
i
✎ ✂ ✓ ✍ ✡ ✔ ✎ ✠ ✝ ☛ ✡ ✎ ☞i i i
− ∑ ∂ ∂
i j i i
x x g ) ( λ
j
x x f ∂ ∂ ) ( = ∂ Λ ∂
j
x x ) , ( λ
i
− ∑ ∇
i i i
x g ) ( λ ) (x f ∇ =
Λ
✩x,λ*
✪Λ
✩x*,λ
✪x*
✽ ✾ ✿ ❀ ❀❂❁ ❃❄❅ ❆ ❇❈ ❉ ❄ ❆ ❅i
❊ ❋ ❅ ❆x*
❑ ✾ ▲ ▼ ❍ ◆ ❇ ❈ ❏ ❉ ❍ ❄ ❆ ❅ ❖ ❈ ❄ ❁ ❃❄ ❏ ❉ ❆ ❉ ❃ ❄ ▼ ❃ ❀ ❏ ❅ ❈ ❆x*
■ ❃ ❇ ❅ ❃ ❊ ❍λ
❑ ✮ P ◗ ✳ ✲ ✱ ❘ ✺ ✱ ✻ ✷ ✴x
❙ ❚ ❯❲❱ ✸ ✲ ❳ ✱ ✴ ❨ ✸x*
❩ ✵ ❘ ✷ ❨ ❳ ✰ ❘❭❬ ❩ ❱ ❳ ✷ ❘ ✸ ✵ ✰ ✱ ❬ ✷ ✴ ❨ ✷ ✴ ✰ ❳ ✸ ✲ ✳ ✴ ✵ ✰ ✶ ✱ ✷ ✴ ✰ ✶ ✸ ❨ ✷ ✳ ✴ ❩f(x)
✺ ✼ ✵ ✰ ✹ ✶ ✳❪❴❫ ❵ ✳ ❱ ✸ ❛ ✸ ✶ ❩ ✸ ✱ ✲ ❳gi(x)
❱ ✷ ❘ ❘ ✵ ✰ ✱ ❬ ❜ ✸ ✶ ✳ ❩ ✵ ✳Λ
❝x,λ
❞ ❱ ✷ ❘ ❘ ✹ ✶ ✳ ❪❴❫ ✮ P ◗ ✳ ✲ ✱ ❘ ✺ ✷ ✴ ✷ ✴λ
❙ ❚ ❯ ❱ ✸ ✲ ❳ ✱ ✴ ❨ ✸λ*
❩ ✵ ❘ ✷ ❨ ❳ ✰ ❘ ❬ ❩ ✰ ❳ ✸ ✴ ❯ ✷ ✴ ✹ ✰ ❳ ✸x
❱ ❳ ✷ ✲ ❳ ✺ ✱ ✻ ✷ ✺ ✷ ❜ ✸ ✵Λ
❩ ✰ ❳ ✸ ✺ ✱ ✻Λ
✲ ✱ ✴ ✳ ✴ ❘❭❬ ❡ ✸ ❨ ✶ ✸ ✱ ✰ ✸ ✶ ✰ ❳ ✱ ✴ ✰ ❳ ✸ ✳ ❘ ✹ ✳ ✴ ✸ ❩ ❡ ✸ ✲ ✱ ✼ ✵ ✸ ✱ ✰x* Λ
❢ ✵ ❛ ✱ ❘❭✼ ✸ ✷ ✵ ✷ ✴ ✹ ✸ ❪ ✸ ✴ ✹ ✸ ✴ ✰ ✳ ❯λ
❩ ✵ ✳ ❱ ✸ ✲ ✱ ✴ ✵ ✰ ✷ ❘ ❘ ❨ ✸ ✰ ✷ ✰ ❫k
✥ ✦ ✧✩★✪ ✫✬ ✭k
✮ ✫ ✬ ✯ ✬ ✰★ ✬ ✱ ✲ ✪ ✫ ✳ ✬ ✴ ✵ ★ ✶ ✲ ✳✸✷ ✳✩✹ ✫ ✲ ✳ ★ ✬ ✺ ✳ ✻ ✻✼ ✫ ✻ ✫✬ ✰ ✴ ✱ ✲ ✽ ✴ ✶ ✴ ✬ ✫ ✻ ✲ ✭ ✫✾ ✫ ✳ ✬ ✱ ✲ ✾ ✫ ✳ ✬ ✱ ✳ ✬ ✿ ✯ ✬ ✰ ✲ ✳ ★ ✬ ❀ ✫ ✻ ✯ ✴ ❁k
❂❃ ❄ ❅❆❃ ❇ ❈ ❉ ❊ ❋ ❈ ❅❍● ■ ❈ ❉ ❊ ■ ❈ ❉ ❏ ❊ ❑▲ ❉ ▼ ❃ ◆ ❑ ❉ ❇ ◆ ❊ ❖❆❃ ❉ ◆ ❊ ❑▲ ▲ ❂ ❃ ❖ ❖◗P2
i i x
λ=0
✖ ✕ ✡k=k0
✜x = arg max Λ
✤x,λ*,k)
✣k = α k
✣λi = λi + k gi(x)
x*
✖ ✕ ✡λ*
✖ ☎ ☎ ✞ ✂ ✌ ✖ ✝ ✂ ☎ ✍ ✝ ✂ ✦2
i i x
PENALIZED
i i i
x x x
i i
f f x x
∈
x x x
i x i x f i
i
λ
✌ ✙ ✶ ✻ ✔ ✄ ✟ ✠ ✌ ✠ ✌ ☞ ✖ ✓ ✄ ✝ ☞ ✞✌ ✑ ✷ ✝ ✞ ✝ ✷ ✝ ✹ ✠ ☞ ✕ ✠ ✍ ✄ ✟✡✠ ✌ ✘ ☎ ☛ ✠ ☞ ✔λ
✌ ✙i i i
λ
x
i i i
λ
x
λ
✠ ✡ ✞ ☛✌☞ ✂ ✡ ✍ ✎✌✏ ✍= ∂ Λ ∂
x
p p ) , ( λ
x x x x
p p p ∂ ∂ − ∑ log
x i x i x i i
p x f p C ∂
✖✘✗ ✙ ✚✘✛ ✜− ∂ − +
) ( λ
=
x x x x x
p p p p log 1 log + = ∂ ∂∑
− = ∂
✢✤✣ ✥ ✦✤✧ ★− ∂
i i i x i x i x i i
x f p x f p C ) ( ) ( λ λ ) ( log 1 x f p
i i i x ∑
= + λ
i i i x
λ
✠ ✡ ☞ ✠ ✢ ✣✥✤ ✣ ✢ ✣✧✦ ★ ✠ ✡ ✂ ✩ ☞ ✪ ✏ ☞ ✆ ✪ ✍✄☞ ✆ ✘λ
✟ ✡ ✍✄✱ ✡ ☛☞ ✌ ✍ ☛ ✍✥✲ ✂ ✠ ✡ ✂ ✓ ☞ ✠ ☞ ✖ ✍ ☎ ✂ ✖ ✍ ✡ ✞ ✞ ✓✴✳ ✞ ✎ ✏ ✞ ✏ ✍ ✪ ✍ ✆ ☞ ✖ ✑ ✏ ✞ ✕ ✖ ✂ ☛ ✍ ✆ ✑ ☞ ✏ ✠ ✵✷✶i i i x
x x x
i x i x f i
i
x x x
i x i x f i
i
x x i i i i i i x
'
i x i x f i
i
'
x i i i x i i i x
x i i i x f i i
i
x i i i
i
f i iC
i x x fi ∑
x i i i
x i i i x
x i i i x
x i i i
x i i i i i i x x
x x x
x
★ ✚ ☛ ✏ ✞ ✕∇f(x)
✌ ✞ ✡∇
✩f(x).
✘ ✧ ✠ ✏✪☛ ✏ ✞ ✑ ✝ ✎✫✑ ✂ ☛ ✏ ✞ ✑ ✂ ✍ ✆ ✏ ✞ ✕ ✆ ✠ ✂∇
✩f(x)
✬ ✭☞✮ ✯☞✰ ✱ ✲✴✳ ✵✷✶ ✸ ✮ ✹✻✺ ✼✄✽ ✱ ✾ ✰ ✿ ❀✽ ✾ ❁ ✰ ❂ ✮ ✶ ✮ ✽ ✮ ✹ ❀❃ ✯☞✽ ✺ ✮ ✾ ❄ ✶ ❅ ✾ ✶ ❄ ✽ ❂ ✸ ❃ ❃ ❄ ✰ ❆ ✹ ❀ ✸ ✾ ✹✪✰ ✿ ✮ ✾ ✰∇
❇f(x).
❈ ❉ ❊ ✾ ❁ ✽ ✿ ✶ ❀ ❋ ✽ ❄ ✰ ❊ ❂ ✹ ❀✽ ✿ ✮ ✹ ✰ ✿ ✮ ✭ ✿ ✶ ❀ ❋ ✽ ❄ ✰ ❊ ❊ ✽ ✸ ✾ ✶ ❄ ✽ ✮ ✲ ✹ ✮ ✯ ✸ ❄∇
❇f(x)
✹ ✮ ✾ ✰ ✰ ✯ ✸ ❄∇
❇f(x)
✭ ▲ ▼ ✹ ✮ ✸ ✮ ❃ ✽ ❅ ✹ ✸ ✯ ❅ ✸ ✮ ✽ ✰ ❊ ✾ ❁ ✹ ✮ ✲ ✳ ◆ ❖ ✹ ❀ ✹ ✾ ✽ ❂ ✺ ❀✽ ❀✰ ❄ ■ ❏ ✶ ✸ ✮ ✹✻✺ ✼✄✽ ✱ ✾ ✰ ✿ ❀ ✽ ✾ ❁ ✰ ❂ ✮ ✯ ✹ P ✽ ✹ ✿ ✭ ✼ ✰ ❅ ✽ ❂ ✸ ✯ ◗ ❘ ❘ ❙ ✲ ✸ ❄ ✽ ❃ ✰ ✮ ✮ ✹ ❋ ✯ ■ ✾ ❁ ✽ ❀ ✰ ✮ ✾ ✽ ❊ ❊ ✹ ❅ ✹✪✽ ✿ ✾ ✱ ✸ ■ ✾ ✰ ✾ ❄ ✸ ✹ ✿ ❀ ✸ ❆ ✽ ✿ ✾ ❀ ✰ ❂☞✽ ✯ ✮ ✭ ❚ ✸ ✯ ✰ ✶ ❊ ❯ ❱ ❱ ❯ ✲ ✳ ❲ ❳❩❨ ❬ ❭ ❪ ❫❴ ❵ ❛ ❛✷❜ ❫ ❴ ❝ ❴ ❝ ❞ ❴ ❫ ❪ ❡ ❢❤❣ ✐Sequence Level Local Level
k
✑ ✕✚ ✓ ✛ ✎ ✍ ✎ ✖ ✎✜ ✢ ✎ ✘ ✑ ✎ ✖✤✣ ☞ ✥✧✦ ★ ✩ ✪ ✫ ✩ ✬✭ ✮✔✯ ✩ ✰ ✱ ✩ ✪ ✭ ✩ ✲ ✪ ✩ ✬✭ ✮✳ ✴ ✭ ✬ ✳ ✵ ✬ ✶✸✷ ✹ ✺ ✮ ✩ ✩ ✦ ★ ✩ ✪ ✯ ✲ ✴ ✪ ✯ ✭ ✴✻ ✼ ✩ ★ ✩ ✽ ✴✾ ★ ✮ ✩k
✯ ✳ ✴ ★ ✯ ✬ ★ ★ ✮ ✩ ✪ ✩ ✦ ★ ✼ ✴ ✯ ✲ ★ ✲ ✴ ✪ ✷ ✿ ❀ ❁❃❂ ❄❅ ❆ ❄❇ ❈❉ ❊ ✹ ❋ ✬ ✯ ★c1 c2 c3 d1 d2 d3 HMM
d1 d 2 d3 Naïve-Bayes
wi
✡ ✎ ✍✎ ✔ ✕ ✍ ✎ ✎ ✖ ✖ ✎ ✗ ✒ ✘ ☛ ✏ ✒ ✡ ✎ ✙✚ ✌ ✎ ☞✜✛ ✟ ✢ ✡ ✎ ✏ ✒ ✣ ✚ ✕ ✍ ✍ ✚ ✣ ✘ ✎ ✤ ☛ ✒ ✕ ✏✚ ✌ ✎✦✥ ✒ ✡ ✎ ✗ ✡ ☛ ☞✌ ✍✎ ✏ ✕ ✍✎ ✑ ☛ ✏ ✌ ✎ ✧ ✎ ✏ ✌ ✎ ✏ ✒ ✔ ✎ ✖ ✖ ✎ ✗ ✒ ✘ ✛wi
✸ ✳ ✲ ✳ ✹ ✰ ✲ ✳ ✺ ✰ ✻ ✶ ✳ ✶ ✼ ✴ ✵ ✸ ✳ ✽✾ ✿ ✳ ❀✜❁ ✮ ❂ ✸ ✳ ✴ ✵ ❃ ✾ ✰ ✲ ✲ ✾ ❃ ✶ ✳ ✴ ✵ ✳ ✲ ✰ ✴ ✾ ✿ ✳ ✷ ✰ ❄❆❅ ✶ ✵ ✲ ✻ ✺ ✵ ✻ ✲ ✳ ✹✜❇ ✵ ✸ ✳❈ ✰ ✲ ✳✴ ✵ ✶ ✰ ✲ ✳ ✼ ✴ ✺ ✰ ✻ ✶ ✰ ❀ ✺ ✾ ✽ ❈ ✳ ✵ ✼ ✵ ✼ ✾ ✴ ❁d1 d2 d3
d1 d2 d3
w1 w 2 w 3
A) “thanks anyway, the transatlantic line
✌died.” B) “… phones with more than one line
✌, plush robes, exotic flowers, and complimentary wine.”
✍ ✎✑✏ ✒✔✓ ✕✖ ✗✑✘ ✖ ✙ ✚✛ ✛ ✜ ✢✣ ✤ ✗ ✥ ✦✧ ✗ ✏ ✘ ★ ✩ ✪ ✫ ✬ ✭✑✮ ✯✰✲✱ ✳ ✴✑✵ ✶✔✷ ✸✺✹ ✻✺✼ ✵ ✯ ✽ ✰ ✾ ✿ ✼ ❀ ❀ ❁ ❂ ✰ ❃ ✭ ✬ ✻❄ ✭✑✵ ✯❆❅ ❇ ❈ ❇ ✪ ❇ ❈ ✫ ✬ ✭ ✮ ✯ ✰✲✱ ✳ ❉ ✯ ❊❋ ❄ ✭NB NB
NB NB
ME ME
ME ME
c1 c2 c3 w1 w2 w3 c1 c2 c3 w1 w2 w3 Joint HMM Conditional CMM
c
❹ ❺c w
0.2 0.4 0.6 0.8 1 2 3 4 HMM CMM
'
c i i i
i i i
Classifier interface
✁General linear classifiers
✂ ✄✆☎ ✝ ✞✟ ✠ ✡ ☛ ☎ ☞ ☞ ✌ ✍ ✌ ✞✎ ✍ ☎ ✡ ✠ ✏ ✎ ✑ ✒ ✓ ☎ ✔✖✕ ✞✘✗ ✙ ☎ ✑ ✞ ☞ ✡ ☛ ☎ ☞ ☞ ✌ ✍ ✌ ✞✎ ✍ ☎ ✡ ✠ ✏ ✎ ✑ ✚Optimization
✒ ✛ ✟ ✡ ✏ ✟ ☞ ✠ ✎ ☎ ✌ ✟ ✞ ✜ ✢✣ ✄ ✌ ✟ ✌ ✤ ✌✦✥ ✞ ✎ ✒ ✢ ✏ ✟ ☞ ✠ ✎ ☎ ✌ ✟ ✞ ✜ ✧ ✞ ✟ ☎ ☛ ✠ ✑ ✄ ✌ ✟ ✌ ✤ ✌✦✥ ✞ ✎ ★↑ ↑ ↑ ↑
❂❃❄Jason Baldridge et al. Java maxent model
Rob Malouf. Frontend maxent package that uses PETSc library for optimization. GIS, IIS, gradient ascent, CG, limited memory variable metric quasi-Newton technique.
Hugo WL ter Doest. Perl 5. GIS, IIS.