Maxent Models, Conditional Estimation, and Optimization Without - PowerPoint PPT Presentation

✒ ✌ ✓ ✑ ✓ ✏ ✔ � ✏ ✎ ✍ ☞✌ ✓ ☛ ✟ ✞ ✆ ✄ ✆✝ ✁ Example: Text Categorization ✂☎✄ ✠☎✡ Features are a word in document and class (they do feature selection to use reliable indicators) Tests on classic Reuters data set (and others) Naïve Bayes: 77.0% F Linear regression: 86.0% Logistic regression: 86.4% Support vector machine: 86.5% Emphasizes the importance of regularization (smoothing) for successful use of discriminative methods (not used in most early NLP/IR work)

➧➨ ⑩ ⑩ ⑧ t ⑨ ✉ ❦ ❹ ✉ ❶ ❧ ⑤ ✉ ❦ ✉ ❦ ❸ ⑩ t ❷ ⑩ t ❶ ❦ ⑨ ❽ ♥ ⑤ t t ⑩ ① ♥ ⑤ ❦ ❼ ④ ❻ ① ③ ⑧ ❺ ⑤ ❶ ③ ⑨ ③ ⑧ ① ✉ ❩ ✉ qr ♣ ♥♦ ✐ ❨ ❲ ❤ ❣ ❵ ✉ ❲ ❢ ❡ ❞ ❝❞ � ❫❴ ❲ ❪ ❬❭ ✈ ❧ ③ ③ ⑩ t ❷ ③ ✉❷ ❶ ⑧⑩ t ⑨ ⑧ ⑦ ✇ ⑥ ⑤ ③ ① ⑤ ① ③ ⑤ ④ ♥ ❦ ❧ ❯ ➑ ➑ ➔ ➄ ➓ ➈ ➜ ➌➜ ➛ ➙ ➓ ➑ ↕ ➔ ↔ ➣ ➣ ➣ ➣ ➣ ➣ ➏ ➉ ➓ ➎ ➔ ➑ ➎ ➢ ➓ ➔ ➌ ➔ ➅ ➡ ➠ ➞ ➆ ➎ ➉ ➎➠ ➆ ➜ ➆ ➒ ➓ ➟ ➑ → ❾ ⑩ ⑧ ❺ ⑤ ❶ ③ ⑨ ③ ⑧ t t ① ❦ ⑤ ❶ ⑧ ❶ ✇ ❺ ❧ ③ ⑧ ③ ✉ ➔ ➈➉ ➌ ➓ ➌➒ ➏ ➌ ➇ ➈ ➊ ➈➉ ➈ ➈ ③ ➅ ➄ ➂ ⑤ ⑧ ❺ ♥ ⑧ ③ ④ ❩ ❵ ▲ ✮ ✦ ✮ ✭ ✬ ✯ ✰ ✫ ✭ ✯ ✦ ✤ ✶ ✷ ✶ ✲ ✲ ✰ ✯ ✫ ✮ ✯ ✻ ✭ ✮ ✱ ✿ ✱ ✭ ✳ ✦ ✵ ❈ ✦ ✩ ✦ ✯ ✫ ✪ ✶ ✲ ✦ ✱ ✩ ✼ ✰ ✱ ✳ ❀ ✘ ✣ ✡ ✙ ✟ ✄ ✜ ✜ ✛ ✚ ✗ ★ ✟ ✒✖ ✕ ✠ ✌ ☞✌ ✠ ✟ ✄ ✁ ✤ ✩ ✵ ✲ ✫ ✰ ✴ ✤ ✲ ✮ ✱ ✭ ✳ ✲ ✦ ✭ ✱ ✫ ✰ ✯ ✦ ✮ ✬✭ ✦ ✪✫ ✦ ✼ ✶ ✪ ✵ ✫ ✽ ✰❄ ✼ ✥ ❃ ❂ ✼ ✦ ✱ ✱ ✱ ✭ ✫ ✮ ✪ ✰ ✽ ✼ ✰ ✰ ✦ ● ❙ ❚ ❋ ❙ ◗❘ ▼ ▲ ❉❑ ❏ ❈ ✫ ■ ❍ ❉ ❉❊ ❇❈ ❆ ✱❅ ✦ ✼ ❁ ✩ ✪ ✭ ✭ ✶ ❀ ✦ ✱ ✽ ✲ ✮ ✱ ✳ ✲ ✼ ❁ ✦ ✪ ✮ ✪ ✰ ✲ ✩ ✩ ✿ ✫ ✲ ✦ ✯ ❀ ✶ ✽ ✭ ✦ ✱ ✩ ✲ ✲ ✲ ✰ ✽ ✿ Local Context ➍❳➎ ➃☎➄ ➤♠➥➦ ➙✧➝ ➐♠➑ ➃☎➄ ➋❳➌ ➆✏➇ ✉➁➀ Example: NER ✜✢✓ ✮✾✽ ❦✧❿ ②P③ ❋✺● ✆✞✙ ✸✺✹ ❦♠① ❪❜❛ ✒✔✓ ✂☎✑ s✧t ✍✏✎ ✂☛✡ ❱❳❲❨ ◆P❖ ❥❦♠❧ ✥✧✦ ✆✞✝ ✂☎✄

⑧⑨ ✻ ❅ ✄ ☎ ✟ ✴ ✚ ✌ ✎ ✷ ❈ ✒ ✛ ✎ ✌ ✌ ❉ ✏ ✵ ❏ ✬ ▼ ❖ ◆ ✫ ✬ ❑ ❍■ ✗ ❋● ❊ ✻ ✌ ✷ ❈ ✑ ✗ ❇ ◆ ✛ � ✟ ✙ ✓ ✒ ✞✽ ✒ ✎ ✴✼ ✌ ✌ ✗ ✵ ☞ ✺ ✺ ✴ ✾ ❅❆ ❁ ✄ � � � ❂ ✓ ✌ ✛✙ ✿ ✘ ✛ ✛ ✘ ✙ ✎ ❀ ✫ ❖ ✴ q ✈ ✉✈ ❬ ❡ ❞ ❞ ❦ ♣ ① ♦ ❞ ❛ ❳ ❛ ❡ ❙ ✇ ❝ ❲ ❚ ❛ ❫ ④ ❞ ❡ ❬ ❡ ③ ❯ ❡ ❫ ❳ r ❫ ❯ ❧ ❞ ❧ ▼ ❲ ❬ ❴ ❬ ❱ ❲ ❨ ❲❳ ❲❳ ❬ ❲ ❱ ❯ ❚ ❙ ▼ ◗ ❝❞ ❡ ❬❧ ❣ ❦ ❥ ❛ ❞ ✐ ❡ ❤ ❣ ❢ ❣ ❣ ❣ ❣ ❴ ❛ ❫ ☎ ☞ ✴✼ ✟ ✒ ✑ ✁ ☞ ✟ ✡ ✞ ✓ � ✏ ✱ ✑ ✟ ✒ ✎ ✒ ✰ ✘ ✗ ✵ � ✆ ✴ ✙ ✑ ✟ ✌ ✞ ✗ ✖ ✟ ✎ ✝✱ ✓ ✌ ✌ ✎ ✖ ✗ ✡ ✠ ✟ ✟ ✟ ✡ ✆✚ ✑ ✟ ☎ ✄ ✏ ✖ ✬ ✦ ✫ ✒ ✥ ★ ✢ ✧ ✤✥ ✗ ✑ ✒ ✡ ✎ ✛ ✟ ✌ ✌ ✝ ✎ ✓ ✸ ✟ ✴ ✸ ✎ ✒ ✒ ✹✺ ✑ ✑ ✗ ✝ ✏ ✎ � ✺ ✎ ✖ ✑ ✙ ✓ ✒ ✌ ✗ ✻ ✺ ✛ ✓ ✏ ✎ ✄ ☞ ✌ ☎ ✎ ✟ ✓ ✒ ✗ ✌ ✶ ✒ ✆ ✛ ✟ ✱ ✙ ✚ ✘ ✞ ✑ ✒ ☎ ✛ � ✜✣✢ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ Feature Weights �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ ✠☛✡ ✠☛✡ ✠☛✡ ✠☛✡ ✿❄❃ ▲✍▼ ✘✍✙ ✘✍✙ ✘✍✙ ✲✳✟ ✩☛P ✎✕✔ Example: NER ✭✯✮ ✘✍✷ ✘✍✷ ✩✍✪ ▲✍▼ ☞✍✌ ☞✍✌ ☞✍✌ ☞✍✌ ☞✍✌ ❪❭❫ Local Context ❘✳❙ ❯❄r ⑤❜⑥⑦ ❥♥♠ ❵❜❛ ❘✳❙ ❦ts ❯❄r ②❭❞ ❩❭❬

➁ ❉❊ ❁ ❅ ❆ ❇ ✿ ❈ ❈ ❈ ❈ ❈ ❈ ❂ ✽ ❋ ❋● ❂ ❆ ❍ ■ ❏ ❑ ■ ❑ ✺ ❑ ❁ ✺ ▼ ✔ ✷ ✙ ✔ ✣ ✗ ✣ ✑ � ✵ ✛ ✤ ✣ ✹ ✗ ✦ ★ ✣ ✫ ✔ ✣ ✗ ✔ ✖ ✸ ✬ ▲ ◆ ✦ s t ✉s ✈ s ✇ ①② ③ ④ ⑤ ❶ t ✉ ♥ ⑩ ❷ s ❸ t s ❼❽ ❽ ❸ t ➀ ❺ q ❧ ❖ ❛ P ❯ ❱ ▼ ❨ ❩ ❩ ❬ ❴❵ ❯ ❞ ❪ ❜ ❞ ❡ ❴❵ ❯ ❞ ❢ P ❣ ❣ ❤ ❜ ❞ ✐ ✖ ✵ ✓ ✛ ✥ ✜ ✓ ✦ ✣ ✛ ✕ ✛ ✣ ✜ ✖ ✜ ✤ ✕ ✜ ✓ ✖ ✜ ✪ ✜ ✖ ✫ ✔ ✮ ✖ ✔ ✓✔ ✆ ✁ ✂ ✄ ✞ ✂✟ ✠ ✄✡ ☛ ✡ ✠ ☞ ✌ ✕ ✂ ✍ ✎ ✓ ✓✔ ✕ ✙ ✓✔ ✚ ✑ ✣ ✗ ✯ ✔ ✚ ✣ ✖ ✖ ✫ ✓✔ ★ ✔ ✔ ✖ ★ ✛ ✪ ✬ ✖ ✵ ✧ ★ ✮ ✲ ✜ ✕ ✓ ✓ ✛ ✔ ✑ ✗ ✜ ✔ ✜ ✥ ✑ ✣ ✰ ✶ ✜ ✓ ✕ ✥ ✕ ✤ ✖ ✱ ✜ ✕ ✣ ✔ ✓✔ ✗ ✖ ✓✭✬ Features ❧✘♠ ❬❭❫❪ ❱❳❲ ❤♣♦ ❤❦❥ ❛❝❜ ❛❝❜ ◗❙❘❚ ✧✩★ Example: Tagging Decision Point ❾✼❿ ❹❻❺ ✛✢✜ ✺✼✻ Local Context ⑨❄⑩ ✖✘✗ ✾❀✿ ✦✴✳ ⑤⑥⑧⑦ ✏✒✑ ☎✝✆ ❂❄❃ r❄s

✯ ✰ ✶ ✾ ✲ ✰ ✿ ✱ ✿ ✳ � ❆ ✮ ✭ ✫✬ ✫ ✪ ✷ ❆ ✥ ✾ ✸ ✼ ✸ ❁ ❀ ✿ ✿ ✽ ✴ ✻✼ ✶✺ ✶✹ ✸ ✷ ✶ ✵ ★✩ ✧ ❃ ✠ ❈ ☞ ❍ ☛ ✡ ✠ ✟✠ ✍ ✞ ✝ ✝ ☎✆ ✂✄ ✁ ■ ✌ ✎ ✥✦ ✚ ✣✤ ✢ ✜ ✷ ☞ ● ✛ ✘✙ ✏ ✘ ✗ ✖ ✔✕ ✎✓ ✎✒ ✑ ❅ Other Maxent Examples Sentence boundary detection Is period end of sentence or abbreviation? PP attachment Features of head noun, preposition, etc. Language models P(w |w ,…,w ). Features are word n-gram features, and trigger features which model repetitions of the same word. ❇❉❈ ❂❄❃ ✿❋❊ Parsing Either: Local classifications decide parser actions or feature counts choose a parse.

✂ � � ✁ � ✂ ✂ The likelihood of data: CL vs. JL We have some data {( d , c )} and we want to place probability distributions over it. A joint model gives probabilities P( d,c ) and tries to maximize this likelihood. It turns out to be trivial to choose weights: just relative frequencies. A conditional model gives probabilities P( c | d ) . It takes the data as given and models only the conditional probability of the class. We seek to maximize conditional likelihood. Harder to do (as we’ll see…) More closely related to classification error.

✵ ☞ ✗ ✑ ✓ ✟ ✡ ✟ ✞✆ ✞ ☛ ✟ ✑ ✍ ☛ ☛ ✎ ✎ ✡ ✟ ✞✆ ✎ ✏ ✆ ✣ ✡ ☞ ✎ ✗ ✡ ✗ ✍ � ✤ ✯ ✴ ✳ ✲ ✱ ✫ ✪ ✩ ★ ✧ ✭ ✬ ✥ ✫ ✎ ✪ ✩ ★ ✧ ✦ ✦ ✥ ✤ ☞ ☛ ✎ ✡ ✑ ✟ ✣ ✛ ✎ ☞ ✆ ✆ ✞ ✒ ✟ ✑ ✏ ☞ ✎ ✍ ✌ ✞☛ ✡ ✡ ✠ ✆ ✞✟ ✄ ✁ ✁ ✂ ✁ ✁ ✁ ☞ ✞ ✖ ✆ ☞ ✒ ✑ ✆ ✓ ✠ ✖ ✑ ✘ ✏ ✏ ✆ ✎ ✍ ✎ ✗ ✗ ✑ ✓ ✟ ✢ ☞ Feature-Based Classifiers “Linear” classifiers: Classify from features sets { f i } to classes { c }. Assign a weight λ i to each feature f i . For a pair ( c,d ), features vote with their weights: vote(c) = Σ λ i f i ( c,d ) ✮✰✯ ✮✰✯ Choose the class c which maximizes Σ λ i f i ( c,d ) = VB There are many ways to chose weights ✓✕✔ ☎✝✆ ✎✝✗ ✆✚✙ ✏✜✛

⑧ ❭ ❍ ▼ ❭ ■ ❪ ❑ ❝ ❋ ❚ ❖ ❑ ▼ ❬ ❭ ❵ ❴ P ❙ ❑ ❙ ❯ ❏ ❖ ❚ ❚ ❍ ❬ ❲ ❲ ❑ ■ ❚ ❏ ❩ ❙ ◗ ❚ ❯ ❲ ■ ❨ ▼ ❚ ❘ ■ ❘ ■ ❍ ▼ ❯ ❚ ❖ ◗ ▼ ❚ ■ P❙ P◗ ❘ ■ ❍ ▼ ■ ◗ ❱ ❖ ❬ ❫ P P ❑ ■ ❱ ❙ ❚ ❩ � ❲ ❚ ❙ ❳ ▼ ❑ ❲ ❑ ❱ ■ P ❖ ▼ ❲ ■ ❨ ❚ ❙ ❖ ❑ ❍ ❚ ③ ▼ ▲ ❨ ◗ ❚ ❬ ❬ P r ④ ▼ ④❶ ② ✇ ⑤ ④ ③ ✇② ❸ ❷t ⑩ ⑤ ③ ⑦ ⑤ ⑦ ② ④ ⑥ ② ✇ P P ◗ ▼ ✐❧ ❦ ❥ ✐ ❞ ❣ ❢ ❞❡ P ♦ ❍ ▼ ❖ ◗ ■ ▼ ■ ❙ P ♠♥ ❣ ❨ ❦ ■ ❍ ▼ ❯ ❚ ♦ ♠ ♠ ❣ ❥ ♣ ✐ q ❣ ♣ ♣ ❡ ♠♥ ❣ P◗ P ☎ ✏ ✟ ✗ ✠ ✠ ✗ ☎ ✄ ✟ ✠ ✝ ✕ ✂ ✝ ✆ ✏ ☛ ✔ ☎ ✧ ✭ ✬ ✫ ✪ ✩ ✩ ★ ✦ ✞ ✛ ☛ ✝ ✙ ☎ ✓ ✓ ✄ ✞ ✌ ✘ ✢ ✜ ✛ ✔ ☛ ☎ ✓ ✔ ✝ ☛ ✗ ✗ ✟ ✖ ✑ ☎ ✔ ✞ ✞ ✗ ✆ ☎ ✟ ✞ ✠ ✆ ✟ ✓ ✑ ☎ ✕ ✏ ✠ ✆✝ ✟ ☛ ✮ ✯ ✰ ✶ ✶ ★ ✽ ❁ ✸ ❀ ✴ ✺ ✳ ✲ ✆ ✱ ✭ ✰ ✯ ✆✝ ✶ ✭ ✴ ■ ❏ ✁ ● ❋ ❉❊ ✂ ✳ ✻ ❁ ✸ ❀ ✶ ✾ ✼ ✄☎ ✮ ✞ ✭ ✟ ✶ ✺ ☛ ✶ ★ ✽ ✼ ✻ ✬ ✶ ✺ ✆✝ ✶ ✠ ✲ ✱ ✻ ☎ ✞ ✦ ☛ ✫ ✪ ❈ ❇ ★ ✧ ❅❆ ✶ ☞ ✴ ✳ ❁ ☛ ✸ ❀ ✙✚✝ ✇⑨⑧ Feature-Based Classifiers ✙✥✤ ⑦✉❹ ❂❄❃ ❂❄❃ ✈①✇② ✟✡✕ ❯❜❛ Σ λ i f i ( c,d ) s✉t ✟✡✔ ✼✿✾ ✷✹✸ ) ) ✷✹✸ ✷✹✸ d d , , ' c c ✷✹✸ ( ( ✷✹✸ i i f f λ λ i i ✷✹✸ ∑ ∑ ✏✒✑ i i exp ❨✥❪ exp ✳✵✴ ❣✡❤ ∑ ❑✡❭ ✌✎✍ ' c ❑✡❭ ✟✡✕ = ❍◆▼ ) λ ❑✡▲ ✟✡✠ , ✣✚✝ d | ❨✚■ c { λ i } ( ❍✚■ P

� � � � Other Feature-Based Classifiers The exponential model approach is one way of deciding how to weight features, given data. It constructs not only classifications, but probability distributions over classifications. There are other (good!) ways of discriminating classes: SVMs, boosting, even perceptrons – though these methods are not as trivial to interpret as distributions over classes. We’ll see later what maximizing the conditional likelihood according to the exponential model has to do with entropy.

� ✁ � Exponential Model Likelihood Maximum Likelihood (Conditional) Models : Given a model form, choose values of parameters to maximize the (conditional) likelihood of the data. Exponential model form, for a data set (C,D): exp ( , ) λ f c d ∑ i i log ( | , ) log ( | , ) log λ = λ = i P C D P c d ∑ ∑ exp ( ' , ) λ f c d ( , ) ( , ) ( , ) ( , ) c d ∈ C D c d ∈ C D ∑ ∑ i i ' c i

⑥ ✩ ★ ✦ ✦ ✫ ✩ ✱ ✘ ❖ ✱ ▲ ✱◆ ✚ ✥ ✦ ✩ ✛ ✩ ★ ✴ ✘ ✦ ✱ ✦ ✦ ✫ ✩ ✱ ✘ ❖ ✩ ✘ ✢ ✫ ✛ ✧ ✦ ✘ ✛ ✚ ✭ ✩ ✥ ✬ ✪ ❊ ❇ ❇ ✿ ✾ ❉ ❅ ❉ ❇ ❁ ❍ ❂ ❅ ❉ ❅ ✽ ✾ ❅ � ✿ ✼ ❅ ▼ ✾ ✚ ✛ ✚ ▲ ✴ ❑ ❅ ❇ ■ ✾ ❈ ❏ ■ ✾ ❏ ✼ ❇ ❅ ❍■ ★ ✢ ❁ ❜ s r ❤ ❦ ♣q ❢ ❡ ❚ ❬ t ❞ ❚ ❴ ❳ ❬ ❳ ❪❝ ❩ ❜ r ♥ ❬ ✐ ⑤ ❦ ✈ ✉ q ❦ ④ ♥ ③ ✉ ② ❦ ♠ ♥ ✇ ♥ ✈ s ❦ ❛ ❲ ✛ ✘ ❚ P ✛ ✫ ✯ ✩ ✘ ✪ ✢ ❱ ✜ ✛ ✚ ✘ ✧ ✫ ✬ ✚ ✘ ❯ ❚ ❲ ❴ ❨ ❚ ❛ ❵ ❳ ❨ ❚ ❫ ❲ ❲ ❨ ❚ ❫ ❬ ❘ ❪ ❳ ❨ ❳ ❅ ❆ ❀ ✘ ✧ ✦ ✣ ✛ ✘ ✣ ✛ ✥ ✣ ✚ ✘ ✢ ✤ ✘ ✢ ✣ ✘ ✢ ★ ✛ ✛ ✢ ✬ ✥ ✾✿ ✛ ✣ ✩ ★ ✘ ✚ ✚ ✫ ✫ ✪ ✣ ✛ ✥ ✩ ✦ ✤ ✜ ✚ ✘ ☞ ☛ ✏ ✟ ✞ ✍ ✝ ✆ ✌ ✂ ✠ ☛ ✞ ✂ ☎ ✂ ✝ ✆ ☎ ☎ ✝ ✖ ✍ ✟ ✝ ✆ ✏ ✓ ✞ ✟ ✞ ☛ ✎ ✂ ✒ ✏ ✑ ☞ ✝ ✏ ✆ ✟ ✛ ✩ ✘ ✥ ✘ ★ ★ ✚ ✘ ✹ ✢ ✚ ✩ ✣ ✘ ✢ ✜ ✛ ✚ ✘ ★ ✥ ✘ ✯ ✛ ✸ ✦ ✛ ✶ ✱ ✱ ✘ ✬ ✛ ✚ ✛ ✥ ✘ ✰ ✘ ✢ ✧ ✱ ✢ ✘ ✘ ✭ ✢ ✢ ✺ ✘ ✣ ★ ✦ ✦ ✛ ✫ ✯ ✜ ✦ ✽ ✰ ★ ✚ ✘ ✜ ✣ ✵ ✢ ✴ ✢ ✢ ✛ ✘ ✘ ✰ ✚ ✦ ✢ ✚ ✤ ✱ ✚ Building a Maxent Model ✣✳✲ ✩✙✬ ❤✄① ❋✕● ✆✄✎ ❇✡❈ ❩❭❬ ✟✡✠ ✩✮✭ ✩✮✭ ❂❄❃ ✱✷✶ ✣✳✲ ◗❙❘❚ ♠♦♥ ☞✕✔ ❥❧❦ ✗✙✘ ✁✄✂ ✻✄✼ ❣❤✄✐

✪ ☛ ✎ ✍ ✄ ✍ ✂ ✌ ✑ ✌ ✍ ✕ ✞ ✍ ✦ ✞ ✡ ✞ ✗ ✕ ✞☛ ✎✖ ✄ ☛ ✝ ✑ ✛ ✍ ✞ ✎ ✝ ✡ ✓ ✎ ✑ ✍ ✄ ✘ � ✥ ✄ ✡ ✎ ☛ ✄ ✍ ✄ ✱ ✭ ✱ ✲ � ✳ ✺ ✵ ✪ ✷ ✪✶ ✩ ✷ ✺✻ ✼ ✺ ✶ ✪ ✶ ✲ ✲ ✪ ✑ ✮ ✘ ✧ ★ ✱ ✲ ✪ ✲ ✬ ✵ ✹ ✵ ✪ ✭ ✪✶ ✷ ✪ ✸ ✪ ✲ ✎ ✡ ✑ ✄ ✌ ✞☛ ✞ ✒ ✍ ✂ ✡ ☞ ✡ ✠ ✞✟ ✚ ✎ ✍ ✎ ✆ ✠ ✍ ☛ ☛ ☞ ✞☛ ✎ ✝ ✝ ✌ ✌ ✏ ✄ ✝ ✞☛ ✌ ✂ ✞ ✞ ☞ ✌ ✑ ✎ ✍ ✎ ☞ ✎ ✄ ☛ ✚ ✍ ✗ ✎ ☛ ✛✜ ✝ ✍ ✓ ✄ ✑ ✞ ✆✝ ✁ ✤ ✌ ✍ ✎✖ ✄ ✖ ✄ ✂ ✄ ✕ ✎✖ ✎ ✗✄ ✍ ✄ ✖ ✌ ✑ ✘ � ✙ ✒ ✍ ✂ ) d , ' c ( i f λ i ∑ i exp ) ∑ ' d c ) The Likelihood Value d , log c , ) ( λ c ) i ( λ ( f ) λ i D , i f λ , d ✮✰✯ i C ∑ ∑ ( | M ∈ i ∑ c ) ✮✰✯ i d exp ( , c P exp − ( ✒✔✓ − ✬✫✪ log ∑ ) ) ' ) ( λ ✩✫✪ c D d , C , log c ∑ c ( ∈ ( N ) i ✒✣✢ d f ) λ λ , D i = , ( C ∑ ∑ = ( ) ∈ λ i ) ) d λ exp , c , ( D , = D log ) λ | | C C , ) D D ( ( , P C P ∑ | ✩✫✪ ( ∈ C log ) log d ( , P c ✮✴✳ = ( log ) λ ✮✰✯ , D ✮✰✯ | C,D C ✬✫✪✭ ( ✂☎✄ P ✩✫✪ log

� The Derivative I: Numerator log exp ( , ) ∂ λ f c d ( , ) ∂ λ f c d ∑ ∑ ci i ( ) ∂ λ ∑ ∑ i i N ( , ) ( , ) ∈ c d C D i ( , ) ∈ ( , ) = c d C D i = ∂ λ ∂ λ ∂ λ i i i ( , ) ∂ λ f c d ∑ i i = i ∑ ∂ λ ( , ) ( , ) c d ∈ C D i ( , ) = f c d ∑ i ( , ) ( , ) ∈ c d C D Derivative of the numerator is: the empirical count( f , c )

✑ � ☛ ✞ ✡ ✠ ✝✟ ☎ ✄ ✞ ✝ ✆ ☎ ✄ ✁✂ The Derivative II: Denominator log exp ( ' , ) ∂ λ f c d ∑ ∑ ∑ i i ( ) ∂ λ M ( , ) ∈ ( , ) ' c d C D c i = ∂ λ ∂ λ i i ∂ exp λ ( ' , ) f c d 1 ∑ ∑ i i = ' c i ∑ exp ( ' , ) λ ∂ λ f c d ( , ) ∈ ( , ) c d C D ∑ ∑ i i i ' c i exp λ ( ' , ) ∂ λ ( ' , ) f c d f c d 1 ∑ i i ∑ i i = i i ∑ ∑ exp ( ' , ) 1 λ ∂ λ f c d ( , ) ( , ) ' ∈ c d C D c ∑ ∑ i i i ' c i exp ( ' , ) ( ' , ) λ ∂ λ f c d f c d ∑ i i ∑ i i = i i ∑ ∑ ∑ exp ( ' , ) λ ∂ λ f c d ( , ) ( , ) ' ∈ c d C D c ∑ i i i ' c i ( ' | , ) ( ' , ) = λ P c d f c d λ ☞✍✌✏✎ ∑ ∑ i ( , ) ( , ) ' ∈ c d C D c

✬ ✾ ❆ ● ❉ ❀ ✿ ❀ ❅ ❑ ▲ ❏ ✾ ■ ✾ ❍ ❋● ❇ ❄ ❑ ❇ ● ❉ ✾ ▼ ❄ ❀ ❆ ❁ ❅ ❃ ❍ ✾ ❀ ❅ ❉ ❍ ✾ ❇ ✾ ❇ ✧ ✷ � ✱ ✤ ✦ ✯ ★ ✳ ✳ ✦✸ ✬ ✱ ✧ ✯ ✤ ✬ ✱ ✲ ❈ ✿ ✿ ❇ ✿❆ ❄❅ ✾ ❃ ❀ ✼ ✤ ✤ ✯ ✤ ✜✹ ✤ ✧ ✯ ✸ ❀ ✾ ❊ ● ✿ ✾ ◗ ❃ ✾ ❇ ❀ ❇ ❁ ❀ ❉ P ❃ ❁ ❅ ❅ ❀ ❃ ❖ ✾ ✾ P ✾ ❊ ❃ ✿ ■ ❈ ✾ ✿ ❇ ❇ ❅ ❉ ❇ P ❄ ● ✾ ❃ ❇ ■ ✾ ❈ ✿ ❇ ❖ ✾ ❀ ❃ ✾ ❄ ❁ ✿ ❅ ✾ ◆ ✿ ❊ ❇ ◆ ❀ ✾ ❄ ❁ ✿ ❅ ✾ ❃ ✾ ● ❄ ❀ ❇ ❊ ❉ ✾ P ✷ ✷ ✩ ✝ ✆ ✟ ✞ ☛ ✞ ✑ ✄ ✔ ✄ ✄ ✓ ✄ ✞ ✓ ✄ ☞ ✍ ✕ ✌ ✝ ✝ ✔ ✄ ✖ ☛ ☞ ✟ ✠ ✡ ✄ ✌ ✞ ✟ ✌ ✖ ☛ ✝ ✄ ✑ ☛☞ ✄ ☛☞ ☞✌ ✄ ✞ ✠✄ ☛ ✝ ✂ ✠ ✡ ✠ ✟ ✞ ✆✝ ✁ ✞ ✄ ☞ ☛ ✡ ✞ ☛ ✄ ✎ ✂ ✑ ✄ ✆✍ ✂ ✂ ✏ ☞ ✆ ✎ ✌ ✄ ✭ ✄ ✞ ✬ ✯ ★✳ ✥ ✤ ✲ ✦ ✱ ✯ ✬ ✬ ✤✲ ✤✱ ✰ ✯ ✧ ☛ ✮ ✧ ✧ ✥ ✦ ✯ ✦ ✩ ✶ ✬ ✦ ✤ ★ ✛ ✵ ✴ ✬ ✧ ✪ ✩ ✬ ✭✮ ❉ ✟ ✞ ✌ ✟ ✓ ✠ ✡ ✠ ✞ ✟ ✆✝ ✁ ✗ ✍ ✆ ✟ ✞ ★ ☞ ✘ ✞ ✟ ✆✍ ✟ ✧ ✌ ✙ ✦ ✚ ✡ ✛ ✥ ✤ ) λ , i f ✴✻✺ ( count ✟✒✑ predicted − ) The Derivative III C , f i ( count ✂☎✄ ❄❙❘ actual ❉✒❊ ✟✒✑ = ✩✫✪ λ ) , D ✟✒✑ | λ C i ✜✣✢ ✜✣✢ ❀❂❁ ( ∂ ✂☎✄ P ✽✒✾ log ∂

� ✁ ✁ � Summary so far We have a function to optimize. exp ( , ) λ f c d ∑ i i log ( | , ) log λ = P C D i ∑ exp ( , ) λ f c d ( , ) ∈ ( , ) c d C D ∑ ∑ i i ' c i We know the function’s derivatives. log ( | , λ / ) actual count ( , ) − predicted count ( , ) ∂ ∂ λ = λ P C D f i C f i i Perfect situation for general optimization (Part II) But first … what has all this got to do with maximum entropy models?

� ✁ ✁ ✁ Maximum Entropy Models An equivalent approach: Lots of distributions out there, most of them very spiked, specific, overfit. We want a distribution which is uniform except in specific ways we require. Uniformity means high entropy – we can search for distributions which have properties we desire, but also have high entropy.

✝ ✟ ✡ ✟ ✞ ✓ ✎ ✑✒ ☎ ✏ ✒ ✎ ✌ ✝ ✍ ☛ ✡ ✠ ✝ ✓ � ✁ � ✁ ✁ ✔ ✡ ✟ ✞ ✝ ✂ ☎ ✒ ✞ ✓ (Maximum) Entropy Entropy: the uncertainty of a distribution. Quantifying uncertainty (“surprise”): x Event H p x Probability log(1/ p x ) “Surprise” p HEADS Entropy: expected surprise (over p ): 1 ✄✆☎ ✞✆✟ ✞✆☞ ✞✆✌   H ( ) log = p E p  p  x   H ( ) log = − p p p ∑ x x x

❋ ✚ ✚ ✴ ✚ ✗ ★ ✧ ✢✯ ✗ ✯ ✗ ✧ ✮ ✭ ✫ ✤✪ ✱ ❚ ✥ ✳ ✗ ✩ ❱ ✭ ✧ ✥ ✧ ✲ ✢ ✤ ✳ ✤ ✤ ❯ ✗ ✱ ✩ ✘ ❲ ✾ ✾ ❃ ❆ ❏ P ❋ ❀ ❇ ✿ ❃ ❇ ✾ ❋ ❀ ✺ ✽ ❆ ✢ ✯ ✚ ✴ ✚ ✗ ★ ✧ ✚ ✩ � ✢ ❙ ✤ ❘ ✭ ❁ ◗ ✘ ✯ ❩❬ ❪ ✱ ✣ ✘ ✤ ✗ ✥ ✗ ✯ ✩ ✥ ✯ ✗ ✳ ✢ ❯ ✤ ✢ ✯ ❴ ❭ ❪ ❫❴ ❵ ❨ ❛ ❡ ❢ ✩ ❴ ✧ ✥ ✧ ✳ ✤ ✥ ✥ ❲ ✥ ❥ ✴ ✲ ✘ ✤ ✗ ✥ ✗ ✥ ✩ ✥ ✯ ✗ ✳ ✢ ❯ ✩ ❯ ✘ ✲ ✗ ✩ ❱ ✭ ✚ ✩ ✤ ✗ ✢ ✘ ✴ ✚ ✤ ✩ ✲ ✩ ■ ❀❖ ✿ ✛ ✤✪ ✩ ✥ ✘ ✢ ✢ ✗ ✬ ✚ ✗ ★ ✚✧ ✦ ✥ ✘ ✫ ✭ ✚ ✢ ✘ ✢ ✩ ✢ ✲ ✢ ✩ ✚ ✮ ✤ ✯ ✢ ✰✱ ✚ ✢ ✢✯ ✢ ✥ ▼ ✡ ✝ ✄ ✍ ✞ ❧ ✆ ✄ ✆ ✟ ✠ ✟ ❴ ✆ ❪ ✁ ❪ ☞ ✗ ❨ ✚ ✚ ✤ ✢✣ ❵ ✚ ✗ ✖ ✎ ✕ ✔ ✡ ✞ ✎ ✆ ❪ ✣ ✢ ✳ ❏ ❦ ❇ ✽ ✾ ❃ ❆ ✺ ❃ ✾ ✾ ❆■ ❍ ● ✽ ❋ ● ❋ ✗ ✿ ▼ ◆ ✸ ❁ ❋ ✾ ❀ ❃ ❆ ❇ ✾ ❋ ❀ ✺ ■ ▼ ❊ ❬ ❉ ✘ ✷ ✧ ✥ ✧ ✳ ✶ ✤ ✸ ✗ ✥ ❫❴ ✗ ✩ ✥ ✯ ✬ ❪ ❈ ❭ ✺ ❇ ✾ ❀ ❆ ❆ ❨ ❂ ✿ ❄ ❂❃ ❁ ❀ ✺ ✿ ✾ ❨ ❜❞❝ p HEADS = 0.3 ❣✐❤ ❳✑❨ Maxent Examples I i C = x p ∑ i f ✰✵✴ ∈ x ✒✌✓ H ✎✑✏ [ ] i f ✰✵✴ ✰✵✴ ☛✌☞ ✿✻❅ p ˆ E = ✗✜✛ [ ] ❆▲❑ ✝☎✞ i f ✗✙✘ ✂☎✄ ✼✌✽ p E ✹✻✺

1/ e Maxent Examples II - x log x H( p H p T ,) p H + p T = 1 p H = 0.3

� � � � Maxent Examples III Lets say we have the following event space: NN NNS NNP NNPS VBZ VBD … and the following empirical data: 3 5 11 13 3 1 Maximize H: 1/ e 1/ e 1/ e 1/ e 1/ e 1/ e … want probabilities: E[NN,NNS,NNP,NNPS,VBZ,VBD] = 1 1/6 1/6 1/6 1/6 1/6 1/6

✼ ✂ ✍ ✍ ✫ ✟✍ ✞ ✝ ✆ ✝ ✆✱ ✏ ✑ ✍ ✠ ✗ ✯ ✍ ✯ ✕ ✓ ✱ ✲ ✌ ✗ ✗ ✝ ✆ ✱ ✌ ✗✰ ☎ ✍ ✑ ✙ ✚ ☛ ☛✜ � ☛ ☛✜ ✛ ✢ ✓ ✖ ✝ ✏ ✣ ✂ ✤ ✭ ✥ ✙ ✧ ✮ ★ ✦ ✩ � ✪ ✖ ✍ ✎ ✞ ✌ ✗ ✕ ✆ ✕ ✓ ✂ ✟ ✷ ✍ ✟ ✴ ✏ ✵ ✫ ✍ ✲ ✂ ✸✹ ✺ ✸✹ ✻ ✼ ✼✽ ✾ ✼ ✼✽ ✼ ✼ ✾ ✼ ☎ ✆ ✏ ✑ ☎ ✟✍ ✏ ✂ ✗ ✝ ✕ ✏ ✝ ✆✱ ☎ ✝ ✕ ✕ ✯ ✝ ✆✱ ☎ ✯ ✌ ✟ ✷ ✕ ✲ ✫ ✯ ☎ ✟ ✌ ✭ ✓ ✗ ☛ ✗ ✏ ✑ ✍ ✞ ✍ ✌ ✏ ☎ ✟✍ ✘ ✙ ✚ ☛ ✌ ✓ ☛ ☛ ✛ ✓ ☛ ☛✜ ✓ ✌ ☛ ✜ ✛ ✢ ✓ ✗ ✍ ✝ ✠ ✂ ☎ ✆ ✝ ✞ ✂ ✟ ✠ ✡ � ☛☞ ✌ ✟✍ ✂ ✖ ✟✍ ✎ ✂ ✠ ✠ ✂ ✆ ✏ ✑ ✌ ✆ ✒ ✕ ✂ ✖ ☛ ✏ ✂ ✞ ✟ ✍ ✬ ✑ ✍ ✆ ✏ ✏ ✑ ✌ ✆ ✎ ✠ ✂ ✠ ✂ ✆ ✆ ✂ ☎ ✆ ✕ ✓ ✕ ✂ ✖ ✍ ✟✍ ☎ ✠ ✤ ✦✧ ✙ ✥ ✘ ✍ � ✪ ✌ ✆ ✗ ✦✩ ✫ ✟✂ ✫ ✟ ✌ ✕ ✆ ✣ ★ ☎ ✂ ✆ ✟ ✍ 2/36 f 2/36 2/36 2/36 ✴✶✵ Maxent Examples IV ✍✳✲ 12/36 8/36 f 12/36 8/36 ☞✔✓ f 8/36 4/36 8/36 f 4/36 ✁✄✂

✿ ✫ ✦ ✬ ✹ ✦ ✬ ✺ ✩ ✪ ✫ ✺ ✷ ✧ ✬ ✺ ★ ✩ ✪ ✫ ★ ✬ ✬ ✶ ✦ ✬ ✭ � ✪ ✫ ✱ ✪ ✫ ✦ ✶ ✦ ★ ✩ ✪ ✫ ✦ ✬ ✹ ✦ ✹ ✬ ❃ ✪ ✫ ✻ ✼ ✻ ✽ ✾ ✿ ❀ ★ ❀ ❁ ✼ ❃ ❀ ❀ ❁ ✽ ✾ ✩ ✫ ✦ ✺ ✬ ✺ ★ ✦ ✬ ✹ ✦ ✬ ✩ ✪ ✪ ✫ ✫ ✷ ✧ ✬ ✺ ★ ✩ ✭ ✩ ✦ ☛ ✆ ✂ ✏ ✆ ☞ ✗ ✆ ☛ ✘ ✓ � ✙ ✝ ☛ ✒ ★ ✆ ✂ ✛✜ ✔ ✝ ✠ ✝ ☎ ✆✝ ✞ ✟✠ ✡ ✆ ✍ ✂ ✡ ✒ ☛ ✆ ✠ ✎ ✆✏ ☛ ✂ ✑ ✑ ✟ ✚ ✡ ✧ ✝ ✓ ✥ ✦ ✆ ★ ✦ ✩ ✞ ✪ ✫ ✦ ✬ ✭ ✦ ✬ ✭ ✒ ✧ ✝ ✞ ✆ ✕ ✆ ✝ ✠ ✍ ✏ ✡ ✢ ✠ ✕ ✣ ☛ ✆✤ ✠ ☛ λ λ ❀❄❁ ❀❂❁ λ λ ✞✖✕ λ λ Feature Overlap ✒✄☞ ✶✸✷ ☛✌☞ ✲✴✵ ✲✴✳ ✁✄✂ ✮✰✯

✛ ➀ ❋● ❊ ✻ ✌ ✷ ❈ ✑ ✗ ✏ ❏ ✎ ❺ ❉ ❅ ✄ ☎ ✁ � ❍■ ❑ ✴ ❖ ❚ ❙ ✈ ♦ ▼ ◗ r ▼ ◆ ♣ ✫ ✬ ▼ ▲ ▼ ❖ ◆ ✫ ✬ ✟ ✚ ❱ ✿ ❁ ✛✙ ✘ ✛ ✘ ✙ ✎ ❀ ✾ ✌ ✴✼ ✁ � ✴ � ✁ � ✟ ✙ ② ✓ ✻ ✗ ✌ ✷ ❈ ✒ ✛ ✎ ✌ ✌ ✵ ❂ ❇ ❅❆ ✉ ✄ � ✁ � � � ❯ ❲ ✒ ♣ ④ ③ ✇ s ② ① r ✇ ✈ ⑦⑧ r ♦ ♦ ✉ q ✉ r ♣q ⑤⑥ r ✇ ① ① s ❼ r ③ q ❸ ❻ ① q ♣ ✇ ✉ ❺ t r ✉ ① ✇ ❺ ♠ ❲❳ ❬ ❣ ❴ ❛ ❫ ❢ ❡ ❬ ❝❞ ✉ ❣ ❴ ❬ ❾ ❱ ♣ ❲ ❨ ❲❳ ❲ ❣ ❣ ❞ ❞ ❛ ❳ ❛ ✉❽ ❡ ❙ ❞ ❲ ❡ ❣ ❬ ❡ ❚ ❛ ❞ ✐ ❡ ❤ ❣ ❺ ✓ ✞✽ ✬ � ✱ ✁ � ✟ ✝✱ ✰ r ✫ ✟ ❾ ✥ ★ ✢ ✧ ✦ ✤✥ ❾ ✞ ✡ ✒ ✗ ✁ ☞ ✆ ✴ ✙ ✘ ✑ ✌ ✖ ✟ ♦ ✟ ✎ ✒ ✓ ✒ ✑ ✁ ☞ ✑ ✡ � ♦ ✖ ❸ ✒ ✓ ✒ ✑ ✏ ✎ r ✌ ✟ ✞ ✝ ✆ ✁ � ☎ ✄ ✗ ✑ ✎ r ✛ ✟ ✌ ✗ ✖ ✟ ✏ ✎ ✟ ✇ ♣ ✟ ✡ ✆✚ ✁ � ✟ ☎ ✄ ✝ � ✠ ✺ ✻ ✛ ✓ ✏ ✎ s ☞ ✺ ✹✺ ✗ � ✸ ✁ � ✟ ✴ ✸ ✡ ✖ ✌ ✓ ✺ ✒ ✛ ✎ ✌ ✌ ✗ ✵ ☞ ✺ ✒ ☎ ✴ ✁ � ✴✼ ✑ ✙ ✓ ✒ ✎ r ✛ ✁ � ✟ ✎ ✌ ✗ ✒ ✓ ✙ ✟ ✘ ✑ ✒ ✛ ✎ ✌ ✌ ✗ ✵ ✱ ✚ � ✗ ✌ ✎ ✏ ✶ ❺➁ ✒ ✑ ✎ ✞ � ☎ ✑ ✒ ✁ ❽ ✜✣✢ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ Feature Weights ✠☛✡ ✠☛✡ Example: NER Overlap ✠☛✡ ✠☛✡ ✿❄❃ ▲✍▼ ✘✍✙ ✘✍✙ ✘✍✙ ✲✳✟ ✩☛P ✎✕✔ ✭✯✮ ✘✍✷ ✘✍✷ ✩✍✪ ☞✍✌ ☞✍✌ ☞✍✌ ☞✍✌ ☞✍✌ ❪❭❫ Local Context ❘✳❙ ❷❹❸ t➄➃ ⑨❶⑩ ❥❧❦ ❵❜❛ ❘✳❙ s❜t ①❹❿ s❧➂ ♥❧♦ ❩❭❬

✿ ★ ✦ ✥ ✪ ✷ ✥ ✪ ✸ ✧ ✩ ✪ ✩ ✵ ✹ ✪ ✸ ✦ ✧ ★ ✩ ✸ ✥ ✪ ★ ✪ ✫ � ✪ ✫ ✧ ★ ✩ ✯ ✴ ✷ ✩ ✴ ✥ ✦ ✧ ★ ✩ ✥ ✪ ✥ ✺ ✦ ✽ ✤ ✤ ✧ ★ ✩ ✻ ✼ ✻ ✾ ✤ ✿ ✻ ✼ ❀ ❁ ❂ ❃ ✽ ✾ ✦ ✤ ✹ ✧ ✪ ✺ ✦ ✹ ✪ ✺ ✫ ✪ ✺ ★ ✩ ✩ ✧ ✵ ✹ ✪ ✸ ✦ ✧ ★ ✥ ✥ ✫ ✙ ✂ ✏ ✆ ☞ ✗ ✆ ☛ ☛ ✘ ✕ ✔ ✞ ✡ ✠ ✝ ✠ ✞ ✪ ✕ ✞ ✆ ✓ ✟ ✡ ☎ ✆✝ ✞ ✟✠ ✡ ✆ ✍ ✂ ✝ ☛ ✝ ✆ ✠ ✎ ✆✏ ☛ ✂ ✑ ✑ ✒ ✠ ✂ ✂ ✦ ✂ ✚ ✞ ✒ ✞ ✝ ✤ ✥ ✥ ✆ ✥ ✧ ★ ✩ ✥ ✪ ✫ ✥ ✏ ✠ ✆ ✡ ✔ ☛ ☛ ✆ ✠ ✂ ✞ ✞ ✂ ✕ ✏ ✆ ✒ ✝ ✟ λ λ λ λ ☞✣✢ ✞✖✕ Feature Interaction λ λ ✴✶✵ ☛✜✛ ✒✄✚ ☛✌☞ ✰✲✳ ✰✲✱ ✁✄✂ ✬✮✭

✥ ✵ ✯ ✱ ✢ ✣ ✤ ✥ ✜ ✯ ✴ ✯ ✯ ✵ ✢ ✴ ✯ ✵ ✶ ✯ ✵ ✴ ✥ ✤ ✜ � ✩ ✣ ✱ ✤ ✮ ✜ ✯ ✰ ✯ ✤ ✱ ✢ ✜ ✯ ✰ ✜ ✯ ✱ ✣ ✣ ✥ ✤ ✢ ✲ ✜ ✯ ✱ ✢ ✣ ✤ ✥ ✣ ✥ ✤ ✥ ✛ ✜ ✯ ✱ ✢ ✜ ✯ ✣ ✥ ✣ ✜ ✲ ✴ ✯ ✱ ✢ ✣ ✤ ✥ ✛ ✯ ✤ ✱ ✢ ✜ ✯ ✱ ✜ ✯ ✱ ✣ ✥ ✣ ✜ ✔ ✆ ✒ ✞ ✓ ☛ ✠ ☎ ✞ ✔ ✄ ✠ ✒ ☛ ✍ ✕ � ✖ ✔ ✡ ☎ ✤ ✗ ✟ ✁ ✂ ✄ ☎ ✆ ✝ ✞✟ ✠ ✡ ✠ ☞✍ ☛ ☞ ✞✌ ✠ ✡ ☎ ✟ ✠ ☛ ✎ ✆ ✜ ✞ ☛ ✔ ☎ ✟ ☛ ✡ ✠ ✙ ✘ ✞ ☎ ✟ ☛ ✚ ✕ ✛ ✜ ✢ ✜ ✓ ✒ ✟ ✆ ✌ ✠ ✡ ✓ ☛ ✂ ☛ ✞ ✠ ☞ ☎ ☛ ✝ ☎ ✆ ✘ ✔ ✞ ✘ ✎ ✱ Feature Interaction ✎✑✏ ✥✳✲ ✪✬✭ ✪✬✫ ✦★✧

� � � � Feature Interaction For loglinear/logistic regression models in statistics, it is standard to do a greedy stepwise search over the space of all possible interaction terms. This combinatorial space is exponential in size, but that’s okay as most statistics models only have 4–8 features. In NLP, our models commonly use hundreds of thousands of features, so that’s not okay. Commonly, interaction terms are added by hand based on linguistic intuitions.

s ❞ ♣ ❬ ❞ ❡ ❬ ❡ ♥ ❦ ❫ ❯ ♠ ❛ ❝ ❳ ♠ ❞ ❛ ❳ ❛ ❡ ❙ ❞ ❲ ❞ ❡ ❴ ❦ ❡ ❞ ❡ ❝ ❬ ❛ ❞ ❡ ♣ ❯ ❞ ♠ ❬ ✐ ❛ ❛ ❦ ❡ ❬ ♣ ❱ ❯ ♥ ♦ ❡ ♣ ❞ ❛ ❬ ❚ ❫ ◆ ❙ ▼ ◗ ▼ ❖ ◆ ✫ ✬ ▼ ▲ ▼ ❖ ✫ ❯ ✬ ❑ ❏ ❍■ ❋● ❊ ✻ ✷ ❈ ✑ ✗ ✏ ❚ ❱ ❛ ❫ ❞ ✐ ❡ ❤ ❣ ❣ ❣ ❣ ❣ ❣ ❴ ❛ ❢ ❲ ❡ ❬ ❝❞ ❬ ❴ ❬ ❱ ❲ ❨ ❲❳ ❲ ❲❳ ❯ ♣ ❉ ❛ ✐ ❡ ♥ ⑥ ❫ ✇ ✇ ❬ ③❞ ① ❡ ❞ ❦ ✈ ❡ ❬ ❞ ⑤ ♥ ❯ ✐ ❩ ❡ ♣ ❞ ❴ ❞ ❫ ③❞ ❯ ♣ ❫ ❯ ❡ ❝ ❬ ❛ ❞ ❡ ♣ ❯ ♥ ✐ ❴ ❡ ❞ ❛ ❦ ❡ ③ ❬ ❝ ❫ ❡ ✇ ❞ ♣ ❞ ❱ ❯ ✈ ❚ ❪ ✉ ❳ t ❥ ♥ ❞ ❡ ❬ ❝ ❴ ❝ ♣ ❯ t ❥ ♦ ❚ ❪ ✉ ❳ t ❳ s ❦ ✐ ❴ ❥ ♣ ❯ ❚ ❪ ✉ ❳ t ❳ ❴ ♣ ❬ t ♣ ✈ ❬ ✐ ❡ ❱ ♣ ❫ ❛ ❡ ♥ ❞ ❛ ❫ ✎ ✌ ❅ ✴ ✌ ✌ ✗ ✵ ✡ ✠ � ✝ ✁ � ✆ ✙ ✛ ✘ ✑ ✌ ✗ ✖ ✟ ✎ ✒ ✓ ✒ ✑ ✁ ✎ ✒ ✟ ✟ ✗ ✏ ✎ ✌ ✎ ✶ ✒ ✞ ☎ ✁ � ✱ ✑ ✚ ✁ � ✟ ✎ ✌ ✗ ✒ ✓ ✛ ✙ ✘ ☞ ✡ ✑ ✒ ✡ ✆✚ ✁ � ✟ ☎ ✄ ✑ ✌ ✗ ✖ ✓ ✟ ✒ ✑ ✏ ✎ ✟ ✞ ✝ ✆ ✁ � ☎ ✄ ✟ ✎ ✟ ✢ ✞ � ✱ ✁ � ✟ ✝✱ ✰ ✬ ✫ ✥ ★ ✧ ✏ ✦ ✤✥ ✑ ✒ ✡ ✎ ✛ ✟ ✌ ✗ ✖ ✟ ✑ � ✒ ✿ ✌ ❁ ✛✙ ✘ ✛ ✛ ✘ ✙ ✎ ❀ ✾ ❂ ✴✼ ✁ ✓ ✴ � ✁ � ✟ ✙ ✓ ✒ ✓ � ☞ ✒ ✄ ☎ ✁ � ✟ ✴ ✚ ✻ ✌ ✷ ❈ ✛ � ✎ ✌ ✌ ✗ ✵ ❇ ❅❆ ✄ � ✁ � ✞✽ � ✒ ✺ ✓ ✒ ✌ ✗ ✻ ✛ ✓ ✏ ✎ ☞ ✺ ✑ ✹✺ � ✸ ✁ � ✟ ✴ ✸ ✎ ✒ ✙ ✖ ✺ ✌ ✴ ✁ ✺ � ☞ ✵ ✴✼ ☎ ✗ ✎ ✌ ✛ ✜✣✢ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ Feature Weights Example: NER Interaction ✠☛✡ ✠☛✡ ✠☛✡ ✠☛✡ ✿❄❃ ▲✍▼ ✘✍✙ ✘✍✙ ✘✍✙ ✲✳✟ ✩☛P ✎✕✔ ✭✯✮ ✘✍✷ ✘✍✷ ✩✍✪ ☞✍✌ ☞✍✌ ☞✍✌ ☞✍✌ ☞✍✌ ♥rq ✇②① Local Context ❪❭❫ ❘✳❙ ❥❧❦ ❵❜❛ ❘✳❙ ❘✳❙ ✇②①④s ❘✳❙ ♥✳♦ ❩❭❬ ❞rs

✏ ✡ ✆ ✎ ✞ ✑ ✄ ☛ � ✦ ✎ ✎ ✝ ✑✩ ✪ ✞ ✎ ✑ ✆ ☛ ✎ ✔ ✎ ✑ ☛ ✄ � ✓ ✄ ☛ ☛ ✂ ✄ ✆ ✏ ✑ ✞ ✌ ✄ ✖ ✛ ✎ ✠ ✆ ✂ ✟ � ✟ ✎ ✟ ✟ ✆ ✂ ✎ ✠ ✡ ✞ ☛ ✄ ✖ ✟ ✎ ✫ ✚ ✎ ✌ ✄ ☛ ✟ ✑ ✆ ✬ ✑ ✝ ✞ ✆ ✝ ✄ ✑ ✝ ✏ ✞ ✑ ✝ ✎ ✎ ✞ ✞ ✑ ✄ ☛ ☛ ✎ ✔ ✟ ☛ ✎ ✢ ✆ ✕ ☛ ✝ ✎ ✆ ✎ ✞ ✑ ✄ ☛ ✡✞ ✠ ✞ ✒ � ✂ ✎ ✑ ✔ ✞ ☞ ✆ ✑ ✏ ✟ ✠ ✁ ✆ ✆ ✟ ✡ ✞ ✝ ✟ ☛ ✞ ✂ ☞ ✂ ✄ ✌ ✟ ✆ ✞ ✍ ✎ ✆ ✂ ✓ ✕ ✟ ✚ ✡✟ ✄ ☞ ✟ ✄ ✄ ✚ ✖ ☞ ✠ ✞ ✖ ✟ ✄ ✏ ✂ ✝ ✄ ✆ ✄ ✠ ✆ ✙ ✎ ✠ ✄ � ✆ ✠ ✆ ✟ ✘ ✠ ✡ P( c|d ): ) d ✝☎✞ ( ) ) P ˆ d d ( ( = P P ˆ ✆✜✛ ) ) ) x d d d P( x ) ( ✂☎✟ | | P c c Classification P( c,d ) ( ( P P ✂☎✞ ( c,d ) D ✝☎✟ = = ∈ ) d ) ✝☎✟ d ✎✗✖ , c ( ( ∀ P ✂☎✟✠ ✎★✧ ✝☎✞ ✂☎✞ ✑✤✣ ✂☎✄ ✂☎✟ ✥✗✄

☞ ☞ ✞ ✤ ✏ ✑ ✞ ✕ ▼ ☞ ✥ ✚ ✗ ✘ ✩ ✑ ✞ ❖P ☎ ✆ ✓ ❈ ✟ ✌ ☞ ✞ ✏ ✎ ★ ✒ ✞ ✢ ✕ ✒ ✏ ✂ ✍ ✎ ✏ ✑ ✞✒ ✆ ✂ ✞ ✡ ✏ ❈● ✗ ❈ ▼ ❅ ▲ ● ● ❍ ❋ ✌ ❅ ❊ ❃ ❀ ✽ ✺ ✷ ✴ ✰ ☞ ✞☞ ✂ ✌ ☛ ✞ ☞ ✓ ✌ ✧ ✞ ✓ ✑ ☞ ✎ ✎ ✢ ✎ ✆ ✞✒ ✕ ◆ � ✛ ✎ ✞ ✌ ☞ ☞ ✌ ✔ ✌ ✕ ✂ ✏ ✎ ✕ ✍ ✗ ✘ ✙ ✞ ✑ ✂ ✆ ✞ ❘ ✒ ✧ ☞ ❈ ❙ P ✡ ✂ ☛ ✞☞ ✌ ✂ ✎ ✍✎ ✏ ✑ ✞✒ ✏ ✎ ✎ ✓ ✔ ✂ ◗ ✍ ✕ ✏ ✛ ✒ ✞☞ ✥ ✦ ✑ ✌ ✑ ✞ ✦ ✞ ✦ ✎ ✛ ✓ ✢ ✓ ✌ ✕ ✂ ✔ ✂ ✑ ✎ ✔ ✒ ✂ ✍ ▲ ✣ ✂ ✆ ✒ ▼ ✚ ✓ ✞☞ ✤ ✢ ✂ ✏ ✭ φ 3 ✱✳✲ ✪✬✫ ) Comparison to Naïve-Bayes ) ' i c i c φ 2 | | c ✾✬✿ φ φ ) ' ( ( ✸✬✹ c ) P P c , d log , log d ( ( ' ic ic φ 1 ∑ + ∑ f f i i ' ic λ ic λ + ∑ ) ∑ ) i ' c i c ( ( ❁✬❂ P ✻✬✼ exp P exp log log ∑ ' c ✓✖✂ ✮✬✯ exp ✵✳✶ exp ✓✖✂ ✌✄✂ ∑ ' ✓✖✂ c ✢✖✎ ✌✄✂ ) ) ' i c i c | | φ φ ( ( P P ∏ ∏ ◗❯❚ ✚✜✛ i i ) ) c ' c ( ■❑❏ ( P P ✁✄✂ ∑ ' c ❍✄❅ ✞✠✟ = ) ☎✝✆ λ ❈✠❉ ✁✄✂ , d ❆✝❇ | c ❄✄❅ ( P

✕ ✑ ☛ ✛ ✂ ✓ ✏ ✆ ✔ ✝ ✠ ✢ ✑ ✔☛ ✂ ✙ ✆ ✛ ✞ ✤ ✂ ✕ ☛ ✆ ✡ ✂ ☎ ✞ ✦ ✥ ✆ ✂ ✞ ☎ ✌ ☎ ✝ ✟ ✝ ☛ ✜ ☎ ✂ ✆ ✔ ✑ ✝ ✢ ✑ ✔☛ ✙ ✓ ✆ ✛ ☛ ✂ ✜ � ✟ ✏ ✂ ✆ ✂ ☛ ✑ ✡ ✂ ✂ ✑ ✟ ✞ ✣ ✝ ✆ ☎ ✕ ✠ ✞ ☛ ✛ ✠ ✧ ✠ ✔☛ ✑ ☛ ✏ ✆ ✏ ✡ ✑ ✂ ✌ ✙ ✆ ✂ ✠ ✏ ✧ ☎ ☎ ✌ ☛ ✛ ✟ ✂ ✟ ✟ ☎ ✌ ✔ ☛ ✏ ✡ ☛ ☛ ✙ ✏ ✌ ✂ ✚ ✠ ✆ ✏ ✚ ✡ ☛ ☛ ✙ ✏ ✌ ✂ ✏ ✛ ✌ ✆ ✑ ✏ ☛ ✢ ✂ ✠ ☛ ✡ ✡ ✟ ✂ ☎ ✞ ✦ ✕ ✟ ✂ ✟ ☎ ☎ ✌ ✔ ✡ ✑ ☎ ☎ ✆ ✝ ✆ ✟ ✚ ✝ ✆ ☎ ✂ ✛ ✂ ☎ ✂ ✆ ✟ ✆ ✙ ✂ ✗ ✟ ✞ ✡ ✞ ✆ ✆ ✑ ✝ ✂ ✔ ☎ ☛ ✂ ✂ ✔ ✑ ✂ ✡ ✑ ✂ ☞ ✂ ✝ ☎ ✟ ✟ ☛ ✆ ✡ ✂ ✠ ✝ ✟ ☞ ☎ ✟ ✂ ✞ ✝ ✆ ☎ ✝ ☞ ✆ ✂ ☎ ✔ ✑ ✂ ✡ ✏ ✓ ✆ ✡ ✑ ✂ ✡ ✑ ✂ ☞ ✂ ✕ ✔☛ ✆ ✙ ✆ ☎ ☎ ✆ ✝ ✞ ✂ ✗ ✂ ✏ ✠ ✏ ✆ ✘ ✆ ✂ ✂ ✌ ✝ ✆ ✍ ✝ ✠ ✜ ✟ ✂ ✆ ✟ ✟ ✝ ✠ ✕ ✆ ☎ ✔ ✞ ✝ ✞ ✂ ✗ ✂ ✝ ✏ ✘ ✙ ✆ ☎ ✆ ✟ ✕ ✑ ✡ ✂ ✡ ✑ ✂ ☞ ✂ ✡ ☎ ✆ ✂ ✟ ✂ ✜ ✑ ✌ Comparison to Naïve-Bayes The primary differences between Naïve- ✏✒✑ Maxent ✜✎✝ ✏✄★ ✏✒✘ Bayes and maxent models are: ✌✎✍ ✏✒✑ ✁✄✂ ✁✄✂ ✁✄✂ ( d) ∧ c = c i Naïve-Bayes ✏✄★ ✌✎✍ ✏✒✑ ✂✖✕ ✌✎✍ ✏✒✑ ✁✄✂ ✁✄✂ ✁✄✂ ✏✒✑

✍ ✕ ✖ ☞ ✞ ✟ ✌ ✡ ☛ ✚ ✂ ✒ ✏ ✔ ✕ ✖ ✗ ✘ ✙ ✛ ✁ ✎ ✁ � ✜ � ✜ � ✜ ✜ ✙ ✂ ☛ ✘ ✄ ☎ ✝ ✞ ✟ ✠ ✡ ✍ Example: Sensors Reality Raining Sunny ✎✑✏✓✒ ✄✆☎ ✄✆☎ ✄✆☎ ✏✓✒ ✏✓✒ NB Model NB FACTORS: PREDICTIONS: P(s) = 1/2 P(r,+,+) = (½)(¾)(¾) Raining? P(+|s) = 1/4 P(s,+,+) = (½)(¼)(¼) P(+|r) = 3/4 P(r|+,+) = 9/10 M1 M2 P(s|+,+) = 1/10

✌ ☞ ✍ ✌ ✎ ✌ ✍ ✏ ✑ ✏ � ☛ ☎ ✄ ✂ ✌ ☞ ☎ ✄ � ✁ ✂ ☛ Example: Sensors Problem: NB multi-counts the evidence. ( | ... ) ( ) ( | ) ( | ) + + + + P r P r P r P r ... = ( | ... ) ( ) ( | ) ( | ) + + + + P s P s P s P s Maxent behavior: Take a model over (M 1 ,…M n ,R) with features: f ri : M i =+, R=r λ λ λ λ ✟✡✠ ✆✞✝ f si : M i =+, R=s λ λ λ λ ✟✡✠ ✆✞✝ ) is the factor analogous to P(+|r)/P(+|s) exp( λ λ λ λ - λ λ λ λ ✒✔✓ … but instead of being 3, it will be 3 … because if it were 3, E[ f ri ] would be far higher than the target of 3/8!

� � � � � � Example: Stoplights Reality Lights Working Lights Broken P(g,r,w) = 3/7 P(r,g,w) = 3/7 P(r,r,b) = 1/7 NB Model NB FACTORS: P(w) = 6/7 P(b) = 1/7 Working? P(r|w) = 1/2 P(r|b) = 1 P(g|w) = 1/2 P(g|b) = 0 NS EW

✂ ✁ ✄ ☎ ✁ � � � ✁ ✁ ✁ Example: Stoplights What does the model say when both lights are red? P(b,r,r) = (1/7)(1)(1) = 1/7 = 4/28 P(w,r,r)= (6/7)(1/2)(1/2) = 6/28 = 6/28 P(w|r,r) = 6/10! We’ll guess that (r,r) indicates lights are working! Imagine if P(b) were boosted higher, to 1/2: P(b,r,r) = (1/2)(1)(1) = 1/2 = 4/8 P( ,r,r) = (1/2)(1/2)(1/2) = 1/8 = 1/8 P(w|r,r) = 4/5! Changing the parameters, bought conditional accuracy at the expense of data likelihood!

✰ ★ ✤ ✥ ✲ ✤ ✳ ✤ ✭ ✬ ✣ ✣ ✘ ★ ✚ ✧ ✬ ✚ ✥ ❉ ❀ ✘ ❊ ❂ ✼ ✿ ✸ ✢ ✮ ✚ ✗ ✲ ✬ ✬ ✚ ✜ ❂ ✤ ✬ ✤ ✚ ✫ ✴ ✥ ✪ ★ � ❇ ✥✳ ✲ ✦ ✬ ✭ ✚ ✜ ✤ ✲ ❈ ✥✳ ✲ ✥ ✲ ★ ✢ ✬ ✥ ✲ ✥ ✚ ✚ ✢ ✼ ✺ ✗✘ ✚ ★ ✚ ✧ ✜ ✘ ✢ ✧ ✣✤ ✲ ✬ ✤ ★ ✚ ✬ ✢ ✥ ✙ ✬ ✚ ✳ ✘ ✬ ✗✚ ✲ ✥✳ ✬ ✘ ✜ ✤ ✚ P ✤ ✲ ✥ ❍ ✰ ★ ✭ ✬ ✤ ✚ ▼ ❄ ❈ ✽ ❊ ✦ ✺ ❁ ✿ ✚ ✚ ✤ ✥ ✘ ✬ ✲ ✥ ✲ ✫ ✲ ✬ ✚ ✷ ✥ ✣✤ ✢ ✬ ✦ ✘ ✢ ✬ ✧ ★ ✘ ✬ ✢ ✥ ✚ ✱ ✥✳ ✰ ✚ ★ ✭ ✬ ✤ ✲ ✤ ✫ ✴ ✗ ✤ ★ ✤ ✥ ✘ ✤ ✢ ★ ★ ✤ ✫ ✥✳ ✲ ✚ ✪ ✙ ✟ ✌ ☞ ☎ ✝ ☛ ✲ ✬ ✏ ✆ ✂ ☎ ✄ ✂ ✁ ✍✎ ✑✒ ✩ ✤ ★ ✚ ✧ ✘ ✚ ✧ ✦ ✓ ✥ ✣✤ ✲ ✥ ✗✘ ✖ ✔✕ ✚ ✬ ✚ ❁ ✧ ❅ ✰ ✚ ❂ ✽ ✿❀ ★ ✽ ✾ ✺ ✽ ✼ ✲ ✚ ✫ ✮ ✚ ✚ ✚ ✥ ✢ ✢ ✤ ✴ ✲ ★ ✚ ✧ ✥✳ ✲ ✬ ✬ ✸ ✬ ★ ✤ ✤ ✚ ✧ ✤ ✦ ✥ ✣ ✭ ✢ ✚ ✭ ✜ ✤ ✧ ✢ ✷ ✢ ✢ ✘ ✣ ✴ ★ ✘ ✗ ✗✚ ✜ ✤ ✲ ✬ ✥ ✤ ✬ ✬ ✢✯✮ ✢✯✮ ✲◆✚ ✚✯❖ ✫✶✵ Issues of Scale ✴✻❆ ✜✛✚ ✜✛✢ ✙✛✚ ■❏▲❑ ✼✡❃❄ ✲◆✳ ✄✡✠ ✆✞✝ ❂●❋ ✹✻✺

✂ � � ☎ � ✝ ✁ � Smoothing: Issues Assume the following empirical distribution: ☞✠✟ ✞✠✟ ✡✆☛ ✄✆☎ h t Features: {Heads}, {Tails} We’ll have the following model distribution: λ λ e e H T = = p p HEADS TAILS λ λ λ λ + + e e e e H T H T Really, only one degree of freedom ( λ = λ - λ ) λ − λ 0 λ e e e e H T = = = p p HEADS TAILS λ − λ λ − λ 0 0 + λ + λ + e e e e e e e e H T T T λ

☎ ✂ ✆ ☎ ✝ ✂ ☎ � ✟ ✠ ✂ Smoothing: Issues The data likelihood in this model is: log ( , | ) log log λ = + P h t h p t p HEADS TAILS log ( , | ) ( ) log ( 1 ) λ λ = λ − + + P h t h t h e log P log P log P λ λ λ ✡✞✝ ✆✞✝ ✡✞✝ ✡✞✝ ✆✞✝ ✟✄✠ ✁✄✂ ✟✄✠ ✁✄✂ ✁✄✂ 2 2 3 1 4 0

❇ ❁ ❃ ◆ ❁ ❅ ❍ ❂ ◗ ❍ ❃ ◆ ❁ ❅ ◗ ❃ ❜ ❅ ❆ ❃◆ ✫ ✫ ✙ ✹ ✧ ✧ ✣ ✬ ✛ ✿ ✙ ❭ ✦ ★ ✗ ✸ ❩ ❈ ❁ ✰ ❇ ■ ❏ ❆ ❅ ❇ ❍ ❁ ❅ ■ ❇ ❇ ❍ ● ❅ ❆ � ❇ ✿ ❍ ❆ ❍ ❅ ❃ ▼ ❍ ▲ ❅ ❅ ❑ ❆ ❁ ● ❅ ❑ ❆ ✣ ✣ ✢ ✦ ❲ ✸ ✶ ❯ ✛ ✲ ✲ ✤ ✯ ❙ ✲ ✙ ★ ✙ ✲ ✛ ✤ ✫ ✦ ✙ ❳ ✦ ✤ ❚ ✧ ✲ ✙ ✛✰ ✰ ✣ ✵ ✙ ✭ ✪ ✬ ✛ ✙ ✢ ✭ ✱ ✣ ✹ ✹ ✦ ✤ ✪ ✙ ✶ ✲ ✜ ✢ ✪ ✹ ❙ ✙ ❯ ✸ ✻ ✣ ✦ ✢ ✢ ❱ ✛✰ ✬ ✛✰ ✢ ✗ ✦ ✳ ✣ ✻ ✢ ✛✜ ✶ ❋ ❁ ❫ ✘ ✛✰ ✧ ✦ ✭ ✣ ✘ ❫ ✬ ✢ ✮ ❡ ✦✭ ✬ ❢ ✫ ✛ ✱ ✲ ❣ ✣ ✜ ✛✰ ✣ ✢ ✦ ❜ ✤ ✢ ✣ ✛✜ ✰ ✦ ✲ ✛ ✫ ✜ ✙ ✦ ❆ ✠✡ ✎ ✆ ✆✍ ☎ ✄ ❣ ☛ ✟ ✍ ✞ ✝ ✆ ☎ ✄ ✂ ✁ ✆ ✆ ❣ ✕ ✦ ❡ ✢ ✛✜ ❢ ✗ ✖ ☛ ✄ ✆✔ ✒✓ ✏ ✍ ✑ ✏ ✎ ✲ ✛ ✯ ✣ ✼ ✙ ✛✰ ✧ ✦ ✲ ✜ ✛ ✤ ✙ ✙ ✘ ✢ ✦✭ ✵ ✰ ✭ ✙ ✽ ❇ ❆ ❅ ❄ ❃ ❂ ❁ ✶ ✤ ✱ ✰ ✣ ✘ ✢ ✛ ✛ ✙ ✻ ✣ ✦ ✣ ✵ ✵ ✙ ✰ ✲ ❭ ✢ ✗ ✸ ❩ ✲ ✪ ✵ ✜ ✭ ✲ ✣ ✭ ✦✭ ✢ ✭ ✪ ✺ ✭ ✣ ✰ ✛ ✣ ✢ ✪ ✹ ✰ ❪❴❫ ❪❴❫ 0 0 ❨❬❩ ❨❬❩ λ Smoothing: Early Stopping ❤✴❢ ❝❬❞ 4 1 ❵❬❛ ❵❬❛ ✙✷✶ ✣✥✯ ∞ ❄P❖ λ ✣✴✳ ✧✩✪ λ ✆✌☞ ✣✥✯ ❈❊❉ ✧✩★ ✧✩❚ ✧✩❚ ✣✥✤ ✣✥✤ ❇✌❘ ✧✩✪ ✧✩❚ ✧✚✙ ✘✚✙ ✘✚✙ ✘✚✙ ✘✚✙ ✧✚✙ ✾❀✿

☛ ✟ ✡ ✚ ☛ ✕ ✖ ✍ ✘ ✓✍ ✌ ✡ ✆ ✡ ✓ ✄ ✘ ✍ ✄ ✡ ✝ ✒ ✝ ✞ ✆ ☞ ☛✩ ✆ ✒ ✄ ✣ ☛ ✡ ✂ ✌ ☞ ✡ ✔ ✄ ✆ ✍ ✆ ✌ ✡ ✡ ✕ ✒ ☛ ✕ ✖ ✍ ✟ ✕ ✛ ☛ ✡ ✒ ✄ ✝ ✍ ✝ ☞✌ ✂ � ✆ ✆ ✄ ✡ ✞ ✡ ✆ ✪ ✑ ✝ ✡ ✭ ✚ ✍ ✒ ✍ ✆ ✆ ✄ ✌ ✓ ✝ ✆ ☞ ✑ ✝ ✡ ✓ ✆ ✘ ✍ ☞ ✓ ✖ ✝ ✔ ✏ ✄ ✓ ✍ ✆ ✡ ✍ ✂ ✤ ✕ ✏ ✡ ✘ ✝ ✂ ✆ ✒ ✕ ✄ ✑ ✡ ✁ � ✥ ☞ ✝ ✆ ✂ ✜ ✒ ✝ ✜ ✒ ✄ ✑ ✑ ✛ ✚ ✛ ✕ ✆ ✝ ✡ ✒ ✌ ✌✜ ✖ ✕ ☛ ✒ ✙ ✆ ✚ ✡ ✔ ✡ ✌ ✛ ✕ ✄ ✡ ✟ ☛ ✄ ✚ ✒ ✡ ✂ ✆ ✕ ✢ ✖ ✍ ✑ ✡ ✁ � ✍ ✡✘ ✄ ✄ ✁ ✆ ✝ ✡ ☛ ✄ ☞✌ ✍ ✌ ✡ ✏ ☞ ✡✑ ✆ ✆ ✄ ✄ ✔ ✌ ✡ ✆ ✡ ✓ ✌ ✝ ✄ ☞ ✆ ✕ ✆ ✒ ✡ ✕ ✒ ✒ ✌ ✝ ✒ ✞ ✝ ✒ ✝ ✆ ✡ ✤ ✄ ✜ ✄ ✝ ✘ ✑ ✧ ☛ ✝ ✔ ✡ ✡ ✂ � ✆ ✩ ✡ ☞✌ ✌ ✖ ✍ ✍ ✡ ✣ ✑ ✡✘ ✜ ✜ ✖ ✘ ✡ ✒ ✝ ✡ ☛ ✝ ✔ ✡ ✡ ✘ ✆ ✒ ✌ ✌ ✡ ✆ ✡ ✓ ✜ ✄ ✄ ☞ ✡ ✜ ✌ ✄ ✕ ) λ ✕✗✖ , Smoothing: Priors (MAP) D Evidence ✌✦★ ✝✬✫ | C ( P ☛✗✮ log ✌✦✥ ✂☎✍ ✝✎✍ ✂☎✄ + ) λ ✝✎✍ ( Prior P ✝✎✍ log = ) D ✝✎✍ | Posterior λ , C ✂☎✄ ( P ☛☎✍ log ✞✠✟ ✂☎✄

✩ ✢ ✍ ✖ ☛ ✆ ✫ ✡ ☛ ✌ ☛ ✝ ✫ ✍ ✝ ✞ ✍ ✂ ✍ ✂ ✕✖ ☛ ✝ ✖ ✧ ✕ ✡ ✡ ☛ ✫ ☛ ✂ ✫ ✕✖ ☛ ✑ ✂ ✂ ❈ ✞ ✂ ❈ ✞ ✖ ✝ ✕ ✧ ✍ ✝ ✚ ✞ ☛ ✂ ✂ ✞ ✑ ✆ ✖ ✵ ✗ ✂ ✖ ✞✎ ✪ ✓ ✣ � ❇ ✖ ✞ ☛ ✝ ✷ ✝ ✆ ✝ ✞ ✢ ✗ ✘ ✚ ✖ ✗ ✗ ✣ ✳ ✺ ✿ ✑ ✼ ✷ ❅ ✵ ❅✵ ❂ ✿ ❁ ✻ ✾✿❀ ✺ ✵ ✽ ✵ ✼ ☛ ☛ ✝ ✬ ✗ ✗ ✂ ☎ ✆ ☎ ✖ ❅✹❆ ☎ ✗ ✂ ★ ☛ ✡ ✘ ✭ ☎ ✱ ✆ ✆ ✲ ☛ ✡ ✚ ✭ ✮ ✩ ✷ ✰ ✓ ✣ ✯ ✂ ✆ ✿ ✚ ✂ ☛ ✂ ✕✖ ✍ ✖ ☛ ✆ ☎ ✆ ☎ ✂ ✗ ❋ ✡ ✒ ✢ ✂ ✕ ✡☛ ✤ ✓ ❊ ☛ ✞ ✂ ✗ ✖ ✜ ✍ ✛ ✑ ✞ ❉ ✡☛ ✂ ☛ ✌ ✂ ☎ ☞ ● ❋ ✂ ✝ ✆ ✆ ☎ ❍ ✍ ✟ ✡ ✔ ✝ ✍ ✝ ☎ ✍ ✞ ✓ ✑ ✒ ✆ ☛ ✡ ✝ ☛ ✗ ✂ ✆ ✍ ✚ ✝ ✗ ✗ ✜ ✖ ✌ ✝ ✆ ✍ ☛ ✝ ✜ ☎ ✖ ✖ ✢ ☎ ✂ ✢ ✂ ✡ ✍ ✞ ✌ ✝ ✌ ✡☛ ✎ ✎ ✂ ☛ ✍ ✍ ✡ ✎ ✖ ✑ ✦ ✖ ☛ ✝ ✂ ☛ ✑ ✒ ✞ ✡ ✝ ✍ ✍ ✕✖ ✂ ✂ ☛ ✂ ✑ ✧ ✎ ✖ ✝ ✍ ✂ ✧ ✍ ✞ ✡ ✺ 2 σ 2 = ∞ = 10 2 σ 2 2 σ 2 =1 ❁❄❃ ✸✶✹✺ ✴✶✵ Smoothing: Priors σ µ 2 ) µ i 2 σ i − 2 λ ✌✙★ ✗✙✘ i ( − µ exp π ✝✏✎ 2 1 σ i = ) λ i ( P ✝✄✥ ✝✄✥ ✞✠✟ ✖✠✣ σ ✁✄✂

✁ ✞ ✓ ☛ ✑ ✞ ✌ ✠ ✄ ✏ ✠ ✠ ✌ ✌ ✠ ✓ ✠✓ ✞ ✒ ✎ ✖✞ ✎ ✑ ✞ ✝ ✑ ✠ ✞ ✍ ✗ ✆ ✆ ✒ ✓ ✆ ✝ ✆ ✑ ☛ ✓ ✚ ✏ � ✧ ✏ ✏ ✚ ✄ ✆ ✣ ✑ ✖ ✞ ✓ ✞ ✄ ✆ ✗ ✘ ✕ ✞ ✞ ✄ ✆ ✗ ✠ ✄ ✞ ✏ ✖ ✖ ✞ ✌ ✠ ✄ ✞ ✦ ✞ ✒ ✒ ✄ ✣ ✒ ✠✓ ✔ ✌ ✆ ✑ ✏ ✕ ✓ ✑ ✖ ✡ ✠ ✄ ☛ ✌ ✆ ✒ ✆ ✗ ✎ ✩ ✁ � ✆ ★ ✠ ✡ ✠ ✑ ✌ ✞ ✞ ✍ ✎ ✞ ✏ ✗ ✞ ✏ ✞ ✞ ✡ ✞ ✆ ✑ ✚ ✄ ☛ ✎ ✏ ✆ ✓ ✛ ✞ ✄ ✞ ✗ ✒ ✄ ✄ ✎ ✆✄ ✆ ✌ ✞ ✑ ✞ ☛ ✑ ✘ ✁ ✙ ✚ ✓ ✌ ✚ ✗ ✗ Smoothing: Priors If we use gaussian priors: ✝✟✞ ✡☞☛ ✂☎✄ ✕✟✞ ✕✟✞ ✗☞✣ ✒✥✤ ✑✢✜ 2 σ 2 = ∞ Change the objective: 2 σ 2 log ( ) − λ log ( , | ) log ( | , ) P λ = λ P C D P C D = 10 2 σ 2 ( ) 2 λ − µ =1 log ( , | ) ( | , ) λ = λ − ∑ + P C D P c d i i k ∑ 2 2 σ ( , ) ( , ) ∈ c d C D i i Change the derivative: log ( , | ) / actual ( , ) predicted ( , ) ( ) / 2 ∂ λ ∂ λ = − λ − λ i − µ σ P C D f C f i i i i

✛ ❯ ❨ ❲❳ ❲ ❲❳ ❲ ❱ ❯ ❚ ❙ ❛ ❬ ▼ ◗ ♣ ▼ ❖ ◆ ✫ ✬ ▼ ▲ ❲ ❱ ❖ ❣ ❛ ❞ ✐ ❡ ❤ ❣ ❣ ❣ ❣ ❣ ❴ ✉ ❛ ❫ ❢ ❡ ❬ ❝❞ ❬ ✇ ❴ ❬ ▼ ◆ ❡ ✵ ✻ ✌ ✷ ❈ ✒ ✎ ✌ ✌ ✗ ❇ ✴ ❅❆ ❦ ✄ � ✁ � � � ❂ ✓ ✚ ✟ ✫ ❈ ✬ ❞ ❑ ❏ ❍■ ❋● ❊ ✻ ✌ ✷ ✑ � ✗ ❬ ✏ ✎ ❡ ❉ ❅ ✄ ☎ ✁ ❚ ❬ ❛ ❛ ❛ ❬ ✈ ❞ ♠ ❬ ✐ ♦ ❞ ❦ ❞ ❡ ❬ ❞ ♣ ❱ ❬ ❡ ❴ ✈ ❱ ❱ ❛ ♦ ✐ ❛ ❛ ❯ ❡ r ❞ ✐ ❱ ❦ ❫ ❡ ① r ❞ ♠ ❞ ♦ ❡ ✐ ❱ ❯ ❞ ❬ ❴ ❡ ❬ ♦q ♣ ❫ ❞ ♦ ❦ ❬ ❝ ❡ ♠ ❫ ❞ ❛ ❳ ❛ ❱ ❡ ❙ ❞ ❲ ❞ ❫ ❡ r q ❬ ❙ ❯ ♣ ❞ ❛ ✉ r ❫ q ❫ ✐ ❝ ❞ ❛ ❫ q ❞ ✐ ❡ r ❴ ❫ ✌ ❁ ✱ ✑ ✁ ☞ ❞ ✟ ✡ ✟ ✞ � ✁ ✓ � ✟ ✝✱ ✰ q ✬ ✫ ❫ ✥ ★ ✒ ✒ ✧ ✁ ✛✙ ✎ ✌ ✌ ✗ ✵ ✡ ✠ � ✝ � ✎ ✆ ✴ ✙ ✘ ✑ ✌ ✗ ✖ ❛ ✟ ✢ ✦ ✑ ✏ ✑ ✌ ✗ ✖ ❯ ✒ ✓ ✒ ✑ ✎ ✄ ♣ ❯ ✟ ✞ ✝ ✆ ✁ � ☎ ✄ ❝ ☎ ✤✥ ✟ ❛ ✑ ✒ ✡ ✎ ✛ ✟ ✌ ✗ ✖ ✏ ✟ ✎ ❞ ♦ ✟ ✉❞ ✟ ✡ ✆✚ ✁ � ✒ ✛ ✘ ✁ ✌ ✌ ✗ ✵ ☞ ✺ ✺ ☎ ✴ � ✛ ✴✼ ✑ ✙ ✓ ✒ ✌ ✗ ✖ ✻ ✙ ✎ ✒ ✏ ✴✼ ✘ ✛ ✛ ✘ ✙ ✎ ❀ ✿ ✾ ✁ ☞ � ✴ � ✁ � ✟ ✙ ✓ ✒ ✞✽ ✓ ✛ ✎ ✟ ❞ ✎ ✶ ✒ ✞ � ☎ ✁ � ✱ ✎ ✚ ✁ � ✟ ✎ ✌ ✗ ✒ ✓ ✛ ❬ ✌ ✏ ♦ ☞ ✺ ✺ ✹✺ � ✸ ✁ � ✟ ✴ ✸ ① ✓ ✗ ✑ ✑ ✒ ✒ ✎ ✜✣✢ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ �✂✁ Feature Weights Example: NER Smoothing ✠☛✡ ✠☛✡ ✠☛✡ ✠☛✡ ✿❄❃ ▲✍▼ ✘✍✙ ✘✍✙ ✘✍✙ ✲✳✟ ✩☛P ✎✕✔ ✭✯✮ ✘✍✷ ✘✍✷ ✩✍✪ ☞✍✌ ☞✍✌ ☞✍✌ ☞✍✌ ☞✍✌ ❪❭❫ Local Context ❘✳❙ ❱ts ❯❄r ❥❧❦ ❵❜❛ ❘✳❙ ❝t② ❞✳✇ ❞✳✇ ❯❄r ♥❭❞ ❩❭❬

❩ ✮ ✥ ✫ ❄ ✪ ✯ ✹ ✪ ✫ ❃ ✪ ❅ ✧ ✱ ★ ✷ ✿ ✯ ✪ ✿ ✾ ✾ ✥ ✸ ✪ ✽❃ ✪ ★ ✮ ✱ ✻ ✽ � ✭ ✬ ✿ ✧ ✪ ✾ ❁ ✫ ✪ ★ ✥ ★ ✷ ✪ ✻ ✮ ✼ ✳ ✵ ✤ ✽ ✭ ✥ ✪ ❙ ▲ ◆ ❏ ❖ ❏ ■ P ❏ ❈ ■◗ ❘ ❚ ❊ ❚ ❲ ◗❳ ❨ ❨ ❙ ❚ ❚ ▼ ◗ ❳ ❚ ❋ ■ ❃ ❊ ✥ ✫ ❄ ✪ ✯ ✹ ✪ ❆ ❇ ❈ ❉ ❋ ■ ■ ■ ❊ ❋ ▲ ❈ ❉ ❊ ❋ ■ ▼ ❊ ✭ ✪ ✰ ✬ ☛✚ ✘ ✌ ✢ ✕ ✣ ✪ ✫✬ ✭ ✮ ★ ✘ ✯ ✮ ★ ✮ ✥ ✥ ✶ ✱ ✬ ✷ ✪ ✙ ✠ ✸ ✡ ☎ ✆ ✝ ✞ ☎ ✟ ✠ ✡ ☛ ☎ ☞ ✌ ☎ ✠ ✡ ✍ ✎✏ ✑✒ ✒ ✓ � ✗ ✆ ☎ ✬ ✵ ✪ ✸ ✿ ★ ✱ ✯ ✪ ✬ ✳ ✵ ❂ ✾ ✾ ✥ ✬ ❁ ✻ ✿ ✫ ❁ ✧ ✪ ✿ ★ ✱ ✯ ✪ ✬ ★ ✧ ✪ ✯ ✽ ✷ ★ ✥ ✫ ★ ✥ ✻ ✥ ✯ ✪ ✪ ✼ ✫ ✿ ✥ ★ Example: POS Tagging ✾❀✿ ✔✖✕ ✰❀✥ 85.20 88.20 ✫✬✴✳ ✰✲✱ ✍✜✛ 96.54 97.10 ✮✺✹ ✭❀✬ ❯✦❱ ✧✩★ ✤✦✥ ❉❑❏ ❉❑❏ ✁✄✂ ●✺❍ ●✺❍

✆ � ✁ ✄ ✁ ✆ � ✄ � Smoothing: Virtual Data Another option: smooth the data, not the parameters. Example: ✠☎✡ ☛✟✞ ✂☎✄ ✝✟✞ ✠☎✡ ☛✟✞ ✂☎✄ ✝✟✞ 4 0 5 1 Equivalent to adding two extra data points. Similar to add-one smoothing for generative models. Hard to know what artificial data to create!

Part II: Optimization a. Unconstrained optimization methods b. Constrained optimization methods c. Duality of maximum entropy and exponential models

� ☛ ✌ ✝ ✡ ☎ ✜ ☎ ✕ ☎ ☞ ✠ ☛ ✎ ☞ ✂ ☎ ✆ ☎ ✠ ☛ ☛ ☞ ✂ ✠ ✖ ✡ ✞ ✆ ✝ ✡ ✂ ✜ ✝ ✝ ☎ ✖ ✞ ✠ ☛ ✎ ☞ ☛ ✂ ✆ ☎ ✟ ☎ ✡ ✠ ✂ ✒ ✆ ☎ ✍ ✞ ✆ � ✧ ✕ ✠ ✠ ✕ ✝ ✠ ✕ ✡ ✝ ☞ ✞ ✝ ✡ ✟ ✝ ✂ ✆ ✎ ☛ ☛ ✂ ☞ ✆ ✜ ✟ ✂ ✂ ✝ ✡ ✞ ✕ ✜ ☛ ☞ ✞ ✂ ☞ ✆ ✂ ✏ ✂ ☎ ✟ ✂ ✒ ✞ ✌ ✞ ✖ ✠ ✝ ✞ ✂ ✕ ✕ ☎ ✠ ✒ ✆ ✝ ✂ ✢ ☎ ✟✠ ✒ ☎ ✝ ✂ ✝ ✡ ☎ ✆ ☛ ✜ ☎ ✏ ✞ ✡ ✂ ✂ ✒ ✟ ✂ ✒ ☎ ✏ ✓ ✔ ✟ ✑ ✎ ✆ ✝ ✌ ✒ ✝ ✡ ☎ ✔ ✡ ✡ ✟✠ ✍ ✞ ☎ ✞ ☎ ✟ ☎✆ ✝ ✞ ✟✠ ✝ ☎ ✝ ✡ ☎ ☛ ✠ ☞ ✠ ☎ ✏ ✝ ☎ ☞ ✆ ✂ ✌ ✠ ✟ ✠ ✍ ✞ ✟ ✎ ✟ ✎ ✟ ✆ ✠ ✌ ✙ ✕ ✖ ✝ ✞ ✂ ✕ ✂ ☛ ✝ ✞ ✟ ✝ ✠ ✞ ✂ ✕ ☛ ☞ ✂ ✢ ✏ ☎ ✟ ✓ ✝ ✡ ✂ ✏ ✎ ☞ ✖ � ✚ ☎ ✛ ✏ ✏ ✠ ☛ ☛ ☞ ✂ ☎ ✠ ✥ ✡ ✠ ✝ ✡ ✞ ✕ ✆ ☎ ✠ ✆ ✜ ) d ) , d ' c , c ( ( i Function Optimization f i λ f i λ ✞✘✗ i ∑ ∑ i i ✏✤✣ exp exp ✢✩✪ ∑ ' c ✞✄✖ log ✏✤✣ ) λ D , C ∑ ( ∈ ) d , c ( = ✞✄✕ ) λ ✝✦✥ , D | C ( P ✞✘✗ log ✒✩★ ✁✄✂

✑ ✒ ✗ ✂ ✝ ☛ ✏ ✎ ✠ ☛ ✏ ✝ � ✗ ✙ ✝ ✎ ✒ ✠ ✒ ✑ ✒ ✔ ✕ ✎ ✂ ✑ ✝ ✒ ✎ ✆✠ ✍ ✝ ✟ ✎ ✏ ✟ � ✛ ✌ ✌ ✗ ✂ ✂ ✝ ✚ ✝ ✑ ✍ ✂ ☞ ✆ ✑ ✏ ☛ ✌ ✝ ✎ ✌✍ ☎ ✠ ✎ ✝ ☛ ✂ ✝ ✞ ✆✝ ☎ ✂ ✔ ✎ ✑ ✝ ✌ ☛ ✏ ✝ ✟ ✎ ✂ ✏ ☛ ✎ ✝ ✠ ✏ ✗ ✠ ✒ ✖ ✝ ✟ ✕ � ✔ ☞ Notation ✟✡✠ ✁✄✂ f f ( x ) ☞✓✒ ✏✄✑ R n R / ∂ ∂ f x ∇ f ( x )   n × 1 ∇ = f   M ☞✓✘ ✏✄✠   / ∂ ∂ ∂ f/ ∂ x i f x   n   2 2 / / ∂ ∂ ∂ ∂ ∂ ∂ f x x f x x ∇ f L ✏✄✠   1 1 1 n n × n 2 ✏✢✜   ∇ = f M O M   ∂ 2 f/ ∂ x i ∂ x j 2 2 / / ∂ ∂ ∂ ∂ ∂ ∂ f x x f x x L   1 n n n  

� � � Taylor Approximations Constant (zeroth-order): 0 ( ) ( ) = f x x f x 0 0 Linear (first-order): 1 ( ) ( ) ( 0 ) T = + ∇ f x x f x f x x 0 0 Quadratic (second-order): 1 2 ( ) ( ) ( 0 ) T T ∇ 2 ( ) = + ∇ + f x x f x f x x x f x x 0 0 2 0

✂ � ✄ � ✂ ✁ � ✁ ✂ ✁ Unconstrained Optimization * arg max ( ) = x f x Problem: x Questions: Is there a unique maximum? How do we find it efficiently? Does f have a special form? Our situation: f is convex. f’ s first derivative vector ∇ f is known. f’ s second derivative matrix ∇ f is not available.

✁ � Convexity ( ) 1 ( ) = ≥ w f x i w f ∑ w x ∑ ∑ i i i i i i i ( ) f w x ( x ) w f Convex Non-Convex Convexity guarantees a single, global maximum because any higher points are greedily reachable.

� ✁ ✁ � ✁ ✂ ✄ Optimization Methods Iterative Methods: Start at some x i . Repeatedly find a new x i+ 1 such that f ( x i+ 1 ) ≥ f ( x i ). Iterative Line Search Methods: Improve x i by choosing a search direction s i and setting arg max ( ) = + x f x ts + 1 i i i + x ts i i Gradient Methods: s i is a function of the gradient ∇ f at x i .

� � � Line Search I Choose a start point x i and a search direction s i . Search along s i to find the line x i +1 arg max ( ) = + x f x ts maximizer: + 1 i i i + x ts i i When are we done? ( ) + f x ts i i s i x i x i +1 x i ∇ f ⋅ s i

✭ ✰ ✦ ★✩✪ ✫ ✯ ✩✪ ✪ ✱ ✤ ✣ ✯✲ ✩ ✱ ✣ ✯ ✳ ★ ✣ ✱ ☎ ✑ ✍ ☎ ☛ ✝ ✂ ✞ ✜ ✡ ✘ ✝ ✞ ☎ ☞ ✛ ✰ ✫ ✂ ✰ ★ ✳ ✩ ✻ ✻ ✪ ✶ ✩ ✣ ✵ ✩ ✴ ✣ ✰ ✯ ✴ ★ ✰ ✯ ✯ ✴ ✩ ✣ ✯ ✣ ✲ ★✻ ✴ ✬ ★ ✵ ✩ ✹ ✺ � ✆ ✝ ✝ ✝ ✟ ✞ ✎ ✌ ✍ ✟ ✞✏ ✡☞✌ ☛ ☎ ☞ ✑ ✍ ✡ ✂ ✍ ☎ ✗ ✝ ☎ ✆ ✝ ✞ ☎ ✂ ✟ ✠ ✟ ✂ ✡ ☛ ☛ ✝ ✂ ☎ ✞ ✎ ☛ ☎ ✑ ☞✌ ✍ ✒ � ✓ ✔ ✟ ☎ ☞ ✡ ☛ ✕ ✡ ✑ ✖ ✡ ✟ ✝ ✆ ✝ ✞ ☎ ✂ ✝ ☎ ✠ ✠ ✂ ✡ ☛ ✟ ✶✸✷ ✝✚✙ Line Search II ✬✮✭ ✣✧✦ ✢✣✥✤ ✁✄✂

✿ ✸ ✸ ✸ ✶ ✼ ✤ ✯ ✸✽ ★ ✵ ✭ ✺ ✢ ✶ ✩ ✮ ✯ ✾ ✱ ✯ ✩ ✛ ✳ ✱ ✴ ✵ ✷ ★ ✮ ✶ ✭ ✥ ★ ✮ ✺ ✪ ✬ ✫ ✢ ✮ ✢ ✢ ✤ ✥ ✩ ✴ ✸ ✺ ✢ ✯ ✩ ✰ ✤ ✯ ✧ ✩ ✮ ✢ ✧ ✭ ✥ ✼ ✯ ✮ ✰ ✢ ✥ ✪ ✫ ✢ ✧ � ✢ ✥ ✞✙ ✛ ✍ ✚ ✎ ✠ ✚ ✏ ✧ ✚ ✏ ✝ ✘ ✌ ✍ ✗ ✎ ✡ ✌✖ ✎ ☛ ✕ ✞☛ ✝ ✔ ✝ ✧ ✏ ✥ ✯ ✪ ✁ ✤ ✬ ✧ ✮ ✥ ✫ ✆✝ ✧ ✢ ★✥ ✢ ✥ ✭ ✥ ✫ ✩ ✤ ✬ ✞ ✟ ✢ ✠ ☛ Gradient Ascent I Gradient Ascent: Until convergence: ∇ f ( x ) ✡☞☛ ✟☞☛ ✂☎✄ ∇ f ( x ) ✑✓✒ Each iteration improves the value of f ( x ) Guaranteed to find a local optimum (in theory could find a saddle point). Why would you ever want anything else? ✤✦✥ ✤✦✩ ✪✦✫ ✜✣✢ ✰✫✲✱ ∇ f ( x ) ✹✻✺ ✤✦✥

✖ ✔ ✄ ✡ ☛ ✠ ☞ ✔ ✝ ☛ ✂ ✆ ✟ ✔ ✡ ✞ ✌ ☛ ✔ ☛ ✂ ✄ ☞ ✠ ✞ ✞ ✂ ✌ ✠ ✡ ✄ ✖ � ✘ ☛ ☛ ✞ ✆✝ ☛ ✏ ✔ ✠ ✡ ☛ ✙ ☛ ✂ ✄ ✄ ✝ � ✆ ✌ ✌ ✜ ✄ ✔ ✝ ☛ ✂ ✔ ✔ ✍ ✡ ✞ ✌ ☛ ✔ ☛ ✌ ✞ ☞ ☛ ✠ ✡ ✑ ✄ ✂ ✌ ✠ ✡ ✄ ✙ ☞ ✔ ☛ ✂ ✡ ✔ ✄ ✛ ☛ ✟ ✠ ✝ ✄ ✑ ☛ ✠ ☛ ✡ ✠ ✕ ✞ ✝ ☛ ✔ ☛ ✂ ✄ ✌ ✄ ✄ ✓ ✌ ✑ ✓ ✝ ✕ ✄ � ✘ ✌ ✔ ✌ ✟ ✆ ☞ ✁ ✆✝ ✞ ✟ ✠ ✄ ✡ ☛ ✠ ✞ ✡ ✌ ✍ ✞ ✎ ☞ ✏ ✄ ✝ ✏ ✄ ☛ ✡ ✞ ✡ ✌ ✂ ✄ ✆✝ ✞ ✟ ✠ ✄ ☛ ✡ ✂ ✞ ☞ ✡ ✔ ✑ ✔ ✚ ✏ ☛ ✄ ✡ ✠ ✠ ✡ ✡ ☛ ✄ ✚ ✞ ✛ ✚ ✂ ✓ ✚ ✔ ✑ ✑ ✓ ✝ ☞ ✍ ✔ ✂☎✄ ✠✒✑ ✂☎✄ Gradient Ascent II ☞✗✖ ✂☎✄ ✄✗✙ ✂☎✄

♠ ✓ ✔ ✕ ✘ ✓ ✎ ✛ ✙ ✚ ✔ ✓ ✙ ✘ ✎ ✖✗ ✕ ✔ ✓ ✔ ✪ ✢ ❆ ✎ ✖ ✔ ✒ ✚ ✖ ✓ ✢ ✣ ✔ ✕ ✥ ✢ ✫ ✔ ✕ ✛ ❇ ❃ ✮ ✒ ❃ ✎ ✗ ✓ ✢ ✣ ✎ ✘ ✔ ✒ ✑ ✎ ✚ � ✛ ✚ ✛ ✦ ❅ ✔ ✥ ✖ ✒ ✮ ✘ ✓ ✢ ✙ ✚ ✑ ✔ ✗ ✙ ✚ ✔ ✓ ✔ ✙ ✘ ✎ ✖✗ ✘ ✣ ✢ ✗ ✘ ✔ ❪ ❥❦ ❪ ✐ ❞ ❜ ❣ ❢ ❤ ❝ ❣ ❜ ❵ ❝ ❭ ❢ ❫ ❵❣ ❢ ❪ ❵ ❜❝❞ ❡ ❣ ❪ ❢ ❫ ✐ ❵ ❝ ❢ ❫ ❣ ❭ ❞ ❧ ❵ ❴ ❫ ❣ ❢ ❫ ❡ ❫ ✙ ❈ ■ ■ ▲ ● ■ ❊ ❉ ❈ ✛ ● ✫✔ ✢ ✑ ✔ ✰ ✚ ✓ ✔ ❋ ■ ❭❪ ❱ ❬ ❳ ❨ ❩ ❲ ❨ ❳ ❲ ❯ ❑ ❚ ❙ ❑ ❖ ■ P◗❘ ❖ ◆ ▼◆ ✩ ✔ ❄ ✔ ✫ ✤ ✔ ✔ ✪ ✣ ✣ ✩ ✕ ✪ ✢ ✛ ✑ ✗ ✎ ✔ ✛ ✔ ✎ ✙ ✙ ❃ ✩ ✚ ✯ ✮ ✛ ✓ ✗ ✦ ✔ ✓ ✣ ✖ ✓ ✎ ✒ ✖ ✙ ✗ ✖ ✓ ✣ ✣ ✒ ✓ ✔ ✙ ✘ ✎ ✖✗ ✕ ✓✔ ✑ ✢ ✌ ☞ ✠ ☎ ✟ ✞ ✝ ☎✆ ✚ ✗ ✛ ✒ ✦ ✢ ✙ ✥ ✔ ✗ ✤ ✔ ✚ ✚ ✢ ✚ ✣ ✎ ✓ ✢ ✖ ✢ ✒ ✛ ✚ ✣ ✦ ✴ ✖ ✚ ✓ ✢ ✙ ✣ ✎ ✘ ✢ ✗ ✖ ✓ ✛ ✢ ✮ ✎ ✙ ✢ ✩ ✎ ✔ ✪ ✎ ✫ ✣ ✣ ✦ ✢ ✢ ✱ ✚ ✎ ✓ ✢ ✖ ✎ ✚ ✖ ✓ ✓ ✷ ❀ ✱ ✚ ✙ ✑ ✎ ✢ ✙ ✖ ✓ ✙ ✔ ✰ ✔ ✪ ✹ ✷ ✚ ✔ ✧ ✘ ✵ ✙ ✗ ✖ ✛ ✹ ✗ ✔ ✗ ✽ ✚ ✱ ✚ ∇ f ( x i +ts i ) ≈ ∇ f ( x i ) T ⋅ ∇ f ( x i ) = ∇ f ( x i- 1 ) T ⋅ ∇ f ( x i ) = 0 s i- 1 = ∇ f ( x i- 1 ) ✒★✧ What Goes Wrong? s i- 1 f ( x i ) ∇ f ( x i ) f ( x i ) ∇ f ( x i ). s i- 1 ✣✳✲ s i = ∇ f ( x i ), f ( x i ) ∇ f ( x i )) ∇ f ( x i ) ∇ f ( x i- 1 ) T ∇ f ( x i ) + t ∇ f ( x i- 1 ) T ∇ f ( x i ) ∇ f ( x i ) f ( x i ) s i = ∇ f ( x i ) + t ∇ ✙✜✛ T ⋅ ( ∇ f ( x i- 1 ) + t ∇ ✚✭✬ 0 + t ∇ f ( x i- 1 ) T ∇ ❀☛❁❂ ❴❛❵ ✾✸✿ ❏❍❑ ✠☛✡ s i- 1 + t ∇ ❋❍● ✺✼✻ ✍✏✎ ✶✸✷ ✁✄✂

✎ ✞ ✄✡ ✆ ☛ ✂ ✄ ☞ ✌ ✟ ✍ � ✄ ✏ ☎ ✓ ✑ ✄✠ ✂ ✒ ✓ ✕ ✕ ✔ � ✓ ✑ � ✁ ✄✞ ✑ ✂ ✄ ☎ ✂ ✆ ✝ ✑ Conjugacy I Problem: with gradient ascent, search along s i ruined optimization in previous directions. s i Idea: choose s i to keep the gradient in the previous direction(s) zero. If we choose a direction s i , we want: ∇ f ( x i ) ∇ f ( x i +ts i ) s s i- 1 ∇ f ( x i +ts i )] = 0 s i- 1 T ⋅ s i- 1 T ⋅ [ ∇ f ( x i ) + t ∇ f ( x i ) s i ] = 0 s i- 1 T ⋅ ∇ f ( x i ) + s i- 1 T ⋅ t ∇ f ( x i ) s i = 0 0 + s i- 1 T ⋅ t ∇ f ( x i ) s i = 0 f ( x ) s i = 0 f ( x ) is constant, then we want: s i- 1 T ∇ If ∇

✲ ✚ ✛ ☎ ✛ ✞ ✚ ✑ ✙ ✟ ✞ ✌ ✡ ✘ � ✔ ✍ ☛ ✗ ✭ ✦ ✬ ✪ ✫ ★ ✮ ✬ ✠ ★✫ ✦ ✠ ✔ ☛ ✞ ✑ ✍ ✖ � ☎✟ ✠ ✡ ✂ � ✞ ✁ ✝ ✓ ✞ ✍ ✔ ✁ ✕ ☎✞ Conjugacy II f ( x i ) s i = 0 T ∇ The condition s i- 1 almost says that the new s i direction and the last should be orthogonal – it says that they f ( x i )- orthogonal, or must be ∇ ∇ f ( x i ) s i- conjugate. 1 Various ways to operationalize this condition. Basic problems: f ( x i ) . ∇ ✡☞☛ ✌✎✍ ✏✒✑ ✄✆☎ ✏✒✑ ✧✩★✪ ☛✢✜ ✣✥✤ ✯✱✰

❘ ❬ ❛ ❘ P ❳ ❥ ❖ ❛ ❙ ❤ ◗ ❘ ✐ P ❬ ◗ ❳ ❵ ① ◆ ❬ ❳ ❵ P ❘ ❖P ❘ ❪ P ❬ ❚ ❘ ✐ ❪ ❱ ❩ ❘ ❩ ◗ P ❖ ❶ P ❩ t ❬ ❩ ❵ ❛ ❳ ❱ ❲ ❷ P ❖ ❚ ❳ ④ P P ❬ ❩ ❘ ❚ ❳ ❙ ❩ ◗ ❳ ❘ ④ ❢ � ❬ P ❬ ◗ ❳ ❵ ❤ ❣ ❷ ❘ ❘ ❚ ❳ ❙ ❩ ◗ ❪ ❩ ❫ ♥ ❘ ❙❚ ① ⑥ ❸ ① ❷ ⑧ ② ✉ ⑤ ❶ ❹ ③ ④ ③ ⑩ ✈ ⑧⑨ s ⑦ ② ① ✉ ✈ ✉ ✉ ❶ ③ ④ ❺ ② ⑨ ❾ ❽ ④ ④ ✉ ③ ✈ ① ⑥ ✉ ❺ ⑤ s ④ ① ✐ ❳ ♥♣ ❙ ❳ ❫ ❘ ❬ ❳ ❙ ❖ ❙ ❘ ❳ ❫ ❘ ❴ ♦ ❘ ❬ P ❪ ❳ ❘ ⑥ t s ②⑤ ① ④ ③ ③ ② ③ ❶ t ❳ ③ q ❞ ❪ ❜ ❖ ❵ ♦ ❙ ❫ ❖ ❿ ✜ ✤ ✱ ✜ ❷ ✵ ✱ ✴ ✤ ✧ ✱ ✜ ✱ ✰✳ ✱ ✲ ✧ ✪ ✰✱ ✧ ✲ ✜ ✭ ✛ ✤ ✲ ✧ ✩ ✛ ✣ ✴ ✷ ✱ ✪ ➁ ✧ ✩ ✳ ✱ ✤ ✧ ✤ ✬ ✸ ✯ ✮ ✱ ✌ ✍ ① ✔ ✌ ✝ ✓ ✒ ❹ ① ☞✄ ✖ ✡☛ ✠ ✞✟ ✄ ✝ ✄ ✆ ⑤ ✁ ✝ ✄ ✭ ❶ ① ✧ ③ ✤ ✬ ✈ ✩ ✧ ✣ ① ✞ ➃ ✚ ① ✏ ✄ ✕ ✝ ✄ ✆ ✛ ✜ ❳ ❁ ❑ ❏ ① ❋ ❁ ❆ ❊ ❉ ❅ ❄ ▲ ❃ ❉❊ ✈ ❆ ❃ ❅ ❃❄ ❂ s ✵ ✧ ✾ ❯ ❪ ❪ ❭ ❿ ❨ ❙❳ ✉ ❱ ❖ ❖❚ ▼ ❙ ② ❖P ⑧ ◆ ⑥ ▲ ✱ ✜ ✿ ✭ ✴ ✲ ✧ ✩ ① ✤ ✸ ✜ ⑤ ✬ ✦ ✩ ✬ ✬ ✧ ✴ ✧ ➀ ✺ ✭ ✹ ✭ ✰ ✈ ⑤ ✩ ✵ ✱ ✜ ✱ ✛ ✤ ✲ ⑥ ✱ ✳ ② ✧ ✤ ✬ ✸ ⑥ ✜ ✱ ✲ ① ∇ f ( x i ) Conjugate Gradient Methods ) 1 ) − i i x x ( ( ②❼❻ f f ❪✽❴ ❦♠❧ ∇ ∇ ➄➆➅ Τ Τ ) ) 1 i x − ✶✢✧ i ( x ⑥✫➂ f ( ∇ f ∇ = ✶✫✷ β i ✶✫✷ ❩✢❬ ✸✽✼ ∇ f ( x i ) ✛✢✜ ✛✢✜ 1 ❱★❲ − ∇ f ( x i ) i s β ●■❍ f ( x i ) – i ✦★✧ ✎✑✏ + ❱❝❜❡❞ ◗✥❘ ∇ ) ✂☎✍ ✛✫✪ i x f ( x i ) ✻★✱ ( ✛✫✪ f ❯★P ∇ ∇ ✦★✧ = ✉✇① ❅❈❇ ✣✥✤ ✉✇✈ i s ✠☎✕ ✛✢✜ ✛✢✜ r✢s ✗✙✘ ❩✢❛ ✂☎✄ ❀☎❁

✁ � � ✂ ✁ Constrained Optimization * arg max ( ) = x f x Goal: x subject to the constraints: : ( ) 0 ∀ = i g x i Problems: Have to ensure we satisfy the constraints. No guarantee that ∇ f ( x * ) = 0 , so how to recognize the max? Solution: the method of Lagrange Multipliers

✝ ✍ ✂ ✜ ✑ ✏ ✠ ✓ ✄ ✏ ✏ ✂ ✜ ✄ ✂ ✂ ✆ ☎ ✕ ☛ ✍ ✒ ✂ ✗ ✓ ✕ ✝ ✢ � ✝ ✣ ✝ � ✒ ✑ ✠ ✓ ✍ ✏ ✑ ✄ ✑ ✟ ✄ ✓ ✔ ✝ ✍ ✏ ✂ ✕ ✑ ✂ ✍ ✂ ✍ ✝ ✕ ✠ ✄ ✆ ✜ ✤ ✓ ✓ ✍ ✂ ✠ ✘ ✏ ✂ ✟ ✄ ✝ ✠ ✘ ✆ ✂ ✑ ✚ ✆ ✓ ✓ ✂ ✌ ☎ ✂ ✕ ✄ ✏ ✂ ✕ ✓ ✑ ✍ ✝ ✍ ☛ ✑ ✔ ✄ ✍ ✟ ✏ ✝ ✍ ✆ ✁ ✂ ✄ ☎ ✢ ✟ ✄ ✠ ✄✔ ✄ ✂ ✕ � ✄ ✏ ✑ ✚ ✓ ✝ ✏ ✍ ✏ ✑ ✓ ✂ ✦ ✔ ✝ ✏ ✑ ✂ ✕ ✄ ✑ ✍ ✂ ✠ ✘ ✍ ✍ ✍ ✑ ✖ ✗ ✓ ✕ ✝ ☛ ✜ ✂ ✂ ✝ ✏ ✚ ✕ ✝ ✂ ✔ ✂ ✑ ✥ Lagrange Multipliers I ∇ f ( x * ) = 0. ✟✙✘ ✜✞✓ ) ( x ✒✞✓ g ✑✎✍ ∇ ✡☞☛ λ ✜✞✓ = ) ✛✎✓ ( x ✆✞✝ ∇ f ( x * ) ✒✞✓ f ∇ ✌✎✍ ✌✎✍

✝ ✕ ✂✍ ✆ ✎ ✎ ✦ ✟ ✂ ✗ ✙ ✎ ✔ � ✕ ✟ ✝ ✟ ✔ ✖ ✆ ✠ ✎ ✟ ✂ ✍ ✂ ✍ ☛ ✢ ☛ ✔ ✖ ✂ ✔ ✖ ✠ ✝ ☎ ✕ ✘ ☛ ✎ ✝ ✓ ✕ ☛ ✣ ☎ ✟ ✕ ☛ ✄ ✝ ✎ ✚ ✌ ✎ ✕ ✠ ✂ ✖ ✠ ✍ ★ ✂ ✟ ✂ ✖ ☛ ✕ ✗ ☛ ✓ ✝ ✍ ✕ ✕ ☛ ✔ ✎ ✠ ✕ ✙ ✖ ✝ ✓ ☛ ✧ ✖ ✗ ✕ ✖ ✂ ✗ ✟ ✂ ✝ ✚ ✎ ✔ ✍ ✕ ☛ ✜ ✎ ✔ ☛ ✕ � ✂ ☛ ✕ ✝ ✕ ✖ ✟ ✂ ✝ ✝ ✓ ☛ ✗ ✖ ✎ ✌ ✝ ✂ ✝ ✄ ☎ ✍ ✝ ✘ ☛ ✟ ✂✍ ✔ ✎ ✂✍ ✁ ✂ ✄ ☎ ✝ ✂ ✌ ✟ ✄ ☛ ✟ ✕ ✖ ✒ ✟ ✝ ✓ ✄ ☎ ✆ ✝ ✟ ✠ ✂ ✝ ✓ ✖ ✍ ✎ ✏ ✒ ☛ ✍ ✝ ✟ ✆ ✆ ✓ ✜ ☛ ☛ ✔ ✎ ✂✍ ✝ ✕ ✖ ✟ ✂ ✝ ✍ ✆ ✛ � ✚ ✍ ✠ ✖ ✂ ✎ ✙ ✝ ✓ ☛ ✍ ☎ ✕ ✙ ✖ ✔ ☛ ✂ ✎ ✕ ✄ ✖ ✆ ✍ ✜ i. ✍✑✏ Lagrange Multipliers II ∂Λ / ∂ λ i = 0 ) x ( i ) g x λ i ( ✘✥✤ ✆☞☛ i − ∑ g 0 i ∇ = λ i ) x = ∑ ) ( i ( x i g : f i ✤✫✪ ) ∀ ✂✍✑✏ ( x = ✟✡✎ ) λ f ∇ , x ( Λ ∂Λ / ∂ x = 0 ✆☞☛ ✟✡✠ ✆✞✝ ✝✩★ ✟✡☛ ✟✡☛

✄ ✝ ✡ ✍ ☛ ✎ ☛ ✏ ✡ ✍ ✓ ✡ ✖ � ✔ ✎ ✡ ✄ ☛ ✏ ✝ ✠ ✟ ✄ ✂ ☞ ✌ ✗ ✝ ✔ ✠ ✂ ✑ ✍ ✓ ✄ ✠ ✄ ✓ ✎ ✍ ✄ ✠ ✄ ✏ ✡ ✔ ✎ ✟ ✡ ☛ ✍ ✠ ✎ ✍ ✎ � ☛ ✡ ✎ ✟ ✡ ☛ ✍ ✠ ✄ ✌ ☞ ✠ ✡ ✝ ☛ ✟ ✡ ✝ ✟✠ ✎ ☞ ✁ � ✝ ✠ ✎ ✎ ✡ ✡✄ ✍ ✕✖ ✍ ✓ ✂ ✎ ✎ ✄ ✂ ✔ ✝ ✠ ✄ ✑ ✍ ✓ ✄ ✠ ✄ ✑ ☛ ✎ ✎ The Lagrangian as an ) ) x x ) j ( ( x x i ) i ∂ g ( g ( x j ∇ i i ∂ g λ λ g i i i λ i ✂☎✄ − ∑ − ∑ − ∑ i i i ) ( x − ) g i ) x ( x j ) ( x 0 ( x ∂ = f f ∂ ∇ f 0 = = = ☛✒✑ Encoding = λ ) 0 ) λ ) λ ☛✒✑ ☛✒✑ λ , , i j x x x , ( ∂ ( ∂ ✏☎✄ x Λ Λ λ i ( x j ∂ ∂ Λ ✂☎✄ ✂☎✄ ✆✞✝ ✂☎✄

❫ ✳ ✮ ✳ ✶ ✹ ❘ ❘ ✷ ❱ ❞ ❝ ✵ ◗ ❩ ✳ ✶ ✸ ❜ ❬ ✱ ✰ ✵ ❘ ❘ P ✳ ❱ ✲ ❳ ❨ ✷ ❘ ✵ ❩ ✸ ❨ ✴ ✱ ❳ ✸ ✲ ❱ ❯ ❚ ❙ ✴ ✷ ✴ ✷ ✺ ❘ ✱ ✷ ❳ ❘ ✴ ✰ ✵ ✴ ✳ ✲ ✸ ❳ ✰ ✴ ✷ ❨ ✷ ✱ ❬ ✱ ✰ ✵ ✸ ❘ � ❳ ❱ ❩ ✰ ✶ ✷ ✲ ✹ ✱ ✸ ❩ ✶ ✸ ❛ ✸ ❱ ✳ ❵ ✶ ✰ ✴ ✵ ✼ ✺ ❩ ✴ ✳ ✷ ❨ ✸ ✶ ✰ ✰ ❬ ❨ ✰ ✹ ✴ ✷ ✵ ✷ ✸ ✱ ❛ ✵ ❢ ✱ ❪ ✸ ✵ ✼ ✱ ✲ ✸ ❡ ❩ ✸ ✴ ✳ ✸ ✸ ❘ ✱ ✰ ✷ ✰ ✸ ❨ ❘ ❘ ✷ ✰ ✵ ✴ ✲ ✴ ✸ ❱ ✳ ✵ ❩ ❯ ✳ ✰ ✴ ✸ ✹ ✹ ✳ ❩ ❳ ✸ ❜ ✷ ✺ ✷ ✻ ✱ ✺ ❳ ✲ ✷ ❱ ❩ ✸ ❳ ✰ ✹ ✴ ✷ ❯ ✴ ✸ ❳ ✰ ✵ ✰ ✸ ✶ ❳ ✰ ✴ ✱ ❳ ✰ ✶ ✸ ✰ ✱ ✸ ❨ ❳ ✸ ❡ ✴ ✳ ✴ ✱ ✲ ✻ ✱ ✺ ✸ ❳ ✷ ✷ ✥ ✬ � ✪ ✩ ★ ✢ ✥ ✧ ✥ ✘ ✜ ✂ ✛ ✤ ✢ ✜ ✛ ✙ ✘ ✪ ✩ ★ ✢ ✞ ✂ ✧ ✶ ✺ ✷ ✻ ✱ ✺ ✹ ✸ ❘ ✷ ✱ ✰ ✔ ✵ ✳✴ ✱✲ ✰ ✯ ✮ ✗ ✆ ✝ ✠ ✭ ✥ ✥ ✺ ✟ ✠ ✞ ✍ ✝ ✟ ✞ ✎ ✍ ✍ ✂ ☞ ✠ ✠ ✌ ✟ ☞ ☛ ☎✡ ✠ ✟ ✞ ✝ ☎✆ ✝ ✍ ✘ ✖ ✦ ✛ ✥ ✜ ✛ ✤ ✛ ✗ ✡ ✞ ✡ ✕ ✟ ✔ ✂ ✓ ✒ ✆ ✝ ☎ ✠ ✂ ✑ ✏ ✼ ✴ ❁ ❅ ❉ ❏ ❃❄ ✽ ❄ ❈ ❖ ❘ ✺ ❉ ✱ ❆ ✻ ❄ ❍ ❉ ❏ ❈ ✷ ❆ ❃ ◆ P ❃ ❅ ❇ ❃ ❑ ■ ✮ ❆ ❈ ❄ ❅ ❏ ❀ ❃ ◗ ✳ ▼ ✲ ✱ ❍ ❇ ❍ ❄ ✴ ❆ ❅ ❋ ❊ ❨ ❅ ❆ ❉ ❈ ❇❈ ✸ ❆ ❃❄❅ ❩ ❀ ✿ ✵ ✾ ❅ ✱ ❆ ✲ ▼ ▲ ✾ ✴ ❑ ❙ ❆ ❚ ✸ ❈ ❳ ❏ ❍ ❉ ■ ❅ ❉ ❊ Λ x g i ( x ) x * , λ * λ λ A Duality Theorem ❘❭❬ x * x * x * , λ ❘❭❬ λ * x, λ * x * ✳❪❴❫ x * Λ Λ Λ ●❂❍ x * f ( x ) ❘❭✼ ❯❲❱ ❪❴❫ i λ x * Λ ✘✚✫ x Λ x, λ ✜✣✢ ❀❂❁ Λ ✘✚✙ 2. λ * 1. x * ✁✄✂

✪ ✴ ✻ ✻✼ ✫ ✻ ✫✬ ✰ ✴ ✱ ✲ ✽ ✶ ✺ ✴ ✬ ✫ ✻ ✲ ✭ ✫✾ ✫ ✳ ✬ ✱ ✳ ✬ ✾ ✱ ▲ ✫✬ ✭ ✮ ✫ ✬ ✯ ✬ ✰★ ✬ ✲ ★ ✫ ✳ ✬ ✴ ✵ ★ ✶ ✲ ✫ ✲ ✳ ✲ ✫ ✥ ❑▲ ❂ ■ ❈ ❉ ❊ ■ ❈ ❉ ❏ ❊ ❉ ❋ ▼ ❃ ◆ ❑ ❉ ❇ ◆ ❊ ❉ ◆ ❊ ❈ ❊ ✳ ✬ ✬ ✱ ✳ ✬ ✿ ✯ ✬ ✰ ✲ ✳ ★ ❀ ❉ ✫ ✻ ✯ ✴ ❁ ❖ ❂❃ ❄ ❃ ❇ ❈ ✦ ❑▲ ✎ ✄ ✆ ✡ ✗ ✒ ✠ ✖ ✞ ✒ ✍ ✎ ✟ ✞ ✆☞ ✟ ✒ ✆ ☛ ✡ ✠ ✄ ✞ ✎ ✡ ✎ ✡ ✄ ✆ ✞ ✆✟ ✠✡ ✄ ☛ ✞ ✆☞ ✌ ✆ ✕ ✄ ✄ ✑ ✎ ✒ ✎ ✓ ✡ ✒ ✔ ✆ ☛ ✝ ✄ ☞ ✓ ✆ ✒ ✎ ✛ ✡ ✍ ✍✤ ☛ ✘ ✑ ✒ ✣ ✞ ✆✟ ✠ � ✡ ✄ ☛ � ✁ ✣ ✔ ✙ ✚ ✓ ✞ ☛ ✛ ✒ ✠ ✜ ☛ ✑ ✒ ✞ ✎ ☛ ✡ ✍ ✗ ✢ ✎ ✓ ✝ Many methods for constrained optimization are k ❖◗P 2 / 2 outgrowths of Lagrange multiplier ideas. ) i x ✞☎✖ ( ❖❆❃ g Direct Constrained k ∑ i ✳✩✹ − ✳✸✷ ) ( x ✍✏✎ Optimization Iterative Penalty Methods f = ) ❅❍● k , x ( f PENALIZED k ❅❆❃ ✧✩★✪ k ✂☎✄

✼ ✬ ✺ ✿ ✯ ✳ ★ ✲ ✭ ✭✱ ✰ ✯ ✪ ✫ ✭✮ ✲ ✯ ✰ ✭✱ ✭✲ ★ ✳ ✯ ✭ ✹ ✮ ✺ ★ ✹ ✻ ✾ ✦ ☎ ★ ✞ ✂ ✌ ✖ ✝ ✂ ☎ ✍ ✝ ✂ ✰ ✯ ✯ ✲ ✴ ✶ ✭ ✯ ✮ ✭ ✪ ✭ ✲ ✫ ✴ ✻ ✭✲ ✭ ❄ ✹ ✳ ✯ ✴ ✱ ✳ ✴ ✭ ✮ ✷ ✯ ★ ✭ ✭ ✫ ✬ ✭✮ ✯ ✰ ✭ ✺ ✫ ✮ ✻ ✭ ✯ ❃ ❀ ★ ✭ ✳ ✯ ✿ ✻ ✫ ✷ ✯ ❀ ✮ ✫ ❁ ✷ ★ ✲ ✻ ✰ ✮ ✫ ✹ ✲ ✶ ✼ ❂ ★ ❀ ✮ ★ ✖ ☎ ✡ ✙ ✂ ✔ ✂ ✕✖ ✗ ☎ ✓ ✟ ✕ ☎ ✟ ✘ ✖ ✆ ☎ ✖ ✕ ✙ ✂ ✝ ✚ ✗ ☎ ✍ ✔ ✗ ✍ ✂ ✞ ✫ ✌ ✍ � ✶ ☎ ☎ ✂ ✆ ✝ ✂ ☎ ✲ ✹ ✌ ✞ ✎ ✡ ☎ ☎ ✞ ✂ ✎ ✟ ✆✏ ✂ ✂ ✑ ✂ ✆ ☎ ✂ ✆ ☛ ✕ ☎ � � ✞ ✍ ✌ ✎ ✍ ✕ ✡ ✌ ✒ ✟ ✞ ✣ ☎ ✞ ✂ ✟ ✔ ☎ ✍ ✝ ✚ ✝ ✻ ✖ ✹ ✥ ✮ ✣ ✛ ✍ ✑ ✮ ✖ ✕ ✡ ✯ ✜ � ✢ ✖ ✏ ✰ ✕ ✣ ✮ ✞ ✆ ✤ ✟ ✚ ✭✲ ✡ ☛ 2 ✭✽✼ / 2 ) i x ( ✶✸✷ g k ✴✩✵ ∑ i − Direct Constrained ) x ✒☞✓ ( λ * i g λ i ✧✩★ ∑ i x * Optimization − ) x, λ * , k ) ( x f = λ i = λ i + k g i ( x ) x = arg max Λ k=k 0 ) k ✡☞☛ , λ ✞✠✟ , x k = α k ( λ = 0 PENALIZED ✁✄✂ Λ

✞ ☛ ✏ ✏ ☎ ✠✂ ✑ ✒ ✎ ✑ ✤ ✏ ✂ ✑ ✙ ✏ ☞ ✞ ✢✣ ✜ ✚✛ ★✩ ✖ � ✕ ✍ ✆ ✞ ✂ ✙ ✝ ✫ ★ ✮ ✯ ✰ ✱ ✲ ✮✳ ✴ ✴ ✵ ✧ ✂ ✏ ✒ ✞ ☛ ✏ ✍ ✂ ✡ ✆ ☛ ✠ ✟ � ✑ ✂ ✍ ✌ ☞ ✞ ✒ ✂ ✆ ✑ ✆ ✠ ✏ ✎ ✞✍ ☎ ✞ ✑ ✂ ✝ ☛☞ ✆ ✡ ✂ ✠ ✟ ✞ ✝ ✝ ☎✆ ✏ ✍ ☛ ✍ ✠ ✙ ✆ ✘ ✎ ✏ ✑ ✒ ✝ ✑ ✟ ✗ ✪ ✆ ☎ ✆ ✏ ✑ ✞ ✍ ✕ � ✖ ✂    ) x ( i f x p ∑ x − ✑✔✓ i f C Maximum Entropy    λ i x p ∑ log i i f C − x = ✝✦✥ p x p x ∑ p x p log ∀ ∑ − i f ∈ x = x : ) i p ∑ x ( H − ✑✔✓ = ) λ , p ✫✭✬ ( Λ ✁✄✂

✧ ✄ ✣ � ✲ ✶ ✂ ☞ ✖ ✕ ✠ ✟ ✪ ✠ ✷ ☎ ✸ ✝ ✷ ☎ ☎ ✄ ✴ ✥ ☞ ✮ ✰✱ ✲ ✚ ☛ ✣ ✤ ✳ ★ ✦ ✫ ✮ ✯ ✪ ✴ ✫ ✬ ✵ ✥ ✰ ✦ ✝ ✞ ✬ ✝ ☞ ✖ ✓ ✄ ✝ ☞ ✞✌ ✑ ✷ ✞ ✠ ✝ ✷ ✝ ✹ ✠ ☞ ✕ ✠ ✍ ✌ ✌ ✔ ✠ ☞ ✍ ☎ ✏ ✝ ✕ ✠ ✞ ✌ ✄ ✠ ☞ ✔ ✘ ✌ ✙ ✶ ✻ ✔ ✄ ✟ ✥ ★ ✌ ✞ ✔ ☞ ✍ ✆ ☞ ✓ ✍ ✔ ✓ ☛ ✝ ✄ ✝ ☞ ✞ ✕ ☎ ✖ ✓ ✠ ☞ ✏ ✄ ✄ ✙ � ✁ ✂ ✄ ☎ ✆ ✝ ✞ ✌ ✠ ☛☞ ✞✌ ✄ ✍ ☎ ✝ ✞✠ ✎ ✍ ✏✠ ✌ ✯ ★ ✦ ✧ ✠ ✦ ✪ ✧ ✫ ✜ ✬ ✭ ✤ ★ ✬ ✪ ★ ✦ ✮ ★ ✫ ✬ ✥ ✣ ☛ ✆ ✖ ☞ ✗ ✗ ✠ ✍ ✠ ✎ ✗ ✘ ☞ ✠ ✞ ☎ ✖ ✄ ✝ ✠ ✌ ✙ ✚ ✄ Penalty methods work somewhat in this way: Lagrangian: Max-and-Min λ ) ) x x Can think of constrained optimization as: ( ( i i λ g g λ λ i i − ∑ − ∑ i i ✟✡✠ ) ) ( x ( x ✞✒✑ f f = = ) ) λ λ , , x x ✝✺✹ ( ( Λ Λ max min λ x ✟✡✠ ✤✩★ max min λ x ✛✢✜ ✛✢✜

✞ ✕ ✒ ✓ ✒ ✡ ✎ ✞ ✄ ✔ ✔ � ✏ ✝ ☞ ✟ ✔ � ✕ ✑ ✂ ✡ ✞ ★ ✆ ✏ ✡ ✞ ✁ ✂✄ ☎ ✆ ✝ ✞ ✟ ✙ ✠ ✡ � ✒ ✂ ✡ ✍ ✜ ✍ ✥ ✎ ✏ ✑ ✏ ✂ The Dual Problem Λ λ ☛✌☞ ✎✌✏ ( ) − ∂ λ − log − ∑ ∂ C p f x p p ✚✘✛ ✖✘✗ ( , ) ∂ Λ λ p ∑ i i ∑ x i x x = 0 = + i x x ∂ p ∂ ∂ p p x x x ( ) ∂ λ − C p f x log ∂ ∑ p p ✦✤✧ ✢✤✣ ∑ i i ∑ x i x x ( ) 1 log i x = − λ = + p f x x ∑ i i x ∂ ∂ p p i x x ☛✌☞ 1 log ( ) + = λ p f x x ∑ i i i exp ( ) ∝ λ p f x x ∑ i i i

☎ ✰ ✞ ✓✯ ✆ ✂ ✌ ✠ ✔ ✖ ✍ ✓ ✂ ✔ ✛ ✡ ✔ ✞ ✠ ✡ ✍ ✔ ✠ ✡ ✂ ✔ ☞ ☛ ✞ ✍ ☞ ✆ ✡ ✂ ✩ ✪ ✏ ☞ ✆ ✪ ✆ ✠ ✡ ✂ ✂ ✖ ✪ ☞ ✠ ✂ ✠ ☞ ✖ ✞ ✖ ✍ � ✂ ✂ ✔ � ✖ ✞ ✞ ✞ ✎ ✏ ✞ ✏ ✍ ✪ ✍ ✆ ☞ ✑ ✍ ✏ ✞ ✕ ✖ ✂ ☛ ✍ ✆ ✑ ☞ ✏ ✠ ✡ ✖ ✗ ☛☞ ✍ ✆ ✓ ✍ ✆ ✪ ✠ ✡ ✂ ✟ ✡ ✡ ✌ ✂ ✍ ☛ ✂ ✠ ✡ ✂ ✓ ☞ ✠ ☞ ✖ ✍ ☎ ✫ ☞ ✘ ☛ ✂ ✌ ✑ ✞ ✆ ✂ ✆ ✠ ✖ ✗ ✞ ✏ ✘ ✡ � ✙ ✒ ✠ ✡ ✂ ☞ ✖ ✍ ✆ ✒ ✠ ✂ ✠ ✂ ✆ ✞ ✟ ✠ ✡ ✂ ☛☞ ✌ ✍ ☛ ✎ ☛ ✂ ✠ ✔ ✏ ✞✑ ✒ ✓ ✠ ✏ ✍ ✕ ✎ ✠ ✍ ✞ ✆ ✡ ✠ ✞ ✣ ✑ ✖ ✍ ✂ ✏ ✏ ✠ ✡ ☞ ✠ ✢ ✢ ✠ ★ ✠ ✡ ✂ ✩ ☞ ✪ ✏ ☞ ✆ ✪ ✍ ✔ ✖ ✆ ✂ ✟ ✂ ✟ ☞ ✆ ✠ ✠ ✞ ✍ ✗ ✓ ✂ ✠ ☛ ✎ ✡    ) x λ ✍✥✲ ( i f x p ✵✷✶ ∑ x ✍✄✱ − ✡✝☞ i λ f C ) The Dual Problem    x λ i ✪✮✭ ( i ∑ f i λ i ✍✄✔ − ∑ ✓✝☞ i x p exp ✍✥✬ p log ∝ x ☛✜✛ ✍✄☞ ) λ ∑ x ( ✓✴✳ − x ✍✄✔ p = ✍✄✔ ✍✄☞ ) λ ✍✄☞ , p ✓✚✎ ✣✧✦ ( Λ ☎✝✆ ✣✥✤ ✁✄✂

The Dual Problem  ( )  p log − − λ − ( , ) C p f x Λ λ = p p ∑ x x ∑ i f ∑ x i   i x i x   exp ( ) λ f x ∑ i i  ( )  − λ − C p f x log − p i ∑ i f ∑ x i   ∑ x i exp ( ' ) λ f x i x   x ∑ ∑ i i ' x i  − ( )   log exp ( ' )  λ + λ p f x f x ∑ ∑ x i i ∑ ∑ i i     ' x i x i       ( ) − λ + λ C p f x ∑ i f ∑ x ∑ i i   i i x i  

The Dual Problem   ˆ log exp ( ) ( ) λ − λ = ( , ) Λ λ = f x i C C p f x p ∑ ∑ i i ∑ f f i ∑ x i   i x i i x     ˆ λ log exp ( ) ( ) λ − f x p f x ∑ ∑ i i ∑∑ x i i   x i x i   ˆ log exp ( ) − λ   p f x log exp ( ) λ f x ∑ x ∑ i i ∑ ∑ i i   x i x i   exp ( ) λ f x   ∑ i i ˆ log − p i   ˆ p log = − p ∑ x exp ( ) λ f x ∑ x x   x ∑ ∑ i i x   x i  

❇ ❂ ● ❀ ❅ ❚ ✾ ❫ ❆ ❋ ❀ ● ❋ ■ ❄ ❋ ❭ ❉ P ❄ ❂ ❋ ● ❂ ❈ ❇ ❍ ❆ ❯ ❀ ● ❅ ❋ ❊ ❀ ❉ ❭ ❢ ❀ � ❋ ❊ ❀ ❉ ❈ ❄ ❆ ● ❅ ❄ ❂❃ ❩❛ ✾ ✹✺ ✕ ❅ ❀ ✾ ❆ ❡ ❀ ❑ ❄ ❇ ❂ ❏ ❆ ❋ ■ ❄ ❀ ❄ ❍ ❂ ❉ ❀ ❩❛ ❨ ✐ ❛ ❨ ❪ ❤ ❣ ❞ ❡ ❨ ❛ ❫ ✐ ❡ ❪ ❤ ❣ ❬ ❩❛ ❛ ❥ ❝ ❞ ❫ ❤ ✐ ❨ ❭ ❝ ❞ ♠ ❩ ❤ ❛ ❨ ❣ ❨ ❃ ❴ ❝ ❨ ❜ ❩❛ ❵ ❨ ❣ ❡ ❨ ❭ ❩❬ ❛ ❲ ❱ ❋ ❩❞ ❫ ❫ ❬ ❨ ✐ ❨ ❪ ❤ ❤ ❡ ❤ ❞ ❞ ❫ ❣ ❛ ❫ ❴ ❢ ✝ ✒ ✕ ✒ ✜✢ ✛✜ ✚ ✙ ✘ ✔ ✑ ✤ ✂ ✄ ✏ ✌ ✒ ✝ ✂ ✣ ✛✥ ♣ ✮ ✄ ✖ ✠ � ✰ ✯ ♣ ✦ ✫ ✪ ✩ ✣ ★ ✛ ✧ ✆ ✏ ✝ ✆ ✂ ✄ ✏ ✎ ❣ ☞ ❨ ❣ ✄ ❞ ✂ ✆ ✄☎ ✂ ✁ ♦ ✆ ✝ ✆ ✂ ✖ ✒ ✄ ✞ ✝ ✂ ✌ ☎ ✄☎ ✂ ☞ ✆ ✌ ✆ ✄ ✡ ✆ ✸ ✄ ✄ ✞ ✝ ✂ ✆ ☎ ✂ ✒ ✝ ✄ ☎ ✆ ✵ ✄ ✑ ☞ ✶ ☞ ✏ ✕ ✔ ✒ ✑ ✂ ✄ ✎ ✄☎ ✌ ✝ ✔ ✌ ✷ ✒ ✱ ✴ � ✘ ✖ ✏ ✄ ☞ ✱ ✒ ☎ ✄ ✕ ✑ ✂ ✒ ✂ ✔ ✄ ❨ ✒ ✲ ✲ ✌ ☞ ✄ ✔ ✒ ✏ ✂ ✄ ❨ ✳ ✆ ✏ ✎ ✌ ✝ ✔ ❞ ✻✽✼ Iterative Scaling Methods ❧✽♠ ✫✭✬ ❂▼▲ ✐❦❥ ✝✍✌ ❤rq ✔✓✕ ✑✓✒ ❍❙▲ ✝✍✌ ✠☛✡ ❑❁❘ ❪✓❫ ✝✟✗ ❑◗P ✝✟✗ ❣✍♦ ✝✟✞ ◆❁❖ ✿❁❀ ❭✓♥ ❳✍❨

✐ ✽ ✮ ✶ ✮ ✽ ✾ ❁ ✯ ✰ ✸ ✮ ✾ ❊ ✽ ✱ ❂ ❁ ❄ ✸ ❀ ✰ ❄ ■ ❏ ✶ ✮ ✾ ✱ ✾ ✰ ✿ ❀ ✽ ● ✸ ❀ ✰ ✯ ■ ✸ ❃ ❃ ❄ ❆ ✹ ✹ ❀ ✸ ✾ ✽ ❇ ✾ ❅ ❂ ✯ ✹ ✽ ✿ ✾ ❑ ✸ ✶ ✹ ✽ ✮ ✾ ✰ ✹ ❀❃ ✯ ✽ ✺ ▲ ❄ ✭ ✿ ✶ ❀ ❋ ✽ ✰ ✿ ❊ ❊ ✽ ✸ ✾ ✶ ✮ ✰ ✽ ✽ ❁ ✽ ✿ � ❀ ❋ ❄ ✹ ✰ ❊ ❂ ✹ ❀✽ ✿ ✮ ❄ ✮ ❂ ✽ ✾ ✰ ✮ ✾ ✰ ❄ ❍ ● ✯ ✹ ❀ ✹ ✾ ✽ ✽ ❄ ✲ ✽ ✹ ✮ ✯ ✸ ❄ ● ✬ ✸ ❇ ✹ ✮ ✾ ✰ ✰ ✯ ✭ ▼ ❊ ✾ ✾ ✱ ✸ ■ ✾ ✰ ❄ ❅ ✸ ✹ ✿ ❀ ✸ ❆ ✿ ✹ ✿ ❁ ✮ ✹ ❋ ✯ ■ ✾ ✽ ❊ ❀ ✰ ✮ ✾ ✽ ❊ ✽ ✾ ✰ ❝ ❪ ❫❴ ❵ ❛ ❫ ❴ ❴ ❬ ❝ ❞ ❴ ❫ ❪ ❡ ❭ ❲ ❀ ✯ ✰ ✯ ✮ ✭ ❚ ✸ ✰ ✳ ✶ ❊ ❯ ❱ ❱ ❯ ✲ ✮ ❃ ✹ ✹ ✲ ✳ ◆ ❖ ✹ ❀ ✾ ✹ ✽ ❂ ✺ ❀✽ ❀✰ ❄ ✮ ❁ ❏ ✹ ✮ ✸ ✮ ❃ ✽ ❅ ✸ ✾ ✯ ❅ ✸ ✮ ✽ ✰ ❊ ■ ✶ ✽ ✯ ✼ ✰ ❅ ✽ ❂ ✸ ◗ ✿ ❘ ❘ ❙ ✲ ✸ ❄ ✭ ✹ ✸ ✽ ✮ ✱ ✾ ✰ ✿ ❀ ✾ ✽ ❁ ✰ ❂ ✮ ✯ ✹ P ✾ ✶ ❉ � ✌ ✆ ✏ ✝ ✞ ✖ ✣ ✏ ✌ ✍ ✏ ✝ ✚ ☛ ✔ ✓ ✌ ✔ ✖ ✘ ✟ ✌ ✓ ✏ ✂ ✝ ✆ ✠ ✂ ✌✒ ✒ ✍ ☎ ✤ ✝ ✠ ✏ ✝ ✞ ✦ ✘ ✧ ✂ ✌ ✒ ✚ ✍ ❈ ☎ ✆ ✆ ✔ ☛ ✂ ✝ ✥ ✡ ✝ ✏ ✞✕ ✌ ✏ ✛ ✠ ✌✒ ✒ ✍ ✝ ✓ ✞ ✏ ✞ ✍ ✍ ✌ ✆ ✂ ✌✒ ✒ ✝ ✆ ✓ ✏ ✔ ✌ ✆ ✏ ✂ ✏ ✞ ✆ ☎ ✆ ✝ ✞ ✟ ✂ ✠ ✝ ✝ ✌✍ ✂ ✌ ✎ ☛ ✝ ✌ ✆ ✒ ✍ ✌ ✆ ✏ ✛ ✌ ✒ ✌ ✍ ✝ ✓ ✏ ✔ ✌ ✡ ✚ ✝ ✙ ✍ ✏ ✆ ✠ ✔ ✘ ✝ ✌✜ ✞ ☛ ✆ ✍ ✚ ✛ ✆ ✝ ✂ ✔ ✂ ✰ ❁ ✾ ❀✽ ✿ ✰ ✾ ✱ ✮ ✸ ✱ ✬ ✩ ✠ ✮ ✆ ✕ ✞ ✏ ✆ ✍ ✂ ✑ ✞ ✏ ☛ ✂ ✝ ✑ ❂ ✶ ✏ ✸ ❇ ✰ ✾ ✮ ✿ ✾ ✸ ❀ ✹ ❆ ✰ ❄ ❃ ❃ ❂ ✮ ✽ ❄ ✶ ✾ ❅ ✶ ❄ ✾ ✮ ✺ ❀❃ ✹ ✮ ✽ ✞ ✂ ✠ ✌ ✕ ✂ ✞ ✆ ✜ ✚ ✡ ✌ ✍ ✌ ✏ ✛ ☛ ✚ ✍ ✞ ✆ ✌ ☛ ✆ ✠ ✝ ✡ ✛ ✝ ✞ ✆ ✂ ✍ ✚ ✛ ✆ ☛ ✆ ✠ ✥ ✆ ✛ ✚ ☛ ✏ ✞ ✕ ✌ ✞ ✡ ✩ ✆ ★ ✌ ✘ ✂ ✧ ❢❤❣ ❛✷❜ ❳❩❨ ✼✄✽ ✹✻✺ f ( x ). ✹✪✰ ✵✷✶ Newton Methods ∇ ✲✴✳ ✯☞✰ ✭☞✮ ∇ f ( x ) f ( x ) f ( x ) ✹✪✽ ∇ ✏✄✑ ∇ x ✼✄✽ ✯☞✽ ✹✻✺ f ( x ) ✡☞☛ ✁✄✂ ✎✫✑ ∇ ✼✄✽ ✏✄✢ f ( x ). ☛✗✖ ✹✻✺ ✏✪☛ ∇ ❂☞✽ ✎☞✕ ✁✄✂

� � � Part III: NLP Issues Sequence Inference Model Structure and Independence Assumptions Biases of Conditional Models

❀ ❇ ❑ ❖ ❉ ❆ P ❄ ◗ ❀ ❍ ❂ ❘ ❈ ❀ ● ❂ ❍ ❆ ❇ ❍ ❇ ❍ ❄ ❃ ❇ ❈ ❉❊ ❋ ✿ ❉ ▲ ❀ ❏ ● ❈ ❀ ❇ ◆ ❍ ❆ ❀ ❙ ❚ ❂ ❇ ❍ ❱ ❉ ▼ ❀ ❏ ❨ ❀ ❀ ❆ ❱ ❉ ▼ ❀ ❏ ❨ ❀ ❇ ▼ ❍ ❄ ◆❲ ❍ ❆ ▼ ❍ ✿ ❉ ❏ ❯❱ ❑ ❈ ❑ ❄ ❍ ❑ ❲ ❆ ❳ ❍ ❃ ❄ ❁ ✟ ☞ ✌ ✟ ✕ ✖ ✡ ✗ ✘ ✡ ✝ ✜ ✢ ✄ ✝ ✜ ✁✥ ✢ ☛ ✟ ✁ ✡ � ✁✂ ✄ ☎ ✆ ✄ ✝ ✄ ☛ ✞ ☞ ✌ ✟ ☛ ✌ ✡✑ ☛ ✔ ✁ ✦ ✝ ✽ ✺ ✧ ✻ ✳ ✼ ✦ ✬ ✾ ✷✸ ✬ ✥ ✂ ✬ ✆ ✄ ✝ ✄ ✳✹ ✵ ✶ ✬ ✚ ✜ ✥ ✮ ★ ✩ ☎ ✫ ✜ ✄✪ ✪ ✲✴✳✵ ●■❍ Inference in Systems ✯✱✰ ❂✭▼ ✜✤✣ ✜✭✬ Sequence Level ✙✛✚ Local Level ❏■❑ ✒✏✓ ●■❍ ✞✠✟ ✍✏✎ ✿✤❀ ❅✤❆

❯ ✩ ✶ ✫ ✴ ✪ ◆ ★ ✪ ✻ ✬ ✩ ✳ ✼ ✻ ✲ ✴ ★ ✪ ✻ ✯ ✲ ✲ ✱ ✰ ✩ ✾ ▼ ✪ ✻ ✲ ✻ ✬ ✾ ▼ ✴ ✾ ✼ ✭ ✶ ✬ ✩ ✫ ✲ ★ ✬✭ ✦ ✩ ✯ ✬ ✴ ✽ ✴ ▼ ✯ ✬ ★ ✯ ✴ ✻ ✪ ✩ ✥ ✶ ✹ ✷ ✯ ✩ ✯ ✬ ✭ ✪ ✾ ✬ ✻ ✪ ✲ ✩ ✭ ✪ ✩ ✾ ✫❖ ✬ ✮ ❲ ✷ ✻ ✬ ✩ ❍ ✩ ★ ❯ ✽ ✽ ✴ ✳ ✳ ✬ ✽ ✪ ❳ ❨❩ ✭ ❯ ❨❩ ❯ ❳ ❲ ❯ ❚ ❡ ❨❩ ❯ ❯ ❯❜ ❛ ❨ ❵ ❴ ❯ ❫ ✬ ✩ ✷ ❊ ✭ ✬ ✦ ✩ ✪ ❘ ✹ ❈ ❙ ❇ ❄ ❆ ❄❅ ❄ ❉ P◗ ✿ ★ ★ ✭ ✯ ✪ ✩ ✱ ✰ ✩ ✯ ★ ✩ ✮ ❍ ✳ ✬ ❍ ✴ ✳ ▼ ✩ � ✳ ✾ ✫ ✪ ✩ ✱ ✰ ✩ ✬✭ ✩ ✪ ✩ ✩ ★ ☞ ✎ ✑ ✘ ✎ ✭ ✲ ✎✜ ✬ ★ ✦ ✩ ✴ ✮ ✺ ✹ ✵ ✪ ✳ ✬ ✭ ✴ ✮✳ ✬✭ ✩ ✢ ✖ ✪ ✂ ✕✖ ✑ ✎✏ ✍ ✌ ☞ ☛ ✡ ✍ ✞ ✂ ✠ ✂ ✟ ✆ ☎ ✗ ✗ ✎ ✕ ✍ ✎ ✛ ✓ ✕✚ ✑ ✓ ✍ ✕✘ ✎ ✒ ✍ ✓ ✎ ✎ ✙ ✩ ✩ ✯ ✹ ✪ ✬ ● ★ ✯ ✬ ❋ ❊ ✩ ❈❉ ❄❇ ❆ ❄❅ ❀ ✲ ✷ ✫❍ ✬ ✴ ✾ ✫ ✴ ✴ ▼ ✯ ✬ ✩ ✬ ✻ ▲ ❑ ❏ ✽ ✴ ✯ ✩ ✯ ✪ ✿ ✲ ✩ ✳ ★ ✩ ✮ ★ ✴✾ ✽ ★ ★ ✩ ✼ ✴✻ ✭ ✯ ✪ ✴ ✴ ✯ ✯ ✪ ✲ ✯ ✴ ✼ ★ ✬ ✩ ✦ ★ ✩ ✮ ★ ✖✤✣ ❝❪❯❞ ✶✸✷ k Beam Inference k ✲✧■ ❬❪❭ ✮✔✯ ✒✔✓ ✝✄✞ ❚❱❯ ❁❃❂ ✥✧✦ ❁❃❂ ✁✄✂

P ★ ✵ ✩ ✮ ✳ ✴ ✫ ✤ ✦ ✳ ✮ ✧ ✤ ✳ ✫ ● ✳ ✩ ✤ ✮ ✤ ✳ ✤ ✴ ✮ ★ ● ✤ ✲ ✮ ✧ ✪ ✫ ✤ ✳ ▼ ✩ ✤ ✹ ✪ ✧ ✰ ✳ ★ ✯ ✤ ★ ✫ ▲ ❇ ❆ ❅ � ✬ ✪ ✤ ✮ ✳ ✫ ✳ ✩ ✤ ✴ ✫ ✤ ✳ ✩ ✧ ✯ ✮ ✳ ✮ ✯ ✮ ❃ ✫ P ❘ ◗ ✶ ✻ ◆ ✭ P ◆ ✮ ✫ ✩ ✤ ✴ ✮ ❙❚ ❳ ✦ ❴ ❙❚ P ❘ ◗ P ❖ P P ❙❚ P P❭ ❬ ❙ ❩ ❨ ✤ ✥ ✰ ✮ ✳ ✩ ✧ ✯ ▼ ✷ ✭ ✮ ✰ ✬ ✬ ✫ ✰ ✳ ✳ ✫ ✴ ✤ ✳ ✩ ✲ ✰ ✮ ✰ ✧ ✴ ✤ ✤ ★ ★ ✦ ✩ ✤ ★ ❁ ❄ ❁❂ ✭ ✳ ✫ ✳ ✩ ✲ ✰ ✭ ✧ ✬ ✬ ✫ ✩✪ ✩ ✤ ✤ ✮ ✥ ✷ ✳ ✳ ✩ ✫ ✸ ✶ ✵ ✲ ✤ ✴ ✮ ✤ ✦ ✬ ✦ ☛ ✰ ✟ ☛ ✡ ☎ ✞✠ ☎ ☎✆ ✞ ✑ ✂ ✂ ✝ ☎✆ ✄ ✁✂ ✎✏ ✕✖ ✗ ✗✖ ✒ ✛ ✏ ✗ ✑ ✑✙ ✘ ✗✘ ✎ ✒ ✑ ✑ ✏ ✖ ✭ ✹ ✩ ✤ ✳ ❊ ✳ ✮ ❋ ✤ ✳ ✷ ✬ ✰ ● ✫ ✬ ● ✩ ✧ ★ ✳ ✦ ✩ ✤ ✳ ✥ ✦ ✤ ✮ ✤ ★ ✴ ✤ ✴ ✫ ✤ ❇ ✫ ✳ ✤ ✩ ✫ ❁ ★ ✤ ★ ✤ ✬ ✤ ✺ ✫ ✮ ✳ ✼ ✯ ✶ ❆ ❅ ❁❄ ❃ ❁❂ ✽ ✼ ✶ ✻ ✩ ✧✔✰ ✤✢✶ ❪❲P❫ ✬✱✰ ✎✢✜ Viterbi Inference ✒✍✚ ✷❑▼ ✬✱✰ ✯✱✰ ✧✔✮ ❯❲❱ ✒✔✓ ✧✔★ ❖✔P ✾❀✿ ☞✍✌ ✣✔✤ ❈✍❉ ✾❀✿ ❍■❑❏

� Independence Assumptions Graphical models describe the conditional independence assumptions implicit in models. c c 1 c 2 c 3 d 1 d 2 d 3 d 1 d 2 d 3 HMM Naïve-Bayes

✶ ✿ ✶ ✳ ✶ ✼ ✴ ✵ ✸ ✳ ✽✾ ✳ ✰ ✳ ✮ ❂ ✸ ✳ ✴ ✵ ❃ ✾ ✻ ✺ ✲ ✵ ✫ ✬✭ ✬ ✮ ✼ ✳✴ ✵ � ✷ ✸ ✳ ✳ ✵ ✸ ✳ ✲ ✳ ✹ ✰ ✲ ✰ ✲ � ✰ ✽ ✵ ✸ ✳❈ ✰ ✲ ✳✴ ✵ ✶ ✲ ✲ ✳ ✼ ✴ ✺ ✰ ✻ ✶ ✰ ❀ ✳ ✻ ✾ ✾ ❃ ✶ ✳ ✴ ✵ ✳ ✲ ✰ ✴ ✿ ✵ ✳ ✷ ✰ ❈ ✶ ✵ ✲ ✻ ✺ ✵ ✛ ✾ ✒ ✖ ✖ ✎ ✗ ✒ ✘ ☛ ✏ ✡ ✎ ✎ ✙✚ ✌ ✎ ✾ ✟ ✢ ✡ ✎ ✎ ✍ ✒ ☞✌ ✁ ✂ ✆ ✝ ✞ ✟ ✠ ✡ ☛ ✍✎ ✕ ✏ ❁ ✡ ✎ ✴ ✡ ✎ ✍✎ ✔ ✏ ✣ ✘ ✌ ✍✎ ✑ ☛ ✏ ✌ ✎ ✧ ✎ ✏ ✎ ✚ ✏ ✒ ✔ ✎ ✖ ✖ ✎ ✗ ✒ ✕ ✏ ✍✎ ☞✌ ✕ ✍ ✍ ✚ ✣ ✘ ✎ ✤ ☛ ✒ ✕ ✏✚ ✌ ✼ ✒ ✡ ✎ ✗ ✡ ☛ ✺ d 3 d 3 d 2 d 2 c c Causes and Effects d 1 d 1 ✎✦✥ ☞✜✛ ❀✜❁ w i w i ✹✜❇ ✑✓✒ ❄❆❅ ✯✱✰✲ ✂☎✄ ★✪✩

✚ ✝ ✝ ✘ ✑ ✙ ✎ ✤ ✝ ✆✄ ✄ ✝ ✢ ✄ ☛ ✍ ✄ ✣ ✆ ✄ ✍ ✑ ✑ ✟ ✘ ✠ ✄ ✣ ✑ ✟ ✌ ✎ ✑ ✑ ✄ ✶ ☞ ☛ ✑ ✙ ✡ ✄ ✂ ✌ ✆ ✄ ✠ ✟ ☛ ✡ ✄ ✂ ✑ ☛ ✆ ☛ ✄ ✎ ✌ ✟ ✟ ✠ ✌ ✝ ✘ ✆ ✝ ✠ ✹ ✚ ✻ ✙ ✺ ✠ ✡ ☞ ✝ ✆ ✞ ✆ ✄ ✂ ✑ ✜ ✚ ✡ ✄ ☛ ✆✄ � ✄ ✂ ✑ ✆✕ ☛ ✢ ✄ ☛ ✍ ✣ ✑ ✝ ✑ ✶ ☞ ✠ ✣ ✝ ✕ ✠ ✸ ✝ ✝ ✡ ✄ ✆ ✆ ☛ ✘ ✠ ✑ ✆ ☛ ✙ ✍ ✣ ✠ ✣ ✝ ✡ ✏ ✟ ✌ ✙ ✡ ✙ ☛ ✤ ✆ ✄ ✂ ✑ ✌ ✝ ✤ ✞ ✜ ✚ ✾✿ ✽ ✖ ✾✿ ✽ ✒ ✆ ✑ ✒ ☛ ✆ ☛ ✘ ✞ ✄ ✆ ☛ ✞ ✞ ✂ ✷ ✑ ✄ ✶ ☞ ☛ ✑ ✑ ✠ ✹ ✚ ✸ ✠ ✠ ✍ ✏ ✔ ✄ ✆ ✜ � ✘ ✆ ✠ ☛ ✆ ☛ ✠ ✍ ✏ ✔ ☛ ☛ ✟ ✢ ✞ ✄ ✆ ✝ ✆✕ ☛ ✡ ✄ ✝ ✠ ☞ ✟ ☛ ✞ ✒ ✙ ✠ ✘ ✄ ☛ ✌ ✠ ✄ ✎✏ ✝ ☞ ✍ ✠ ✟ ✌ ✆☞ ☛ ☛ ✄ ✠✡ ✄✟ ✞ ✆✝ ✆ ✁ ✑ ✑ ✆ ✆ ✝ ☛ ✑ ☞ ✠ ✡ ✄ ✑ ☛ ☛ ✆ ✝ ✎ ✎ ✝ ☞ ✠ ✝ ✠ ☞ ✟ ✱ ✍ ✏ ✎ ✠ ✔ ✴ ✦ ✲ ✱ ✵ ✫ ✰ ✧ ✯ ✫ ✳ ✦ ✱✲ ✄ ✜ ✧ ☛ ✆ ✠ ✷ ✟ ✑ ✄ ✶ ☞ ✑ ✣ ✙ ✡ ✄ ✑ ✑ ✝ ✍ ✙ ✌ ✄ ✰ ☛ ✍ ✟ ✆ ✮ ✤ ✑ ✄ ☛ ✄ ✑ ✞✣ ✄ ✡ ✄ ✘ ✝ ✍ ✝ ✂ ✝ ✄ ✍ ✄ ★✩ ✥✦✧ ✚ ✠ ✟ ✄ ✟ ✞ ✌ ✠ ☞ ✡ ✠ ✪✫✭✬ ✞☎✄ ✫✭✯ ✫✭✱ ☛✼✕ Explaining-Away ✙✛✚ ✆✓✒ ✆✕✗✖ ✆✕✗✖ ✂☎✄

r ✇① ❵ ❞ ♠ ② ❡ ❵ ♠ ♣ ❜ ❞ ❤ t✉✈ ❵ ❧ ③ ♠ ❛ ♠ ❵ ④ ♠ ④ ♠ ❢♠ ♣ ❦ ❞ ❧ ♦ ❛ ❧ s ❣ ♠ ❣ ♦ ❪ ♣ ❞❦ ❤ ♦ ❵ ♠ ♣ ♣ ♣ ❡ ♦ ♠ ❵ � ❡ s ❣ ♠ ♣ ❡ ❜ ❦ ❢ ❛ ♦ ❦ ❞❦ ✐ ❦ ❦ ♠ ❡ ♠ ❣ ⑦ ❵ ❦ ♠ ❧ ❵ ♠ ❞ ③ ❧ ❞ ♠ ❵ ❵ ♦ ♣ ♦ ② ♠ ❢ ❛ ❵ ❵ ③ ❴ s ❵ ❡ ❧ ❦ ♠ ♣ ③ ♠ ② ❵ ♦ ♣ ❦ ❢ ③ ❵ ❞ ❛ ❣ ② ❛ ❦ ❞❦ ❤ ♦ ❛ ❧ ♠ ❵ ❦ ❤ ♠ ④ ❡ ❜ ❢ ❵ ♠ ⑥ ❞ ⑤ ❪ q ❢ ❛ ② ✫ ✖ ✤ ✦ ✗ ✤ ✖ ✢ ✓ ✪ ✩ ❛ ✗ ✘ ✓ ✬ ✭ ✖ ✬ ✖ ✗ ✕✮ ✘ ✜ ✛ ✜ ✤ ✙ ✓ ✬ ✜ ✙ ✓ ✙ ✤ ✯ ✁ ✞ ✡ ✞ ✌ ✡ ✠ ☛☞ ✍ ❢ ✟ ✞ ✆✝ ☎ q ✂ ❡ ✗✦ ✢ ✥ ✤ ✓ ✣ ✙ ✓ ✖ ✑ ✜ ✛ ❵ ✗ ✕✖ ❛ ✓ ✦ ❦ ■❏ ❲ ❦ ❙ ❞ ❤ ♦ ❡ ③ ● ❜ ❆ ❈ ❇ ❆ ❘ ❩ ❴ ❵ ❞❦ ❦ ❞ ❡ ❢❣ ❛ ❞ ❞ ❢ ❵ ❵ ❡ ❪ ♦ ❄❅ ✵ ✛ ✭ ✳ ❵ ✰ ✧ ✖ ✤ ✢ ✶ ✦ ✖ ✯ ✥ ✓ ✓ ✴✵ ✷✸ ✿ ✿ ✶ ✻ ✴ ✴❀ ✶ ✵ ✵ ✹ ✲ ✴ ✾ ✲ ✽ ♠ ✴ ❣ Data and Causal Competition w 3 ❑▼▲ ❢★⑨ w 2 c ●✏❍ ❉❋❊ ❦★q w 1 ❁❃❂ ✖★✧ ❛✚❡ ❧♥♠ ❤❥⑧ ✝✏✎ ❛✔⑦ ❤❥✐ ❖❃❭ ◗✄❬ ✺✼✻ ❛✚❡ ❳✼❨ ✘✚✙ ✠✄✡ ❛✔❝ ❛✚❜ ❚❯✏❱ ❫✔❴ ✒✔✓ ◗✄❘ ✱✄✲ ✁✄✂ ◆P❖

✱ ✭ ❂❍ ✬ ✸ ✯ ● ✭ ❄ ❊❋ ✯ ❉ ✳ ✯ ✮ ✬ ✰ ✫ ❈ ❇ ✪ ❇ ❈ ❇ � ✸ ✻❄ ✬ ✭ ❃ ✵ ❍ ❂ ❍ ✻ ✬ ✯ ❂ ✼ ✮ ❅ ✯ ✵ ✭ ❄ ✯ ✬ ❀ ✬ ✭ ❋ ✵ ✭ ✼ ✬ ✿ ❀ ✭ ✬ ✵ ❍ ❄ ✰ ❁ ✵ ☎ ✙ ✖ ✰ ✕✖ ✿ ❋ ✍ ✌ ✼ ✌ ✯ ☞ ✄ ❀ ✂ ✁ ☎ ✄ ☛ ✡ ✠ ✟ ✞ ✱ ✄ ✂ ✁ ✚✛ ✛ ✜ ❂ ❀ ✼ ✿ ✾ ✰ ✽ ✯ ✵ ■ ❄ ✼ ❃ ✳ ✯ ✬ ✢✣ ✘ ✤ ✗ ✥ ✦✧ ✗ ✏ ★ ✫ ✩ ✪ ❍ robes, exotic flowers, and complimentary wine.” died.” , plush ✰✲✱ Example WSD Behavior I A) “thanks anyway, the transatlantic line B) “… phones with more than one line ✯✰✲✱ ✭✑✮ ✯❆❅ ✭✑✵ ✰✲✱ ✻✺✼ ✗✑✘ ✸✺✹ ✒✔✓ ✶✔✷ ☎✝✆ ✎✑✏ ✴✑✵

✓ ✍ ✞ ☛ ✝ ✂✑ ✂ ☛ ✄ ✏ ✡ ✆ ✂ ✂ ✄ ☎ ✒ ✡ ✎ ✔ � ✁ ✂ ✄ ☎ ✆ ✂ ✔ ✞ ☛ ✡ ✞ ✔ ✝ ✝ ✂✍ ✍ Example WSD Behavior II ✝✟✞ ✠✟✡ ☞✟✌ With Naïve-Bayes: ( | 2 ) ( | 2 ) P flowers P transatlan tic 2 2 = = NB NB ( | 1 ) ( | 1 ) P flowers P transatlan tic NB NB With a word-featured maxent model: ( | 2 ) ( | 2 ) P flowers P transatlan tic 2 . 05 3 . 74 = = ME ME ( | 1 ) ( | 1 ) P flowers P transatlan tic ME ME Of course, “thanks” is just like “transatlantic”!

✁ � ✁ � ✁ � Markov Models for POS Tagging c 1 c 2 c 3 c 1 c 2 c 3 w 1 w 2 w 3 w 1 w 2 w 3 Joint HMM Conditional CMM Need P( c | c -1 ), P( w | c ) Need P( c | w , c -1 ), P( w ) Advantage: easy to Advantage: easy to train. include features. Typically split P( c | w , c -1 ) Could be used for language modeling.

❭ ✹ ✷ ✲ ✲ ✺ ✴ ✴ ✽ ❀ ✰✶ ✽ ✼ ✴ ❁ ✿ ✲ ❂ ✼ ✶ ✮ ❃ ✪ ✓ ☛ ☞ � ✒ ✢ ★✩ ✩ ✫ ✭ � ✬ ☛ ☞ ✂ ✆ ✒ ✡ ✳ ❄❅ ✖ ◆ ❑ ▲ ❑ ❑ ❖ ❖ ◆ ❖ ❏ ❯ ❱ ❲ P ◆❬ ❭ ❲ ◆ ❑ ❍ ❆ ❁ ❄ ✰ ✰ ✽ ✾ ✼ ✲ ✴ ✰ ❋ ✰ ✽ ✾ ✸ ✹ ❍ ❊ ■ ☞ ✂ ☛ ✚ ✆ ✑ ✏ ☞ ✂ ☎ ✂ ✝ ✌ ☛ ✏ ✖ ✖ ☛ ✜ ☞ ✘ ✛ ✗ ✕✖ ✖ ☞ ☎ ☎ ☎ ✞ ✟ ✠ ✡☛ ✝ ☛ ✝ ✝ ✌ ☛ ✏ ✡ ✆ ✝ ☎✑ ✑ ✝ ✚ ✤ ✂ ✖ ☛ ✥ ✕ ☛ ✖ ✆ ☛ ✆ ✝ ☞ ☞ ✑ ✦ ✧ ✂ ✜ ✜ ✒ ✏ ✖ ✖ ☛ � ✂ ☞ ☛ ✣ ✓ ✡ ☛ ✡✙✘ ❉❊●❋ ✒✔✓ ❳❩❨ ✸✱❁ WSJ Results ✾✵✿ ❘❚❙ ✡✎✍ ✡✎✢ ✺❈❇ P✄◗ ✻✵✼ ✸✱✹ ▼✄◆ ✳✵✴ ✆✄✤ ✆✄✝ ✯✱✰✲ ✁✄✂

✽ ✼ ✵ ✱ ✭ ❍ ✭ ✳ ✰ ✱ ❘ ✽ ❀ ◗ ◆ ▲▼ ■ ✾ ✷ ✵ ✸ ✭✯ ✳ ✰ ✴ ✳ ❉ ❩ ❳❨ ❲ ❱ ✸ ✷ ✭ ✲ ❯ ✸ ✰ ✳ ✰ ✽ ✸ ✹ ✲ ✭ ✷ ✭ ❉ ● ✺ ✳ ✲ ✬ ✲ ✻ ✻ ✰❍ ✵ ✸ ✹ ✬ ✭ ✷ ✳ � ✸ ✭ ❃ ✷ ✸ ✰ ✯ ✻ ✵ ✳ ✶ ✰ ✰ ✸ ✴ ✳ ✼ ✵ ✯ ✯ ✰ ✬ ✽ ✰ ✳ ✷ ✳ ❈ ✯ ✰ ✷ ✲ ❯ ✸ ✰ ✰ ✸ ✽ ✯ ❈ ✯ ❬ ✵ ✲ ❃ ❬ ✸ ✸ ✬ ✸ ✵ ✱ ✼ ✵ ✯ ✯ ✰ ✽ ✰ ✸ ✭ ❍ ✰ ✸ ✯ ❈ ✳ ✷ ✰ ❲ ✴ ✲ ✯ ✳ ✯ ❯✵ ✬ ✭ ✬ ✬ ✱ ✸ ✽ ✸ ✵ ✱ ✳ ✶ ✰ ✷ ✰ ❬ ✰ ❀ ✲ ✾ ❩ ❨ ❳ ❱ ✸ ✰ ✷ ❯ ✬ ✸ ✰ ✳ ✰ ✽ ✭ ✰ ✻ ✺ ✽ ✲ ✵ ✆ ✎ ✟ ✤ ✍ ✎ ✣ ✍ ✟ ✢ ✌ � ✜ ✠ ✟ ✔ ✒ ✡ ✍ ✤ ✍ ✖ ★ ✯ ✰ ✯ ❈ ✫ ✪ ✩ ✧ ★ ✠ ✧ ✒ ✝ ✡ ✡ ✝ ✦ ✥ ✟ ✟ ☞ ✆ ✡ ✗ ✔ ✝ ✖ ✔ ✟ ✟✔✕ ☞ ✟ ✑ ✏ ✎ ☞✍ ✝ ✌ ✡ ✌ ✂ ✘ ✗ ✚ ✍ ✠ ✟ ✂ ✡ ☞ ✟ ✟☛ ✙ ✝ ✎ ✟ ✆ ✗✝ ✡ ✝ ☞ ✌ ✱ ✲ ✳ ✵ ✰ ✰ ✷ ✵ ✺ ✬ ✷ ✯ ✲ ❈ ✭ ✼ ✲ ✷ ✁ ✴ ✶ ✳ ❃ ✭ ✴ ✳ ✲ ✱ ✷ ✰ ❋ ✳ ✴ ✯ ✲ ✳ ✲ ✶ ✰ ✳ ✭ ✭ ✳ ✸ ✵ ✹ ✰ ✬✻ ✬ ✲ ✱ ✺ ✹ ✸ ✸ ✳ ✰✷ ✳ ✲ ✶ ✰ ✱ ✂ ✰ ✆ ✰ ✼ ✯ ✰ ✸ ✹ ✯ ✟✠ ✭✯ ❂ ✡ ❁ ❀ ✰ ✸ ✸ ✰ ✳❊❉ ✽✿✾ ✬✮✭✯ ✴✮✭✯ ✑✓✒ ✬✮✭✯ ✰❚❙ ✲❄✵ ❅❇❆ Label Bias ✲❄✵ ✬✮✵ ✬✮✭✯ ✌✓✛ ❖✞P ✆✞✝ ✬✮✭✯ ✂☎✄ ❏✓❑

Maxent Models, Conditional Estimation, and Optimization Without - PowerPoint PPT Presentation

Maxent Models, Conditional Estimation, and Optimization Without Magic That is, With Math! Dan Klein and Chris Manning Stanford University http://nlp.stanford.edu/ HLT-NAACL 2003 and ACL 2003 Tutorial Introduction

Overview MAXENT-Modeling: A framework for Discrete MAXENT-Models and RMs IRT-Modeling?

Maxent Models, Conditional Introduction Estimation, and Optimization In recent years there

From Maxent to Machine Learning and Back T. Sears ANU March 2007 T. Sears (ANU) From Maxent to

MaxEnt Models and Discriminative Estimation Gerald Penn CS224N/Ling284 [based on slides by

T-orders in MaxEnt Arto Anttila (Stanford University) and Giorgio Magri (CNRS) Society for

Duality in a maximum generalized entropy model Shinto Eguchi Osamu Komori Atsumi Ohara MaxEnt

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Review: Conditional Probability Conditional Probability The conditional probability of event

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Graphical Models Graphical Models Conditional Independence 1 Steven J Zeil d-Separation 2

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Graphical Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Conditional Independence

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline

Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017 Announcements: Assignment 1 Due

On the role of infinite cardinals Menachem Kojman Ben-Gurion University of the Negev Helsinki

Random elements of large groups Continuous case Viktor Kiss Etvs Lornd University

Tariq Alhindi , Smaranda Muresan and Daniel Preo iuc-Pietro Only 41% of publishers label

HIPE Evaluation Lab Robust Named Entity Recognition an Linking on Historical Documents Example

LEICESTER HEALTH AND WELLBEING BOARD 3 rd February 2015 Commissioning Intentions 2015/16 Sue

Caf Scientifique 1 The Leaky Pipeline and Age Chairs: Prof Kelly Mack , AAC&U and University

Big Data Myths and Facts: Explaining Digital Transformation to non-IT Professionals Boris Novikov

7.3 Learning test sequences/behaviors: General idea See Thornton, C. ; Cohen, O. ; Denzinger, J.

Maxent Models, Conditional Estimation, and Optimization Without - PowerPoint PPT Presentation

Maxent Models, Conditional Estimation, and Optimization Without Magic That is, With Math! Dan Klein and Chris Manning Stanford University http://nlp.stanford.edu/ HLT-NAACL 2003 and ACL 2003 Tutorial Introduction

Overview MAXENT-Modeling: A framework for Discrete MAXENT-Models and RMs IRT-Modeling?

Maxent Models, Conditional Introduction Estimation, and Optimization In recent years there

From Maxent to Machine Learning and Back T. Sears ANU March 2007 T. Sears (ANU) From Maxent to

MaxEnt Models and Discriminative Estimation Gerald Penn CS224N/Ling284 [based on slides by

T-orders in MaxEnt Arto Anttila (Stanford University) and Giorgio Magri (CNRS) Society for

Duality in a maximum generalized entropy model Shinto Eguchi Osamu Komori Atsumi Ohara MaxEnt

Maxent Models (III), &amp; Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

11/15/16 Conditional distributions Let X and Y be discrete r.v.s. Conditional probability mass

Review: Conditional Probability Conditional Probability The conditional probability of event

Blended Conditional Gradients: The unconditioning of conditional gradients Joint work with Gabor

Graphical Models Graphical Models Conditional Independence 1 Steven J Zeil d-Separation 2

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Motion Estimation by Affine Transforms Motion Estimation by Affine Transforms Motion Estimation

Graphical Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Conditional Independence

Nave Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline

Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017 Announcements: Assignment 1 Due

On the role of infinite cardinals Menachem Kojman Ben-Gurion University of the Negev Helsinki

Random elements of large groups Continuous case Viktor Kiss Etvs Lornd University

Tariq Alhindi , Smaranda Muresan and Daniel Preo iuc-Pietro Only 41% of publishers label

HIPE Evaluation Lab Robust Named Entity Recognition an Linking on Historical Documents Example

LEICESTER HEALTH AND WELLBEING BOARD 3 rd February 2015 Commissioning Intentions 2015/16 Sue

Caf Scientifique 1 The Leaky Pipeline and Age Chairs: Prof Kelly Mack , AAC&amp;U and University

Big Data Myths and Facts: Explaining Digital Transformation to non-IT Professionals Boris Novikov

7.3 Learning test sequences/behaviors: General idea See Thornton, C. ; Cohen, O. ; Denzinger, J.

Maxent Models (III), & Neural Language Models CMSC 473/673 UMBC September 25 th , 2017 Some

Caf Scientifique 1 The Leaky Pipeline and Age Chairs: Prof Kelly Mack , AAC&U and University