[PPT] - Structure and Support Vector Machines SPFLODD October PowerPoint Presentation

SLIDE 1

Structure ¡and ¡ ¡ Support ¡Vector ¡Machines ¡

SPFLODD ¡ October ¡31, ¡2013 ¡

SLIDE 2

Outline ¡

SVMs ¡for ¡structured ¡outputs ¡

– Declara?ve ¡view ¡ – Procedural ¡view ¡

SLIDE 3

Warning: ¡ ¡Math ¡Ahead ¡

SLIDE 4

Nota?on ¡for ¡Linear ¡Models ¡

Training ¡data: ¡ ¡{(x1, ¡y1), ¡(x2, ¡y2), ¡…, ¡(xN, ¡yN)} ¡
Tes?ng ¡data: ¡ ¡{(xN+1, ¡yN+1), ¡… ¡(xN+N’, ¡yN+N’)} ¡
Feature ¡func?on: ¡ ¡g ¡
Weights: ¡ ¡w ¡
Decoding: ¡
Learning: ¡
Evalua?on: ¡

learn

{(xi, yi)}N

i=1

⇥ = arg max

w Φ

w, {(xi, yi)}N

i=1

⇥ decode(w, x) = arg max

y

wg(x, y)

1 N

N

⇤

i=1

cost

decode
learn
{(xi, yi)}N

i=1

⇥ , xN+i ⇥ , yN+i ⇥

SLIDE 5

The ¡Ideal ¡Loss ¡Func?on ¡

Convex ¡
Con?nuous ¡
Cost-‑aware ¡

SLIDE 6

Cost ¡and ¡Margin ¡

The ¡“margin” ¡is ¡an ¡important ¡concept ¡when ¡

we ¡take ¡the ¡linear ¡models ¡point ¡of ¡view. ¡

– A ¡“large ¡margin” ¡means ¡that ¡the ¡correct ¡output ¡is ¡ well-‑separated ¡from ¡the ¡incorrect ¡outputs. ¡

Neither ¡log ¡loss ¡nor ¡“perceptron ¡loss” ¡takes ¡

into ¡account ¡the ¡cost ¡func?on, ¡though. ¡

– In ¡other ¡words, ¡some ¡incorrect ¡outputs ¡are ¡worse ¡ than ¡others. ¡

SLIDE 7

Mul?class ¡SVM ¡(Crammer ¡and ¡ Singer, ¡2001) ¡

The ¡above ¡can ¡be ¡understood ¡as ¡a ¡0-‑1 ¡cost; ¡

let’s ¡generalize ¡a ¡bit: ¡

max

w γ

s.t. ⌃w⌃ ⇥ 1 ⇧i, ⇧y, wg(xi, yi) wg(xi, y) ⇤

γ

if y ⌅= yi

therwise

max

w γ

s.t. ⇧w⇧ ⇥ 1 ⌅i, ⌅y, wg(xi, yi) wg(xi, y) ⇤ γcost(y, yi)

SLIDE 8

Max-‑Margin ¡Markov ¡Networks ¡

Star?ng ¡point: ¡ ¡mul?class ¡SVM ¡(Crammer ¡and ¡

Singer, ¡2001) ¡

max

w γ

s.t. ⇧w⇧ ⇥ 1 ⌅i, ⌅y, wg(xi, yi) wg(xi, y) ⇤ γcost(y, yi)

SLIDE 9

Max-‑Margin ¡Markov ¡Networks ¡

Standard ¡transforma?on ¡to ¡get ¡rid ¡of ¡explicit ¡

men?on ¡of ¡γ, ¡plus ¡slack ¡variables ¡in ¡case ¡the ¡ constraints ¡cannot ¡be ¡met: ¡

No?ce: ¡

¡

∀i, ∀y, ξi ≥ −wg(xi, yi) + wg(xi, y) + cost(y, yi) ∀i, ξi ≥ max

y

−wg(xi, yi) + wg(xi, y) + cost(y, yi)

min

w

C 2 ⌅w⌅2

2 + N

i=1

ξi s.t. ⇤i, ⇤y, wg(xi, yi) wg(xi, y) ⇥ cost(y, yi) ξi

SLIDE 10

Max-‑Margin ¡Markov ¡Networks ¡

Having ¡solved ¡for ¡the ¡slack ¡variables, ¡we ¡can ¡plug ¡

in; ¡we ¡now ¡have ¡an ¡unconstrained ¡problem: ¡

Ratliff, ¡Bagnell, ¡and ¡Zinkevich ¡(2007): ¡ ¡

subgradient ¡descent ¡(or ¡stochas?c ¡version) ¡– ¡ much, ¡much ¡simpler ¡approach ¡to ¡op?mizing ¡this ¡ func?on. ¡

– And ¡more ¡perceptron-‑like! ¡

min

w

C 2 ⇥w⇥2

2 + N

i=1

wg(xi, yi) + max

y

wg(xi, y) + cost(y, yi)

−gj(x, y) + gj(x, cost augmented decode(w, x))

SLIDE 11

Structured ¡Hinge ¡Loss ¡

Small ¡change ¡to ¡the ¡perceptron ¡loss: ¡

¡

Resul?ng ¡subgradient: ¡

– Rather ¡than ¡merely ¡decoding, ¡find ¡a ¡candidate ¡y’ ¡ that ¡is ¡both ¡high-‑scoring ¡and ¡dangerous. ¡

L(w, x, y) = −w⇥g(x, y) + max

y w⇥g(x, y) + cost(y, y)

−gj(x, y) + gj(x, cost augmented decode(w, x))

SLIDE 12

Structured ¡Hinge ¡

Three ¡different ¡lines ¡of ¡work ¡all ¡arrived ¡at ¡this ¡

idea, ¡or ¡something ¡very ¡close. ¡

– Max-‑margin ¡Markov ¡networks ¡ (Taskar, ¡Guestrin, ¡and ¡Koller, ¡2003) ¡ – Structural ¡support ¡vector ¡machines ¡(Tsochantaridis, ¡ Joachims, ¡Hoffman, ¡and ¡Altun, ¡2005) ¡ – Online ¡passive-‑aggressive ¡algorithms ¡ ¡ (Crammer, ¡Keshet, ¡Dekel, ¡Shalev-‑Shwartz, ¡and ¡Singer, ¡ 2006) ¡

Important ¡developments ¡in ¡op?miza?on ¡

techniques ¡since ¡then! ¡

– I’ll ¡highlight ¡what ¡I ¡think ¡it’s ¡most ¡useful ¡to ¡know. ¡

SLIDE 13

I’m ¡Taking ¡Liber?es ¡

The ¡M3N ¡view ¡of ¡the ¡world ¡really ¡thinks ¡about ¡
utputs ¡as ¡configura?ons ¡in ¡a ¡Markov ¡network. ¡
They ¡assume ¡y ¡corresponds ¡to ¡a ¡set ¡of ¡random ¡

variables, ¡each ¡of ¡which ¡gets ¡a ¡label ¡in ¡a ¡finite ¡

set. ¡
Their ¡cost ¡func?on ¡is ¡Hamming ¡cost: ¡ ¡“how ¡many ¡

r.v.s ¡do ¡I ¡predict ¡incorrectly?” ¡

– This ¡is ¡convenient ¡and ¡makes ¡sense ¡for ¡their ¡ applica?ons. ¡ ¡But ¡it’s ¡not ¡as ¡general ¡as ¡it ¡could ¡be. ¡

SLIDE 14

Cost-‑Augmented ¡Decoding ¡

Efficient ¡decoding ¡is ¡possible ¡when ¡the ¡features ¡

factor ¡locally: ¡

Efficient ¡cost-‑augmented ¡decoding ¡requires ¡that ¡

the ¡cost ¡func?on ¡break ¡into ¡parts ¡the ¡same ¡way: ¡

decode(w, x) = arg max

y w⇥g(x, y)

cost augmented decode(w, x, y) = arg max

y w⇥g(x, y) + cost(y, y)

g(x, y) =

p

f(x, partp(y))

cost(y, y) =

p

local cost(partp(y), y)

SLIDE 15

An ¡Exercise ¡

If ¡the ¡features ¡are ¡such ¡that ¡we ¡can ¡use ¡the ¡

Viterbi ¡algorithm ¡for ¡decoding, ¡what ¡are ¡some ¡ cost ¡func?ons ¡we ¡could ¡inside ¡an ¡efficient ¡ cost-‑augmented ¡decoding ¡algorithm ¡that’s ¡a ¡ very ¡small ¡change ¡to ¡Viterbi? ¡

SLIDE 16

Max-‑Margin ¡Markov ¡Networks ¡

Taskar ¡et ¡al. ¡actually ¡work ¡through ¡a ¡dual ¡version ¡of ¡

the ¡problem. ¡

– Primal ¡and ¡dual ¡are ¡both ¡QPs; ¡exponen?ally ¡many ¡ constraints ¡or ¡variables, ¡respec?vely. ¡

Key ¡trick: ¡ ¡factored ¡dual. ¡

– Enables ¡kernelized ¡factors ¡in ¡the ¡MN. ¡ – Actual ¡algorithm ¡is ¡sequen?al ¡minimal ¡op?miza?on ¡(SMO) ¡ for ¡SVMs, ¡a ¡coordinate ¡ascent ¡method ¡(Plao, ¡1999). ¡

The ¡paper ¡includes ¡a ¡generaliza?on ¡bound ¡that ¡is ¡

argued ¡to ¡improve ¡over ¡the ¡Collins ¡perceptron. ¡

Experiments: ¡ ¡handwri?ng ¡recogni?on, ¡text ¡

classifica?on ¡for ¡hyperlinked ¡documents. ¡

SLIDE 17

Structural ¡SVM ¡

Tsochantaridis ¡et ¡al. ¡(2005) ¡– ¡extends ¡their ¡2004 ¡
paper. ¡
Slightly ¡different ¡version ¡of ¡the ¡loss ¡func?on: ¡

– Alterna?ve ¡version ¡of ¡cost-‑augmented ¡decoding ¡ (“slack ¡rescaling” ¡as ¡opposed ¡to ¡Taskar ¡et ¡al.’s ¡“margin ¡ rescaling”) ¡

min

w

C 2 ⌅w⌅2

2 + N

i=1

ξi s.t. ⇤i, ⇤y, wg(xi, yi) wg(xi, y) ⇥ +1 ξi cost(y, yi)

SLIDE 18

Op?miza?on ¡Algorithms ¡for ¡SSVMs ¡

Taskar ¡et ¡al. ¡(2003): ¡ ¡SMO ¡based ¡on ¡factored ¡dual ¡
Bartleo ¡et ¡al. ¡(2004) ¡and ¡Collins ¡et ¡al. ¡(2008): ¡ ¡

exponen?ated ¡gradient ¡

Tsochantaridis ¡et ¡al. ¡(2005): ¡ ¡cusng ¡planes ¡(based ¡on ¡

dual) ¡

Taskar ¡et ¡al. ¡(2005): ¡ ¡dual ¡extragradient ¡

¡ Easiest ¡to ¡use, ¡in ¡my ¡opinion: ¡ ¡

Ratliff ¡et ¡al. ¡(2006): ¡ ¡(stochas?c) ¡subgradient ¡descent ¡
Crammer ¡et ¡al. ¡(2006): ¡ ¡online ¡“passive-‑aggressive” ¡

algorithms ¡

SLIDE 19

“Passive ¡Aggressive” ¡Learners ¡

Star?ng ¡point ¡is ¡the ¡perceptron, ¡and ¡the ¡focus ¡

is ¡on ¡the ¡step ¡size. ¡

In ¡NLP, ¡people ¡oten ¡use ¡a ¡specific ¡instance ¡

called ¡“1-‑best ¡MIRA” ¡(margin ¡infused ¡ relaxa?on ¡algorithm). ¡ ¡

– Some?mes ¡with ¡regular ¡decoding, ¡some?mes ¡ cost-‑augmented ¡decoding. ¡

I ¡do ¡not ¡understand ¡the ¡name. ¡

SLIDE 20

Passive-‑Aggressive ¡Update ¡ ¡ in ¡a ¡Nutshell ¡(“1-‑best ¡MIRA”) ¡

Given ¡x ¡(and ¡y), ¡perform ¡decoding ¡(or ¡cost-‑

augmented ¡decoding) ¡to ¡obtain ¡y’. ¡

To ¡get ¡the ¡updated ¡weights, ¡solve: ¡
Closed ¡form ¡solu?on! ¡

– Essen?ally, ¡a ¡subgradient ¡update ¡with ¡a ¡closed-‑ form ¡step ¡size. ¡

min

w ⇤w w⇤2 2

s.t. w⇥g(x, y) w⇥g(x, y) ⇥ cost(y, y)

SLIDE 21

Perceptron ¡and ¡PA ¡

The ¡PA ¡papers ¡(e.g., ¡Crammer ¡et ¡al., ¡2006) ¡take ¡a ¡

procedural ¡view ¡of ¡online ¡learning ¡and ¡prove ¡ convergence ¡and ¡regret-‑style ¡bounds. ¡

An ¡alterna?ve ¡view, ¡described ¡by ¡Mar?ns ¡et ¡al. ¡

(2010), ¡derives ¡the ¡same ¡updates ¡via ¡dual ¡ coordinate ¡ascent. ¡

– Confusing ¡name: ¡ ¡it ¡doesn’t ¡work ¡in ¡the ¡dual! ¡ – More ¡general: ¡ ¡applies ¡to ¡many ¡other ¡loss ¡func?ons, ¡ so ¡you ¡can ¡get ¡a ¡closed-‑form ¡step ¡size ¡for ¡perceptron ¡ and ¡CRFs. ¡ – Assumes ¡L2 ¡regulariza?on; ¡role ¡of ¡regulariza?on ¡ constant ¡C ¡is ¡very ¡clear ¡in ¡the ¡form ¡of ¡the ¡update. ¡ ¡

SLIDE 22

Dual ¡Coordinate ¡Ascent ¡Update ¡

Assumes ¡L2 ¡regulariza?on. ¡
1-‑best ¡MIRA ¡is ¡a ¡special ¡case ¡with ¡structured ¡hinge ¡loss. ¡
Can ¡get ¡regulariza?on ¡into ¡perceptron ¡this ¡way ¡(use ¡

perceptron ¡loss). ¡

Can ¡get ¡closed-‑form ¡step ¡size ¡for ¡CRF ¡stochas?c ¡SGD. ¡

w ⇥ w min 1 C , L(w, x, y) ⇤⌅wL(w, x, y)⇤2

2

⇥ ⇧ ⌅⇤ ⌃ step size ⌅wL(w, x, y) ⇧ ⌅⇤ ⌃ subgradient

SLIDE 23

Hinge ¡Loss ¡and ¡Log ¡Loss ¡

Hinge ¡loss ¡(M3N): ¡
Log ¡loss ¡(CRF): ¡

¡ −w⇥g(x, y) + max

y w⇥g(x, y) + cost(y, y)

−w⇥g(x, y) + log

y

exp w⇥g(x, y)

SLIDE 24

Aside: ¡ ¡Probabili?es ¡and ¡Cost? ¡

“Sotmax ¡margin” ¡(Gimpel ¡and ¡Smith, ¡2010): ¡

−w⇥g(x, y) + log ⇤

y

exp

w⇥g(x, y) + cost(y, y)

⇥

SLIDE 25

Loss ¡Func?ons ¡You ¡Know ¡

Name ¡ Expression ¡of ¡ Log ¡loss ¡(joint) ¡ ¡ Log ¡loss ¡ (condi?onal) ¡ Cost ¡ ¡ Expected ¡cost, ¡ a.k.a. ¡“risk” ¡ Perceptron ¡loss ¡ ¡ Hinge ¡(margin ¡ rescaling ¡version) ¡

cost(decode(w, x), y) Ep(Y |x,w)[cost(Y , y)] − log p(y | x, w) − log p(x, y | w)

L(w, x, y)

max

y0 w>g(x, y0) − w>g(x, y)

max

y0 w>g(x, y0) + cost(y0, y) − w>g(x, y)

SLIDE 26

On ¡Regulariza?on ¡

In ¡principle, ¡this ¡choice ¡is ¡orthogonal ¡to ¡the ¡

loss ¡func?on. ¡

L2 ¡is ¡the ¡most ¡common ¡star?ng ¡place. ¡
L1 ¡and ¡other ¡sparsity-‑inducing ¡regularizers ¡are ¡

aorac?ng ¡more ¡aoen?on ¡lately. ¡

– But ¡they ¡make ¡op?miza?on ¡more ¡complicated! ¡

SLIDE 27

Does ¡this ¡maoer? ¡

SLIDE 28

Prac?cal ¡Advice ¡

Features ¡s?ll ¡more ¡important ¡than ¡the ¡loss ¡

func?on. ¡

– But ¡general, ¡easy-‑to-‑implement ¡algorithms ¡are ¡quite ¡ useful! ¡

Perceptron ¡is ¡easiest ¡to ¡implement. ¡
CRFs ¡and ¡SSVMs ¡usually ¡do ¡beoer. ¡
If ¡the ¡cost ¡func?on ¡factors ¡locally, ¡I ¡recommend ¡

using ¡a ¡hinge ¡loss ¡and ¡stochas?c ¡subgradient ¡

descent. ¡
Tune ¡the ¡regulariza?on ¡constant. ¡

Structure ¡and ¡ ¡ Support ¡Vector ¡Machines ¡

SPFLODD ¡ October ¡31, ¡2013 ¡

Outline ¡

– Declara?ve ¡view ¡ – Procedural ¡view ¡

Warning: ¡ ¡Math ¡Ahead ¡

Nota?on ¡for ¡Linear ¡Models ¡

learn

⇥ = arg max

⇥ decode(w, x) = arg max

wg(x, y)

1 N

⇤

cost

⇥ , xN+i ⇥ , yN+i ⇥

The ¡Ideal ¡Loss ¡Func?on ¡

Cost ¡and ¡Margin ¡

we ¡take ¡the ¡linear ¡models ¡point ¡of ¡view. ¡

– A ¡“large ¡margin” ¡means ¡that ¡the ¡correct ¡output ¡is ¡ well-­‑separated ¡from ¡the ¡incorrect ¡outputs. ¡

into ¡account ¡the ¡cost ¡func?on, ¡though. ¡

– In ¡other ¡words, ¡some ¡incorrect ¡outputs ¡are ¡worse ¡ than ¡others. ¡

Mul?class ¡SVM ¡(Crammer ¡and ¡ Singer, ¡2001) ¡

let’s ¡generalize ¡a ¡bit: ¡

max

s.t. ⇧w⇧ ⇥ 1 ⌅i, ⌅y, wg(xi, yi) wg(xi, y) ⇤ γcost(y, yi)

Max-­‑Margin ¡Markov ¡Networks ¡

Singer, ¡2001) ¡

max

s.t. ⇧w⇧ ⇥ 1 ⌅i, ⌅y, wg(xi, yi) wg(xi, y) ⇤ γcost(y, yi)

Max-­‑Margin ¡Markov ¡Networks ¡

men?on ¡of ¡γ, ¡plus ¡slack ¡variables ¡in ¡case ¡the ¡ constraints ¡cannot ¡be ¡met: ¡

¡

Max-­‑Margin ¡Markov ¡Networks ¡

in; ¡we ¡now ¡have ¡an ¡unconstrained ¡problem: ¡

subgradient ¡descent ¡(or ¡stochas?c ¡version) ¡– ¡ much, ¡much ¡simpler ¡approach ¡to ¡op?mizing ¡this ¡ func?on. ¡

– And ¡more ¡perceptron-­‑like! ¡

−gj(x, y) + gj(x, cost augmented decode(w, x))

Structured ¡Hinge ¡Loss ¡

¡

– Rather ¡than ¡merely ¡decoding, ¡find ¡a ¡candidate ¡y’ ¡ that ¡is ¡both ¡high-­‑scoring ¡and ¡dangerous. ¡

−gj(x, y) + gj(x, cost augmented decode(w, x))

Structured ¡Hinge ¡

idea, ¡or ¡something ¡very ¡close. ¡

techniques ¡since ¡then! ¡

– I’ll ¡highlight ¡what ¡I ¡think ¡it’s ¡most ¡useful ¡to ¡know. ¡

I’m ¡Taking ¡Liber?es ¡

variables, ¡each ¡of ¡which ¡gets ¡a ¡label ¡in ¡a ¡finite ¡

r.v.s ¡do ¡I ¡predict ¡incorrectly?” ¡

– This ¡is ¡convenient ¡and ¡makes ¡sense ¡for ¡their ¡ applica?ons. ¡ ¡But ¡it’s ¡not ¡as ¡general ¡as ¡it ¡could ¡be. ¡

Cost-­‑Augmented ¡Decoding ¡

factor ¡locally: ¡

the ¡cost ¡func?on ¡break ¡into ¡parts ¡the ¡same ¡way: ¡

g(x, y) =

f(x, partp(y))

cost(y, y) =

local cost(partp(y), y)

An ¡Exercise ¡

Viterbi ¡algorithm ¡for ¡decoding, ¡what ¡are ¡some ¡ cost ¡func?ons ¡we ¡could ¡inside ¡an ¡efficient ¡ cost-­‑augmented ¡decoding ¡algorithm ¡that’s ¡a ¡ very ¡small ¡change ¡to ¡Viterbi? ¡

Max-­‑Margin ¡Markov ¡Networks ¡

the ¡problem. ¡

– Primal ¡and ¡dual ¡are ¡both ¡QPs; ¡exponen?ally ¡many ¡ constraints ¡or ¡variables, ¡respec?vely. ¡

– Enables ¡kernelized ¡factors ¡in ¡the ¡MN. ¡ – Actual ¡algorithm ¡is ¡sequen?al ¡minimal ¡op?miza?on ¡(SMO) ¡ for ¡SVMs, ¡a ¡coordinate ¡ascent ¡method ¡(Plao, ¡1999). ¡

argued ¡to ¡improve ¡over ¡the ¡Collins ¡perceptron. ¡

classifica?on ¡for ¡hyperlinked ¡documents. ¡

Structural ¡SVM ¡

– Alterna?ve ¡version ¡of ¡cost-­‑augmented ¡decoding ¡ (“slack ¡rescaling” ¡as ¡opposed ¡to ¡Taskar ¡et ¡al.’s ¡“margin ¡ rescaling”) ¡

Op?miza?on ¡Algorithms ¡for ¡SSVMs ¡

exponen?ated ¡gradient ¡

dual) ¡

¡ Easiest ¡to ¡use, ¡in ¡my ¡opinion: ¡ ¡

algorithms ¡

“Passive ¡Aggressive” ¡Learners ¡

is ¡on ¡the ¡step ¡size. ¡

called ¡“1-­‑best ¡MIRA” ¡(margin ¡infused ¡ relaxa?on ¡algorithm). ¡ ¡

– Some?mes ¡with ¡regular ¡decoding, ¡some?mes ¡ cost-­‑augmented ¡decoding. ¡

Passive-­‑Aggressive ¡Update ¡ ¡ in ¡a ¡Nutshell ¡(“1-­‑best ¡MIRA”) ¡

augmented ¡decoding) ¡to ¡obtain ¡y’. ¡

– Essen?ally, ¡a ¡subgradient ¡update ¡with ¡a ¡closed-­‑ form ¡step ¡size. ¡

min

s.t. w⇥g(x, y) w⇥g(x, y) ⇥ cost(y, y)

Perceptron ¡and ¡PA ¡

– A ¡“large ¡margin” ¡means ¡that ¡the ¡correct ¡output ¡is ¡ well-‑separated ¡from ¡the ¡incorrect ¡outputs. ¡

Max-‑Margin ¡Markov ¡Networks ¡

Max-‑Margin ¡Markov ¡Networks ¡

Max-‑Margin ¡Markov ¡Networks ¡

– And ¡more ¡perceptron-‑like! ¡

– Rather ¡than ¡merely ¡decoding, ¡find ¡a ¡candidate ¡y’ ¡ that ¡is ¡both ¡high-‑scoring ¡and ¡dangerous. ¡

Cost-‑Augmented ¡Decoding ¡

Viterbi ¡algorithm ¡for ¡decoding, ¡what ¡are ¡some ¡ cost ¡func?ons ¡we ¡could ¡inside ¡an ¡efficient ¡ cost-‑augmented ¡decoding ¡algorithm ¡that’s ¡a ¡ very ¡small ¡change ¡to ¡Viterbi? ¡

Max-‑Margin ¡Markov ¡Networks ¡

– Alterna?ve ¡version ¡of ¡cost-‑augmented ¡decoding ¡ (“slack ¡rescaling” ¡as ¡opposed ¡to ¡Taskar ¡et ¡al.’s ¡“margin ¡ rescaling”) ¡

called ¡“1-‑best ¡MIRA” ¡(margin ¡infused ¡ relaxa?on ¡algorithm). ¡ ¡

– Some?mes ¡with ¡regular ¡decoding, ¡some?mes ¡ cost-‑augmented ¡decoding. ¡

Passive-‑Aggressive ¡Update ¡ ¡ in ¡a ¡Nutshell ¡(“1-‑best ¡MIRA”) ¡

– Essen?ally, ¡a ¡subgradient ¡update ¡with ¡a ¡closed-‑ form ¡step ¡size. ¡

procedural ¡view ¡of ¡online ¡learning ¡and ¡prove ¡ convergence ¡and ¡regret-‑style ¡bounds. ¡

– But ¡general, ¡easy-‑to-‑implement ¡algorithms ¡are ¡quite ¡ useful! ¡