Machine Learning
A Geometric Approach
Professor Liang Huang
Linear Classification: Perceptron
some slides from Alex Smola (CMU)
Machine Learning A Geometric Approach Linear Classification: - - PowerPoint PPT Presentation
Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU) Perceptron Frank Rosenblatt deep learning multilayer perceptron perceptron linear regression SVM CRF
A Geometric Approach
Professor Liang Huang
Linear Classification: Perceptron
some slides from Alex Smola (CMU)
Frank Rosenblatt
SVM linear regression CRF structured perceptron multilayer perceptron deep learning
1959 Rosenblatt invention 1962 Novikoff proof 1969* Minsky/Papert book killed it 1999 Freund/Schapire voted/avg: revived 2002 Collins structured 2003
Crammer/Singer
MIRA 1997 Cortes/Vapnik SVM 2006
Singer group
aggressive 2005*
McDonald/Crammer/Pereira
structured MIRA
DEAD
*mentioned in lectures but optional (others papers all covered in detail)
max margin
+max margin + k e r n e l s +soft-margin c
s e r v a t i v e u p d a t e s i n s e p a r a b l e c a s e
2007--2010*
Singer group
Pegasos
subgradient descent minibatch
minibatch batch
AT&T Research ex-AT&T and students
Cell body - combines signals
Combines the inputs from several other nerve cells
Interface and parameter store between neurons
May be up to 1m long and will transport the activation signal to neurons at different locations
f(x) = X
i
wixi = hw, xi x1 x2 x3 xn . . .
w1 wn
synaptic weights
σ( )
combination
decision function
x1 x2 x3 xn . . .
w1 wn
synaptic weights
f(x) = σ (hw, xi + b)
combination
decision function
hyperplane through the origin
f(x) = σ (hw, xi + b)
x1 x2 x3 xn . . .
wn
synaptic weights
w1
w0
x0 = 1
O 1
can’t separate in 1D from the origin can separate in 2D from the origin can’t separate in 2D from the origin can separate in 3D from the origin
Spam Ham
inner products
initialize w = 0 and b = 0 repeat if yi [hw, xii + b] 0 then w w + yixi and b b + yi end if until all classified correctly w = X
i∈I
yixi f(x) = X
i∈I
yi hxi, xi + b
σ( )
inner products
initialize w = 0 and b = 0 repeat if yi [hw, xii + b] 0 then w w + yixi and b b + yi end if until all classified correctly w = X
i∈I
yixi f(x) = X
i∈I
yi hxi, xi + b
σ( )
xi w
(bias=0)
then the perceptron converges to a linear separator after a number of updates bounded by
u : kuk = 1
yi(u · xi) ≥ δ for all i
R2/δ2 where R = max
i kxik
wi+1 = wi + yixi u · wi+1 = u · wi + yi(u · xi) u · wi+1 ≥ u · wi + δ u · wi+1 ≥ iδ kwi+1k = kukkwi+1k u · wi+1 iδ yi(u · xi) ≥ δ for all i
projection on u increases! (more agreement w/ oracle)
u : kuk = 1
δ δ wi xi ⊕ ⊕ wi+1 ⊕ assume wi is the weight vector before the ith update (on hxi, yii) and assume initial w0 = 0
kwi+1k = kukkwi+1k u · wi+1 iδ
Combine with part 1
i ≤ R2/δ2
wi+1 = wi + yixi kwi+1k2 = kwi + yixik2 = kwik2 + kxik2 + 2yi(wixi) kwik2 + R2 iR2
“mistake on x_i” (radius)
u : kuk = 1
δ δ wi xi ⊕ ⊕ wi+1 ⊕
R2/δ2 w
dependent of:
(shuffling helps)
(1/total#error helps)
convergence?
hard easy
Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).
1959 Rosenblatt invention 1962 Novikoff proof 1969* Minsky/Papert book killed it 1999 Freund/Schapire voted/avg: revived 2002 Collins structured 2003
Crammer/Singer
MIRA 1997 Cortes/Vapnik SVM 2006
Singer group
aggressive 2005*
McDonald/Crammer/Pereira
structured MIRA
DEAD
*mentioned in lectures but optional (others papers all covered in detail)
max margin
+max margin + k e r n e l s +soft-margin c
s e r v a t i v e u p d a t e s i n s e p a r a b l e c a s e
2007--2010*
Singer group
Pegasos
subgradient descent minibatch
minibatch batch
AT&T Research ex-AT&T and students
test error (low dim - less separable)
test error
(high dim - more separable)
32
initialize w = 0 and b = 0 repeat if yi [hw, xii + b] 0 then w w + yixi and b b + yi end if until all classified correctly
w0 = 0 w0 ← w0 + w c ← c + 1
w0/c
after each example, not each update
33
w(0) = w(1) = w(2) = w(3) = w(4) =
∆w(1) ∆w(1)∆w(2) ∆w(1)∆w(2)∆w(3) ∆w(1)∆w(2)∆w(3)∆w(4)
wt
∆wt
initialize w = 0 and b = 0 repeat if yi [hw, xii + b] 0 then w w + yixi and b b + yi end if until all classified correctly
c ← c + 1 wa = 0 wa ← wa + cyixi
easy to show:
wi+1 = wi + yi wi · xi kxik2 xi
yi(wi+1 · xi) = yi (wi + yi wi · xi kxik2 xi) · xi = 1
wi xi ⊕
perceptron
MIRA 1 kxik wi · xi kxik 1 wi · xi kxik
margin-infused relaxation algorithm (MIRA)
perceptron over- corrects this mistake
xi w
(bias=0)
perceptron
perceptron under- corrects this mistake
xi w
(bias=0)
margin of 1/|x_i|
MIRA
perceptron
perceptron under- corrects this mistake MIRA makes sure after update, dot- product w ∙ x_i = 1
min
w0 kw0 wk2
s.t. w0 · x 1 minimal change to ensure margin
MIRA ≈ 1-step SVM
yi(w · xi) yi(w · xi) kwk
w
p=0.9
p=0.2
p e r c e p t r
p=0.9
p=0.2
p e r c e p t r
O 1
big margin in 1D small margin in 2D
answer: margin shrinks in augmented space!
O 1
answer: margin shrinks in augmented space!
big margin in 1D
(x1, x2) (x1, x2, x1x2)
averaging, shuffling, variable learning rate, fixing feature scale
O 1 O 1
categorical=>binary, feature bucketing (binning/quantization)
Age, Workclass, Education, Marital_Status, Occupation, Race, Sex, Hours, Country, Target 40, Private, Doctorate, Married-civ-spouse, Prof-specialty, White, Male, 60, United-States, >50K
1959 Rosenblatt invention 1962 Novikoff proof 1969* Minsky/Papert book killed it 1999 Freund/Schapire voted/avg: revived 2002 Collins structured 2003
Crammer/Singer
MIRA 1997 Cortes/Vapnik SVM 2006
Singer group
aggressive 2005*
McDonald/Crammer/Pereira
structured MIRA
DEAD
*mentioned in lectures but optional (others papers all covered in detail)
max margin
+max margin + k e r n e l s +soft-margin c
s e r v a t i v e u p d a t e s i n s e p a r a b l e c a s e
2007--2010*
Singer group
Pegasos
subgradient descent minibatch
minibatch batch
AT&T Research ex-AT&T and students