Machine Learning A Geometric Approach Linear Classification: - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning A Geometric Approach Linear Classification: - - PowerPoint PPT Presentation

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU) Perceptron Frank Rosenblatt deep learning multilayer perceptron perceptron linear regression SVM CRF


slide-1
SLIDE 1

Machine Learning

A Geometric Approach

Professor Liang Huang

Linear Classification: Perceptron

some slides from Alex Smola (CMU)

slide-2
SLIDE 2

Perceptron

Frank Rosenblatt

slide-3
SLIDE 3

perceptron

SVM linear regression CRF structured perceptron multilayer perceptron deep learning

slide-4
SLIDE 4

Brief History of Perceptron

1959 Rosenblatt invention 1962 Novikoff proof 1969* Minsky/Papert book killed it 1999 Freund/Schapire voted/avg: revived 2002 Collins structured 2003

Crammer/Singer

MIRA 1997 Cortes/Vapnik SVM 2006

Singer group

aggressive 2005*

McDonald/Crammer/Pereira

structured MIRA

DEAD

*mentioned in lectures but optional (others papers all covered in detail)

  • nline approx.

max margin

+max margin + k e r n e l s +soft-margin c

  • n

s e r v a t i v e u p d a t e s i n s e p a r a b l e c a s e

2007--2010*

Singer group

Pegasos

subgradient descent minibatch

minibatch batch

  • nline

AT&T Research ex-AT&T and students

slide-5
SLIDE 5

Neurons

  • Soma (CPU)

Cell body - combines signals

  • Dendrite (input bus)

Combines the inputs from several other nerve cells

  • Synapse (interface)

Interface and parameter store between neurons

  • Axon (output cable)

May be up to 1m long and will transport the activation signal to neurons at different locations

slide-6
SLIDE 6

Neurons

f(x) = X

i

wixi = hw, xi x1 x2 x3 xn . . .

  • utput

w1 wn

synaptic weights

σ( )

slide-7
SLIDE 7

Frank Rosenblatt’s Perceptron

slide-8
SLIDE 8

Multilayer Perceptron (Neural Net)

slide-9
SLIDE 9

Perceptron w/ bias

  • Weighted linear

combination

  • Nonlinear

decision function

  • Linear offset (bias)
  • Linear separating hyperplanes
  • Learning: w and b

x1 x2 x3 xn . . .

  • utput

w1 wn

synaptic weights

f(x) = σ (hw, xi + b)

slide-10
SLIDE 10

Perceptron w/o bias

  • Weighted linear

combination

  • Nonlinear

decision function

  • No Linear offset (bias):

hyperplane through the origin

  • Linear separating hyperplanes
  • Learning: w

f(x) = σ (hw, xi + b)

x1 x2 x3 xn . . .

  • utput

wn

synaptic weights

w1

w0

x0 = 1

slide-11
SLIDE 11

Augmented Space

O 1

can’t separate in 1D from the origin can separate in 2D from the origin can’t separate in 2D from the origin can separate in 3D from the origin

slide-12
SLIDE 12

Perceptron

Spam Ham

slide-13
SLIDE 13

The Perceptron w/o bias

  • Nothing happens if classified correctly
  • Weight vector is linear combination
  • Classifier is linear combination of

inner products

initialize w = 0 and b = 0 repeat if yi [hw, xii + b]  0 then w w + yixi and b b + yi end if until all classified correctly w = X

i∈I

yixi f(x) = X

i∈I

yi hxi, xi + b

σ( )

slide-14
SLIDE 14

The Perceptron w/ bias

  • Nothing happens if classified correctly
  • Weight vector is linear combination
  • Classifier is linear combination of

inner products

initialize w = 0 and b = 0 repeat if yi [hw, xii + b]  0 then w w + yixi and b b + yi end if until all classified correctly w = X

i∈I

yixi f(x) = X

i∈I

yi hxi, xi + b

σ( )

slide-15
SLIDE 15

Demo

xi w

(bias=0)

slide-16
SLIDE 16

Demo

slide-17
SLIDE 17

Demo

slide-18
SLIDE 18

Demo

slide-19
SLIDE 19
slide-20
SLIDE 20

Convergence Theorem

  • If there exists some oracle unit vector

then the perceptron converges to a linear separator after a number of updates bounded by

  • Dimensionality independent
  • Order independent (but order matters in output)
  • Dataset size independent
  • Scales with ‘difficulty’ of problem

u : kuk = 1

yi(u · xi) ≥ δ for all i

R2/δ2 where R = max

i kxik

slide-21
SLIDE 21

Geometry of the Proof

  • part 1: progress (alignment) on oracle projection

wi+1 = wi + yixi u · wi+1 = u · wi + yi(u · xi) u · wi+1 ≥ u · wi + δ u · wi+1 ≥ iδ kwi+1k = kukkwi+1k u · wi+1 iδ yi(u · xi) ≥ δ for all i

projection on u increases! (more agreement w/ oracle)

u : kuk = 1

δ δ wi xi ⊕ ⊕ wi+1 ⊕ assume wi is the weight vector before the ith update (on hxi, yii) and assume initial w0 = 0

slide-22
SLIDE 22

Geometry of the Proof

  • part 2: bound the norm of the weight vector

kwi+1k = kukkwi+1k u · wi+1 iδ

Combine with part 1

i ≤ R2/δ2

wi+1 = wi + yixi kwi+1k2 = kwi + yixik2 = kwik2 + kxik2 + 2yi(wixi)  kwik2 + R2  iR2

“mistake on x_i” (radius)

u : kuk = 1

δ δ wi xi ⊕ ⊕ wi+1 ⊕

slide-23
SLIDE 23

Convergence Bound

  • is independent of:
  • dimensionality
  • number of examples
  • starting weight vector
  • order of examples
  • constant learning rate
  • and is dependent of:
  • separation difficulty
  • feature scale

R2/δ2 w

  • but test accuracy is

dependent of:

  • order of examples

(shuffling helps)

  • variable learning rate

(1/total#error helps)

  • can you still prove

convergence?

slide-24
SLIDE 24

Hardness margin vs. size

hard easy

slide-25
SLIDE 25

XOR

  • XOR - not linearly separable
  • Nonlinear separation is trivial
  • Caveat from “Perceptrons” (Minsky & Papert, 1969)

Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).

slide-26
SLIDE 26

Brief History of Perceptron

1959 Rosenblatt invention 1962 Novikoff proof 1969* Minsky/Papert book killed it 1999 Freund/Schapire voted/avg: revived 2002 Collins structured 2003

Crammer/Singer

MIRA 1997 Cortes/Vapnik SVM 2006

Singer group

aggressive 2005*

McDonald/Crammer/Pereira

structured MIRA

DEAD

*mentioned in lectures but optional (others papers all covered in detail)

  • nline approx.

max margin

+max margin + k e r n e l s +soft-margin c

  • n

s e r v a t i v e u p d a t e s i n s e p a r a b l e c a s e

2007--2010*

Singer group

Pegasos

subgradient descent minibatch

minibatch batch

  • nline

AT&T Research ex-AT&T and students

slide-27
SLIDE 27

Extensions of Perceptron

  • Problems with Perceptron
  • doesn’t converge with inseparable data
  • update might often be too “bold”
  • doesn’t optimize margin
  • is sensitive to the order of examples
  • Ways to alleviate these problems
  • voted perceptron and average perceptron
  • MIRA (margin-infused relaxation algorithm)
slide-28
SLIDE 28

Voted/Avged Perceptron

  • motivation: updates on later examples taking over!
  • voted perceptron (Freund and Schapire, 1999)
  • record the weight vector after each example in D
  • (not just after each update)
  • and vote on a new example using |D| models
  • shown to have better generalization power
  • averaged perceptron (from the same paper)
  • an approximation of voted perceptron
  • just use the average of all weight vectors
  • can be implemented efficiently
slide-29
SLIDE 29

Voted Perceptron

slide-30
SLIDE 30

Voted/Avged Perceptron

test error (low dim - less separable)

slide-31
SLIDE 31

Voted/Avged Perceptron

test error

(high dim - more separable)

slide-32
SLIDE 32

Averaged Perceptron

  • voted perceptron is not scalable
  • and does not output a single model
  • avg perceptron is an approximation of voted perceptron

32

initialize w = 0 and b = 0 repeat if yi [hw, xii + b]  0 then w w + yixi and b b + yi end if until all classified correctly

w0 = 0 w0 ← w0 + w c ← c + 1

  • utput

w0/c

after each example, not each update

slide-33
SLIDE 33

Efficient Implementation of Averaging

  • naive implementation (running sum) doesn’t scale
  • very clever trick from Daume (2006, PhD thesis)

33

w(0) = w(1) = w(2) = w(3) = w(4) =

∆w(1) ∆w(1)∆w(2) ∆w(1)∆w(2)∆w(3) ∆w(1)∆w(2)∆w(3)∆w(4)

wt

∆wt

initialize w = 0 and b = 0 repeat if yi [hw, xii + b]  0 then w w + yixi and b b + yi end if until all classified correctly

c ← c + 1 wa = 0 wa ← wa + cyixi

  • utput w − wa/c
slide-34
SLIDE 34

MIRA

  • perceptron often makes too bold updates
  • but hard to tune learning rate
  • the smallest update to correct the mistake?

easy to show:

wi+1 = wi + yi wi · xi kxik2 xi

yi(wi+1 · xi) = yi (wi + yi wi · xi kxik2 xi) · xi = 1

wi xi ⊕

perceptron

MIRA 1 kxik wi · xi kxik 1 wi · xi kxik

margin-infused relaxation algorithm (MIRA)

perceptron over- corrects this mistake

slide-35
SLIDE 35

Perceptron

xi w

(bias=0)

perceptron

perceptron under- corrects this mistake

slide-36
SLIDE 36

MIRA

xi w

(bias=0)

margin of 1/|x_i|

MIRA

perceptron

perceptron under- corrects this mistake MIRA makes sure after update, dot- product w ∙ x_i = 1

min

w0 kw0 wk2

s.t. w0 · x 1 minimal change to ensure margin

MIRA ≈ 1-step SVM

slide-37
SLIDE 37

Aggressive MIRA

  • aggressive version of MIRA
  • also update if correct but margin isn’t big enough
  • functional margin:
  • geometric margin:
  • update if functional margin is <=p (0<=p<1)
  • update rule is same as MIRA
  • called p-aggressive MIRA (MIRA: p=0)
  • larger p leads to a larger geometric margin
  • but slower convergence

yi(w · xi) yi(w · xi) kwk

w

slide-38
SLIDE 38

p=0.9

p=0.2

p e r c e p t r

  • n

Aggressive MIRA

slide-39
SLIDE 39

Demo

  • perceptron vs. 0.2-aggressive vs. 0.9-aggressive

p=0.9

p=0.2

p e r c e p t r

  • n
slide-40
SLIDE 40

Demo

  • perceptron vs. 0.2-aggressive vs. 0.9-aggressive
  • why does this dataset so slow to converge?
  • perceptron: 22, p=0.2: 87, p=0.9: 2,518 epochs

O 1

big margin in 1D small margin in 2D

answer: margin shrinks in augmented space!

slide-41
SLIDE 41

Demo

  • perceptron vs. 0.2-aggressive vs. 0.9-aggressive
  • why does this dataset so fast to converge?
  • perceptron: 3, p=0.2: 1, p=0.9: 5 epochs

O 1

answer: margin shrinks in augmented space!

big margin in 1D

  • k margin in 2D
slide-42
SLIDE 42

What if data is not separable

  • in practice, data is almost always inseparable
  • wait, what does that mean??
  • perceptron cycling theorem (1970)
  • weights will remain bounded and not diverge
  • use dev set for when to stop (prevents overfitting)
  • higher-order features by combining atomic ones
  • kernels => separable in higher dimensions
slide-43
SLIDE 43
slide-44
SLIDE 44

Solving XOR

  • XOR not linearly separable
  • Mapping into 3 dimensions makes it easily solvable

(x1, x2) (x1, x2, x1x2)

slide-45
SLIDE 45

Useful Engineering Tips:

averaging, shuffling, variable learning rate, fixing feature scale

  • averaging helps significantly; MIRA helps a tiny little bit
  • perceptron < MIRA < avg. perceptron ≈ avg. MIRA
  • shuffling the data helps hugely if classes were ordered
  • shuffling before each epoch helps a little bit
  • variable learning rate often helps a little
  • 1/(total#updates) or 1/(total#examples) helps
  • any requirement in order to converge?
  • how to prove convergence now?
  • centering of each feature dim helps
  • why? => R smaller, margin bigger
  • unit variance also helps (why?)
  • 0-mean, 1-var => each feature ≈ a unit Gaussian

O 1 O 1

slide-46
SLIDE 46

Useful Engineering Tips:

categorical=>binary, feature bucketing (binning/quantization)

  • HW1 Adult income dataset: <=50K, or >50K?
  • 2 numerical features
  • age and hours-per-week
  • ption 1: treat them as numerical features
  • but is older and more hours always better?
  • ption 2: treat them as binary features
  • e.g., age=22, hours=38, ...
  • ption 3: bin them (e.g., age=0-25, hours=41-60...)
  • 7 categorical features: convert to binary features
  • country, race, occupation, etc.
  • e.g., country:United_States, education:Doctorate,...
  • ptional: you can probably add a numerical feature “edu_level”
  • perceptron: ~20% error, avg. perceptron: ~16% error

Age, Workclass, Education, Marital_Status, Occupation, Race, Sex, Hours, Country, Target 40, Private, Doctorate, Married-civ-spouse, Prof-specialty, White, Male, 60, United-States, >50K

slide-47
SLIDE 47

Brief History of Perceptron

1959 Rosenblatt invention 1962 Novikoff proof 1969* Minsky/Papert book killed it 1999 Freund/Schapire voted/avg: revived 2002 Collins structured 2003

Crammer/Singer

MIRA 1997 Cortes/Vapnik SVM 2006

Singer group

aggressive 2005*

McDonald/Crammer/Pereira

structured MIRA

DEAD

*mentioned in lectures but optional (others papers all covered in detail)

  • nline approx.

max margin

+max margin + k e r n e l s +soft-margin c

  • n

s e r v a t i v e u p d a t e s i n s e p a r a b l e c a s e

2007--2010*

Singer group

Pegasos

subgradient descent minibatch

minibatch batch

  • nline

AT&T Research ex-AT&T and students