Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A - - PowerPoint PPT Presentation

applied machine learning
SMART_READER_LITE
LIVE PREVIEW

Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A - - PowerPoint PPT Presentation

Applied Machine Learning CIML Chaps 4-5 (A Geometric Approach) A ship in port is safe, but that is not what ships are for. Grace Hopper (1906-1992) Week 5: Extensions and Variations of Perceptron, and Practical Issues


slide-1
SLIDE 1

Applied Machine Learning

Professor Liang Huang

Week 5: Extensions and Variations of Perceptron, and Practical Issues

CIML Chaps 4-5

(A Geometric Approach)

“A ship in port is safe, but that is not what ships are for.” 
 – Grace Hopper (1906-1992)

some slides from A. Zisserman (Oxford)

slide-2
SLIDE 2

Trivia: Grace Hopper and the first bug

  • Edison coined the term “bug” around 1878 and since then it had been widely used

in engineering

  • Hopper was associated with the discovery of the first computer bug in 1947

which was a moth stuck in a relay

2

Smithsonian National Museum of American History

slide-3
SLIDE 3

Week 5: Perceptron in Practice

  • Problems with Perceptron
  • doesn’t converge with inseparable data
  • update might often be too “bold”
  • doesn’t optimize margin
  • result is sensitive to the order of examples
  • Ways to alleviate these problems (without SVM/kernels)
  • Part II: voted perceptron and average perceptron
  • Part III: MIRA (margin-infused relaxation algorithm)
  • Part IV: Practical Issues and HW1
  • Part

V: “Soft” Perceptron: Logistic Regression

3

“A ship in port is safe, but that is not what ships are for.” 
 – Grace Hopper (1906-1992)

slide-4
SLIDE 4

Recap of Week 4

4

u · x ≥ δ

u : kuk = 1

x

δ δ

R

input: training data D

  • utput: weights w

initialize w ← 0 while not converged aafor (x, y) ∈ D aaaaif y(w · x) ≤ 0 aaaaaaw ← w + yx

x w w0

Training

Input x Output y

Model w

“idealized” ML

Training

Input x Output y

Model w

“actual” ML

feature map ϕ

Training

Input x Output y

Model w

deep learning ≈ representation learning

feature map ϕ

slide-5
SLIDE 5

Python Demo

5

$ python perc_demo.py

(requires numpy and matplotlib)

slide-6
SLIDE 6

Part II: Voted and Averaged Perceptron

6

dev set error

vanilla perceptron v

  • t

e d p e r c e p t r

  • n

a v e r a g e d p e r c e p t r

  • n
slide-7
SLIDE 7

Brief History of Perceptron

1959 Rosenblatt invention 1962 Novikoff proof 1969* Minsky/Papert book killed it

1999 Freund/Schapire voted/avg: revived

2002 Collins structured 2003

Crammer/Singer

MIRA 1997 Cortes/Vapnik SVM 2006

Singer group

aggressive 2005*

McDonald/Crammer/Pereira

structured MIRA

DEAD

*mentioned in lectures but optional
 (others papers all covered in detail)

  • nline approx.


max margin

+max margin +kernels +soft-margin conservative updates inseparable case

2007--2010*

Singer group

Pegasos subgradient descent minibatch

minibatch batch

  • nline

AT&T Research ex-AT&T and students

7

slide-8
SLIDE 8

Voted/Avged Perceptron

  • problem: later examples dominate earlier examples
  • solution: voted perceptron (Freund and Schapire, 1999)
  • record the weight vector after each example in D
  • not just after each update!
  • and vote on a new example using |D| models
  • shown to have better generalization power
  • averaged perceptron (from the same paper)
  • an approximation of voted perceptron
  • just use the average of all weight vectors
  • can be implemented efficiently

8

slide-9
SLIDE 9

Voted Perceptron

9

  • ur notation: (x(1), y(1))

v is weight, 
 c is its # of votes

if correct, increase the current model’s # of votes;

  • therwise create a new

model with 1 vote

slide-10
SLIDE 10

Experiments

10

dev set error

vanilla perceptron v

  • t

e d p e r c e p t r

  • n

a v e r a g e d p e r c e p t r

  • n
slide-11
SLIDE 11

Averaged Perceptron

  • voted perceptron is not scalable
  • and does not output a single model
  • avg perceptron is an approximation of voted perceptron
  • actually, summing all weight vectors is enough; no need to divide

11

after each example, not after each update!

w(1) =∆w(1) w(2) =∆w(1)∆w(2) w(3) =∆w(1)∆w(2)∆w(3) w(4) =∆w(1)∆w(2)∆w(3)∆w(4) w(1) = w(2) = w(3) = w(4) =

initialize w ← 0; ws ← 0 while not converged aafor (x, y) ∈ D aaaaif y(w · x) ≤ 0 aaaaaaw ← w + yx aaaaws ← ws + w

  • utput: summed weights ws
slide-12
SLIDE 12

Efficient Implementation of Averaging

  • naive implementation (running sum ws) doesn’t scale either
  • OK for low dim. (HW1); too slow for high-dim. (HW3)
  • very clever trick from Hal Daumé (2006, PhD thesis)

12

w(t)

∆w(t)

w(1) =∆w(1) w(2) =∆w(1)∆w(2) w(3) =∆w(1)∆w(2)∆w(3) w(4) =∆w(1)∆w(2)∆w(3)∆w(4) w(1) = w(2) = w(3) = w(4) =

initialize w ← 0; wa ← 0; c ← 0 while not converged aafor (x, y) ∈ D aaaaif y(w · x) ≤ 0 aaaaaaw ← w + yx aaaaaawa ← wa + cyx aaaac ← c + 1

  • utput: cw − wa

after each update, not after each example!

c

slide-13
SLIDE 13

Part III: MIRA

  • perceptron often makes bold updates (over-correction)
  • and sometimes too small updates (under-correction)
  • but hard to tune learning rate
  • “just enough” update to correct the mistake?

easy to show:

perceptron

MIRA

margin-infused relaxation
 algorithm (MIRA)

  • ver-correction

13

w0 w + y w · x kxk2 x

w0 · x = (w + y w · x kxk2 x) · x = y x

w

w0

w0

1 k x k

w · x kxk

1 w · x kxk

w w0 x

under-correction

slide-14
SLIDE 14

Example: Perceptron under-correction

14

w

perceptron w0

x

slide-15
SLIDE 15

MIRA: just enough

MIRA

perceptron

min

w0 kw0 wk2

s.t. w0 · x 1

minimal change to ensure 
 functional margin of 1
 (dot-product w’·x=1)

MIRA ≈ 1-step SVM

15

x w w0 w0

functional margin: y(w · x) geometric margin: y(w · x) kwk

1 kxk

slide-16
SLIDE 16

MIRA: functional vs geom. margin

MIRA

min

w0 kw0 wk2

s.t. w0 · x 1

minimal change to ensure 
 functional margin of 1
 (dot-product w’·x=1)

MIRA ≈ 1-step SVM

16

x w w0

functional margin: y(w · x) geometric margin: y(w · x) kwk

1 kw0k

w0 · x = 1

w · x =

slide-17
SLIDE 17

Optional: Aggressive MIRA

  • aggressive version of MIRA
  • also update if correct but not confident enough
  • i.e., functional margin (y w·x) not big enough
  • p-aggressive MIRA: update if y (w·x) < p (0<=p<1)
  • MIRA is a special case with p=0: only update if misclassified!
  • update equation is same as MIRA
  • i.e., after update, functional margin becomes 1
  • larger p leads to a larger geometric margin but slower convergence

17

x

w

1 kwk

w · x = 0.7 w · x = 1 w · x = 0

slide-18
SLIDE 18

Demo

18

slide-19
SLIDE 19

Demo

19

slide-20
SLIDE 20

Part IV: Practical Issues

20

“A ship in port is safe, but that is not what ships are for.” 
 – Grace Hopper (1906-1992)

  • you will build your own linear classifiers for HW2 (same data as HW1)
  • slightly different binarizations
  • for k-NN, we binarize all categorical fields but keep the two numerical ones
  • for perceptron (and most other classifiers), we binarize numerical fields as well
  • why? hint: larger “age” always better? more “hours” always better?
slide-21
SLIDE 21

Useful Engineering Tips:

averaging, shuffling, variable learning rate, fixing feature scale

  • averaging helps significantly; MIRA helps a tiny little bit
  • perceptron < MIRA < avg. perceptron ≈ avg. MIRA ≈ SVM
  • shuffling the data helps hugely if classes were ordered (HW1)
  • shuffling before each epoch helps a little bit
  • variable (decaying) learning rate often helps a little
  • 1/(total#updates) or 1/(total#examples) helps
  • any requirement in order to converge?
  • how to prove convergence now?
  • centering of each dimension helps (Ex1/HW1)
  • why? => smaller radius, bigger margin!
  • unit variance also helps (why?) (Ex1/HW1)
  • 0-mean, 1-var => each feature ≈ a unit Gaussian

O 1 O 1

21

small margin big margin

slide-22
SLIDE 22

Feature Maps in Other Domains

  • how to convert an image or text to a vector?

22

28x28 grayscale image

“one-hot” representation of words (all binary features)

23x23 RGB image

x ∈ ℝ23x23x3

  • image
  • text

in deep learning there are other feature maps

slide-23
SLIDE 23

Part V: Perceptron vs. Logistic Regression

  • logistic regression is another popular linear classifier
  • can be viewed as “soft” or “probabilistic” perceptron
  • same decision rule (sign of dot-product), but prob. output

23

f(x) = sign(w · x)

f(x) = σ(w · x) = 1 1 + e−w·x

perceptron logistic regression

slide-24
SLIDE 24

Logistic vs. Linear Regression

24

  • linear regression is regression applied to real-valued output using linear function
  • logistic regression is regression applied to 0-1 output using the sigmoid function

https://florianhartl.com/logistic-regression-geometric-intuition.html

linear logistic

1 feature 2 features 1 feature 2 features

slide-25
SLIDE 25

Why Logistic instead of Linear

25

  • linear regression easily dominated by distant points
  • causing misclassification

http://www.robots.ox.ac.uk/~az/lectures/ml/2011/lect4.pdf

slide-26
SLIDE 26

Why Logistic instead of Linear

26

  • linear regression easily dominated by distant points
  • causing misclassification
slide-27
SLIDE 27

Why 0/1 instead of +/-1

  • perc: y=+1 or -1; logistic regression: y=1 or 0
  • reason: want the output to be a probability
  • decision boundary is still linear: p(y=1 | x) = 0.5

27

slide-28
SLIDE 28

Logistic Regression: Large Margin

28

  • perceptron can be viewed roughly as “step” regression
  • logistic regression favors large margin; SVM: max margin
  • in practice: perc. << avg. perc. ≈ logistic regression ≈ SVM
slide-29
SLIDE 29

perceptron


1958

SVM


1964;1995

logistic regression


1958


  • cond. random fields


2001


structured perceptron


2002

multilayer perceptron deep learning


~1986; 2006-now

29

structured SVM


2003


kernels


1964

voted/avg. perceptron


1999