Lecture 3: Perceptron Princeton University COS 495 Instructor: - - PowerPoint PPT Presentation

β–Ά
lecture 3 perceptron
SMART_READER_LITE
LIVE PREVIEW

Lecture 3: Perceptron Princeton University COS 495 Instructor: - - PowerPoint PPT Presentation

Machine Learning Basics Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview Previous lectures: (Principle for loss function) MLE to derive loss Example: linear regression; some linear


slide-1
SLIDE 1

Machine Learning Basics Lecture 3: Perceptron

Princeton University COS 495 Instructor: Yingyu Liang

slide-2
SLIDE 2

Perceptron

slide-3
SLIDE 3

Overview

  • Previous lectures: (Principle for loss function) MLE to derive loss
  • Example: linear regression; some linear classification models
  • This lecture: (Principle for optimization) local improvement
  • Example: Perceptron; SGD
slide-4
SLIDE 4

Task

(π‘₯βˆ—)π‘ˆπ‘¦ = 0 Class +1 Class -1 π‘₯βˆ— (π‘₯βˆ—)π‘ˆπ‘¦ > 0 (π‘₯βˆ—)π‘ˆπ‘¦ < 0

slide-5
SLIDE 5

Attempt

  • Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ i.i.d. from distribution 𝐸
  • Hypothesis 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦

  • 𝑧 = +1 if π‘₯π‘ˆπ‘¦ > 0
  • 𝑧 = βˆ’1 if π‘₯π‘ˆπ‘¦ < 0
  • Prediction: 𝑧 = sign(𝑔

π‘₯ 𝑦 ) = sign(π‘₯π‘ˆπ‘¦)

  • Goal: minimize classification error
slide-6
SLIDE 6

Perceptron Algorithm

  • Assume for simplicity: all 𝑦𝑗 has length 1

Perceptron: figure from the lecture note of Nina Balcan

slide-7
SLIDE 7

Intuition: correct the current mistake

  • If mistake on a positive example

π‘₯𝑒+1

π‘ˆ 𝑦 = π‘₯𝑒 + 𝑦 π‘ˆπ‘¦ = π‘₯𝑒 π‘ˆπ‘¦ + π‘¦π‘ˆπ‘¦ = π‘₯𝑒 π‘ˆπ‘¦ + 1

  • If mistake on a negative example

π‘₯𝑒+1

π‘ˆ 𝑦 = π‘₯𝑒 βˆ’ 𝑦 π‘ˆπ‘¦ = π‘₯𝑒 π‘ˆπ‘¦ βˆ’ π‘¦π‘ˆπ‘¦ = π‘₯𝑒 π‘ˆπ‘¦ βˆ’ 1

slide-8
SLIDE 8

The Perceptron Theorem

  • Suppose there exists π‘₯βˆ— that correctly classifies

𝑦𝑗, 𝑧𝑗

  • W.L.O.G., all 𝑦𝑗 and π‘₯βˆ— have length 1, so the minimum distance of any

example to the decision boundary is 𝛿 = min

𝑗

| π‘₯βˆ— π‘ˆπ‘¦π‘—|

  • Then Perceptron makes at most

1 𝛿 2

mistakes

slide-9
SLIDE 9

The Perceptron Theorem

  • Suppose there exists π‘₯βˆ— that correctly classifies

𝑦𝑗, 𝑧𝑗

  • W.L.O.G., all 𝑦𝑗 and π‘₯βˆ— have length 1, so the minimum distance of any

example to the decision boundary is 𝛿 = min

𝑗

| π‘₯βˆ— π‘ˆπ‘¦π‘—|

  • Then Perceptron makes at most

1 𝛿 2

mistakes

Need not be i.i.d. ! Do not depend on π‘œ, the length of the data sequence!

slide-10
SLIDE 10

Analysis

  • First look at the quantity π‘₯𝑒

π‘ˆπ‘₯βˆ—

  • Claim 1: π‘₯𝑒+1

π‘ˆ π‘₯βˆ— β‰₯ π‘₯𝑒 π‘ˆπ‘₯βˆ— + 𝛿

  • Proof: If mistake on a positive example 𝑦

π‘₯𝑒+1

π‘ˆ π‘₯βˆ— = π‘₯𝑒 + 𝑦 π‘ˆπ‘₯βˆ— = π‘₯𝑒 π‘ˆπ‘₯βˆ— + π‘¦π‘ˆπ‘₯βˆ— β‰₯ π‘₯𝑒 π‘ˆπ‘₯βˆ— + 𝛿

  • If mistake on a negative example

π‘₯𝑒+1

π‘ˆ π‘₯βˆ— = π‘₯𝑒 βˆ’ 𝑦 π‘ˆπ‘₯βˆ— = π‘₯𝑒 π‘ˆπ‘₯βˆ— βˆ’ π‘¦π‘ˆπ‘₯βˆ— β‰₯ π‘₯𝑒 π‘ˆπ‘₯βˆ— + 𝛿

slide-11
SLIDE 11

Analysis

  • Next look at the quantity π‘₯𝑒
  • Claim 2: π‘₯𝑒+1

2 ≀

π‘₯𝑒

2 + 1

  • Proof: If mistake on a positive example 𝑦

π‘₯𝑒+1

2 =

π‘₯𝑒 + 𝑦

2 =

π‘₯𝑒

2 +

𝑦

2 + 2π‘₯𝑒 π‘ˆπ‘¦

Negative since we made a mistake on x

slide-12
SLIDE 12

Analysis: putting things together

  • Claim 1: π‘₯𝑒+1

π‘ˆ π‘₯βˆ— β‰₯ π‘₯𝑒 π‘ˆπ‘₯βˆ— + 𝛿

  • Claim 2: π‘₯𝑒+1

2 ≀

π‘₯𝑒

2 + 1

After 𝑁 mistakes:

  • π‘₯𝑁+1

π‘ˆ

π‘₯βˆ— β‰₯ 𝛿𝑁

  • π‘₯𝑁+1

≀ βˆšπ‘

  • π‘₯𝑁+1

π‘ˆ

π‘₯βˆ— ≀ π‘₯𝑁+1 So 𝛿𝑁 ≀ βˆšπ‘, and thus 𝑁 ≀

1 𝛿 2

slide-13
SLIDE 13

Intuition

  • Claim 1: π‘₯𝑒+1

π‘ˆ π‘₯βˆ— β‰₯ π‘₯𝑒 π‘ˆπ‘₯βˆ— + 𝛿

  • Claim 2: π‘₯𝑒+1

2 ≀

π‘₯𝑒

2 + 1

The correlation gets larger. Could be:

  • 1. π‘₯𝑒+1 gets closer to π‘₯βˆ—
  • 2. π‘₯𝑒+1 gets much longer

Rules out the bad case β€œ2. π‘₯𝑒+1 gets much longer”

slide-14
SLIDE 14

Some side notes on Perceptron

slide-15
SLIDE 15

History

Figure from Pattern Recognition and Machine Learning, Bishop

slide-16
SLIDE 16

Note: connectionism vs symbolism

  • Symbolism: AI can be achieved by representing concepts as symbols
  • Example: rule-based expert system, formal grammar
  • Connectionism: explain intellectual abilities using connections

between neurons (i.e., artificial neural networks)

  • Example: perceptron, larger scale neural networks
slide-17
SLIDE 17

Symbolism example: Credit Risk Analysis

Example from Machine learning lecture notes by Tom Mitchell

slide-18
SLIDE 18

Connectionism example

Figure from Pattern Recognition and machine learning, Bishop

Neuron/perceptron

slide-19
SLIDE 19

Note: connectionism v.s. symbolism

  • Formal theories of logical reasoning, grammar, and other higher

mental faculties compel us to think of the mind as a machine for rule- based manipulation of highly structured arrays of symbols. What we know of the brain compels us to think of human information processing in terms of manipulation of a large unstructured set of numbers, the activity levels of interconnected neurons.

  • --- The Central Paradox of Cognition (Smolensky et al., 1992)
slide-20
SLIDE 20

Note: online vs batch

  • Batch: Given training data 𝑦𝑗, 𝑧𝑗 : 1 ≀ 𝑗 ≀ π‘œ , typically i.i.d.
  • Online: data points arrive one by one
  • 1. The algorithm receives an unlabeled example 𝑦𝑗
  • 2. The algorithm predicts a classification of this example.
  • 3. The algorithm is then told the correct answer 𝑧𝑗, and update its model
slide-21
SLIDE 21

Stochastic gradient descent (SGD)

slide-22
SLIDE 22

Gradient descent

  • Minimize loss ΰ· 

𝑀 πœ„ , where the hypothesis is parametrized by πœ„

  • Gradient descent
  • Initialize πœ„0
  • πœ„π‘’+1 = πœ„π‘’ βˆ’ πœƒπ‘’π›Όΰ· 

𝑀 πœ„π‘’

slide-23
SLIDE 23

Stochastic gradient descent (SGD)

  • Suppose data points arrive one by one
  • ΰ· 

𝑀 πœ„ =

1 π‘œ σ𝑒=1 π‘œ

π‘š(πœ„, 𝑦𝑒, 𝑧𝑒), but we only know π‘š(πœ„, 𝑦𝑒, 𝑧𝑒) at time 𝑒

  • Idea: simply do what you can based on local information
  • Initialize πœ„0
  • πœ„π‘’+1 = πœ„π‘’ βˆ’ πœƒπ‘’π›Όπ‘š(πœ„π‘’, 𝑦𝑒, 𝑧𝑒)
slide-24
SLIDE 24

Example 1: linear regression

  • Find 𝑔

π‘₯ 𝑦 = π‘₯π‘ˆπ‘¦ that minimizes ΰ· 

𝑀 𝑔

π‘₯ = 1 π‘œ σ𝑒=1 π‘œ

π‘₯π‘ˆπ‘¦π‘’ βˆ’ 𝑧𝑒 2

  • π‘š π‘₯, 𝑦𝑒, 𝑧𝑒 =

1 π‘œ π‘₯π‘ˆπ‘¦π‘’ βˆ’ 𝑧𝑒 2

  • π‘₯𝑒+1 = π‘₯𝑒 βˆ’ πœƒπ‘’π›Όπ‘š π‘₯𝑒, 𝑦𝑒, 𝑧𝑒 = π‘₯𝑒 βˆ’

2πœƒπ‘’ π‘œ

π‘₯𝑒

π‘ˆπ‘¦π‘’ βˆ’ 𝑧𝑒 𝑦𝑒

slide-25
SLIDE 25

Example 2: logistic regression

  • Find π‘₯ that minimizes

ΰ·  𝑀 π‘₯ = βˆ’ 1 π‘œ ෍

𝑧𝑒=1

log𝜏(π‘₯π‘ˆπ‘¦π‘’) βˆ’ 1 π‘œ ෍

𝑧𝑒=βˆ’1

log[1 βˆ’ 𝜏 π‘₯π‘ˆπ‘¦π‘’ ] ΰ·  𝑀 π‘₯ = βˆ’ 1 π‘œ ෍

𝑒

log𝜏(𝑧𝑒π‘₯π‘ˆπ‘¦π‘’) π‘š π‘₯, 𝑦𝑒, 𝑧𝑒 = βˆ’1 π‘œ log𝜏(𝑧𝑒π‘₯π‘ˆπ‘¦π‘’)

slide-26
SLIDE 26

Example 2: logistic regression

  • Find π‘₯ that minimizes

π‘š π‘₯, 𝑦𝑒, 𝑧𝑒 = βˆ’1 π‘œ log𝜏(𝑧𝑒π‘₯π‘ˆπ‘¦π‘’) π‘₯𝑒+1 = π‘₯𝑒 βˆ’ πœƒπ‘’π›Όπ‘š π‘₯𝑒, 𝑦𝑒, 𝑧𝑒 = π‘₯𝑒 +

πœƒπ‘’ π‘œ 𝜏 𝑏 1βˆ’πœ 𝑏 𝜏(𝑏)

𝑧𝑒𝑦𝑒 Where 𝑏 = 𝑧𝑒π‘₯𝑒

π‘ˆπ‘¦π‘’

slide-27
SLIDE 27

Example 3: Perceptron

  • Hypothesis: 𝑧 = sign(π‘₯π‘ˆπ‘¦)
  • Define hinge loss

π‘š π‘₯, 𝑦𝑒, 𝑧𝑒 = βˆ’π‘§π‘’π‘₯π‘ˆπ‘¦π‘’ 𝕁[mistake on 𝑦𝑒] ΰ·  𝑀 π‘₯ = βˆ’ ෍

𝑒

𝑧𝑒π‘₯π‘ˆπ‘¦π‘’ 𝕁[mistake on 𝑦𝑒] π‘₯𝑒+1 = π‘₯𝑒 βˆ’ πœƒπ‘’π›Όπ‘š π‘₯𝑒, 𝑦𝑒, 𝑧𝑒 = π‘₯𝑒 + πœƒπ‘’π‘§π‘’π‘¦π‘’ 𝕁[mistake on 𝑦𝑒]

slide-28
SLIDE 28

Example 3: Perceptron

  • Hypothesis: 𝑧 = sign(π‘₯π‘ˆπ‘¦)

π‘₯𝑒+1 = π‘₯𝑒 βˆ’ πœƒπ‘’π›Όπ‘š π‘₯𝑒, 𝑦𝑒, 𝑧𝑒 = π‘₯𝑒 + πœƒπ‘’π‘§π‘’π‘¦π‘’ 𝕁[mistake on 𝑦𝑒]

  • Set πœƒπ‘’ = 1. If mistake on a positive example

π‘₯𝑒+1 = π‘₯𝑒 + 𝑧𝑒𝑦𝑒 = π‘₯𝑒 + 𝑦

  • If mistake on a negative example

π‘₯𝑒+1 = π‘₯𝑒 + 𝑧𝑒𝑦𝑒 = π‘₯𝑒 βˆ’ 𝑦

slide-29
SLIDE 29

Pros & Cons

Pros:

  • Widely applicable
  • Easy to implement in most cases
  • Guarantees for many losses
  • Good performance: error/running time/memory etc.

Cons:

  • No guarantees for non-convex opt (e.g., those in deep learning)
  • Hyper-parameters: initialization, learning rate
slide-30
SLIDE 30

Mini-batch

  • Instead of one data point, work with a small batch of 𝑐 points

(𝑦𝑒𝑐+1,𝑧𝑒𝑐+1),…, (𝑦𝑒𝑐+𝑐,𝑧𝑒𝑐+𝑐)

  • Update rule

πœ„π‘’+1 = πœ„π‘’ βˆ’ πœƒπ‘’π›Ό 1 𝑐 ෍

1≀𝑗≀𝑐

π‘š πœ„π‘’, 𝑦𝑒𝑐+𝑗, 𝑧𝑒𝑐+𝑗

  • Other variants: variance reduction etc.
slide-31
SLIDE 31

Homework

slide-32
SLIDE 32

Homework 1

  • Assignment online
  • Course website:

http://www.cs.princeton.edu/courses/archive/spring16/cos495/

  • Piazza: https://piazza.com/princeton/spring2016/cos495
  • Due date: Feb 17th (one week)
  • Submission
  • Math part: hand-written/print; submit to TA (Office: EE, C319B)
  • Coding part: in Matlab/Python; submit the .m/.py file on Piazza
slide-33
SLIDE 33

Homework 1

  • Grading policy: every late day reduces the attainable credit for the

exercise by 10%.

  • Collaboration:
  • Discussion on the problem sets is allowed
  • Students are expected to finish the homework by himself/herself
  • The people you discussed with on assignments should be clearly detailed:

before the solution to each question, list all people that you discussed with on that particular question.