Machine Learning
The Perceptron Algorithm
1
Some slides based on lectures from Dan Roth, Avrim Blum and others
The Perceptron Algorithm Machine Learning 1 Some slides based on - - PowerPoint PPT Presentation
The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others Outline The Perceptron Algorithm Variants of Perceptron Perceptron Mistake Bound 2 Where are we? The Perceptron
1
Some slides based on lectures from Dan Roth, Avrim Blum and others
2
3
4
5
โ sgn ๐ฅ! ๐ฅ" ๐ฅ# ๐ฅ$ ๐ฅ% ๐ฅ& ๐ฅ' ๐ฅ( ๐ฆ! ๐ฆ" ๐ฆ# ๐ฆ$ ๐ฆ% ๐ฆ& ๐ฆ' ๐ฆ( 1 ๐
6
In higher dimensions, a linear classifier represents a hyperplane that separates the space into two half-spaces
We only care about the sign, not the magnitude [w1 w2]
7
8
1๐ฒ")
9
1๐ฒ")
10
Remember: Prediction = sgn(wTx) There is typically a bias term also (wTx + b), but the bias may be treated as a constant feature and folded into w
1๐ฒ")
11
Remember: Prediction = sgn(wTx) There is typically a bias term also (wTx + b), but the bias may be treated as a constant feature and folded into w Footnote: For some algorithms it is mathematically easier to represent False as -1, and at other times, as 0. For the Perceptron algorithm, treat -1 as false and +1 as true.
1๐ฒ")
12
Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
1๐ฒ")
13
r is the learning rate, a small positive number less than 1 Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
1๐ฒ")
14
r is the learning rate, a small positive number less than 1 Update only on error. A mistake-driven algorithm Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
1๐ฒ")
15
r is the learning rate, a small positive number less than 1 Update only on error. A mistake-driven algorithm Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+ Mistake can be written as y+๐ฑ)
,๐ฒ+ โค 0
1๐ฒ")
16
r is the learning rate, a small positive number less than 1 Update only on error. A mistake-driven algorithm This is the simplest version. We will see more robust versions shortly Mistake can be written as y+๐ฑ)
,๐ฒ+ โค 0
Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
"๐ฒ โค 0
" ๐ฒ = ๐ฑ! + ๐ฒ "๐ฒ = ๐ฑ! "๐ฒ + ๐ฒ๐๐ฒ โฅ ๐ฑ๐ฎ ๐๐ฒ
17
Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
18
wold Predict Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
19
wold (x, +1) Predict Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
20
wold (x, +1)
Predict Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
21
wold (x, +1) ๐ฑ โ ๐ฑ + ๐ง๐ฒ
(x, +1) Predict Update Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
22
wold (x, +1) ๐ฑ โ ๐ฑ + ๐ง๐ฒ
(x, +1) Predict Update y x Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
23
wold (x, +1) ๐ฑ โ ๐ฑ + ๐ง๐ฒ
(x, +1) Predict Update y x Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
24
wold (x, +1) (x, +1) wnew ๐ฑ โ ๐ฑ + ๐ง๐ฒ
(x, +1) Predict Update After y x Mistake on positive: ๐ฑ)*! โ ๐ฑ) + ๐ ๐ฒ+ Mistake on negative: ๐ฑ)*! โ ๐ฑ) โ ๐ ๐ฒ+
25
wold Predict
26
wold (x, -1) Predict
27
wold (x, -1) ๐ฑ โ ๐ฑ + ๐ง๐ฒ (x, -1) Predict Update
y x
28
wold (x, -1) ๐ฑ โ ๐ฑ + ๐ง๐ฒ (x, -1) Predict Update
y x
29
wold (x, -1) ๐ฑ โ ๐ฑ + ๐ง๐ฒ (x, -1) Predict Update
y x
30
wold (x, -1) (x, -1) wnew ๐ฑ โ ๐ฑ + ๐ง๐ฒ (x, -1) Predict Update After
y x
31
32
33
34
T is a hyper-parameter to the algorithm Another way of writing that there is an error
โ Remember every weight vector in your sequence of updates. โ At final prediction time, each weight vector gets to vote on the label. The number of votes it gets is the number of iterations it survived before being updated โ Comes with strong theoretical guarantees about generalization, impractical because of storage issues
โ Instead of using all weight vectors, use the average weight vector (i.e longer surviving weight vectors get more say) โ More practical alternative and widely used
35
โ Remember every weight vector in your sequence of updates. โ At final prediction time, each weight vector gets to vote on the label. The number of votes it gets is the number of iterations it survived before being updated โ Comes with strong theoretical guarantees about generalization, impractical because of storage issues
โ Instead of using all weight vectors, use the average weight vector (i.e longer surviving weight vectors get more say) โ More practical alternative and widely used
36
37
38
This is the simplest version of the averaged perceptron There are some easy programming tricks to make sure that a is also updated
39
This is the simplest version of the averaged perceptron There are some easy programming tricks to make sure that a is also updated
If you want to use the Perceptron algorithm, use averaging
40
41
42
The New Yorker, December 6, 1958 P. 44 The New York Times, July 8 1958
43
The New Yorker, December 6, 1958 P. 44 The New York Times, July 8 1958 The IBM 704 computer
44