Perceptrons From the heights of error, To the valleys of Truth - - PowerPoint PPT Presentation

perceptrons
SMART_READER_LITE
LIVE PREVIEW

Perceptrons From the heights of error, To the valleys of Truth - - PowerPoint PPT Presentation

Perceptrons From the heights of error, To the valleys of Truth Piyush Kumar Advanced Computational Geometry Reading Material Duda/Hart/Stork : 5.4/5.5/9.6.8 Any neural network book (Haykin, Anderson) Look at papers of


slide-1
SLIDE 1

Perceptrons

“From the heights of error, To the valleys of Truth”

Piyush Kumar Advanced Computational Geometry

slide-2
SLIDE 2

Reading Material

Duda/Hart/Stork : 5.4/5.5/9.6.8 Any neural network book (Haykin, Anderson…) Look at papers of related people

Santosh Vempala

  • A. Blum
  • J. Dunagan
  • F. Rosenblatt
  • T. Bylander
slide-3
SLIDE 3

Introduction

Supervised Learning

Input Pattern Output Pattern Compare and Correct if necessary

slide-4
SLIDE 4

Linear discriminant functions

Definition

It is a function that is a linear combination of the components of x

g(x) = wtx+ w0 (1)

where w is the weight vector and w0 the bias

  • A two-category classifier with a discriminant function of the form (1) uses

the following rule: Decide ω1 if g(x) > 0 and ω2 if g(x) < 0

⇔ Decide ω1 if wtx > -w0 and ω2 otherwise If g(x) = 0 ⇒ x is assigned to either class

slide-5
SLIDE 5

LDFs

The equation g(x) = 0 defines the

decision surface that separates points assigned to the category ω1 from points assigned to the category ω2

When g(x) is linear, the decision

surface is a hyperplane

slide-6
SLIDE 6

Classification using LDFs

Two main approaches

Fischer’s Linear Discriminant

Project data onto a line with ‘good’ discrimination; then classify on the real line

Linear Discrimination in d-dimensions

Classify data using suitable hyperplanes. (We’ll use perceptrons to construct these)

slide-7
SLIDE 7

Perceptron: The first NN

Proposed by Frank Rosenblatt in 1956 Neural net researchers accuse

Rosenblatt of promising ‘too much’ ☺

Numerous variants We’ll cover the one that’s most

geometric to explain ☺

One of the simplest Neural Network.

slide-8
SLIDE 8

Perceptrons : A Picture

+ 1

  • 1

⎪ ⎩ ⎪ ⎨ ⎧ − > =

=

  • therwise

1 if 1

i n i ix

w y

Compare And correct

w1 w0 w2 w3 wn x0=-1 x1 x2 x3 xn . . .

slide-9
SLIDE 9

Where is the geometry?

Class 2 : (-1)

I s t hi s uni que?

Class 1 : (+ 1)

slide-10
SLIDE 10

Assumption

Lets assume for this talk that the red

and green points in ‘feature space’ are separable using a hyperplane.

Two Cat egor y Li near l y separ abl e case

slide-11
SLIDE 11

Whatz the problem?

Why not just take out the convex hull of

  • ne of the sets and find one of the

‘right’ facets?

Because its too much work to do in d-

dimensions.

What else can we do?

Linear programming = = Perceptrons Quadratic Programming = = SVMs

slide-12
SLIDE 12

Perceptrons

Aka Learning Half Spaces Can be solved in polynomial time using

IP algorithms.

Can also be solved using a simple and

elegant greedy algorithm

(Which I present today)

slide-13
SLIDE 13

In Math notation

)) , ( ),..., , ( ), , {(

2 2 1 1 n n y

x y x y x r r r

N samples :

d

x R ∈ r

Where y = + /- 1 are labels for the data.

. = x w r r

Can we find a hyperplane that separates the two classes? (labeled by y) i.e.

. > w x j r r

: For all j such that y = + 1

. < w x j r r

: For all j such that y = -1

slide-14
SLIDE 14

W hi ch we wi l l r el ax l at er !

Further assumption 1

Lets assume that the hyperplane that we are looking for passes thru the origin

slide-15
SLIDE 15

Rel ax now! ! ☺

Further assumption 2

Lets assume that we are looking for a

halfspace that contains a set of points

slide-16
SLIDE 16

Lets Relax FA 1 now

“Homogenize” the coordinates by

adding a new coordinate to the input.

Think of it as moving the whole red and

blue points in one higher dimension

From 2D to 3D it is just the x-y plane

shifted to z = 1. This takes care of the “bias” or our assumption that the halfspace can pass thru the origin.

slide-17
SLIDE 17

Rel ax now! ☺

Further Assumption 3

Assume all points on a unit sphere! If they are not after applying

transformations for FA 1 and FA 2 , make them so.

slide-18
SLIDE 18

Restatement 1

Given: A set of points on a sphere in d-dimensions,

such that all of them lie in a half-space.

Output: Find one such halfspace Note: You can solve the LP feasibility problem.

You can solve any general LP !!

Take Est i e’ s cl ass i f you W ant t o know why. ☺

slide-19
SLIDE 19

Restatement 2

Given a convex body (in V-form), find a

halfspace passing thru the origin that contains it.

slide-20
SLIDE 20

Support Vector Machines

A small break from perceptrons

slide-21
SLIDE 21

Support Vector Machines

  • Li near Lear ni ng M

achi nes l i ke per cept r ons.

  • M

ap non- l i near l y t o hi gher di m ensi on t o

  • ver com

e t he l i near i t y const r ai nt .

  • Sel ect bet ween hyper pl anes, Use m

ar gi n as a t est ( Thi s i s what per cept r ons don’ t do)

Fr om l ear ni ng t heor y, m axi m um m ar gi n i s good

slide-22
SLIDE 22

SVMs

M ar gi n

slide-23
SLIDE 23

Another Reformulation

Unl i ke Per cept r ons SVM s have a uni que sol ut i on but ar e har der t o sol ve. <Q P>

slide-24
SLIDE 24

Support Vector Machines

There are very simple algorithms to

solve SVMs ( as simple as perceptrons ) ( If there is enough demand, I can try to cover it ) ( and If my job hunting lets me ;) )

slide-25
SLIDE 25

Back to perceptrons

slide-26
SLIDE 26

Perceptrons

So how do we solve the LP ?

Simplex Ellipsoid IP methods Perceptrons = Gradient Decent

So we could solve the classification problem using any LP method.

slide-27
SLIDE 27

Why learn Perceptrons?

You can write an LP solver in 5 mins ! A very slight modification can give u a

polynomial time guarantee (Using smoothed analysis)!

slide-28
SLIDE 28

Why learn Perceptrons

Multiple perceptrons clubbed together are

used to learn almost anything in practice. (Idea behind multi layer neural networks)

Perceptrons have a finite capacity and so

cannot represent all classifications. The amount of training data required will need to be larger than the capacity. We’ll talk about capacity when we introduce VC-dimension.

Fr om l ear ni ng t heor y, l i m i t ed capaci t y i s good

slide-29
SLIDE 29

Another twist : Linearization

If the data is separable with say a

sphere, how would you use a perceptron to separate it? (Ellipsoids?)

slide-30
SLIDE 30

Del aunay! ??

Linearization

Li f t t he poi nt s t o a par abol oi d i n one hi gher di m ensi on, For i nst ance i f t he dat a i s i n 2D, ( x, y) - > ( x, y, x2+y 2)

slide-31
SLIDE 31

The kernel Matrix

Another trick that ML community uses for

Linearization is to use a function that redefines distances between points.

Example : There are even papers on how to learn

kernels from data !

2

|| || /2

( , )

x z

K x z e

σ − −

=

slide-32
SLIDE 32

Perceptron Smoothed Complexity

Let L be a l i near pr ogr am and l et L’ be t he sam e l i near pr ogr am under a G aussi an per t ur bat i on of var i ance si gm a2, wher e si gm a2 <= 1/ 2d. For any del t a, wi t h pr obabi l i t y at l east 1 – del t a ei t her

The per cept r on f i nds a f easi bl e sol ut i on i n pol y( d, m , 1/ si gm a, 1/ del t a) L’ i s i nf easi bl e or unbounded

slide-33
SLIDE 33

The Algorithm

In one line

slide-34
SLIDE 34

The 1 Line LP Solver!

Start with a random vector w, and if a

point is misclassified do:

k k k

x w w r r r + =

+1

( unt i l done) One of the most beautiful LP Solvers I’ve ever come across…

slide-35
SLIDE 35

A better description

I ni t i al i ze w=0, i =0 do i = ( i +1) m

  • d n

i f x i i s m i scl assi f i ed by w t hen w = w + xi unt i l al l pat t er ns cl assi f i ed Ret ur n w

slide-36
SLIDE 36

An even better description

That ’ s t he ent i r e code! W r i t t en i n 10 m i ns.

f unct i on w = per cept r on( r , b) r = [ r ( zer os( l engt h( r ) , 1) +1) ] ; % Hom

  • geni ze

b = - [ b ( zer os( l engt h( b) , 1) +1) ] ; % Hom

  • geni ze and f l i p

dat a = [ r ; b] ; % M ake one poi nt set s = si ze( dat a) ; % Si ze of dat a? w = zer os( 1, s( 1, 2) ) ; % I ni t i al i ze zer o vect or i s_er r or = t r ue; whi l e i s_er r or i s_er r or = f al se; f or k=1: s( 1, 1) i f dot ( w, dat a( k, : ) ) <= 0 w = w+dat a( k, : ) ; i s_er r or = t r ue; end end end

And i t can be sol ve any LP!

slide-37
SLIDE 37

An output

slide-38
SLIDE 38

In other words

At each step, the algorithm picks any vector x that is misclassified, or is on the wrong side of the halfspace, and brings the normal vector w closer into agreement with that point

slide-39
SLIDE 39

The m at h behi nd…

Still: Why the hell does it work?

Back to the most advanced presentation tools available on earth ! The blackboard ☺ Wait (Lemme try the whiteboard)

The Conver gence Pr oof

slide-40
SLIDE 40

Proof

slide-41
SLIDE 41

Proof

slide-42
SLIDE 42

Proof

slide-43
SLIDE 43

Proof

slide-44
SLIDE 44

Proof

slide-45
SLIDE 45

Proof

slide-46
SLIDE 46

That’s all folks ☺