Linear Classifiers CS 4100: Artificial Intelligence Perceptrons and - - PowerPoint PPT Presentation

linear classifiers cs 4100 artificial intelligence
SMART_READER_LITE
LIVE PREVIEW

Linear Classifiers CS 4100: Artificial Intelligence Perceptrons and - - PowerPoint PPT Presentation

Linear Classifiers CS 4100: Artificial Intelligence Perceptrons and Logistic Regression Jan-Willem van de Meent, Northeastern University [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188


slide-1
SLIDE 1

CS 4100: Artificial Intelligence

Perceptrons and Logistic Regression

Jan-Willem van de Meent, Northeastern University

[These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.]

Linear Classifiers Feature Vectors

Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

SP SPAM

  • r
  • r

+

PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...

“2 “2”

Some (Simplified) Biology

  • Ve

Very loose se insp spiration: human neurons

slide-2
SLIDE 2

Linear Classifiers

  • In

Inputs s are fe feature values

  • Ea

Each feature has s a we weight

  • Su

Sum is the act activat ation

  • If

If the activa vation is: s:

  • Po

Positive ve, output +1 +1

  • Ne

Negative, output -1

S

f1 f2 f3 w1 w2 w3

>0?

Weights

  • Bi

Bina nary case: compare features to a weight vector

  • Le

Learni ning ng: figure out the weight vector from examples

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ... # free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3 ... # free : 0 YOUR_NAME : 1 MISSPELLED : 1 FROM_FRIEND : 1 ...

Do Dot t pr produ duct t po positive itive me means the positive class

Decision Rules Binary Decision Rule

  • In

In the sp space of feature ve vectors

  • Examples are points
  • Any weight vector is a hyperplane
  • One side corresponds to Y=

Y=+1

  • Other corresponds to Y=

Y=-1

BIAS : -3 free : 4 money : 2 ... 1 1 2 free money +1 = SPAM

  • 1 = HAM
slide-3
SLIDE 3

Weight Updates Learning: Binary Perceptron

  • St

Start with we weights = = 0

  • Fo

For each training instance:

  • Cl

Classify with current weights

  • If

If co correct ect: (i.e., y=y*), no change!

  • If

If wrong: adjust the weight vector

Learning: Binary Perceptron

  • St

Start with we weights = = 0

  • Fo

For each training instance:

  • Cl

Classify with current weights

  • If

If co correct ect: (i.e., y= y=y* y*), no change!

  • If

If wrong: adjust the weight vector by adding or subtracting the feature

  • vector. Subtract if y*

y* is -1.

Examples: Perceptron

  • Separable Case
slide-4
SLIDE 4

Multiclass Decision Rule

  • If

If we e hav ave e multiple e cl clas asses es:

  • A we

weig ight vector for each class:

  • Sc

Score (activation) of a class y:

  • Prediction with hi

highe ghest st sc scor

  • re wins

Binary = multiclass where the negative class has weight zero

Learning: Multiclass Perceptron

  • St

Start with all we weights = = 0

  • Pi

Pick training examples one by one

  • Pr

Predi dict with current weights

  • If

If c correct: no change!

  • If

If wr wrong: lower score of wrong answer, raise score of right answer

Example: Multiclass Perceptron

BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

Qu Question: What will the weights w be for each class after 3 updates? y1 = “p “politics”, x1 = “wi “win the vote” y2 = “p “politics”, x2 = “wi “win the election” y3 = “s “sports”, x3 = “wi “win the game” ”

Example: Multiclass Perceptron

BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

Qu Question: What will the weights w be for each class after 3 updates? wpo

politics f(x1) = 0

wsp

sports f(x1) = 1

wte

tech f(x1) = 0 + 1 + 1 + 0 + 1 + 1

Pr Prediction: “s “sports” (wr (wrong)

  • 1
  • 1
  • 1
  • 1

1 1 1 1 f(x1) =

y1 = “p “politics”, x1 = “wi “win the vote” y2 = “p “politics”, x2 = “wi “win the election” y3 = “s “sports”, x3 = “wi “win the game” ”

slide-5
SLIDE 5

Example: Multiclass Perceptron

BIAS : 0 win : -1 game : 0 vote : -1 the : -1 ... BIAS : 1 win : 1 game : 0 vote : 1 the : 1 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

y1 = “p “politics”, x1 = “wi “win the vote” y2 = “p “politics”, x2 = “wi “win the election” y3 = “s “sports”, x3 = “wi “win the game” ” Qu Question: What will the weights w be for each class after 3 updates? wpo

politics f(x1) = 3

wsp

sports f(x1) = -2

wte

tech f(x1) = -3

Pr Prediction: “p “politics” (c (correct)

1 1 1 f(x2) =

Example: Multiclass Perceptron

BIAS : 0 win : -1 game : 0 vote : -1 the : -1 ... BIAS : 1 win : 1 game : 0 vote : 1 the : 1 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

y1 = “p “politics”, x1 = “wi “win the vote” y2 = “p “politics”, x2 = “wi “win the election” y3 = “s “sports”, x3 = “wi “win the game” ” Qu Question: What will the weights w be for each class after 3 updates? wpo

politics f(x1) = 3

wsp

sports f(x1) = -2

wte

tech f(x1) = -3

Pr Prediction: “p “politics” (wr (wrong)

1 1 1 1 f(x3) =

  • 1
  • 1
  • 1
  • 1

+ 1 + 1 + 1 + 0 + 1

Example: Multiclass Perceptron

BIAS : 1 win : 0 game : 1 vote : -1 the : 0 ... BIAS : 0 win : 0 game : -1 vote : 1 the : 0 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

y1 = “p “politics”, x1 = “wi “win the vote” y2 = “p “politics”, x2 = “wi “win the election” y3 = “s “sports”, x3 = “wi “win the game” ” Qu Question: What will the weights w be for each class after 3 updates?

δ

Properties of Perceptrons

  • Se

Separability: y: tr true if there exists weights w w that get the training set perfectly correct

  • Co

Conv nvergenc nce: if the training data are se separable, a perceptron will eventually converge (binary case)

  • Mistake

ke Bound: the maximum number of mistakes (updates) (binary case) is related to the num number of featur ures k and the ma margin δ or degree of separability Separable Non-Separable

slide-6
SLIDE 6

Problems with the Perceptron

  • No

Noise: if the data isn’t separable, weights might thrash

  • Av

Averaging weight vectors over time can help (averaged perceptron)

  • Med

Mediocr cre e gen ener eral alizat ation: finds a “barely” separating solution

  • Ov

Overtraining: te test t / held-ou

  • ut

t accuracy usually rises, then falls

  • Overtraining is a kind of overfitting

Improving the Perceptron Non-Separable Case: Deterministic Decision

Even the best linear boundary makes at least one mistake

Non-Separable Case: Probabilistic Decision

0.5 | 0.5 0.3 | 0.7 0.1 | 0.9 0.7 | 0.3 0.9 | 0.1

slide-7
SLIDE 7

How to get probabilistic decisions?

  • Pe

Perceptron

  • n scor
  • rin

ing: g:

  • If

If very po positive à want probability going to 1

  • If

If very ne negative ive à want probability going to 0

  • Sigmoid

Sigmoid fu function ion z = w · f(x)

z = w · f(x) z = w · f(x)

φ(z) = 1 1 + e−z

Best w?

  • Maximum like

kelihood estimation: wi with th:

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w) P(y(i) = +1|x(i); w) = 1 1 + e−w·f(x(i)) P(y(i) = −1|x(i); w) = 1 − 1 1 + e−w·f(x(i))

Th This is is is calle lled Lo Logis istic ic Regressio ion

Separable Case: Deterministic Decision – Many Options Separable Case: Probabilistic Decision – Clear Preference

0.5 | 0.5 0.3 | 0.7 0.7 | 0.3 0.5 | 0.5 0.3 | 0.7 0.7 | 0.3

slide-8
SLIDE 8

Multiclass Logistic Regression

  • Re

Recall Perceptron: n:

  • A we

weig ight vector for each class:

  • Sc

Score (activation) of a class y:

  • Prediction with hi

highe ghest st sc scor

  • re wins

ns

  • Ho

How w to tur urn n sc scores s in into pr proba babi bilities? ?

z1, z2, z3 → ez1 ez1 + ez2 + ez3 , ez2 ez1 + ez2 + ez3 , ez3 ez1 + ez2 + ez3

  • riginal activations

softmax activations

Best w?

  • Maximum like

kelihood estimation: wi with th:

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w) P(y(i)|x(i); w) = ewy(i)·f(x(i)) P

y ewy·f(x(i))

Th This is is is calle lled Mu Multi-Cl Class L ss Logist stic R Regressi ssion

Next Lecture

  • Op

Opti timizati tion

  • i.e., how do we solve:

max

w

ll(w) = max

w

X

i

log P(y(i)|x(i); w)