Aykut Erdem // Hacettepe University // Fall 2019
Lecture 10:
Linear Discriminant Functions Perceptron
Illustration: Frank Rosenblatt's Perceptron
BBM406 Fundamentals of Machine Learning Lecture 10: Linear - - PowerPoint PPT Presentation
Illustration: Frank Rosenblatt's Perceptron BBM406 Fundamentals of Machine Learning Lecture 10: Linear Discriminant Functions Perceptron Aykut Erdem // Hacettepe University // Fall 2019 Assignment 2 is out! It is due November 22 (i.e.
Aykut Erdem // Hacettepe University // Fall 2019
Lecture 10:
Linear Discriminant Functions Perceptron
Illustration: Frank Rosenblatt's Perceptron
− It is due November 22 (i.e. in 2 weeks) − Implement Naive Bayes classifier for fake news
detection
2
image credit: Frederick Burr Opper
Last time… Logistic Regression
3 8%
Assumes%the%following%func$onal%form%for%P(Y|X):%
Logis&c( func&on( (or(Sigmoid):(
Logis$c%func$on%applied%to%a%linear% func$on%of%the%data%
z% logit%(z)%
Features(can(be(discrete(or(con&nuous!(
slide by Aarti Singh & Barnabás PóczosAssumes the following functional form for P(Y|X): Logistic function applied to linear function of the data
Logistic function (or Sigmoid):
Features can be discrete or continuous!
Last time.. Logistic Regression vs. Gaussian Naïve Bayes
4
slide by Aarti Singh & Barnabás Póczos− decision rule is a hyperplane
− no closed-form solution − concave ! global optimum with gradient ascent
representationally equivalent to LR
− Solution differs because of objective (loss) function
− NB: Features independent given class! assumption on P(X|Y) − LR: Functional form of P(Y|X), no assumption on P(X|Y)
− GNB (usually) needs less data − LR (usually) gets to better solutions in the limit
5
where w is called weight vector, and w0 is a bias.
where step function sign(·) is defined as
6
y(x) = wTx + w0
C(x) = sign(wTx + w0)
sign(a) = ( +1, a > 0 −1, a < 0
slide by Ce LiuwTx kwk = w0 kwk the decision surface.
Properties of Linear Discriminant Functions
from the origin to the decision surface is
7
x2 x1 w x
y(x) kwkx?
w0 kwky = 0 y < 0 y > 0 R2 R1
is perpendicular to w, and its displacement from the origin is controlled by the bias parameter w0.
a general point x from the decision surface is given by y(x)/||w||
perpendicular distance r of the point x from the decision surface
slide by Ce LiuProperties of Linear Discriminant Functions
where x⊥ is the projection x on the decision surface. Then
8
x = x⊥ + r w kwk
wTx = wTx⊥ + rwTw kwk wTx + w0 = wTx⊥ + w0 + rkwk y(x) = rkwk r = y(x) kwk
define e w = (w0, w)
and e x = (1, x)
e y(x) = e wTe x
slide by Ce LiuMultiple Classes: Simple Extension
9
R1 R2 R3 ? C1 not C1 C2 not C2
R1 R2 R3 ? C1 C2 C1 C3 C2 C3
not in Ck.
Multiple Classes: K-Class Discriminant
by yk(x) = yj(x)
10
yk(x) = wT
k x + wk0
C(x) = k, if yk(x) > yj(x) 8 j 6= k
C C (wk wj)Tx + (wk0 wj0) = 0
slide by Ce Liuy = wTx
can be maximally separated
11
−2 2 6 −2 2 4 −2 2 6 −2 2 4
Difference of means Fisher’s Linear Discriminant
m1 = 1 N1 X
n∈C1
xn, m2 = 1 N2 X
n∈C2
xn A way to view a linear classification model is in terms of dimensionality reduction.
slide by Ce Liu⇣ wT(m1 − m2) ⌘2 = wT(m1 − m2)(m1 − m2)Tw = wTSBw
where SB = (m1 − m2)(m1 − m2)T is called between-class covariance matrix.
where
12
wTSWw
SW = X
n∈C1
(xn − m1)(xn − m1)T + X
n∈C2
(xn − m2)(xn − m2)T
slide by Ce Liuabout directions. So the scalars are dropped. Therefore
13
J(w) = Between-class variance Within-class variance = wTSBw wTSWw
for f(x) = g(x)
h(x)
f 0(x) = g0(x)h(x) g(x)h0(x) h2(x)
(wTSBw)SWw = (wTSWw)SBw (wTSBw)SWw = (wTSWw)(m2 m1) ⇣ (m2 m1)Tw ⌘
w / S1
W (m2 m1)
slide by Ce LiuFrom Fisher’s Linear Discriminant to Classifiers
classification problem to 1D.
thresholds lead to nonlinear classifiers). The final classifier has the form where the nonlinear activation function sign(·) is a step function
14
y(x) = sign(wTx + w0)
· sign(a) = ( +1, a > 0 −1, a < 0
slide by Ce Liu15
early theories
punished (or not rewarded). This improves system fitness.
Successful behavior is rewarded (e.g. food).
The wrongly coded animal does not reproduce.
17
slide by Alex SmolaCell body - combines signals
Combines the inputs from several other nerve cells
Interface and parameter store between neurons
May be up to 1m long and will transport the activation signal to neurons at different locations
18
slide by Alex Smola19
f(x) = X
i
wixi = hw, xi x1 x2 x3 xn . . .
w1 wn
synaptic weights
slide by Alex Smolacombination
decision function
(spam/ham, novel/typical, click/no click)
Estimating the parameters w and b
20
x1 x2 x3 xn . . .
w1 wn
synaptic weights
f(x) = σ (hw, xi + b)
slide by Alex Smola21
Spam Ham
slide by Alex SmolaRosenblatt Widom
slide by Alex Smolainner products
23
initialize w = 0 and b = 0 repeat if yi [hw, xii + b] 0 then w w + yixi and b b + yi end if until all classified correctly
w = X
i∈I
yixi f(x) = X
i∈I
yi hxi, xi + b
slide by Alex Smola
then the perceptron converges to a linear separator after a number of steps bounded by
24
(w∗, b∗) yi [hxi, w∗i + b∗] ρ for all i
⇣ b∗2 + 1 ⌘ r2 + 1
25
Black & White
This gives a compression bound for perceptron.
l(xi, yi, w, b) = max (0, 1 yi [hw, xii + b])
do NOT train your avatar with perceptrons
slide by Alex Smola26
hard easy
slide by Alex Smolathe concept space
39
slide by Alex SmolaFinding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).
40
slide by Alex SmolaWe got nonlinear functions by preprocessing
41
x → φ(x) hx, x0i hφ(x), φ(x0)i
φ(xi)
slide by Alex SmolaCircles, hyperbolae, parabolae
42
slide by Alex SmolaConstructing Features (very naive OCR system)
43
Construct features manually. E.g. for OCR we could
slide by Alex SmolaFeature Engineering for Spam Filtering
44
Delivered-To: alex.smola@gmail.com Received: by 10.216.47.73 with SMTP id s51cs361171web; Tue, 3 Jan 2012 14:17:53 -0800 (PST) Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Return-Path: <alex+caf_=alex.smola=gmail.com@smola.org> Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175]) by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:51 -0800 (PST) Received-SPF: neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) client- ip=209.85.215.175; Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) smtp.mail=alex+caf_=alex.smola=gmail.com@smola.org; dkim=pass (test mode) header.i=@googlemail.com Received: by eaal1 with SMTP id l1so15092746eaa.6 for <alex.smola@gmail.com>; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362; Tue, 03 Jan 2012 14:17:51 -0800 (PST) X-Forwarded-To: alex.smola@gmail.com X-Forwarded-For: alex@smola.org alex.smola@gmail.com Delivered-To: alex@smola.org Received: by 10.204.65.198 with SMTP id k6cs206093bki; Tue, 3 Jan 2012 14:17:50 -0800 (PST) Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795; Tue, 03 Jan 2012 14:17:48 -0800 (PST) Return-Path: <althoff.tim@googlemail.com> Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com [209.85.220.179]) by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:48 -0800 (PST) Received-SPF: pass (google.com: domain of althoff.tim@googlemail.com designates 209.85.220.179 as permitted sender) client-ip=209.85.220.179; Received: by vcbf13 with SMTP id f13so11295098vcb.10 for <alex@smola.org>; Tue, 03 Jan 2012 14:17:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=; b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o= MIME-Version: 1.0 Received: by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787; Tue, 03 Jan 2012 14:17:47 -0800 (PST) Sender: althoff.tim@googlemail.com Received: by 10.220.17.129 with HTTP; Tue, 3 Jan 2012 14:17:47 -0800 (PST) Date: Tue, 3 Jan 2012 14:17:47 -0800 X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHs Message-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com> Subject: CS 281B. Advanced Topics in Learning and Decision Making From: Tim Althoff <althoff@eecs.berkeley.edu> To: alex@smola.org Content-Type: multipart/alternative; boundary=f46d043c7af4b07e8d04b5a7113aTransform the data into a radial and angular part
income)
45
(x1, x2) = (r sin φ, r cos φ)
slide by Alex Smolainner products
46
initialize w, b = 0 repeat Pick (xi, yi) from data if yi(w · Φ(xi) + b) 0 then w0 = w + yiΦ(xi) b0 = b + yi until yi(w · Φ(xi) + b) > 0 for all i end
w = X
i∈I
yiφ(xi)
f(x) = X
i∈I
yi hφ(xi), φ(x)i + b
slide by Alex Smola47
slide by Alex Smola48
(x1, x2) (x1, x2, x1x2)
slide by Alex SmolaMulti-layer Perceptron
49