Lecture 10:
−Linear Discriminant Functions (cont’d.) −Perceptron
Aykut Erdem
November 2016 Hacettepe University
Lecture 10: Linear Discriminant Functions (contd.) Perceptron Aykut - - PowerPoint PPT Presentation
Lecture 10: Linear Discriminant Functions (contd.) Perceptron Aykut Erdem November 2016 Hacettepe University Last time Logistic Regression Assumes%the%following%func$onal%form%for%P(Y|X):% Assumes the following functional form for
−Linear Discriminant Functions (cont’d.) −Perceptron
Aykut Erdem
November 2016 Hacettepe University
2 8%
Assumes%the%following%func$onal%form%for%P(Y|X):%
Logis&c( func&on( (or(Sigmoid):(
Logis$c%func$on%applied%to%a%linear% func$on%of%the%data%
z% logit%(z)%
Features(can(be(discrete(or(con&nuous!(
slide by Aarti Singh & Barnabás Póczos
Assumes the following functional form for P(Y|X): Logistic function applied to linear function of the data
Logistic function (or Sigmoid):
Features can be discrete or continuous!
3
slide by Aarti Singh & Barnabás Póczos
− decision rule is a hyperplane
− no closed-form solution − concave ! global optimum with gradient ascent
representationally equivalent to LR
− Solution differs because of objective (loss) function
− NB: Features independent given class! assumption on P(X|Y) − LR: Functional form of P(Y|X), no assumption on P(X|Y)
− GNB (usually) needs less data − LR (usually) gets to better solutions in the limit
where w is called weight vector, and w0 is a bias.
where step function sign(·) is defined as
4
y(x) = wTx + w0
C(x) = sign(wTx + w0)
sign(a) = ( +1, a > 0 −1, a < 0
slide by Ce Liu
wTx kwk = w0 kwk the decision surface.
from the origin to the decision surface is
5
x2 x1 w x
y(x) kwk
x?
w0 kwk
y = 0 y < 0 y > 0 R2 R1
is perpendicular to w, and its displacement from the origin is controlled by the bias parameter w0.
a general point x from the decision surface is given by y(x)/||w||
perpendicular distance r of the point x from the decision surface
slide by Ce Liu
Last time… Multiple Classes: Simple Extension
6
R1 R2 R3 ? C1 not C1 C2 not C2
R1 R2 R3 ? C1 C2 C1 C3 C2 C3
not in Ck.
slide by Ce Liu
by yk(x) = yj(x)
7
yk(x) = wT
k x + wk0
C(x) = k, if yk(x) > yj(x) 8 j 6= k
C C (wk wj)Tx + (wk0 wj0) = 0
slide by Ce Liu
8
9
Theorem
The decision regions of the K-class discriminant yk(x) = wT
k x + wk0 are singly
connected and convex.
Proof.
Suppose two points xA and xB both lie inside decision region Rk. Any point ˆ x on the line between xA and xB can be expressed as ˆ x = λxA + (1 λ)xB So yk(ˆ x) = λyk(xA) + (1 λ)yk(xB) > λyj(xA) + (1 λ)yj(xB) (8 j 6= k) = yj(ˆ x) (8 j 6= k) Therefore, the regions Rk is single connected and convex.
slide by Ce Liu
10
Theorem
The decision regions of the K-class discriminant yk(x) = wT
k x + wk0 are singly
connected and convex.
Proof.
Suppose two points xA and xB both lie inside decision region Rk. Any point ˆ x on the line between xA and xB can be expressed as ˆ x = λxA + (1 λ)xB So yk(ˆ x) = λyk(xA) + (1 λ)yk(xB) > λyj(xA) + (1 λ)yj(xB) (8 j 6= k) = yj(ˆ x) (8 j 6= k) Therefore, the regions Rk is single connected and convex.
slide by Ce Liu
11
ul- decision re- line in be
Ri Rj Rk xA xB ˆ x
Theorem
The decision regions of the K-class discriminant yk(x) = wT
k x + wk0 are singly
connected and convex.
If two points xA and xB both lie inside the same decision region Rk, then any point x that lies on the line connecting these two points must also lie in Rk, and hence the decision region must be singly connected and convex.
slide by Ce Liu
y = wTx
can be maximally separated
12
−2 2 6 −2 2 4 −2 2 6 −2 2 4
Difference of means Fisher’s Linear Discriminant
m1 = 1 N1 X
n∈C1
xn, m2 = 1 N2 X
n∈C2
xn A way to view a linear classification model is in terms of dimensionality reduction.
slide by Ce Liu
⇣ wT(m1 − m2) ⌘2 = wT(m1 − m2)(m1 − m2)Tw = wTSBw
where SB = (m1 − m2)(m1 − m2)T is called between-class covariance matrix.
where
13
wTSWw
SW = X
n∈C1
(xn − m1)(xn − m1)T + X
n∈C2
(xn − m2)(xn − m2)T
slide by Ce Liu
about directions. So the scalars are dropped. Therefore
14
J(w) = Between-class variance Within-class variance = wTSBw wTSWw
for f(x) = g(x)
h(x)
f 0(x) = g0(x)h(x) g(x)h0(x) h2(x)
(wTSBw)SWw = (wTSWw)SBw (wTSBw)SWw = (wTSWw)(m2 m1) ⇣ (m2 m1)Tw ⌘
w / S1
W (m2 m1)
slide by Ce Liu
classification problem to 1D.
thresholds lead to nonlinear classifiers). The final classifier has the form where the nonlinear activation function sign(·) is a step function
15
y(x) = sign(wTx + w0)
· sign(a) = ( +1, a > 0 −1, a < 0
slide by Ce Liu
16
early theories
slide by Alex Smola
punished (or not rewarded). This improves system fitness.
Successful behavior is rewarded (e.g. food).
The wrongly coded animal does not reproduce.
18
slide by Alex Smola
Cell body - combines signals
Combines the inputs from several other nerve cells
Interface and parameter store between neurons
May be up to 1m long and will transport the activation signal to neurons at different locations
19
slide by Alex Smola
20
f(x) = X
i
wixi = hw, xi x1 x2 x3 xn . . .
w1 wn
synaptic weights
slide by Alex Smola
combination
decision function
(spam/ham, novel/typical, click/no click)
Estimating the parameters w and b
21
x1 x2 x3 xn . . .
w1 wn
synaptic weights
f(x) = σ (hw, xi + b)
slide by Alex Smola
22
Spam Ham
slide by Alex Smola
Rosenblatt Widom
slide by Alex Smola
inner products
24
initialize w = 0 and b = 0 repeat if yi [hw, xii + b] 0 then w w + yixi and b b + yi end if until all classified correctly
w = X
i∈I
yixi f(x) = X
i∈I
yi hxi, xi + b
slide by Alex Smola
then the perceptron converges to a linear separator after a number of steps bounded by
25
(w∗, b∗) yi [hxi, w∗i + b∗] ρ for all i
⇣ b∗2 + 1 ⌘ r2 + 1
slide by Alex Smola
26
Black & White
This gives a compression bound for perceptron.
l(xi, yi, w, b) = max (0, 1 yi [hw, xii + b])
do NOT train your avatar with perceptrons
slide by Alex Smola
27
hard easy
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
slide by Alex Smola
the concept space
40
slide by Alex Smola
Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).
41
slide by Alex Smola
42
x → φ(x)
hx, x0i hφ(x), φ(x0)i
φ(xi)
slide by Alex Smola
43
slide by Alex Smola
44
Construct features manually. E.g. for OCR we could
slide by Alex Smola
45
Delivered-To: alex.smola@gmail.com Received: by 10.216.47.73 with SMTP id s51cs361171web; Tue, 3 Jan 2012 14:17:53 -0800 (PST) Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Return-Path: <alex+caf_=alex.smola=gmail.com@smola.org> Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175]) by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:51 -0800 (PST) Received-SPF: neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) client- ip=209.85.215.175; Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) smtp.mail=alex+caf_=alex.smola=gmail.com@smola.org; dkim=pass (test mode) header.i=@googlemail.com Received: by eaal1 with SMTP id l1so15092746eaa.6 for <alex.smola@gmail.com>; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362; Tue, 03 Jan 2012 14:17:51 -0800 (PST) X-Forwarded-To: alex.smola@gmail.com X-Forwarded-For: alex@smola.org alex.smola@gmail.com Delivered-To: alex@smola.org Received: by 10.204.65.198 with SMTP id k6cs206093bki; Tue, 3 Jan 2012 14:17:50 -0800 (PST) Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795; Tue, 03 Jan 2012 14:17:48 -0800 (PST) Return-Path: <althoff.tim@googlemail.com> Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com [209.85.220.179]) by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:48 -0800 (PST) Received-SPF: pass (google.com: domain of althoff.tim@googlemail.com designates 209.85.220.179 as permitted sender) client-ip=209.85.220.179; Received: by vcbf13 with SMTP id f13so11295098vcb.10 for <alex@smola.org>; Tue, 03 Jan 2012 14:17:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=; b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o= MIME-Version: 1.0 Received: by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787; Tue, 03 Jan 2012 14:17:47 -0800 (PST) Sender: althoff.tim@googlemail.com Received: by 10.220.17.129 with HTTP; Tue, 3 Jan 2012 14:17:47 -0800 (PST) Date: Tue, 3 Jan 2012 14:17:47 -0800 X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHs Message-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com> Subject: CS 281B. Advanced Topics in Learning and Decision Making From: Tim Althoff <althoff@eecs.berkeley.edu> To: alex@smola.org Content-Type: multipart/alternative; boundary=f46d043c7af4b07e8d04b5a7113a
Content-Type: text/plain; charset=ISO-8859-1
slide by Alex Smola
Transform the data into a radial and angular part
income)
46
(x1, x2) = (r sin φ, r cos φ)
slide by Alex Smola
inner products
47
initialize w, b = 0 repeat Pick (xi, yi) from data if yi(w · Φ(xi) + b) 0 then w0 = w + yiΦ(xi) b0 = b + yi until yi(w · Φ(xi) + b) > 0 for all i end
w = X
i∈I
yiφ(xi)
f(x) = X
i∈I
yi hφ(xi), φ(x)i + b
slide by Alex Smola
48
slide by Alex Smola
49
(x1, x2) (x1, x2, x1x2)
slide by Alex Smola