Support t Vecto tor Machine (str treamlined) Mich Michael ael - - PowerPoint PPT Presentation

support t vecto tor machine str treamlined
SMART_READER_LITE
LIVE PREVIEW

Support t Vecto tor Machine (str treamlined) Mich Michael ael - - PowerPoint PPT Presentation

extended version: Biehl-Part1.pdf Support t Vecto tor Machine (str treamlined) Mich Michael ael Bieh Biehl Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence University of Groningen www.cs.rug.nl/biehl


slide-1
SLIDE 1

Mich Michael ael Bieh Biehl Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence University of Groningen www.cs.rug.nl/biehl

Support t Vecto tor Machine (str treamlined)

extended version: Biehl-Part1.pdf

slide-2
SLIDE 2

IAC Winte ter School November 2018, La Laguna

2

Solving the perceptron storage problem

re-write the problem ... consider a given data set I D = {ξµ, Sµ

R}

... find a vector w with Sµ

H = sign(w · ξµ) = Sµ R

for all µ Note: sign(w · ξµ) = Sµ

R ⇔ sign(w · ξµ Sµ R) = 1 ⇔ Eµ = w · ξµ Sµ R > 0

( local potentials Eµ) equivalent problem: solve a set of linear inequalities (in w) ... find a vector w with Eµ = w · ξµ Sµ

R ≥ c > 0

for all µ Note that the actual value of is irrelevant: satisfies satisfies

  • th

the sto torage problem revisite ted

slide-3
SLIDE 3

IAC Winte ter School November 2018, La Laguna

3

solving equati tions ?

Instead of inequalities, try to solve P equations for N unknowns: (A) (A) if no solution exists, find approximati tion by least square dev.:

Eµ =

N

X

j=1

wjξµ

j Sµ = 1

for all µ = 1, 2, . . . , P

minimization, e.g. by means of gradient t descent t with minimize f = 1 2

P

X

µ=1

(1 − Eµ)2

rwf =

P

X

µ=1

(1 Eµ) ξµ Sµ

slide-4
SLIDE 4

IAC Winte ter School November 2018, La Laguna

4

solving equati tions ?

(B) (B) if the system is under-determined → find a unique solution:

minimize 1 2 | w |2 under constraints {Eµ = 1}P

µ=1

Lagrange function necessary conditi tions for opti timum:

∂L ∂λµ = (1 − Eµ)

!

= 0 rwL = w

P

X

µ=1

λµ ξµ Sµ

!

= 0

⇒ w =

P

X

µ=1

λµ ξµ Sµ

Lagrange parameters ~ embedding str trength ths λµ (rescaled with N) solution is a linear combination of the data

L = 1 2 | w |2 +

P

X

µ=1

λµ (1 − Eµ)

slide-5
SLIDE 5

IAC Winte ter School November 2018, La Laguna

5

eliminate weights: in terms of weights: the same as in (A) !!! simplified problem:

maxλ L = −1 2 X

µ,ν

λν Cνµ λµ + X

µ

λµ Eν =

P

X

µ=1

1 N

N

X

k=1

(ξµ

k Sµ) (ξν kSν)

| {z }

≡Cνµ

λµ ∂L ∂λρ = 1 − X

µ

Cρµλµ = (1 − Eρ)

gradient ascent with:

∆w ∝ X

ρ

(1 − Eρ) ξρ Sρ solving equati tions ?

N

X

j=1

w2

j ∝

X

µ,ν

λν Cνµ λµ

slide-6
SLIDE 6

IAC Winte ter School November 2018, La Laguna

6

rename the Lagrange parameters, re-writing the problem: in terms of weights: the same as in (A) !!! simplified problem: gradient ascent with:

∆w ∝ X

ρ

(1 − Eρ) ξρ Sρ Eν =

P

X

µ=1

1 N

N

X

k=1

(ξµ

k Sµ) (ξν kSν)

| {z }

≡Cνµ

N

X

j=1

w2

j ∝

X

µ,ν

xν Cνµ xµ maxx L = −1 2 X

µ,ν

xν Cνµ xµ + X

µ

xµ ∂L ∂xρ = 1 − X

µ

Cρµxµ = (1 − Eρ) solving equati tions ?

slide-7
SLIDE 7

IAC Winte ter School November 2018, La Laguna

7

classical algorith thm: ADA DALINE E

Adaline algorithm: Ad Adaptive Li Linear N Neuron (Widrow and Hoff, 1960) gradient based learning for linear regression (MSE) frequent strategy: regression as a proxy for classification more general: training of a linear unit with continuous output iteration of weights / embedding strengths

w(t) = w(t − 1) + η ⇣ 1 − Eµ(t)⌘ ξµ(t)Sµ(t) xµ(t) = xµ(t − 1) + η ⇣ 1 − Eµ(t)⌘

sequence µ(t)

  • f examples

minimize f = 1 2

P

X

µ=1

(hµ − Eµ)2 with hµ ∈ I R, µ = 1, 2 . . . , P

f = 1 2

P

X

µ=1

  • yµ − w>ξµ2

with yµ = hµ Sµ

slide-8
SLIDE 8

8

hardware realizati tion “Science in acti tion” ca. 1960 http://www.youtube.com/watch?v=IEFRtz68m-8

youtube video “science in action” with Bernard Widrow

slide-9
SLIDE 9

IAC Winte ter School November 2018, La Laguna

9

Intr troducti tion:

  • supervised learning, clasification, regression
  • machine learning “vs.” statistical modeling

Ea Early (importa tant! t!) approaches:

  • linear threshold classifier, Rosenblatt’s Perceptron
  • adaptive linear neuron, Widrow and Hoff’s Adaline

From Perceptr tron to Support t Vecto tor Machine

  • large margin classification
  • beyond linear separability

Di Dista tance-based syste tems

  • prototypes: K-means and Vector Quantization
  • from K-Neares_Neighbors to Learning Vector Quantization
  • adaptive distance measures and relevance learning
slide-10
SLIDE 10

Optimal stability by quadratic optimization Note: the solution of the problem yields stability

minimize 1 2w2 subject to inequality constraints

  • Eµ = w>ξµ Sµ

R ≥ 1

P

µ=1

wmax κmax = 1 | wmax |

slide-11
SLIDE 11

Notation: correlation matrix (outputs incorporated) with elements P-vectors: inequalities “one-vector”: (C is positive semi-definite)

slide-12
SLIDE 12

We can formulate optimal stability completely in terms of embedding strengths: minimize subject to linear constraints This is a special case of a standard problem in Quadratic Programming: minimize a nonlinear function under linear inequality constraints Optimal stability by quadratic optimization Note: the solution of the problem yields stability

minimize 1 2w2 subject to inequality constraints

  • Eµ = w>ξµ Sµ

R ≥ 1

P

µ=1

wmax κmax = 1 | wmax |

slide-13
SLIDE 13

Optimization theory: Kuhn–Tucker theorem see, e.g., R. Fletcher, Practical Methods of Optimization (Wiley, 1987)

  • r http://wikipedia.org “Karush-Kuhn-Tucker-conditions” for a quick start

necessary conditions for a local solution of a general non-linear optimization problem with equality and inequality constraints

slide-14
SLIDE 14

Lagrange function:

minimize~

x

1 2 ~ x> C ~ x subject to C~ x ≥ ~ 1 L(~ x,~ ) = 1 2 ~ x> C ~ x − ~ > (C~ x −~ 1)

  • Max. stability: Kuhn–Tucker theorem for a special non-linear optimization problem

Any solution can be represented by a Kuhn-Tucker (KT) point with: non-negative embedding strengths (←minover) linear separability complementarity implies also: → all KT-points yield the same unique perceptron weight vector → any local solution is globally optimal

straightforward to show:

slide-15
SLIDE 15

IAC Winte ter School November 2018, La Laguna

Duality, theory of Lagrange multipliers → equivalent formulation (Wolfe dual): maximize

~ x

e f = −1 2 ~ xT C ~ x + ~ xT ~ 1 subject to ~ x ≥ 0

absent in the Adaline problem

slide-16
SLIDE 16

IAC Winte ter School November 2018, La Laguna

Duality, theory of Lagrange multipliers → equivalent formulation (Wolfe dual): maximize

~ x

e f = −1 2 ~ xT C ~ x + ~ xT ~ 1 subject to ~ x ≥ 0 AdaTron algorithm: – sequential presentation of examples I D = { ξµ, Sµ } – gradient ascent w.r.t. e f, projected onto ~ x ≥ 0 xµ → max { 0, xµ + ⌘ ( 1 − [C~ x]µ ) } (0 < ⌘ < 2)

z }| { η h r~

x e

f iµ

(Ad Adaptive PercepTron Tron) )

[Anlauf and Biehl, 1989]

slide-17
SLIDE 17

IAC Winte ter School November 2018, La Laguna

Duality, theory of Lagrange multipliers → equivalent formulation (Wolfe dual): maximize

~ x

e f = −1 2 ~ xT C ~ x + ~ xT ~ 1 subject to ~ x ≥ 0 AdaTron algorithm: – sequential presentation of examples I D = { ξµ, Sµ } – gradient ascent w.r.t. e f, projected onto ~ x ≥ 0 xµ → max { 0, xµ + ⌘ ( 1 − [C~ x]µ ) } (0 < ⌘ < 2) for the proof of convergence one can show:

  • for an arbitrary ~

x ≥ 0 and a KT point ~ x∗: e f(~ x∗) ≥ e f(~ x)

  • e

f(x) is bounded from above in ~ x ≥ 0

  • e

f(x) increases in every cycle through I D, unless a KT point has been reached

5

[Anlauf and Biehl, 1989]

(Ad Adaptive PercepTron Tron) )

slide-18
SLIDE 18

IAC Winte ter School November 2018, La Laguna

Support Vectors

complementarity condition: xµ ( 1 − Eµ ) = 0 for all µ i.e. either ⇢ Eµ = 1 xµ ≥

  • r

⇢ Eµ > 1 xµ =

  • examples

... have to be embedded

  • r

... are stabilized “automatically” P

µ

theweights∝xµξµSµ depend (explicitly) only on a subset of I D if these support vectors were known in advance, training could be restricted to the subset (unfortunately they are not...)

slide-19
SLIDE 19

IAC Winte ter School November 2018, La Laguna

19

learn learnin ing in in v version ersion space ? space ?

... even then, it only makes sense if

  • the unknown rule is a linearly separable function
  • the data set is reliable ( noise-free )

... (including max. stability) is only possible if

  • the data set is linearly separable

?

  • lin. separable
  • nonlin. boundary

noisy data (?)

slide-20
SLIDE 20

. lassiication beyond linear separability

assume is not linearly separable - what can we do?

  • accept an approximation by a linearly separable function

→ see “pocket algorithm” and large margin with errors

  • → see “committee and
  • potential reasons: noisy data, more complex problem

! large margins with errors arge margins with errors admit disagreements w.r.t. training data, but keep basic idea of optimal stability

minimizew,β 1 2w2 + γ

P

X

µ=1

βµ subject to Eµ ≥ 1 − βµ and βµ ≥ 0 for all µ

slack variables

( βµ = 0 ↔ Eµ ≥ 1 βµ > 0 ↔ Eµ < 1 includes errors with Eµ < 0

slide-21
SLIDE 21

IAC Winte ter School November 2018, La Laguna

minimize~

x,~

  • 1

2 ~ x> C ~ x + ~ ·~ 1 subject to C ~ x ≥ ~ 1 − ~

  • and ~

≥ 0

rewritten in terms of embedding strengths (see above for notation)

Ÿ

dual problem: (elimination of slack variables!)

maximize~

x

− 1 2 ~ x> C ~ x + ~ 1 · ~ x subject to 0 ≤ ~ x ≤ ~ 1

positive and upper-bounded embedding strengths parameter γ - limits the growth of xµ for misclassified data points

  • controls a compromise between aims of large margin and low error
  • has to be chosen appropriately, e.g. by validation methods (later chapter)

note: even for lin. sep. data the optimum can include misclassifications!

  • does not (in general) minimize the numb

number of errors

slide-22
SLIDE 22

IAC Winte ter School November 2018, La Laguna

AdaTron with errors (projected gradient ascent)

˜ xµ ← xµ + ⌘ (1 − [C~ x]µ) gradient step ˆ xµ ← max

  • 0, ˜

xµ enforce non-negative embeddings xµ ← min

  • , ˆ

xµ limit embedding strenghts to xµ ≤

example algorithm:

slide-23
SLIDE 23

. lassiication beyond linear separability

assume is not linearly separable - what can we do?

  • accept an approximation by a linearly separable function

→ see “pocket algorithm” and large margin with errors

  • construct more complex architectures from perceptron-like units.

e.g. multilayer networks (universal classificators, difficult training)

→ see “committee and parity-machine”

  • consider ensembles of perceptrons

train several student perceptrons

  • each student should make a small number of errors
  • the perceptrons should differ significantly

combine the into an ensemble classifier , e.g. by majority vote competing aims: potential reasons: noisy data, more complex problem

see also: Decision Trees and Forests (lectures by Dalya Baron)

slide-24
SLIDE 24
  • employ a linear decision boundary, but after a non-linear transformation of the data

to an M-dim. feature space (M=N is possible, but not required) M-dim. weight vector non-linear transformation for a given, explicit transformation , perceptron training can be applied in Support Vector Machines:

  • most frequent approach: approximate classification by continuous regression
slide-25
SLIDE 25

IAC Winte ter School November 2018, La Laguna

The Support t Vecto tor Machine

  • Perceptron of optimal stability: support

t vecto tors

  • SVM: non-linear transformation to high-dim. feature space
  • implicit kernel formulation, Mercer’s theorem

history: www.svms.org

slide-26
SLIDE 26

assume is not linearly separable — what can we do? accept an approximation by a linearly separable function (limited flexibility and usefulness) construct more complex architectures from perceptron units, e.g. multilayer networks (universal approximators, difficult training) generate a non-linear decision surface for the original data

  • employ a linear decision boundary, but after a non-linear transformation of the data

H = sign [ W · Ψ(ξµ) ],

ξ 2 I RN ! Ψ(ξ) 2 I RM with weights W 2 I RM in general = , mostly

61

SVM: transformation with M>N to high-dim. feature space

An illustrative example (c/o R. Dietrich, PhD thesis) consider original, two-dimensional data (x1, x2) and the non-linear transformed data ⇣ √ ⌘ linearly separable classification in : with the non-separable classification in

becomes linearl separable in

62

(c/o R. Dietrich, PhD thesis) consider original, two-dimensional data and the non-linear transformed data linearly separable classification in : with the non-separable classification in

becomes linearl separable in

62

basic idea:

slide-27
SLIDE 27

assume is not linearly separable — what can we do? accept an approximation by a linearly separable function (limited flexibility and usefulness) construct more complex architectures from perceptron units, e.g. multilayer networks (universal approximators, difficult training) generate a non-linear decision surface for the original data

  • employ a linear decision boundary, but after a non-linear transformation of the data

H = sign [ W · Ψ(ξµ) ],

ξ 2 I RN ! Ψ(ξ) 2 I RM with weights W 2 I RM in general = , mostly

61

SVM: transformation with M>N to high-dim. feature space

An illustrative example (c/o R. Dietrich, PhD thesis) consider original, two-dimensional data (x1, x2) and the non-linear transformed data Ψ(x1, x2) = ⇣ x2

1,

√ 2 x1 x2, x2 ⌘ ∈ I R3 linearly separable classification in : with the non-separable classification in

becomes linearl separable in

62

(c/o R. Dietrich, PhD thesis) consider original, two-dimensional data and the non-linear transformed data linearly separable classification in : with the non-separable classification in

becomes linearl separable in

62

basic idea:

slide-28
SLIDE 28

assume is not linearly separable — what can we do? accept an approximation by a linearly separable function (limited flexibility and usefulness) construct more complex architectures from perceptron units, e.g. multilayer networks (universal approximators, difficult training) generate a non-linear decision surface for the original data

  • employ a linear decision boundary, but after a non-linear transformation of the data

H = sign [ W · Ψ(ξµ) ],

ξ 2 I RN ! Ψ(ξ) 2 I RM with weights W 2 I RM in general = , mostly

61

SVM: transformation with M>N to high-dim. feature space

An illustrative example (c/o R. Dietrich, PhD thesis) consider original, two-dimensional data (x1, x2) and the non-linear transformed data Ψ(x1, x2) = ⇣ x2

1,

√ 2 x1 x2, x2 ⌘ ∈ I R3 linearly separable classification in : with the non-separable classification in

becomes linearl separable in

62

(c/o R. Dietrich, PhD thesis) consider original, two-dimensional data and the non-linear transformed data linearly separable classification in : Sµ = sign ( W · Ψ(x1, x2) ) with ~ W = (1, 1, −1) the non-separable classification in

becomes linearl separable in

62

(c/o R. Dietrich, PhD thesis) consider original, two-dimensional data and the non-linear transformed data linearly separable classification in : with the non-separable classification in I R2becomes linearl separable in I R3

62

basic idea:

slide-29
SLIDE 29

assume: transformation guarantees linear separability of { Ψ(ξµ), Sµ } → a vector W exists with Sµ

H = sign ( W · Ψ(ξµ) )

for all µ.

  • ptimal stability:

maximize

W

(W) where (W) = min

µ

⇢ µ = W · Ψ(ξµ) Sµ | W |

  • Exact same structure as the original perceptron problem – all above results from
  • ptimization theory apply accordingly

re-formulate: subject to here:

63

slide-30
SLIDE 30

assume: transformation guarantees linear separability of { Ψ(ξµ), Sµ } → a vector W exists with Sµ

H = sign ( W · Ψ(ξµ) )

for all µ.

  • ptimal stability:

maximize

W

(W) where (W) = min

µ

⇢ µ = W · Ψ(ξµ) Sµ | W |

  • Exact same structure as the original perceptron problem – all above results from
  • ptimization theory apply accordingly

re-formulate: minimize

~ X

1 2 ~ XT Γ ~ X subject to Γ ~ X ≥ ~ 1 here: W = 1 M

P

X

µ=1

Xµ Ψ(ξµ) Sµ Γµ⌫ = 1 M Sµ Ψ(ξµ) · Ψ(ξ⌫) S⌫ W 2 = 1 M ~ XT Γ ~ X

63

slide-31
SLIDE 31

Kernel formulation

consider the function K : I RN × I RN → I R with K(ξµ, ξν) =

1 M Ψ(ξµ) · Ψ(ξν)

re-write in terms of this kernel function

  • the classification scheme:

SH(ξ) = sign ( W · Ψ(ξ) ) = sign @

P

X

µ=1

Xµ Sµ Ψ(ξµ) · Ψ(ξ) 1 A = sign @

P

X

µ=1

Xµ Sµ K(ξµ, ξ) 1 A training algorithms for the embedding strengths, just one example: – no explicit use of the transformed feature vectors – only dot-products required, which can be expressed in terms of the

64

slide-32
SLIDE 32

Kernel formulation

consider the function K : I RN × I RN → I R with K(ξµ, ξν) =

1 M Ψ(ξµ) · Ψ(ξν)

re-write in terms of this kernel function

  • the classification scheme:

SH(ξ) = sign ( W · Ψ(ξ) ) = sign @

P

X

µ=1

Xµ Sµ Ψ(ξµ) · Ψ(ξ) 1 A = sign @

P

X

µ=1

Xµ Sµ K(ξµ, ξ) 1 A

  • training algorithms for the embedding strengths, just one example:

Kernel AdaTron Xµ → max ( 0, Xµ + η 1 − Sµ

P

X

ν=1

Sν Xν K(ξµ, ξν) ! ) – no explicit use of the transformed feature vectors – only dot-products required, which can be expressed in terms of the

64

slide-33
SLIDE 33

Kernel formulation

consider the function K : I RN × I RN → I R with K(ξµ, ξν) =

1 M Ψ(ξµ) · Ψ(ξν)

re-write in terms of this kernel function

  • the classification scheme:

SH(ξ) = sign ( W · Ψ(ξ) ) = sign @

P

X

µ=1

Xµ Sµ Ψ(ξµ) · Ψ(ξ) 1 A = sign @

P

X

µ=1

Xµ Sµ K(ξµ, ξ) 1 A

  • training algorithms for the embedding strengths, just one example:

Kernel AdaTron Xµ → max ( 0, Xµ + η 1 − Sµ

P

X

ν=1

Sν Xν K(ξµ, ξν) ! ) – no explicit use of the transformed feature vectors Ψ(ξ) – only dot-products required, which can be expressed in terms of the kernel

64

slide-34
SLIDE 34

so far: define non-linear Ψ(ξ) ∈ I RM, find corresponding kernel function K(ξµ, ξν) now: as we will never use Ψ(ξ) explicitly, why not start with defining a kernel function in the first place? for practical purposes, we need not know Ψ nor its dimension M Question: does a given kernel K correspond to some valid transformation Ψ? (sufficient condition) a given kernel function can be written as , if holds true for all functions with finite norm

65

slide-35
SLIDE 35

so far: define non-linear Ψ(ξ) ∈ I RM, find corresponding kernel function K(ξµ, ξν) now: as we will never use Ψ(ξ) explicitly, why not start with defining a kernel function in the first place? for practical purposes, we need not know Ψ nor its dimension M Question: does a given kernel K correspond to some valid transformation Ψ? Mercer’s Theorem (sufficient condition) a given kernel function K can be written as K(ξµ, ξν) = Ψ(ξµ) · Ψ(ξν), if Z Z g(ξµ) K(ξµ, ξν) g(ξν) dNξµ dNξν ≥ 0 holds true for all functions g with finite norm Z g(ξ)2 dNξ < ∞

65

slide-36
SLIDE 36

popular classes of kernels (which satisfy Mercer’s conditon)

  • polynomial kernels of degree (up to) q, e.g.

linear kernel

= perceptron with threshold in original space

c

  • > perceptron with respect to feature vectors containing all single and products of 2 original features
slide-37
SLIDE 37

popular classes of kernels (which satisfy Mercer’s conditon)

  • polynomial kernels of degree (up to) q, e.g.

linear kernel

= perceptron with threshold in original space

quadratic kernel

  • > perceptron with respect to feature vectors containing all single and products of 2 original features
slide-38
SLIDE 38
  • Radial basis function (RBF) kernel

involves all powers of the features, “M → ∞” attractive aspects of the SVM approach:

  • optimization problem is uniquely solvable (no local minima)
  • efficient training algorithms are known
  • maximum stability facilitates good generalization ability

… if the kernel (its parameters) is (are) appropriately chosen in practice:

  • select simple kernels, allow for violations of some of the linear constraints

by means of slack variables (e.g. kernel-version of Adatron with errors, see above)

  • choose kernel (kernel parameters) by means of cross-validation procedures
  • use approximate schemes for huge amounts of data (many support vectors)
  • involves all powers of the features, “

→ ∞” attractive aspects of the SVM approach:

  • optimization problem is uniquely solvable (no local minima)
  • efficient training algorithms are known
  • maximum stability facilitates good generalization ability

… if the kernel (its parameters) is (are) appropriately chosen in practice:

  • select simple kernels, allow for violations of some of the linear constraints

by means of slack variables (e.g. kernel-version of Adatron with errors, see above)

  • choose kernel (kernel parameters) by means of cross-validation procedures
  • use approximate schemes for huge amounts of data (many support vectors)
  • involves all powers of the features, “

→ ∞” attractive aspects of the SVM approach:

  • optimization problem is uniquely solvable (no local minima)
  • efficient training algorithms are known
  • maximum stability facilitates good generalization ability

… if the kernel (its parameters) is (are) appropriately chosen in practice:

  • select simple kernels, allow for violations of some of the linear constraints

by means of slack variables (e.g. kernel-version of Adatron with errors, see above)

  • choose kernel (kernel parameters) by means of cross-validation procedures
  • use approximate schemes for huge amounts of data (many support vectors)

(“kernelized” max. stability algorithms) so much for the “curse of dimensionality” J

slide-39
SLIDE 39