Kernels Course of Machine Learning Master Degree in Computer - - PowerPoint PPT Presentation

kernels
SMART_READER_LITE
LIVE PREVIEW

Kernels Course of Machine Learning Master Degree in Computer - - PowerPoint PPT Presentation

Kernels Course of Machine Learning Master Degree in Computer Science Giorgio Gambosi a.a. 2018-2019 Idea Thus far, we have been assuming that each object that we deal with For certain kinds of objects (text document, protein sequence,


slide-1
SLIDE 1

Kernels

Course of Machine Learning Master Degree in Computer Science Giorgio Gambosi a.a. 2018-2019

slide-2
SLIDE 2

Idea

  • Thus far, we have been assuming that each object that we deal with

can be represented as a fixed-size feature vector x ∈ I

Rd

  • For certain kinds of objects (text document, protein sequence, parse

tree, etc.) it is not clear how to best represent them in this way

  • 1. first approach: define a generative model of data (with latent variables)

and define an object as the inferred values of latent variables

  • 2. second approach: do not rely on vector representation, but just

assume a similarity measure between objects is defined

2

slide-3
SLIDE 3

Representation by pairwise comparison

Idea

  • Define a comparison function κ : χ × χ → I

R

  • Represent a set of data items x1, . . . xn by the n × n Gram matrix G

such that

Gij = κ(xi, xj)

  • G is always an n × n matrix, whatever the nature of data: the same

algorithm will work for any type of data (vectors, strings, …)

3

slide-4
SLIDE 4

Kernel definition

Given a set χ, a function κ : χ2 → I

R is a kernel on χ if there exists a

Hilbert space H (essentially, a vector space with dot product ·) and a map

φ : χ → H such that for all x1, x2 ∈ χ we have κ(x1, x2) = φ(x1) · φ(x2)

We shall consider the particular but common case when H = I

Rd for some d > 0, φ(x) = (ϕ1(x), . . . , ϕd(x) and φ(x1) · φ(x2) = φ(x1)T φ(x2) φ is called a feature map a H a feature space of κ

4

slide-5
SLIDE 5

Kernel definition

Positive definitess of κ is a relevant property in this framework. Positive semidefinitess Given a set χ, a function κ : χ2 → I

R is positive semidefinite if for all n ∈ I N, (x1, . . . , xn) ∈ χn the corresponding Gram matrix is positive

semidefinite, that is zT Gz ≥ 0 for all vectors z ∈ I

Rn

5

slide-6
SLIDE 6

Why is positive semidefinitess relevant?

Let κ : χ × χ → I

  • R. Then κ is a kernel iff for all sets {x1, x2, . . . , xn}

the corresponding Gram matrix G is symmetric and positive semidefinite Only if: Gij = ϕ(xi)T ϕ(xj) then clearly Gij = Gji. Moreover for any

z ∈ I Rd zT Gz =

d

i=1 d

j=1

ziGijzj =

d

i=1 d

j=1

ziφ(xi)T φ(xj)zj =

d

i=1 d

j=1

zi

( d ∑

k=1

ϕk(xi)ϕk(xj)

)

zj =

n

k=1 d

i=1 d

j=1

ziϕk(xi)ϕk(xj)zj =

d

k=1

( d ∑

i=1

ziϕk(xi)

)2

≥ 0

6

slide-7
SLIDE 7

Why are positive definite kernels relevant?

If: Given {x1, x2, . . . , xn} if G is positive definite it is possible to compute an eigenvector decomposition

G = UT ΛU

where Λ is the diagonal matrix of eigenvalues λi > 0 and the columns

u1, . . . , un of U are the corresponding eigenvectors. Then, Gij = (Λ

1 2 ui)T (Λ 1 2 uj)

Then if we define ϕ(xi) = Λ

1 2 ui we get

κ(xi, xj) = ϕ(xi)T ϕ(xj) = Gij

This results is valid only wrt the domain {x1, x2, . . . , xn}. For the general case, consider n → ∞ (as for example in gaussian processes)

7

slide-8
SLIDE 8

Why are positive definite kernels relevant?

Using positive definite kernels allows to apply the kernel trick wherever useful. Kernel trick Any algorithm which processes finite-dimensional vectors in such a way to consider only pairwise dot products can be applied to higher (possibly infinite) dimensional vectors by replacing each dot product by a suitable application of a positive definite kernel.

  • Many practical applications
  • Vectors in the new space are manipulated only implicitly, through

pairwise dot products, computed by evaluating the kernel function on the original pair of vectors Example: Support vector machines. Also, many linear models for regression and classification can be reformulated in terms of a dual representation involving only dot products.

8

slide-9
SLIDE 9

Dual representations: example

Regularized sum of squares in regression with predefined basis function

φ(x) J(w) = 1 2

n

i=1

(

wT φ(xi) − ti

)2 + λ

2 wT w = 1 2(Φw − t)T (Φw − t) + λ 2 wT w

where by definition of Φ ∈ I

Rn×d it is Φij = ϕj(xi)

Setting ∂J(w)

∂w = 0, the resulting solution is ˆ w = (ΦΦT + λId)−1ΦT t = ΦT (ΦΦT + λIn)−1t

since it is possible to prove that for any matrix A ∈ I

Rr×c it is (AT A + λIr)−1AT = AT (AAT + λIc)−1

9

slide-10
SLIDE 10

Dual representations: example

If we define the dual variables a = (ΦΦT + λIn)−1t, we get w = ΦT a. By substituting ΦT a to w we express the cost function in terms of a, instead of w, introducing a dual formulation of J.

J(a) = 1 2aT ΦΦT ΦΦT a + 1 2tT t − aT ΦΦT t + λ 2 aT ΦΦT a = 1 2aT GGa + 1 2tT t − aT Gt + λ 2 aT Ga

where G = ΦΦT is the Gram matrix, such that by definition

Gij = ∑d

k=1 ϕk(xi)ϕk(xj) = φ(xi)T φ(xj)

10

slide-11
SLIDE 11

Dual representations: example

Setting the gradient of ∂J(a)

∂a = 0 it results ˆ a = (G + Iλn)−1t

We can use this to make predictions in a different way

y(x) = wT φ(x) = aT Φφ(x) = tT (G + Iλn)−1Φφ(x) = k(x)(G + Iλn)−1t

where

k(x) = φ(x)T Φ = (φ(x1)T φ(x), . . . , φ(xT

nφ(x))T

= (κ(x1, x), . . . , κ(xn, x))T = (κ1(x), . . . , κn(x))T

The prediction can be done in terms of dot products between different pairs

  • f φ(x), or in terms of the kernel function κ(xi, xj) = φ(xi)T φ(xj)

11

slide-12
SLIDE 12

Dual representations: another example

  • As well known, a perceptron is a linear classifier with prediction

y(x) = wT x

  • Its update rule is: If xi is misclassified, that is wT xiti < 0, then

w := w + tixi

  • If we assume a zero initial value for all wk, then w is the sum of all

items that have been considered as misclassified by the algorithm, where each item is weighted by the number of times it has been considered.

  • We may then define a dual formulation by setting w = ∑n

k=1 akxk,

which results in prediction y(x) = ∑n

k=1 akxT k x

  • and update rule: if xi is misclassified, that is ∑n

k=1 akxT k xi < 0,

then ai := ai + 1

  • a kernelized perceptron can be defined with

y(x) = ∑n

k=1 akφ(xk)T φ(x) or with y(x) = ∑n k=1 akκ(xk, x),

by just using a positive definite kernel κ

12

slide-13
SLIDE 13

Kernelization: one more example

  • The k-nn classifier selects the label of the nearest neighbor: assume

the Euclidean distance is considered

||xi − xj||2 = xT

i xi + xT j xj − 2xT i xj

  • We can now replace the dot products by a valid positive definite kernel

and we obtain:

d(xi, xj)2 = κ(xi, xi) + κ(xj, xj) − 2κ(xi, xj)

  • This is a kernelized nearest-neighbor classifier
  • We do not explicitly compute vectors

Why referring to the dual representation?

  • While in the original formulation of linear regression w can be derived

by inverting the m × m matrix ΦT Φ, in the dual formulation computing a requires inverting the n × n matrix G + Iλ.

  • Since usually n ≫ m, this seems to lead to a loss of efficiency.
  • However, the dual approach makes it possible to refer only to the

kernel function , and not to the set of base functions : this makes it possible to implicitly use feature space of very high dimension (much larger than , even infinite).

13

slide-14
SLIDE 14

Dealing with kernels

Since not all functions f : χ → I

Rd are positive definite kernel, some

method to define them must be applied.

  • the straighforward way is just to define a basis function φ and define

κ(x1, x2) = φ(x1)T φ(x2). κ is a positive definite kernel since

  • 1. ϕ(x1)T ϕ(x2) = ϕ(x2)T ϕ(x1)
  • 2. ∑n

i=1

∑n

j=1 cicjκ(xi, xj) = ∑n i=1

∑n

j=1 cicjϕ(xi)T ϕ(xj) =

∥∑n

i=1 ciϕ(xi)∥2 ≥ 0

14

slide-15
SLIDE 15

Dealing with kernels

  • a second method defines a possible kernel function κ directly: in
  • rder to ensure that such function is a valid kernel, apply Mercer’s

theorem and prove that κ is a positive definite kernel by showing it is simmetric and the corresponding Gram matrix G is positive definite for all possible sets of items. In this case we do not define φ

15

slide-16
SLIDE 16

A simple positive definite kernel

Let χ = I

R: the function κ : I R2 → I R defined as κ(x1, x2) = x1x2

is a positive definite kernel. In fact,

  • x1x2 = x2x1
  • ∑n

i=1

∑n

j=1 cicjκ(xi, xj) = ∑n i=1

∑n

j=1 cicjxixj =

(∑n

i=1 cixi)2) ≥ 0

16

slide-17
SLIDE 17

Another simple positive definite kernel

Let χ = I

Rd: the function κ : χ2 → I R defined as κ(x1, x2) = xT

1 x2

is a positive definite kernel. In fact,

  • xT

1 x2 = xT 2 x1

  • ∑n

i=1

∑n

j=1 cicjκ(xi, xj) = ∑n i=1

∑n

j=1 cicjxT i xj =

∥∑n

i=1 cixi∥2 ≥ 0

17

slide-18
SLIDE 18

Dealing with kernels

  • a third method defines again a possible kernel function κ directly: in
  • rder to ensure that such function is a valid kernel, a basis function φ

must be found such that κ(x1, x2) = φ(x1)T φ(x2) for all x1, x2

18

slide-19
SLIDE 19

Example

A polynomial kernel in 2d: ϕ(x) = (x2

1,

√ 2x1x2, x2

2)

κ(x1, x2) = (x2

11,

√ 2x11x12, x2

12)T (x2 21,

√ 2x21x22, x2

22)

= x2

11x2 21 + 2x11x12x21x22 + x2 12x2 22

= ||xT

1 x2||2

Example If x1, x2 ∈ I

Rd define κ(x1, x2) = (x1 · x2)2 = φ(x1)T φ(x2), where φ(x) = (x2

1, . . . , x2 d, x1x2, . . . , x1xd, x2x1, . . . , xdxd−1)T

19

slide-20
SLIDE 20

Example

κ(x1, x2) = (x1 · x2)2 is a valid kernel function, since κ(x1, x2) = (x11x21 + x12x22)2 = x2

11x2 21 + x2 12x2 22 + 2x11x12x21x22

= (x2

11, x2 12, x11x12, x11x12) · (x2 21, x2 22, x21x22, x21x22)

= φ(x1) · φ(x2)

The basis function thus results φ(x) = (x2

1, x2 2, x1x2, x1x2)T .

20

slide-21
SLIDE 21

Example

  • In general, if x1, x2 ∈ I

Rd then κ(x1, x2) = (x1 · x2)2 = φ(x1)T φ(x2), where φ(x) = (x2

1, . . . , x2 d, x1x2, . . . , x1xd, x2x1, . . . , xdxd−1)T

  • the d-dimensional input space is mapped onto a space with

dimension m = d2

  • observe that computing κ(x1, x2) requires time O(d), while deriving

it from φ(x1)T φ(x2) requires O(d2) steps

21

slide-22
SLIDE 22

Example

κ(x1, x2) = (x1 · x2 + c)2 is a kernel function, since κ(x1, x2) = (x1 · x2 + c)2 =

n

i=1 n

j=1

x1ix1jx2ix2j +

n

i=1

( √ 2cx1i)( √ 2cx2i) + c2 = φ(x1)T φ(x2)

for

φ(x) = (x2

1, . . . , x2 d, x1x2, . . . , x1xd, x2x1, . . . , xdxd−1,

√ 2cx1, . . . , √ 2cxd

This implies a mapping from a d-dimensional to a (d + 1)2-dimensional space.

22

slide-23
SLIDE 23

Example

κ(x1, x2) = (x1 · x2 + c)t is a kernel function corresponding to a

mapping from a d-dimensional space to a space of dimension

m =

t

i=0

di = dt+1 − 1 d − 1

corresponding to all products xi1xi2 . . . xil with 0 ≤ l ≤ t. Observe that, even if the space has dimension O(dt), evaluating the kernel function requires just time O(d).

23

slide-24
SLIDE 24

Constructing kernels from kernels

More complex kernels can be derived from simpler ones by appying suitable transormation and composition rules. In fact, given kernel functions

κ1(x1, x2), κ2(x1, x2), the function κ(x1, x2) is a kernel in all the

following cases

  • κ(x1, x2) = eκ1(x1,x2)
  • κ(x1, x2) = κ1(x1, x2) + κ2(x1, x2)
  • κ(x1, x2) = κ1(x1, x2)κ2(x1, x2)
  • κ(x1, x2) = cκ1(x1, x2), for any c > 0
  • κ(x1, x2) = xT

1 Ax2, with A positive definite

  • κ(x1, x2) = f(x1)κ1(x1, x2)g(x2), for any f, g : I

Rn → I R

  • κ(x1, x2) = p(κ1(x1, x2)), for any polynomial p : I

Rq → I R with

non-negative coefficients

  • κ(x1, x2) = κ3(φ(x1), φ(x2)), for any vector φ of m functions

ϕi : I Rn → I R and for any kernel function κ3(x1, x2) in I Rm

24

slide-25
SLIDE 25

Costructing kernel functions

κ(x1, x2) = (x1 · x2 + c)d is a kernel function. In fact,

  • 1. x1 · x2 = xT

1 x2 is a kernel function corresponding to the base

functions φ = (ϕ1, . . . , ϕn), with ϕi(x) = x

  • 2. c is a kernel function corresponding to the base functions

φ = (ϕ1, . . . , ϕn), with ϕi(x) = √c n

  • 3. x1 · x2 + c is a kernel function since it is the sum of two kernel

functions

  • 4. (x1 · x2 + c)d is a kernel function since it is a polynomial with non

negative coefficients (in particular p(z) = zd) of a kernel function

25

slide-26
SLIDE 26

Costructing kernel functions

κ(x1, x2) = e− ||x1−x2||2

2σ2

is a kernel function. In fact,

  • 1. since ||x1 − x2||2 = xT

1 x1 + xT 2 x2 − 2xT 1 x2, it results

κ(x1, x2) = e−

xT 1 x1 2σ2 e− xT 2 x2 2σ2 e xT 1 x2 σ2

  • 2. xT

1 x2 is a kernel function (see above)

  • 3. then, xT

1 x2

σ2

is a kernel function, being the product of a kernel function with a constant c = 1

σ2

  • 4. e

xT 1 x2 σ2

is the exponential of a kernel function, and as a consequence a kernel function itself

  • 5. e−

xT 1 x1 σ2 e− xT 1 x1 2σ2 e xT 1 x2 σ2

is a kernel function, being the product of a kernel function with two functions f(x1) = e−

xT 1 x1 2σ2

and

g(x2) = e−

xT 2 x2 2σ2

26

slide-27
SLIDE 27

Relevant kernel functions

  • Polynomial kernel

κ(x1, x2) = (x1 · x2 + 1)d

  • Sigmoidal kernel

κ(x1, x2) = tanh (c1x1 · x2 + c2)

  • Gaussian kernel

κ(x1, x2) = exp

(

−||x1 − x2||2 2σ2

)

where σ ∈ I

R

Observe that a gaussian kernel can be derived also starting from a non linear kernel function κ(x1, x2) instead of xT

1 x2.

27

slide-28
SLIDE 28

Kernels of structured objects

Kernels are particularly useful when applied to structured objects. Consider the case of strings (for example sequences of DNA bases or amino acids). Given two strings x1, x2 on a same alphabet A, we can define their similarity to be equal to the number of substrings they have in common. More formally, let ϕs(x) be the number of times substring s occurs in x and let φ(x) the corresponding vector of such functions for all substrings s: a kernel can be defined as

κ(x1, x2) = φ(x1)T φ(x2) =

s∈A∗

wsϕs(x1)ϕs(x2)

where ws ≥ 0 are predefined weights.

28

slide-29
SLIDE 29

Kernels of structured objects

If ws = 1 for all considered substrings and we define φ′(x) =

φ(x) ||φ(x)||2

as a normalized version of φ, we get

κ(x1, x2) = φ(xi)T φ(x2) ||φ(x1)||2||φ(x2)||2

that is, the well known cosine similarity measure. Borrowing from information retrieval methods, a better similarity measure can be obtained by defining ϕs(x) through more sophisticated measures, such as tf-idf, instead of occurrences counting

29

slide-30
SLIDE 30

Kernels of structured objects

Special cases:

  • ws = 0 if |s| > 1 is a bag-of-chars kernel, with ϕc(x) being the

number of occurrences of character c in x

  • If only s delimited by white spaces are considered, we get a

bag-of-words kernel.

  • If only strings of fixed length |s| = k are considered, we have a

k-spectrum kernel

The approach can be extended to the case of trees, in order to deal with, for example, parse or evolutionary trees More complex kernel construction techniques have been defined.

30