Non parametric methods Course of Machine Learning Master Degree in - - PowerPoint PPT Presentation

non parametric methods
SMART_READER_LITE
LIVE PREVIEW

Non parametric methods Course of Machine Learning Master Degree in - - PowerPoint PPT Presentation

Non parametric methods Course of Machine Learning Master Degree in Computer Science University of Rome Tor Vergata Giorgio Gambosi a.a. 2018-2019 1 Probability distribution estimates The statistical approach to classification


slide-1
SLIDE 1

Non parametric methods

Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019

1

slide-2
SLIDE 2

Probability distribution estimates

  • The statistical approach to classification requires the (at least

approximate) knowledge of p(Ci|x): in fact, an item x shall be assigned to the class Ci such that i = argmax

k

p(Ck|x)

  • The same holds in the regression case, where p(y|x) has to be

estimated.

2

slide-3
SLIDE 3

Probability distribution estimates: hypotheses

What do we assume to know of class distributions, given a training set X, t?

  • Case 1. The probabilities p(x|Ci) are known: an item is assigned x to the

class Ci such that i = argmax

j

p(Cj|x) where p(Cj|x) can be derived through Bayes’ rule and prior probabilities, since p(Ck)|x) ∝ p(x|Ck)p(Ck)

3

slide-4
SLIDE 4

Probability distribution estimates: hypotheses

  • Case 2. The type of probability distribution p(x|θ) is known: an estimate
  • f parameter values θi is performed for all classes, taking into account

for each class Ci the subset of Xi, ti of items belonging to the class, that is such that t = i. Different approaches to parameter estimation:

  • 1. Maximum likelihood: θML

i

= argmax

θ

p(Xi, ti|θ) is computed. Item x is assigned to class Ci if i = argmax

j

p(Cj|x) = argmax

j

p(x|θML

j

)p(Cj)

  • 2. Maximum a posteriori: θMAP

i

= argmax

θ

p(θ|Xi, ti) is computed. Item x is assigned to class Ci if i = argmax

j

p(Cj|x) = argmin

j

p(x|θMAP

j

)p(Cj)

  • 3. Bayesian estimate: the distributions p(θ|Xi, ti) are estimated for each

class and, from them, p(x|Ci) = ∫

θ

p(x|θ)p(θ|Xi, ti)dθ Item x is assigned to class Ci if i = argmax

j

p(Cj|x) = argmax

j

p(Cj)p(x|Cj) = argmax

j

p(Cj) ∫

θ

p(x|θ)p(θ|Xj, tj)dθ

4

slide-5
SLIDE 5

Probability distribution estimates: hypotheses

  • Case 3. No knowledge of the probabilities assumed.
  • The class distributions p(x|Ci) are directly from data.
  • In previous cases, use of (parametric) models for a synthetic

description of data in X, t

  • In this case, no models (and parameters): training set items explicitly

appear in class distribution estimates.

  • Denoted as non parametric models: indeed, an unbounded number of

parameters is used

5

slide-6
SLIDE 6

Histograms

  • Elementary type of non parametric estimate
  • Domain partitioned into m d-dimensional intervals (bins)
  • The probability Px that an item belongs to the bin containing item x is

estimated as n(x) n , where n(x) is the number of element in that bin

  • The probability density in the interval corresponding to the bin

containing x is then estimated as the ratio between the above probability and the interval width ∆(x) (tipically, a constant ∆) pH(x) =

n(x) N

∆(x) = n(x) N∆(x)

∆ = 0.04 0.5 1 5 ∆ = 0.08 0.5 1 5 ∆ = 0.25 0.5 1 5 6

slide-7
SLIDE 7

Histograms: problems

  • The density is a function of the position of the first bin. In the case of

multivariate data, also from bin orientation.

  • The resulting estimates is not continuous.
  • Curse of dimensionality: the number of bins grows as a polynomial of
  • rder d: in high-dimensional spaces many bins may result empty,

unless a large number of items is available.

  • In practice, histograms can be applied only in low-dimensional datasets

(1,2)

7

slide-8
SLIDE 8

Kernel density estimators

  • Probability that an item is in region R(x), containing x

Px = ∫

R(x)

p(z)dz

  • Given n items x1, x2, . . . , xn, the probability that k among them are in

R(x) is given by the binomial distribution p(k) = ( n k ) P K

x (1 − Px)n−k =

n! k!(n − k)!P K

x (1 − Px)n−k

  • Mean and variance of the ratio r = k

n are E[r] = Px var [r] = Px(1 − Px) n

  • Px is the expected fraction of items in R(x), and the ratio r is an
  • estimate. As n → ∞ variance decreases and r tends to E[r] = Px.

Hence, in general, r = k n ≃ P(x)

8

slide-9
SLIDE 9

Nonparametric estimates

  • Let the volume of R(x) be sufficiently small. Then, the density p(x) is

almost constant in the region and Px = ∫

R(x)

p(z)dz ≃ p(x)V where V is the volume of R(x)

  • since Px ≃ k

n, it then derives that p(x) ≃ k nV

9

slide-10
SLIDE 10

Approaches to nonparametric estimates

Two alternative ways to exploit the estimate p(x) ≃ k nV

  • 1. Fix V and derive k from data (kernel density estimation)
  • 2. Fix k and derive V from data (K-nearest neighbor).

It can be shown that in both cases, under suitable conditions, the estimator tends to the true density p(x) as n → ∞.

10

slide-11
SLIDE 11

Kernel density estimation: Parzen windows

  • Region associated to a point x: hypercube with edge length h (and

volume hd) centered on x.

  • Kernel function k(u) (Parzen window) used to count the number of

items in the unit hypercube centered on the origin 0 k(u) =    1 |ui| ≤ 1/2 i = 1, . . . , d

  • therwise
  • as a consequence, k

(x − x′ h ) = 1 iff x′ is in the hypercube of edge length h centered on x

  • the number of items in the hypercube is then

K =

n

i=1

k (x − xi h )

11

slide-12
SLIDE 12

Kernel density estimation: Parzen windows

  • The estimated density is

p(x) = 1 n

n

i=1

1 hd k (x − xi h )

  • Since

k(u) ≥ 0 and ∫ k(u)du = 1 it derives k (x − xi h ) ≥ 0 and ∫ k (x − xi h ) dx = hd

12

slide-13
SLIDE 13

Kernel density estimation: Parzen windows

As a consequence, it results that pn(x) is a probability density. In fact, p(x) = 1 n

n

i=1

1 hd k (x − xi h ) ≥ 0 and ∫ p(x)dx = ∫ 1 n

n

i=1

1 hd k (x − xi h ) dx 1 nhd ∫

n

i=1

k (x − xi h ) dx 1 nhd

n

i=1

∫ k (x − xi h ) dx = 1 nhd nhd = 1 Clearly, the window size has a relevant effect on the estimate

13

slide-14
SLIDE 14

Kernel density estimation: Parzen windows

h = ε h = 1 h = 2

14

slide-15
SLIDE 15

Kernels and smoothing

Drawbacks

  • 1. discontinuity of the estimates
  • 2. items in a region centered on x have uniform weights: their distance

from x is not taken into account

  • Solution. Use of smooth kernel functions κh(u) to assign larger weights to

points nearer to the origin. Assumed characteristics of κh(u): ∫ κh(x)dx = 1 ∫ xκh(x)dx = 0 ∫ x2κh(x)dx > 0

15

slide-16
SLIDE 16

Kernels and smoothing

Usually kernels are based on smooth radial functions (functions of the distance from the origin)

  • 1. gaussian κ(u) =

1 √ 2πσ e− 1

2 u2 σ2 , unlimited support

  • 2. Epanechnikov κ(u) = 3

(1 2 − u2 ) , |u| ≤ 1

2, limited support

  • 3. · · ·

u k(u)

1 2

− 1

2

u κ(u)

1 2

− 1

2

u κ(u)

1 2

− 1

2

resulting estimate: p(x) = 1 nh

n

i=1

κ (x − xi h ) = 1 n

n

i=1

κh (x − xi)

16

slide-17
SLIDE 17

Kernels and smoothing

h = 1

6 5 4 3 2 1 1 2 3 4 5 6

x

2 4 6 8

p(x)

17

slide-18
SLIDE 18

Kernels and smoothing

h = 2

6 5 4 3 2 1 1 2 3 4 5 6

x

1 2 3 4 5 6 7

p(x)

18

slide-19
SLIDE 19

Kernels and smoothing

h = .5

6 5 4 3 2 1 1 2 3 4 5 6

x

2 4 6 8 10 12 14

p(x)

19

slide-20
SLIDE 20

Kernels and smoothing

h = .25

6 5 4 3 2 1 1 2 3 4 5 6

x

5 10 15 20 25

p(x)

20

slide-21
SLIDE 21

Kernels regression

Kernel smoothers methods can be applied also to regression: in this case, the value corresponding to any item x is predicted by referring to items in the training set (and in particular to the items which are closer to x). In this case, the conditional expectation f(x) = E[y|x] = ∫ yp(y|x)dy = ∫ y p(x, y) p(x) dy = ∫ yp(x, y)dy p(x) = ∫ yp(x, y)dy ∫ p(x, y)dy should be returned. Applying kernels, we have p(x, y) ≈ 1 n

n

i=1

κh(x − xi)κh(y − ti)

21

slide-22
SLIDE 22

Kernels regression

This results into f(x) = ∫ y 1

n

∑n

i=1 κh(x − xi)κh(y − ti)dy

1 n

∑n

i=1 κh(x − xi)κh(y − ti)dy =

∑n

i=1 κh(x − xi)

∫ yκh(y − ti)dy ∑n

i=1 κh(x − xi)

∫ κh(y − ti)dy and, since ∫ κh(y − ti)dy = 1 and ∫ yκh(y − ti)dy = yi, we get f(x) = ∑n

i=1 κh(x − xi)ti

∑n

i=1 κh(x − xi) 22

slide-23
SLIDE 23

Kernels regression

By setting wi(x) = κh(x − xi) ∑n

j=1 κh(x − xj)

we can write f(x) =

n

i=1

wi(x)ti that is, the predicted value is computed as a linear combination of all target values, weighted by kernels (Nadaraya-Watson)

23

slide-24
SLIDE 24

Locally weighted regression

In Nadaraya-Watson model, the prediction is performed by means of a weighted combination of constant values (target values in the training set). Locally weighted regression improves that approach by referring to a weighted version of the sum of squared differences loss function used in regression. If a value y has to be predicted for a provided item x, a “local” version of the loss function is considered, with weight wi dependent from the “distance” between x and xi. L(x) =

n

i=1

κh(x − xi)(wT xi − ti)2

24

slide-25
SLIDE 25

Locally weighted regression

An instance of the weighted regression problem must be solved, ˆ w(x) = argmin

w n

i=1

κh(x − xi)(wT xi − ti)2 which has solution ˆ w(x) = (X

T Ψ(x)X)−1X T Ψ(x)t

where Ψ(x) is a diagonal n × n matrix with Ψ(x)ii = κh(x − xi). The prediction is then performed as usual, as y = ˆ w(x)T x The same considerations can be done if polynomial regression is applied

25

slide-26
SLIDE 26

Density estimation through kNN

  • The region around x is extended to include k items
  • The estimated density is

p(x) ≃ k nV = k ncdrd

k(x)

where:

  • cd is the volume of the d-dimensional sphere of unitary radius
  • rd

k(x) is the distance from x to the k-th nearest item (the radius of the

smallest sphere with center x containing k items)

26

slide-27
SLIDE 27

Classification through kNN

  • To classify xi, let us consider a hypersphere of volume V with center x

containing k items from the training set

  • Let ki be the number of such items belonging to class Ci. Then, the

following approximation holds: p(x|Ci) = ki niV where ni is the number of items in the training set belonging to class Ci

  • Similarly, for the evidence,

p(x) = k nV

  • And, for the prior distribution,

p(Ci) = ni n

  • The class posterior distribution is then

p(Ci|x) = p(x|Ci)p(Ci) p(x) =

ki niV · ni n k nV

= ki k

27

slide-28
SLIDE 28

Classification through kNN

  • Simple rule: an item is classified on the basis of similarity to near

training set items

  • To classify x, determine the k items in the training nearest to it and

assign x to the majority class among them

  • A metric is necessary to measure similarity.

x6 x7

K = 1

1 2 1 2 x6 x7

K = 3

1 2 1 2 x6 x7

K = 31

1 2 1 2 28

slide-29
SLIDE 29

Classification through kNN

  • kNN is a simple classifier is simple and can work quite well, provided it

is given a good distance metric and has enough labeled training data: it can be shown that it can result within a factor of 2 of the best possible performance as n → ∞

  • subject to the curse of dimensionality: due to the large sparseness of

data at high dimensionality, items considered by kNN can be quite far away from the query point, and thus resulting in poor locality.

29

slide-30
SLIDE 30

Local logistic regression

The same approach applied in the case of local regression can be applied for classification, by defining a weighted loss function to be minimized, with weights dependent from the item whose target must be predicted. In this case, a weighted version of the cross entropy function is considered, which has to be maximized L(x) =

n

i=1

κh(x − xi)(ti log pi − (1 − ti) log(1 − pi)) with pi = σ(wT xi), as usual. A suitable modification of the IRLS algorithm for logistic regression can be applied here to compute ˆ w(x) = argmax

w n

i=1

κh(x − xi)(ti log pi − (1 − ti) log(1 − pi))

30

slide-31
SLIDE 31

Some properties of Gaussian distribution

In order to introduce Gaussian processes and how they can be exploited for regression, let us first provide a short reminder on some properties of multivariate gaussian distributions. Consider a random vector x = (x1, . . . , xn)T with p(x) = N(µ, Σ) and let x = (xA, xB) be a partition of the components x such that:

  • xA = (x1, . . . , xr)T
  • xB = (xr+1, . . . , xn)T

Then, both µ and Σ can be partitioned as µ = (µA, µB)T Σ = ( ΣAA ΣAB ΣT

AB

ΣBB ) Clearly, µA ∈ I Rr, µB ∈ I Rn−r, ΣAA ∈ I Rr × I Rr, ΣBB ∈ I Rn−r × I Rn−r, ΣAB ∈ I Rr × I Rn−r

31

slide-32
SLIDE 32

Some properties of Gaussian distribution

Properties of xA, xB.

  • Marginal densities are gaussian with

p(xA) = N(µA, ΣAA) p(xB) = N(µB, ΣBB)

  • Conditional densities are gaussian with

p(xA|xB) = N(µA + ΣABΣ−1

BB(xB − µB), ΣAA − ΣABΣ−1 BBΣBA)

p(xB|xA) = N(µB + ΣBAΣ−1

AA(xA − µA), ΣBB − ΣBAΣ−1 AAΣAB) 32

slide-33
SLIDE 33

Gaussian processes

Multivariate Gaussian distributions: useful for modeling finite collections of real-valued variables. Nice analytical properties. Gaussian processes: extension of multivariate Gaussians to infinite-sized collections of real-valued variables. We may think of Gaussian processes as distributions not just over random vectors but over random functions.

33

slide-34
SLIDE 34

Probability distributions over functions with finite domains

Let χ = (x1, . . . , xm) be any finite array of elements, and let H be the set of functions from χ to I R A function f ∈ H can be described by the array (f(x1), . . . , f(xm)) and any array (y1, . . . , ym) can be seen as the description of a function f ∈ H such that f(xi) = yi. The set H is then in 1-to-1 correspondence with the set of vectors in I Rm A density distribution p(f), f ∈ H over functions f ∈ H, corresponds then to a density distribution p(x), x ∈ I Rm

34

slide-35
SLIDE 35

Gaussian distributions over functions with finite domains

If we assume that p(f) is a multivariate Gaussian distribution centered on 0 and with diagonal covariance σ2I, it results p(f) = N(f|0, σ2I) =

m

i=1

1 √ 2πσ e− f(xi)2

2σ2

This can be seen as a prior distribution of functions, with respect to the

  • bservation of any pair (xj, ˆ

yj), 1 ≤ j ≤ m.

35

slide-36
SLIDE 36

Gaussian distributions over functions with finite domains

Given a set I ⊂ {1, . . . , m} of indices, let X = {(xj, ˆ yj), j ∈ I} be a set of

  • bserved pairs. Then, according to the bayesian framework,

The posterior distribution p(f|X) of functions (wrt to X) can be defined and derived according to Bayes’ rule, provided a likelihood model is defined such as ˆ y = t + ε, p(ε) = N(ε|0, β). The predictive distribution p(yk|xk, X) = ∫ p(yk|f, xk)p(f|X)d f for any k ̸∈ I can be also derived.

36

slide-37
SLIDE 37

Gaussian distributions over functions with infinite domains

In the case of infinite χ, we have to deal with an infinite collection of random variables. In this case, the role of multidimensional distributions is covered by stochastic processes.

  • A stochastic process is a collection of random variables, {f(x) : x ∈ χ,

indexed by elements from some set χ, known as the index set.

  • A Gaussian process is a stochastic process such that any finite subset of

random variables has a multivariate Gaussian distribution.

37

slide-38
SLIDE 38

Gaussian processes

In order to define a Gaussian process, both a mean and a covariance function must be defined.

  • a mean function m(x), mapping points in χ onto I

R

  • a kernel function κ(x,1 , x2), mapping pairs from χ2 onto I

R Given a finite subset X = (x1, . . . , xm) of χ, the corresponding set of random variables (f(x1), . . . , f(xm)) distribution is given by p(f(x1), . . . , f(xm) = N(f|µ(X), Σ(X)) where µ(X) = (m(x1), . . . , m(xm))T and Σ(X)ij = κ(xi, xj). That is, in a Gaussian process the expectation of f(x) is provided by function m(x) and the covariance between f(x1) and f(x2), defined as Ef[(f(x1) − m(x1))2(f(x2) − m(x2))2], is provided by the kernel function κ(x1, x2). It can be shown that κ must be such that for any subset X of χ, the corresponding covariance matrix Σ(X) is positive semidefinite (that is, for any x it must be xT Σ(X)x ≥ 0)

38

slide-39
SLIDE 39

Gaussian processes

Given a Gaussian process GP(m, κ), a functionf drawn from it can be intuitively seen as an extremely high-dimensional vector drawn from an extremely high-dimensional multivariate Gaussian. Each dimension of the Gaussian corresponds to an element x from χ, and the corresponding component of the random vector represents the value of f(x) Using the marginalization property for multivariate Gaussians, we can obtain the multivariate Gaussian density corresponding to any finite subset of variables.

39

slide-40
SLIDE 40

RBF kernel

One of the most applied kernel is the RBF kernel κ(x1, x2) = σ2e− ||x1−x2||2

2τ2

which tends to assign higher covariance between f(x1) and f(x2) if x1 and x2 are nearby points. Functions drawn from a Gaussian process with RBF kernel tend to be smooth (values computed for nearby points tend to have similar values). Smoothing is larger for larger τ.

40

slide-41
SLIDE 41

RBF kernel

Larger smoothing

41

slide-42
SLIDE 42

RBF kernel

Smaller smoothing

42

slide-43
SLIDE 43

Gaussian process regression

  • Assume, as usual, that observed data derive from function f with an

additional gaussian and independent noise. y = f(x) + ε p(ε) = N(ε|0, σ2I) that is, p(y|f, x) = N(y|f(x), σ2I)

  • A Gaussian process p(f) = GP(m, κ) provides a prior distribution on

functions in H.

43

slide-44
SLIDE 44

Gaussian process regression

Given a set of observations {X = {xi}, y = {yi}}, i = 1, . . . , m} we could now derive a posterior distribution of functions. It is possible to show that p(f|X, y) = GP(mp, κp) with mp(x) = k(x, X)(Σ(X) + σ2I)−1y κp(x, x′) = κ(x, x′) − k(x, X)(Σ(X) + σ2I)−1k(X, x′) where k(x, X) = [ κ(x, x1) · · · κ(x, xm) ] k(X, x′) = [ κ(x1, x′) · · · κ(xm, x′) ]T = k(x′, X)T This makes it possible to sample functions from the posterior distribution

44

slide-45
SLIDE 45

Gaussian process regression

In order to derive the predictive distribution of a new item x, let us observe that for any function f drawn from GP(0, κ), the marginal distribution over any set of input points X = (x1, . . . , xm) is a multivariate Gaussian distribution. p(f(x1), . . . , f(xm)) = N(x|0, Σ(X)) Let Xt be the set of points in the training set, and let x be a new item. Then, (f(x1), . . . , f(xm), f(x)) ∼ N(0, Σ(X ∪ {x})) where Σ(X ∪ {x}) = [ Σ(X) k(X, x) k(x, X) κ(x, x) ]

45

slide-46
SLIDE 46

Gaussian process regression

By the assumption of gaussian noise yi = f(xi) + ε i = 1, . . . , m y = f(x) + ε with p(ε) = N(ε|0, σ2), it derives that (y1, . . . , ym, y) ∼ (f(x1)+ε, . . . , f(xm)+ε, f(x)+ε) ∼ N(0, Σ(X∪{x}+σ2I) From the properties of multivariate gaussian distributions, namely p(xB|xA) = N(µB + ΣBAΣ−1

AA(xA − µA), ΣBB − ΣBAΣ−1 AAΣAB)

it results p(y|x, X, y) = N(y|µ(x), σ2(x)) with µ(x) = k(x, X)(Σ(X) + σ2I)−1y =

m

i=1

α(xi, x)yi σ2(x) = κ(x, x) + σ2 − k(x, X)(Σ(X) + σ2I)−1k(x, X)T

46

slide-47
SLIDE 47

Estimating kernel parameters

The predictive performance of gaussian processes depends exclusively on the suitability of the chosen kernel. Let us consider the case of an RBF kernel. Then, κ(xi, xj) = σ2

fe− 1

2 (xi−xj)T M(xi−xj) + σ2

yδij

M can be defined in several ways: the simplest one is M = l−2I. Even in this simple case, varying the values of σf, σy, l returns quite different results. (figure from K.Murphy “Machine learning: a probabilistic perspective” p. 519, with (l, σf, σy) equal to (1, 1, 0.1), (0.3, 1.08, 0.00005), (3.0, 1.16, 0.89))

47

slide-48
SLIDE 48

Estimating kernel parameters

Kernel parameters can be estimated, as usual, through grid search and (cross-)validation. A different, more efficient approach relies on maximizing the marginal likelihood p(y|X) = ∫ p(y|f, X)p(f|X)df = ∫ N(f|0, Σ(X))

m

i=1

N(yi|fi, σ2

y)df

It can be shown that log p(y|X) = log N(y|0, Σ(X) + σ2

yI)

= −1 2yT (Σ(X) + σ2

yI)−1y − 1

2 log |Σ(X)| + σ2

yI − n

2 log(2π) where the first term measures the fitting of the model to data, the second the complexity of the model, and the third + fourth ones are constants

48