Model inference . Course of Machine Learning Master Degree in - - PowerPoint PPT Presentation

model inference
SMART_READER_LITE
LIVE PREVIEW

Model inference . Course of Machine Learning Master Degree in - - PowerPoint PPT Presentation

Model inference . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1 Model inference Purpose the data domain. Dataset distributed (iid): they can be seen as


slide-1
SLIDE 1

Model inference

.

Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019

1

slide-2
SLIDE 2

Model inference

Purpose Inferring a probabilistic model from a collection of observed data X = {x1, . . . , xn}. A probabilistic model is a probability distribution over the data domain. Dataset A dataset X is a collection of N observed data, independent and identically distributed (iid): they can be seen as realizations of a single random variable.

2

slide-3
SLIDE 3

Model inference

Problems considered Inference objectives: Model selection Selecting the probabilistic model M best suited for a given data collection Estimation Estimate the values of the set θ = (θ1, . . . , θD) of parameters

  • f a given model type (probability distribution), which best

model the observed data X Prediction Compute the probability p(x|X) of a new observation from the set of already observed data

3

slide-4
SLIDE 4

Bayesian learning

Context Model space M: a model m ∈ M is a probability distribution p(x|m) over data. Let p(m) be any prior distribution of models ∑

m∈M

p(m) = 1 The corresponding predictive distribution of data is p(x) = ∑

m∈M

p(x|m)p(m)

4

slide-5
SLIDE 5

Inference

After the observation of a dataset X, the updated probabilities are p(m|X) = p(m)p(X|m) p(X) ∝ p(m)p(X|m) = p(m)

n

i=1

p(xi|m) and the predictive distribution is p(x|X) = ∑

m∈M

p(x|m)p(m|X)

5

slide-6
SLIDE 6

Parameters

Parametric models Models are defined as parametric probability distributions, with parameters θ ranging on a parameter space Θ. A prior parameter distribution p(θ|m) is defined for a model. The prior predictive distribution is then p(x|m) = ∫

Θ

p(x|θ, m)p(θ|m)dθ Posterior parameter distribution Given a model m ∈ M, Bayes' formula makes it possible to infer the posterior distribution of parameters, given the dataset X p(θ|X, m) = p(θ|m)p(X|θ, m) p(X|m) ∝ p(θ|m)p(X|θ, m) The posterior predictive distribution, given the model, is p(x|X, m) = ∫

Θ

p(x|θ, m)p(θ|X, m)dθ

6

slide-7
SLIDE 7

Bayesian inference

According to the bayesian approach to inference, parameters are considered as random variables, whose distributions have to be inferred from observed data. The approach relies on Bayes' classic result: Theorem (Bayes) Let X, Y be a pair of (sets of) random variables. Then, p(Y|X) = p(X|Y)p(Y) p(X) = p(X|Y)p(Y) ∫

Z p(X, Z)dZ

where

  • p(Y) is the prior probability of Y (with respect to the observation of X)
  • p(Y|X) is the posterior probability of Y
  • p(X|Y) is the likelihood of X w.r.t. Y
  • p(X) is the evidence of X

7

slide-8
SLIDE 8

Point estimate of parameters

Motivation Given a model m, the bayesian approach is aimed to derive the posterior distribution of the set of parameters θ. This requires computing p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ) ∫

Θ p(X|θ)p(θ)dθ

and p(x|X) = ∫

θ

p(x|θ)p(θ|X)dθ This is usually impossible to be done efficiently. Idea Only an estimate of the ``best'' value ˆ θ in θ (according to some measure) is

  • performed. The posterior predictive distribution can then be approximated

as follows p(x|X) = ∫

θ

p(x|θ)p(θ|X)dθ ≈ ∫

θ

p(x|ˆ θ)p(θ|X)dθ = p(x|ˆ θ) ∫

θ

p(θ|X)dθ = p(x|ˆ θ)

8

slide-9
SLIDE 9

Maximum likelihood estimate

Approach Frequentist point of view: parameters are deterministic variables, whose value is unknown and must be estimated. Determine the parameter value that maximize the likelihood L(θ|X) = p(X|θ) =

N

i=1

p(xi|θ) Log-likelihood l(θ|X) = ln L(θ|X) =

N

i=1

ln p(xi|θ) is usually preferrable. The maximum occurs at the same point: argmax

θ

l(θ|X) = argmax

θ

L(θ|X) Estimate ˆ θML = argmax

θ

L(θ|X) = argmax

θ N

i=1

ln p(xi|θ)

9

slide-10
SLIDE 10

Maximum likelihood estimate

Solution Solve the system ∂l(θ|X) ∂θi = 0 i = 1, . . . , D more concisely, ∇θl(θ|X) = 0 Prediction Probability of a new observation x: p(x|X) = ∫

θ

p(x|θ)p(θ|X)dθ ≈ ∫

θ

p(x|ˆ θML)p(θ|X)dθ = p(x|ˆ θML) ∫

θ

p(θ|X)dθ = p(x|ˆ θML)

10

slide-11
SLIDE 11

Maximum likelihood estimate

Example Collection X of n binary events, modeled through a Bernoulli distribution with unknown parameter φ p(x|φ) = φx(1 − φ)1−x Likelihood L(φ|X) =

N

i=1

φxi(1 − φ)1−xi Log-likelihood l(φ|X) =

N

i=1

(xi ln φ + (1 − xi) ln(1 − φ)) = N1 ln φ + N0 ln(1 − φ) where N0 (N1) is the number of events x ∈ X equal to 0 (1) ∂l(φ|X) ∂φ = N1 φ − N0 1 − φ = 0 = ⇒ ˆ φML = N1 N0 + N1 = N1 N

11

slide-12
SLIDE 12

ML and overfitting

Overfitting Maximizing the likelihood of the observed dataset tends to result into an estimate too sensitive to the dataset values, hence into overfitting. The

  • btained estimates are suitable to model observed data, but may be too

specialized to be used to model different datasets. Penalty functions An additional function P(θ) can be introduced with the aim to limit

  • verfitting and the overall complexity of the model. This results in the

following function to maximize C(θ|X) = l(θ|X) − P(θ) as a common case, P(θ) = γ

2 ∥θ∥2, with γ a tuning parameter. 12

slide-13
SLIDE 13

Maximum a posteriori estimate

Idea Inference through maximum a posteriori (MAP) is similar to ML, but θ is now considered as a random variable, whose distribution has to be derived from

  • bservations, also taking into account previous knowledge (prior

distribution). The parameter value maximizing p(θ|X) = p(X|θ)p(θ) p(X) is computed. Estimate ˆ θMAP = argmax

θ

p(θ|X) = argmax

θ

p(X|θ)p(θ) = argmax

θ

L(θ|X)p(θ) = argmax

θ

(l(θ|X) + ln p(θ)) = argmax

θ

( N ∑

i=1

ln p(xi|θ) + ln p(θ) )

13

slide-14
SLIDE 14

MAP and gaussian prior

Hypothesis Assume θ is distributed around the origin as a multivariate gaussian with uniform variance and null covariance.That is, p(θ) ∼ N(θ|0, σ2) = 1 (2π)d/2σd exp ( −1 2 ∥θ∥2 σ2 ) ∝ exp ( −∥θ∥2 2σ2 ) Inference From the hypothesis, ˆ θMAP = argmax

θ

p(θ|X) = argmax

θ

(l(θ|X) + ln p(θ)) = argmax

θ

( l(θ|X) + ln exp ( −∥θ∥2 2σ2 )) = argmax

θ

( l(θ|X) − ∥θ∥2 2σ2 ) which is equal to the penalty function introduced before, if γ =

1 σ2 14

slide-15
SLIDE 15

MAP estimate

Example Collection X of n binary events, modeled as a Bernoulli distribution with unknown parameter φ. Initial knowledge of φ is modeled as a Beta distribution: p(φ|α, β) = Beta(φ|α, β) = Γ(α + β) Γ(α)Γ(β)φα−1(1 − φ)β−1 Log-likelihood l(φ|X) =

N

i=1

(xi ln φ + (1 − xi) ln(1 − φ)) = N1 ln φ + N0 ln(1 − φ) ∂ ∂φl(φ|X) + ln Beta(φ|α, β) = N1 φ − N0 1 − φ + α − 1 φ − β − 1 1 − φ = 0 = ⇒ ˆ φMAP = N1 + α − 1 N0 + N1 + α + β − 2 = N1 + α − 1 N + α + β − 2

15

slide-16
SLIDE 16

Note

Gamma function The function Γ(x) = ∫ ∞ tx−1e−tdt is an extension of the factorial to the real numbers field: hence, for any integer x, Γ(x) = (x − 1)!

16

slide-17
SLIDE 17

Applying bayesian inference

Mode and mean Once the posterior distribution p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ) ∫

θ p(X|θ)dθ

is available, MAP estimate computes the most probable value (mode) θMAP

  • f the distribution. This may lead to inaccurate estimates, as in the figure

below:

x p(x)

17

slide-18
SLIDE 18

Applying bayesian inference

Mode and mean A better estimation can be obtained by applying a fully bayesian approach and referring to the whole posterior distribution, for example by deriving the expectation of θ w.r.t. p(θ|X), θ∗ = Ep(θ|X)[θ] = ∫

θ

θp(θ|X)dθ

18

slide-19
SLIDE 19

Bayesian estimate

Example Collection X of n binary events, modeled as a Bernoulli distribution with unknown parameter φ. Initial knowledge of φ is modeled as a Beta distribution: p(φ|α, β) = Beta(φ|α, β) = Γ(α + β) Γ(α)Γ(β)φα−1(1 − φ)β−1 Posterior distribution p(φ|X, α, β) = ∏N

i=1 φxi(1 − φ)1−xip(φ|α, β)

p(X) = φN1(1 − φ)N0φα−1(1 − φ)β−1

Γ(α)Γ(β) Γ(α+β) p(X)

= φN1+α−1(1 − φ)N0+β−1 Z since ∫ +∞

−∞ p(φ|X, α, β)dφ = 1, Z must be equal to the normalizing

coefficient of the distribution Beta(φ|α + N1, β + N0). Hence, p(φ|X, α, β) = Beta(φ|α + N1, β + N0)

19

slide-20
SLIDE 20

Model comparison

Comparing different models Let M1, . . . , Mm be a set of model types, each with its own set of

  • parameters. Given a dataset X, we wish to select the model type which best

represents X. In a bayesian framework, we may consider the posterior probability of each model type p(Mi|X) = p(X|Mi)p(Mi) p(X) ∝ p(X|Mi)p(Mi) If we assume that no specific knowledge on model types is initially available, then the prior distribution is uniform: as a consequence, p(Mi|X) ∝ p(X|Mi). Evidence The distribution p(X|Mi) is the evidence of the dataset w.r.t. a model type. It can be obtained by marginalization of model parameters p(X|Mi) = ∫

θ

p(X|θ, Mi)p(θ|Mi)dθ

20

slide-21
SLIDE 21

Model selection in practice

Validation Test set Dataset is split into Training set (used for learning parameters) and Test set (used for measuring effectiveness). Good for large datasets: otherwise, small resulting training and test set (few data for fitting and validation) Cross validation Dataset partitioned into K equal-sized sets. Iteratively, in K phases, use one set as test set and the union of the other K − 1 ones as training set (K-fold cross validation). Average validation measures. As a particular case, iteratively leave one element out and use all other points as training set (Leave-one-out cross validation). Time consuming for large datasets and for models which are costly to fit.

21

slide-22
SLIDE 22

Model selection in practice

Information measures Faster methods to compare model effectiveness, based on computing measures which take into account data fitting and model complexity. Akaike Information Criterion (AIC) Let θ be the set of parameters of the model and let θML be their maximum likelihood estimate on the dataset X. Then, AIC = 2|θ| − 2 log p(X|θML) = 2|θ| − 2max

θ

p(X|θ) lower values correspond to models to be preferred. Bayesian Information Criterion (BIC) A variant of the above, defined as BIC = log |X||θ|−2 log p(X|θML) = log |X||θ|−2max

θ p(X|θ) 22

slide-23
SLIDE 23

Model averaging

Marginalization to reduce overfitting

  • To avoid overfitting, we may apply marginalization of model parameters:

this corresponds to averaging among all possible models

  • Bayesian approach: use of probabilities to represent uncertainty in the

choice of the model

  • Set of L models Mi, i = 1, . . . , L, each a probability distribution

p(X|Mi), that is the model evidence, over the dataset X

  • Prior uncertainty about the model represented through distribution

p(Mi)

  • Observing the training set modifies the uncertainty to the posterior

p(Mi|X) ∝ p(X|Mi)p(Mi)

  • p(XMi) is called marginal likelihood or model evidence
  • p(X|Mi)

p(X|Mj) is the Bayes factor for models Mi, Mj

23

slide-24
SLIDE 24

Model evidence

As an average

  • The evidence of a model can be expressed as an average among

instances for all possible parameter values p(X|Mi) = ∫ p(X|w, Mi)p(w|Mi)dw

  • this is the normalization term in the definition of the posterior

distribution of parameters p(w|X, Mi) = p(X|w, Mi)p(w|Mi) p(X|Mi)

24

slide-25
SLIDE 25

Averaged model prediction

Prediction

  • Given the posterior among models, the predictive distribution can be
  • btained as

p(x|X) =

L

i=1

p(x|Mi, X)p(Mi|X) this corresponds to a weighted average among predictions of single models, with weights given by their probabilities

25

slide-26
SLIDE 26

Example: learning in the dirichlet-multinomial model

26

slide-27
SLIDE 27

Language modeling

A language model is a (categorical) probability distribution on a vocabulary

  • f terms (possibly, all words which occur in a large collection of documents).

Use A language model can be applied to predict the next term occurring in a

  • text. The probability of occurrence of a term is related to its information

content and is at the basis of a number of information retrieval techniques. Hypothesis It is assumed that the probability of occurrence of a term is independent from the preceding terms in a text (bag of words model). Generative model Given a language model, it is possible to sample from the distribution to generate random documents statistically equivalent to the documents in the collection used to derive the model.

27

slide-28
SLIDE 28

Language model

  • Let T = {t1, . . . , tn} be the set of terms occurring in a given collection C
  • f documents, after stop word (common, non informative terms)

removal and stemming (reduction of words to their basic form).

  • For each i = 1, . . . , n let mi be the multiplicity (number of occurrences)
  • f term ti in C
  • A language model can be derived as a categorical distribution

associated to a vector ˆ φ = (ˆ φ1, . . . , ˆ φn)T of probabilities: that is, 0 ≤ ˆ φi ≤ 1 i = 1, . . . , n

n

i=1

ˆ φi = 1 where ˆ φj = p(tj|C)

28

slide-29
SLIDE 29

Learning a language model by ML

Applying maximum likelihood to derive term probabilities in the language model results into setting ˆ φj = p(tj|C) = mj ∑n

k=1 mk = mj

N where N = ∑n

i=1 mi is the overall number of occurrences in C after

stopword removal. Smoothing According to this estimate, a term t which never occurred in C has zero probability to be observed (black swan paradox). Due to overfitting the model to the observed data, typical of ML estimation. Solution: assign small, non zero, probability to events (terms) not observed up to now. This is called smoothing.

29

slide-30
SLIDE 30

Bayesian learning of a language model

We may apply the dirichlet-multinomial model:

  • this implies defining a Dirichlet prior Dir(φ|α), with

α = (α1, α2, . . . , αn) that is, p(φ1, . . . , φn|α) = 1 ∆(α1, . . . , αn)

n

i=1

φαi−1

i

  • the posterior distribution of φ after C has been observed is then

Dir(φ|α′), where α′ = (α1 + m1, α2 + m2, . . . , αn + mn) that is, p(φ1, . . . , φn|α′) = 1 ∆(α1 + m1, . . . , αn + mn)

n

i=1

φαi+mi−1

i 30

slide-31
SLIDE 31

Bayesian learning of a language model

The language model ˆ φ corresponds to the predictive posterior distribution ˆ φj = p(tj|C, α) = ∫ p(tj|φ)p(φ|C, α)dφ = ∫ φjDir(φ|α′)dφ = E[φj] where E[φj] is taken w.r.t. the distribution Dir(φ|α′). Then, ˆ φj = α′

j

∑n

k=1 α′ k

= αj + mj ∑n

k=1(αk + mk) = αj + mj

α0 + N The αj term makes it impossible to obtain zero probabilities (Dirichlet smoothing). Non informative prior: αi = α for all i, which results into p(tj|C, α) = mj + α αV + N where V is the vocabulary size.

31

slide-32
SLIDE 32

Naive bayes classifiers

A language model can be applied to derive document classifiers into two or more classes.

  • given two classes C1, C2, assume that, for any document d, the

probabilities p(C1|d) and p(C2|d) are known: then, d can be assigned to the class with higher probability

  • how to derive p(Ck|d) for any document, given a collection C1 of

documents known to belong to C1 and a similar collection C2 for C2? Apply Bayes' rule: p(Ck|d) ∝ p(d|Ck)p(Ck) the evidence p(d) is the same for both classes, and can be ignored.

  • we have still the problem of computing p(Ck) and p(d|Ck) from C1 and

C2

32

slide-33
SLIDE 33

Naive bayes classifiers

Computing p(Ck) The prior probabilities p(Ck) (k = 1, 2) can be easily estimated from C1, C2: for example, by applying ML, we obtain p(Ck) = |C1| |C1| + |C2| Computing p(d|Ck) For what concerns the likelihoods p(d|Ck) (k = 1, 2), we observe that d can be seen, according to the bag of words assumption, as a multiset of nd terms d = {t1, t2, . . . , tnd} By applying the product rule, it results p(d|Ck) = p(t1, . . . , tnd|Ck) = p(t1|Ck)p(t2|t1, Ck) · · · p(tnd|t1, . . . , tnd−1, Ck)

33

slide-34
SLIDE 34

Naive bayes classifiers

The naive Bayes assumption Computing p(d|Ck) is much easier if we assume that terms are pairwise conditionally independent, given the class Ck, that is, for i, j = 1. . . . , nd and k = 1, 2, p(ti, tj|Ck) = p(ti|Ck)p(t2|Ck) as, a consequence, p(d|Ck) =

nd

j=1

p(tj|Ck) Language models and NB classifiers The probabilities p(tj|Ck) are available for all terms if language models have been derived for C1 and C2, respectively from documents in C1 and C2.

34

slide-35
SLIDE 35

Feature selection by mutual information

Feature selection The set of probabilities in a language model can be exploited to identify the most relevant terms for classification, that is terms whose presence or absence in a document best characterizes the class of the document. Mutual information To measure relevance, we can apply the set of mutual informations {I1, . . . , In} Ij = ∑

k=1,2

p(tj, Ck) log p(tj, Ck) p(tj)p(Ck) = ∑

k=1,2

p(Ck|tj)p(tj) log p(Ck|tj) p(Ck) = p(tj)KL(p(Ck|tj)||p(Ck)) here, KL is a measure of the amount of information on class distributions provided by the presence of tj. This amount is weighted by the probability

  • f occurrence of tj.

35

slide-36
SLIDE 36

Feature selection by mutual information

Mutual information Since p(tj, Ck) = p(Ck|tj)p(tj) = p(tj|Ck)p(Ck), Ij can be estimated as Ij = p(tj|C1)p(C1) log p(tj|C1) p(tj) + p(tj|C2)p(C2) log p(tj|C2) p(tj) = φj1π1 log φj1 φj1π1 + φj2π2 + φj2π2 log φj2 φj1π1 + φj2π2 where φjk is the estimated probability of tj in documents of class Ck and πk is the estimated probability of a document of class Ck in the collection. A selection of the most significant terms can be performed by selecting the set of terms with highest mutual information Ij.

36