Learning in Graphical Models Marco Chiarandini Department of - - PowerPoint PPT Presentation

learning in graphical models
SMART_READER_LITE
LIVE PREVIEW

Learning in Graphical Models Marco Chiarandini Department of - - PowerPoint PPT Presentation

Lecture 13 Learning in Graphical Models Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Learning Graphical Models Course Overview Unsupervised Learning Introduction Learning


slide-1
SLIDE 1

Lecture 13

Learning in Graphical Models

Marco Chiarandini

Department of Mathematics & Computer Science University of Southern Denmark

slide-2
SLIDE 2

Learning Graphical Models Unsupervised Learning

Course Overview

✔ Introduction

✔ Artificial Intelligence ✔ Intelligent Agents

✔ Search

✔ Uninformed Search ✔ Heuristic Search

✔ Uncertain knowledge and Reasoning

✔ Probability and Bayesian approach ✔ Bayesian Networks ✔ Hidden Markov Chains ✔ Kalman Filters

Learning

✔ Supervised Decision Trees, Neural Networks Learning Bayesian Networks ✔ Unsupervised EM Algorithm

Reinforcement Learning Games and Adversarial Search

Minimax search and Alpha-beta pruning Multiagent search

Knowledge representation and Reasoning

Propositional logic First order logic Inference Plannning

2

slide-3
SLIDE 3

Learning Graphical Models Unsupervised Learning

Outline

  • 1. Learning Graphical Models

Parameter Learning in Bayes Nets Bayesian Parameter Learning

  • 2. Unsupervised Learning

k-means EM Algorithm

3

slide-4
SLIDE 4

Learning Graphical Models Unsupervised Learning

Outline

Methods:

  • 1. Bayesian learning
  • 2. Maximum a posteriori and maximum likelihood learning

Bayesian networks learning with complete data

  • a. ML parameter learning
  • b. Bayesian parameter learning

4

slide-5
SLIDE 5

Learning Graphical Models Unsupervised Learning

Full Bayesian learning

View learning as Bayesian updating of a probability distribution

  • ver the hypothesis space

H hypothesis variable, values h1, h2, . . ., prior Pr(h) dj gives the outcome of random variable Dj (the jth observation) training data d = d1, . . . , dN Given the data so far, each hypothesis has a posterior probability: P(hi|d) = αP(d|hi)P(hi) where P(d|hi) is called the likelihood Predictions use a likelihood-weighted average over the hypotheses: Pr(X|d) =

  • i

Pr(X|d, hi)P(hi|d) =

  • i

Pr(X|hi)P(hi|d) Or predict according to the most probable hypothesis (maximum a posteriori)

5

slide-6
SLIDE 6

Learning Graphical Models Unsupervised Learning

Example

Suppose there are five kinds of bags of candies: 10% are h1: 100% cherry candies 20% are h2: 75% cherry candies + 25% lime candies 40% are h3: 50% cherry candies + 50% lime candies 20% are h4: 25% cherry candies + 75% lime candies 10% are h5: 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be?

6

slide-7
SLIDE 7

Learning Graphical Models Unsupervised Learning

Posterior probability of hypotheses

0.2 0.4 0.6 0.8 1 2 4 6 8 10 Posterior probability of hypothesis Number of samples in d P(h1 | d) P(h2 | d) P(h3 | d) P(h4 | d) P(h5 | d)

7

slide-8
SLIDE 8

Learning Graphical Models Unsupervised Learning

Prediction probability

0.4 0.5 0.6 0.7 0.8 0.9 1 2 4 6 8 10 P(next candy is lime | d) Number of samples in d

8

slide-9
SLIDE 9

Learning Graphical Models Unsupervised Learning

MAP approximation

Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Maximum a posteriori (MAP) learning: choose hMAP maximizing P(hi|d) I.e., maximize P(d|hi)P(hi) or log P(d|hi) + log P(hi) Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning For deterministic hypotheses, P(d|hi) is 1 if consistent, 0 otherwise = ⇒ MAP = simplest consistent hypothesis

9

slide-10
SLIDE 10

Learning Graphical Models Unsupervised Learning

ML approximation

For large data sets, prior becomes irrelevant Maximum likelihood (ML) learning: choose hML maximizing P(d|hi) I.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the “standard” (non-Bayesian) statistical learning method

10

slide-11
SLIDE 11

Learning Graphical Models Unsupervised Learning

Parameter learning by ML

Bag from a new manufacturer; fraction θ of cherry candies? Any θ is possible: continuum of hypotheses hθ θ is a parameter for this simple (binomial) family of models Suppose we unwrap N candies, c cherries and ℓ = N − c limes These are i.i.d. (independent, identically distributed)

  • bservations, so

Flavor

P F=cherry ( )

θ

P(d|hθ) =

N

  • j = 1

P(dj|hθ) = θc · (1 − θ)ℓ Maximize this w.r.t. θ—which is easier for the log-likelihood: L(d|hθ) = log P(d|hθ) =

N

  • j = 1

log P(dj|hθ) = c log θ + ℓ log(1 − θ) dL(d|hθ) dθ = c θ − ℓ 1 − θ = 0 = ⇒ θ = c c + ℓ = c N Seems sensible, but causes problems with 0 counts!

12

slide-12
SLIDE 12

Learning Graphical Models Unsupervised Learning

Multiple parameters

P F=cherry ( )

Flavor Wrapper

P( ) W=red | F F cherry

2

lime

θ

1

θ θ

Red/green wrapper depends probabilistically on flavor: Likelihood for, e.g., cherry candy in green wrapper: P(F = cherry, W = green|hθ,θ1,θ2) = P(F = cherry|hθ,θ1,θ2)P(W = green|F = cherry = θ · (1 − θ1) N candies, rc red-wrapped cherry candies, etc.: P(d|hθ,θ1,θ2) = θc(1 − θ)ℓ · θrc

1 (1 − θ1)gc · θrℓ 2 (1 − θ2)gℓ

L = [c log θ + ℓ log(1 − θ)] + [rc log θ1 + gc log(1 − θ1)] + [rℓ log θ2 + gℓ log(1 − θ2)]

13

slide-13
SLIDE 13

Learning Graphical Models Unsupervised Learning

Multiple parameters contd.

Derivatives of L contain only the relevant parameter: ∂L ∂θ = c θ − ℓ 1 − θ = 0 = ⇒ θ = c c + ℓ ∂L ∂θ1 = rc θ1 − gc 1 − θ1 = 0 = ⇒ θ1 = rc rc + gc ∂L ∂θ2 = rℓ θ2 − gℓ 1 − θ2 = 0 = ⇒ θ2 = rℓ rℓ + gℓ With complete data, parameters can be learned separately

14

slide-14
SLIDE 14

Learning Graphical Models Unsupervised Learning

Continuous models

P(x) = 1 √ 2πσ exp− (x−µ)2

2σ2

Parameters µ and σ2 Maximum likelihood:

15

slide-15
SLIDE 15

Learning Graphical Models Unsupervised Learning

Continuous models, Multiple param.

0.2 0.4 0.6 0.8 1 x 0 0.20.40.60.81 y 0.5 1 1.5 2 2.5 3 3.5 4 P(y |x) 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y x

Maximizing P(y|x) = 1 √ 2πσ e− (y−(θ1x+θ2))2

2σ2

w.r.t. θ1, θ2 = minimizing E =

N

  • j = 1

(yj − (θ1xj + θ2))2 That is, minimizing the sum of squared errors gives the ML solution for a linear fit assuming Gaussian noise of fixed variance

16

slide-16
SLIDE 16

Learning Graphical Models Unsupervised Learning

Summary

Full Bayesian learning gives best possible predictions but is intractable MAP learning balances complexity with accuracy on training data Maximum likelihood assumes uniform prior, OK for large data sets

  • 1. Choose a parameterized family of models to describe the data

requires substantial insight and sometimes new models

  • 2. Write down the likelihood of the data as a function of the parameters

may require summing over hidden variables, i.e., inference

  • 3. Write down the derivative of the log likelihood w.r.t. each parameter
  • 4. Find the parameter values such that the derivatives are zero

may be hard/impossible; gradient techniques help

17

slide-17
SLIDE 17

Learning Graphical Models Unsupervised Learning

Bayesian Parameter Learning

If small data set the ML method leads to premature conclusions: From the Flavor example: P(d|hθ) =

N

  • j = 1

P(dj|hθ) = θc · (1 − θ)ℓ = ⇒ θ = c c + ℓ If N = 1 and c = 1, l = 0 we conclude θ = 1. Laplace adjustment can mitigate this result but it is artificial.

19

slide-18
SLIDE 18

Learning Graphical Models Unsupervised Learning

Bayesian approach: P(θ|d) = αP(d|θ)P(θ) we saw the likelihood to be p(X = 1|θ) = Bern(θ) = θ which is known as Bernoulli distribution. Further, for a set of n observed

  • utcomes d = (x1, . . . , xn) of which s are 1s, we have the binomial sampling

model: p(D = d|θ) = p(s|θ) = Bin(s|θ) = n s

  • θs(1 − θ)n−s

(1)

20

slide-19
SLIDE 19

Learning Graphical Models Unsupervised Learning

The Beta Distribution

We define the prior probability p(θ) to be Beta distributed p(θ) = Beta(θ|a, b) = Γ(a + b) Γ(a)Γ(b)θa−1(1 − θ)b−1

0.5 1 1.5 2 2.5 0.2 0.4 0.6 0.8 1 P(Θ = θ) Parameter θ [1,1] [2,2] [5,5] 1 2 3 4 5 6 0.2 0.4 0.6 0.8 1 P(Θ = θ) Parameter θ [3,1] [6,2] [30,10]

Reasons for this choice: provides flexiblity varying the hyperparameters a and b

  • Eg. the uniform distribution is included in this family with a = 1, b = 1

conjugancy property

21

slide-20
SLIDE 20

Learning Graphical Models Unsupervised Learning

Eg: we observe N = 1, c = 1, l = 0: p(θ|d) = αp(d|θ)p(θ) = αBin(d|θ)p(θ) = αBeta(θ|a + c, b + l).

22

slide-21
SLIDE 21

Learning Graphical Models Unsupervised Learning

In Presence of Parents

Denote by Paj

i the jth parent variable/node of Xi

p(xi|paj

i, θi) = θij,

where pa1

i , . . . , paqi i , qi = Xi ∈Pai ri, denote the configurations of Pai,

and θi = (θij), j = 1, . . . , qi, are the local parameters of variable i. In the case of no missing values, that is, all variables of the network have a value in the random sample d, and independence among parameters, the parameters remain independent given d, that is, p(θ|d) =

d

  • i=1

qi

  • j=1

p(θij|d) In other terms, we can update each vector parameter θij independently, just as in the one-variable case. Assuming each vector has the prior distribution Beta(θij|aij, bij), we obtain the posterior distribution p(θij|d) = Beta(θij|aij + sij, bij + n − sij) where sij is the number of cases in d in which Xi = 1 and Pai = paj

i.

23

slide-22
SLIDE 22

Learning Graphical Models Unsupervised Learning

Outline

  • 1. Learning Graphical Models

Parameter Learning in Bayes Nets Bayesian Parameter Learning

  • 2. Unsupervised Learning

k-means EM Algorithm

24

slide-23
SLIDE 23

Learning Graphical Models Unsupervised Learning

K-means clustering

Init: select k cluster centers at random repeat assign data to nearest center. update cluster center to the centroid of assigned data points until no change ;

26

slide-24
SLIDE 24

Learning Graphical Models Unsupervised Learning

Expectation-Maximization Algorithm

Generalization of k-means that uses soft assignments

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

Mixture model: exploit an hidden variable z p(x) =

  • z

p(x, z) =

  • z

p(x | z)p(z) Both p(x | z) and p(z) are unknown: assume p(x | z) is multivariate Gaussian distribution N(µi, σi) assume p(z) is multinomial distribution with parameter θi µi, σi, θi are unkown

28

slide-25
SLIDE 25

Learning Graphical Models Unsupervised Learning

E-step: Assume we know µi, σi, θi, calculate for each sample j the probability of coming from i pij = αθi(2π)−N/2|Σ|−1 exp{−1/2(x − µ)Σ(x − µ)T} M-step: update µi, σi, θi: πi =

  • j

pij N µi =

  • j

pijxj

  • j pij

Σi =

  • j pij(xj − µi)(xj − µj)T
  • j pij

29

slide-26
SLIDE 26

Learning Graphical Models Unsupervised Learning

the ML method on

j p(xj | µi, σi, θi) does not lead to a closed form.

Hence we need to proceed by assuming values for some parameters and deriving the others as a consequence of these choices. The procedure finds local optima It can be proven that the procedure converges pij are soft guesses as opposed to hard links in the k-means algorithm

30