0.9 0.8 0.7 0.6 0.5 0.4 R 23 = C 2 is misclassified as C 3 0.3 - - PDF document

0 9 0 8 0 7 0 6 0 5 0 4
SMART_READER_LITE
LIVE PREVIEW

0.9 0.8 0.7 0.6 0.5 0.4 R 23 = C 2 is misclassified as C 3 0.3 - - PDF document

Artificial Intelligence: Representation and Problem Solving 15-381 April 17, 2007 Probabilistic Learning Reminder No class on Thursday - spring carnival. Artificial Intelligence: Probabilistic Learning Michael S. Lewicki Carnegie


slide-1
SLIDE 1

Artificial Intelligence: Representation and Problem Solving

15-381 April 17, 2007

Probabilistic Learning

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Reminder

  • No class on Thursday - spring carnival.

2

slide-2
SLIDE 2

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Recall the basic algorithm for learning decision trees:

  • 1. starting with whole training data
  • 2. select attribute or value along dimension that gives “best” split using information

gain or other criteria

  • 3. create child nodes based on split
  • 4. recurse on each child using child data until a stopping criterion is reached
  • all examples have same class
  • amount of data is too small
  • tree too large
  • Does this capture probabilistic relationships?

3 Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Quantifying the certainty of decions

4

<2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N

  • Predicting credit risk
  • Suppose instead of a yes or no

answer want some estimate of how strongly we believe a loan applicant is a credit risk.

  • This might be useful if we want some

flexibility in adjusting our decision criteria.

  • Eg, suppose we’re willing to take

more risk if times are good.

  • Or, if we want to examine case we

believe are higher risks more carefully.

slide-3
SLIDE 3

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

The mushroom data

  • Or suppose we wanted to know how

likely a mushroom was safe to eat?

  • Do decision trees give us that

information?

5 EDIBLE? CAP-SHAPE CAP-SURFACE

  • ••

1 edible flat fibrous

  • ••

2 poisonous convex smooth

  • ••

3 edible flat fibrous

  • ••

4 edible convex scaly

  • ••

5 poisonous convex smooth

  • ••

6 edible convex fibrous

  • ••

7 poisonous flat scaly

  • ••

8 poisonous flat scaly

  • ••

9 poisonous convex fibrous

  • ••

10 poisonous convex fibrous

  • ••

11 poisonous flat smooth

  • ••

12 edible convex smooth

  • ••

13 poisonous knobbed scaly

  • ••

14 poisonous flat smooth

  • ••

15 poisonous flat fibrous

  • ••
  • • •

Mushroom data

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

1 2 3 4 5 6 7 0.5 1 1.5 2 2.5 petal length (cm) petal width (cm)

Fisher’s Iris data

6

Iris virginica Iris setosa Iris versicolor In which example would you be more confident about the class? Decision trees provide a classification but not uncertainty.

slide-4
SLIDE 4

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

The general classification problem

7

Data D = {x1, . . . , xT } xi = {x1, . . . , xN}i desired output y = {y1, . . . , yK} model θ = {θ1, . . . , θM} Given data, we want to learn a model that can correctly classify novel observations. yi =

  • 1

if xi ∈ Ci ≡ class i,

  • therwise
  • utput is a binary classification vector:

input is a set of T observations, each an N-dimensional vector (binary, discrete, or continuous) model (e.g. a decision tree) is defined by M parameters.

How do we approach this probabilistically?

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

The answer to all questions of uncertainty

  • Let’s apply Bayes’ rule to infer the most probable class given the observation:
  • This is the answer, but what does it mean?
  • How do we specify the terms?
  • p(Ck) is the prior probability on the different classes
  • p(x|Ck) is the data likelihood, ie probability of x given class Ck
  • How should we define this?

8

p(Ck|x) = p(x|Ck)p(Ck) p(x) = p(x|Ck)p(Ck)

  • k p(x|Ck)p(Ck)
slide-5
SLIDE 5

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

What classifier would give “optimal” performance?

  • Consider the iris data again.
  • How would we minimize the number
  • f future mis-classifications?
  • We would need to know the true

distribution of the classes.

  • Assume they follow a Gaussian

distribution.

  • The number of samples in each class

is the same (50), so (assume) p(Ck) is equal for all classes.

  • Because p(x) is the same for all

classes we have:

9

1 2 3 4 5 6 7 0.5 1 1.5 2 2.5 petal length (cm) petal width (cm) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

p(petal length |C2) p(petal length |C3) p(Ck|x) = p(x|Ck)p(Ck) p(x) ∝ p(x|Ck)p(Ck)

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

1 2 3 4 5 6 7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Where do we put the boundary?

10

p(petal length |C2) p(petal length |C3)

slide-6
SLIDE 6

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

1 2 3 4 5 6 7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Where do we put the boundary?

11

decision boundary

R32 = C3 is misclassified as C2 R23 = C2 is misclassified as C3 p(petal length |C2) p(petal length |C3)

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

1 2 3 4 5 6 7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Where do we put the boundary?

12

Shifting the boundary trades-off the two errors. R32 = C3 is misclassified as C2 R23 = C2 is misclassified as C3 p(petal length |C2) p(petal length |C3)

slide-7
SLIDE 7

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

  • The misclassification error is defined by
  • which in our case is proportional to the data likelihood

1 2 3 4 5 6 7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Where do we put the boundary?

13

R32 = C3 is misclassified as C2 R23 = C2 is misclassified as C3 p(error) =

  • R32

p(C3|x)dx +

  • R23

p(C2|x)dx p(petal length |C2) p(petal length |C3)

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

  • The misclassification error is defined by
  • which in our case is proportional to the data likelihood

1 2 3 4 5 6 7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Where do we put the boundary?

14

This region would yield but we’re still classifying this region as C2! p(C3|x) > p(C2|x) p(error) =

  • R32

p(C3|x)dx +

  • R23

p(C2|x)dx p(petal length |C2) p(petal length |C3)

slide-8
SLIDE 8

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

  • The minimal misclassification error at the point where

1 2 3 4 5 6 7 0.2 0.4 0.6 0.8 1

The optimal decision boundary

15

Optimal decision boundary p(petal length |C2) p(petal length |C3) p(C3|x) = p(C2|x) ⇒ p(x|C3)p(C3)/p(x) = p(x|C2)p(C2)/p(x) ⇒ p(x|C3) = p(x|C2) p(C2 | petal length) p(C3 | petal length) Note: this assumes we have only two classes.

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Bayesian classification for more complex models

  • Recall the class conditional probability:
  • How do we define the data likelihood, p(x|Ck)

ie the probability of x given class Ck

16

p(Ck|x) = p(x|Ck)p(Ck) p(x) = p(x|Ck)p(Ck)

  • k p(x|Ck)p(Ck)
slide-9
SLIDE 9

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Defining a probabilistic classification model

  • How would we define credit risk problem?
  • Class:

C1 = “defaulted” C2 = “didn’t default”

  • Data:

x = { “<2 years”, “missed payments” }

  • Prior (from data):

p(C1) = 3/10; p(C2) = 7/10;

  • Likelihood:

p(x1, x2 | C1) = ? p(x1, x2 | C2) = ?

  • How would we determine these?

17

<2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N

  • Predicting credit risk

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Defining a probabilistic model by counting

  • The “prior” is obtained by counting number of classes in the data:
  • The likelihood is obtained the same way:
  • This is the maximum likelihood estimate (MLE) of the probabilities

18

p(Ck = k) = Count(Ck = k) # records p(x = v|Ck) = Count(x = v ∧ Ck = k) Count(Ck = k) p(x1 = v1, . . . , xN = vN|Ck = k) = Count(x1 = v1, . . . ∧ xN = vN, ∧Ck = k) Count(Ck = k)

slide-10
SLIDE 10

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Defining a probabilistic classification model

  • Determining the likelihood:

p(x1, x2 | C1) = ? p(x1, x2 | C2) = ?

  • Simple approach: look at counts in data

19

<2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N

  • Predicting credit risk

x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N N Y Y N Y Y

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Defining a probabilistic classification model

  • Determining the likelihood:

p(x1, x2 | C1) = ? p(x1, x2 | C2) = ?

  • Simple approach: look at counts in data

20

<2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N

  • Predicting credit risk

x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y Y N Y Y

slide-11
SLIDE 11

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Defining a probabilistic classification model

  • Determining the likelihood:

p(x1, x2 | C1) = ? p(x1, x2 | C2) = ?

  • Simple approach: look at counts in data

21

<2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N

  • Predicting credit risk

x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 Y N Y Y

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Defining a probabilistic classification model

  • Determining the likelihood:

p(x1, x2 | C1) = ? p(x1, x2 | C2) = ?

  • Simple approach: look at counts in data

22

<2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N

  • Predicting credit risk

x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 Y N 1/4 3/4 Y Y

slide-12
SLIDE 12

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Defining a probabilistic classification model

  • Determining the likelihood:

p(x1, x2 | C1) = ? p(x1, x2 | C2) = ?

  • Simple approach: look at counts in data

23

<2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N

  • Predicting credit risk

x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 Y N 1/4 3/4 Y Y 0/0 0/0

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Defining a probabilistic classification model

  • Determining the likelihood:

p(x1, x2 | C1) = ? p(x1, x2 | C2) = ?

  • Simple approach: look at counts in data

24

<2 years at current job? missed payments? defaulted? N N N Y N Y N N N N N N N Y Y Y N N N Y N N Y Y Y N N Y N N

  • Predicting credit risk

x1 <2 years at current job? x2 missed payments? C1 did default C2 did not default N N 3/3 0/0 N Y 2/3 1/3 Y N 1/4 3/4 Y Y 0/0 0/0

What do we do about these?

slide-13
SLIDE 13

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Being (proper) Bayesians: Recall our coin-flipping example

  • In Bernoulli trials, each sample is either 1 (e.g. heads) with probability , or 0

(tails) with probability 1 .

  • The binomial distribution specifies probability of total #heads, y, out of n trials:

25

p(y|θ, n) = n y

  • θy(1 − θ)n−y

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.05 0.1 0.15 0.2 0.25 0.3 0.35 y p(y|!=0.25, n=10)

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Applying Bayes’ rule

  • Given n trials with k heads, what do we know about ?
  • We can apply Bayes’ rule to see how our knowledge changes as we acquire new
  • bservations:

26

p(θ|y, n) = p(y|θ, n)p(θ|n) p(y|n)

posterior likelihood prior normalizing constant Uniform on [0, 1] is a reasonable assumption, i.e. “we don’t know anything”. We know the likelihood, what about the prior?

=

  • p(y|θ, n)p(θ|n)dθ

p(θ|y, n) ∝ n y

  • θy(1 − θ)n−y

In this case, the posterior is just proportional to the likelihood: What is the form of the posterior?

slide-14
SLIDE 14

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Evaluating the posterior

27

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ! p(! | y=0, n=0)

What do we know initially, before observing any trials?

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Coin tossing

28

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ! p(! | y=0, n=1)

What is our belief about after observing one “tail” ?

slide-15
SLIDE 15

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Coin tossing

29

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ! p(! | y=1, n=2)

Now after two trials we observe 1 head and 1 tail.

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Coin tossing

30

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ! p(! | y=1, n=3)

3 trials: 1 head and 2 tails.

slide-16
SLIDE 16

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Coin tossing

31

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ! p(! | y=1, n=4)

4 trials: 1 head and 3 tails.

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Coin tossing

32

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ! p(! | y=1, n=5)

5 trials: 1 head and 4 tails.

slide-17
SLIDE 17

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Evaluating the normalizing constant

  • To get proper probability density functions, we need to evaluate p(y|n):

33

p(θ|y, n) = p(y|θ, n)p(θ|n) p(y|n)

Bayes in his original paper in 1763 showed that:

p(y|n) = 1 p(y|θ, n)p(θ|n)dθ = 1 n + 1 ⇒ p(θ|y, n) = n y

  • θy(1 − θ)n−y(n + 1)

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

The ratio estimate

  • What about after just one trial: 0 heads and 1 tail?

34

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ! p(! | y=0, n=1)

MAP and ratio estimate would say 0. y/n = 0 * Does this make sense? What would a better estimate be?

MAP estimate = 0

slide-18
SLIDE 18

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ! p(! | y=0, n=1)

The expected value estimate

  • The expected value of a pdf is:

35

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ! p(! | y=0, n=1)

E(θ|y, n) = 1 θp(θ|y, n)dθ = y + 1 n + 2

What happens for zero trials?

E(θ|y = 0, n = 1) = 1 3

This is called “smoothing” or “regularization”

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

On to the mushrooms!

36 EDIBLE? CAP-SHAPE CAP-SURFACE CAP-COLOR ODOR STALK-SHAPE POPULATION HABITAT

  • ••

1 edible flat fibrous red none tapering several woods

  • ••

2 poisonous convex smooth red foul tapering several paths

  • ••

3 edible flat fibrous brown none tapering abundant grasses

  • ••

4 edible convex scaly gray none tapering several woods

  • ••

5 poisonous convex smooth red foul tapering several woods

  • ••

6 edible convex fibrous gray none tapering several woods

  • ••

7 poisonous flat scaly brown fishy tapering several leaves

  • ••

8 poisonous flat scaly brown spicy tapering several leaves

  • ••

9 poisonous convex fibrous yellow foul enlarging several paths

  • ••

10 poisonous convex fibrous yellow foul enlarging several woods

  • ••

11 poisonous flat smooth brown spicy tapering several woods

  • ••

12 edible convex smooth yellow anise tapering several woods

  • ••

13 poisonous knobbed scaly red foul tapering several leaves

  • ••

14 poisonous flat smooth brown foul tapering several leaves

  • ••

15 poisonous flat fibrous gray foul enlarging several woods

  • ••

16 edible sunken fibrous brown none enlarging solitary urban

  • ••

17 poisonous flat smooth brown foul tapering several woods

  • ••

18 poisonous convex smooth white foul tapering scattered urban

  • ••

19 poisonous flat scaly yellow foul enlarging solitary paths

  • ••

20 edible convex fibrous gray none tapering several woods

  • ••
  • ••
  • ••
  • ••
  • ••
  • ••
  • ••
  • ••
  • ••
  • ••
slide-19
SLIDE 19

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

The scaling problem

37

p(x = v|Ck) = Count(x = v ∧ Ck = k) Count(Ck = k) p(x1 = v1, . . . , xN = vN|Ck = k) = Count(x1 = v1, . . . ∧ xN = vN, ∧Ck = k) Count(Ck = k)

  • The prior is easy enough.
  • But for the likelihood, the table is huge!

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Mushroom attributes and values

38

EDIBLE: edible poisonous CAP-SHAPE: bell conical convex flat knobbed sunken CAP-SURFACE: fibrous grooves scaly smooth CAP-COLOR: brown buff cinnamon gray green pink purple red white yellow BRUISES: bruises no ODOR: almond anise creosote fishy foul musty none pungent spicy GILL-ATTACHMENT: attached free GILL-SPACING: close crowded GILL-SIZE: broad narrow GILL-COLOR: black brown buff chocolate gray green orange pink purple red white yellow STALK-SHAPE: enlarging tapering STALK-ROOT: bulbous club equal rooted STALK-SURFACE-ABOVE-RING: fibrous scaly silky smooth STALK-SURFACE-BELOW-RING: fibrous scaly silky smooth STALK-COLOR-ABOVE-RING: brown buff cinnamon gray orange pink red white yellow STALK-COLOR-BELOW-RING: brown buff cinnamon gray orange pink red white yellow VEIL-TYPE: partial universal VEIL-COLOR: brown orange white yellow RING-NUMBER: none one two RING-TYPE: evanescent flaring large none pendant SPORE-PRINT

  • COLOR: black brown buff chocolate green orange purple white yellow

POPULATION: abundant clustered numerous scattered several solitary HABITAT: grasses leaves meadows paths urban waste woods 2 6 4 10 2 9 2 2 2 12 2 4 4 4 9 9 2 4 3 5 9 6 7

# values attributes

22 attributes with an average of 5 values!

slide-20
SLIDE 20

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Simplifying with “Naïve” Bayes

  • What if we assume the features are independent?
  • We know that’s not precisely true, but it might make a good approximation.
  • Now we only need to specify N different likelihoods:
  • Huge savings in number of of parameters

39

p(x|Ck) = p(x1, . . . , xN|Ck) =

N

  • n=1

p(xn|Ck) p(xi = vi|Ck = k) = Count(xi = vi ∧ Ck = k) Count(Ck = k)

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Inference with Naïve Bayes

  • Inference is just like before, but with the independence approximation:
  • Classification performance is often surprisingly good
  • easy to implement

40

p(Ck|x) = p(x|Ck)p(Ck) p(x) = p(Ck)

n p(xn|Ck)

p(x) = p(Ck)

n p(xn|Ck)

  • k

p(Ck)

  • n

p(xn|Ck)

slide-21
SLIDE 21

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Implementation issues

  • If you implement Naïve Bayes naïvely, you’ll run into trouble. Why?
  • It’s never good to compute products of a long list of numbers
  • They’ll quickly go to zero with machine precision, even using doubles (64 bit)
  • Strategy: compute log probabilities
  • What about that constant? It still has a product.

41

p(Ck|x) = p(Ck)

n p(xn|Ck)

  • k

p(Ck)

  • n

p(xn|Ck) log p(Ck|x) = log p(Ck) +

  • n

log p(xn|Ck) − log

  • k

p(Ck)

  • n

p(xn|Ck)

  • =

log p(Ck) +

  • n

log p(xn|Ck) − constant

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Converting back to probabilities

  • The only requirement of the denominator is that it normalize the numerator to

yield a valid probability distribution.

  • We used a log transformation:
  • The form of the probability the same for any constant c
  • A common choice: choose c so that the log probabilities are shifted to zero:

42

gi = log pi + constant pi

  • i pi

= egi

  • i egi

= ecegi

  • i ecegi

= egi+c

  • i egi+c

c = − max

i

gi

slide-22
SLIDE 22

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Recall another simplifying assumption: Noisy-OR

  • We assume each cause Cj can produce

effect Ei with probability fij.

  • The noisy-OR model assumes the parent

causes of effect Ei contribute independently.

  • The probability that none of them caused

effect Ei is simply the product of the probabilities that each one did not cause Ei.

  • The probability that any of them caused Ei

is just one minus the above, i.e.

43

P(Ei|par(Ei)) = P(Ei|C1, . . . , Cn) = 1 −

  • i

(1 − P(Ei|Cj)) = 1 −

  • i

(1 − fij) catch cold (C) touch contaminated

  • bject (O)

hit by viral droplet (D) fCD fCO eat contaminated food (F) fCF 1 − (1 − fCD)(1 − fCO)(1 − fCF ) P(C|D, O, F) =

  • Could either model causes and

effects

  • Or equivalently stochastic

binary features.

  • Each input xi encodes the

probability that the ith binary input feature is present.

  • The set of features

represented by j is defined by weights fij which encode the probability that feature i is an instance of j.

A general one-layer causal network

slide-23
SLIDE 23

Each column is a distinct eight-dimensional binary feature.

The data: a set of stochastic binary patterns

There are five underlying causal feature patterns. What are they? Each column is a distinct eight-dimensional binary feature. true hidden causes of the data inferred causes of the data

The data: a set of stochastic binary patterns

slide-24
SLIDE 24

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Hierarchical Statistical Models

A Bayesian belief network:

pa(Si) Si D

The joint probability of binary states is P(S|W) =

  • i

P(Si|pa(Si), W) The probability Si depends only on its parents: P(Si|pa(Si), W) =

  • h(

j Sjwji)

if Si = 1 1 − h(

j Sjwji)

if Si = 0 The function h specifies how causes are combined, h(u) = 1 − exp(−u), u > 0. Main points:

  • hierarchical structure allows model to form

high order representations

  • upper states are priors for lower states
  • weights encode higher order features

47 Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

  • Model represents stochastic

binary features.

  • Each input xi encodes the

probability that the ith binary input feature is present.

  • The set of features represented

by j is defined by weights fij which encode the probability that feature i is an instance of j.

  • Trick: It’s easier to adapt weights

in an unbounded space, so use the transformation:

  • optimize in w-space.

Gibbs sampling (back to the example from last lecture)

48

slide-25
SLIDE 25

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Each column is a distinct eight-dimensional binary feature. true hidden causes of the data inferred causes of the data

The data: a set of stochastic binary patterns

49 Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Hierarchical Statistical Models

A Bayesian belief network:

pa(Si) Si D

The joint probability of binary states is P(S|W) =

  • i

P(Si|pa(Si), W) The probability Si depends only on its parents: P(Si|pa(Si), W) =

  • h(

j Sjwji)

if Si = 1 1 − h(

j Sjwji)

if Si = 0 The function h specifies how causes are combined, h(u) = 1 − exp(−u), u > 0. Main points:

  • hierarchical structure allows model to form

high order representations

  • upper states are priors for lower states
  • weights encode higher order features

50

slide-26
SLIDE 26

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Learning Objective

Adapt W to find the most probable explanation of the input patterns. The probability of the data is P(D1:N|W) =

  • n

P(Dn|W) P(Dn|W) is computed by marginalizing P(Dn|W) =

  • k

P(Dn|Sk, W)P(Sk|W) Computing this sum exactly is intractable, but we can still make accurate approximations.

51 Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Approximating P(Dn|W)

Good representations should have just one or a few possible explanations for most patterns.

  • in this case, most P(Dn|Sk, W) will be zero
  • the following approximation will be very accurate once weights have adapted

P(Dn|W) ≈ P(Dn|ˆ S, W)P(ˆ S|W)

  • approximation becomes increasingly accurate as learning proceeds

52

slide-27
SLIDE 27

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Adapting the Weights

The complexity of the model is controlled by placing a prior on the weights.

  • assume the prior to be the product of gamma distributions
  • objective function becomes

L = P(D1:N|W)P(W|α, β) A simple and efficient EM formula for adapting the weights can be derived using the transformations fij = 1 − exp(−wij) and gi = 1 − exp(−ui). fij = α − 1 + 2fij +

n S(n) i

S(n)

j

fij/g(n)

j

α + β +

n S(n) i

  • fij can be interpreted as the frequency of state Sj given cause Si
  • fij is a weighted average of the number of times Sj was active given Si
  • the ratio fij/gj inversely weights each term by number causes for Sj

53

Using EM to adapt the network parameters

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Inferring the best representation of the observed variables

54

  • Given on the input D, the is no simple way to determine which states are the

input’s most likely causes.

  • Computing the most probable network state is an inference process
  • we want to find the explanation of the data with highest probability
  • this can be done efficiently with Gibbs sampling
  • Gibbs sampling is another example of an MCMC method
  • Key idea:

The samples are guaranteed to converge to the true posterior probability distribution

slide-28
SLIDE 28

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Gibbs Sampling

Gibbs sampling is a way to select an ensemble of states that are representative of the posterior distribution P(S|D, W).

  • Each state of the network is updated iteratively according to the probability of

Si given the remaining states.

  • this conditional probability can be computed using (Neal, 1992)

P(Si = a|Sj : j = i, W) ∝ P(Si = a|pa(Si), W)

  • j∈ch(Si)

P(Sj|pa(Sj), Si = a, W)

  • limiting ensemble of states will be typical samples from P(S|D, W)
  • also works if any subset of states are fixed and the rest are sampled

55 Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Network Interpretation of the Gibbs Sampling Equation

The probability of Si changing state given the remaining states is P(Si = 1−Si|Sj : j = i, W) = 1 1 + exp(−∆xi) ∆xi indicates how much changing the state Si changes the probability of the whole network state ∆xi = log h(ui; 1−Si) − log h(ui; Si) +

  • j∈ch(Si)

log h(uj + δij; Sj) − log h(uj; Sj)

  • ui is the causal input to Si, ui =

k Skwki

  • δj specifies the change in uj for a change in Si,

δij = +Sjwij if Si = 0, or −Sjwij if Si = 1

56

The Gibbs sampling equations (derivation omitted)

slide-29
SLIDE 29

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning 57

Interpretation of the Gibbs sampling equation

  • The Gibbs equation can be interpreted as: feedback + feedforward
  • feed-back: how consistent is Si with current causes?
  • feedforward: how likely is Si a cause of its children
  • feedback allows the lower-level units to use information only

computable at higher levels

  • feedback determines (disambiguates) the state when the

feedforward input is ambiguous

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

The higher-order lines problem

58

!

!"# !"$ !"$ !"$ !"$

"

The true generative model Patterns sampled from the model Can we infer the structure of the network given only the patterns?

slide-30
SLIDE 30

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Weights in a 25-10-5 belief network after learning

59

The first layer of weights learn that patterns are combinations of lines. The second layer learns combinations of the first layer features

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

The Shifter Problem B A

Shift patterns weights of a 32-20-2 network after learning

60

slide-31
SLIDE 31

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Gibbs sampling: feedback disambiguates lower-level states

! " # $

61

One the structure learned, the Gibbs updating convergences in two sweeps.

Michael S. Lewicki Carnegie Mellon Artificial Intelligence: Probabilistic Learning

Next time (which is next Tuesday)

62

  • classifying with other models

Don’t forget: Spring Carnival this week - No class on Thursday.