COSC343: Artificial Intelligence Lecture 16: Introduction to - - PowerPoint PPT Presentation

cosc343 artificial intelligence
SMART_READER_LITE
LIVE PREVIEW

COSC343: Artificial Intelligence Lecture 16: Introduction to - - PowerPoint PPT Presentation

COSC343: Artificial Intelligence Lecture 16: Introduction to probability theory Alistair Knott Dept. of Computer Science, University of Otago Alistair Knott (Otago) COSC343 Lecture 16 1 / 22 Probabilistic learning algorithms In the next two


slide-1
SLIDE 1

COSC343: Artificial Intelligence

Lecture 16: Introduction to probability theory Alistair Knott

  • Dept. of Computer Science, University of Otago

Alistair Knott (Otago) COSC343 Lecture 16 1 / 22

slide-2
SLIDE 2

Probabilistic learning algorithms

In the next two lectures, I’ll introduce probabilistic learning algorithms. These algorithms take a set of training data and learn a probabilistic model of the data. The model can be used to assess the probabilities of events—including events not seen in the training data.

Alistair Knott (Otago) COSC343 Lecture 16 2 / 22

slide-3
SLIDE 3

Probabilistic learning algorithms

In the next two lectures, I’ll introduce probabilistic learning algorithms. These algorithms take a set of training data and learn a probabilistic model of the data. The model can be used to assess the probabilities of events—including events not seen in the training data. For instance: Training data: people with meningitis—what symptoms do they show? Model: takes symptoms, and estimates probability of meningitis.

Alistair Knott (Otago) COSC343 Lecture 16 2 / 22

slide-4
SLIDE 4

Defining a sample space

A sample space is a model of ‘all possible ways the world can be’. Formally, it’s the space of all possible values of the inputs and

  • utputs to the function f(x1. . .xn).

Each of these defines one dimension of the sample space. Each possible combination is called a sample point. Formally, a probability model assigns a probability to each sample point in a sample space. Each probability is between 0 and 1 inclusive. Probabilities for all points in the space sum to 1.

Alistair Knott (Otago) COSC343 Lecture 16 3 / 22

slide-5
SLIDE 5

A simple probability model

Imagine we roll a single die. There’s just one variable in our sample space (call it Roll), which has 6 possible values. Roll 1 2 3 4 5 6 p p p p p p We can estimate the probability at each point by generating a training set of die rolls and using relative frequencies of events in this set. p(Roll = n) = count(Roll = n) size(training_set) Terminology: Note variables are capitalised!

Alistair Knott (Otago) COSC343 Lecture 16 4 / 22

slide-6
SLIDE 6

A two-dimensional probability model

If we roll two dice many times, we can build a probability model looking something like this: Roll_1 1 2 3 4 5 6 Roll_2 1

1 36 1 36 1 36 1 36 1 36 1 36

2

1 36 1 36 1 36 1 36 1 36 1 36

3

1 36 1 36 1 36 1 36 1 36 1 36

4

1 36 1 36 1 36 1 36 1 36 1 36

5

1 36 1 36 1 36 1 36 1 36 1 36

6

1 36 1 36 1 36 1 36 1 36 1 36

Alistair Knott (Otago) COSC343 Lecture 16 5 / 22

slide-7
SLIDE 7

Some terminology

An event is any subset of points in a sample space. The probability of an event E is the sum of the probabilities of each sample point it contains. p(E) =

  • {ω∈E}

p(ω)

Alistair Knott (Otago) COSC343 Lecture 16 6 / 22

slide-8
SLIDE 8

Events

What’s P(Roll_1 = 5)? Roll_1 1 2 3 4 5 6 Roll_2 1

1 36 1 36 1 36 1 36 1 36 1 36

2

1 36 1 36 1 36 1 36 1 36 1 36

3

1 36 1 36 1 36 1 36 1 36 1 36

4

1 36 1 36 1 36 1 36 1 36 1 36

5

1 36 1 36 1 36 1 36 1 36 1 36

6

1 36 1 36 1 36 1 36 1 36 1 36

Alistair Knott (Otago) COSC343 Lecture 16 7 / 22

slide-9
SLIDE 9

Events

Events can also be partial descriptions of outcomes. What’s P(Roll_1 ≥ 4)? Roll_1 1 2 3 4 5 6 Roll_2 1

1 36 1 36 1 36 1 36 1 36 1 36

2

1 36 1 36 1 36 1 36 1 36 1 36

3

1 36 1 36 1 36 1 36 1 36 1 36

4

1 36 1 36 1 36 1 36 1 36 1 36

5

1 36 1 36 1 36 1 36 1 36 1 36

6

1 36 1 36 1 36 1 36 1 36 1 36

Alistair Knott (Otago) COSC343 Lecture 16 8 / 22

slide-10
SLIDE 10

Continuous and discrete variables

The sample spaces we’ve seen so far have been built from discrete random variables. But you can build probability models using continuous variables too. E.g. we can define a random variable Temperature, whose domain is the real numbers. Terminology: For Boolean variables, (e.g. Stiff_neck), lower-case is shorthand for ‘true’, and ‘¬’ means ‘not’: stiff_neck ≡ Stiff_neck = true ¬stiff_neck ≡ Stiff_neck = false

Alistair Knott (Otago) COSC343 Lecture 16 9 / 22

slide-11
SLIDE 11

Probability distributions

A probability model induces a probability distribution for each random variable. This distribution is a function, whose domain is all possible values for the random variable, and which returns a probability for each possible value. The area under the graph has to sum to 1. Terminology: *note capitalisation! p(E) is the probability of an event P(V) is a probability distribution for the variable V. E.g. P(Roll_1) = 1 2 3 4 5 6

1 6 1 6 1 6 1 6 1 6 1 6

1 2 3 4 5 6 1/6

Alistair Knott (Otago) COSC343 Lecture 16 10 / 22

slide-12
SLIDE 12

Probability for continuous variables

For continuous variables, distributions are continuous. Here’s a function which gives a uniform probability for values between 18 and 26: P(X = x) = U[18, 26](x)

0.125 dx 18 26

Here P is a density; integrates to 1. So P(X = 20.5) = 0.125 means lim

dx→0 P(20.5 ≤ X ≤ 20.5 + dx)/dx = 0.125

Alistair Knott (Otago) COSC343 Lecture 16 11 / 22

slide-13
SLIDE 13

Gaussian density

A particularly useful probability function for continuous variables is the Gaussian function. P(x) =

1 √ 2πσe−(x−µ)2/2σ2

Lots of real-world variables have this distribution.

Alistair Knott (Otago) COSC343 Lecture 16 12 / 22

slide-14
SLIDE 14

A simple medical example

Consider a medical scenario, with 3 Boolean variables: Cavity (does the patient have a cavity or not?) Toothache (does the patient have a toothache or not?) Catch (does the dentist’s probe catch on the patient’s tooth?) Here’s an example probability model: the joint probability distribution P(Toothache, Catch, Cavity). (Note capital letters: we’re enumerating all possible values for each variable.)

cavity

L

toothache cavity catch catch

L

toothache

L

catch catch

L

.108 .012 .016 .064 .072 .144 .008 .576

Alistair Knott (Otago) COSC343 Lecture 16 13 / 22

slide-15
SLIDE 15

Inference from a joint distribution

Given a full joint distribution, we can compute the probability of any event simply by summing the probabilities of the relevant sample points.

cavity

L

toothache cavity catch catch

L

toothache

L

catch catch

L

.108 .012 .016 .064 .072 .144 .008 .576

E.g. how to calculate p(toothache)?

Alistair Knott (Otago) COSC343 Lecture 16 14 / 22

slide-16
SLIDE 16

Inference from a joint distribution

Given a full joint distribution, we can compute the probability of any event simply by summing the probabilities of the relevant sample points.

cavity

L

toothache cavity catch catch

L

toothache

L

catch catch

L

.108 .012 .016 .064 .072 .144 .008 .576

E.g. how to calculate p(toothache)? p(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2

Alistair Knott (Otago) COSC343 Lecture 16 14 / 22

slide-17
SLIDE 17

Inference from a joint distribution

Given a full joint distribution, we can compute the probability of any event simply by summing the probabilities of the relevant sample points.

cavity

L

toothache cavity catch catch

L

toothache

L

catch catch

L

.108 .012 .016 .064 .072 .144 .008 .576

E.g. how to calculate p(cavity ∨ toothache)?

Alistair Knott (Otago) COSC343 Lecture 16 14 / 22

slide-18
SLIDE 18

Inference from a joint distribution

Given a full joint distribution, we can compute the probability of any event simply by summing the probabilities of the relevant sample points.

cavity

L

toothache cavity catch catch

L

toothache

L

catch catch

L

.108 .012 .016 .064 .072 .144 .008 .576

E.g. how to calculate p(cavity ∨ toothache)? p(cavity ∨ toothache) = 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28

Alistair Knott (Otago) COSC343 Lecture 16 14 / 22

slide-19
SLIDE 19

Set-theoretic relationships in probability

Note that we can describe the probabilities of logically related events in set-theoretic terms. For instance: p(a ∨ b) = p(a) + p(b) − p(a ∧ b)

>

A B A B

Alistair Knott (Otago) COSC343 Lecture 16 15 / 22

slide-20
SLIDE 20

Prior probabilities and conditional probabilities

Assume we have built a probability model from some training data, and we are now considering a test item. If we don’t know anything about this item, all we can compute is is prior probabilities: e.g. p(toothache). But if we know some of the item’s properties, we can compute conditional probabilities based on these properties.

Alistair Knott (Otago) COSC343 Lecture 16 16 / 22

slide-21
SLIDE 21

Prior probabilities and conditional probabilities

Assume we have built a probability model from some training data, and we are now considering a test item. If we don’t know anything about this item, all we can compute is is prior probabilities: e.g. p(toothache). But if we know some of the item’s properties, we can compute conditional probabilities based on these properties. Terminology: p(cavity|toothache) probability of cavity given P has a toothache. P(Cavity|Toothache) conditional probability distribution (table of conditional probabilities for all combinations of Cavity and Toothache)

Alistair Knott (Otago) COSC343 Lecture 16 16 / 22

slide-22
SLIDE 22

Computing conditional probabilities

Assume we begin with these prior probabilities. . .

cavity

L

toothache cavity catch catch

L

toothache

L

catch catch

L

.108 .012 .016 .064 .072 .144 .008 .576

Alistair Knott (Otago) COSC343 Lecture 16 17 / 22

slide-23
SLIDE 23

Computing conditional probabilities

Assume we begin with these prior probabilities. . .

cavity

L

toothache cavity catch catch

L

toothache

L

catch catch

L

.108 .012 .016 .064 .072 .144 .008 .576

. . . and then learn that a ‘test patient’ has a toothache. What’s the new probability that the patient has no cavity?

Alistair Knott (Otago) COSC343 Lecture 16 17 / 22

slide-24
SLIDE 24

Computing conditional probabilities

Assume we begin with these prior probabilities. . .

cavity

L

toothache cavity catch catch

L

toothache

L

catch catch

L

.108 .012 .016 .064 .072 .144 .008 .576

The red area shows the only remaining possibilities. The green area shows the event ¬cavity.

Alistair Knott (Otago) COSC343 Lecture 16 17 / 22

slide-25
SLIDE 25

Computing conditional probabilities

Assume we begin with these prior probabilities. . .

cavity

L

toothache cavity catch catch

L

toothache

L

catch catch

L

.108 .012 .016 .064 .072 .144 .008 .576

P(¬cavity|toothache) = P(¬cavity ∧ toothache) P(toothache) = 0.016 + 0.064 0.108 + 0.012 + 0.016 + 0.064 = 0.4

Alistair Knott (Otago) COSC343 Lecture 16 17 / 22

slide-26
SLIDE 26

Conditional probability

The general definition of conditional probability: P(a|b) = P(a ∧ b) P(b) if P(b) = 0 The product rule gives an alternative formulation: P(a ∧ b) = P(a|b)P(b) = p(b|a)P(a) The chain rule is derived by successive application of product rule:

Alistair Knott (Otago) COSC343 Lecture 16 18 / 22

slide-27
SLIDE 27

Conditional probability

The general definition of conditional probability: P(a|b) = P(a ∧ b) P(b) if P(b) = 0 The product rule gives an alternative formulation: P(a ∧ b) = P(a|b)P(b) = p(b|a)P(a) The chain rule is derived by successive application of product rule: p(X1∧ . . . ∧Xn)

Alistair Knott (Otago) COSC343 Lecture 16 18 / 22

slide-28
SLIDE 28

Conditional probability

The general definition of conditional probability: P(a|b) = P(a ∧ b) P(b) if P(b) = 0 The product rule gives an alternative formulation: P(a ∧ b) = P(a|b)P(b) = p(b|a)P(a) The chain rule is derived by successive application of product rule: p(X1∧ . . . ∧Xn) = p(X1∧ . . . ∧Xn−1) p(Xn|X1∧ . . . ∧Xn−1)

Alistair Knott (Otago) COSC343 Lecture 16 18 / 22

slide-29
SLIDE 29

Conditional probability

The general definition of conditional probability: P(a|b) = P(a ∧ b) P(b) if P(b) = 0 The product rule gives an alternative formulation: P(a ∧ b) = P(a|b)P(b) = p(b|a)P(a) The chain rule is derived by successive application of product rule: p(X1∧ . . . ∧Xn) = p(X1∧ . . . ∧Xn−1) p(Xn|X1∧ . . . ∧Xn−1) = p(X1∧ . . . ∧Xn−2) p(Xn1|X1∧ . . . ∧Xn−2) p(Xn|X1∧ . . . ∧Xn−1)

Alistair Knott (Otago) COSC343 Lecture 16 18 / 22

slide-30
SLIDE 30

Conditional probability

The general definition of conditional probability: P(a|b) = P(a ∧ b) P(b) if P(b) = 0 The product rule gives an alternative formulation: P(a ∧ b) = P(a|b)P(b) = p(b|a)P(a) The chain rule is derived by successive application of product rule: p(X1∧ . . . ∧Xn) = p(X1∧ . . . ∧Xn−1) p(Xn|X1∧ . . . ∧Xn−1) = p(X1∧ . . . ∧Xn−2) p(Xn1|X1∧ . . . ∧Xn−2) p(Xn|X1∧ . . . ∧Xn−1) = n

i = 1 p(Xi|X1∧ . . . ∧Xi−1)

Alistair Knott (Otago) COSC343 Lecture 16 18 / 22

slide-31
SLIDE 31
  • Cond. probability for whole distributions

Conditional probabilities can also be computed for whole distributions.

cavity

L

toothache cavity catch catch

L

toothache

L

catch catch

L

.108 .012 .016 .064 .072 .144 .008 .576

For instance, the conditional probability distribution of Cavity and Catch given toothache shows all possible values of Cavity and Catch given toothache. P(Cavity, Catch|toothache): catch ¬catch cavity .108/.2 .012 /.2 ¬cavity .016/.2 .064/.2

Alistair Knott (Otago) COSC343 Lecture 16 19 / 22

slide-32
SLIDE 32

Normalization

Notice that each cell in the conditional joint distribution is obtained by dividing by the same number: 1/p(toothache). We can think of this number as a normalisation constant (often called α), which ensures that the values in the joint distribution still sum to 1. Normalising a conditional joint distribution allows some useful short cuts. A non-normalised distribution can be written like this: P(Cavity, Catch, toothache) We can then state: P(Cavity, Catch|toothache) = α × P(Cavity, Catch, toothache) We can compute α just by summing all the numbers in the non-normalised distribution.

Alistair Knott (Otago) COSC343 Lecture 16 20 / 22

slide-33
SLIDE 33

Summary

Machine learning algorithms let us find out about systems when we don’t have full knowledge of their structure. Supervised learning algorithms induce a general hypothesis from a finite set of training instances. There’s a trade-off between consistency with training data and generalisation to test data. Probability theory is the foundation for many learning algorithms. Key concepts: sample space, probability model, random variable, probability distribution, prior and conditional probability.

Alistair Knott (Otago) COSC343 Lecture 16 21 / 22

slide-34
SLIDE 34

Reading

For today: AIMA Sections 13.1–13.2 For next lecture: AIMA Sections 13.3–13.5

Alistair Knott (Otago) COSC343 Lecture 16 22 / 22