Artificial Intelligence: Methods and Applications Lecture 6: - - PDF document

artificial intelligence methods and applications
SMART_READER_LITE
LIVE PREVIEW

Artificial Intelligence: Methods and Applications Lecture 6: - - PDF document

Artificial Intelligence: Methods and Applications Lecture 6: Probability theory Henrik Bjrklund Ume University 4. December 2012 What is probability theory? Probability theory deals with mathematical models of random phenomena. We often


slide-1
SLIDE 1

Artificial Intelligence: Methods and Applications

Lecture 6: Probability theory Henrik Björklund

Umeå University

  • 4. December 2012

What is probability theory?

Probability theory deals with mathematical models of random phenomena. We often use models of randomness to model uncertainty. Uncertainty can have different causes:

◮ Laziness: it is too difficult or computationally expensive to get to a certain

answer

◮ Theoretical ignorance: We don’t know all the rules that influence the

processes we are studying

◮ Practical ignorance: We know the rules in principle, but we don’t have all

the data to apply them

slide-2
SLIDE 2

Random experiments

Mathematical models of randomness are based on the concept of random experiments. Such experiments should have two important properties:

  • 1. The experiment must be repeatable
  • 2. Future outcomes cannot be exactly predicted based on previous
  • utcomes, even if we can controll all aspects of the experiment

Examples:

◮ Coin tossing ◮ Quality control ◮ Genetics

Deterministic vs. random models

Deterministic models often give a macroscopic view of random phenomena. They describe an average behavior but ignore local random variations. Examples:

◮ Water molecules in a river ◮ Gas molecules in a heated container

Lesson to be learned: Model on the right level of detail!

slide-3
SLIDE 3

Key observation

Consider a random experiment for which outcome A sometimes occurs and sometimes doesn’t occur.

◮ Repeat the experiment a large number of times and note, for each

repetition, whether A occurs or not

◮ Let fn(A) be the number of times A occured in the first n experiments ◮ Let rn(A) = fn(A) n

be the relative frequency of A in the first n experiments Key observation: As n → ∞, the relative frequency rn(A) converges to a real number.

Kolmogorov

The concequences of the key obeservation were axiomatized by the russian mathematician Andrey Kolmogorov (1903-1987) in his book Grundbegriffe der Wahrscheinlichkeitsberechnung (1933).

slide-4
SLIDE 4

Intuitions about probability

i Since 0 ≤ fn(A) ≤ n we have 0 ≤ rn(A) ≤ 1. Thus the probability of A should be in [0, 1]. ii fn(∅) = 0 and fn(Everything) = n. Thus the probability of ∅ should be 0 and the probability of Everything should be 1. iii Let B be Everything except A. Then fn(A) + fn(B) = n and rn(A) + rn(B) = 1. Thus the probability of A plus the probability of B should be 1. iv Let A ⊆ B. Then rn(A) ≤ rn(B) and thus the probability of A should be no bigger than that of B. v Let A ∩ B = ∅ and C = A ∪ B. Then rn(C) = rn(A) + rn(B). Thus the probability of C should be the probability of A plus the probability of B. vi Let C = A ∪ B. Then fn(C) ≤ fn(A) + fn(B) and rn(C) ≤ rn(A) + rn(B). Thus the probability of C should be at most the sum of the probabilities of A and B. vii Let C = A ∪ B and D = A ∩ B. Then fn(C) = fn(A) + fn(B) − fn(D) and thus the probability of C should be the probability of A plus the probability of B minus the probability of D.

The probability space

Definition

A probability space is a tuple (Ω, F, P) where

◮ Ω is the sample space or set of all elementary events ◮ F is the set of events (for our purposes, we can consider F = P(Ω)) ◮ P : F → R is the probability function

Note: We often use logical formulas to describe events: Sunny ∧ ¬Freezing

slide-5
SLIDE 5

Kolmogorov’s axioms

Kolmogorov formulated three axioms that the probability function P must

  • satisfy. The rest of probability theory can be built from these axioms.

A1 For any A ∈ F, there is a nonnegative real number P(A) A2 P(Ω) = 1 A3 Let {An | 1 ≤ n} be a collection of pairwise disjoint events. Let A = ∞

n=1 An be their union. Then

P(A) = Σ∞

n=1P(An)

Intuition i: ∀A : P(A) ∈ [0, 1]

P(A) ≥ 0 A1 Let B = Ω \ A P(B) ≥ 0 A1 P(A ∪ B) = P(A) + P(B) A3 P(A ∪ B) = P(Ω) = 1 A2 P(A) ≤ 1

slide-6
SLIDE 6

Intuition ii: P(∅) = 0 and P(Ω) = 1

P(Ω) = 1 A2 P(∅ ∪ Ω) = P(∅) + P(Ω) A3 0 ≤ P(∅ ∪ Ω) ≤ 1 i P(∅) = 0

Intuition iv: Let C = A ∪ B. Then P(C) ≤ P(A) + P(B).

Let A′ = A \ B, B′ = B \ A, D = A ∩ B 0 ≤ P(A′), P(B′), P(D) ≤ 1 i P(A) = P(A′ ∪ D) = P(A′) + P(D) A3 P(B) = P(B′ ∪ D) = P(B′) + P(D) A3 P(A) + P(B) = P(A′) + P(B′) + 2 · P(D) P(C) = P(A′ ∪ B′ ∪ D) = P(A′) + P(B′) + P(D) A3 P(C) ≤ P(A) + P(B)

slide-7
SLIDE 7

Flipping coins

Example

Consider the random experiment of flipping a coin two times, one after the

  • ther.

We have Ω = {HH, HT, TH, TT} and P({HH}) = P({HT}) = P({TH}) = P({TT}) = 1/4.

◮ Let H1 = {HH, HT} = the first flip results in a head ◮ Let H2 = {HH, TH} = the second flip results in a head

We have

◮ P(H1) = P({HH}) + P({HT}) = 1/2 ◮ P(H2) = P({HH}) + P({TH}) = 1/2 ◮ P(H1 ∩ H2) = P({HH}) = 1/4 = P(H1) · P(H2)

Drawing from an urn

Example

Consider the random experiment of drawing two balls, one after the other, from an urn that contains a red, a blue, and a green ball. We have Ω = {RB, RG, BR, BG, GR, GB} and P({RB}) = P({RG}) = P({BR}) = P({BG}) = P({GR}) = P({GB}) = 1/6.

◮ Let R1 = {RB, RG} = the first ball is red ◮ Let B2 = {RB, GB} = the second ball is blue

We have

◮ P(R1) = P({RB}) + P({RG}) = 1/3 ◮ P(B2) = P({RB}) + P({GB}) = 1/3 ◮ P(R1 ∩ B2) = P({RB}) = 1/6 = P(R1) · P(B2) = 1/9

slide-8
SLIDE 8

Independent events

The difference between the two examples is that in the first one, the two events are independent while in the second they are not.

Definition

Events A and B are independent if P(A ∩ B) = P(A) · P(B).

Conditional probability

Definition

Let A and B be events, with P(B) > 0. The conditional probability P(A|B) of A given B is given by P(A|B) = P(A ∩ B) P(B) . Notice: If B = Ω, then P(A|B) = P(A ∩ Ω) P(Ω) = P(A) P(Ω) = P(A) 1 = P(A). Notice: If A and B are independent, then P(A|B) = P(A ∩ B) P(B) = P(A) · P(B) P(B) = P(A).

slide-9
SLIDE 9

Flipping coins

Example

What is the probability of the second throw resulting in a head given that the first one results in a head? P(H2|H1) = P(H2 ∩ H1) P(H1) = P({HH}) P({HH, HT}) = 1/4 1/2 = 1/2 = P(H2)

Drawing from an urn

Example

What is the probability of the second ball being blue given that the first one is red? P(B2|R1) = P(B2 ∩ R1) P(R1) = P({RB}) P({RB, RG}) = 1/6 1/3 = 1/2 = P(B2) = 1/3

slide-10
SLIDE 10

The product rule

If we rewrite the definition of conditional probability, we get the product rule. Conditional probability: P(A|B) = P(A ∩ B) P(B) Product rule: P(A ∩ B) = P(A|B) · P(B)

Bayes’ rule

The product rule can be written in two different ways: P(A ∩ B) = P(A|B) · P(B) P(A ∩ B) = P(B|A) · P(A) Equating the two right-hand sides we get P(B|A) · P(A) = P(A|B) · P(B). By dividing with P(A) we get Bayes’ rule: P(B|A) = P(A|B) · P(B) P(A)

slide-11
SLIDE 11

Thomas Bayes

Thomas Bayes (1701-1761) was an English mathematician and presbyterian

  • minister. His most important results were published after his death.

Example

Consider the random experiment of throwing three dice. We have 63 = 216 elementary events: Ω = {1, . . . , 6} × {1, . . . , 6} × {1, . . . , 6} One benefit of this sample space is that the elementary events all have the same probability: P({(1, 3, 5)}) = P({(2, 2, 6)}) = 1/216 We may, however, be more interested in the number of eyes than the precise result of each die. There are 16 such outcomes (3 being the lowest and 18 being the highest). These outcompes are not, however, equally probable.

◮ The probability of having 3 eyes showing is P({(1, 1, 1)}) = 1/216 ◮ The probability of having 4 eyes showing is

P({(1, 1, 2), (1, 2, 1), (2, 1, 1)}) = 3/216 = 1/72

slide-12
SLIDE 12

Example

For the number of eyes, we introduce the random variable Eyes. Eyes can take values from the domain {3, . . . , 18}. We can now talk about the probability of Eyes taking on certain values:

◮ P(Eyes = 3) = P({(1, 1, 1)}) = 1/216 ◮ P(Eyes = 4) = P({(1, 1, 2), (1, 2, 1), (2, 1, 1)}) = 1/72 ◮ P(Eyes = 5) = 1/36

Example

Consider our urn example again.

◮ Let B1 be a random variable that takes values from {red, blue, green} and

represents the color of the first ball

◮ Let B2 represent the color of the second ball

P(B2 = blue | B1 = red) = P({RB, GB}|{RB|RG}) = = P({RB}) P({RB, RG}) = 1/6 1/3 = 1/2

slide-13
SLIDE 13

Probability distributions

For random variables with finite domains, the probability distribution simply defines the probability of the variable taking on each of the different values: P(Eyes) =< 1/216, 1/72, 1/36, . . . , 1/36, 1/72, 1/216 > We can also talk about the probability distribution of conditional probabilities: P(B2 | B1) =   1/2 1/2 1/2 1/2 1/2 1/2  

Continuous distributions

For random variables with continuous domains, we cannot simply list the probabilities of all outcomes. Instead, we use probability density functions: P(x) = lim

dx→0 P(x ≤ X ≤ x + dx)/dx

slide-14
SLIDE 14

Joint probability distributions

We also need to be able to talk about distributions over multiple variables: P(B1, B2) =   1/6 1/6 1/6 1/6 1/6 1/6   We can also use rules, such as the product rule, over distributions: P(B1, B2) = P(B1 | B2) · P(B2) This notation summarizes 9 equations of the form P(B1 = red, B2 = blue) = P(B1 = red | B2 = blue) · P(B2 = blue).

Possible worlds and full joint distributions

Assigning a value to each of the random variables in our model defines a possible world. Let X1, . . . , Xk be all the random variables of our model. Then the full joint probability distribution P(X1, . . . , Xk) completely determines the probability of each possible world and thus the whole probability model. The probability of a logical proposition is the sum of the probabilities of all the possible worlds in which the proposition is true.

slide-15
SLIDE 15

Example

In the following example, we have three random variables:

◮ The Boolean variable Alarm (domain {true, false}) ◮ The Boolean variable Moving (domain {true, false}) ◮ The variable Engine (domain {off, working, broken})

The following is the full joint distribution P(Alarm, Moving, Engine): alarm ¬alarm moving ¬moving moving ¬moving Engine = off 1/81 4/81 3/81 13/81 Engine = working 6/81 1/81 31/81 1/81 Engine = broken 2/81 13/81 2/81 4/81 For instance, the probability of the proposition alarm ∨ Engine = off is P(alarm ∪ Engine = off) = P(alarm) + P(Engine = off) − P(alarm ∩ Engine = off) = (1 + 6 + 2 + 4 + 1 + 13)/81 + (1 + 4 + 3 + 13)/81 − (1 + 4)/81 = 43/81

Example

alarm ¬alarm moving ¬moving moving ¬moving Engine = off 1/81 4/81 3/81 13/81 Engine = working 6/81 1/81 31/81 1/81 Engine = broken 2/81 13/81 2/81 4/81 Once the alarm goes off, what is the probability of the engine being broken? P(broken|alarm) = P(broken ∩ alarm) P(alarm) = 15/81 27/81 = 15/27 = 5/9 And the probability of the engine not being broken: P(working∪off | alarm) = P((working ∪ off) ∩ alarm) P(alarm) = 12/81 27/81 = 12/27 = 4/9

slide-16
SLIDE 16

Normalization

Example

In both the above computations, the factor 1/P(alarm) was a constant 81/27 = 3. Call this constant α. P(Engine | alarm) = α · P(Engine, alarm) = α(P(Engine, alarm, moving) + P(Engine, alarm, ¬moving)) = α(< 1/81, 6/81, 2/81 > + < 4/81, 1/81, 13/81 >) = α < 5/81, 7/81, 15/81 > = < 15/81, 21/81, 45/81 > Notice the second to last result: α < 5/81, 7/81, 15/81 > We know that the elements in P(Engine | alarm) should sum up to 1. This means that we don’t have to know the value of α and thus not P(alarm)!