and Applications Lecture 8: Review of Probability Theory Juan - - PowerPoint PPT Presentation
and Applications Lecture 8: Review of Probability Theory Juan - - PowerPoint PPT Presentation
Artificial Intelligence: Methods and Applications Lecture 8: Review of Probability Theory Juan Carlos Nieves Snchez November 28, 2014 Outline Probability Axioms Independence. Baye s rule. Inference Using Full Joint
Artificial Intelligence: Methods and Applications
Lecture 8: Review of Probability Theory Juan Carlos Nieves Sánchez November 28, 2014
Review of Probability Theory 3
Outline
- Probability Axioms
- Independence.
- Baye’s rule.
- Inference Using Full Joint
Distributions
What is probability theory
Probability theory deals with mathematical models of random phenomena. We often use models of randomness to model uncertainty. Uncertainty can have different causes:
- Laziness: it is too difficult or computationally expensive
to get to a certain answer.
- Theoretical ignorance: We don’t know all the rules that
influence the processes we are studying.
- Practical ignorance: We know the rules in principle, but
we don’t have all the data to apply them.
Review of Probability Theory 4
Random experiments
Mathematical models of randomness are based on the concept of random experiments. Such experiments should have two important properties:
- 1. The experiment must be repeatable.
- 2. Future outcomes cannot be exactly predicted based
- n previous outcomes, even if we can control all
aspects of the experiment. Examples:
- Coin tossing
- Genetics
Review of Probability Theory 5
Deterministic vs. random models
Deterministic models often give a macroscopic view of random phenomena. They describe an average behavior but ignore local random variations. Examples:
- Water molecules in a river.
- Gas molecules in a heated container.
Lesson to be learned: Model on the right level of detail!
Review of Probability Theory 6
Random Variables
The basic element of probability is the random variable. We can think random variable as an event with some degree of uncertainty as to whether that event occurs. Random variables have a domain of values it can take on. There are two types of random variables:
- 2. Discrete random variables.
- 3. Continuous random variables.
Review of Probability Theory 7
Exampes of Random Variables
Discrete random variable can take values from a finite number of values. For example:
- P(DrinkSize=Small) = 0.1
- P(DrinkSize=Medium) = 0.2
- P(DrinkSize=Large) = 0.7
Review of Probability Theory 8
Note: We will mainly be dealing with discrete random variables. Continuous random variables can take values from the real number, e.g, they can take values from 0,1 .
Probability
Given a random variable A, P(A) denotes the fraction of possible worlds in which A is true.
Review of Probability Theory 9
Worlds in which X is false
Worlds in which A is true
Event space of all possible worlds
P(A)
Key observation
Consider a random experiment for which outcome 𝐵 sometimes occurs and sometimes doesn’t occur.
- Repeat the experiment a large number of times and
note, for each repetition, whether 𝐵 occurs or not
- Let 𝑔
𝑜(𝐵) be the number of times 𝐵 occurred in the first
𝑜 experiments
- Let 𝑠
𝑜 𝐵 = 𝑔
𝑜(𝐵)
𝑜 be the relative frequency of 𝐵 in the
first 𝑜 experiments Key observation: As 𝑜 → ∞, the relative frequency 𝑠
𝑜 𝐵
converges to a real number.
Review of Probability Theory 10
Intuitions about probability
I. Since 0 ≤ 𝑔
𝑜(𝐵) ≤ 𝑜 we have 0 ≤ 𝑠 𝑜(𝐵) ≤ 1 . Thus the probability of
𝐵 should be in [0, 1]. II. 𝑔
𝑜 ∅ = 0 and 𝑔 𝑜 𝐹𝑤𝑓𝑠𝑧𝑢ℎ𝑗𝑜 = 𝑜. Thus the probability of ∅ should be 0
and the probability of 𝐹𝑤𝑓𝑠𝑧𝑢ℎ𝑗𝑜 should be 1.
- III. Let 𝐶 be 𝐹𝑤𝑓𝑠𝑧𝑢ℎ𝑗𝑜 except 𝐵. Then 𝑔
𝑜 𝐵 + 𝑔 𝑜 𝐶 = 𝑜 and 𝑠 𝑜 𝐵 + 𝑠 𝑜 𝐶 =
- 1. Thus the probability of 𝐵 plus the probability of 𝐶 should be 1.
- IV. Let 𝐵 ⊆ 𝐶. Then 𝑠
𝑜 𝐵 ≤ 𝑠 𝑜 𝐶 and thus the probability of 𝐵 should be no
bigger than that of 𝐶. V. Let 𝐵 ∩ 𝐶 = ∅ and 𝐷 = 𝐵 ∪ 𝐶. Then 𝑠
𝑜(𝐷) = 𝑠 𝑜(𝐵) + 𝑠 𝑜(𝐶). Thus the
probability of 𝐷 should be the probability of 𝐵 plus the probability of 𝐶.
- VI. Let 𝐷 = 𝐵 ∪ 𝐶. Then 𝑔
𝑜 𝐷 ≤ 𝑔 𝑜 (𝐵) + 𝑔 𝑜 (𝐶) and 𝑠 𝑜(𝐷) ≤ 𝑠 𝑜(𝐵) + 𝑠 𝑜(𝐶).
Thus the probability of 𝐷 should be at most the sum of the probabilities
- f 𝐵 and 𝐶.
- VII. Let 𝐷 = 𝐵 ∪ 𝐶 and 𝐸 = 𝐵 ∩ 𝐶. Then 𝑔
𝑜 𝐷 = 𝑔 𝑜 𝐵 + 𝑔 𝑜 𝐶 − 𝑔 𝑜 𝐸 and
thus the probability of 𝐷 should be the probability of 𝐵 plus the probability of 𝐶 minus the probability of 𝐸.
Review of Probability Theory 11
The probability space
A probability space is a tuple where:
- is the sample space or set of all elementary events
- is the set of events (for our purposes, we can
consider )
- is the probability function
Review of Probability Theory 12
Note: We often use logical formulas to describe events:
𝑇𝑣𝑜𝑜𝑧 ∧ ¬ 𝐺𝑠𝑓𝑓𝑨𝑗𝑜
Kolmogorov’s axioms
Kolmogorov formulated three axioms that the probability function 𝑄 must satisfy. The rest of probability theory can be built from these axioms.
- 1. A1: For any , there is a nonnegative real number
- 2. A2:
- 3. A3: Let be a collection of pairwise disjoint
- events. Let be their union. Then
Review of Probability Theory 13
These axioms are often called Kolmogorov’s axioms in honor of the Russian mathematician Andrei Kolmogorov.
Kolmogorov’s axioms
Review of Probability Theory 14
Kolmogorov’s axioms express which properties have to satisfy a probability; however, they do not say how to calculate the probability of the events
Flipping coins
Review of Probability Theory 15
Consider the random experiment of flipping a coin two times, one after the other.
Drawing from an urn
Consider the random experiment of drawing two balls, one after the other, from an urn that contains a red (R), a blue (B), and a green (G) ball.
Review of Probability Theory 16
Independent events
The difference between the two examples is that in the first one, the two events are independent while in the second they are not.
Review of Probability Theory 17
Conditional probability
Review of Probability Theory 18
Flipping coins
What is the probability of the second throw resulting in a head given that the first one results in a head?
Review of Probability Theory 19
Drawing from an urn
What is the probability of the second ball being blue given that the first one is red?
Review of Probability Theory 20
The product rule
If we rewrite the definition of conditional probability, we get the product rule.
Review of Probability Theory 21
Conditional probability: Product rule:
Bayes’ rule
Review of Probability Theory 22
Bayes Rule’ Example
Meningitis causes stiff necks with probability 0.5. The prior probability of having meningitis is 0.00002. The prior probability of having a stiff neck is 0.05. What is the probability of having meningitis given that you have a stiff neck?
Review of Probability Theory 23
When is Bayes’ Rule Useful?
- Sometimes it’s easier to get 𝑄(𝑌|𝑍) than 𝑄(𝑍|𝑌).
- Information is typically available in the form
𝑄(𝑓𝑔𝑔𝑓𝑑𝑢 | 𝑑𝑏𝑣𝑡𝑓 ) rather than 𝑄( 𝑑𝑏𝑣𝑡𝑓 | 𝑓𝑔𝑔𝑓𝑑𝑢) .
- 𝑄(𝑓𝑔𝑔𝑓𝑑𝑢 | 𝑑𝑏𝑣𝑡𝑓 ) quantifies the relationship in the
causal direction, whear 𝑄( 𝑑𝑏𝑣𝑡𝑓 | 𝑓𝑔𝑔𝑓𝑑𝑢) describes the diagnostic direction.
- For example, 𝑄( 𝑡𝑧𝑛𝑞𝑢𝑝𝑛 | 𝑒𝑗𝑡𝑓𝑏𝑡𝑓 ) is easy to measure
empirically but obtaining 𝑄( 𝑒𝑗𝑡𝑓𝑏𝑡𝑓 |𝑡𝑧𝑛𝑞𝑢𝑝𝑛 ) is harder.
Review of Probability Theory 24
How is Bayes’ Rule Used
In machine learning, we use Bayes rule in the following way:
Review of Probability Theory 25 Posterior probability Likelihood of the data Prior probability
Probability distributions
For random variables with finite domains, the probability distribution simply defines the probability of the variable taking on each of the different values. For instance,
Review of Probability Theory 26
- The bold P indicates that the result is a vector of
numbers representing the probabilities of each individual state of weather; and where we assume a predefined ordering.
- Because a probability distribution represents a
normalized frequency distribution, the sum of probabilities must sum 1.
P notation and Conditional Distributions
Review of Probability Theory 27
Possible worlds and full joint distributions
Review of Probability Theory 28
Full Joint Probability Distributions
Toothache Cavity Catch false false false 0.576 false false true 0.144 false true false 0.008 false true true 0.072 true false false 0.064 true false true 0.016 true true false 0.012 true true true 0.108
Review of Probability Theory 29
This cell means 𝑄(𝑈𝑝𝑝𝑢ℎ𝑏𝑑ℎ𝑓 = 𝑢𝑠𝑣𝑓, 𝐷𝑏𝑤𝑗𝑢𝑧 = 𝑢𝑠𝑣𝑓, 𝐷𝑏𝑢𝑑ℎ = 𝑢𝑠𝑣𝑓) = 0.108
Joint Probability Distribution
Review of Probability Theory 30
Full Joint Probability Distributions are very powerful, they can be used to answer any probabilistic query involving the three random variables.
Marginalization
We can even calculate marginal probabilities (the probability distribution over a subset of the variables)
Review of Probability Theory 31
The general marginalization rule for any sets of variables Y and Z:
- r
Normalization
Review of Probability Theory 32
Note that 1/𝑸(𝑈𝑝𝑝𝑢ℎ𝑏𝑑ℎ𝑓 = 𝑢𝑠𝑣𝑓) remains constant in the two equations. In fact, 1/𝑸(𝑈𝑝𝑝𝑢ℎ𝑏𝑑ℎ𝑓 = 𝑢𝑠𝑣𝑓) can be viewed as a normalization constant for 𝑄(𝐷𝑏𝑤𝑗𝑢𝑧 | 𝑢𝑝𝑝𝑢ℎ𝑏𝑑ℎ𝑓), ensuring it adds up to 1.
General Inference Procedure
- Let X be the query variable (Cavity),
- Let E by the set of evidence variables (just Toothache in this case),
- Let e be the observed values for them,
- Let Y be the remaining unobserved variables (just Catch in our
example). Then the query P(X|e) can be avaluated as: where the summation is over all possible 𝒛s (i.e., all possible combinations
- f values of the unobserved variables Y)
Review of Probability Theory 33
Review of Probability Theory 34
Sources of this Lecture
- S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach. Third
Edition.