and Applications Lecture 8: Review of Probability Theory Juan - - PowerPoint PPT Presentation

and applications
SMART_READER_LITE
LIVE PREVIEW

and Applications Lecture 8: Review of Probability Theory Juan - - PowerPoint PPT Presentation

Artificial Intelligence: Methods and Applications Lecture 8: Review of Probability Theory Juan Carlos Nieves Snchez November 28, 2014 Outline Probability Axioms Independence. Baye s rule. Inference Using Full Joint


slide-1
SLIDE 1
slide-2
SLIDE 2

Artificial Intelligence: Methods and Applications

Lecture 8: Review of Probability Theory Juan Carlos Nieves Sánchez November 28, 2014

slide-3
SLIDE 3

Review of Probability Theory 3

Outline

  • Probability Axioms
  • Independence.
  • Baye’s rule.
  • Inference Using Full Joint

Distributions

slide-4
SLIDE 4

What is probability theory

Probability theory deals with mathematical models of random phenomena. We often use models of randomness to model uncertainty. Uncertainty can have different causes:

  • Laziness: it is too difficult or computationally expensive

to get to a certain answer.

  • Theoretical ignorance: We don’t know all the rules that

influence the processes we are studying.

  • Practical ignorance: We know the rules in principle, but

we don’t have all the data to apply them.

Review of Probability Theory 4

slide-5
SLIDE 5

Random experiments

Mathematical models of randomness are based on the concept of random experiments. Such experiments should have two important properties:

  • 1. The experiment must be repeatable.
  • 2. Future outcomes cannot be exactly predicted based
  • n previous outcomes, even if we can control all

aspects of the experiment. Examples:

  • Coin tossing
  • Genetics

Review of Probability Theory 5

slide-6
SLIDE 6

Deterministic vs. random models

Deterministic models often give a macroscopic view of random phenomena. They describe an average behavior but ignore local random variations. Examples:

  • Water molecules in a river.
  • Gas molecules in a heated container.

Lesson to be learned: Model on the right level of detail!

Review of Probability Theory 6

slide-7
SLIDE 7

Random Variables

The basic element of probability is the random variable. We can think random variable as an event with some degree of uncertainty as to whether that event occurs. Random variables have a domain of values it can take on. There are two types of random variables:

  • 2. Discrete random variables.
  • 3. Continuous random variables.

Review of Probability Theory 7

slide-8
SLIDE 8

Exampes of Random Variables

Discrete random variable can take values from a finite number of values. For example:

  • P(DrinkSize=Small) = 0.1
  • P(DrinkSize=Medium) = 0.2
  • P(DrinkSize=Large) = 0.7

Review of Probability Theory 8

Note: We will mainly be dealing with discrete random variables. Continuous random variables can take values from the real number, e.g, they can take values from 0,1 .

slide-9
SLIDE 9

Probability

Given a random variable A, P(A) denotes the fraction of possible worlds in which A is true.

Review of Probability Theory 9

Worlds in which X is false

Worlds in which A is true

Event space of all possible worlds

P(A)

slide-10
SLIDE 10

Key observation

Consider a random experiment for which outcome 𝐵 sometimes occurs and sometimes doesn’t occur.

  • Repeat the experiment a large number of times and

note, for each repetition, whether 𝐵 occurs or not

  • Let 𝑔

𝑜(𝐵) be the number of times 𝐵 occurred in the first

𝑜 experiments

  • Let 𝑠

𝑜 𝐵 = 𝑔

𝑜(𝐵)

𝑜 be the relative frequency of 𝐵 in the

first 𝑜 experiments Key observation: As 𝑜 → ∞, the relative frequency 𝑠

𝑜 𝐵

converges to a real number.

Review of Probability Theory 10

slide-11
SLIDE 11

Intuitions about probability

I. Since 0 ≤ 𝑔

𝑜(𝐵) ≤ 𝑜 we have 0 ≤ 𝑠 𝑜(𝐵) ≤ 1 . Thus the probability of

𝐵 should be in [0, 1]. II. 𝑔

𝑜 ∅ = 0 and 𝑔 𝑜 𝐹𝑤𝑓𝑠𝑧𝑢ℎ𝑗𝑜𝑕 = 𝑜. Thus the probability of ∅ should be 0

and the probability of 𝐹𝑤𝑓𝑠𝑧𝑢ℎ𝑗𝑜𝑕 should be 1.

  • III. Let 𝐶 be 𝐹𝑤𝑓𝑠𝑧𝑢ℎ𝑗𝑜𝑕 except 𝐵. Then 𝑔

𝑜 𝐵 + 𝑔 𝑜 𝐶 = 𝑜 and 𝑠 𝑜 𝐵 + 𝑠 𝑜 𝐶 =

  • 1. Thus the probability of 𝐵 plus the probability of 𝐶 should be 1.
  • IV. Let 𝐵 ⊆ 𝐶. Then 𝑠

𝑜 𝐵 ≤ 𝑠 𝑜 𝐶 and thus the probability of 𝐵 should be no

bigger than that of 𝐶. V. Let 𝐵 ∩ 𝐶 = ∅ and 𝐷 = 𝐵 ∪ 𝐶. Then 𝑠

𝑜(𝐷) = 𝑠 𝑜(𝐵) + 𝑠 𝑜(𝐶). Thus the

probability of 𝐷 should be the probability of 𝐵 plus the probability of 𝐶.

  • VI. Let 𝐷 = 𝐵 ∪ 𝐶. Then 𝑔

𝑜 𝐷 ≤ 𝑔 𝑜 (𝐵) + 𝑔 𝑜 (𝐶) and 𝑠 𝑜(𝐷) ≤ 𝑠 𝑜(𝐵) + 𝑠 𝑜(𝐶).

Thus the probability of 𝐷 should be at most the sum of the probabilities

  • f 𝐵 and 𝐶.
  • VII. Let 𝐷 = 𝐵 ∪ 𝐶 and 𝐸 = 𝐵 ∩ 𝐶. Then 𝑔

𝑜 𝐷 = 𝑔 𝑜 𝐵 + 𝑔 𝑜 𝐶 − 𝑔 𝑜 𝐸 and

thus the probability of 𝐷 should be the probability of 𝐵 plus the probability of 𝐶 minus the probability of 𝐸.

Review of Probability Theory 11

slide-12
SLIDE 12

The probability space

A probability space is a tuple where:

  • is the sample space or set of all elementary events
  • is the set of events (for our purposes, we can

consider )

  • is the probability function

Review of Probability Theory 12

Note: We often use logical formulas to describe events:

𝑇𝑣𝑜𝑜𝑧 ∧ ¬ 𝐺𝑠𝑓𝑓𝑨𝑗𝑜𝑕

slide-13
SLIDE 13

Kolmogorov’s axioms

Kolmogorov formulated three axioms that the probability function 𝑄 must satisfy. The rest of probability theory can be built from these axioms.

  • 1. A1: For any , there is a nonnegative real number
  • 2. A2:
  • 3. A3: Let be a collection of pairwise disjoint
  • events. Let be their union. Then

Review of Probability Theory 13

These axioms are often called Kolmogorov’s axioms in honor of the Russian mathematician Andrei Kolmogorov.

slide-14
SLIDE 14

Kolmogorov’s axioms

Review of Probability Theory 14

Kolmogorov’s axioms express which properties have to satisfy a probability; however, they do not say how to calculate the probability of the events

slide-15
SLIDE 15

Flipping coins

Review of Probability Theory 15

Consider the random experiment of flipping a coin two times, one after the other.

slide-16
SLIDE 16

Drawing from an urn

Consider the random experiment of drawing two balls, one after the other, from an urn that contains a red (R), a blue (B), and a green (G) ball.

Review of Probability Theory 16

slide-17
SLIDE 17

Independent events

The difference between the two examples is that in the first one, the two events are independent while in the second they are not.

Review of Probability Theory 17

slide-18
SLIDE 18

Conditional probability

Review of Probability Theory 18

slide-19
SLIDE 19

Flipping coins

What is the probability of the second throw resulting in a head given that the first one results in a head?

Review of Probability Theory 19

slide-20
SLIDE 20

Drawing from an urn

What is the probability of the second ball being blue given that the first one is red?

Review of Probability Theory 20

slide-21
SLIDE 21

The product rule

If we rewrite the definition of conditional probability, we get the product rule.

Review of Probability Theory 21

Conditional probability: Product rule:

slide-22
SLIDE 22

Bayes’ rule

Review of Probability Theory 22

slide-23
SLIDE 23

Bayes Rule’ Example

Meningitis causes stiff necks with probability 0.5. The prior probability of having meningitis is 0.00002. The prior probability of having a stiff neck is 0.05. What is the probability of having meningitis given that you have a stiff neck?

Review of Probability Theory 23

slide-24
SLIDE 24

When is Bayes’ Rule Useful?

  • Sometimes it’s easier to get 𝑄(𝑌|𝑍) than 𝑄(𝑍|𝑌).
  • Information is typically available in the form

𝑄(𝑓𝑔𝑔𝑓𝑑𝑢 | 𝑑𝑏𝑣𝑡𝑓 ) rather than 𝑄( 𝑑𝑏𝑣𝑡𝑓 | 𝑓𝑔𝑔𝑓𝑑𝑢) .

  • 𝑄(𝑓𝑔𝑔𝑓𝑑𝑢 | 𝑑𝑏𝑣𝑡𝑓 ) quantifies the relationship in the

causal direction, whear 𝑄( 𝑑𝑏𝑣𝑡𝑓 | 𝑓𝑔𝑔𝑓𝑑𝑢) describes the diagnostic direction.

  • For example, 𝑄( 𝑡𝑧𝑛𝑞𝑢𝑝𝑛 | 𝑒𝑗𝑡𝑓𝑏𝑡𝑓 ) is easy to measure

empirically but obtaining 𝑄( 𝑒𝑗𝑡𝑓𝑏𝑡𝑓 |𝑡𝑧𝑛𝑞𝑢𝑝𝑛 ) is harder.

Review of Probability Theory 24

slide-25
SLIDE 25

How is Bayes’ Rule Used

In machine learning, we use Bayes rule in the following way:

Review of Probability Theory 25 Posterior probability Likelihood of the data Prior probability

slide-26
SLIDE 26

Probability distributions

For random variables with finite domains, the probability distribution simply defines the probability of the variable taking on each of the different values. For instance,

Review of Probability Theory 26

  • The bold P indicates that the result is a vector of

numbers representing the probabilities of each individual state of weather; and where we assume a predefined ordering.

  • Because a probability distribution represents a

normalized frequency distribution, the sum of probabilities must sum 1.

slide-27
SLIDE 27

P notation and Conditional Distributions

Review of Probability Theory 27

slide-28
SLIDE 28

Possible worlds and full joint distributions

Review of Probability Theory 28

slide-29
SLIDE 29

Full Joint Probability Distributions

Toothache Cavity Catch false false false 0.576 false false true 0.144 false true false 0.008 false true true 0.072 true false false 0.064 true false true 0.016 true true false 0.012 true true true 0.108

Review of Probability Theory 29

This cell means 𝑄(𝑈𝑝𝑝𝑢ℎ𝑏𝑑ℎ𝑓 = 𝑢𝑠𝑣𝑓, 𝐷𝑏𝑤𝑗𝑢𝑧 = 𝑢𝑠𝑣𝑓, 𝐷𝑏𝑢𝑑ℎ = 𝑢𝑠𝑣𝑓) = 0.108

slide-30
SLIDE 30

Joint Probability Distribution

Review of Probability Theory 30

Full Joint Probability Distributions are very powerful, they can be used to answer any probabilistic query involving the three random variables.

slide-31
SLIDE 31

Marginalization

We can even calculate marginal probabilities (the probability distribution over a subset of the variables)

Review of Probability Theory 31

The general marginalization rule for any sets of variables Y and Z:

  • r
slide-32
SLIDE 32

Normalization

Review of Probability Theory 32

Note that 1/𝑸(𝑈𝑝𝑝𝑢ℎ𝑏𝑑ℎ𝑓 = 𝑢𝑠𝑣𝑓) remains constant in the two equations. In fact, 1/𝑸(𝑈𝑝𝑝𝑢ℎ𝑏𝑑ℎ𝑓 = 𝑢𝑠𝑣𝑓) can be viewed as a normalization constant for 𝑄(𝐷𝑏𝑤𝑗𝑢𝑧 | 𝑢𝑝𝑝𝑢ℎ𝑏𝑑ℎ𝑓), ensuring it adds up to 1.

slide-33
SLIDE 33

General Inference Procedure

  • Let X be the query variable (Cavity),
  • Let E by the set of evidence variables (just Toothache in this case),
  • Let e be the observed values for them,
  • Let Y be the remaining unobserved variables (just Catch in our

example). Then the query P(X|e) can be avaluated as: where the summation is over all possible 𝒛s (i.e., all possible combinations

  • f values of the unobserved variables Y)

Review of Probability Theory 33

slide-34
SLIDE 34

Review of Probability Theory 34

Sources of this Lecture

  • S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach. Third

Edition.