Review of probability Nuno Vasconcelos UCSD Probability - - PDF document
Review of probability Nuno Vasconcelos UCSD Probability - - PDF document
Review of probability Nuno Vasconcelos UCSD Probability probability is the language to deal with processes that are non-deterministic examples: if I flip a coin 100 times, how many can I expect to see heads? what is the
Probability
- probability is the language to deal with processes that
are non-deterministic
- examples:
– if I flip a coin 100 times, how many can I expect to see heads? – what is the weather going to be like tomorrow? – are my stocks going to be up or down? – am I in front of a classroom or is this just a picture of it?
Sample space
- the most important concept is that of a sample space
- our process defines a set of events
– these are the outcomes or states of the process
- example:
– we roll a pair of dice – call the value on the up face at the nth toss xn – note that possible events such as
- dd number on second throw
two sixes x1 = 2 and x2 = 6
– can all be expressed as combinations
- f the sample space events
1 6 6 1
x2 x1
Sample space
- is the list of possible events that satisfies the following
properties:
– finest grain: all possible distinguishable events are listed separately – mutually exclusive: if one event happens the other does not (if x1 = 5 it cannot be anything else) – collectively exhaustive: any possible
- utcome can be expressed as unions of
sample space events
- mutually exclusive property simplifies the calculation of
the probability of complex events
- collectively exhaustive means that there is no possible
- utcome to which we cannot assign a probability
1 6 6 1
x2 x1
Probability measure
- probability of an event:
– number expressing the chance that the event will be the outcome
- f the process
- probability measure: satisfies three axioms
– P(A) ≥ 0 for any event A – P(universal event) = 1 – if A ∩ B = ∅, then P(A+B) = P(A) + P(B)
- e.g.
– P(x1 ≥ 0) = 1 – P(x1 even U x1 odd) = P(x1 even)+ P(x1 odd)
1 6 6 1
x2 x1
Probability measure
- the last axiom
– combined with the mutually exclusive property of the sample set – allows us to easily assign probabilities to all possible events
- back to our dice example:
– suppose that the probability of any pair (x1,x2) is 1/36 – we can compute probabilities of all “union” events – P(x2 odd) = 18x1/36 = 1 – P(U) = 36x1/36 = 1 – P(two sixes) = 1/36 – P(x1 = 2 and x2 = 6) = 1/36
1 6 6 1
x2 x1
Probability measure
- note that there are many ways to
define the universal event U
– e.g. A = {x2 odd}, B = {x2 even}, U = A U B – on the other hand U = (1,1) U (1,2) U (1,3) U … U (6,6) – the fact that the sample space is finest grain, exhaustive, and mutually exclusive and the measure axioms – make the whole procedure consistent
1 6 6 1
x2 x1
Random variables
- random variable X
– is a function that assigns a real value to each sample space event – we have already seen one such function: PX(x1,x2) = 1/36 for all (x1,x2)
- notation:
– specify both the random variable and the value that it takes in your probability statements – we do this by specifying the random variable as subscript PX and the value as argument PX (x1,x2) = 1/36 means Prob[X=(x1,x2)] = 1/36 – without this, probability statements can be hopelessly confusing
Random variables
- two types of random variables:
– discrete and continuous – really means what types of values the RV can take
- if it can take only one of a finite set of possibilities, we call
it discrete
– this is the dice example we saw, there are only 36 possibilities
1 6 6 1
x2 x1
Random variables
- if it can take values in a real interval we say that the
random variable is continuous
- e.g. consider the sample space of weather temperature
– we know that it could be any number between -50 and 150 degrees – random variable T ∈ [-50,150] – note that the extremes do not have to be very precise, we can just say that P(T < -45o) = 0
- most probability notions apply equal well to discrete and
continuous random variables
Discrete RV
- for a discrete RV the probability assignments given by a
probability mass function (PMF)
– this can be thought of as a normalized histogram – satisfies the following properties
- example for the random variable
– X ∈ {1,2,3, …, 20} where X = i if the grade of student z on class is between 5i and 5(i+1) – we see that PX(14) = α
α
1 ) ( , 1 ) ( = ∀ ≤ ≤
∑
a X X
a P a a P
Continuous RV
- for a continuous RV the probability assignments are given
by a probability density function (PDF)
– this is just a continuous function – satisfies the following properties
- example for the Gaussian random variable of mean µ and
variance σ2
1 ) ( ) ( = ∀ ≤
∫
da a P a a P
X X
⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − − =
2 2
2 ) ( exp 2 1 ) ( σ µ σ π a a PX
Discrete vs continuous RVs
- in general the same, up to replacing summations by
integrals
- note that PDF means “density of probability”,
– this is probability per unit – the probability of a particular event is always zero (unless there is a discontinuity) – we can only talk about – note also that PDFs are not upper bounded – e.g. Gaussian goes to Dirac when variance goes to zero
∫
= ≤ ≤
b a X
dt t P b X a ) ( ) Pr(
Multiple random variables
- frequently we have problems with multiple random
variables
– e.g. when in the doctor, you are mostly a collection of random variables
x1: temperature x2: blood pressure x3: weight x4: cough …
- we can summarize this as
– a vector X = (x1, …, xn) of n random variables – PX(x1, …, xn) is the joint probability distribution
Marginalization
- important notion for multiple random
variables is marginalization
– e.g. having a cold does not depend on blood pressure and weight – all that matters are fever and cough – that is, we need to know PX1,X4(a,b)
- we marginalize with respect to a subset of variables
– (in this case X1 and X4) – this is done by summing (or integrating) the others out
? ) (cold
P
∫∫
=
3 2 4 3 2 1 , , , 4 1 ,
) , , , ( ) , (
4 3 2 1 4 1
dx dx x x x x P x x P
X X X X X X
∑
=
4 3 4 3 2 1 4 1
, 4 3 2 1 , , , 4 1 ,
) , , , ( ) , (
x x X X X X X X
x x x x P x x P
Conditional probability
- another very important notion:
– so far, doctor has PX1,X4(fever,cough) – still does not allow a diagnostic – for this we need a new variable Y with two states Y ∈ {sick, not sick} – doctor measures fever and cough levels, these are no longer unknowns, or even random quantities – the question of interest is “what is the probability that patient is sick given the measured values of fever and cough?”
- this is exactly the definition of conditional probability
– what is the probability that Y takes a given value given
- bservations for X
) , 98 | (
4 1,
|
high sick P
X X Y
? ) | (
|
cough sick P X
Y
Conditional probability
- note the very important difference between conditional
and joint probability
- joint probability is an hypothetical question with respect to
all variables
– what is the probability that you will be sick and cough a lot?
? ) , (
,
cough sick P X
Y
Conditional probability
- conditional probability means that you know the values of
some variables
– what is the probability that you are sick given that you cough a lot? – “given” is the key word here – conditional probability is very important because it allows us to structure our thinking – shows up again and again in design of intelligent systems
? ) | (
|
cough sick P X
Y
Conditional probability
- fortunately it is easy to compute
– we simply normalize the joint by the probability of what we know – note that this makes sense since – and, by the marginalization equation, – the definition of conditional probability
just makes these two statements coherent simply says that, given what we know, we still have a valid probability measure universal event {sick} U {not sick} still probability 1 after observation
) 98 ( ) 98 , ( ) 98 | (
1 1 1
, | X X Y X Y
P sick P sick P =
1 ) 98 | ( ) 98 | (
1 1
| |
= + sick not P sick P
X Y X Y
) 98 ( ) 98 , ( ) 98 , (
1 1 1
, , X X Y X Y
P sick not P sick P = +
The chain rule of probability
- is an important consequence of the definition of
conditional probability
– note that, from this definition, – more generally, it has the form
- combination with marginalization allows us to make hard
probability questions simple
) ( ) | ( ) , (
1 1 | 1 ,
1 1 1
x P x y P x y P
X X Y X Y
=
× = ) ,..., | ( ) ,..., , (
2 1 ,..., | 2 1 ,..., ,
2 1 2 1
n X X X n X X X
x x x P x x x P
n n
... ) ,..., | (
3 2 ..., |
3 2
× ×
n X X X
x x x P
n
) ( ) | ( ...
1 |
1
n X n n X X
x P x x P
n n n
−
−
× ×
The chain rule of probability
- e.g. what is the probability that you will be sick and have
104o of fever?
– breaks down a hard question (prob of sick and 104) into two easier questions – Prob (sick|104): everyone knows that this is close to one
) 104 ( ) 104 | ( ) 104 , (
1 1 1
| , X X Y X Y
P sick P sick P =
You have a cold!
! 1 ) 104 | (
|
= sick P X
Y
The chain rule of probability
- e.g. what is the probability that you will be sick and have
104o of fever?
– Prob(104): still hard, but easier than P(sick,104) since we know
- nly have one random variable (temperature)
– does not depend on sickness, it is just the question “what is the probability that someone will have 104o?”
gather a number of people, measure their temperatures and make an histogram that everyone can use after that
) 104 ( ) 104 | ( ) 104 , (
1 1 1
| , X X Y X Y
P sick P sick P =
The chain rule of probability
- in fact, the chain rule is so handy, that most times we use
it to compute probabilities
– e.g. – in this way we can get away with knowing
PX1(t), which we may know because it was needed for some other problem PY|X1(sick|t), we can ask a doctor or approximate with rule of thumb
∫
= dt t sick P sick P
X Y Y
) , ( ) (
1
,
∫
= dt t P t sick P
X X Y
) ( ) | (
1 1
|
⎪ ⎩ ⎪ ⎨ ⎧ < < < > ≈ 98 102 98 5 . 102 1 ) | (
1
|
t t t t sick P X
Y
(marginalization)
Independence
- another fundamental concept for multiple variables
– two variables are independent if the joint is the product of the marginals – note: implies that – which captures the intuitive notion: – “if X1 is independent of X2, knowing X2 does not change the probability of X1”
e.g. knowing that it is sunny does not change the probability that it will rain in three months
) ( ) ( ) , (
2 1 2 1,
b P a P b a P
X X X X
=
) ( ) ( ) , ( ) | (
1 2 2 1 2 1
, |
a P b P b a P b a P
X X X X X X
= =
Independence
- extremely useful in the design of intelligent
systems
– frequently, knowing X makes Y independent of Z – e.g. consider the shivering symptom:
if you have temperature you sometimes shiver it is a symptom of having a cold but once you measure the temperature, the two become independent
- simplifies considerably the estimation of the probabilities
× = ) , 98 | ( ) , 98 , (
, | , ,
1 1
shiver sick P shiver sick P
S X Y S X Y
) 98 ( ) 98 | (
1 1
| X X S
P shiver P × = ) 98 | (
1
|
sick P X
Y
) 98 ( ) 98 | (
1 1
| X X S
P shiver P
Independence
- useful property: if you add two independent random
variables their probability distributions convolve
– i.e. if z = x + y and x,y are independent then where * is the convolution operator – for discrete random variables – for continuous random variables
) ( * ) ( ) ( z P z P z P
y X Z
=
∑
− =
k Y X Z
k z P k P z P ) ( ) ( ) ( dt t z P t P z P
Y X Z
∫
− = ) ( ) ( ) (
Moments
- important properties of
random variables
– summarize the distribution
- important moments
– mean: µ = E[x] – variance: σ2 = E[(x-µ)2] – various distributions are completely specified by a small number
- f moments
µ σ2
discrete continuous mean variance
∑
=
k X
k k P ) ( µ
∫
= dk k k P
X
) ( µ
2 2
) ) (
∑
=
k X
(k- k P µ σ
∫
= dk k P
X
)
- k
( ) (
2 2
µ σ
Mean
- µ = E[x], is the center of mass of the distribution
- is a linear quantity
– if Z = X + Y, then E[Z] = E[X] + E[Y] – this does not require any special relation between X and Y – always holds
- other moments are the mean of powers of X
– nth order moment is E[Xn] – nth central moments is E[(X-µ)n] discrete continuous mean
∑
=
k X
k k P ) ( µ
∫
= dk k k P
X
) ( µ
µ σ2
Variance
- σ2 = E[(x-µ)2] measures the dispersion around the mean
– it is the second central moment
- in general, not linear
– if Z = X + Y, then Var[Z] = Var[X] + Var[Y] – only holds if X and Y are independent
- it is related to 2nd order moment by
discrete continuous variance
2 2
) ) (
∑
=
k X
(k- k P µ σ
∫
= dk k P
X
)
- k
( ) (
2 2
µ σ
( )
[ ]
[ ] [ ]
[ ]
[ ]
2 2 2 2 2 2 2 2
2 2 µ µ µ µ µ µ σ − = + − = + − = − =
x E x E x E x x E x E
µ σ2