Review of probability Nuno Vasconcelos UCSD Probability - - PDF document

review of probability
SMART_READER_LITE
LIVE PREVIEW

Review of probability Nuno Vasconcelos UCSD Probability - - PDF document

Review of probability Nuno Vasconcelos UCSD Probability probability is the language to deal with processes that are non-deterministic examples: if I flip a coin 100 times, how many can I expect to see heads? what is the


slide-1
SLIDE 1

Review of probability

Nuno Vasconcelos UCSD

slide-2
SLIDE 2

Probability

  • probability is the language to deal with processes that

are non-deterministic

  • examples:

– if I flip a coin 100 times, how many can I expect to see heads? – what is the weather going to be like tomorrow? – are my stocks going to be up or down? – am I in front of a classroom or is this just a picture of it?

slide-3
SLIDE 3

Sample space

  • the most important concept is that of a sample space
  • our process defines a set of events

– these are the outcomes or states of the process

  • example:

– we roll a pair of dice – call the value on the up face at the nth toss xn – note that possible events such as

  • dd number on second throw

two sixes x1 = 2 and x2 = 6

– can all be expressed as combinations

  • f the sample space events

1 6 6 1

x2 x1

slide-4
SLIDE 4

Sample space

  • is the list of possible events that satisfies the following

properties:

– finest grain: all possible distinguishable events are listed separately – mutually exclusive: if one event happens the other does not (if x1 = 5 it cannot be anything else) – collectively exhaustive: any possible

  • utcome can be expressed as unions of

sample space events

  • mutually exclusive property simplifies the calculation of

the probability of complex events

  • collectively exhaustive means that there is no possible
  • utcome to which we cannot assign a probability

1 6 6 1

x2 x1

slide-5
SLIDE 5

Probability measure

  • probability of an event:

– number expressing the chance that the event will be the outcome

  • f the process
  • probability measure: satisfies three axioms

– P(A) ≥ 0 for any event A – P(universal event) = 1 – if A ∩ B = ∅, then P(A+B) = P(A) + P(B)

  • e.g.

– P(x1 ≥ 0) = 1 – P(x1 even U x1 odd) = P(x1 even)+ P(x1 odd)

1 6 6 1

x2 x1

slide-6
SLIDE 6

Probability measure

  • the last axiom

– combined with the mutually exclusive property of the sample set – allows us to easily assign probabilities to all possible events

  • back to our dice example:

– suppose that the probability of any pair (x1,x2) is 1/36 – we can compute probabilities of all “union” events – P(x2 odd) = 18x1/36 = 1 – P(U) = 36x1/36 = 1 – P(two sixes) = 1/36 – P(x1 = 2 and x2 = 6) = 1/36

1 6 6 1

x2 x1

slide-7
SLIDE 7

Probability measure

  • note that there are many ways to

define the universal event U

– e.g. A = {x2 odd}, B = {x2 even}, U = A U B – on the other hand U = (1,1) U (1,2) U (1,3) U … U (6,6) – the fact that the sample space is finest grain, exhaustive, and mutually exclusive and the measure axioms – make the whole procedure consistent

1 6 6 1

x2 x1

slide-8
SLIDE 8

Random variables

  • random variable X

– is a function that assigns a real value to each sample space event – we have already seen one such function: PX(x1,x2) = 1/36 for all (x1,x2)

  • notation:

– specify both the random variable and the value that it takes in your probability statements – we do this by specifying the random variable as subscript PX and the value as argument PX (x1,x2) = 1/36 means Prob[X=(x1,x2)] = 1/36 – without this, probability statements can be hopelessly confusing

slide-9
SLIDE 9

Random variables

  • two types of random variables:

– discrete and continuous – really means what types of values the RV can take

  • if it can take only one of a finite set of possibilities, we call

it discrete

– this is the dice example we saw, there are only 36 possibilities

1 6 6 1

x2 x1

slide-10
SLIDE 10

Random variables

  • if it can take values in a real interval we say that the

random variable is continuous

  • e.g. consider the sample space of weather temperature

– we know that it could be any number between -50 and 150 degrees – random variable T ∈ [-50,150] – note that the extremes do not have to be very precise, we can just say that P(T < -45o) = 0

  • most probability notions apply equal well to discrete and

continuous random variables

slide-11
SLIDE 11

Discrete RV

  • for a discrete RV the probability assignments given by a

probability mass function (PMF)

– this can be thought of as a normalized histogram – satisfies the following properties

  • example for the random variable

– X ∈ {1,2,3, …, 20} where X = i if the grade of student z on class is between 5i and 5(i+1) – we see that PX(14) = α

α

1 ) ( , 1 ) ( = ∀ ≤ ≤

a X X

a P a a P

slide-12
SLIDE 12

Continuous RV

  • for a continuous RV the probability assignments are given

by a probability density function (PDF)

– this is just a continuous function – satisfies the following properties

  • example for the Gaussian random variable of mean µ and

variance σ2

1 ) ( ) ( = ∀ ≤

da a P a a P

X X

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − − =

2 2

2 ) ( exp 2 1 ) ( σ µ σ π a a PX

slide-13
SLIDE 13

Discrete vs continuous RVs

  • in general the same, up to replacing summations by

integrals

  • note that PDF means “density of probability”,

– this is probability per unit – the probability of a particular event is always zero (unless there is a discontinuity) – we can only talk about – note also that PDFs are not upper bounded – e.g. Gaussian goes to Dirac when variance goes to zero

= ≤ ≤

b a X

dt t P b X a ) ( ) Pr(

slide-14
SLIDE 14

Multiple random variables

  • frequently we have problems with multiple random

variables

– e.g. when in the doctor, you are mostly a collection of random variables

x1: temperature x2: blood pressure x3: weight x4: cough …

  • we can summarize this as

– a vector X = (x1, …, xn) of n random variables – PX(x1, …, xn) is the joint probability distribution

slide-15
SLIDE 15

Marginalization

  • important notion for multiple random

variables is marginalization

– e.g. having a cold does not depend on blood pressure and weight – all that matters are fever and cough – that is, we need to know PX1,X4(a,b)

  • we marginalize with respect to a subset of variables

– (in this case X1 and X4) – this is done by summing (or integrating) the others out

? ) (cold

P

∫∫

=

3 2 4 3 2 1 , , , 4 1 ,

) , , , ( ) , (

4 3 2 1 4 1

dx dx x x x x P x x P

X X X X X X

=

4 3 4 3 2 1 4 1

, 4 3 2 1 , , , 4 1 ,

) , , , ( ) , (

x x X X X X X X

x x x x P x x P

slide-16
SLIDE 16

Conditional probability

  • another very important notion:

– so far, doctor has PX1,X4(fever,cough) – still does not allow a diagnostic – for this we need a new variable Y with two states Y ∈ {sick, not sick} – doctor measures fever and cough levels, these are no longer unknowns, or even random quantities – the question of interest is “what is the probability that patient is sick given the measured values of fever and cough?”

  • this is exactly the definition of conditional probability

– what is the probability that Y takes a given value given

  • bservations for X

) , 98 | (

4 1,

|

high sick P

X X Y

? ) | (

|

cough sick P X

Y

slide-17
SLIDE 17

Conditional probability

  • note the very important difference between conditional

and joint probability

  • joint probability is an hypothetical question with respect to

all variables

– what is the probability that you will be sick and cough a lot?

? ) , (

,

cough sick P X

Y

slide-18
SLIDE 18

Conditional probability

  • conditional probability means that you know the values of

some variables

– what is the probability that you are sick given that you cough a lot? – “given” is the key word here – conditional probability is very important because it allows us to structure our thinking – shows up again and again in design of intelligent systems

? ) | (

|

cough sick P X

Y

slide-19
SLIDE 19

Conditional probability

  • fortunately it is easy to compute

– we simply normalize the joint by the probability of what we know – note that this makes sense since – and, by the marginalization equation, – the definition of conditional probability

just makes these two statements coherent simply says that, given what we know, we still have a valid probability measure universal event {sick} U {not sick} still probability 1 after observation

) 98 ( ) 98 , ( ) 98 | (

1 1 1

, | X X Y X Y

P sick P sick P =

1 ) 98 | ( ) 98 | (

1 1

| |

= + sick not P sick P

X Y X Y

) 98 ( ) 98 , ( ) 98 , (

1 1 1

, , X X Y X Y

P sick not P sick P = +

slide-20
SLIDE 20

The chain rule of probability

  • is an important consequence of the definition of

conditional probability

– note that, from this definition, – more generally, it has the form

  • combination with marginalization allows us to make hard

probability questions simple

) ( ) | ( ) , (

1 1 | 1 ,

1 1 1

x P x y P x y P

X X Y X Y

=

× = ) ,..., | ( ) ,..., , (

2 1 ,..., | 2 1 ,..., ,

2 1 2 1

n X X X n X X X

x x x P x x x P

n n

... ) ,..., | (

3 2 ..., |

3 2

× ×

n X X X

x x x P

n

) ( ) | ( ...

1 |

1

n X n n X X

x P x x P

n n n

× ×

slide-21
SLIDE 21

The chain rule of probability

  • e.g. what is the probability that you will be sick and have

104o of fever?

– breaks down a hard question (prob of sick and 104) into two easier questions – Prob (sick|104): everyone knows that this is close to one

) 104 ( ) 104 | ( ) 104 , (

1 1 1

| , X X Y X Y

P sick P sick P =

You have a cold!

! 1 ) 104 | (

|

= sick P X

Y

slide-22
SLIDE 22

The chain rule of probability

  • e.g. what is the probability that you will be sick and have

104o of fever?

– Prob(104): still hard, but easier than P(sick,104) since we know

  • nly have one random variable (temperature)

– does not depend on sickness, it is just the question “what is the probability that someone will have 104o?”

gather a number of people, measure their temperatures and make an histogram that everyone can use after that

) 104 ( ) 104 | ( ) 104 , (

1 1 1

| , X X Y X Y

P sick P sick P =

slide-23
SLIDE 23

The chain rule of probability

  • in fact, the chain rule is so handy, that most times we use

it to compute probabilities

– e.g. – in this way we can get away with knowing

PX1(t), which we may know because it was needed for some other problem PY|X1(sick|t), we can ask a doctor or approximate with rule of thumb

= dt t sick P sick P

X Y Y

) , ( ) (

1

,

= dt t P t sick P

X X Y

) ( ) | (

1 1

|

⎪ ⎩ ⎪ ⎨ ⎧ < < < > ≈ 98 102 98 5 . 102 1 ) | (

1

|

t t t t sick P X

Y

(marginalization)

slide-24
SLIDE 24

Independence

  • another fundamental concept for multiple variables

– two variables are independent if the joint is the product of the marginals – note: implies that – which captures the intuitive notion: – “if X1 is independent of X2, knowing X2 does not change the probability of X1”

e.g. knowing that it is sunny does not change the probability that it will rain in three months

) ( ) ( ) , (

2 1 2 1,

b P a P b a P

X X X X

=

) ( ) ( ) , ( ) | (

1 2 2 1 2 1

, |

a P b P b a P b a P

X X X X X X

= =

slide-25
SLIDE 25

Independence

  • extremely useful in the design of intelligent

systems

– frequently, knowing X makes Y independent of Z – e.g. consider the shivering symptom:

if you have temperature you sometimes shiver it is a symptom of having a cold but once you measure the temperature, the two become independent

  • simplifies considerably the estimation of the probabilities

× = ) , 98 | ( ) , 98 , (

, | , ,

1 1

shiver sick P shiver sick P

S X Y S X Y

) 98 ( ) 98 | (

1 1

| X X S

P shiver P × = ) 98 | (

1

|

sick P X

Y

) 98 ( ) 98 | (

1 1

| X X S

P shiver P

slide-26
SLIDE 26

Independence

  • useful property: if you add two independent random

variables their probability distributions convolve

– i.e. if z = x + y and x,y are independent then where * is the convolution operator – for discrete random variables – for continuous random variables

) ( * ) ( ) ( z P z P z P

y X Z

=

− =

k Y X Z

k z P k P z P ) ( ) ( ) ( dt t z P t P z P

Y X Z

− = ) ( ) ( ) (

slide-27
SLIDE 27

Moments

  • important properties of

random variables

– summarize the distribution

  • important moments

– mean: µ = E[x] – variance: σ2 = E[(x-µ)2] – various distributions are completely specified by a small number

  • f moments

µ σ2

discrete continuous mean variance

=

k X

k k P ) ( µ

= dk k k P

X

) ( µ

2 2

) ) (

=

k X

(k- k P µ σ

= dk k P

X

)

  • k

( ) (

2 2

µ σ

slide-28
SLIDE 28

Mean

  • µ = E[x], is the center of mass of the distribution
  • is a linear quantity

– if Z = X + Y, then E[Z] = E[X] + E[Y] – this does not require any special relation between X and Y – always holds

  • other moments are the mean of powers of X

– nth order moment is E[Xn] – nth central moments is E[(X-µ)n] discrete continuous mean

=

k X

k k P ) ( µ

= dk k k P

X

) ( µ

µ σ2

slide-29
SLIDE 29

Variance

  • σ2 = E[(x-µ)2] measures the dispersion around the mean

– it is the second central moment

  • in general, not linear

– if Z = X + Y, then Var[Z] = Var[X] + Var[Y] – only holds if X and Y are independent

  • it is related to 2nd order moment by

discrete continuous variance

2 2

) ) (

=

k X

(k- k P µ σ

= dk k P

X

)

  • k

( ) (

2 2

µ σ

( )

[ ]

[ ] [ ]

[ ]

[ ]

2 2 2 2 2 2 2 2

2 2 µ µ µ µ µ µ σ − = + − = + − = − =

x E x E x E x x E x E

µ σ2

slide-30
SLIDE 30