Refresher on Discrete Probability STAT 27725/CMSC 25400: Machine - - PowerPoint PPT Presentation

refresher on discrete probability
SMART_READER_LITE
LIVE PREVIEW

Refresher on Discrete Probability STAT 27725/CMSC 25400: Machine - - PowerPoint PPT Presentation

Refresher on Discrete Probability STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi University of Chicago October 2015 Refresher on Discrete Probability STAT 27725/CMSC 25400 Background Things you should have seen before Events,


slide-1
SLIDE 1

Refresher on Discrete Probability

STAT 27725/CMSC 25400: Machine Learning Shubhendu Trivedi

University of Chicago

October 2015

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-2
SLIDE 2

Background

Things you should have seen before

  • Events, Event Spaces
  • Probability as limit of frequency
  • Compound Events
  • Joint and Conditional Probability
  • Random Variables
  • Expectation, variance and covariance
  • Independence and Conditional Independence
  • Estimation

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-3
SLIDE 3

Background

Things you should have seen before

  • Events, Event Spaces
  • Probability as limit of frequency
  • Compound Events
  • Joint and Conditional Probability
  • Random Variables
  • Expectation, variance and covariance
  • Independence and Conditional Independence
  • Estimation

This refresher WILL revise these topics.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-4
SLIDE 4

Three types of Probability

Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 ≤ p(A) ≤ 1 is the fraction of times that the outcome will be A.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-5
SLIDE 5

Three types of Probability

Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 ≤ p(A) ≤ 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-6
SLIDE 6

Three types of Probability

Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 ≤ p(A) ≤ 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-7
SLIDE 7

Three types of Probability

Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 ≤ p(A) ≤ 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-8
SLIDE 8

Three types of Probability

Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 ≤ p(A) ≤ 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is. Typical example: the probability that the Earth will warmer by more than 5◦F by 2100.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-9
SLIDE 9

Three types of Probability

Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 ≤ p(A) ≤ 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is. Typical example: the probability that the Earth will warmer by more than 5◦F by 2100. Bayesian probability.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-10
SLIDE 10

Three types of Probability

Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 ≤ p(A) ≤ 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is. Typical example: the probability that the Earth will warmer by more than 5◦F by 2100. Bayesian probability. Subjective probability: “I’m 110% sure that I’ll go out to dinner with you tonight.”

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-11
SLIDE 11

Three types of Probability

Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 ≤ p(A) ≤ 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is. Typical example: the probability that the Earth will warmer by more than 5◦F by 2100. Bayesian probability. Subjective probability: “I’m 110% sure that I’ll go out to dinner with you tonight.” Mixing these three notions is a source of lots of trouble.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-12
SLIDE 12

Three types of Probability

Frequency of repeated trials: if an experiment is repeated infinitely many times, 0 ≤ p(A) ≤ 1 is the fraction of times that the outcome will be A. Typical example: number of times that a coin comes up heads. Frequentist probability. Degree of belief: A quantity obeying the same laws as the above, describing how likely we think a (possibly deterministic) event is. Typical example: the probability that the Earth will warmer by more than 5◦F by 2100. Bayesian probability. Subjective probability: “I’m 110% sure that I’ll go out to dinner with you tonight.” Mixing these three notions is a source of lots of trouble. We will start with the frequentist interpretation and then discuss the Bayesian one.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-13
SLIDE 13

Why do we need Probability in Machine Learning

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-14
SLIDE 14

Why do we need Probability in Machine Learning

To analyze, understand and predict the performance of learning algorithms (Vapnik Chervonenkis Theory, PAC model, etc.)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-15
SLIDE 15

Why do we need Probability in Machine Learning

To analyze, understand and predict the performance of learning algorithms (Vapnik Chervonenkis Theory, PAC model, etc.) To build flexible and intuitive probabilistic models.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-16
SLIDE 16

Basic Notions

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-17
SLIDE 17

Sample space

Random Experiment: An experiment whose outcome cannot be determined in advance, but is nonetheless subject to analysis

1 Tossing a coin 2 Selecting a group of 100 people and observing the

number of left handers

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-18
SLIDE 18

Sample space

Random Experiment: An experiment whose outcome cannot be determined in advance, but is nonetheless subject to analysis

1 Tossing a coin 2 Selecting a group of 100 people and observing the

number of left handers There are three main ingredients in the model of a random experiment

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-19
SLIDE 19

Sample space

Random Experiment: An experiment whose outcome cannot be determined in advance, but is nonetheless subject to analysis

1 Tossing a coin 2 Selecting a group of 100 people and observing the

number of left handers There are three main ingredients in the model of a random experiment We can’t predict the outcome of a random experiment with certainty, but can specify a set of possible outcomes

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-20
SLIDE 20

Sample space

Random Experiment: An experiment whose outcome cannot be determined in advance, but is nonetheless subject to analysis

1 Tossing a coin 2 Selecting a group of 100 people and observing the

number of left handers There are three main ingredients in the model of a random experiment We can’t predict the outcome of a random experiment with certainty, but can specify a set of possible outcomes Sample Space: The sample space Ω of a random experiment is the set of all possible outcomes of the experiment

1 {H, T} 2 {1, 2, ..., 100 }

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-21
SLIDE 21

Events

We are often not interested in a single outcome, but in whether or not one of a group of outcomes occurs. Such subsets of the sample space are called events Events are sets, can apply the usual set operations to them:

1 A ∪ B: Event that A or B or both occur 2 A ∩ B: Event that A and B both occur 3 Ac: Event that A does not occur 4 A ⊂ B: event A will imply event B 5 A ∩ B = ∅: Disjoint events.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-22
SLIDE 22

Axioms of Probability

The third ingredient in the model for a random experiment is the specification of the probability of events

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-23
SLIDE 23

Axioms of Probability

The third ingredient in the model for a random experiment is the specification of the probability of events The probability of some event A, denoted by P(A), is defined such that P(A) satisfies the following axioms

1 P(A) ≥ 0 2 P(Ω) = 1 3 For any sequence A1, A2, . . . of disjoint events we have:

P

  • ∪i Ai
  • =
  • i

P(Ai)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-24
SLIDE 24

Axioms of Probability

The third ingredient in the model for a random experiment is the specification of the probability of events The probability of some event A, denoted by P(A), is defined such that P(A) satisfies the following axioms

1 P(A) ≥ 0 2 P(Ω) = 1 3 For any sequence A1, A2, . . . of disjoint events we have:

P

  • ∪i Ai
  • =
  • i

P(Ai) Kolmogorov showed that these three axioms lead to the rules

  • f probability theory

de Finetti, Cox and Carnap have also provided compelling reasons for these axioms

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-25
SLIDE 25

Some Consequences

Probability of the Empty set: P(∅) = 0

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-26
SLIDE 26

Some Consequences

Probability of the Empty set: P(∅) = 0 Monotonicity: if A ⊆ B then P(A) ≤ P(B)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-27
SLIDE 27

Some Consequences

Probability of the Empty set: P(∅) = 0 Monotonicity: if A ⊆ B then P(A) ≤ P(B) Numeric Bound: 0 ≤ P(A) ≤ 1 ∀A ∈ S

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-28
SLIDE 28

Some Consequences

Probability of the Empty set: P(∅) = 0 Monotonicity: if A ⊆ B then P(A) ≤ P(B) Numeric Bound: 0 ≤ P(A) ≤ 1 ∀A ∈ S Addition Law: P(A ∪ B) = P(A) + P(B) − P(A ∩ B) P(Ac) = P(S \ A) = 1 − P(A)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-29
SLIDE 29

Some Consequences

Probability of the Empty set: P(∅) = 0 Monotonicity: if A ⊆ B then P(A) ≤ P(B) Numeric Bound: 0 ≤ P(A) ≤ 1 ∀A ∈ S Addition Law: P(A ∪ B) = P(A) + P(B) − P(A ∩ B) P(Ac) = P(S \ A) = 1 − P(A) Axioms of probability are the only system with this property: If you gamble using them you can’t be be unfairly exploited by an opponent using some other system (di Finetti, 1931)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-30
SLIDE 30

Discrete Sample Spaces

For now, we focus on the case when the sample space is countable Ω = {ω1, ω2, . . . , ωn}

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-31
SLIDE 31

Discrete Sample Spaces

For now, we focus on the case when the sample space is countable Ω = {ω1, ω2, . . . , ωn} The probability P on a discrete sample space can be specified by first specifying the probability pi of each elementary event ωi and then defining:

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-32
SLIDE 32

Discrete Sample Spaces

For now, we focus on the case when the sample space is countable Ω = {ω1, ω2, . . . , ωn} The probability P on a discrete sample space can be specified by first specifying the probability pi of each elementary event ωi and then defining: P(A) =

  • i:ωi∈A

pi ∀A ⊂ Ω

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-33
SLIDE 33

Discrete Sample Spaces

For now, we focus on the case when the sample space is countable Ω = {ω1, ω2, . . . , ωn} The probability P on a discrete sample space can be specified by first specifying the probability pi of each elementary event ωi and then defining: P(A) =

  • i:ωi∈A

pi ∀A ⊂ Ω

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-34
SLIDE 34

Discrete Sample Spaces

P(A) =

  • i:ωi∈A

pi ∀A ⊂ Ω

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-35
SLIDE 35

Discrete Sample Spaces

P(A) =

  • i:ωi∈A

pi ∀A ⊂ Ω In many applications, each elementary event is equally likely. Probability of an elementary event: 1 divided by total number

  • f elements in Ω

Equally likely principle:If Ω has a finite number of outcomes, and all ar equally likely, then the possibility of each event A is defined as P(A) = |A| |Ω|

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-36
SLIDE 36

Discrete Sample Spaces

P(A) =

  • i:ωi∈A

pi ∀A ⊂ Ω In many applications, each elementary event is equally likely. Probability of an elementary event: 1 divided by total number

  • f elements in Ω

Equally likely principle:If Ω has a finite number of outcomes, and all ar equally likely, then the possibility of each event A is defined as P(A) = |A| |Ω| Finding P(A) reduces to counting What is the probability of getting a full house in poker?

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-37
SLIDE 37

Discrete Sample Spaces

P(A) =

  • i:ωi∈A

pi ∀A ⊂ Ω In many applications, each elementary event is equally likely. Probability of an elementary event: 1 divided by total number

  • f elements in Ω

Equally likely principle:If Ω has a finite number of outcomes, and all ar equally likely, then the possibility of each event A is defined as P(A) = |A| |Ω| Finding P(A) reduces to counting What is the probability of getting a full house in poker? 13 4

3

  • · 12

4

2

  • 52

5

  • ≈ 0.14

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-38
SLIDE 38

Counting

Counting is not easy! Fortunately, many counting problems can be cast into the framework of drawing balls from an urn with replacement without replacement

  • rdered

not ordered

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-39
SLIDE 39

Choosing k of n distinguishable objects

with replacement without replacement

  • rdered

nk n(n − 1) . . . (n − k + 1) not ordered n+k−1

n−1

  • n

k

  • Refresher on Discrete Probability

STAT 27725/CMSC 25400

slide-40
SLIDE 40

Choosing k of n distinguishable objects

with replacement without replacement

  • rdered

nk n(n − 1) . . . (n − k + 1) not ordered n+k−1

n−1

  • n

k

→ usually goes in the denominator

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-41
SLIDE 41

Indistinguishable Objects

If we choose k balls from an urn with n1 red balls and n2 green balls, what is the probability of getting a particular sequence of x red balls and k − x green ones? What is the probability of any such sequence? How many ways can this happen? (this goes in the numerator)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-42
SLIDE 42

Indistinguishable Objects

If we choose k balls from an urn with n1 red balls and n2 green balls, what is the probability of getting a particular sequence of x red balls and k − x green ones? What is the probability of any such sequence? How many ways can this happen? (this goes in the numerator) with replacement without replacement

  • rdered

nx

1nk−x 2

n1 . . . (n1 − x + 1) · n2 . . . (n2 − k + x not ordered k

x

  • nx

1nk−x 2

k! n1

x

n2

k−x

  • Refresher on Discrete Probability

STAT 27725/CMSC 25400

slide-43
SLIDE 43

Joint and conditional probability

Joint: P(A, B) = P(A ∩ B) Conditional: P(A|B) = P(A ∩ B) P(B) AI is all about conditional probabilities.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-44
SLIDE 44

Conditional Probability

P(A|B) = fraction of worlds in which B is true that also have A true

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-45
SLIDE 45

Conditional Probability

P(A|B) = fraction of worlds in which B is true that also have A true H = ”Have a headache”, F = ”Have flu”.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-46
SLIDE 46

Conditional Probability

P(A|B) = fraction of worlds in which B is true that also have A true H = ”Have a headache”, F = ”Have flu”. P(H) = 1

10, P(F) = 1 40, P(H|F) = 1 2

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-47
SLIDE 47

Conditional Probability

P(A|B) = fraction of worlds in which B is true that also have A true H = ”Have a headache”, F = ”Have flu”. P(H) = 1

10, P(F) = 1 40, P(H|F) = 1 2

”Headaches are rare and flu is rarer, but if you are coming down wih flu, there is a 50-50 chance you’ll have a headache.”

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-48
SLIDE 48

Conditional Probability

P(A|B) = fraction of worlds in which B is true that also have A true H = ”Have a headache”, F = ”Have flu”. P(H) = 1

10, P(F) = 1 40, P(H|F) = 1 2

”Headaches are rare and flu is rarer, but if you are coming down wih flu, there is a 50-50 chance you’ll have a headache.”

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-49
SLIDE 49

Conditional Probability

P(H|F) : Fraction of flu-inflicted worlds in which you have a headache

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-50
SLIDE 50

Conditional Probability

P(H|F) : Fraction of flu-inflicted worlds in which you have a headache P(H|F) = Number of worlds with flu and headache

Number of worlds with flu

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-51
SLIDE 51

Conditional Probability

P(H|F) : Fraction of flu-inflicted worlds in which you have a headache P(H|F) = Number of worlds with flu and headache

Number of worlds with flu

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-52
SLIDE 52

Conditional Probability

P(H|F) : Fraction of flu-inflicted worlds in which you have a headache P(H|F) = Number of worlds with flu and headache

Number of worlds with flu

P(H|F) = Area of H and F region

Area of F region

= P(H∩F)

P(F)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-53
SLIDE 53

Conditional Probability

P(H|F) : Fraction of flu-inflicted worlds in which you have a headache P(H|F) = Number of worlds with flu and headache

Number of worlds with flu

P(H|F) = Area of H and F region

Area of F region

= P(H∩F)

P(F)

Conditional Probability: P(A|B) = P(A∩B)

P(B)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-54
SLIDE 54

Conditional Probability

P(H|F) : Fraction of flu-inflicted worlds in which you have a headache P(H|F) = Number of worlds with flu and headache

Number of worlds with flu

P(H|F) = Area of H and F region

Area of F region

= P(H∩F)

P(F)

Conditional Probability: P(A|B) = P(A∩B)

P(B)

Corollary: The Chain Rule P(A ∩ B) = P(A|B)P(B)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-55
SLIDE 55

Probabilistic Inference

H = ”Have a headache”, F = ”Have flu”. P(H) = 1

10, P(F) 1 40, P(H|F) = 1 2

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-56
SLIDE 56

Probabilistic Inference

H = ”Have a headache”, F = ”Have flu”. P(H) = 1

10, P(F) 1 40, P(H|F) = 1 2

Suppose you wake up one day with a headache and think: ”50 % of flus are associated with headaches so I must have a 50-50 chance of coming down with flu” Is this reasoning good?

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-57
SLIDE 57

Bayes Rule: Relates P(A|B) to P(A|B)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-58
SLIDE 58

Sensitivity and Specificity

TRUE FALSE predict + true + false + predict − false − true − Sensitivity = P(+|disease) FNR = P(−|T) = 1 − sensitivity Specificity = P(−|healthy) FPR = P(+|F) = 1 − specificity

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-59
SLIDE 59

Mammography

Sensitivity of screening mammogram P(+|cancer) ≈ 90% Specificity of screening mammogram P(−|no cancer) ≈ 91% Probability that a woman age 40 has breast cancer ≈ 1% If a previously unscreened 40 year old woman’s mammogram is positive, what is the probability that she has breast cancer?

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-60
SLIDE 60

Mammography

Sensitivity of screening mammogram P(+|cancer) ≈ 90% Specificity of screening mammogram P(−|no cancer) ≈ 91% Probability that a woman age 40 has breast cancer ≈ 1% If a previously unscreened 40 year old woman’s mammogram is positive, what is the probability that she has breast cancer? P(cancer|+) =

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-61
SLIDE 61

Mammography

Sensitivity of screening mammogram P(+|cancer) ≈ 90% Specificity of screening mammogram P(−|no cancer) ≈ 91% Probability that a woman age 40 has breast cancer ≈ 1% If a previously unscreened 40 year old woman’s mammogram is positive, what is the probability that she has breast cancer? P(cancer|+) = P(cancer, +) P(+) =

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-62
SLIDE 62

Mammography

Sensitivity of screening mammogram P(+|cancer) ≈ 90% Specificity of screening mammogram P(−|no cancer) ≈ 91% Probability that a woman age 40 has breast cancer ≈ 1% If a previously unscreened 40 year old woman’s mammogram is positive, what is the probability that she has breast cancer? P(cancer|+) = P(cancer, +) P(+) = P(+|cancer) P(cancer) P(+) =

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-63
SLIDE 63

Mammography

Sensitivity of screening mammogram P(+|cancer) ≈ 90% Specificity of screening mammogram P(−|no cancer) ≈ 91% Probability that a woman age 40 has breast cancer ≈ 1% If a previously unscreened 40 year old woman’s mammogram is positive, what is the probability that she has breast cancer? P(cancer|+) = P(cancer, +) P(+) = P(+|cancer) P(cancer) P(+) = 0.01 × .9 0.01 × .9 + 0.99 × 0.09 ≈

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-64
SLIDE 64

Mammography

Sensitivity of screening mammogram P(+|cancer) ≈ 90% Specificity of screening mammogram P(−|no cancer) ≈ 91% Probability that a woman age 40 has breast cancer ≈ 1% If a previously unscreened 40 year old woman’s mammogram is positive, what is the probability that she has breast cancer? P(cancer|+) = P(cancer, +) P(+) = P(+|cancer) P(cancer) P(+) = 0.01 × .9 0.01 × .9 + 0.99 × 0.09 ≈ 0.009 0.009 + 0.09 ≈ 0.009 0.1 ≈ 9% Message: P(A|B) = P(B|A).

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-65
SLIDE 65

Bayes’ rule

P(B|A) = P(A|B) P(B) P(A) (Bayes, Thomas (1763) An Essay towards solving a problem in the doctrine of chances. Philosophi- cal Transactions of the Royal So- ciety of London)

  • Rev. Thomas Bayes (1701–1761)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-66
SLIDE 66

Prosecutor’s fallacy: Sally Clark

Sally Clark (1964–2007) Two kids died with no explanation. Sir Roy Meadow testified that chance of this happening due to SIDS is (1/8500)2 ≈ (73 × 106)−1. Sally Clark found guilty and imprisoned. Later verdict overturned and Meadow struck off medical register. Fallacy: P(SIDS|2 deaths) = P(SIDS, 2 deaths) P(guilty|+) = 1 − P(not guilty|+) = 1 − P(+|not guilty)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-67
SLIDE 67

Independence

Two events A and B are independent, denoted A ⊥ Bif P(A, B) = P(A) P(B).

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-68
SLIDE 68

Independence

Two events A and B are independent, denoted A ⊥ Bif P(A, B) = P(A) P(B). P(A|B) = P(A, B) P(B) = P(A) P(B) P(B) = P(A)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-69
SLIDE 69

Independence

Two events A and B are independent, denoted A ⊥ Bif P(A, B) = P(A) P(B). P(A|B) = P(A, B) P(B) = P(A) P(B) P(B) = P(A) P(Ac|B) = P(B) − P(A, B) P(B) = P(B)(1 − P(A)) P(B) = P(Ac)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-70
SLIDE 70

Independence

A collection of events A are mutually independent if for any {i1, i2, . . . , in} ⊆ A P n

  • i=1

Ai

  • =

n

  • i=1

P(Ai) If A is independent of B and C, that does not necessarily mean that it is independent of (B, C) (example).

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-71
SLIDE 71

Conditional independence

A is conditionally independent of B given C, denoted A ⊥ B | C if P(A, B|C) = P(A|C) P(B|C). A ⊥ B | C does not imply and is not implied by A ⊥ B.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-72
SLIDE 72

Common cause

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-73
SLIDE 73

Common cause

p(xA, xB, xC) = p(xC) p(xA|xC) p(xB|xC)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-74
SLIDE 74

Common cause

p(xA, xB, xC) = p(xC) p(xA|xC) p(xB|xC) XA ⊥ XB but XA ⊥ XB | XC Example: Lung cancer ⊥ Yellow teeth | Smoking

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-75
SLIDE 75

Explaining away

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-76
SLIDE 76

Explaining away

p(xA, xB, xC) = p(xA) p(xB) p(xC|xA, xB)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-77
SLIDE 77

Explaining away

p(xA, xB, xC) = p(xA) p(xB) p(xC|xA, xB) XA ⊥ XB but XA ⊥ XB | XC Example: Burglary ⊥ Earthquake | Alarm

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-78
SLIDE 78

Explaining away

p(xA, xB, xC) = p(xA) p(xB) p(xC|xA, xB) XA ⊥ XB but XA ⊥ XB | XC Example: Burglary ⊥ Earthquake | Alarm Even if two variables are independent, they can become dependent when we observe an effect that they can both influence

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-79
SLIDE 79

Bayesian Networks

Simple case: POS Tagging. Want to predict an output vector y = {y0, y1, . . . , yT } of random variables given an observed feature vector x (Hidden Markov Model)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-80
SLIDE 80

Random Variables

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-81
SLIDE 81

Random Variables

A Random Variable is a function X : Ω → R

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-82
SLIDE 82

Random Variables

A Random Variable is a function X : Ω → R Example: Sum of two fair dice

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-83
SLIDE 83

Random Variables

A Random Variable is a function X : Ω → R Example: Sum of two fair dice The set of all possible values a random variable X can take is called its range Discrete random variables can only take isolated values (probability of a random variable taking a particular value reduces to counting)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-84
SLIDE 84

Discrete Distributions

Assume X is a discrete random variable. We would like to specify probabilities of events {X = x}

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-85
SLIDE 85

Discrete Distributions

Assume X is a discrete random variable. We would like to specify probabilities of events {X = x} If we can specify the probabilities involving X, we can say that we have specified the probability distribution of X

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-86
SLIDE 86

Discrete Distributions

Assume X is a discrete random variable. We would like to specify probabilities of events {X = x} If we can specify the probabilities involving X, we can say that we have specified the probability distribution of X For a countable set of values x1, x2, . . . xn, we have P(X = xi) > 0, i = 1, 2, . . . , n and

i P(X = xi) = 1

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-87
SLIDE 87

Discrete Distributions

Assume X is a discrete random variable. We would like to specify probabilities of events {X = x} If we can specify the probabilities involving X, we can say that we have specified the probability distribution of X For a countable set of values x1, x2, . . . xn, we have P(X = xi) > 0, i = 1, 2, . . . , n and

i P(X = xi) = 1

We can then define the probability mass function f of X by f(X) = P(X = x)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-88
SLIDE 88

Discrete Distributions

Assume X is a discrete random variable. We would like to specify probabilities of events {X = x} If we can specify the probabilities involving X, we can say that we have specified the probability distribution of X For a countable set of values x1, x2, . . . xn, we have P(X = xi) > 0, i = 1, 2, . . . , n and

i P(X = xi) = 1

We can then define the probability mass function f of X by f(X) = P(X = x) Sometimes write as fX

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-89
SLIDE 89

Discrete Distributions

Example: Toss a die and let X be its face value. X is discrete with range {1, 2, 3, 4, 5, 6}. The pmf is

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-90
SLIDE 90

Discrete Distributions

Example: Toss a die and let X be its face value. X is discrete with range {1, 2, 3, 4, 5, 6}. The pmf is Another example: Toss two dice and let X be the largest face

  • value. The pmf is

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-91
SLIDE 91

Expectation

Assume X is a discrete random variable with pmf f.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-92
SLIDE 92

Expectation

Assume X is a discrete random variable with pmf f. The expectation of X, E[X] is defined by: E[X] =

  • x

xP(X = x) =

  • x

xf(x)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-93
SLIDE 93

Expectation

Assume X is a discrete random variable with pmf f. The expectation of X, E[X] is defined by: E[X] =

  • x

xP(X = x) =

  • x

xf(x) Sometimes written as µX. Is sort of a ”weighted average” of the values that X can take (another interpretation is as a center of mass).

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-94
SLIDE 94

Expectation

Assume X is a discrete random variable with pmf f. The expectation of X, E[X] is defined by: E[X] =

  • x

xP(X = x) =

  • x

xf(x) Sometimes written as µX. Is sort of a ”weighted average” of the values that X can take (another interpretation is as a center of mass). Example: Expected outcome of toss of a fair die - 7

2

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-95
SLIDE 95

Expectation

If X is a random variable, then a function of X, such as X2 is also a random variable. The following statement is easy to prove:

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-96
SLIDE 96

Expectation

If X is a random variable, then a function of X, such as X2 is also a random variable. The following statement is easy to prove:

Theorem

If X is discrete with pmf f, then for any real-valued function g, Eg(X) =

  • x

g(x)f(x) Example: E[X2] when X is outcome of the toss of a fair die, is 91

6

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-97
SLIDE 97

Linearity of Expectation

A consequence of the obvious theorem from earlier is that Expectation is linear i.e. has the following two properties for a, b ∈ R and functions g, h

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-98
SLIDE 98

Linearity of Expectation

A consequence of the obvious theorem from earlier is that Expectation is linear i.e. has the following two properties for a, b ∈ R and functions g, h E(aX + b) = aEX + b

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-99
SLIDE 99

Linearity of Expectation

A consequence of the obvious theorem from earlier is that Expectation is linear i.e. has the following two properties for a, b ∈ R and functions g, h E(aX + b) = aEX + b (Proof: Suppose X has pmf f. Then the above follows from E(aX + b) =

x(ax + b)f(x) = a x f(x) + b x f(x) =

aEX + b) E(g(X) + h(X)) = Eg(X) + Eh(X)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-100
SLIDE 100

Linearity of Expectation

A consequence of the obvious theorem from earlier is that Expectation is linear i.e. has the following two properties for a, b ∈ R and functions g, h E(aX + b) = aEX + b (Proof: Suppose X has pmf f. Then the above follows from E(aX + b) =

x(ax + b)f(x) = a x f(x) + b x f(x) =

aEX + b) E(g(X) + h(X)) = Eg(X) + Eh(X) (Proof: E(g(X) + h(X) =

x(g(x) + h(x))f(x) =

  • x g(x)f(x) +

x h(x)f(x) = Eg(X) + Eh(X))

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-101
SLIDE 101

Variance

Variance of a random variable X, denoted by V ar(X) is defined as: V ar(X) = E(X − EX)2

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-102
SLIDE 102

Variance

Variance of a random variable X, denoted by V ar(X) is defined as: V ar(X) = E(X − EX)2 Is a measure of dispersion

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-103
SLIDE 103

Variance

Variance of a random variable X, denoted by V ar(X) is defined as: V ar(X) = E(X − EX)2 Is a measure of dispersion The following two properties follow easily from the definitions

  • f expectation and variance:

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-104
SLIDE 104

Variance

Variance of a random variable X, denoted by V ar(X) is defined as: V ar(X) = E(X − EX)2 Is a measure of dispersion The following two properties follow easily from the definitions

  • f expectation and variance:

1 V ar(X) = EX2 − (EX)2

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-105
SLIDE 105

Variance

Variance of a random variable X, denoted by V ar(X) is defined as: V ar(X) = E(X − EX)2 Is a measure of dispersion The following two properties follow easily from the definitions

  • f expectation and variance:

1 V ar(X) = EX2 − (EX)2

(Proof: Write EX = µ. Expanding V ar(X) = E(x − µ)2 = E(X2 − 2µX + µ2). Using linearity of expectation yields E(X2) − µ2)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-106
SLIDE 106

Variance

Variance of a random variable X, denoted by V ar(X) is defined as: V ar(X) = E(X − EX)2 Is a measure of dispersion The following two properties follow easily from the definitions

  • f expectation and variance:

1 V ar(X) = EX2 − (EX)2

(Proof: Write EX = µ. Expanding V ar(X) = E(x − µ)2 = E(X2 − 2µX + µ2). Using linearity of expectation yields E(X2) − µ2)

2 V ar(aX + b) = a2V ar(X)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-107
SLIDE 107

Variance

Variance of a random variable X, denoted by V ar(X) is defined as: V ar(X) = E(X − EX)2 Is a measure of dispersion The following two properties follow easily from the definitions

  • f expectation and variance:

1 V ar(X) = EX2 − (EX)2

(Proof: Write EX = µ. Expanding V ar(X) = E(x − µ)2 = E(X2 − 2µX + µ2). Using linearity of expectation yields E(X2) − µ2)

2 V ar(aX + b) = a2V ar(X)

(Proof: V ar(aX + b) = E(aX + b − (aµ + b))2 = E(a2(X − µ)2) = a2V ar(X))

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-108
SLIDE 108

Joint Distributions

Let X1, . . . , Xn be discrete random variables. The function f defined by f(x1, . . . , xn) = P(X1 = x1, . . . , Xn = xn) is called the joint probability mass function of X1, . . . , Xn

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-109
SLIDE 109

Joint Distributions

Let X1, . . . , Xn be discrete random variables. The function f defined by f(x1, . . . , xn) = P(X1 = x1, . . . , Xn = xn) is called the joint probability mass function of X1, . . . , Xn X1, . . . , Xn are independent if and only if P(X1 = x1, . . . , Xn = x) = P(X1 = x1) . . . P(Xn = xn) for all x1, x2, . . . , xn

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-110
SLIDE 110

Joint Distributions

Let X1, . . . , Xn be discrete random variables. The function f defined by f(x1, . . . , xn) = P(X1 = x1, . . . , Xn = xn) is called the joint probability mass function of X1, . . . , Xn X1, . . . , Xn are independent if and only if P(X1 = x1, . . . , Xn = x) = P(X1 = x1) . . . P(Xn = xn) for all x1, x2, . . . , xn IfX1, . . . , Xn are independent, then EX1, X2, . . . , Xn = EX1EX2, . . . , EXn (Also: If X and Y are independent, then V ar(X + Y ) = V ar(X) + V ar(Y ))

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-111
SLIDE 111

Joint Distributions

Let X1, . . . , Xn be discrete random variables. The function f defined by f(x1, . . . , xn) = P(X1 = x1, . . . , Xn = xn) is called the joint probability mass function of X1, . . . , Xn X1, . . . , Xn are independent if and only if P(X1 = x1, . . . , Xn = x) = P(X1 = x1) . . . P(Xn = xn) for all x1, x2, . . . , xn IfX1, . . . , Xn are independent, then EX1, X2, . . . , Xn = EX1EX2, . . . , EXn (Also: If X and Y are independent, then V ar(X + Y ) = V ar(X) + V ar(Y )) Covariance: The covariance of two random variables X and Y is defined as the number Cov(X, Y ) = E(X − EX)(Y − EY )

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-112
SLIDE 112

Joint Distributions

Let X1, . . . , Xn be discrete random variables. The function f defined by f(x1, . . . , xn) = P(X1 = x1, . . . , Xn = xn) is called the joint probability mass function of X1, . . . , Xn X1, . . . , Xn are independent if and only if P(X1 = x1, . . . , Xn = x) = P(X1 = x1) . . . P(Xn = xn) for all x1, x2, . . . , xn IfX1, . . . , Xn are independent, then EX1, X2, . . . , Xn = EX1EX2, . . . , EXn (Also: If X and Y are independent, then V ar(X + Y ) = V ar(X) + V ar(Y )) Covariance: The covariance of two random variables X and Y is defined as the number Cov(X, Y ) = E(X − EX)(Y − EY ) It is a measure for the amount of linear dependency between the variables

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-113
SLIDE 113

Joint Distributions

Let X1, . . . , Xn be discrete random variables. The function f defined by f(x1, . . . , xn) = P(X1 = x1, . . . , Xn = xn) is called the joint probability mass function of X1, . . . , Xn X1, . . . , Xn are independent if and only if P(X1 = x1, . . . , Xn = x) = P(X1 = x1) . . . P(Xn = xn) for all x1, x2, . . . , xn IfX1, . . . , Xn are independent, then EX1, X2, . . . , Xn = EX1EX2, . . . , EXn (Also: If X and Y are independent, then V ar(X + Y ) = V ar(X) + V ar(Y )) Covariance: The covariance of two random variables X and Y is defined as the number Cov(X, Y ) = E(X − EX)(Y − EY ) It is a measure for the amount of linear dependency between the variables If X and Y are independent, the covariance is zero

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-114
SLIDE 114

Some Important Discrete Distributions

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-115
SLIDE 115

Bernoulli Distribution: Coin Tossing

We say X has a Bernoulli Distribution with success probability p if X can only take values 0 and 1 with probabilities P(X = 1) = p = 1 − P(X = 0)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-116
SLIDE 116

Bernoulli Distribution: Coin Tossing

We say X has a Bernoulli Distribution with success probability p if X can only take values 0 and 1 with probabilities P(X = 1) = p = 1 − P(X = 0) Expectation:

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-117
SLIDE 117

Bernoulli Distribution: Coin Tossing

We say X has a Bernoulli Distribution with success probability p if X can only take values 0 and 1 with probabilities P(X = 1) = p = 1 − P(X = 0) Expectation: EX = 0P(X = 0) + 1P(X = 1)p

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-118
SLIDE 118

Bernoulli Distribution: Coin Tossing

We say X has a Bernoulli Distribution with success probability p if X can only take values 0 and 1 with probabilities P(X = 1) = p = 1 − P(X = 0) Expectation: EX = 0P(X = 0) + 1P(X = 1)p Variance:

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-119
SLIDE 119

Bernoulli Distribution: Coin Tossing

We say X has a Bernoulli Distribution with success probability p if X can only take values 0 and 1 with probabilities P(X = 1) = p = 1 − P(X = 0) Expectation: EX = 0P(X = 0) + 1P(X = 1)p Variance: V ar(X) = EX2 − (EX)2 = EX − (EX)2 = p(1 − p)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-120
SLIDE 120

Binomial Distribution

Consider a sequence of n coin tosses. Suppose X counts the total number of heads. If the probability of ”heads” is p, then we say X has a binomial distribution with parameters n and p and write X ∼ Bin(n, p)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-121
SLIDE 121

Binomial Distribution

Consider a sequence of n coin tosses. Suppose X counts the total number of heads. If the probability of ”heads” is p, then we say X has a binomial distribution with parameters n and p and write X ∼ Bin(n, p) The pmf is f(x) = P(X = x) = n x

  • px(1 − p)n−x, with x = 0, 1, . . . , n

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-122
SLIDE 122

Binomial Distribution

Consider a sequence of n coin tosses. Suppose X counts the total number of heads. If the probability of ”heads” is p, then we say X has a binomial distribution with parameters n and p and write X ∼ Bin(n, p) The pmf is f(x) = P(X = x) = n x

  • px(1 − p)n−x, with x = 0, 1, . . . , n

Expectation:

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-123
SLIDE 123

Binomial Distribution

Consider a sequence of n coin tosses. Suppose X counts the total number of heads. If the probability of ”heads” is p, then we say X has a binomial distribution with parameters n and p and write X ∼ Bin(n, p) The pmf is f(x) = P(X = x) = n x

  • px(1 − p)n−x, with x = 0, 1, . . . , n

Expectation: EX = np. Could evaluate the sum, but that is

  • messy. Use linearity of expectation instead (X can be viewed

as a sum X = X1 + X2, . . . , Xn of n independent Bernoulli random variables). Variance:

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-124
SLIDE 124

Binomial Distribution

Consider a sequence of n coin tosses. Suppose X counts the total number of heads. If the probability of ”heads” is p, then we say X has a binomial distribution with parameters n and p and write X ∼ Bin(n, p) The pmf is f(x) = P(X = x) = n x

  • px(1 − p)n−x, with x = 0, 1, . . . , n

Expectation: EX = np. Could evaluate the sum, but that is

  • messy. Use linearity of expectation instead (X can be viewed

as a sum X = X1 + X2, . . . , Xn of n independent Bernoulli random variables). Variance: V ar(X) = np(1 − p) (showed in a similar way to the expectation)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-125
SLIDE 125

Binomial Distribution

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-126
SLIDE 126

Geometric Distribution

Again look at coin tosses, but count a different thing: Number of tosses before the first head

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-127
SLIDE 127

Geometric Distribution

Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 − p)x−1p, for x = 1, 2, 3.... X is said to have a geometric distribution with parameter p, X ∼ G(p)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-128
SLIDE 128

Geometric Distribution

Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 − p)x−1p, for x = 1, 2, 3.... X is said to have a geometric distribution with parameter p, X ∼ G(p) Expectation:

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-129
SLIDE 129

Geometric Distribution

Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 − p)x−1p, for x = 1, 2, 3.... X is said to have a geometric distribution with parameter p, X ∼ G(p) Expectation: EX = 1

p

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-130
SLIDE 130

Geometric Distribution

Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 − p)x−1p, for x = 1, 2, 3.... X is said to have a geometric distribution with parameter p, X ∼ G(p) Expectation: EX = 1

p

Variance:

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-131
SLIDE 131

Geometric Distribution

Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 − p)x−1p, for x = 1, 2, 3.... X is said to have a geometric distribution with parameter p, X ∼ G(p) Expectation: EX = 1

p

Variance: V ar(X) = 1−p

p2

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-132
SLIDE 132

Geometric Distribution

Again look at coin tosses, but count a different thing: Number of tosses before the first head P(X = x) = (1 − p)x−1p, for x = 1, 2, 3.... X is said to have a geometric distribution with parameter p, X ∼ G(p) Expectation: EX = 1

p

Variance: V ar(X) = 1−p

p2

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-133
SLIDE 133

Poisson Distribution

A random variable X for which: P(X = x) = λx x! exp−λ, x = 0, 1, 2, ... for fixed λ > 0

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-134
SLIDE 134

Poisson Distribution

A random variable X for which: P(X = x) = λx x! exp−λ, x = 0, 1, 2, ... for fixed λ > 0 We write X ∼ Poi(λ) Can be seen as a limiting distribution of Bin(n, λ

n)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-135
SLIDE 135

Law of Large Numbers

To discuss the law or large numbers, we will first prove Chebyshev Inequality

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-136
SLIDE 136

Law of Large Numbers

To discuss the law or large numbers, we will first prove Chebyshev Inequality

Theorem (Chebyshev Inequality)

Let X be a discrete random variable with EX = µ, and let ǫ > 0 be any positive real number. Then P(|X − µ| ≥ ǫ) ≤ V ar(X) ǫ2

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-137
SLIDE 137

Law of Large Numbers

To discuss the law or large numbers, we will first prove Chebyshev Inequality

Theorem (Chebyshev Inequality)

Let X be a discrete random variable with EX = µ, and let ǫ > 0 be any positive real number. Then P(|X − µ| ≥ ǫ) ≤ V ar(X) ǫ2 Basically states that the probability of deviation from the mean of more than k standard deviations is ≤

1 k2

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-138
SLIDE 138

Law of Large Numbers

Proof.

Let f(x) denote the pmf for X. Then the probability that X differs from µ by ateast ǫ is given by P(|X − µ| ≥ ǫ) =

|X−µ|≥ǫ f(x)

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-139
SLIDE 139

Law of Large Numbers

Proof.

Let f(x) denote the pmf for X. Then the probability that X differs from µ by ateast ǫ is given by P(|X − µ| ≥ ǫ) =

|X−µ|≥ǫ f(x)

We know that V ar(X) =

x(x − µ)2f(x), and this is at least as

large as

|x−µ|≥ǫ(x − µ)2f(x) since all the summands are positive

and we have restricted the range of summation.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-140
SLIDE 140

Law of Large Numbers

Proof.

Let f(x) denote the pmf for X. Then the probability that X differs from µ by ateast ǫ is given by P(|X − µ| ≥ ǫ) =

|X−µ|≥ǫ f(x)

We know that V ar(X) =

x(x − µ)2f(x), and this is at least as

large as

|x−µ|≥ǫ(x − µ)2f(x) since all the summands are positive

and we have restricted the range of summation. But this last sum is at least

  • |x−µ|≥ǫ

ǫ2f(x) = ǫ2

  • |x−µ|≥ǫ

f(x) = ǫ2P(|x − µ| ≥ ǫ) So, P(|X − µ| ≥ ǫ) ≤ V ar(X) ǫ2

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-141
SLIDE 141

Law of Large Numbers(Weak Form)

Theorem (Law of Large Numbers)

Let X1, X2, . . . , Xn be an independent trials process, with finite expected value µ = EXj and finite variance σ2 = V ar(Xj). Let Sn = X1 + X2 + · · · + Xn, then for any ǫ > 0

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-142
SLIDE 142

Law of Large Numbers(Weak Form)

Theorem (Law of Large Numbers)

Let X1, X2, . . . , Xn be an independent trials process, with finite expected value µ = EXj and finite variance σ2 = V ar(Xj). Let Sn = X1 + X2 + · · · + Xn, then for any ǫ > 0 P

  • |Sn

n − µ| ≥ ǫ

  • → 0

as n → ∞ and equivalently P

  • |Sn

n − µ| < ǫ

  • → 1

as n → ∞ Sample average converges in probability towards expected value.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-143
SLIDE 143

Proof.

Since X1, X2, . . . , Xn are independent and have the same distribution, we have V ar(Sn) = nσ2 and V ar( Sn

n ) = σ2 den.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-144
SLIDE 144

Proof.

Since X1, X2, . . . , Xn are independent and have the same distribution, we have V ar(Sn) = nσ2 and V ar( Sn

n ) = σ2

  • den. We

also know that E Sn

n = µ. By Chebyshev’s inequality, for any ǫ > 0

P

  • |Sn

n − µ| ≥ ǫ

  • ≤ σ2

nǫ2 Thus for fixed ǫ, n → ∞ implies the statement.

Refresher on Discrete Probability STAT 27725/CMSC 25400

slide-145
SLIDE 145

Roadmap

Today: Discrete Probability Next time: Continuous Probability

Refresher on Discrete Probability STAT 27725/CMSC 25400