Basic Statistics 10-701 Recitations 1/25/2013 Recitation 1: - - PowerPoint PPT Presentation

basic statistics
SMART_READER_LITE
LIVE PREVIEW

Basic Statistics 10-701 Recitations 1/25/2013 Recitation 1: - - PowerPoint PPT Presentation

Carnegie Mellon University 10-701 Machine Learning Spring 2013 TA: Ina Fiterau Alex Smola Barnabas Poczos 4 th year PhD student MLD Review of Probabilities and Basic Statistics 10-701 Recitations 1/25/2013 Recitation 1: Statistics Intro 1


slide-1
SLIDE 1

Carnegie Mellon University 10-701 Machine Learning Spring 2013

TA: Ina Fiterau

4th year PhD student MLD

Alex Smola Barnabas Poczos

Review of Probabilities and Basic Statistics

10-701 Recitations

1/25/2013 1 Recitation 1: Statistics Intro

slide-2
SLIDE 2

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Overview

Introduction to Probability Theory Random Variables. Independent RVs Properties of Common Distributions

  • Estimators. Unbiased estimators. Risk

Conditional Probabilities/Independence Bayes Rule and Probabilistic Inference

1/25/2013 2 Recitation 1: Statistics Intro

slide-3
SLIDE 3

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Review: the concept of probability

Sample space Ω – set of all possible outcomes Event E ∈ Ω – a subset of the sample space Probability measure – maps Ω to unit interval

“How likely is that event E will occur?”

Kolmogorov axioms

P E ≥ 0 P Ω = 1 P 𝐹1 ∪ 𝐹2 ∪ ⋯ = 𝑄(𝐹𝑗)

∞ 𝑗=1

1/25/2013 3 Introduction to Probability Theory

Ω 𝐹

slide-4
SLIDE 4

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Reasoning with events

Venn Diagrams

𝑄 𝐵 = 𝑊𝑝𝑚(𝐵)/𝑊𝑝𝑚 (Ω)

Event union and intersection

𝑄 𝐵 𝐶 = 𝑄 𝐵 + 𝑄 𝐶 − 𝑄 𝐵 ∩ 𝐶

Properties of event union/intersection

Commutativity: 𝐵 ∪ 𝐶 = 𝐶 ∪ 𝐵; 𝐵 ∩ 𝐶 = 𝐶 ∩ 𝐵 Associativity: 𝐵 ∪ 𝐶 ∪ C = (𝐵 ∪ 𝐶) ∪ C Distributivity: 𝐵 ∩ 𝐶 ∪ 𝐷 = (𝐵 ∩ 𝐶) ∪ (𝐵 ∩ 𝐷)

1/25/2013 4 Introduction to Probability Theory

slide-5
SLIDE 5

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Reasoning with events

DeMorgan’s Laws

(𝐵 ∪ 𝐶)𝐷= 𝐵𝐷 ∩ 𝐶𝐷 (𝐵 ∩ 𝐶)𝐷= 𝐵𝐷 ∪ 𝐶𝐷

Proof for law #1 - by double containment

(𝐵 ∪ 𝐶)𝐷⊆ 𝐵𝐷 ∩ 𝐶𝐷

𝐵𝐷 ∩ 𝐶𝐷 ⊆ (𝐵 ∪ 𝐶)𝐷

1/25/2013 5 Introduction to Probability Theory

slide-6
SLIDE 6

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Reasoning with events

Disjoint (mutually exclusive) events

𝑄 𝐵 ∩ 𝐶 = 0 𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄(𝐶) examples:

  • 𝐵 and 𝐵𝐷
  • partitions

NOT the same as independent events

For instance, successive coin flips

1/25/2013 6 Introduction to Probability Theory

𝑇1 𝑇2 𝑇3 𝑇4 𝑇5 𝑇6

slide-7
SLIDE 7

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Partitions

Partition 𝑇1 … 𝑇𝑜

Events cover sample space 𝑇1 ∪ ⋯ ∪ 𝑇𝑜 = Ω Events are pairwise disjoint 𝑇𝑗 ∩ 𝑇

𝑘 = ∅

Event reconstruction

𝑄 𝐵 = 𝑄(𝐵 ∩ 𝑇

𝑜 𝑗=1 𝑗)

Boole’s inequality

𝑄 𝐵𝑗

∞ 𝑗=1

≤ 𝑄(𝐵

𝑜 𝑗=1 𝑗)

Bayes’ Rule

𝑄 𝑇𝑗|𝐵 =

𝑄 𝐵 𝑇𝑗 𝑄(𝑇𝑗) 𝑄 𝐵 𝑇𝑘 𝑄(𝑇𝑘)

𝑜 𝑘=1

1/25/2013 7 Introduction to Probability Theory

slide-8
SLIDE 8

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Overview

Introduction to Probability Theory Random Variables. Independent RVs Properties of Common Distributions

  • Estimators. Unbiased estimators. Risk

Conditional Probabilities/Independence Bayes Rule and Probabilistic Inference

1/25/2013 8 Recitation 1: Statistics Intro

slide-9
SLIDE 9

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Random Variables

Random variable – associates a value to the outcome of a randomized event Sample space 𝒴: possible values of rv 𝑌 Example: event to random variable

1/25/2013 9 Random Variables

Draw 2 numbers between 1 and 4. Let r.v. X be their sum. E 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 X(E) 2 3 4 5 3 4 5 6 4 5 6 7 5 6 7 8 Induced probability function on 𝒴. x 2 3 4 5 6 7 8 P(X=x) 1 16 2 16 3 16 4 16 3 16 2 16 1 16

slide-10
SLIDE 10

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Cumulative Distribution Functions

𝐺

𝑌 𝑦 = 𝑄 𝑌 ≤ 𝑦 ∀𝑦 ∈ 𝒴

The CDF completely determines the probability distribution of an RV The function 𝐺 𝑦 is a CDF i.i.f

lim

𝑦→−∞ 𝐺 𝑦 = 0 and lim 𝑦→∞ 𝐺 𝑦 = 1

𝐺 𝑦 is a non-decreasing function of 𝑦 𝐺 𝑦 is right continuous: ∀𝑦0 lim

𝑦→𝑦0 𝑦 > 𝑦0

𝐺 𝑦 = 𝐺(𝑦0)

1/25/2013 10 Random Variables

slide-11
SLIDE 11

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Identically distributed RVs

Two random variables 𝑌1and 𝑌2are identically distributed iif for all sets of values 𝐵 𝑄 𝑌1 ∈ 𝐵 = 𝑄 𝑌2 ∈ 𝐵 So that means the variables are equal?

NO. Example: Let’s toss a coin 3 times and let 𝑌𝐼 and 𝑌𝐺 represent the number of heads/tails respectively They have the same distribution but 𝑌𝐼 = 1 − 𝑌𝐺

1/25/2013 11 Random Variables

slide-12
SLIDE 12

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Discrete vs. Continuous RVs

Step CDF 𝒴 is discrete Probability mass

𝑔

𝑌 𝑦 = 𝑄 𝑌 = 𝑦 ∀𝑦

Continuous CDF 𝒴 is continuous Probability density

𝐺

𝑌 𝑦 =

𝑔

𝑌 𝑢 𝑒𝑢 𝑦 −∞

∀𝑦

1/25/2013 12 Random Variables

slide-13
SLIDE 13

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Interval Probabilities

Obtained by integrating the area under the curve 𝑄 𝑦1 ≤ 𝑌 ≤ 𝑦2 = 𝑔

𝑦 𝑦 𝑒𝑦 𝑦2 𝑦1

1/25/2013 13 Random Variables

This explains why P(X=x) = 0 for continuous distributions!

𝑄 𝑌 = 𝑦 ≤ lim

𝜗→0 𝜗 >0

[𝐺

𝑦 𝑦 − 𝐺 𝑦(𝑦 − 𝜗)] = 0

𝑦1 𝑦2

slide-14
SLIDE 14

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Moments

Expectations

The expected value of a function 𝑕 depending on a r.v. X~𝑄 is defined as 𝐹𝑕 𝑌 = 𝑕(𝑦)𝑄 𝑦 𝑒𝑦

nth moment of a probability distribution

𝜈𝑜 = 𝑦𝑜𝑄 𝑦 𝑒𝑦

mean 𝜈 = 𝜈1 nth central moment

𝜈𝑜′ = 𝑦 − 𝜈 𝑜𝑄 𝑦 𝑒𝑦

Variance 𝜏2 = 𝜈2′

1/25/2013 14 Random Variables

slide-15
SLIDE 15

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Multivariate Distributions

Example

Uniformly draw 𝑌 and 𝑍 from the set {1,2,3}2 𝑋 = 𝑌 + 𝑍; 𝑊 = |𝑌 − 𝑍|

Joint

𝑄 𝑌, 𝑍 ∈ 𝐵 = 𝑔(𝑦, 𝑧)

(𝑦,𝑧)𝜗𝐵

Marginal

𝑔

𝑍 𝑧 = 𝑔(𝑦, 𝑧) 𝑦

For independent RVs:

𝑔 𝑦1, … , 𝑦𝑜 = 𝑔

𝑌1 𝑦1 … 𝑔 𝑌𝑜(𝑦𝑜)

1/25/2013 15 Random Variables 1 2 PW 2 1/9

1/9

3 2/9

2/9

4 1/9 2/9

3/9

5 2/9

2/9

6 1/9

1/9

PV 3/9 4/9 2/9 1 W V

slide-16
SLIDE 16

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Overview

Introduction to Probability Theory Random Variables. Independent RVs Properties of Common Distributions

  • Estimators. Unbiased estimators. Risk

Conditional Probabilities/Independence Bayes Rule and Probabilistic Inference

1/25/2013 16 Recitation 1: Statistics Intro

slide-17
SLIDE 17

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Bernoulli

𝑌 = 1 𝑥𝑗𝑢ℎ 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 𝑞 𝑥𝑗𝑢ℎ 𝑞𝑠𝑝𝑐𝑏𝑐𝑗𝑚𝑗𝑢𝑧 1 − 𝑞 0 ≤ 𝑞 ≤ 1 Mean and Variance

𝐹𝑌 = 1𝑞 + 0 1 − 𝑞 = 𝑞 𝑊𝑏𝑠𝑌 = 1 − 𝑞2 𝑞 + 0 − 𝑞2 1 − 𝑞 = 𝑞(1 − 𝑞)

MLE: sample mean Connections to other distributions:

If 𝑌1 … 𝑌𝑜~ 𝐶𝑓𝑠𝑜(𝑞) then Y = 𝑌𝑗

𝑜 𝑗=1

is Binomial(n, p) Geometric distribution – the number of Bernoulli trials needed to get one success

1/25/2013 17 Properties of Common Distributions

slide-18
SLIDE 18

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Binomial

𝑄 𝑌 = 𝑦; 𝑜, 𝑞 = 𝑜 𝑦 𝑞𝑦(1 − 𝑞)𝑜−𝑦 Mean and Variance

𝐹𝑌 = 𝑦 𝑜 𝑦 𝑞𝑦(1 − 𝑞)𝑜−𝑦

𝑜 𝑦=0

= … = 𝑜𝑞 𝑊𝑏𝑠𝑌 = 𝑜𝑞(1 − 𝑞) NOTE:

Sum of Bin is Bin Conditionals on Bin are Bin

1/25/2013 18 Properties of Common Distributions

𝑾𝒃𝒔𝒀 = 𝑭𝒀𝟑 − (𝑭𝒀)𝟑

slide-19
SLIDE 19

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Properties of the Normal Distribution

Operations on normally-distributed variables

𝑌1, 𝑌2~ 𝑂𝑝𝑠𝑛 0,1 , then 𝑌1 ± 𝑌2~𝑂(0,2) 𝑌1/𝑌2 ~ 𝐷𝑏𝑣𝑑ℎ𝑧(0,1) 𝑌1~ 𝑂𝑝𝑠𝑛 𝜈1, 𝜏12 , 𝑌2~ 𝑂𝑝𝑠𝑛 𝜈2, 𝜏22 and 𝑌1 ⊥ 𝑌2 then 𝑎 = 𝑌1 + 𝑌2~ 𝑂𝑝𝑠𝑛 𝜈1 + 𝜈2, 𝜏12 + 𝜏22 If 𝑌 , 𝑍 ~ 𝑂 𝜈𝑦 𝜈𝑧 , 𝜏𝑌2 𝜍𝜏𝑌𝜏𝑍 𝜍𝜏𝑌𝜏𝑍 𝜏𝑍2 , then 𝑌 + 𝑍 is still normally distributed, the mean is the sum of the means and the variance is 𝜏𝑌+𝑍2 = 𝜏𝑌2 + 𝜏𝑍2 + 2𝜍𝜏𝑌𝜏𝑍, where 𝜍 is the correlation

1/25/2013 19 Properties of Common Distributions

slide-20
SLIDE 20

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Overview

Introduction to Probability Theory Random Variables. Independent RVs Properties of Common Distributions

  • Estimators. Unbiased estimators. Risk

Conditional Probabilities/Independence Bayes Rule and Probabilistic Inference

1/25/2013 20 Recitation 1: Statistics Intro

slide-21
SLIDE 21

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Estimating Distribution Parameters

Let 𝑌1 … 𝑌𝑜 be a sample from a distribution parameterized by 𝜄 How can we estimate

The mean of the distribution? Possible estimator:

1 𝑜

𝑌𝑗

𝑜 𝑗=1

The median of the distribution? Possible estimator: 𝑛𝑓𝑒𝑗𝑏𝑜(𝑌1 … 𝑌𝑜) The variance of the distribution? Possible estimator:

1 𝑜

(𝑌𝑗 − 𝑌 )2

𝑜 𝑗=1

1/25/2013 21 Estimators

slide-22
SLIDE 22

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Bias-Variance Tradeoff

When estimating a quantity 𝜄, we evaluate the performance of an estimator by computing its risk – expected value of a loss function

R 𝜄, 𝜄 = 𝐹 𝑀(𝜄, 𝜄 ), where 𝑀 could be

  • Mean Squared Error Loss
  • 0/1 Loss
  • Hinge Loss (used for SVMs)

Bias-Variance Decomposition: 𝑍 = 𝑔 𝑦 + 𝜁

𝐹𝑠𝑠 𝑦 = 𝐹 𝑔 𝑦 − 𝑔 𝑦 2 = (𝐹 𝑔 𝑦 − 𝑔(𝑦))2+𝐹 𝑔 𝑦 − 𝐹 𝑔 𝑦

2

+ 𝜏𝜁2

1/25/2013 22 Estimators

Bias Variance

slide-23
SLIDE 23

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Overview

Introduction to Probability Theory Random Variables. Independent RVs Properties of Common Distributions

  • Estimators. Unbiased estimators. Risk

Conditional Probabilities/Independence Bayes Rule and Probabilistic Inference

1/25/2013 23 Recitation 1: Statistics Intro

slide-24
SLIDE 24

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Review: Conditionals

Conditional Variables

𝑄 𝑌 𝑍 =

𝑄(𝑌,𝑍) 𝑄(𝑍) note X;Y is a different r.v.

Conditional Independence 𝑌 ⊥ 𝑍 |𝑎

X and Y are cond. independent given Z iif 𝑄 𝑌, 𝑍 𝑎 = 𝑄 𝑌 𝑎 𝑄 𝑍 𝑎

Properties of Conditional Independence

Symmetry 𝑌 ⊥ 𝑍 |𝑎 ⟺ 𝑍 ⊥ 𝑌 |𝑎 Decomposition 𝑌 ⊥ 𝑍, 𝑋 𝑎 ⇒ 𝑌 ⊥ 𝑍 𝑎 Weak Union 𝑌 ⊥ 𝑍, 𝑋 𝑎 ⇒ 𝑌 ⊥ 𝑍 𝑎, 𝑋 Contraction (𝑌 ⊥ 𝑋 𝑎, 𝑍) , 𝑌 ⊥ 𝑍 𝑎 ⇒ 𝑌 ⊥ 𝑍, 𝑋 𝑎

1/25/2013 24 Conditional Probabilities Can you prove these?

slide-25
SLIDE 25

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Overview

Introduction to Probability Theory Random Variables. Independent RVs Properties of Common Distributions

  • Estimators. Unbiased estimators. Risk

Conditional Probabilities/Independence Bayes Rule and Probabilistic Inference

1/25/2013 25 Recitation 1: Statistics Intro

slide-26
SLIDE 26

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Priors and Posteriors

We’ve so far introduced the likelihood function

𝑄 𝐸𝑏𝑢𝑏 𝜄) - the likelihood of the data given the parameter of the distribution 𝜄𝑁𝑀𝐹 = 𝑏𝑠𝑕𝑛𝑏𝑦𝜄𝑄 𝐸𝑏𝑢𝑏 𝜄)

What if not all values of 𝜄 are equally likely?

𝜄 itself is distributed according to the prior 𝑄𝜄 Apply Bayes rule

  • 𝑄 𝜄 𝐸𝑏𝑢𝑏) =

𝑄 𝐸𝑏𝑢𝑏 𝜄)𝑄(𝜄) 𝑄(𝐸𝑏𝑢𝑏)

  • 𝜄𝑁𝐵𝑄 = 𝑏𝑠𝑕𝑛𝑏𝑦𝜄𝑄(𝜄|𝐸𝑏𝑢𝑏) = 𝑏𝑠𝑕𝑛𝑏𝑦𝜄𝑄 𝐸𝑏𝑢𝑏 𝜄)𝑄(𝜄)

1/25/2013 26 Bayes Rule

slide-27
SLIDE 27

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Conjugate Priors

If the posterior distributions 𝑄 𝜄 𝐸𝑏𝑢𝑏) are in the same family as the prior prob. distribution 𝑄𝜄, then the prior and the posterior are called conjugate distributions and 𝑄𝜄 is called conjugate prior

Some examples

1/25/2013 27 Bayes Rule

Likelihood Conjugate Prior Bernoulli/Binomial Beta Poisson Gamma (MV) Normal with known (co)variance Normal Exponential Gamma Multinomial Dirichlet

How to compute the parameters of the Posterior?

I’ll send a derivation

slide-28
SLIDE 28

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Probabilistic Inference

1/25/2013 28 Probabilistic Inference

Problem: You’re planning a weekend biking trip with your best friend, Min. Alas, your path to outdoor leisure is strewn with many

  • hurdles. If it happens to rain, your chances of biking reduce to half

not counting other factors. Independent of this, Min might be able to bring a tent, the lack of which will only matter if you notice the symptoms of a flu before the trip. Finally, the trip won’t happen if your advisor is unhappy with your weekly progress report.

slide-29
SLIDE 29

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Probabilistic Inference

Problem: You’re planning a weekend biking trip with your best friend, Min. Your path to outdoor leisure is strewn with many

  • hurdles. If it happens to rain, your chances of biking reduce to half

not counting other factors. Independent of this, Min might be able to bring a tent, the lack of which will only matter if you notice the symptoms of a flu before the trip. Finally, the trip won’t happen if your advisor is unhappy with your weekly progress report. Variables:

O – the outdoor trip happens A – advisor is happy R – it rains that day T – you have a tent F – you show flu symptoms

1/25/2013 29 Probabilistic Inference

slide-30
SLIDE 30

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Probabilistic Inference

Problem: You’re planning a weekend biking trip with your best friend, Min. Alas, your path to outdoor leisure is strewn with many

  • hurdles. If it happens to rain, your chances of biking reduce to half

not counting other factors. Independent of this, Min might be able to bring a tent, the lack of which will only matter if you notice the symptoms of a flu before the trip. Finally, the trip won’t happen if your advisor is unhappy with your weekly progress report. Variables:

O – the outdoor trip happens A – advisor is happy R – it rains that day T – you have a tent F – you show flu symptoms

1/25/2013 30 Probabilistic Inference

A O F R T

slide-31
SLIDE 31

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Probabilistic Inference

How many parameters determine this model? P(A|O) => 1 parameter P(R|O) => 1 parameter P(F, T|O) => 3 parameters In this problem, the values are given; Otherwise, we would have had to estimate them Variables:

O – the outdoor trip happens A – advisor is happy R – it rains that day T – you have a tent F – you show flu symptoms

1/25/2013 31 Probabilistic Inference

A O F R T

slide-32
SLIDE 32

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Probabilistic Inference

The weather forecast is optimistic, the chances of rain are 20%. You’ve barely slacked off this week so your advisor is probably happy, let’s give it an 80%. Luckily, you don’t seem to have the flu. What are the chances that the trip will happen? Think of how you would do this. Hint #1: do the variables F and T influence the result in this case? Hint #2: use the fact that the combinations of values for A and R represent a partition and use one

  • f the partition formulas we learned

1/25/2013 32 Probabilistic Inference

A O F R T

slide-33
SLIDE 33

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Overview

Introduction to Probability Theory Random Variables. Independent RVs Properties of Common Distributions

  • Estimators. Unbiased estimators. Risk

Conditional Probabilities/Independence Bayes Rule and Probabilistic Inference

1/25/2013 33 Recitation 1: Statistics Intro

slide-34
SLIDE 34

Carnegie Mellon University 10-701 Machine Learning Spring 2013

Overview

Introduction to Probability Theory Random Variables. Independent RVs Properties of Common Distributions

  • Estimators. Unbiased estimators. Risk

Conditional Probabilities/Independence Bayes Rule and Probabilistic Inference

1/25/2013 34 Recitation 1: Statistics Intro