Probability and Statistics for Computer Science In sta(s(cs we - - PowerPoint PPT Presentation

probability and statistics
SMART_READER_LITE
LIVE PREVIEW

Probability and Statistics for Computer Science In sta(s(cs we - - PowerPoint PPT Presentation

Probability and Statistics for Computer Science In sta(s(cs we apply probability to draw conclusions from data. ---Prof. J. Orloff Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 10.13.2020 Last time


slide-1
SLIDE 1

ì

Probability and Statistics for Computer Science

“In sta(s(cs we apply probability to draw conclusions from data.”

  • --Prof. J. Orloff

Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 10.13.2020 Credit: wikipedia

slide-2
SLIDE 2

Last time

✺ Exponen(al Distribu(on ✺ Sample mean and confidence

interval

slide-3
SLIDE 3

Objectives

✺ Bootstrap simula(on ✺ Hypothesis test

slide-4
SLIDE 4

Motivation of sampling: the poll example

✺ This senate elec(on poll tells us:

✺ The sample has 1211 likely voters ✺ Ms. Hyde-Smith has realized sample mean equal to 51%

✺ What is the es(mate of the percentage of votes

for Hyde-smith?

✺ How confident is that es(mate?

Source: FiveThirtyEight.com

slide-5
SLIDE 5

Expected value of one random sample is the population mean

✺ Since each sample is drawn uniformly from the

popula(on

✺ We say that is an unbiased es(mator of the

popula(on mean. therefore

X(N)

E[X(1)] = popmean({X}) E[X(N)] = popmean({X})

slide-6
SLIDE 6

Standard deviation of the sample mean

✺ We can also rewrite another result from the lecture

  • n the weak law of large numbers

✺ The standard devia(on of the sample mean ✺ But we need the popula(on standard devia(on in

  • rder to calculate the !

var[X(N)] = popvar({X}) N

std[X(N)]

std[X(N)] = popsd({X}) √ N

slide-7
SLIDE 7

Unbiased estimate of population standard deviation & Stderr

✺ The unbiased es(mate of is

defined as

✺ So the standard error is an es(mate of

stdunbiased({x}) =

  • 1

N − 1

  • xi∈ sample

(xi − mean({xi}))2

popsd({X})

std[X(N)] = popsd({X}) √ N

std[X(N)]

popsd({X}) √ N . = stdunbiased({x}) √ N = stderr({x})

x

slide-8
SLIDE 8

Standard error: election poll

✺ What is the es(mate of the percentage of votes

for Hyde-smith?

Number of sampled voters who selected Ms. Smith is: 1211(0.51) ≅ 618 Number of sampled voters who didn’t selected Ms. Smith was 1211(0.49) ≅ 593

51% 51%

slide-9
SLIDE 9

Standard error: election poll

✺ ✺

=

  • 1

1211 − 1(618(1 − 0.51)2 + 593(0 − 0.51)2) = 0.5001001

stdunbiased({x})

stderr({x})

≃ 0.5 √ 1211 ≃ 0.0144

slide-10
SLIDE 10

Interpreting the standard error

✺ Sample mean is a random variable and has its own

probability distribu(on, stderr is an es(mate of sample mean’s standard devia(on

✺ When N is very large, according to the Central Limit

Theorem, sample mean is approaching a normal distribu(on with

x

µ = popmean({X}) ;

stderr({x}) = stdunbiased({x}) √ N

σ = popsd({X}) √ N . = stderr({x})

slide-11
SLIDE 11

Interpreting the standard error

Credit: wikipedia

99.7% 95% 68% Popula(on mean μ+Standard error Probability distribu(on

  • f sample

mean tends normal when N is large

slide-12
SLIDE 12

Confidence intervals

✺ Confidence interval

for a popula(on mean is defined by frac(on

✺ Given a percentage,

find how many units of strerr it covers.

−4 −2 2 4 0.0 0.1 0.2 0.3 0.4 0.5 x dnorm(x)

95% For 95% of the realized sample means, the popula(on mean lies in [sample mean-2 stderr, sample mean+2 stderr]

2

  • 2
slide-13
SLIDE 13

Confidence intervals when N is large

✺ For about 68% of realized sample means ✺ For about 95% of realized sample means ✺ For about 99.7% of realized sample means

mean({x}) − stderr({x}) ≤ popmean({X}) ≤ mean({x}) + stderr({x}) mean({x})−2stderr({x}) ≤ popmean({X}) ≤ mean({x})+2stderr({x}) mean({x})−3stderr({x}) ≤ popmean({X}) ≤ mean({x})+3stderr({x})

slide-14
SLIDE 14
  • Q. Confidence intervals

✺ What is the 68% confidence interval for a

popula(on mean?

  • A. [sample mean-2stderr, sample mean+2stderr]
  • B. [sample mean-stderr, sample mean+stderr]
  • C. [sample mean-std, sample mean+std]
slide-15
SLIDE 15

Standard error: election poll

51%

✺ We es(mate the popula(on mean as 51% with stderr 1.44% ✺ The 95% confidence interval is [51%-2×1.44%, 51%+2×1.44%]= [48.12%, 53.88%]

slide-16
SLIDE 16

Q.

✺ A store staff mixed their fuji and gala

apples and they were individually wrapped, so they are indis(nguishable. if I pick 30 apples and found 21 fuji , what is my 95% confidence interval to es(mate the popmean is 70% for fuji? (hint: strerr > 0.05)

  • A. [0.7-0.17, 0.7+0.17]
  • B. [0.7-0.056, 0.7+0.056]
slide-17
SLIDE 17

What if N is small? When is N large enough?

✺ If samples are taken from normal distributed

popula(on, the following variable is a random variable whose distribu(on is Student’s t- distribu(on with N-1 degree of freedom.

Degree of freedom is N-1 due to this constraint:

  • i

(xi − mean({x})) = 0

T = mean({x}) − popmean({X}) stderr({x})

slide-18
SLIDE 18

t-distribution is a family of distri. with different degrees of freedom

t-distribu(on with N=5 and N=30

William Sealy Gosset 1876-1937 Credit : wikipedia

−10 −5 5 10 0.0 0.1 0.2 0.3 0.4 0.5

pdf of t − distribution

X density degree = 4, N=5 degree = 29, N=30

slide-19
SLIDE 19

When N=30, t-distribution is almost Normal

t-distribu(on looks very similar to normal when N=30. So N=30 is a rule of thumb to decide N is large or not

−10 −5 5 10 0.0 0.1 0.2 0.3 0.4 0.5

pdf of t (n=30) and normal distribution

X density degree = 29, N=30 standard normal

slide-20
SLIDE 20

Confidence intervals when N< 30

✺ If the sample size N< 30, we should use t-

distribu(on with its parameter (the degrees of freedom) set to N-1

slide-21
SLIDE 21

Centered Confidence intervals

✺ Centered Confidence

interval for a popula(on mean by α value, where

−4 −2 2 4 0.0 0.1 0.2 0.3 0.4 0.5 x dnorm(x)

For 1-2α of the realized sample means, the popula(on mean lies in [sample mean-b×stderr, sample mean+b×stderr] α α

P(T ≥ b) = α

slide-22
SLIDE 22

Centered Confidence intervals

✺ Centered Confidence

interval for a popula(on mean by α value, where

−4 −2 2 4 0.0 0.1 0.2 0.3 0.4 0.5 x dnorm(x)

For 1-2α of the realized sample means, the popula(on mean lies in [sample mean-b×stderr, sample mean+b×stderr] α α

P(T ≥ b) = α

slide-23
SLIDE 23

Q.

✺ The 95% confidence interval for a popula(on

mean is equivalent to what 1-2α interval?

  • A. α= 0.05
  • B. α= 0.025
  • C. α= 0.1
slide-24
SLIDE 24
slide-25
SLIDE 25

Sample statistic

✺ A staQsQc is a func(on of a dataset

✺ For example, the mean or median of a

dataset is a sta(s(c ✺ Sample staQsQc

✺ Is a sta(s(c of the data set that is formed

by the realized sample

✺ For example, the realized sample mean

slide-26
SLIDE 26
  • Q. Is this a sample statistic?

✺ The largest integer that is smaller than or

equal to the mean of a sample

  • A. Yes
  • B. No.
slide-27
SLIDE 27
  • Q. Is this a sample statistic?

✺ The interquar(le range of a sample

  • A. Yes
  • B. No.
slide-28
SLIDE 28

Confidence intervals for other sample statistics

✺ Sample staQsQc such as median and

  • thers are also interes(ng for drawing

conclusion about the popula(on

✺ It’s osen difficult to derive the

analy(cal expression in terms of stderr for the corresponding random variable

✺ So we can use simula(on…

slide-29
SLIDE 29

Bootstrap for confidence interval of

  • ther sample statistics

✺ Bootstrap is a method to construct

confidence interval for any* sample staQsQcs using resampling of the sample data set

✺ Bootstrapping is essen(ally uniform

random sampling with replacement on the sample of size N

slide-30
SLIDE 30

Bootstrap for confidence interval of

  • ther sample statistics

Credit: E S. Banjanovic and J. W. Osborne, 2016, PAREonline

slide-31
SLIDE 31

Example of Bootstrap for confidence interval of sample median

✺ The realized sample of student awendance

{12,10,9,8,10,11,12,7,5,10}, N=10, median=10

✺ Generate a random index uniformly from [1,10] that correspond

to the 10 numbers in the sample, ie. if index=6, the bootstrap sample’s number will be 11.

✺ Repeat the process 10 (mes to get one bootstrap sample

Bootstrap replicate Sample median {11, 11, 12, 10, 10, 10, 12, 10, 7, 10} 10

slide-32
SLIDE 32

Example of Bootstrap for confidence interval of sample median

✺ The realized sample of student awendance

{12,10,9,8,10,11,12,7,5,10}, N=10, median=10

Bootstrap replicate Sample median {11, 11, 12, 10, 10, 10, 12, 10, 7, 10} 10 {7, 10, 10, 10, 9, 7, 9, 10, 12, 10} 10 {9, 7, 10, 8, 5, 10, 7, 10, 12, 8} 8.5 … …

slide-33
SLIDE 33
  • Q. How many possible bootstrap

replicates?

✺ A. 1010 B.10! C. e10

Bootstrap replicate Sample median {11, 11, 12, 10, 10, 10, 12, 10, 7, 10} 10 {7, 10, 10, 10, 9, 7, 9, 10, 12, 10} 10 {9, 7, 10, 8, 5, 10, 7, 10, 12, 8} 8.5 … …

slide-34
SLIDE 34

Example of Bootstrap for confidence interval of sample median

✺ Do the bootstrapping for r = 10000 (mes, then draw

the histogram and also find the stderr of sample median) Bootstrap replicate Sample median {11, 11, 12, 10, 10, 10, 12, 10, 7, 10} 10 {7, 10, 10, 10, 9, 7, 9, 10, 12, 10} 10 {9, 7, 10, 8, 5, 10, 7, 10, 12, 8} 8.5 … …

slide-35
SLIDE 35

Example of Bootstrap for confidence interval of sample median

✺ Bootstrapping

for r = 10000 (mes, then draw the histogram and also find the stderr of sample median.

mean(Sample Median) = 9.73625 stderr(Sample Median) = 0.7724446

Histogram of sample_median

sample_median Frequency 5 6 7 8 9 10 11 12 1000 2000 3000 4000 5000

Is this similar to Normal?

stderr({S}) =

  • i[S({x}i) − S]2

r − 1

slide-36
SLIDE 36

Errors in Bootstrapping

✺ The distribu(on simulated from bootstrapping is

called empirical distribu(on. It is not the true popula(on distribu(on. There is a staQsQcal error.

✺ The number of bootstrapping replicates may not be

  • enough. There is a numerical error.

✺ When the sta(s(c is not a well behaving one, such as

maximum or minimum of a data set, the bootstrap method may fail to simulate the true distribu(on.

slide-37
SLIDE 37

CEO salary example with larger N = 59

✺ The realized

sample of CEO salary N=59, median=350 K

✺ r = 10000

Histogram of sample_median

sample_median Frequency 250 300 350 400 450 500 550 500 1000 1500 2000 2500 3000

mean(Sample Median) = 348.0378 stderr(Sample Median) = 27.30539

Histogram of the Bootstrap sample medians

slide-38
SLIDE 38

CEO salary example with larger N = 59

✺ The realized

sample of CEO salary N=59, median=350 K

✺ r = 10000

Histogram of sample_median

sample_median Frequency 250 300 350 400 450 500 550 500 1000 1500 2000 2500 3000

mean(Sample Median) = 348.0378 stderr(Sample Median) = 27.30539

Histogram of the Bootstrap sample medians

Is this similar to Normal?

slide-39
SLIDE 39

Checking whether it’s normal by Normal Q-Q plot

✺ Q-Q compares a

distribu(on with normal by matching the kth smallest quan(le value pairs and plot as a point in the graph

✺ Linear means

similar to normal!

Read Pg 64, 3.2.3, “Introductory sta(s(cs with R”

slide-40
SLIDE 40

Checking whether it’s normal by Normal Q-Q plot

✺ Q-Q compares a

distribu(on with normal by matching the kth smallest quan(le value pairs and plot as a point in the graph

✺ Linear means

similar to normal!

Read Pg 64, 3.2.3, “Introductory sta(s(cs with R”

Normal Distribu(on’s Quan(le

slide-41
SLIDE 41

CEO salary sample median’s Q-Q plot

✺ Q-Q plot of CEO

salary’s bootstrap sample medians

✺ It’s roughly linear so

it’s close to normal.

✺ We can use the

normal distribu(on to construct the confidence intervals

  • −4

−2 2 4 250 300 350 400 450 500

CEO Bootstap Sample Median Q−Q Plot

Theoretical Quantiles Sample Quantiles

  • utliers
slide-42
SLIDE 42

CEO salary sample median’s Q-Q plot

✺ 95% confidence

interval for the median CEO salary from the bootstrap simula(on

✺ 348.0378±

2×27.30539 = [293.427, 402.6486]

  • −4

−2 2 4 250 300 350 400 450 500

CEO Bootstap Sample Median Q−Q Plot

Theoretical Quantiles Sample Quantiles

  • utliers
slide-43
SLIDE 43

Assignments

✺ Read Chapter 7 of the textbook ✺ Next (me: more on hypothesis tes(ng

slide-44
SLIDE 44

Additional References

✺ Charles M. Grinstead and J. Laurie Snell

"Introduc(on to Probability”

✺ Morris H. Degroot and Mark J. Schervish

"Probability and Sta(s(cs”

slide-45
SLIDE 45

See you next time

See you!