ì
Probability and Statistics for Computer Science
“In sta(s(cs we apply probability to draw conclusions from data.”
- --Prof. J. Orloff
Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 10.13.2020 Credit: wikipedia
Probability and Statistics for Computer Science In sta(s(cs we - - PowerPoint PPT Presentation
Probability and Statistics for Computer Science In sta(s(cs we apply probability to draw conclusions from data. ---Prof. J. Orloff Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 10.13.2020 Last time
“In sta(s(cs we apply probability to draw conclusions from data.”
Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 10.13.2020 Credit: wikipedia
✺ This senate elec(on poll tells us:
✺ The sample has 1211 likely voters ✺ Ms. Hyde-Smith has realized sample mean equal to 51%
✺ What is the es(mate of the percentage of votes
✺ How confident is that es(mate?
Source: FiveThirtyEight.com
✺ Since each sample is drawn uniformly from the
✺ We say that is an unbiased es(mator of the
✺ We can also rewrite another result from the lecture
✺ The standard devia(on of the sample mean ✺ But we need the popula(on standard devia(on in
var[X(N)] = popvar({X}) N
std[X(N)]
std[X(N)] = popsd({X}) √ N
stdunbiased({x}) =
N − 1
(xi − mean({xi}))2
popsd({X})
std[X(N)] = popsd({X}) √ N
std[X(N)]
popsd({X}) √ N . = stdunbiased({x}) √ N = stderr({x})
x
✺ What is the es(mate of the percentage of votes
Number of sampled voters who selected Ms. Smith is: 1211(0.51) ≅ 618 Number of sampled voters who didn’t selected Ms. Smith was 1211(0.49) ≅ 593
51% 51%
=
1211 − 1(618(1 − 0.51)2 + 593(0 − 0.51)2) = 0.5001001
≃ 0.5 √ 1211 ≃ 0.0144
✺ Sample mean is a random variable and has its own
probability distribu(on, stderr is an es(mate of sample mean’s standard devia(on
✺ When N is very large, according to the Central Limit
Theorem, sample mean is approaching a normal distribu(on with
x
µ = popmean({X}) ;
stderr({x}) = stdunbiased({x}) √ N
σ = popsd({X}) √ N . = stderr({x})
Credit: wikipedia
99.7% 95% 68% Popula(on mean μ+Standard error Probability distribu(on
mean tends normal when N is large
✺ Confidence interval
for a popula(on mean is defined by frac(on
✺ Given a percentage,
find how many units of strerr it covers.
−4 −2 2 4 0.0 0.1 0.2 0.3 0.4 0.5 x dnorm(x)
95% For 95% of the realized sample means, the popula(on mean lies in [sample mean-2 stderr, sample mean+2 stderr]
2
✺ For about 68% of realized sample means ✺ For about 95% of realized sample means ✺ For about 99.7% of realized sample means
mean({x}) − stderr({x}) ≤ popmean({X}) ≤ mean({x}) + stderr({x}) mean({x})−2stderr({x}) ≤ popmean({X}) ≤ mean({x})+2stderr({x}) mean({x})−3stderr({x}) ≤ popmean({X}) ≤ mean({x})+3stderr({x})
✺ What is the 68% confidence interval for a
51%
✺ A store staff mixed their fuji and gala
✺ If samples are taken from normal distributed
Degree of freedom is N-1 due to this constraint:
(xi − mean({x})) = 0
t-distribu(on with N=5 and N=30
William Sealy Gosset 1876-1937 Credit : wikipedia
−10 −5 5 10 0.0 0.1 0.2 0.3 0.4 0.5
pdf of t − distribution
X density degree = 4, N=5 degree = 29, N=30
t-distribu(on looks very similar to normal when N=30. So N=30 is a rule of thumb to decide N is large or not
−10 −5 5 10 0.0 0.1 0.2 0.3 0.4 0.5
pdf of t (n=30) and normal distribution
X density degree = 29, N=30 standard normal
✺ If the sample size N< 30, we should use t-
✺ Centered Confidence
−4 −2 2 4 0.0 0.1 0.2 0.3 0.4 0.5 x dnorm(x)
For 1-2α of the realized sample means, the popula(on mean lies in [sample mean-b×stderr, sample mean+b×stderr] α α
P(T ≥ b) = α
✺ Centered Confidence
−4 −2 2 4 0.0 0.1 0.2 0.3 0.4 0.5 x dnorm(x)
For 1-2α of the realized sample means, the popula(on mean lies in [sample mean-b×stderr, sample mean+b×stderr] α α
P(T ≥ b) = α
✺ The 95% confidence interval for a popula(on
✺ The largest integer that is smaller than or
✺ The interquar(le range of a sample
Credit: E S. Banjanovic and J. W. Osborne, 2016, PAREonline
✺ The realized sample of student awendance
{12,10,9,8,10,11,12,7,5,10}, N=10, median=10
✺ Generate a random index uniformly from [1,10] that correspond
to the 10 numbers in the sample, ie. if index=6, the bootstrap sample’s number will be 11.
✺ Repeat the process 10 (mes to get one bootstrap sample
Bootstrap replicate Sample median {11, 11, 12, 10, 10, 10, 12, 10, 7, 10} 10
✺ The realized sample of student awendance
Bootstrap replicate Sample median {11, 11, 12, 10, 10, 10, 12, 10, 7, 10} 10 {7, 10, 10, 10, 9, 7, 9, 10, 12, 10} 10 {9, 7, 10, 8, 5, 10, 7, 10, 12, 8} 8.5 … …
✺ A. 1010 B.10! C. e10
Bootstrap replicate Sample median {11, 11, 12, 10, 10, 10, 12, 10, 7, 10} 10 {7, 10, 10, 10, 9, 7, 9, 10, 12, 10} 10 {9, 7, 10, 8, 5, 10, 7, 10, 12, 8} 8.5 … …
✺ Do the bootstrapping for r = 10000 (mes, then draw
the histogram and also find the stderr of sample median) Bootstrap replicate Sample median {11, 11, 12, 10, 10, 10, 12, 10, 7, 10} 10 {7, 10, 10, 10, 9, 7, 9, 10, 12, 10} 10 {9, 7, 10, 8, 5, 10, 7, 10, 12, 8} 8.5 … …
✺ Bootstrapping
for r = 10000 (mes, then draw the histogram and also find the stderr of sample median.
mean(Sample Median) = 9.73625 stderr(Sample Median) = 0.7724446
Histogram of sample_median
sample_median Frequency 5 6 7 8 9 10 11 12 1000 2000 3000 4000 5000
Is this similar to Normal?
stderr({S}) =
r − 1
✺ The distribu(on simulated from bootstrapping is
called empirical distribu(on. It is not the true popula(on distribu(on. There is a staQsQcal error.
✺ The number of bootstrapping replicates may not be
✺ When the sta(s(c is not a well behaving one, such as
maximum or minimum of a data set, the bootstrap method may fail to simulate the true distribu(on.
✺ The realized
✺ r = 10000
Histogram of sample_median
sample_median Frequency 250 300 350 400 450 500 550 500 1000 1500 2000 2500 3000
mean(Sample Median) = 348.0378 stderr(Sample Median) = 27.30539
Histogram of the Bootstrap sample medians
✺ The realized
✺ r = 10000
Histogram of sample_median
sample_median Frequency 250 300 350 400 450 500 550 500 1000 1500 2000 2500 3000
mean(Sample Median) = 348.0378 stderr(Sample Median) = 27.30539
Histogram of the Bootstrap sample medians
Is this similar to Normal?
✺ Q-Q compares a
distribu(on with normal by matching the kth smallest quan(le value pairs and plot as a point in the graph
✺ Linear means
similar to normal!
Read Pg 64, 3.2.3, “Introductory sta(s(cs with R”
✺ Q-Q compares a
distribu(on with normal by matching the kth smallest quan(le value pairs and plot as a point in the graph
✺ Linear means
similar to normal!
Read Pg 64, 3.2.3, “Introductory sta(s(cs with R”
Normal Distribu(on’s Quan(le
✺ Q-Q plot of CEO
salary’s bootstrap sample medians
✺ It’s roughly linear so
it’s close to normal.
✺ We can use the
normal distribu(on to construct the confidence intervals
−2 2 4 250 300 350 400 450 500
CEO Bootstap Sample Median Q−Q Plot
Theoretical Quantiles Sample Quantiles
✺ 95% confidence
interval for the median CEO salary from the bootstrap simula(on
✺ 348.0378±
2×27.30539 = [293.427, 402.6486]
−2 2 4 250 300 350 400 450 500
CEO Bootstap Sample Median Q−Q Plot
Theoretical Quantiles Sample Quantiles