SLIDE 1 CO-FOUNDER, LOONYCORN
www.loonycorn.com
Janani Ravi
GETTING STARTED WITH BOOTSTRAPPING IN R
Implementing Bootstrap Methods in R
SLIDE 2
Overview
Estimating statistics and calculating confidence intervals The Central Limit Theorem Conventional methods vs. bootstrap methods Advantages of bootstrapping techniques
SLIDE 3
Prerequisites and Course Outline
SLIDE 4
Exposure to statistics at the level of mean, median, and standard deviation Familiarity with probability distributions Familiarity with regression models Some exposure to R programming
Prerequisites
SLIDE 5
R Programming Fundamentals
Prerequisites
SLIDE 6 Introducing bootstrap methods
Bootstrapping for summary statistics
- Non-parametric bootstrapping
- Bayesian bootstrapping
- Smoothed bootstrapping
Bootstrapping for regression models
- Case resampling
- Residual resampling
Course Outline
SLIDE 7
Sample Statistics and Confidence Intervals
SLIDE 8
Two Questions
What is the average height of an American male? How confident are you of your answer?
SLIDE 9
Answering Two Questions
Take sample from population; estimate mean Calculate confidence intervals around estimate
SLIDE 10
What is the _____ of some population? How confident are you of your answer?
Generalizing to Any Statistic
SLIDE 11
Generalizing to Any Statistic
Take sample from population; estimate statistic Calculate confidence intervals around estimate You need answers to the same two questions
SLIDE 12
Example Statistics
Mean, mode, median, standard deviation Correlations, covariances Regression coefficients, R-square values Proportions, odds ratio
SLIDE 13 Estimating Population Statistic
Bootstrap Approach Conventional Approach Sample population
sample statistic Sample once; resample that sample with replacement
SLIDE 14
Establishing Confidence Intervals Around Estimate
Once the estimate has been obtained from the sample… …Need to answer the second question Need to establish confidence intervals around the estimate
SLIDE 15 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population
SLIDE 16
Sample Mean and Confidence Intervals for Normally Distributed Data
SLIDE 17 Estimating Population Statistic
Bootstrap Approach Conventional Approach Sample population
sample statistic Sample once; resample that sample with replacement Estimate the mean
SLIDE 18 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population Assume population normally distributed
SLIDE 19
Estimating Population Mean
What is the average height/weight/ income of the population? Common question in science, business, finance Need to estimate mean value of some property of the population Assume population is normally distributed
SLIDE 20
Normal Distribution
Values close to the mean are more likely than values far away from the mean
SLIDE 21 Sample A subset - hopefully representative - of the population Population All the data out there in the universe
Draw Sample from Population
SLIDE 22 Mean and Variance
x1 x2 xn
x
- These statistics only apply to the sample of data,
and so are known as sample statistics The corresponding figures for all possible data points out there are called population statistics
x
n
Variance = xi - x _
( )2
Σ
n-1
SLIDE 23 From Sample to Population
Population Mean Sample Mean
x
x1 + x2 + … + xn n
μ = ?
SLIDE 24
Estimating Population Mean
Aim: Estimate a statistical property (mean) of the population Will need to do so from a sample Use properties of sample to estimate property of population
SLIDE 25
Sampling Distribution
Tricky part is going from properties of sample to property of population Can’t be completely sure of population property Can however be sure of probability distribution of the population property This distribution depends on sample alone - Sampling Distribution
SLIDE 26
Sampling Distribution
Probability distribution of a population statistic (e.g. population mean), given a particular sample.
SLIDE 27 From Sample to Population
Population Mean Sample Mean
x
x1 + x2 + … + xn n
μ = ?
SLIDE 28 From Sample to Population
Population Mean Sample Mean
x
x1 + x2 + … + xn n
SLIDE 29 Sampling Distribution
Population Mean Sample Mean
x
x1 + x2 + … + xn n
SLIDE 30
Estimating Population Mean
Turns out, x is the best estimate of μ (Law of Large Numbers) Sample mean is best, unbiased estimator of the population mean Even so, how sure are we of our estimate? Confidence levels help answer this question
SLIDE 31
“We can be 99% confident that the average is between ___ and ____” Confidence Intervals
SLIDE 32 Sampling Distribution
Population mean μ has a distribution called the sampling distribution This is a normal distribution
- Mean = Sample mean
- Variance ≈ Sample variance / n
- Std dev. = Sample std dev. / sqrt(n)
SLIDE 33 68% Confidence That μ is Within 1σ of x
68% _ X _
SLIDE 34 68% Confidence That μ is Within 1σ of x
68% X _
n X _
+ 1.s/
n X _ _
SLIDE 35 68% Confidence That μ is Within 1σ of x
68% X _
n X _
+ 1.s/
n X _
We can state with 68% confidence that the population mean μ lies in the range
n X _
+ 1.s/
n X _
to
_
SLIDE 36 99% X _
99% Confidence That μ is Within 2.57σ of x
_
SLIDE 37 99% X _
X _
+ 2.576s/ n
X _
99% Confidence That μ is Within 2.57σ of x
_
SLIDE 38 99% X _
X _
+ 2.576s/ n
X _
We can state with 99% confidence that the population mean μ lies in the range
n X _
to + 2.576s/
n X _
99% Confidence That μ is Within 2.57σ of x
_
SLIDE 39 99% X _
n X _
+ Z. s/
n X _
(100-p)% Confidence That μ is Within Zσ of x
_
SLIDE 40 99% X _
n X _
+ Z. s/
n X _
We can state with (100- p)% confidence that the population mean μ lies in the range
n X _
to + Z.s/
n X _
(100-p)% Confidence That μ is Within Zσ of x
_
SLIDE 41 Sampling Distribution
p is the level of significance Z is the number of standard deviations from the mean corresponding to p s and x are calculated from the sample properties
_
SLIDE 42 Sampling Distribution
Confidence Interval Z
80% 1.282 85% 1.440 90% 1.645 95% 1.960 99% 2.576 99.5% 2.807 99.9% 3.291
SLIDE 43
Sampling Distribution
Range is centered around sample mean Extends symmetrically on both sides Greater the range, the greater our confidence that estimate lies within it
SLIDE 44
Sample Mean and Confidence Intervals for Any Data
SLIDE 45 Estimating Population Statistic
Bootstrap Approach Conventional Approach Sample population
sample statistic Sample once; resample that sample with replacement Estimate the mean
SLIDE 46 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population Make no assumptions of population distribution but draw a large number of samples from population
SLIDE 47
Sampling Distribution of the Mean
Tricky part is going from properties of samples to property of population Can’t be completely sure of population property Need to know the probability distribution of the population property Using the Sampling Distribution i.e. distribution of estimates from the samples
SLIDE 48 Population Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞
Draw many samples, calculate mean of each, plot histogram of these means
Sampling Distribution of the Mean
SLIDE 49 Population Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞
Confidence Intervals from Non-normal Data
Using the Sampling Distribution of the mean, can calculate confidence intervals of our estimate
SLIDE 50 Population
Confidence Intervals from Non-normal Data
SLIDE 51 Population
Confidence Intervals from Non-normal Data
Sample values
. . .
SLIDE 52 Population
Calculate mean of each sample Repeat multiple times
↓
Confidence Intervals from Non-normal Data
↓
Sample values
. . .
SLIDE 53 Population
Calculate mean of each sample Repeat multiple times
↓
Confidence Intervals from Non-normal Data
↓
Sample Means
Sample values
. . .
SLIDE 54 Population
Calculate mean of each sample Repeat multiple times
↓
Confidence Intervals from Non-normal Data
↓
97.5% percentile 2.5% percentile
Sample Means
Sample values
. . .
SLIDE 55 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population The Central Limit Theorem can be used to estimate the mean of even non-normally distributed data
SLIDE 56
Central Limit Theorem
A group of means of N samples drawn from any distribution (even a non-normal distribution) approaches normality as N approaches infinity.
SLIDE 57
Central Limit Theorem
A group of means of N samples drawn from any distribution (even a non-normal distribution) approaches normality as N approaches infinity.
SLIDE 58
Central Limit Theorem
A group of means of N samples drawn from any distribution (even a non-normal distribution) approaches normality as N approaches infinity.
SLIDE 59
Central Limit Theorem
A group of means of N samples drawn from any distribution (even a non-normal distribution) approaches normality as N approaches infinity.
SLIDE 60
Mean of non-normal population can be estimated easily by sampling Draw N samples, compute mean of each sample Compute mean of these means As N -> ∞ this mean of means approaches population mean
Implication of the Central Limit Theorem
SLIDE 61 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population The Central Limit Theorem only applies to a group of means, so computing multiple samples is key
SLIDE 62 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population Not a very realistic approach in the real world
SLIDE 63 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population
SLIDE 64 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population Instead modelers choose only to work with data whose distributions are known
SLIDE 65 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population For normally distributed data we can often work with just one sample to estimate mean
SLIDE 66 Population
Confidence Intervals from Normal Data
Simple random sample
. . .
SLIDE 67 Population
Calculate mean of sample
Compute confidence intervals analytically ↓
Confidence Intervals from Normal Data
↓
Simple random sample
. . .
SLIDE 68
Demo
The central limit theorem
SLIDE 69
Demo
Observing the central limit theorem on a real dataset
SLIDE 70
Drawbacks of Conventional Methods
SLIDE 71
Drawbacks of Conventional Methods
Make strong assumptions of the distribution of data Use analytical formulae to estimate statistics based on data distributions The analytical formula may not exist for certain combinations
SLIDE 72
Drawbacks of Conventional Methods
Need to draw a large number of samples from the population Estimate statistics based on sampling distribution May not be practical or realistic
SLIDE 73 Estimating Population Statistic
Bootstrap Approach Conventional Approach Sample population
sample statistic Sample once; resample that sample with replacement
SLIDE 74 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population
SLIDE 75 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population
Parametric Method
SLIDE 76 Establishing Confidence Intervals
Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without
Sample once; make strong assumptions about population
Non-parametric Methods
SLIDE 77
The basic Bootstrap method is non- parametric, however parametric variants exist too
SLIDE 78
The Bootstrap Method
SLIDE 79 Population Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞
Conventional Methods
SLIDE 80 Population
Calculate mean of each sample Repeat multiple times
↓
Confidence Intervals from Non-normal Data
↓
97.5% percentile 2.5% percentile
Sample Means
Sample values
. . .
SLIDE 81 Population Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞
Bootstrap Method
Draw just one sample from the population
SLIDE 82 Population Sample 1
Bootstrap Method
Draw just one sample from the population
SLIDE 83 Population Bootstrap Sample
The Bootstrap Sample
Treat that one sample as if it were the population
SLIDE 84 Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞
Bootstrap Method
Draw multiple samples from the one sample with replacement
SLIDE 85 Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞
Bootstrap Method
Each of these samples is sometimes called a Bootstrap Replication
SLIDE 86 Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞
Estimate Statistics using the Bootstrap Method
With each bootstrap replication calculate the statistic e.g. mean
SLIDE 87 Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞
Estimate Statistics using the Bootstrap Method
Each estimate from a bootstrapped replication is called a bootstrap realization of the statistic
SLIDE 88 Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞
Confidence Intervals using the Bootstrap Method
Calculate confidence intervals using the bootstrap distribution of the statistic
SLIDE 89
Sampling with replacement is essential Else each Bootstrap Replication will merely reproduce the Bootstrap Sample
SLIDE 90 Sampling with Replacement
Reusing the same data multiple times “Bootstrapping” comes from the phrase “pulling yourself up by your
Has empirically been shown to produce meaningful results
SLIDE 91
Bootstrapping does not create new data Creates the samples that could have been drawn from the original population Assumes that the bootstrap sample accurately represents the population
Sampling with Replacement
SLIDE 92
The Bootstrap Method seems like cheating, but it is both theoretically sound and very robust
SLIDE 93
The Bootstrap Method and Confidence Intervals
SLIDE 94 Confidence Intervals with the Bootstrap Method
Bootstrap Sample (treated as Population)
SLIDE 95 Sample values with replacement
. . .
Confidence Intervals with the Bootstrap Method
Bootstrap Sample (treated as Population)
SLIDE 96 Calculate mean of each sample Repeat multiple times
↓
Confidence Intervals with the Bootstrap Method
↓
Bootstrap Sample (treated as Population)
Sample values with replacement
. . .
SLIDE 97 Confidence Intervals with the Bootstrap Method
Sample Means Bootstrap Sample (treated as Population)
Calculate mean of each sample Repeat multiple times
↓ ↓
Sample values with replacement
. . .
SLIDE 98 Confidence Intervals with the Bootstrap Method
Sample Means Bootstrap Sample (treated as Population)
Calculate mean of each sample Repeat multiple times
↓ ↓
Sample values with replacement
. . .
97.5% percentile 2.5% percentile
SLIDE 99
The Bootstrap Method
Conventional Approach Bootstrap Method Sample population just once if no confidence intervals needed No need to re-sample for confidence intervals for common use-cases Re-sample population if confidence intervals needed for complex cases Sample population just once under all circumstances Re-sample bootstrap sample with replacement under all circumstances No change in procedure, works equally well for common and complex cases
SLIDE 100 Great for
- Arbitrary population (unknown
distribution)
- Arbitrary statistics (not commonly
studied for arbitrary population)
- Confidence interval around arbitrary
statistics
The Bootstrap Method
SLIDE 101 Tends to systematically under-estimate variances Various measures to mitigate this bias
- Compute correction based on difference
between bootstrap and sample estimate
- Add back to each bootstrap value
- “Balanced Bootstrap”
Performs poorly for highly skewed data
The Bootstrap Method
SLIDE 102 Can be used to compute just about any statistic From just about any data However, most widely used to calculate
- Confidence intervals
- Standard errors
- Of complex, hard-to-estimate statistics
The Bootstrap Method
SLIDE 103
Main use-case of the Bootstrap Method: Calculate confidence interval around a complex statistic
SLIDE 104 Computing confidence intervals around the mean of a normal distribution
- No need of bootstrap, parametric
method is simpler Computing confidence intervals around the R-squared of a regression
- Bootstrap method is simple, robust, and
effective
The Bootstrap Method
SLIDE 105
Types of Bootstrap Confidence Intervals
Basic bootstrap Percentile bootstrap Studentized bootstrap Bias-corrected bootstrap Accelerated bootstrap
SLIDE 106
Summary
Estimating statistics and calculating confidence intervals The Central Limit Theorem Conventional methods vs. bootstrap methods Advantages of bootstrapping techniques
SLIDE 107
Up Next: Implementing Bootstrap Methods for Summary Statistics