Implementing Bootstrap Methods in R GETTING STARTED WITH - - PowerPoint PPT Presentation

implementing bootstrap methods in r
SMART_READER_LITE
LIVE PREVIEW

Implementing Bootstrap Methods in R GETTING STARTED WITH - - PowerPoint PPT Presentation

Implementing Bootstrap Methods in R GETTING STARTED WITH BOOTSTRAPPING IN R Janani Ravi CO-FOUNDER, LOONYCORN www.loonycorn.com Estimating statistics and calculating Overview confidence intervals The Central Limit Theorem Conventional


slide-1
SLIDE 1

CO-FOUNDER, LOONYCORN

www.loonycorn.com

Janani Ravi

GETTING STARTED WITH BOOTSTRAPPING IN R

Implementing Bootstrap Methods in R

slide-2
SLIDE 2

Overview

Estimating statistics and calculating confidence intervals The Central Limit Theorem Conventional methods vs. bootstrap methods Advantages of bootstrapping techniques

slide-3
SLIDE 3

Prerequisites and Course Outline

slide-4
SLIDE 4

Exposure to statistics at the level of mean, median, and standard deviation Familiarity with probability distributions Familiarity with regression models Some exposure to R programming

Prerequisites

slide-5
SLIDE 5

R Programming Fundamentals

Prerequisites

slide-6
SLIDE 6

Introducing bootstrap methods

  • Benefits and limitations

Bootstrapping for summary statistics

  • Non-parametric bootstrapping
  • Bayesian bootstrapping
  • Smoothed bootstrapping

Bootstrapping for regression models

  • Case resampling
  • Residual resampling

Course Outline

slide-7
SLIDE 7

Sample Statistics and Confidence Intervals

slide-8
SLIDE 8

Two Questions

What is the average height of an American male? How confident are you of your answer?

slide-9
SLIDE 9

Answering Two Questions

Take sample from population; estimate mean Calculate confidence intervals around estimate

slide-10
SLIDE 10

What is the _____ of some population? How confident are you of your answer?

Generalizing to Any Statistic

slide-11
SLIDE 11

Generalizing to Any Statistic

Take sample from population; estimate statistic Calculate confidence intervals around estimate You need answers to the same two questions

slide-12
SLIDE 12

Example Statistics

Mean, mode, median, standard deviation Correlations, covariances Regression coefficients, R-square values Proportions, odds ratio

slide-13
SLIDE 13

Estimating Population Statistic

Bootstrap Approach Conventional Approach Sample population

  • nce; calculate

sample statistic Sample once; resample that sample with replacement

slide-14
SLIDE 14

Establishing Confidence Intervals Around Estimate

Once the estimate has been obtained from the sample… …Need to answer the second question Need to establish confidence intervals around the estimate

slide-15
SLIDE 15

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population

slide-16
SLIDE 16

Sample Mean and Confidence Intervals for Normally Distributed Data

slide-17
SLIDE 17

Estimating Population Statistic

Bootstrap Approach Conventional Approach Sample population

  • nce; calculate

sample statistic Sample once; resample that sample with replacement Estimate the mean

slide-18
SLIDE 18

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population Assume population normally distributed

slide-19
SLIDE 19

Estimating Population Mean

What is the average height/weight/ income of the population? Common question in science, business, finance Need to estimate mean value of some property of the population Assume population is normally distributed

slide-20
SLIDE 20

Normal Distribution

Values close to the mean are more likely than values far away from the mean

slide-21
SLIDE 21

Sample A subset - hopefully representative - of the population Population All the data out there in the universe

Draw Sample from Population

slide-22
SLIDE 22

Mean and Variance

x1 x2 xn

x

  • These statistics only apply to the sample of data,

and so are known as sample statistics The corresponding figures for all possible data points out there are called population statistics

x

  • = x1 + x2 + … + xn

n

Variance = xi - x _

( )2

Σ

n-1

slide-23
SLIDE 23

From Sample to Population

Population Mean Sample Mean

x

  • =

x1 + x2 + … + xn n

μ = ?

slide-24
SLIDE 24

Estimating Population Mean

Aim: Estimate a statistical property (mean) of the population Will need to do so from a sample Use properties of sample to estimate property of population

slide-25
SLIDE 25

Sampling Distribution

Tricky part is going from properties of sample to property of population Can’t be completely sure of population property Can however be sure of probability distribution of the population property This distribution depends on sample alone - Sampling Distribution

slide-26
SLIDE 26

Sampling Distribution

Probability distribution of a population statistic (e.g. population mean), given a particular sample.

slide-27
SLIDE 27

From Sample to Population

Population Mean Sample Mean

x

  • =

x1 + x2 + … + xn n

μ = ?

slide-28
SLIDE 28

From Sample to Population

Population Mean Sample Mean

x

  • =

x1 + x2 + … + xn n

slide-29
SLIDE 29

Sampling Distribution

Population Mean Sample Mean

x

  • =

x1 + x2 + … + xn n

slide-30
SLIDE 30

Estimating Population Mean

Turns out, x is the best estimate of μ (Law of Large Numbers) Sample mean is best, unbiased estimator of the population mean Even so, how sure are we of our estimate? Confidence levels help answer this question

slide-31
SLIDE 31

“We can be 99% confident that the average is between ___ and ____” Confidence Intervals

slide-32
SLIDE 32

Sampling Distribution

Population mean μ has a distribution called the sampling distribution This is a normal distribution

  • Mean = Sample mean
  • Variance ≈ Sample variance / n
  • Std dev. = Sample std dev. / sqrt(n)
slide-33
SLIDE 33

68% Confidence That μ is Within 1σ of x

68% _ X _

slide-34
SLIDE 34

68% Confidence That μ is Within 1σ of x

68% X _

  • 1.s/

n X _

+ 1.s/

n X _ _

slide-35
SLIDE 35

68% Confidence That μ is Within 1σ of x

68% X _

  • 1.s/

n X _

+ 1.s/

n X _

We can state with 68% confidence that the population mean μ lies in the range

  • 1.s/

n X _

+ 1.s/

n X _

to

_

slide-36
SLIDE 36

99% X _

99% Confidence That μ is Within 2.57σ of x

_

slide-37
SLIDE 37

99% X _

  • 2.576s/ n

X _

+ 2.576s/ n

X _

99% Confidence That μ is Within 2.57σ of x

_

slide-38
SLIDE 38

99% X _

  • 2.576s/ n

X _

+ 2.576s/ n

X _

We can state with 99% confidence that the population mean μ lies in the range

  • 2.576s/

n X _

to + 2.576s/

n X _

99% Confidence That μ is Within 2.57σ of x

_

slide-39
SLIDE 39

99% X _

  • Z. s/

n X _

+ Z. s/

n X _

(100-p)% Confidence That μ is Within Zσ of x

_

slide-40
SLIDE 40

99% X _

  • Z. s/

n X _

+ Z. s/

n X _

We can state with (100- p)% confidence that the population mean μ lies in the range

  • Z.s/

n X _

to + Z.s/

n X _

(100-p)% Confidence That μ is Within Zσ of x

_

slide-41
SLIDE 41

Sampling Distribution

p is the level of significance Z is the number of standard deviations from the mean corresponding to p s and x are calculated from the sample properties

_

slide-42
SLIDE 42

Sampling Distribution

Confidence Interval Z

80% 1.282 85% 1.440 90% 1.645 95% 1.960 99% 2.576 99.5% 2.807 99.9% 3.291

slide-43
SLIDE 43

Sampling Distribution

Range is centered around sample mean Extends symmetrically on both sides Greater the range, the greater our confidence that estimate lies within it

slide-44
SLIDE 44

Sample Mean and Confidence Intervals for Any Data

slide-45
SLIDE 45

Estimating Population Statistic

Bootstrap Approach Conventional Approach Sample population

  • nce; calculate

sample statistic Sample once; resample that sample with replacement Estimate the mean

slide-46
SLIDE 46

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population Make no assumptions of population distribution but draw a large number of samples from population

slide-47
SLIDE 47

Sampling Distribution of the Mean

Tricky part is going from properties of samples to property of population Can’t be completely sure of population property Need to know the probability distribution of the population property Using the Sampling Distribution i.e. distribution of estimates from the samples

slide-48
SLIDE 48

Population Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞

Draw many samples, calculate mean of each, plot histogram of these means

Sampling Distribution of the Mean

slide-49
SLIDE 49

Population Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞

Confidence Intervals from Non-normal Data

Using the Sampling Distribution of the mean, can calculate confidence intervals of our estimate

slide-50
SLIDE 50

Population

Confidence Intervals from Non-normal Data

slide-51
SLIDE 51

Population

Confidence Intervals from Non-normal Data

Sample values

. . .

slide-52
SLIDE 52

Population

Calculate mean of each sample Repeat multiple times

Confidence Intervals from Non-normal Data

Sample values

. . .

slide-53
SLIDE 53

Population

Calculate mean of each sample Repeat multiple times

Confidence Intervals from Non-normal Data

Sample Means

Sample values

. . .

slide-54
SLIDE 54

Population

Calculate mean of each sample Repeat multiple times

Confidence Intervals from Non-normal Data

97.5% percentile 2.5% percentile

Sample Means

Sample values

. . .

slide-55
SLIDE 55

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population The Central Limit Theorem can be used to estimate the mean of even non-normally distributed data

slide-56
SLIDE 56

Central Limit Theorem

A group of means of N samples drawn from any distribution (even a non-normal distribution) approaches normality as N approaches infinity.

slide-57
SLIDE 57

Central Limit Theorem

A group of means of N samples drawn from any distribution (even a non-normal distribution) approaches normality as N approaches infinity.

slide-58
SLIDE 58

Central Limit Theorem

A group of means of N samples drawn from any distribution (even a non-normal distribution) approaches normality as N approaches infinity.

slide-59
SLIDE 59

Central Limit Theorem

A group of means of N samples drawn from any distribution (even a non-normal distribution) approaches normality as N approaches infinity.

slide-60
SLIDE 60

Mean of non-normal population can be estimated easily by sampling Draw N samples, compute mean of each sample Compute mean of these means As N -> ∞ this mean of means approaches population mean

Implication of the Central Limit Theorem

slide-61
SLIDE 61

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population The Central Limit Theorem only applies to a group of means, so computing multiple samples is key

slide-62
SLIDE 62

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population Not a very realistic approach in the real world

slide-63
SLIDE 63

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population

slide-64
SLIDE 64

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population Instead modelers choose only to work with data whose distributions are known

slide-65
SLIDE 65

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population For normally distributed data we can often work with just one sample to estimate mean

slide-66
SLIDE 66

Population

Confidence Intervals from Normal Data

Simple random sample

. . .

slide-67
SLIDE 67

Population

Calculate mean of sample

Compute confidence intervals analytically ↓

Confidence Intervals from Normal Data

Simple random sample

. . .

slide-68
SLIDE 68

Demo

The central limit theorem

slide-69
SLIDE 69

Demo

Observing the central limit theorem on a real dataset

slide-70
SLIDE 70

Drawbacks of Conventional Methods

slide-71
SLIDE 71

Drawbacks of Conventional Methods

Make strong assumptions of the distribution of data Use analytical formulae to estimate statistics based on data distributions The analytical formula may not exist for certain combinations

slide-72
SLIDE 72

Drawbacks of Conventional Methods

Need to draw a large number of samples from the population Estimate statistics based on sampling distribution May not be practical or realistic

slide-73
SLIDE 73

Estimating Population Statistic

Bootstrap Approach Conventional Approach Sample population

  • nce; calculate

sample statistic Sample once; resample that sample with replacement

slide-74
SLIDE 74

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population

slide-75
SLIDE 75

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population

Parametric Method

slide-76
SLIDE 76

Establishing Confidence Intervals

Bootstrap Approach Conventional Approach Sample once; resample that sample with replacement Sample multiple times with or without

  • ut replacement

Sample once; make strong assumptions about population

Non-parametric Methods

slide-77
SLIDE 77

The basic Bootstrap method is non- parametric, however parametric variants exist too

slide-78
SLIDE 78

The Bootstrap Method

slide-79
SLIDE 79

Population Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞

Conventional Methods

slide-80
SLIDE 80

Population

Calculate mean of each sample Repeat multiple times

Confidence Intervals from Non-normal Data

97.5% percentile 2.5% percentile

Sample Means

Sample values

. . .

slide-81
SLIDE 81

Population Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞

Bootstrap Method

Draw just one sample from the population

slide-82
SLIDE 82

Population Sample 1

Bootstrap Method

Draw just one sample from the population

slide-83
SLIDE 83

Population Bootstrap Sample

The Bootstrap Sample

Treat that one sample as if it were the population

slide-84
SLIDE 84

Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞

Bootstrap Method

Draw multiple samples from the one sample with replacement

slide-85
SLIDE 85

Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞

Bootstrap Method

Each of these samples is sometimes called a Bootstrap Replication

slide-86
SLIDE 86

Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞

Estimate Statistics using the Bootstrap Method

With each bootstrap replication calculate the statistic e.g. mean

slide-87
SLIDE 87

Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞

Estimate Statistics using the Bootstrap Method

Each estimate from a bootstrapped replication is called a bootstrap realization of the statistic

slide-88
SLIDE 88

Sample 1 Sample 2 Sample 3 Sample 4 Sample ∞

Confidence Intervals using the Bootstrap Method

Calculate confidence intervals using the bootstrap distribution of the statistic

slide-89
SLIDE 89

Sampling with replacement is essential Else each Bootstrap Replication will merely reproduce the Bootstrap Sample

slide-90
SLIDE 90

Sampling with Replacement

Reusing the same data multiple times “Bootstrapping” comes from the phrase “pulling yourself up by your

  • wn bootstraps”

Has empirically been shown to produce meaningful results

slide-91
SLIDE 91

Bootstrapping does not create new data Creates the samples that could have been drawn from the original population Assumes that the bootstrap sample accurately represents the population

Sampling with Replacement

slide-92
SLIDE 92

The Bootstrap Method seems like cheating, but it is both theoretically sound and very robust

slide-93
SLIDE 93

The Bootstrap Method and Confidence Intervals

slide-94
SLIDE 94

Confidence Intervals with the Bootstrap Method

Bootstrap Sample (treated as Population)

slide-95
SLIDE 95

Sample values with replacement

. . .

Confidence Intervals with the Bootstrap Method

Bootstrap Sample (treated as Population)

slide-96
SLIDE 96

Calculate mean of each sample Repeat multiple times

Confidence Intervals with the Bootstrap Method

Bootstrap Sample (treated as Population)

Sample values with replacement

. . .

slide-97
SLIDE 97

Confidence Intervals with the Bootstrap Method

Sample Means Bootstrap Sample (treated as Population)

Calculate mean of each sample Repeat multiple times

↓ ↓

Sample values with replacement

. . .

slide-98
SLIDE 98

Confidence Intervals with the Bootstrap Method

Sample Means Bootstrap Sample (treated as Population)

Calculate mean of each sample Repeat multiple times

↓ ↓

Sample values with replacement

. . .

97.5% percentile 2.5% percentile

slide-99
SLIDE 99

The Bootstrap Method

Conventional Approach Bootstrap Method Sample population just once if no confidence intervals needed No need to re-sample for confidence intervals for common use-cases Re-sample population if confidence intervals needed for complex cases Sample population just once under all circumstances Re-sample bootstrap sample with replacement under all circumstances No change in procedure, works equally well for common and complex cases

slide-100
SLIDE 100

Great for

  • Arbitrary population (unknown

distribution)

  • Arbitrary statistics (not commonly

studied for arbitrary population)

  • Confidence interval around arbitrary

statistics

The Bootstrap Method

slide-101
SLIDE 101

Tends to systematically under-estimate variances Various measures to mitigate this bias

  • Compute correction based on difference

between bootstrap and sample estimate

  • Add back to each bootstrap value
  • “Balanced Bootstrap”

Performs poorly for highly skewed data

The Bootstrap Method

slide-102
SLIDE 102

Can be used to compute just about any statistic From just about any data However, most widely used to calculate

  • Confidence intervals
  • Standard errors
  • Of complex, hard-to-estimate statistics

The Bootstrap Method

slide-103
SLIDE 103

Main use-case of the Bootstrap Method: Calculate confidence interval around a complex statistic

slide-104
SLIDE 104

Computing confidence intervals around the mean of a normal distribution

  • No need of bootstrap, parametric

method is simpler Computing confidence intervals around the R-squared of a regression

  • Bootstrap method is simple, robust, and

effective

The Bootstrap Method

slide-105
SLIDE 105

Types of Bootstrap Confidence Intervals

Basic bootstrap Percentile bootstrap Studentized bootstrap Bias-corrected bootstrap Accelerated bootstrap

slide-106
SLIDE 106

Summary

Estimating statistics and calculating confidence intervals The Central Limit Theorem Conventional methods vs. bootstrap methods Advantages of bootstrapping techniques

slide-107
SLIDE 107

Up Next: Implementing Bootstrap Methods for Summary Statistics