Simulation - part 2 Stat 133 Gaston Sanchez Department of - - PowerPoint PPT Presentation

simulation part 2
SMART_READER_LITE
LIVE PREVIEW

Simulation - part 2 Stat 133 Gaston Sanchez Department of - - PowerPoint PPT Presentation

Simulation - part 2 Stat 133 Gaston Sanchez Department of Statistics, UCBerkeley gastonsanchez.com github.com/gastonstat Course web: gastonsanchez.com/stat133 Random Numbers in R 2 Generating Random Numbers Generation of random numbers


slide-1
SLIDE 1

Simulation - part 2

Stat 133 Gaston Sanchez

Department of Statistics, UC–Berkeley gastonsanchez.com github.com/gastonstat Course web: gastonsanchez.com/stat133

slide-2
SLIDE 2

Random Numbers in R

2

slide-3
SLIDE 3

Generating Random Numbers Generation of random numbers is at the heart

  • f many statistical methods

3

slide-4
SLIDE 4

Use of Random Numbers

Some uses of random numbers

◮ Sampling procedures ◮ Simulation studies of stochastic processes ◮ Analytically intractable mathematical expressions ◮ Simulation of a population distribution by resampling from

a given sample from that population

◮ More general: Simulation, Monte Carlo, Resampling 4

slide-5
SLIDE 5

Random Samples

◮ Many statistical methods rely on random samples:

– Sampling techniques – Design of experiments – Surveys

◮ Hence, we need a source of random numbers ◮ Before computers, statisticians used tables of random

numbers

◮ Now we use computers to generate “random” numbers ◮ The random sampling required in most analyses is usually

done by the computer

5

slide-6
SLIDE 6

Generating Random Numbers

◮ We cannot generate truly random numbers on a computer ◮ Instead, we generate pseudo-random numbers ◮ i.e. numbers that have the appearance of random numbers ◮ they seem to be randomly drawn from some known

distribution

◮ There are many methods that have been proposed to

generate pseudo-random numbers

6

slide-7
SLIDE 7

Generating Random Numbers A very important advantage of using pseudo-random numbers is that, because they are deterministic, they can be reproduced (i.e. repeated)

7

slide-8
SLIDE 8

Multiple Recursion

◮ Generate a sequence of numbers in a manner that appears

to be random

◮ Use a deterministic generator that yields numbers

recursively (in a fixed sequence)

◮ The previous k numbers determine the next one

xi = f(xi−1, . . . , xi−k)

8

slide-9
SLIDE 9

Simple Congruential Generator

◮ Congruential generators were the first reasonable class of

pseudo-random number generators

◮ The congruential method uses modular arithmetic to

generate “random” numbers

9

slide-10
SLIDE 10

Ingredients

◮ An integer m ◮ An integer a such that a < m ◮ A starting integer x0 (a.k.a. seed) 10

slide-11
SLIDE 11

Simple Congruential Generator

The first number is obtained as: x1 = (a × x0) mod m The rest of the pseudorandom numbers are generated as: xn+1 = (a × xn) mod m

11

slide-12
SLIDE 12

Simple Congruential Generator

For example if we take m = 64, and a = 3, then for x0 = 17 we have: x1 = (3 × 17) mod 64 = 51 x2 = (3 × 51) mod 64 = 25 x3 = (3 × 25) mod 64 = 11 x4 = (3 × 11) mod 64 = 33 x5 = (3 × 33) mod 64 = 35 x6 = (3 × 35) mod 64 = 41 . . .

12

slide-13
SLIDE 13

Congruential algorithm

a <- 3; m <- 64; seed <- 17 x <- numeric(20) x[1] <- (a * seed) %% m for (i in 2:20) { x[i] <- (a * x[i-1]) %% m } x[1:16]; x[17:20] ## [1] 51 25 11 33 35 41 59 49 19 57 43 1 3 9 27 17 ## [1] 51 25 11 33

13

slide-14
SLIDE 14

Congruential algorithm

cong <- function(n, a = 69069, m = 2^32, seed = 17) { x <- numeric(n) x[1] <- (a * seed) %% m for (i in 2:n) { x[i] <- (a * x[i-1]) %% m } x } y <- cong(20, a = 3, m = 64, seed = 17)

14

slide-15
SLIDE 15

Congruential algorithm

plot(y[1:(n-1)], y[2:n], pch = 19, xlab = 'current value', ylab = 'next value')

  • 10

20 30 40 50 60 10 20 30 40 50 60 current value next value 15

slide-16
SLIDE 16

cong(n, a = 69069, m = 2^32, seed = 17)

0e+00 1e+09 2e+09 3e+09 4e+09 0e+00 1e+09 2e+09 3e+09 4e+09 current value next value

16

slide-17
SLIDE 17

Ingredients

◮ m is the modulus (0 < m) ◮ a is the multiplier (0 < a < m) ◮ x0 is the seed (0 ≤ x0 ≤ m)

The period length of a Random Generator Number is at most m, and for some choices of a much less than that.

17

slide-18
SLIDE 18

About the seed

◮ You can reproduce your simulation results by controlling

the seed

◮ set.seed() allows to do this ◮ Every time you perform simulation studies indicate the

random number generator and the seed that was used

18

slide-19
SLIDE 19

About the seed

# set the seed set.seed(69069) runif(3) # call the uniform RNG ## [1] 0.1648855 0.9564664 0.3345479 runif(3) # call runif() again ## [1] 0.01109596 0.18654873 0.94657805 # set the seed back to 69069 set.seed(69069) runif(3) ## [1] 0.1648855 0.9564664 0.3345479

19

slide-20
SLIDE 20

Simple Congruential Generator

We can use a congruential algorithm to generate uniform numbers We’ll describe one of the simplest methods for simulating independent uniform random variables on the interaval [0, 1]

20

slide-21
SLIDE 21

Generating Uniform Numbers

The generator proceeds as follows x1 = (ax0) mod m u1 = x1/m u1 is the first pseudo-random number, taking some value between 0 and 1.

21

slide-22
SLIDE 22

Ingredients

The second pseudorandom number is obtained as: x2 = (ax1) mod m u2 = x2/m u2 is another pseudorandom number.

22

slide-23
SLIDE 23

Generating Uniform Numbers

◮ If m and a are chosen properly, it is difficult to predict the

value u2 given u1

◮ For most practical purposes u2 is approximately

independent of u1

23

slide-24
SLIDE 24

Simple Congruential Generator

For example if we take m = 7, and a = 3, then for x0 = 2 we have: x1 = (3 × 2) mod 7 = 6, u1 = 0.857 x2 = (3 × 6) mod 7 = 4, u2 = 0.571 x3 = (3 × 4) mod 7 = 5, u3 = 0.714 x4 = (3 × 5) mod 7 = 1, u4 = 0.143 x5 = (3 × 1) mod 7 = 3, u5 = 0.429 x6 = (3 × 3) mod 7 = 2, u6 = 0.286 . . .

24

slide-25
SLIDE 25

Random Numbers in R

R uses a pseudo random number generator

◮ It starts with a seed and an algorithm (i.e. function) ◮ The seed is plugged into the algorithm and a number is

returned

◮ That number is then plugged into the algorithm and the

next number is created

◮ The algorithm is such that the produced numbers behave

like random numbers

25

slide-26
SLIDE 26

RNG functions in R

Function Description runif() Uniform rbinom() Binomial rmultinom() Multinomial rnbinom() Negative binomial rpois() Poisson rnorm() Normal rbeta() Beta rgamma() Gamma rchisq() Chi-squared rcauchy() Cauchy See more info: ?Distributions

26

slide-27
SLIDE 27

Random Number Functions

runif(n, min = 0, max = 1) sample from the uniform distribution on the interval (0,1) The chance the value drawn is:

◮ between 0 and 1/3 has chance 1/3 ◮ between 1/3 and 1/2 has chance 1/6 ◮ between 9/10 and 1 has chance 1/10 27

slide-28
SLIDE 28

Random Number Functions

rnorm(n, mean = 0, sd = 1) sample from the normal distribution with center = mean and spread = sd

28

slide-29
SLIDE 29

Random Number Functions

rnorm(n, mean = 0, sd = 1) sample from the normal distribution with center = mean and spread = sd rbinom(n, size, prob) sample from the binomial distribution with number of trials = size and chance of success = prob

28

slide-30
SLIDE 30

More Simulations

29

slide-31
SLIDE 31

Simulation Probability examples

For instance

◮ Simulate flipping a coin ◮ Simulate rolling a die ◮ Simulate drawing a card from a deck ◮ Simulate a probability experiment with balls in an urn ◮ Simulate the “Monty Hall Problem” 30

slide-32
SLIDE 32

Flipping a Coin

31

slide-33
SLIDE 33

Simulating a Coin

How to simulate tossing a coin?

32

slide-34
SLIDE 34

Simulating a coin

One way to simulate a coin

coin <- c("heads", "tails")

33

slide-35
SLIDE 35

Simulating a coin

One way to simulate a coin

coin <- c("heads", "tails")

One way to simulate flipping a coin

sample(coin, size = 1) ## [1] "heads"

33

slide-36
SLIDE 36

Probability

Probability allows us to quantify statements about the chance

  • f an event taking place

For example, flip a fair coin

◮ What’s the chance it lands heads? ◮ Flip it 4 times, what proportion of heads do you expect? ◮ Will you get exactly that proportion? ◮ What happens when you flip the coin 1000 times? 34

slide-37
SLIDE 37

Simulating a Coin

When you flip a coin

◮ it may land heads ◮ it may land tails ◮ with what probability it lands heads? ◮ If it is a fair coin: p = 0.5 35

slide-38
SLIDE 38

Simulating a Coin

Tossing a coin can be modeled with a random variable following a Bernoulli distribution:

◮ heads (X = 1) with probability p ◮ tails (X = 0) with probability q = 1 − p

The Bernoulli distribution is a special case of the Binomial distribution: B(1, p)

36

slide-39
SLIDE 39

Simulating tossing a coin

Tossing a coin simulated with a binomial distribution:

# binomial distribution generator rbinom(n = 1, size = 1, prob = 0.5) ## [1] 0

37

slide-40
SLIDE 40

Flipping a Coin function

Function that simulates flipping a coin

# flipping coin function coin <- function(prob = 0.5) { rbinom(n = 1, size = 1, prob = prob) } coin() ## [1] 1

38

slide-41
SLIDE 41

Flipping a Coin function

It’s better if we assign labels

# flipping coin function coin <- function(prob = 0.5) {

  • ut <- rbinom(n = 1, size = 1, prob = prob)

ifelse(out, "heads", "tails") } coin() ## [1] "tails"

39

slide-42
SLIDE 42

Flipping a Coin function

# 10 flips for (i in 1:10) { print(coin()) } ## [1] "heads" ## [1] "tails" ## [1] "tails" ## [1] "heads" ## [1] "tails" ## [1] "tails" ## [1] "tails" ## [1] "tails" ## [1] "tails" ## [1] "heads"

40

slide-43
SLIDE 43

4 Flips

In 4 flips

◮ Possible outputs:

– HHHH, THHH, HTHH, HHTH, HHHT, ...

◮ we can get 0, 1, 2, 3, 4 heads ◮ so the proportion of heads can be: 0, 0.25, 0.5, 0.75, 1 ◮ we expect the proportion to be 0.5 ◮ but a proportion of 0.25 is also possible 41

slide-44
SLIDE 44

4 Flips

◮ we can think of the proportion of Heads in 4 flips as a

statistic because it summarizes data

◮ this proportion is a random quantity: it takes on 5 possible

values, each with some probability

– 0 → 1/16 – 0.25 → 4/16 – 0.50 → 8/16 – 0.75 → 4/16 – 1.0 → 1/16

42

slide-45
SLIDE 45

Simulating flipping n coins

Function that simulates flipping a coin n times (i.e. flipping n coins)

# generic function flip_coins <- function(n = 1, prob = 0.5) {

  • ut <- rbinom(n = n, size = 1, prob = prob)

ifelse(out, "heads", "tails") } flip_coins(5) ## [1] "tails" "heads" "heads" "tails" "heads"

43

slide-46
SLIDE 46

Proportion of Heads

# number of heads num_heads <- function(x) { sum(x == "heads") } # proportion of heads prop_heads <- function(x) { num_heads(x) / length(x) }

44

slide-47
SLIDE 47

1000 Flips

◮ when we flip the coin 1000 times, we can get many

different possible proportions of Heads

◮ 0, 0.001, 0.002, 0.003, ..., 0.999, 1.000 ◮ It’s highly unlikely that we would get 0 for the

proportion—how unlikely?

◮ what does the distribution of the porpotion of heads in

1000 flips look like?

45

slide-48
SLIDE 48

1000 Flips

◮ With some probability theory and math tools we can figure

this out

◮ But we can also get a good idea using simulation ◮ In our simulation we’ll assume that the chance of Heads is

0.5 (i.e. fair coin)

◮ we can find out what the possible values for the proportion

  • f heads in 1000 flips look like

46

slide-49
SLIDE 49

Flipping coins

set.seed(99900) flips <- flip_coins(1000) num_heads(flips) ## [1] 494 prop_heads(flips) ## [1] 0.494

47

slide-50
SLIDE 50

Flipping coins

set.seed(76547) a_flips <- flip_coins(1000) b_flips <- flip_coins(1000) num_heads(a_flips) ## [1] 493 num_heads(b_flips) ## [1] 507

48

slide-51
SLIDE 51

1000 Flips 1000 times

times <- 1000 head_props <- numeric(times) for (s in 1:times) { flips <- flip_coins(1000) head_props[s] <- prop_heads(flips) }

49

slide-52
SLIDE 52

Empirical distribution of 1000 flips

Histogram of head_props

head_props Frequency 0.44 0.46 0.48 0.50 0.52 0.54 0.56 50 100 150 200

50

slide-53
SLIDE 53

Flipping coins

Experiment: flipping a coin 100 times and counting number of heads You flip a coin 100 times and you get 65 heads. Is it a fair coin?

51

slide-54
SLIDE 54

Flipping coins

Experiment: flipping a coin 100 times and counting number of heads You flip a coin 100 times and you get 65 heads. Is it a fair coin? We could perform a hypothesis test, or we could perform resampling

51

slide-55
SLIDE 55

Flipping coins

# repeat experiment 100 times times <- 10000 head_times <- numeric(times) for (s in 1:times) { flips <- flip_coins(100) head_times[s] <- num_heads(flips) } sum(head_times >= 65) ## [1] 18 sum(head_times >= 65) / times ## [1] 0.0018

52

slide-56
SLIDE 56

Flipping coins

Histogram of head_times

head_times Frequency 35 40 45 50 55 60 65 500 1000 1500

53

slide-57
SLIDE 57

Bootstrapping

54

slide-58
SLIDE 58

Let’s Generalize

◮ A statistic is just a function of a random sample ◮ Statistics are used as estimators of quantities of interest

about the distribution, called parameters

◮ Statistics are random variables ◮ Parameters are NOT random variables 55

slide-59
SLIDE 59

Let’s Generalize

◮ In simple cases, we can study the sampling distribution of

the statistic analytically

◮ e.g. we can prove under mild conditions that the

distribution of the sample proportion is close to normal for large sample sizes

◮ In more complicated cases we can turn to simulation 56

slide-60
SLIDE 60

Sampling Distributions

◮ In our example X1, X2, . . . , Xn are independent

  • bservations from the same distribution

◮ The distribution has center (mean value) µ and spread

(standard deviation) σ

◮ e.g. interest in the distribution of median(X1, X2, . . . , Xn) ◮ We take many samples of size n, and study the behavior of

the sample medians

57

slide-61
SLIDE 61

Some Limitations

◮ Consider t-test procedures for inference about means ◮ Most classical methods rest on the use of Normal

Distributions

◮ However, most real data are not Normal ◮ We cannot use t confidence intervals for strongly skewed

data (unless samples are large)

◮ What about inference for a ratio of means? (no simple

traditional inference)

58

slide-62
SLIDE 62

Fundamental Reasoning

◮ Apply computer power to relax some of the conditions

needed in traditional tests

◮ Have tools to do inference in new settings ◮ What would happen if we applied this method many times? 59

slide-63
SLIDE 63

Bootstrap Idea

◮ Statistical inference is based on the sampling distributions

  • f sample statistics

◮ A sampling distribution is based on many random samples

from the population

◮ The bootstrap is a way of finding the sampling distribution

(approximately)

60

slide-64
SLIDE 64

Bootstrap Samples

x <- c(3.15, 0, 1.58, 19.65, 0.23, 2.21) mean(x) ## [1] 4.47

61

slide-65
SLIDE 65

Bootstrap Samples

x <- c(3.15, 0, 1.58, 19.65, 0.23, 2.21) mean(x) ## [1] 4.47 (x1 <- sample(x, size = 6, replace = TRUE)) ## [1] 19.65 3.15 19.65 0.00 19.65 2.21 mean(x1) ## [1] 10.71833

61

slide-66
SLIDE 66

Bootstrap Samples

(x2 <- sample(x, size = 6, replace = TRUE)) ## [1] 3.15 0.23 3.15 2.21 2.21 1.58 mean(x2) ## [1] 2.088333 (x3 <- sample(x, size = 6, replace = TRUE)) ## [1] 19.65 2.21 19.65 2.21 3.15 1.58 mean(x3) ## [1] 8.075

62

slide-67
SLIDE 67

Procedure for Bootstrapping

◮ Repeatedly sampling with replacement from a random

sample

◮ Each bootstrap sample is the same size as the original

sample

◮ Calculate the statisc of interest (e.g. mean, median, sd) ◮ Draw hundreds or thousands of samples ◮ Obtain a bootstrap distribution 63

slide-68
SLIDE 68

Bootstrap Samples

bootstrap <- numeric(1000) for (b in 1:1000) { boot_sample <- sample(x, size = 6, replace = TRUE) bootstrap[b] <- mean(boot_sample) }

64

slide-69
SLIDE 69

Bootstrap Distribution

hist(bootstrap, col = 'gray70', las = 1)

Histogram of bootstrap

bootstrap Frequency 2 4 6 8 10 12 14 50 100 150 200 250 300 65

slide-70
SLIDE 70

How does Bootstrapping work?

◮ We are not using the resamples as if they were real data ◮ Bootstrap samples is not a substitute to gather more data

to improve accuracy

◮ The bootstrap idea is to use the resample statistics to

estimate how the sample statistic varies from the studied random sample

◮ The bootstrap distribution approximates the sampling

distribution of the statistic

66

slide-71
SLIDE 71

Bootstrap samples

Computing the bootstrap distribution implies

◮ Calulate the statistic for each sample ◮ The distribution of these resample statistics is the

bootstrap distribution

◮ A bootstrap sample is the same size as the original random

sample

67

slide-72
SLIDE 72

Another Example

68

slide-73
SLIDE 73

Bootstrap resampling

# Iris Virginica subset (the "population") virginica <- subset(iris, Species == 'virginica') # random sample of Petal.Length (size = 5) set.seed(7359) rand_sample <- sample(virginica$Petal.Length, size = 5) rand_sample ## [1] 5.1 5.6 5.8 5.7 5.8

69

slide-74
SLIDE 74

Bootstrap resampling

# create 500 bootstrap samples of size 5 with replacement resamples <- 500 n <- length(rand_sample) boot_stats <- numeric(resamples) for (i in 1:resamples) { boot_sample <- sample(rand_sample, size = n, replace = TRUE) boot_stats[i] <- mean(boot_sample) }

70

slide-75
SLIDE 75

Bootstrap resampling

# "population" mean mean(virginica$Petal.Length) ## [1] 5.552 # bootstrap mean mean(boot_stats) ## [1] 5.60028

71

slide-76
SLIDE 76

Bootstrap Distribution

Histogram of boot_stats

boot_stats Frequency 5.2 5.3 5.4 5.5 5.6 5.7 5.8 20 40 60 80 100

72

slide-77
SLIDE 77

Bootstrap standard error

The bootstrap standard error is just the standard deviation of the bootstrap samples

# descriptive statistics summary(boot_stats) ##

  • Min. 1st Qu.

Median Mean 3rd Qu. Max. ## 5.20 5.52 5.60 5.60 5.70 5.80 # Bootstrap Standard Error sd(boot_stats) ## [1] 0.1167183

73