MTH 412/512 - Ch5_bootstrap

Example of using a bootstrap distribution for the case when the probability distribution and parameters are known.

Here, take a sample of size n=50 from \(X\sim N(\mu,\sigma^2)\) and treat this as the given data. Then find the bootstrap distribution from the data and compare that to the known theoretical results for the sampling distribution.

Generate a sample of size n from standard normal distribution:

n <- 50
mu <- 0
sigma <- 1
data = rnorm(n,mu,sigma)
data_mean = mean(data)
data_sd = sd(data)

theory mean = 0,
data mean = -0.0382881
theory std dev = 1
data std dev = 0.9888208

Plot pdf for the prob dist

xfine<-seq(-3,3,length=101)
plot(xfine,dnorm(xfine,mu,sigma),type='l',xlab='x',ylab='pdf')
abline(v=mu,col='green')

Plot the histogram of the sample

hist(data,xlim=c(-3,3))
abline(v=mu,col='green')
abline(v=data_mean,col='blue',lty='dashed')

NOTE: The mean of the data (blue dashed) is not necessarily close to the theoretical mean (green solid).

Find the bootstrap distribution from the data

N <- 10**3  # number of samples 
xboot <- numeric(N) # create a vector to hold bootstrap dist for xbar for each sample
# loop to create sample, calculate xbar, store it in xboot array
for (i in seq(1,N)) {   
  resample <- sample(data,n,replace=TRUE)  
  xbar <- mean(resample)
  xboot[i] <- xbar
}   
head(xboot) # our approx to sampling distribution for xbar

## [1]  0.090801866 -0.006834807 -0.126397665  0.223555216  0.102916255
## [6] -0.196761022

xboot_mean <- mean(xboot)
xboot_sd <- sd(xboot)

sampling dist mean = 0
bootstrap mean = -0.03379
sampling dist std dev = 0.1414214
bootstrap std dev = 0.1442237

Plot pdf for the exact sampling dist

xfine<-seq(-1,1,length=101)
plot(xfine,dnorm(xfine,mu,sigma/sqrt(n)),type='l',xlab='x',ylab='pdf')
abline(v=mu,col='green')

Plot the histogram of the bootstrap distribution

hist(xboot,xlim=c(-1,1))
abline(v=mu,col='green')
abline(v=data_mean,col='blue',lty='dashed')
abline(v=xboot_mean,col='red',lty='dotted')

NOTES:
(1) The mean of the bootstrap distribution is close to the mean of the data (which may or may not be close to mu) so the bootstrap mean is not necessarily a good predictor of mu.
(2) The shape of the bootstrap distribution is a good approximation to the shape of the sampling distribution.
(3) The bootstrap distribution gives a good estimate of the range of possible mu that could generate the data.

95% confidence interval for the true mean of the sampling dist

qlow <- quantile(xboot,0.025)
qhigh <- quantile(xboot,0.975)

The 95% confidence interval for the mean of the sampling distribution is: qlow < Xbar < qhigh which is
-0.3143697 < Xbar < 0.2402024

Plot the ecdf to show the 95% confidence interval graphically

plot(ecdf(xboot))
abline(v=qlow,col='purple')
abline(v=qhigh,col='purple')
abline(v=mu,col='green')
abline(v=data_mean,col='blue',lty='dashed')
abline(v=xboot_mean,col='red',lty='dotted')

The ecdf plot of the bootstrap distribution shows the 95% confidence interval for the mean of the sampling distribution marked by the vertical purple lines. The true mean of the sampling distribution is shown in green, while the mean of the data (blue dashed) and mean of the bootstrap distribution (red dotted) lie on top of each other.

Important take-away point: the best we can do about guessing the true mean from knowing only the sample data is to say that it is 95% likely that the true mean of the distribution that generated the sample is within the 95% confidence interval of the bootstrap distribution.

MTH 412/512 - Ch5_bootstrap_normal.Rmd