Bootstrapping

Based on Chapter 8 of ModernDive. Code for Quiz 12.

Load the R package we will use.
library(tidyverse)
library(moderndive) #install before loading
library(infer) #install before loading
library(fivethirtyeight) #install before loading

-Replace all the instances of ???. These are answers on your moodle quiz.

-Run all the individual code chunks to make sure the answers in this file correspond with your quiz answers

-After you check all your code chunks run then you can knit it. It won’t knit until the ??? are replaced

-Save a plot to be your preview plot

-Look at the variable definitions in congress_age

What is the average age of members that have served in congress?

-Set random seed generator to 123

-Take a sample of 100 from the dataset congress_age and assign it to congress_age_100

set.seed(4346)
congress_age_100 <- congress_age  %>% 
  rep_sample_n(size=100)
#18,635 rows representing members of Congress

-congress_age is the population and congress_age_100 is the sample

-18,635 is number of observations in the the population and 100 is the number of observations in your sample

Construct the confidence interval

  1. Use specify to indicate the variable from congress_age_100 that you are interested in
congress_age_100  %>% 
  specify(response = age)
Response: age (numeric)
# A tibble: 100 x 1
     age
   <dbl>
 1  58  
 2  27.3
 3  59.4
 4  47.8
 5  36.4
 6  62.3
 7  52.5
 8  55.5
 9  44  
10  48  
# ... with 90 more rows
  1. generate 1000 replicates of your sample of 100
congress_age_100  %>% 
  specify(response = age)  %>% 
  generate(reps = 1000, type= "bootstrap")
Response: age (numeric)
# A tibble: 100,000 x 2
# Groups:   replicate [1,000]
   replicate   age
       <int> <dbl>
 1         1  55.2
 2         1  40.8
 3         1  55.7
 4         1  52.5
 5         1  54.5
 6         1  35.8
 7         1  44.5
 8         1  47.9
 9         1  40.8
10         1  37.4
# ... with 99,990 more rows

The output has 100,000 rows

  1. calculate the mean for each replicate

-Assign to bootstrap_distribution_mean_age

-Display bootstrap_distribution_mean_age

bootstrap_distribution_mean_age  <- congress_age_100  %>% 
  specify(response = age)  %>% 
  generate(reps = 1000, type = "bootstrap")  %>% 
  calculate(stat = "mean")

bootstrap_distribution_mean_age
# A tibble: 1,000 x 2
   replicate  stat
 *     <int> <dbl>
 1         1  51.3
 2         2  48.2
 3         3  49.7
 4         4  50.5
 5         5  51.6
 6         6  47.9
 7         7  49.5
 8         8  50.0
 9         9  51.0
10        10  51.0
# ... with 990 more rows

The bootstrap_distribution_mean_age has 1000 means

  1. visualize the bootstrap distribution
visualize (bootstrap_distribution_mean_age)

Calculate the 95% confidence interval using the percentile method

-Assign the output to congress_ci_percentile

-Display congress_ci_percentile

congress_ci_percentile <- bootstrap_distribution_mean_age %>% 
  get_confidence_interval(type ="percentile", level = .95) #4:12 ch8-2 remove later

congress_ci_percentile
# A tibble: 1 x 2
  lower_ci upper_ci
     <dbl>    <dbl>
1     48.5     52.7

-Calculate the observed point estimate of the mean and assign it to obs_mean_age

-Display obs_mean_age,

obs_mean_age  <- congress_age_100  %>% 
  specify(response = age)  %>% 
  calculate(stat = "mean")  %>% 
  pull()

obs_mean_age
[1] 50.533

-Shade the confidence interval

-Add a line at the observed mean, obs_mean_age, to your visualization and color it “hotpink”

#endpoint = the congress percentile variable
visualize(bootstrap_distribution_mean_age) +
  shade_confidence_interval(endpoints = congress_ci_percentile) +
    geom_vline(xintercept = obs_mean_age, color ="hotpink", size = 1 )

-Calculate the population mean to see if it is in the 95% confidence interval

-Assign the output to pop_mean_age

-Display pop_mean_age

#assign orginal data to pop_mean
pop_mean_age  <- congress_age  %>% 
  summarize(pop_mean= mean(age))  %>% pull()

pop_mean_age
[1] 53.31373

-Add a line to the visualiztin at the, population mean, pop_mean_age, to the plot color it “purple”

visualize(bootstrap_distribution_mean_age) +
  shade_confidence_interval(endpoints = congress_ci_percentile) + 
   geom_vline(xintercept = obs_mean_age, color = "hotpink", size = 1) +
  #adding a line use hotpink is the same as up top
  #adding purple to the pop_mean_age
   geom_vline(xintercept = pop_mean_age , color = "purple", size = 3)

-Is population mean the 95% confidence interval constructed using the bootstrap distribution? yes

-Change set.seed(123) to set.seed(4346). Rerun all the code.

-When you change the seed is the population mean in the 95% confidence interval constructed using the bootstrap distribution? no

-If you construct 100 95% confidence intervals approximately how many do you expect will contain the population mean? 95