diff --git a/docs/12-thinking-with-data.html b/docs/12-thinking-with-data.html index 4fd77c28e..83688422f 100644 --- a/docs/12-thinking-with-data.html +++ b/docs/12-thinking-with-data.html @@ -1044,9 +1044,7 @@

12.2 Case study: Effective data s

-/begin{center} -/end{center}

As we’ve progressed throughout this book, you’ve seen how to work with data in a variety of ways. You’ve learned effective strategies for plotting data by understanding which types of plots work best for which combinations of variable types. You’ve summarized data in table form and calculated summary statistics for a variety of different variables. Further, you’ve seen the value of inference as a process to come to conclusions about a population by using a random sample. Lastly, you’ve explored how to use linear regression and the importance of checking the conditions required to make it a valid procedure. All throughout, you’ve learned many computational techniques and focused on reproducible research in writing R code. We now present another case study, but this time of the “effective data storytelling” done by data journalists around the world. Great data stories don’t mislead the reader, but rather engulf them in understanding the importance that data plays in our lives through the captivation of storytelling.

diff --git a/docs/search_index.json b/docs/search_index.json index 85c489573..4391ab340 100644 --- a/docs/search_index.json +++ b/docs/search_index.json @@ -10,7 +10,7 @@ ["9-confidence-intervals.html", "Chapter 9 Confidence Intervals 9.1 Bootstrapping 9.2 The infer package for statistical inference 9.3 Now to confidence intervals 9.4 Comparing bootstrap and sampling distributions 9.5 Interpreting the confidence interval 9.6 Example: One proportion 9.7 Example: Comparing two proportions 9.8 Conclusion", " Chapter 9 Confidence Intervals In preparation for our first print edition to be published by CRC Press in Fall 2019, we’re remodeling this chapter a bit. Don’t expect major changes in content, but rather only minor changes in presentation. Our remodeling will be complete and available online at ModernDive.com by early Summer 2019! In Chapter 8, we explored the process of sampling from a representative sample to build a sampling distribution. The motivation there was to use multiple samples from the same population to visualize and attempt to understand the variability in the statistic from one sample to another. Furthermore, recall our concepts and terminology related to sampling from the beginning of Chapter 8: Generally speaking, we learned that if the sampling of a sample of size \\(n\\) is done at random, then the resulting sample is unbiased and representative of the population, thus any result based on the sample can generalize to the population, and hence the point estimate/sample statistic computed from this sample is a “good guess” of the unknown population parameter of interest Specific to the bowl, we learned that if we properly mix the balls first thereby ensuring the randomness of samples extracted using the shovel with \\(n=50\\) slots, then the contents of the shovel will “look like” the contents of the bowl, thus any results based on the sample of \\(n=50\\) balls can generalize to the large bowl of \\(N=2400\\) balls, and hence the sample proportion red \\(\\widehat{p}\\) of the \\(n=50\\) balls in the shovel is a “good guess” of the true population proportion red \\(p\\) of the \\(N=2400\\) balls in the bowl. We emphasize that we used a point estimate/sample statistic, in this case the sample proportion \\(\\widehat{p}\\), to estimate the unknown value of the population parameter, in this case the population proportion \\(p\\). In other words, we are using the sample to infer about the population. We can however consider inferential situations other than just those involving proportions. We present a wide array of such scenarios in Table 9.1. In all 7 cases, the point estimate/sample statistic estimates the unknown population parameter. It does so by computing summary statistics based on a sample of size \\(n\\). TABLE 9.1: Scenarios of sampling for inference Scenario Population parameter Notation Point estimate Notation. 1 Population proportion \\(p\\) Sample proportion \\(\\widehat{p}\\) 2 Population mean \\(\\mu\\) Sample mean \\(\\widehat{\\mu}\\) or \\(\\overline{x}\\) 3 Difference in population proportions \\(p_1 - p_2\\) Difference in sample proportions \\(\\widehat{p}_1 - \\widehat{p}_2\\) 4 Difference in population means \\(\\mu_1 - \\mu_2\\) Difference in sample means \\(\\overline{x}_1 - \\overline{x}_2\\) 5 Population regression slope \\(\\beta_1\\) Sample regression slope \\(\\widehat{\\beta}_1\\) or \\(b_1\\) 6 Population regression intercept \\(\\beta_0\\) Sample regression intercept \\(\\widehat{\\beta}_0\\) or \\(b_0\\) We’ll cover the first four scenarios in this chapter on confidence intervals and the following one on hypothesis testing: Scenario 2 about means. Ex: the average age of pennies. Scenario 3 about differences in proportions between two groups. Ex: the difference in high school completion rates for Canadians vs non-Canadians. We call this a situation of two-sample inference. Scenario 4 is similar to 3, but its about the means of two groups. Ex: the difference in average test scores for the morning section of a class versus the afternoon section of a class. This is another situation of two-sample inference. In Chapter 11 on inference for regression, we’ll cover Scenarios 5 & 6 about the regression line. In particular we’ll see that the fitted regression line from Chapter 6 on basic regression, \\(\\widehat{y} = b_0 + b_1 \\cdot x\\), is in fact an estimate of some true population regression line \\(y = \\beta_0 + \\beta_1 \\cdot x\\) based on a sample of \\(n\\) pairs of points \\((x, y)\\). Ex: Recall our sample of \\(n=463\\) instructors at the UT Austin from the evals data set in Chapter 6. Based on the results of the fitted regression model of teaching score with beauty score as an explanatory/predictor variable, what can we say about this relationship for all instructors, not just those at the UT Austin? In contrast to these, Scenario 7 involves a measure of spread: the standard deviation. Does the spread/variability of a sample match the spread/variability of the population? However, we leave this topic for a more intermediate course on statistical inference. In most cases, we don’t have the population values as we did with the bowl of balls. We only have a single sample of data from a larger population. We’d like to be able to make some reasonable guesses about population parameters using that single sample to create a range of plausible values for a population parameter. This range of plausible values is known as a confidence interval and will be the focus of this chapter. And how do we use a single sample to get some idea of how other samples might vary in terms of their statistic values? One common way this is done is via a process known as bootstrapping that will be the focus of the beginning sections of this chapter. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section 2.3 for information on how to install and load R packages. library(dplyr) library(ggplot2) library(janitor) library(moderndive) library(infer) 9.1 Bootstrapping 9.1.1 Data explanation The moderndive package contains a sample of 40 pennies collected and minted in the United States. Let’s explore this sample data first: pennies_sample # A tibble: 40 x 2 year age_in_2011 <int> <int> 1 2005 6 2 1981 30 3 1977 34 4 1992 19 5 2005 6 6 2006 5 7 2000 11 8 1992 19 9 1988 23 10 1996 15 # … with 30 more rows The pennies_sample data frame has rows corresponding to a single penny with two variables: year of minting as shown on the penny and age_in_2011 giving the years the penny had been in circulation from 2011 as an integer, e.g. 15, 2, etc. Suppose we are interested in understanding some properties of the mean age of all US pennies from this data collected in 2011. How might we go about that? Let’s begin by understanding some of the properties of pennies_sample using data wrangling from Chapter 4 and data visualization from Chapter 3. 9.1.2 Exploratory data analysis First, let’s visualize the values in this sample as a histogram: ggplot(pennies_sample, aes(x = age_in_2011)) + geom_histogram(bins = 10, color = "white") We see a roughly symmetric distribution here that has quite a few values near 20 years in age with only a few larger than 40 years or smaller than 5 years. If pennies_sample is a representative sample from the population, we’d expect the age of all US pennies collected in 2011 to have a similar shape, a similar spread, and similar measures of central tendency like the mean. So where does the mean value fall for this sample? This point will be known as our point estimate and provides us with a single number that could serve as the guess to what the true population mean age might be. Recall how to find this using the dplyr package: x_bar <- pennies_sample %>% summarize(stat = mean(age_in_2011)) x_bar # A tibble: 1 x 1 stat <dbl> 1 25.1 We’ve denoted this sample mean as \\(\\bar{x}\\), which is the standard symbol for denoting the mean of a sample. Our point estimate is, thus, \\(\\bar{x} = 25.1\\). Note that this is just one sample though providing just one guess at the population mean. What if we’d like to have another guess? This should all sound similar to what we did in Chapter 8. There instead of collecting just a single scoop of balls we had many different students use the shovel to scoop different samples of red and white balls. We then calculated a sample statistic (the sample proportion) from each sample. But, we don’t have a population to pull from here with the pennies. We only have this one sample. The process of bootstrapping allows us to use a single sample to generate many different samples that will act as our way of approximating a sampling distribution using a created bootstrap distribution instead. We will pull ourselves up from our bootstraps using a single sample (pennies_sample) to get an idea of the grander sampling distribution. 9.1.3 The Bootstrapping Process Bootstrapping uses a process of sampling with replacement from our original sample to create new bootstrap samples of the same size as our original sample. We can again make use of the rep_sample_n() function to explore what one such bootstrap sample would look like. Remember that we are randomly sampling from the original sample here with replacement and that we always use the same sample size for the bootstrap samples as the size of the original sample (pennies_sample). bootstrap_sample1 <- pennies_sample %>% rep_sample_n(size = 40, replace = TRUE, reps = 1) bootstrap_sample1 # A tibble: 40 x 3 # Groups: replicate [1] replicate year age_in_2011 <int> <int> <int> 1 1 1983 28 2 1 2000 11 3 1 2004 7 4 1 1981 30 5 1 1993 18 6 1 2006 5 7 1 1981 30 8 1 2004 7 9 1 1992 19 10 1 1994 17 # … with 30 more rows Let’s visualize what this new bootstrap sample looks like: ggplot(bootstrap_sample1, aes(x = age_in_2011)) + geom_histogram(bins = 10, color = "white") We now have another sample from what we could assume comes from the population of interest. We can similarly calculate the sample mean of this bootstrap sample, called a bootstrap statistic. bootstrap_sample1 %>% summarize(stat = mean(age_in_2011)) # A tibble: 1 x 2 replicate stat <int> <dbl> 1 1 23.2 We can see that this sample mean is smaller than the x_bar value we calculated earlier for the pennies_sample data. We’ll come back to analyzing the different bootstrap statistic values shortly. Let’s recap what was done to get to this bootstrap sample using a tactile explanation: First, pretend that each of the 40 values of age_in_2011 in pennies_sample were written on a small piece of paper. Recall that these values were 6, 30, 34, 19, 6, etc. Now, put the 40 small pieces of paper into a receptacle such as a baseball cap. Shake up the pieces of paper. Draw “at random” from the cap to select one piece of paper. Write down the value on this piece of paper. Say that it is 28. Now, place this piece of paper containing 28 back into the cap. Draw “at random” again from the cap to select a piece of paper. Note that this is the sampling with replacement part since you may draw 28 again. Repeat this process until you have drawn 40 pieces of paper and written down the values on these 40 pieces of paper. Completing this repetition produces ONE bootstrap sample. If you look at the values in bootstrap_sample1, you can see how this process plays out. We originally drew 28, then we drew 11, then 7, and so on. Of course, we didn’t actually use pieces of paper and a cap here. We just had the computer perform this process for us to produce bootstrap_sample1 using rep_sample_n() with replace = TRUE set. The process of sampling with replacement is how we can use the original sample to take a guess as to what other values in the population may be. Sometimes in these bootstrap samples, we will select lots of larger values from the original sample, sometimes we will select lots of smaller values, and most frequently we will select values that are near the center of the sample. Let’s explore what the distribution of values of age_in_2011 for six different bootstrap samples looks like to further understand this variability. six_bootstrap_samples <- pennies_sample %>% rep_sample_n(size = 40, replace = TRUE, reps = 6) ggplot(six_bootstrap_samples, aes(x = age_in_2011)) + geom_histogram(bins = 10, color = "white") + facet_wrap(~ replicate) We can also look at the six different means using dplyr syntax: six_bootstrap_samples %>% group_by(replicate) %>% summarize(stat = mean(age_in_2011)) # A tibble: 6 x 2 replicate stat <int> <dbl> 1 1 23.6 2 2 24.1 3 3 25.2 4 4 23.1 5 5 24.0 6 6 24.7 Instead of doing this six times, we could do it 1000 times and then look at the distribution of stat across all 1000 of the replicates. This sets the stage for the infer R package (Bray et al. 2018) that was created to help users perform statistical inference such as confidence intervals and hypothesis tests using verbs similar to what you’ve seen with dplyr. We’ll walk through setting up each of the infer verbs for confidence intervals using this pennies_sample example, while also explaining the purpose of the verbs in a general framework. 9.2 The infer package for statistical inference The infer package makes great use of the %>% to create a pipeline for statistical inference. The goal of the package is to provide a way for its users to explain the computational process of confidence intervals and hypothesis tests using the code as a guide. The verbs build in order here, so you’ll want to start with specify() and then continue through the others as needed. 9.2.1 Specify variables The specify() function is used primarily to choose which variables will be the focus of the statistical inference. In addition, a setting of which variable will act as the explanatory and which acts as the response variable is done here. For proportion problems similar to those in Chapter 8, we can also give which of the different levels we would like to have as a success. We’ll see further examples of these options in this chapter, Chapter 10, and in Appendix B. To begin to create a confidence interval for the population mean age of US pennies in 2011, we start by using specify() to choose which variable in our pennies_sample data we’d like to work with. This can be done in one of two ways: Using the response argument: pennies_sample %>% specify(response = age_in_2011) Response: age_in_2011 (integer) # A tibble: 40 x 1 age_in_2011 <int> 1 6 2 30 3 34 4 19 5 6 6 5 7 11 8 19 9 23 10 15 # … with 30 more rows Using formula notation: pennies_sample %>% specify(formula = age_in_2011 ~ NULL) Response: age_in_2011 (integer) # A tibble: 40 x 1 age_in_2011 <int> 1 6 2 30 3 34 4 19 5 6 6 5 7 11 8 19 9 23 10 15 # … with 30 more rows Note that the formula notation uses the common R methodology to include the response \\(y\\) variable on the left of the ~ and the explanatory \\(x\\) variable on the right of the “tilde.” Recall that you used this notation frequently with the lm() function in Chapters 6 and 7 when fitting regression models. Either notation works just fine, but a preference is usually given here for the formula notation to further build on the ideas from earlier chapters. 9.2.2 Generate replicates After specify()ing the variables we’d like in our inferential analysis, we next feed that into the generate() verb. The generate() verb’s main argument is reps, which is used to give how many different repetitions one would like to perform. Another argument here is type, which is automatically determined by the kinds of variables passed into specify(). We can also be explicit and set this type to be type = "bootstrap". This type argument will be further used in hypothesis testing in Chapter 10 as well. Make sure to check out ?generate to see the options here and use the ? operator to better understand other verbs as well. Let’s generate() 1000 bootstrap samples: thousand_bootstrap_samples <- pennies_sample %>% specify(response = age_in_2011) %>% generate(reps = 1000) Setting `type = "bootstrap"` in `generate()`. We can use the dplyr count() function to help us understand what the thousand_bootstrap_samples data frame looks like: thousand_bootstrap_samples %>% count(replicate) # A tibble: 1,000 x 2 # Groups: replicate [1,000] replicate n <int> <int> 1 1 40 2 2 40 3 3 40 4 4 40 5 5 40 6 6 40 7 7 40 8 8 40 9 9 40 10 10 40 # … with 990 more rows Notice that each replicate has 40 entries here. Now that we have 1000 different bootstrap samples, our next step is to calculate the bootstrap statistics for each sample. 9.2.3 Calculate summary statistics After generate()ing many different samples, we next want to condense those samples down into a single statistic for each replicated sample. As seen in the diagram, the calculate() function is helpful here. As we did at the beginning of this chapter, we now want to calculate the mean age_in_2011 for each bootstrap sample. To do so, we use the stat argument and set it to "mean" below. The stat argument has a variety of different options here and we will see further examples of this throughout the remaining chapters. bootstrap_distribution <- pennies_sample %>% specify(response = age_in_2011) %>% generate(reps = 1000) %>% calculate(stat = "mean") Setting `type = "bootstrap"` in `generate()`. bootstrap_distribution # A tibble: 1,000 x 2 replicate stat <int> <dbl> 1 1 26.5 2 2 25.4 3 3 26.0 4 4 26 5 5 25.2 6 6 29.0 7 7 22.8 8 8 26.4 9 9 24.9 10 10 28.1 # … with 990 more rows We see that the resulting data has 1000 rows and 2 columns corresponding to the 1000 replicates and the mean for each bootstrap sample. Observed statistic / point estimate calculations Just as group_by() %>% summarize() produces a useful workflow in dplyr, we can also use specify() %>% calculate() to compute summary measures on our original sample data. It’s often helpful both in confidence interval calculations, but also in hypothesis testing to identify what the corresponding statistic is in the original data. For our example on penny age, we computed above a value of x_bar using the summarize() verb in dplyr: pennies_sample %>% summarize(stat = mean(age_in_2011)) # A tibble: 1 x 1 stat <dbl> 1 25.1 This can also be done by skipping the generate() step in the pipeline feeding specify() directly into calculate(): pennies_sample %>% specify(response = age_in_2011) %>% calculate(stat = "mean") # A tibble: 1 x 1 stat <dbl> 1 25.1 This shortcut will be particularly useful when the calculation of the observed statistic is tricky to do using dplyr alone. This is particularly the case when working with more than one variable as will be seen in Chapter 10. 9.2.4 Visualize the results The visualize() verb provides a simple way to view the bootstrap distribution as a histogram of the stat variable values. It has many other arguments that one can use as well including the shading of the histogram values corresponding to the confidence interval values. bootstrap_distribution %>% visualize() The shape of this resulting distribution may look familiar to you. It resembles the well-known normal (bell-shaped) curve. The following diagram recaps the infer pipeline for creating a bootstrap distribution. 9.3 Now to confidence intervals Definition: Confidence Interval A confidence interval gives a range of plausible values for a parameter. It depends on a specified confidence level with higher confidence levels corresponding to wider confidence intervals and lower confidence levels corresponding to narrower confidence intervals. Common confidence levels include 90%, 95%, and 99%. Usually we don’t just begin sections with a definition, but confidence intervals are simple to define and play an important role in the sciences and any field that uses data. You can think of a confidence interval as playing the role of a net when fishing. Instead of just trying to catch a fish with a single spear (estimating an unknown parameter by using a single point estimate/statistic), we can use a net to try to provide a range of possible locations for the fish (use a range of possible values based around our statistic to make a plausible guess as to the location of the parameter). The bootstrapping process will provide bootstrap statistics that have a bootstrap distribution with center at (or extremely close to) the mean of the original sample. This can be seen by giving the observed statistic obs_stat argument the value of the point estimate x_bar. bootstrap_distribution %>% visualize(obs_stat = x_bar) We can also compute the mean of the bootstrap distribution of means to see how it compares to x_bar: bootstrap_distribution %>% summarize(mean_of_means = mean(stat)) # A tibble: 1 x 1 mean_of_means <dbl> 1 25.1 In this case, we can see that the bootstrap distribution provides us a guess as to what the variability in different sample means may look like only using the original sample as our guide. We can quantify this variability in the form of a 95% confidence interval in a couple different ways. 9.3.1 The percentile method One way to calculate a range of plausible values for the unknown mean age of coins in 2011 is to use the middle 95% of the bootstrap_distribution to determine our endpoints. Our endpoints are thus at the 2.5th and 97.5th percentiles. This can be done with infer using the get_ci() function. (You can also use the conf_int() or get_confidence_interval() functions here as they are aliases that work the exact same way.) bootstrap_distribution %>% get_ci(level = 0.95, type = "percentile") # A tibble: 1 x 2 `2.5%` `97.5%` <dbl> <dbl> 1 21.0 29.3 These options are the default values for level and type so we can also just do: percentile_ci <- bootstrap_distribution %>% get_ci() percentile_ci # A tibble: 1 x 2 `2.5%` `97.5%` <dbl> <dbl> 1 21.0 29.3 Using the percentile method, our range of plausible values for the mean age of US pennies in circulation in 2011 is 20.972 years to 29.252 years. We can use the visualize() function to view this using the endpoints and direction arguments, setting direction to "between" (between the values) and endpoints to be those stored with name percentile_ci. bootstrap_distribution %>% visualize(endpoints = percentile_ci, direction = "between") You can see that 95% of the data stored in the stat variable in bootstrap_distribution falls between the two endpoints with 2.5% to the left outside of the shading and 2.5% to the right outside of the shading. The cut-off points that provide our range are shown with the darker lines. 9.3.2 The standard error method If the bootstrap distribution is close to symmetric and bell-shaped, we can also use a shortcut formula for determining the lower and upper endpoints of the confidence interval. This is done by using the formula \\(\\bar{x} \\pm (multiplier * SE),\\) where \\(\\bar{x}\\) is our original sample mean and \\(SE\\) stands for standard error and corresponds to the standard deviation of the bootstrap distribution. The value of \\(multiplier\\) here is the appropriate percentile of the standard normal distribution. These are automatically calculated when level is provided with level = 0.95 being the default. (95% of the values in a standard normal distribution fall within 1.96 standard deviations of the mean, so \\(multiplier = 1.96\\) for level = 0.95, for example.) As mentioned, this formula assumes that the bootstrap distribution is symmetric and bell-shaped. This is often the case with bootstrap distributions, especially those in which the original distribution of the sample is not highly skewed. Definition: standard error The standard error is the standard deviation of the sampling distribution. The variability of the sampling distribution may be approximated by the variability of the bootstrap distribution. Traditional theory-based methodologies for inference also have formulas for standard errors, assuming some conditions are met. This \\(\\bar{x} \\pm (multiplier * SE)\\) formula is implemented in the get_ci() function as shown with our pennies problem using the bootstrap distribution’s variability as an approximation for the sampling distribution’s variability. We’ll see more on this approximation shortly. Note that the center of the confidence interval (the point_estimate) must be provided for the standard error confidence interval. standard_error_ci <- bootstrap_distribution %>% get_ci(type = "se", point_estimate = x_bar) standard_error_ci # A tibble: 1 x 2 lower upper <dbl> <dbl> 1 21.0 29.3 bootstrap_distribution %>% visualize(endpoints = standard_error_ci, direction = "between") We see that both methods produce nearly identical confidence intervals with the percentile method being \\([20.97, 29.25]\\) and the standard error method being \\([20.97, 29.28]\\). 9.4 Comparing bootstrap and sampling distributions To help build up the idea of a confidence interval, we weren’t completely honest in our initial discussion. The pennies_sample data frame represents a sample from a larger number of pennies stored as pennies in the moderndive package. The pennies data frame (also in the moderndive package) contains 800 rows of data and two columns pertaining to the same variables as pennies_sample. Let’s begin by understanding some of the properties of the age_by_2011 variable in the pennies data frame. ggplot(pennies, aes(x = age_in_2011)) + geom_histogram(bins = 10, color = "white") pennies %>% summarize(mean_age = mean(age_in_2011), median_age = median(age_in_2011)) # A tibble: 1 x 2 mean_age median_age <dbl> <dbl> 1 21.2 20 We see that pennies is slightly right-skewed with the mean being pulled towards the upper outliers. Recall that pennies_sample was more symmetric than pennies. In fact, it actually exhibited some left-skew as we compare the mean and median values. ggplot(pennies_sample, aes(x = age_in_2011)) + geom_histogram(bins = 10, color = "white") pennies_sample %>% summarize(mean_age = mean(age_in_2011), median_age = median(age_in_2011)) # A tibble: 1 x 2 mean_age median_age <dbl> <dbl> 1 25.1 25.5 Sampling distribution Let’s assume that pennies represents our population of interest. We can then create a sampling distribution for the population mean age of pennies, denoted by the Greek letter \\(\\mu\\), using the rep_sample_n() function seen in Chapter 8. First we will create 1000 samples from the pennies data frame. thousand_samples <- pennies %>% rep_sample_n(size = 40, reps = 1000, replace = FALSE) When creating a sampling distribution, we do not replace the items when we create each sample. This is in contrast to the bootstrap distribution. It’s important to remember that the sampling distribution is sampling without replacement from the population to better understand sample-to-sample variability, whereas the bootstrap distribution is sampling with replacement from our original sample to better understand potential sample-to-sample variability. After sampling from pennies 1000 times, we next want to compute the mean age for each of the 1000 samples: sampling_distribution <- thousand_samples %>% group_by(replicate) %>% summarize(stat = mean(age_in_2011)) ggplot(sampling_distribution, aes(x = stat)) + geom_histogram(bins = 10, fill = "salmon", color = "white") FIGURE 9.1: Sampling distribution for n=40 samples of pennies We can also examine the variability in this sampling distribution by calculating the standard deviation of the stat column. Remember that the standard deviation of the sampling distribution is the standard error, frequently denoted as se. sampling_distribution %>% summarize(se = sd(stat)) # A tibble: 1 x 1 se <dbl> 1 2.01 Bootstrap distribution Let’s now see how the shape of the bootstrap distribution compares to that of the sampling distribution. We’ll shade the bootstrap distribution blue to further assist with remembering which is which. bootstrap_distribution %>% visualize(bins = 10, fill = "blue") bootstrap_distribution %>% summarize(se = sd(stat)) # A tibble: 1 x 1 se <dbl> 1 2.12 Notice that while the standard deviations are similar, the center of the sampling distribution and the bootstrap distribution differ: sampling_distribution %>% summarize(mean_of_sampling_means = mean(stat)) # A tibble: 1 x 1 mean_of_sampling_means <dbl> 1 21.2 bootstrap_distribution %>% summarize(mean_of_bootstrap_means = mean(stat)) # A tibble: 1 x 1 mean_of_bootstrap_means <dbl> 1 25.1 Since the bootstrap distribution is centered at the original sample mean, it doesn’t necessarily provide a good estimate of the overall population mean \\(\\mu\\). Let’s calculate the mean of age_in_2011 for the pennies data frame to see how it compares to the mean of the sampling distribution and the mean of the bootstrap distribution. pennies %>% summarize(overall_mean = mean(age_in_2011)) # A tibble: 1 x 1 overall_mean <dbl> 1 21.2 Notice that this value matches up well with the mean of the sampling distribution. This is actually an artifact of the Central Limit Theorem introduced in Chapter 8. The mean of the sampling distribution is expected to be the mean of the overall population. The unfortunate fact though is that we don’t know the population mean in nearly all circumstances. The motivation of presenting it here was to show that the theory behind the Central Limit Theorem works using the tools you’ve worked with so far using the ggplot2, dplyr, moderndive, and infer packages. If we aren’t able to use the sample mean as a good guess for the population mean, how should we best go about estimating what the population mean may be if we can only select samples from the population. We’ve now come full circle and can discuss the underpinnings of the confidence interval and ways to interpret it. 9.5 Interpreting the confidence interval As shown above in Subsection 9.3.1, one range of plausible values for the population mean age of pennies in 2011, denoted by \\(\\mu\\), is \\([20.97, 29.25]\\). Recall that this confidence interval is based on bootstrapping using pennies_sample. Note that the mean of pennies (21.152) does fall in this confidence interval. If we had a different sample of size 40 and constructed a confidence interval using the same method, would we be guaranteed that it contained the population parameter value as well? Let’s try it out: pennies_sample2 <- pennies %>% sample_n(size = 40) Note the use of the sample_n() function in the dplyr package here. This does the same thing as rep_sample_n(reps = 1) but omits the extra replicate column. We next create an infer pipeline to generate a percentile-based 95% confidence interval for \\(\\mu\\): percentile_ci2 <- pennies_sample2 %>% specify(formula = age_in_2011 ~ NULL) %>% generate(reps = 1000) %>% calculate(stat = "mean") %>% get_ci() Setting `type = "bootstrap"` in `generate()`. percentile_ci2 # A tibble: 1 x 2 `2.5%` `97.5%` <dbl> <dbl> 1 18.4 25.3 This new confidence interval also contains the value of \\(\\mu\\). Let’s further investigate by repeating this process 100 times to get 100 different confidence intervals derived from 100 different samples of pennies. Each sample will have size of 40 just as the original sample. We will plot each of these confidence intervals as horizontal lines. We will also show a line corresponding to the known population value of 21.152 years. Of the 100 confidence intervals based on samples of size \\(n = 40\\), 96 of them captured the population mean \\(\\mu = 21.152\\), whereas 4 of them did not include it. If we repeated this process of building confidence intervals more times with more samples, we’d expect 95% of them to contain the population mean. In other words, the procedure we have used to generate confidence intervals is “95% reliable” in that we can expect it to include the true population parameter 95% of the time if the process is repeated. To further accentuate this point, let’s perform a similar procedure using 90% confidence intervals instead. This time we will use the standard error method instead of the percentile method for computing the confidence intervals. Of the 100 confidence intervals based on samples of size \\(n = 40\\), 87 of them captured the population mean \\(\\mu = 21.152\\), whereas 13 of them did not include it. Repeating this process for more samples would result in us getting closer and closer to 90% of the confidence intervals including the true value. It is common to say while interpreting a confidence interval to be “95% confident” or “90% confident” that the true value falls within the range of the specified confidence interval. We will use this “confident” language throughout the rest of this chapter, but remember that it has more to do with a measure of reliability of the building process. Back to our pennies example After this elaboration on what the level corresponds to in a confidence interval, let’s conclude by providing an interpretation of the original confidence interval result we found in Subsection 9.3.1. Interpretation: We are 95% confident that the true mean age of pennies in circulation in 2011 is between 20.972 and 29.252 years. This level of confidence is based on the percentile-based method including the true mean 95% of the time if many different samples (not just the one we used) were collected and confidence intervals were created. 9.6 Example: One proportion Let’s revisit our exercise of trying to estimate the proportion of red balls in the bowl from Chapter 8. We are now interested in determining a confidence interval for population parameter \\(p\\), the proportion of balls that are red out of the total \\(N = 2400\\) red and white balls. We will use the first sample reported from Ilyas and Yohan in Subsection 8.1.3 for our point estimate. They observed 21 red balls out of the 50 in their shovel. This data is stored in the tactile_shovel1 data frame in the moderndive package. tactile_shovel1 # A tibble: 50 x 1 color <chr> 1 red 2 red 3 white 4 red 5 white 6 red 7 red 8 white 9 red 10 white # … with 40 more rows 9.6.1 Observed Statistic To compute the proportion that are red in this data we can use the specify() %>% calculate() workflow. Note the use of the success argument here to clarify which of the two colors "red" or "white" we are interested in. p_hat <- tactile_shovel1 %>% specify(formula = color ~ NULL, success = "red") %>% calculate(stat = "prop") p_hat # A tibble: 1 x 1 stat <dbl> 1 0.42 9.6.2 Bootstrap distribution Next we want to calculate many different bootstrap samples and their corresponding bootstrap statistic (the proportion of red balls). We’ve done 1000 in the past, but let’s go up to 10,000 now to better see the resulting distribution. Recall that this is done by including a generate() function call in the middle of our pipeline: tactile_shovel1 %>% specify(formula = color ~ NULL, success = "red") %>% generate(reps = 10000) Setting `type = "bootstrap"` in `generate()`. This results in 50 rows for each of the 10,000 replicates. Lastly, we finish the infer pipeline by adding back in the calculate() step. bootstrap_props <- tactile_shovel1 %>% specify(formula = color ~ NULL, success = "red") %>% generate(reps = 10000) %>% calculate(stat = "prop") Let’s visualize() what the resulting bootstrap distribution looks like as a histogram. We’ve adjusted the number of bins here as well to better see the resulting shape. bootstrap_props %>% visualize(bins = 25) We see that the resulting distribution is symmetric and bell-shaped so it doesn’t much matter which confidence interval method we choose. Let’s use the standard error method to create a 95% confidence interval. standard_error_ci <- bootstrap_props %>% get_ci(type = "se", level = 0.95, point_estimate = p_hat) standard_error_ci # A tibble: 1 x 2 lower upper <dbl> <dbl> 1 0.284 0.556 bootstrap_props %>% visualize(bins = 25, endpoints = standard_error_ci) We are 95% confident that the true proportion of red balls in the bowl is between 0.284 and 0.556. This level of confidence is based on the standard error-based method including the true proportion 95% of the time if many different samples (not just the one we used) were collected and confidence intervals were created. 9.6.3 Theory-based confidence intervals When the bootstrap distribution has the nice symmetric, bell shape that we saw in the red balls example above, we can also use a formula to quantify the standard error. This provides another way to compute a confidence interval, but is a little more tedious and mathematical. The steps are outlined below. We’ve also shown how we can use the confidence interval (CI) interpretation in this case as well to support your understanding of this tricky concept. Procedure for building a theory-based CI for \\(p\\) To construct a theory-based confidence interval for \\(p\\), the unknown true population proportion we Collect a sample of size \\(n\\) Compute \\(\\widehat{p}\\) Compute the standard error \\[\\text{SE} = \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] Compute the margin of error \\[\\text{MoE} = 1.96 \\cdot \\text{SE} = 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] Compute both end points of the confidence interval: The lower end point lower_ci: \\[\\widehat{p} - \\text{MoE} = \\widehat{p} - 1.96 \\cdot \\text{SE} = \\widehat{p} - 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] The upper end point upper_ci: \\[\\widehat{p} + \\text{MoE} = \\widehat{p} + 1.96 \\cdot \\text{SE} = \\widehat{p} + 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\] Alternatively, you can succinctly summarize a 95% confidence interval for \\(p\\) using the \\(\\pm\\) symbol: \\[ \\widehat{p} \\pm \\text{MoE} = \\widehat{p} \\pm 1.96 \\cdot \\text{SE} = \\widehat{p} \\pm 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}} \\] Confidence intervals based on 33 tactile samples Let’s load the tactile sampling data for the 33 groups from Chapter 8. Recall this data was saved in the tactile_prop_red data frame included in the moderndive package. tactile_prop_red Let’s now apply the above procedure for constructing confidence intervals for \\(p\\) using the data saved in tactile_prop_red by adding/modifying new columns using the dplyr package data wrangling tools seen in Chapter 4: Rename prop_red to p_hat, the official name of the sample proportion Make explicit the sample size n of \\(n=50\\) the standard error SE the margin of error MoE the left endpoint of the confidence interval lower_ci the right endpoint of the confidence interval upper_ci conf_ints <- tactile_prop_red %>% rename(p_hat = prop_red) %>% mutate( n = 50, SE = sqrt(p_hat * (1 - p_hat) / n), MoE = 1.96 * SE, lower_ci = p_hat - MoE, upper_ci = p_hat + MoE ) conf_ints TABLE 9.2: 33 confidence intervals from 33 tactile samples of size n=50 group red_balls p_hat n SE MoE lower_ci upper_ci Ilyas, Yohan 21 0.42 50 0.070 0.137 0.283 0.557 Morgan, Terrance 17 0.34 50 0.067 0.131 0.209 0.471 Martin, Thomas 21 0.42 50 0.070 0.137 0.283 0.557 Clark, Frank 21 0.42 50 0.070 0.137 0.283 0.557 Riddhi, Karina 18 0.36 50 0.068 0.133 0.227 0.493 Andrew, Tyler 19 0.38 50 0.069 0.135 0.245 0.515 Julia 19 0.38 50 0.069 0.135 0.245 0.515 Rachel, Lauren 11 0.22 50 0.059 0.115 0.105 0.335 Daniel, Caroline 15 0.30 50 0.065 0.127 0.173 0.427 Josh, Maeve 17 0.34 50 0.067 0.131 0.209 0.471 Emily, Emily 16 0.32 50 0.066 0.129 0.191 0.449 Conrad, Emily 18 0.36 50 0.068 0.133 0.227 0.493 Oliver, Erik 17 0.34 50 0.067 0.131 0.209 0.471 Isabel, Nam 21 0.42 50 0.070 0.137 0.283 0.557 X, Claire 15 0.30 50 0.065 0.127 0.173 0.427 Cindy, Kimberly 20 0.40 50 0.069 0.136 0.264 0.536 Kevin, James 11 0.22 50 0.059 0.115 0.105 0.335 Nam, Isabelle 21 0.42 50 0.070 0.137 0.283 0.557 Harry, Yuko 15 0.30 50 0.065 0.127 0.173 0.427 Yuki, Eileen 16 0.32 50 0.066 0.129 0.191 0.449 Ramses 23 0.46 50 0.070 0.138 0.322 0.598 Joshua, Elizabeth, Stanley 15 0.30 50 0.065 0.127 0.173 0.427 Siobhan, Jane 18 0.36 50 0.068 0.133 0.227 0.493 Jack, Will 16 0.32 50 0.066 0.129 0.191 0.449 Caroline, Katie 21 0.42 50 0.070 0.137 0.283 0.557 Griffin, Y 18 0.36 50 0.068 0.133 0.227 0.493 Kaitlin, Jordan 17 0.34 50 0.067 0.131 0.209 0.471 Ella, Garrett 18 0.36 50 0.068 0.133 0.227 0.493 Julie, Hailin 15 0.30 50 0.065 0.127 0.173 0.427 Katie, Caroline 21 0.42 50 0.070 0.137 0.283 0.557 Mallory, Damani, Melissa 21 0.42 50 0.070 0.137 0.283 0.557 Katie 16 0.32 50 0.066 0.129 0.191 0.449 Francis, Vignesh 19 0.38 50 0.069 0.135 0.245 0.515 Let’s plot: These 33 confidence intervals for \\(p\\): from lower_ci to upper_ci The true population proportion \\(p = 900 / 2400 = 0.375\\) with a red vertical line FIGURE 9.2: 33 confidence intervals based on 33 tactile samples of size n=50 We see that: In 31 cases, the confidence intervals “capture” the true \\(p = 900 / 2400 = 0.375\\) In 2 cases, the confidence intervals do not “capture” the true \\(p = 900 / 2400 = 0.375\\) Thus, the confidence intervals capture the true proportion $31 / 33 = 93.939% of the time using this theory-based methodology. Confidence intervals based on 100 virtual samples Let’s say however, we repeated the above 100 times, not tactilely, but virtually. Let’s do this only 100 times instead of 1000 like we did before so that the results can fit on the screen. Again, the steps for compute a 95% confidence interval for \\(p\\) are: Collect a sample of size \\(n = 50\\) as we did in Chapter 8 Compute \\(\\widehat{p}\\): the sample proportion red of these \\(n=50\\) balls Compute the standard error \\(\\text{SE} = \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\) Compute the margin of error \\(\\text{MoE} = 1.96 \\cdot \\text{SE} = 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\) Compute both end points of the confidence interval: lower_ci: \\(\\widehat{p} - \\text{MoE} = \\widehat{p} - 1.96 \\cdot \\text{SE} = \\widehat{p} - 1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\) upper_ci: \\(\\widehat{p} + \\text{MoE} = \\widehat{p} + 1.96 \\cdot \\text{SE} = \\widehat{p} +1.96 \\cdot \\sqrt{\\frac{\\widehat{p}(1-\\widehat{p})}{n}}\\) Run the following three steps, being sure to View() the resulting data frame after each step so you can convince yourself of what’s going on: # First: Take 100 virtual samples of n=50 balls virtual_samples <- bowl %>% rep_sample_n(size = 50, reps = 100) # Second: For each virtual sample compute the proportion red virtual_prop_red <- virtual_samples %>% group_by(replicate) %>% summarize(red = sum(color == "red")) %>% mutate(prop_red = red / 50) # Third: Compute the 95% confidence interval as above virtual_prop_red <- virtual_prop_red %>% rename(p_hat = prop_red) %>% mutate( n = 50, SE = sqrt(p_hat*(1-p_hat)/n), MoE = 1.96 * SE, lower_ci = p_hat - MoE, upper_ci = p_hat + MoE ) Here are the results: FIGURE 9.3: 100 confidence intervals based on 100 virtual samples of size n=50 We see that of our 100 confidence intervals based on samples of size \\(n=50\\), 96 of them captured the true \\(p = 900/2400\\), whereas 4 of them missed. As we create more and more confidence intervals based on more and more samples, about 95% of these intervals will capture. In other words our procedure is “95% reliable.” Theoretical methods like this have largely been used in the past since we didn’t have the computing power to perform the simulation-based methods such as bootstrapping. They are still commonly used though and if the normality assumptions are met, they can provide a nice option for finding confidence intervals and performing hypothesis tests as we will see in Chapter 10. 9.7 Example: Comparing two proportions If you see someone else yawn, are you more likely to yawn? In an episode of the show Mythbusters, they tested the myth that yawning is contagious. The snippet from the show is available to view in the United States on the Discovery Network website here. More information about the episode is also available on IMDb here. Fifty adults who thought they were being considered for an appearance on the show were interviewed by a show recruiter (“confederate”) who either yawned or did not. Participants then sat by themselves in a large van and were asked to wait. While in the van, the Mythbusters watched via hidden camera to see if the unaware participants yawned. The data frame containing the results is available at mythbusters_yawn in the moderndive package. Let’s check it out. mythbusters_yawn # A tibble: 50 x 3 subj group yawn <int> <chr> <chr> 1 1 seed yes 2 2 control yes 3 3 seed no 4 4 seed yes 5 5 seed no 6 6 control no 7 7 seed yes 8 8 control no 9 9 control no 10 10 seed no # … with 40 more rows The participant ID is stored in the subj variable with values of 1 to 50. The group variable is either "seed" for when a confederate was trying to influence the participant or "control" if a confederate did not interact with the participant. The yawn variable is either "yes" if the participant yawned or "no" if the participant did not yawn. We can use the janitor package to get a glimpse into this data in a table format: mythbusters_yawn %>% tabyl(group, yawn) %>% adorn_percentages() %>% adorn_pct_formatting() %>% # To show original counts adorn_ns() group no yes control 75.0% (12) 25.0% (4) seed 70.6% (24) 29.4% (10) We are interested in comparing the proportion of those that yawned after seeing a seed versus those that yawned with no seed interaction. We’d like to see if the difference between these two proportions is significantly larger than 0. If so, we’d have evidence to support the claim that yawning is contagious based on this study. In looking over this problem, we can make note of some important details to include in our infer pipeline: We are calling a success having a yawn value of "yes". Our response variable will always correspond to the variable used in the success so the response variable is yawn. The explanatory variable is the other variable of interest here: group. To summarize, we are looking to see the examine the relationship between yawning and whether or not the participant saw a seed yawn or not. 9.7.1 Compute the point estimate mythbusters_yawn %>% specify(formula = yawn ~ group) Error: A level of the response variable `yawn` needs to be specified for the `success` argument in `specify()`. Note that the success argument must be specified in situations such as this where the response variable has only two levels. mythbusters_yawn %>% specify(formula = yawn ~ group, success = "yes") Response: yawn (factor) Explanatory: group (factor) # A tibble: 50 x 2 yawn group <fct> <fct> 1 yes seed 2 yes control 3 no seed 4 yes seed 5 no seed 6 no control 7 yes seed 8 no control 9 no control 10 no seed # … with 40 more rows We next want to calculate the statistic of interest for our sample. This corresponds to the difference in the proportion of successes. mythbusters_yawn %>% specify(formula = yawn ~ group, success = "yes") %>% calculate(stat = "diff in props") Error: Statistic is based on a difference; specify the `order` in which to subtract the levels of the explanatory variable. `order = c("first", "second")` means `("first" - "second")`. Check `?calculate` for details. We see another error here. To further check to make sure that R knows exactly what we are after, we need to provide the order in which R should subtract these proportions of successes. As the error message states, we’ll want to put "seed" first after c() and then "control": order = c("seed", "control"). Our point estimate is thus calculated: obs_diff <- mythbusters_yawn %>% specify(formula = yawn ~ group, success = "yes") %>% calculate(stat = "diff in props", order = c("seed", "control")) obs_diff # A tibble: 1 x 1 stat <dbl> 1 0.0441 This value represents the proportion of those that yawned after seeing a seed yawn (0.2941) minus the proportion of those that yawned with not seeing a seed (0.25). 9.7.2 Bootstrap distribution Our next step in building a confidence interval is to create a bootstrap distribution of statistics (differences in proportions of successes). We saw how it works with both a single variable in computing bootstrap means in Subsection 9.1.3 and in computing bootstrap proportions in Section 9.6, but we haven’t yet worked with bootstrapping involving multiple variables though. In the infer package, bootstrapping with multiple variables means that each row is potentially resampled. Let’s investigate this by looking at the first few rows of mythbusters_yawn: head(mythbusters_yawn) # A tibble: 6 x 3 subj group yawn <int> <chr> <chr> 1 1 seed yes 2 2 control yes 3 3 seed no 4 4 seed yes 5 5 seed no 6 6 control no When we bootstrap this data, we are potentially pulling the subject’s readings multiple times. Thus, we could see the entries of "seed" for group and "no" for yawn together in a new row in a bootstrap sample. This is further seen by exploring the sample_n() function in dplyr on this smaller 6 row data frame comprised of head(mythbusters_yawn). The sample_n() function can perform this bootstrapping procedure and is similar to the rep_sample_n() function in infer, except that it is not repeated but rather only performs one sample with or without replacement. set.seed(2019) head(mythbusters_yawn) %>% sample_n(size = 6, replace = TRUE) # A tibble: 6 x 3 subj group yawn <int> <chr> <chr> 1 5 seed no 2 5 seed no 3 2 control yes 4 4 seed yes 5 1 seed yes 6 1 seed yes We can see that in this bootstrap sample generated from the first six rows of mythbusters_yawn, we have some rows repeated. The same is true when we perform the generate() step in infer as done below. bootstrap_distribution <- mythbusters_yawn %>% specify(formula = yawn ~ group, success = "yes") %>% generate(reps = 1000) %>% calculate(stat = "diff in props", order = c("seed", "control")) Setting `type = "bootstrap"` in `generate()`. bootstrap_distribution %>% visualize(bins = 20) This distribution is roughly symmetric and bell-shaped but isn’t quite there. Let’s use the percentile-based method to compute a 95% confidence interval for the true difference in the proportion of those that yawn with and without a seed presented. The arguments are explicitly listed here but remember they are the defaults and simply get_ci() can be used. bootstrap_distribution %>% get_ci(type = "percentile", level = 0.95) # A tibble: 1 x 2 `2.5%` `97.5%` <dbl> <dbl> 1 -0.219 0.293 The confidence interval shown here includes the value of 0. We’ll see in Chapter 10 further what this means in terms of this difference being statistically significant or not, but let’s examine a bit here first. The range of plausible values for the difference in the proportion of that that yawned with and without a seed is between -0.219 and 0.293. Therefore, we are not sure which proportion is larger. Some of the bootstrap statistics showed the proportion without a seed to be higher and others showed the proportion with a seed to be higher. If the confidence interval was entirely above zero, we would be relatively sure (about “95% confident”) that the seed group had a higher proportion of yawning than the control group. Note that this all relates to the importance of denoting the order argument in the calculate() function. Since we specified "seed" and then "control" positive values for the statistic correspond to the "seed" proportion being higher, whereas negative values correspond to the "control" group being higher. We, therefore, have evidence via this confidence interval suggesting that the conclusion from the Mythbusters show that “yawning is contagious” being “confirmed” is not statistically appropriate. Learning check Practice problems to come soon! 9.8 Conclusion 9.8.1 What’s to come? This chapter introduced the notions of bootstrapping and confidence intervals as ways to build intuition about population parameters using only the original sample information. We also concluded with a glimpse into statistical significance and we’ll dig much further into this in Chapter 10 up next! 9.8.2 Script of R code An R script file of all R code used in this chapter is available here. References "], ["10-hypothesis-testing.html", "Chapter 10 Hypothesis Testing 10.1 When inference is not needed 10.2 Basics of hypothesis testing 10.3 Criminal trial analogy 10.4 Types of errors in hypothesis testing 10.5 Statistical significance 10.6 Hypothesis testing with infer 10.7 Example: Comparing two means 10.8 Building theory-based methods using computation 10.9 Conclusion", " Chapter 10 Hypothesis Testing In preparation for our first print edition to be published by CRC Press in Fall 2019, we’re remodeling this chapter a bit. Don’t expect major changes in content, but rather only minor changes in presentation. Our remodeling will be complete and available online at ModernDive.com by early Summer 2019! We saw some of the main concepts of hypothesis testing introduced in Chapters 8 and 9. We will expand further on these ideas here and also provide a framework for understanding hypothesis tests in general. Instead of presenting you with lots of different formulas and scenarios, we hope to build a way to think about all hypothesis tests. You can then adapt to different scenarios as needed down the road when you encounter different statistical situations. The same can be said for confidence intervals. There was one general framework that applies to all confidence intervals and we elaborated on this using the infer package pipeline in Chapter 9. The specifics may change slightly for each variation, but the important idea is to understand the general framework so that you can apply it to more specific problems. We believe that this approach is much better in the long-term than teaching you specific tests and confidence intervals rigorously. You can find fully-worked out examples for five common hypothesis tests and their corresponding confidence intervals in Appendix B. We recommend that you carefully review these examples as they also cover how the general frameworks apply to traditional normal-based methodologies like the \\(t\\)-test and normal-theory confidence intervals. You’ll see there that these methods are just approximations for the general computational frameworks, but require conditions to be met for their results to be valid. The general frameworks using randomization, simulation, and bootstrapping do not hold the same sorts of restrictions and further advance computational thinking, which is one big reason for their emphasis throughout this textbook. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section 2.3 for information on how to install and load R packages. library(dplyr) library(ggplot2) library(infer) library(nycflights13) library(ggplot2movies) library(broom) We saw some of the main concepts of hypothesis testing introduced in Chapters 8 and 9. We will expand further on these ideas here and also provide a framework for understanding hypothesis tests in general. Instead of presenting you with lots of different formulas and scenarios, we hope to build a way to think about all hypothesis tests. You can then adapt to different scenarios as needed down the road when you encounter different statistical situations. The same can be said for confidence intervals. There was one general framework that applies to all confidence intervals and we elaborated on this using the infer package pipeline in Chapter 9. The specifics may change slightly for each variation, but the important idea is to understand the general framework so that you can apply it to more specific problems. We believe that this approach is much better in the long-term than teaching you specific tests and confidence intervals rigorously. You can find fully-worked out examples for five common hypothesis tests and their corresponding confidence intervals in Appendix B. We recommend that you carefully review these examples as they also cover how the general frameworks apply to traditional normal-based methodologies like the \\(t\\)-test and normal-theory confidence intervals. You’ll see there that these methods are just approximations for the general computational frameworks, but require conditions to be met for their results to be valid. The general frameworks using randomization, simulation, and bootstrapping do not hold the same sorts of restrictions and further advance computational thinking, which is one big reason for their emphasis throughout this textbook. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section 2.3 for information on how to install and load R packages. library(dplyr) library(ggplot2) library(infer) library(nycflights13) library(ggplot2movies) library(broom) 10.1 When inference is not needed Before we delve into hypothesis testing, it’s good to remember that there are cases where you need not perform a rigorous statistical inference. An important and time-saving skill is to ALWAYS do exploratory data analysis using dplyr and ggplot2 before thinking about running a hypothesis test. Let’s look at such an example selecting a sample of flights traveling to Boston and to San Francisco from New York City in the flights data frame in the nycflights13 package. (We will remove flights with missing data first using na.omit and then sample 100 flights going to each of the two airports.) bos_sfo <- flights %>% na.omit() %>% filter(dest %in% c("BOS", "SFO")) %>% group_by(dest) %>% sample_n(100) Suppose we were interested in seeing if the air_time to SFO in San Francisco was statistically greater than the air_time to BOS in Boston. As suggested, let’s begin with some exploratory data analysis to get a sense for how the two variables of air_time and dest relate for these two destination airports: bos_sfo_summary <- bos_sfo %>% group_by(dest) %>% summarize(mean_time = mean(air_time), sd_time = sd(air_time)) bos_sfo_summary # A tibble: 2 x 3 dest mean_time sd_time <chr> <dbl> <dbl> 1 BOS 39.0 4.51 2 SFO 349. 18.7 Looking at these results, we can clearly see that SFO air_time is much larger than BOS air_time. The standard deviation is also extremely informative here. Learning check (LC10.1) Could we make the same type of immediate conclusion that SFO had a statistically greater air_time if, say, its corresponding standard deviation was 200 minutes? What about 100 minutes? Explain. To further understand just how different the air_time variable is for BOS and SFO, let’s look at a boxplot: ggplot(data = bos_sfo, mapping = aes(x = dest, y = air_time)) + geom_boxplot() Since there is no overlap at all, we can conclude that the air_time for San Francisco flights is statistically greater (at any level of significance) than the air_time for Boston flights. This is a clear example of not needing to do anything more than some simple exploratory data analysis with descriptive statistics and data visualization to get an appropriate inferential conclusion. This is one reason why you should ALWAYS investigate the sample data first using dplyr and ggplot2 via exploratory data analysis. As you get more and more practice with hypothesis testing, you’ll be better able to determine in many cases whether or not the results will be statistically significant. There are circumstances where it is difficult to tell, but you should always try to make a guess FIRST about significance after you have completed your data exploration and before you actually begin the inferential techniques. 10.2 Basics of hypothesis testing In a hypothesis test, we will use data from a sample to help us decide between two competing hypotheses about a population. We make these hypotheses more concrete by specifying them in terms of at least one population parameter of interest. We refer to the competing claims about the population as the null hypothesis, denoted by \\(H_0\\), and the alternative (or research) hypothesis, denoted by \\(H_a\\). The roles of these two hypotheses are NOT interchangeable. The claim for which we seek significant evidence is assigned to the alternative hypothesis. The alternative is usually what the experimenter or researcher wants to establish or find evidence for. Usually, the null hypothesis is a claim that there really is “no effect” or “no difference.” In many cases, the null hypothesis represents the status quo or that nothing interesting is happening. We assess the strength of evidence by assuming the null hypothesis is true and determining how unlikely it would be to see sample results/statistics as extreme (or more extreme) as those in the original sample. Hypothesis testing brings about many weird and incorrect notions in the scientific community and society at large. One reason for this is that statistics has traditionally been thought of as this magic box of algorithms and procedures to get to results and this has been readily apparent if you do a Google search of “flowchart statistics hypothesis tests.” There are so many different complex ways to determine which test is appropriate. You’ll see that we don’t need to rely on these complicated series of assumptions and procedures to conduct a hypothesis test any longer. These methods were introduced in a time when computers weren’t powerful. Your cellphone (in 2016) has more power than the computers that sent NASA astronauts to the moon after all. We’ll see that ALL hypothesis tests can be broken down into the following framework given by Allen Downey here: FIGURE 10.1: Hypothesis Testing Framework Before we hop into this framework, we will provide another way to think about hypothesis testing that may be useful. 10.3 Criminal trial analogy We can think of hypothesis testing in the same context as a criminal trial in the United States. A criminal trial in the United States is a familiar situation in which a choice between two contradictory claims must be made. The accuser of the crime must be judged either guilty or not guilty. Under the U.S. system of justice, the individual on trial is initially presumed not guilty. Only STRONG EVIDENCE to the contrary causes the not guilty claim to be rejected in favor of a guilty verdict. The phrase “beyond a reasonable doubt” is often used to set the cutoff value for when enough evidence has been given to convict. Theoretically, we should never say “The person is innocent.” but instead “There is not sufficient evidence to show that the person is guilty.” Now let’s compare that to how we look at a hypothesis test. The decision about the population parameter(s) must be judged to follow one of two hypotheses. We initially assume that \\(H_0\\) is true. The null hypothesis \\(H_0\\) will be rejected (in favor of \\(H_a\\)) only if the sample evidence strongly suggests that \\(H_0\\) is false. If the sample does not provide such evidence, \\(H_0\\) will not be rejected. The analogy to “beyond a reasonable doubt” in hypothesis testing is what is known as the significance level. This will be set before conducting the hypothesis test and is denoted as \\(\\alpha\\). Common values for \\(\\alpha\\) are 0.1, 0.01, and 0.05. 10.3.1 Two possible conclusions Therefore, we have two possible conclusions with hypothesis testing: Reject \\(H_0\\) Fail to reject \\(H_0\\) Gut instinct says that “Fail to reject \\(H_0\\)” should say “Accept \\(H_0\\)” but this technically is not correct. Accepting \\(H_0\\) is the same as saying that a person is innocent. We cannot show that a person is innocent; we can only say that there was not enough substantial evidence to find the person guilty. When you run a hypothesis test, you are the jury of the trial. You decide whether there is enough evidence to convince yourself that \\(H_a\\) is true (“the person is guilty”) or that there was not enough evidence to convince yourself \\(H_a\\) is true (“the person is not guilty”). You must convince yourself (using statistical arguments) which hypothesis is the correct one given the sample information. Important note: Therefore, DO NOT WRITE “Accept \\(H_0\\)” any time you conduct a hypothesis test. Instead write “Fail to reject \\(H_0\\).” 10.4 Types of errors in hypothesis testing Unfortunately, just as a jury or a judge can make an incorrect decision in regards to a criminal trial by reaching the wrong verdict, there is some chance we will reach the wrong conclusion via a hypothesis test about a population parameter. As with criminal trials, this comes from the fact that we don’t have complete information, but rather a sample from which to try to infer about a population. The possible erroneous conclusions in a criminal trial are an innocent person is convicted (found guilty) or a guilty person is set free (found not guilty). The possible errors in a hypothesis test are rejecting \\(H_0\\) when in fact \\(H_0\\) is true (Type I Error) or failing to reject \\(H_0\\) when in fact \\(H_0\\) is false (Type II Error). The risk of error is the price researchers pay for basing an inference about a population on a sample. With any reasonable sample-based procedure, there is some chance that a Type I error will be made and some chance that a Type II error will occur. To help understand the concepts of Type I error and Type II error, observe the following table: FIGURE 10.2: Type I and Type II errors If we are using sample data to make inferences about a parameter, we run the risk of making a mistake. Obviously, we want to minimize our chance of error; we want a small probability of drawing an incorrect conclusion. The probability of a Type I Error occurring is denoted by \\(\\alpha\\) and is called the significance level of a hypothesis test The probability of a Type II Error is denoted by \\(\\beta\\). Formally, we can define \\(\\alpha\\) and \\(\\beta\\) in regards to the table above, but for hypothesis tests instead of a criminal trial. \\(\\alpha\\) corresponds to the probability of rejecting \\(H_0\\) when, in fact, \\(H_0\\) is true. \\(\\beta\\) corresponds to the probability of failing to reject \\(H_0\\) when, in fact, \\(H_0\\) is false. Ideally, we want \\(\\alpha = 0\\) and \\(\\beta = 0\\), meaning that the chance of making an error does not exist. When we have to use incomplete information (sample data), it is not possible to have both \\(\\alpha = 0\\) and \\(\\beta = 0\\). We will always have the possibility of at least one error existing when we use sample data. Usually, what is done is that \\(\\alpha\\) is set before the hypothesis test is conducted and then the evidence is judged against that significance level. Common values for \\(\\alpha\\) are 0.05, 0.01, and 0.10. If \\(\\alpha = 0.05\\), we are using a testing procedure that, used over and over with different samples, rejects a TRUE null hypothesis five percent of the time. So if we can set \\(\\alpha\\) to be whatever we want, why choose 0.05 instead of 0.01 or even better 0.0000000000000001? Well, a small \\(\\alpha\\) means the test procedure requires the evidence against \\(H_0\\) to be very strong before we can reject \\(H_0\\). This means we will almost never reject \\(H_0\\) if \\(\\alpha\\) is very small. If we almost never reject \\(H_0\\), the probability of a Type II Error – failing to reject \\(H_0\\) when we should – will increase! Thus, as \\(\\alpha\\) decreases, \\(\\beta\\) increases and as \\(\\alpha\\) increases, \\(\\beta\\) decreases. We, therefore, need to strike a balance in \\(\\alpha\\) and \\(\\beta\\) and the common values for \\(\\alpha\\) of 0.05, 0.01, and 0.10 usually lead to a nice balance. Learning check (LC10.2) Reproduce the table above about errors, but for a hypothesis test, instead of the one provided for a criminal trial. 10.4.1 Logic of hypothesis testing Take a random sample (or samples) from a population (or multiple populations) If the sample data are consistent with the null hypothesis, do not reject the null hypothesis. If the sample data are inconsistent with the null hypothesis (in the direction of the alternative hypothesis), reject the null hypothesis and conclude that there is evidence the alternative hypothesis is true (based on the particular sample collected). 10.5 Statistical significance The idea that sample results are more extreme than we would reasonably expect to see by random chance if the null hypothesis were true is the fundamental idea behind statistical hypothesis tests. If data at least as extreme would be very unlikely if the null hypothesis were true, we say the data are statistically significant. Statistically significant data provide convincing evidence against the null hypothesis in favor of the alternative, and allow us to generalize our sample results to the claim about the population. Learning check (LC10.3) What is wrong about saying “The defendant is innocent.” based on the US system of criminal trials? (LC10.4) What is the purpose of hypothesis testing? (LC10.5) What are some flaws with hypothesis testing? How could we alleviate them? 10.6 Hypothesis testing with infer The “There is Only One Test” diagram mentioned in Section 10.2 was the inspiration for the infer pipeline that you saw for confidence intervals in Chapter 9. For hypothesis tests, we include one more verb into the pipeline: the hypothesize() verb. Its main argument is null which is either "point" for point hypotheses involving a single sample or "independence" for testing for independence between two variables. We’ll first explore the two variable case by comparing two means. Note the section headings here that refer to the “There is Only One Test” diagram. We will lay out the specifics for each problem using this framework and the infer pipeline together. 10.7 Example: Comparing two means 10.7.1 Randomization/permutation We will now focus on building hypotheses looking at the difference between two population means in an example. We will denote population means using the Greek symbol \\(\\mu\\) (pronounced “mu”). Thus, we will be looking to see if one group “out-performs” another group. This is quite possibly the most common type of statistical inference and serves as a basis for many other types of analyses when comparing the relationship between two variables. Our null hypothesis will be of the form \\(H_0: \\mu_1 = \\mu_2\\), which can also be written as \\(H_0: \\mu_1 - \\mu_2 = 0\\). Our alternative hypothesis will be of the form \\(H_0: \\mu_1 \\star \\mu_2\\) (or \\(H_a: \\mu_1 - \\mu_2 \\, \\star \\, 0\\)) where \\(\\star\\) = \\(<\\), \\(\\ne\\), or \\(>\\) depending on the context of the problem. You needn’t focus on these new symbols too much at this point. It will just be a shortcut way for us to describe our hypotheses. As we saw in Chapter 9, bootstrapping is a valuable tool when conducting inferences based on one or two population variables. We will see that the process of randomization (also known as permutation) will be valuable in conducting tests comparing quantitative values from two groups. 10.7.2 Comparing action and romance movies The movies dataset in the ggplot2movies package contains information on a large number of movies that have been rated by users of IMDB.com (Wickham 2015). We are interested in the question here of whether Action movies are rated higher on IMDB than Romance movies. We will first need to do a little bit of data wrangling using the ideas from Chapter 4 to get the data in the form that we would like: movies_trimmed <- movies %>% select(title, year, rating, Action, Romance) Note that Action and Romance are binary variables here. To remove any overlap of movies (and potential confusion) that are both Action and Romance, we will remove them from our population: movies_trimmed <- movies_trimmed %>% filter(!(Action == 1 & Romance == 1)) We will now create a new variable called genre that specifies whether a movie in our movies_trimmed data frame is an "Action" movie, a "Romance" movie, or "Neither". We aren’t really interested in the "Neither" category here so we will exclude those rows as well. Lastly, the Action and Romance columns are not needed anymore since they are encoded in the genre column. movies_trimmed <- movies_trimmed %>% mutate(genre = case_when(Action == 1 ~ "Action", Romance == 1 ~ "Romance", TRUE ~ "Neither")) %>% filter(genre != "Neither") %>% select(-Action, -Romance) The case_when function is useful for assigning values in a new variable based on the values of another variable. The last step of TRUE ~ "Neither" is used when a particular movie is not set to either Action or Romance. We are left with 8878 movies in our population dataset that focuses on only "Action" and "Romance" movies. Learning check (LC10.6) Why are the different genre variables stored as binary variables (1s and 0s) instead of just listing the genre as a column of values like “Action”, “Comedy”, etc.? (LC10.7) What complications could come above with us excluding action romance movies? Should we question the results of our hypothesis test? Explain. Let’s now visualize the distributions of rating across both levels of genre. Think about what type(s) of plot is/are appropriate here before you proceed: ggplot(data = movies_trimmed, aes(x = genre, y = rating)) + geom_boxplot() FIGURE 10.3: Rating vs genre in the population We can see that the middle 50% of ratings for "Action" movies is more spread out than that of "Romance" movies in the population. "Romance" has outliers at both the top and bottoms of the scale though. We are initially interested in comparing the mean rating across these two groups so a faceted histogram may also be useful: ggplot(data = movies_trimmed, mapping = aes(x = rating)) + geom_histogram(binwidth = 1, color = "white") + facet_grid(genre ~ .) FIGURE 10.4: Faceted histogram of genre vs rating Important note: Remember that we hardly ever have access to the population values as we do here. This example and the nycflights13 dataset were used to create a common flow from chapter to chapter. In nearly all circumstances, we’ll be needing to use only a sample of the population to try to infer conclusions about the unknown population parameter values. These examples do show a nice relationship between statistics (where data is usually small and more focused on experimental settings) and data science (where data is frequently large and collected without experimental conditions). 10.7.3 Sampling \\(\\rightarrow\\) randomization We can use hypothesis testing to investigate ways to determine, for example, whether a treatment has an effect over a control and other ways to statistically analyze if one group performs better than, worse than, or different than another. We are interested here in seeing how we can use a random sample of action movies and a random sample of romance movies from movies to determine if a statistical difference exists in the mean ratings of each group. Learning check (LC10.8) Define the relevant parameters here in terms of the populations of movies. 10.7.4 Data Let’s select a random sample of 34 action movies and a random sample of 34 romance movies. (The number 34 was chosen somewhat arbitrarily here.) set.seed(2017) movies_genre_sample <- movies_trimmed %>% group_by(genre) %>% sample_n(34) %>% ungroup() Note the addition of the ungroup() function here. This will be useful shortly in allowing us to permute the values of rating across genre. Our analysis does not work without this ungroup() function since the data stays grouped by the levels of genre without it. We can now observe the distributions of our two sample ratings for both groups. Remember that these plots should be rough approximations of our population distributions of movie ratings for "Action" and "Romance" in our population of all movies in the movies data frame. ggplot(data = movies_genre_sample, aes(x = genre, y = rating)) + geom_boxplot() FIGURE 10.5: Genre vs rating for our sample ggplot(data = movies_genre_sample, mapping = aes(x = rating)) + geom_histogram(binwidth = 1, color = "white") + facet_grid(genre ~ .) FIGURE 10.6: Genre vs rating for our sample as faceted histogram Learning check (LC10.9) What single value could we change to improve the approximation using the sample distribution on the population distribution? Do we have reason to believe, based on the sample distributions of rating over the two groups of genre, that there is a significant difference between the mean rating for action movies compared to romance movies? It’s hard to say just based on the plots. The boxplot does show that the median sample rating is higher for romance movies, but the histogram isn’t as clear. The two groups have somewhat differently shaped distributions but they are both over similar values of rating. It’s often useful to calculate the mean and standard deviation as well, conditioned on the two levels. summary_ratings <- movies_genre_sample %>% group_by(genre) %>% summarize(mean = mean(rating), std_dev = sd(rating), n = n()) summary_ratings # A tibble: 2 x 4 genre mean std_dev n <chr> <dbl> <dbl> <int> 1 Action 5.11 1.49 34 2 Romance 6.06 1.15 34 Learning check (LC10.10) Why did we not specify na.rm = TRUE here as we did in Chapter 4? We see that the sample mean rating for romance movies, \\(\\bar{x}_{r}\\), is greater than the similar measure for action movies, \\(\\bar{x}_a\\). But is it statistically significantly greater (thus, leading us to conclude that the means are statistically different)? The standard deviation can provide some insight here but with these standard deviations being so similar it’s still hard to say for sure. Learning check (LC10.11) Why might the standard deviation provide some insight about the means being statistically different or not? 10.7.5 Model of \\(H_0\\) The hypotheses we specified can also be written in another form to better give us an idea of what we will be simulating to create our null distribution. \\(H_0: \\mu_r - \\mu_a = 0\\) \\(H_a: \\mu_r - \\mu_a \\ne 0\\) 10.7.6 Test statistic \\(\\delta\\) We are, therefore, interested in seeing whether the difference in the sample means, \\(\\bar{x}_r - \\bar{x}_a\\), is statistically different than 0. We can now come back to our infer pipeline for computing our observed statistic. Note the order argument that shows the mean value for "Action" being subtracted from the mean value of "Romance". 10.7.7 Observed effect \\(\\delta^*\\) obs_diff <- movies_genre_sample %>% specify(formula = rating ~ genre) %>% calculate(stat = "diff in means", order = c("Romance", "Action")) obs_diff # A tibble: 1 x 1 stat <dbl> 1 0.95 Our goal next is to figure out a random process with which to simulate the null hypothesis being true. Recall that \\(H_0: \\mu_r - \\mu_a = 0\\) corresponds to us assuming that the population means are the same. We would like to assume this is true and perform a random process to generate() data in the model of the null hypothesis. 10.7.8 Simulated data Tactile simulation Here, with us assuming the two population means are equal (\\(H_0: \\mu_r - \\mu_a = 0\\)), we can look at this from a tactile point of view by using index cards. There are \\(n_r = 34\\) data elements corresponding to romance movies and \\(n_a = 34\\) for action movies. We can write the 34 ratings from our sample for romance movies on one set of 34 index cards and the 34 ratings for action movies on another set of 34 index cards. (Note that the sample sizes need not be the same.) The next step is to put the two stacks of index cards together, creating a new set of 68 cards. If we assume that the two population means are equal, we are saying that there is no association between ratings and genre (romance vs action). We can use the index cards to create two new stacks for romance and action movies. Note that the new “romance movie stack” will likely have some of the original action movies in it and likewise for the “action movie stack” including some romance movies from our original set. Since we are assuming that each card is equally likely to have appeared in either one of the stacks this makes sense. First, we must shuffle all the cards thoroughly. After doing so, in this case with equal values of sample sizes, we split the deck in half. We then calculate the new sample mean rating of the romance deck, and also the new sample mean rating of the action deck. This creates one simulation of the samples that were collected originally. We next want to calculate a statistic from these two samples. Instead of actually doing the calculation using index cards, we can use R as we have before to simulate this process. Let’s do this just once and compare the results to what we see in movies_genre_sample. movies_genre_sample %>% specify(formula = rating ~ genre) %>% hypothesize(null = "independence") %>% generate(reps = 1) %>% calculate(stat = "diff in means", order = c("Romance", "Action")) # A tibble: 1 x 1 stat <dbl> 1 0.515 Learning check (LC10.12) How would the tactile shuffling of index cards change if we had different samples of say 20 action movies and 60 romance movies? Describe each step that would change. (LC10.13) Why are we taking the difference in the means of the cards in the new shuffled decks? 10.7.9 Distribution of \\(\\delta\\) under \\(H_0\\) The generate() step completes a permutation sending values of ratings to potentially different values of genre from which they originally came. It simulates a shuffling of the ratings between the two levels of genre just as we could have done with index cards. We can now proceed in a similar way to what we have done previously with bootstrapping by repeating this process many times to create simulated samples, assuming the null hypothesis is true. generated_samples <- movies_genre_sample %>% specify(formula = rating ~ genre) %>% hypothesize(null = "independence") %>% generate(reps = 5000) A null distribution of simulated differences in sample means is created with the specification of stat = "diff in means" for the calculate() step. The null distribution is similar to the bootstrap distribution we saw in Chapter 9, but remember that it consists of statistics generated assuming the null hypothesis is true. We can now plot the distribution of these simulated differences in means: null_distribution_two_means %>% visualize() FIGURE 10.7: Simulated differences in means histogram 10.7.10 The p-value Remember that we are interested in seeing where our observed sample mean difference of 0.95 falls on this null/randomization distribution. We are interested in simply a difference here so “more extreme” corresponds to values in both tails on the distribution. Let’s shade our null distribution to show a visual representation of our \\(p\\)-value: null_distribution_two_means %>% visualize(obs_stat = obs_diff, direction = "both") FIGURE 10.8: Shaded histogram to show p-value Remember that the observed difference in means was 0.95. We have shaded red all values at or above that value and also shaded red those values at or below its negative value (since this is a two-tailed test). By giving obs_stat = obs_diff a vertical darker line is also shown at 0.95. To better estimate how large the \\(p\\)-value will be, we also increase the number of bins to 100 here from 20: null_distribution_two_means %>% visualize(bins = 100, obs_stat = obs_diff, direction = "both") FIGURE 10.9: Histogram with vertical lines corresponding to observed statistic At this point, it is important to take a guess as to what the \\(p\\)-value may be. We can see that there are only a few permuted differences as extreme or more extreme than our observed effect (in both directions). Maybe we guess that this \\(p\\)-value is somewhere around 2%, or maybe 3%, but certainly not 30% or more. Lastly, we calculate the \\(p\\)-value directly using infer: pvalue <- null_distribution_two_means %>% get_pvalue(obs_stat = obs_diff, direction = "both") pvalue # A tibble: 1 x 1 p_value <dbl> 1 0.006 We have around 0.6% of values as extreme or more extreme than our observed statistic in both directions. Assuming we are using a 5% significance level for \\(\\alpha\\), we have evidence supporting the conclusion that the mean rating for romance movies is different from that of action movies. The next important idea is to better understand just how much higher of a mean rating can we expect the romance movies to have compared to that of action movies. 10.7.11 Corresponding confidence interval One of the great things about the infer pipeline is that going between hypothesis tests and confidence intervals is incredibly simple. To create a null distribution, we ran null_distribution_two_means <- movies_genre_sample %>% specify(formula = rating ~ genre) %>% hypothesize(null = "independence") %>% generate(reps = 5000) %>% calculate(stat = "diff in means", order = c("Romance", "Action")) To get the corresponding bootstrap distribution with which we can compute a confidence interval, we can just remove or comment out the hypothesize() step since we are no longer assuming the null hypothesis is true when we bootstrap: percentile_ci_two_means <- movies_genre_sample %>% specify(formula = rating ~ genre) %>% # hypothesize(null = "independence") %>% generate(reps = 5000) %>% calculate(stat = "diff in means", order = c("Romance", "Action")) %>% get_ci() Setting `type = "bootstrap"` in `generate()`. percentile_ci_two_means # A tibble: 1 x 2 `2.5%` `97.5%` <dbl> <dbl> 1 0.333 1.59 Thus, we can expect the true mean of Romance movies on IMDB to have a rating 0.333 to 1.593 points higher than that of Action movies. Remember that this is based on bootstrapping using movies_genre_sample as our original sample and the confidence interval process being 95% reliable. Learning check (LC10.14) Conduct the same analysis comparing action movies versus romantic movies using the median rating instead of the mean rating? What was different and what was the same? (LC10.15) What conclusions can you make from viewing the faceted histogram looking at rating versus genre that you couldn’t see when looking at the boxplot? (LC10.16) Describe in a paragraph how we used Allen Downey’s diagram to conclude if a statistical difference existed between mean movie ratings for action and romance movies. (LC10.17) Why are we relatively confident that the distributions of the sample ratings will be good approximations of the population distributions of ratings for the two genres? (LC10.18) Using the definition of “\\(p\\)-value”, write in words what the \\(p\\)-value represents for the hypothesis test above comparing the mean rating of romance to action movies. (LC10.19) What is the value of the \\(p\\)-value for the hypothesis test comparing the mean rating of romance to action movies? (LC10.20) Do the results of the hypothesis test match up with the original plots we made looking at the population of movies? Why or why not? 10.7.12 Summary To review, these are the steps one would take whenever you’d like to do a hypothesis test comparing values from the distributions of two groups: Simulate many samples using a random process that matches the way the original data were collected and that assumes the null hypothesis is true. Collect the values of a sample statistic for each sample created using this random process to build a null distribution. Assess the significance of the original sample by determining where its sample statistic lies in the null distribution. If the proportion of values as extreme or more extreme than the observed statistic in the randomization distribution is smaller than the pre-determined significance level \\(\\alpha\\), we reject \\(H_0\\). Otherwise, we fail to reject \\(H_0\\). (If no significance level is given, one can assume \\(\\alpha = 0.05\\).) 10.8 Building theory-based methods using computation As a point of reference, we will now discuss the traditional theory-based way to conduct the hypothesis test for determining if there is a statistically significant difference in the sample mean rating of Action movies versus Romance movies. This method and ones like it work very well when the assumptions are met in order to run the test. They are based on probability models and distributions such as the normal and \\(t\\)-distributions. These traditional methods have been used for many decades back to the time when researchers didn’t have access to computers that could run 5000 simulations in a few seconds. They had to base their methods on probability theory instead. Many fields and researchers continue to use these methods and that is the biggest reason for their inclusion here. It’s important to remember that a \\(t\\)-test or a \\(z\\)-test is really just an approximation of what you have seen in this chapter already using simulation and randomization. The focus here is on understanding how the shape of the \\(t\\)-curve comes about without digging big into the mathematical underpinnings. 10.8.1 Example: \\(t\\)-test for two independent samples What is commonly done in statistics is the process of standardization. What this entails is calculating the mean and standard deviation of a variable. Then you subtract the mean from each value of your variable and divide by the standard deviation. The most common standardization is known as the \\(z\\)-score. The formula for a \\(z\\)-score is \\[Z = \\frac{x - \\mu}{\\sigma},\\] where \\(x\\) represent the value of a variable, \\(\\mu\\) represents the mean of the variable, and \\(\\sigma\\) represents the standard deviation of the variable. Thus, if your variable has 10 elements, each one has a corresponding \\(z\\)-score that gives how many standard deviations away that value is from its mean. \\(z\\)-scores are normally distributed with mean 0 and standard deviation 1. They have the common, bell-shaped pattern seen below. Recall, that we hardly ever know the mean and standard deviation of the population of interest. This is almost always the case when considering the means of two independent groups. To help account for us not knowing the population parameter values, we can use the sample statistics instead, but this comes with a bit of a price in terms of complexity. Another form of standardization occurs when we need to use the sample standard deviations as estimates for the unknown population standard deviations. This standardization is often called the \\(t\\)-score. For the two independent samples case like what we have for comparing action movies to romance movies, the formula is \\[T =\\dfrac{ (\\bar{x}_1 - \\bar{x}_2) - (\\mu_1 - \\mu_2)}{ \\sqrt{\\dfrac{{s_1}^2}{n_1} + \\dfrac{{s_2}^2}{n_2}} }\\] There is a lot to try to unpack here. \\(\\bar{x}_1\\) is the sample mean response of the first group \\(\\bar{x}_2\\) is the sample mean response of the second group \\(\\mu_1\\) is the population mean response of the first group \\(\\mu_2\\) is the population mean response of the second group \\(s_1\\) is the sample standard deviation of the response of the first group \\(s_2\\) is the sample standard deviation of the response of the second group \\(n_1\\) is the sample size of the first group \\(n_2\\) is the sample size of the second group Assuming that the null hypothesis is true (\\(H_0: \\mu_1 - \\mu_2 = 0\\)), \\(T\\) is said to be distributed following a \\(t\\) distribution with degrees of freedom equal to the smaller value of \\(n_1 - 1\\) and \\(n_2 - 1\\). The “degrees of freedom” can be thought of measuring how different the \\(t\\) distribution will be as compared to a normal distribution. Small sample sizes lead to small degrees of freedom and, thus, \\(t\\) distributions that have more values in the tails of their distributions. Large sample sizes lead to large degrees of freedom and, thus, \\(t\\) distributions that closely align with the standard normal, bell-shaped curve. So, assuming \\(H_0\\) is true, our formula simplifies a bit: \\[T =\\dfrac{ \\bar{x}_1 - \\bar{x}_2}{ \\sqrt{\\dfrac{{s_1}^2}{n_1} + \\dfrac{{s_2}^2}{n_2}} }.\\] We have already built an approximation for what we think the distribution of \\(\\delta = \\bar{x}_1 - \\bar{x}_2\\) looks like using randomization above. Recall this distribution: ggplot(data = null_distribution_two_means, aes(x = stat)) + geom_histogram(color = "white", bins = 20) FIGURE 10.10: Simulated differences in means histogram The infer package also includes some built-in theory-based statistics as well, so instead of going through the process of trying to transform the difference into a standardized form, we can just provide a different value for stat in calculate(). Recall the generated_samples data frame created via: generated_samples <- movies_genre_sample %>% specify(formula = rating ~ genre) %>% hypothesize(null = "independence") %>% generate(reps = 5000) We can now created a null distribution of \\(t\\) statistics: null_distribution_t <- generated_samples %>% calculate(stat = "t", order = c("Romance", "Action")) null_distribution_t %>% visualize() We see that the shape of this stat = "t" distribution is the same as that of stat = "diff in means". The scale has changed though with the \\(t\\) values having less spread than the difference in means. A traditional \\(t\\)-test doesn’t look at this simulated distribution, but instead it looks at the \\(t\\)-curve with degrees of freedom equal to 62.029. We can overlay this distribution over the top of our permuted \\(t\\) statistics using the method = "both" setting in visualize(). null_distribution_t %>% visualize(method = "both") We can see that the curve does a good job of approximating the randomization distribution here. (More on when to expect for this to be the case when we discuss conditions for the \\(t\\)-test in a bit.) To calculate the \\(p\\)-value in this case, we need to figure out how much of the total area under the \\(t\\)-curve is at our observed \\(T\\)-statistic or more, plus also adding the area under the curve at the negative value of the observed \\(T\\)-statistic or below. (Remember this is a two-tailed test so we are looking for a difference–values in the tails of either direction.) Just as we converted all of the simulated values to \\(T\\)-statistics, we must also do so for our observed effect \\(\\delta^*\\): obs_t <- movies_genre_sample %>% specify(formula = rating ~ genre) %>% calculate(stat = "t", order = c("Romance", "Action")) So graphically we are interested in finding the percentage of values that are at or above 2.945 or at or below -2.945. null_distribution_t %>% visualize(method = "both", obs_stat = obs_t, direction = "both") As we might have expected with this just being a standardization of the difference in means statistic that produced a small \\(p\\)-value, we also have a very small one here. 10.8.2 Conditions for t-test The infer package does not automatically check conditions for the theoretical methods to work and this warning was given when we used method = "both". In order for the results of the \\(t\\)-test to be valid, three conditions must be met: Independent observations in both samples Nearly normal populations OR large sample sizes (\\(n \\ge 30\\)) Independently selected samples Condition 1: This is met since we sampled at random using R from our population. Condition 2: Recall from Figure 10.4, that we know how the populations are distributed. Both of them are close to normally distributed. If we are a little concerned about this assumption, we also do have samples of size larger than 30 (\\(n_1 = n_2 = 34\\)). Condition 3: This is met since there is no natural pairing of a movie in the Action group to a movie in the Romance group. Since all three conditions are met, we can be reasonably certain that the theory-based test will match the results of the randomization-based test using shuffling. Remember that theory-based tests can produce some incorrect results in these assumptions are not carefully checked. The only assumption for randomization and computational-based methods is that the sample is selected at random. They are our preference and we strongly believe they should be yours as well, but it’s important to also see how the theory-based tests can be done and used as an approximation for the computational techniques until at least more researchers are using these techniques that utilize the power of computers. 10.9 Conclusion We conclude by showing the infer pipeline diagram. In Chapter 11, we’ll come back to regression and see how the ideas covered in Chapter 9 and this chapter can help in understanding the significance of predictors in modeling. 10.9.1 Script of R code An R script file of all R code used in this chapter is available here. References "], ["11-inference-for-regression.html", "Chapter 11 Inference for Regression 11.1 Simulation-based Inference for Regression 11.2 Bootstrapping for the regression slope 11.3 Inference for multiple regression 11.4 Residual analysis", " Chapter 11 Inference for Regression In preparation for our first print edition to be published by CRC Press in Fall 2019, we’re remodeling this chapter a bit. Don’t expect major changes in content, but rather only minor changes in presentation. Our remodeling will be complete and available online at ModernDive.com by early Summer 2019! Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Read Section 2.3 for information on how to install and load R packages. library(ggplot2) library(dplyr) library(moderndive) library(infer) library(gapminder) library(ISLR) 11.1 Simulation-based Inference for Regression We can also use the concept of permuting to determine the standard error of our null distribution and conduct a hypothesis test for a population slope. Let’s go back to our example on teacher evaluations Chapters 6 and 7. We’ll begin in the basic regression setting to test to see if we have evidence that a statistically significant positive relationship exists between teaching and beauty scores for the University of Texas professors. As we did in Chapter 6, teaching score will act as our outcome variable and bty_avg will be our explanatory variable. We will set up this hypothesis testing process as we have each before via the “There is Only One Test” diagram in Figure 10.1 using the infer package. 11.1.1 Data Our data is stored in evals and we are focused on the measurements of the score and bty_avg variables there. Note that we don’t choose a subset of variables here since we will specify() the variables of interest using infer. evals %>% specify(score ~ bty_avg) Response: score (numeric) Explanatory: bty_avg (numeric) # A tibble: 463 x 2 score bty_avg <dbl> <dbl> 1 4.7 5 2 4.1 5 3 3.9 5 4 4.8 5 5 4.6 3 6 4.3 3 7 2.8 3 8 4.1 3.33 9 3.4 3.33 10 4.5 3.17 # … with 453 more rows 11.1.2 Test statistic \\(\\delta\\) Our test statistic here is the sample slope coefficient that we denote with \\(b_1\\). 11.1.3 Observed effect \\(\\delta^*\\) We can use the specify() %>% calculate() shortcut here to determine the slope value seen in our observed data: slope_obs <- evals %>% specify(score ~ bty_avg) %>% calculate(stat = "slope") The calculated slope value from our observed sample is \\(b_1 = 0.067\\). 11.1.4 Model of \\(H_0\\) We are looking to see if a positive relationship exists so \\(H_A: \\beta_1 > 0\\). Our null hypothesis is always in terms of equality so we have \\(H_0: \\beta_1 = 0\\). In other words, when we assume the null hypothesis is true, we are assuming there is NOT a linear relationship between teaching and beauty scores for University of Texas professors. 11.1.5 Simulated data Now to simulate the null hypothesis being true and recreating how our sample was created, we need to think about what it means for \\(\\beta_1\\) to be zero. If \\(\\beta_1 = 0\\), we said above that there is no relationship between the teaching and beauty scores. If there is no relationship, then any one of the teaching score values could have just as likely occurred with any of the other beauty score values instead of the one that it actually did fall with. We, therefore, have another example of permuting in our simulating of data under the null hypothesis. Tactile simulation We could use a deck of 926 note cards to create a tactile simulation of this permuting process. We would write the 463 different values of beauty scores on each of the 463 cards, one per card. We would then do the same thing for the 463 teaching scores putting them on one per card. Next, we would lay out each of the 463 beauty score cards and we would shuffle the teaching score deck. Then, after shuffling the deck well, we would disperse the cards one per each one of the beauty score cards. We would then enter these new values in for teaching score and compute a sample slope based on this permuting. We could repeat this process many times, keeping track of our sample slope after each shuffle. 11.1.6 Distribution of \\(\\delta\\) under \\(H_0\\) We can build our null distribution in much the same way we did in Chapter 10 using the generate() and calculate() functions. Note also the addition of the hypothesize() function, which lets generate() know to perform the permuting instead of bootstrapping. null_slope_distn <- evals %>% specify(score ~ bty_avg) %>% hypothesize(null = "independence") %>% generate(reps = 10000) %>% calculate(stat = "slope") null_slope_distn %>% visualize(obs_stat = slope_obs, direction = "greater") In viewing the distribution above with shading to the right of our observed slope value of 0.067, we can see that we expect the p-value to be quite small. Let’s calculate it next using a similar syntax to what was done with visualize(). 11.1.7 The p-value null_slope_distn %>% get_pvalue(obs_stat = slope_obs, direction = "greater") # A tibble: 1 x 1 p_value <dbl> 1 0 Since 0.067 falls far to the right of this plot beyond where any of the histogram bins have data, we can say that we have a \\(p\\)-value of 0. We, thus, have evidence to reject the null hypothesis in support of there being a positive association between the beauty score and teaching score of University of Texas faculty members. Learning check (LC11.1) Repeat the inference above but this time for the correlation coefficient instead of the slope. Note the implementation of stat = "correlation" in the calculate() function of the infer package. 11.2 Bootstrapping for the regression slope With the p-value calculated as 0 in the hypothesis test above, we can next determine just how strong of a positive slope value we might expect between the variables of teaching score and beauty score (bty_avg) for University of Texas faculty. Recall the infer pipeline above to compute the null distribution. Recall that this assumes the null hypothesis is true that there is no relationship between teaching score and beauty score using the hypothesize() function. null_slope_distn <- evals %>% specify(score ~ bty_avg) %>% hypothesize(null = "independence") %>% generate(reps = 10000, type = "permute") %>% calculate(stat = "slope") To further reinforce the process being done in the pipeline, we’ve added the type argument to generate(). This is automatically added based on the entries for specify() and hypothesize() but it provides a useful way to check to make sure generate() is created the samples in the desired way. In this case, we permuted the values of one variable across the values of the other 10,000 times and calculated a "slope" coefficient for each of these 10,000 generated samples. If instead we’d like to get a range of plausible values for the true slope value, we can use the process of bootstrapping: bootstrap_slope_distn <- evals %>% specify(score ~ bty_avg) %>% generate(reps = 10000, type = "bootstrap") %>% calculate(stat = "slope") bootstrap_slope_distn %>% visualize() Next we can use the get_ci() function to determine the confidence interval. Let’s do this in two different ways obtaining 99% confidence intervals. Remember that these denote a range of plausible values for an unknown true population slope parameter regressing teaching score on beauty score. percentile_slope_ci <- bootstrap_slope_distn %>% get_ci(level = 0.99, type = "percentile") percentile_slope_ci # A tibble: 1 x 2 `0.5%` `99.5%` <dbl> <dbl> 1 0.0229 0.110 se_slope_ci <- bootstrap_slope_distn %>% get_ci(level = 0.99, type = "se", point_estimate = slope_obs) se_slope_ci # A tibble: 1 x 2 lower upper <dbl> <dbl> 1 0.0220 0.111 With the bootstrap distribution being close to symmetric, it makes sense that the two resulting confidence intervals are similar. 11.3 Inference for multiple regression 11.3.1 Refresher: Professor evaluations data Let’s revisit the professor evaluations data that we analyzed using multiple regression with one numerical and one categorical predictor. In particular \\(y\\): outcome variable of instructor evaluation score predictor variables \\(x_1\\): numerical explanatory/predictor variable of age \\(x_2\\): categorical explanatory/predictor variable of gender library(ggplot2) library(dplyr) library(moderndive) evals_multiple <- evals %>% select(score, ethnicity, gender, language, age, bty_avg, rank) First, recall that we had two competing potential models to explain professors’ teaching scores: Model 1: No interaction term. i.e. both male and female profs have the same slope describing the associated effect of age on teaching score Model 2: Includes an interaction term. i.e. we allow for male and female profs to have different slopes describing the associated effect of age on teaching score 11.3.2 Refresher: Visualizations Recall the plots we made for both these models: FIGURE 11.1: Model 1: no interaction effect included FIGURE 11.2: Model 2: interaction effect included 11.3.3 Refresher: Regression tables Last, let’s recall the regressions we fit. First, the regression with no interaction effect: note the use of + in the formula in Table 11.1. score_model_2 <- lm(score ~ age + gender, data = evals_multiple) get_regression_table(score_model_2) TABLE 11.1: Model 1: Regression table with no interaction effect included term estimate std_error statistic p_value lower_ci upper_ci intercept 4.484 0.125 35.79 0.000 4.238 4.730 age -0.009 0.003 -3.28 0.001 -0.014 -0.003 gendermale 0.191 0.052 3.63 0.000 0.087 0.294 Second, the regression with an interaction effect: note the use of * in the formula. score_model_3 <- lm(score ~ age * gender, data = evals_multiple) get_regression_table(score_model_3) TABLE 11.2: Model 2: Regression table with interaction effect included term estimate std_error statistic p_value lower_ci upper_ci intercept 4.883 0.205 23.80 0.000 4.480 5.286 age -0.018 0.004 -3.92 0.000 -0.026 -0.009 gendermale -0.446 0.265 -1.68 0.094 -0.968 0.076 age:gendermale 0.014 0.006 2.45 0.015 0.003 0.024 11.3.4 Script of R code An R script file of all R code used in this chapter is available here. 11.4 Residual analysis 11.4.1 Residual analysis Recall the residuals can be thought of as the error or the “lack-of-fit” between the observed value \\(y\\) and the fitted value \\(\\widehat{y}\\) on the blue regression line in Figure 6.6. Ideally when we fit a regression model, we’d like there to be no systematic pattern to these residuals. We’ll be more specific as to what we mean by no systematic pattern when we see Figure 11.4 below, but let’s keep this notion imprecise for now. Investigating any such patterns is known as residual analysis and is the theme of this section. We’ll perform our residual analysis in two ways: Creating a scatterplot with the residuals on the \\(y\\)-axis and the original explanatory variable \\(x\\) on the \\(x\\)-axis. Creating a histogram of the residuals, thereby showing the distribution of the residuals. First, recall in Figure 6.8 above we created a scatterplot where on the vertical axis we had the teaching score \\(y\\), on the horizontal axis we had the beauty score \\(x\\), and the blue arrow represented the residual for one particular instructor. Instead, in Figure 11.3 below, let’s create a scatterplot where On the vertical axis we have the residual \\(y-\\widehat{y}\\) instead On the horizontal axis we have the beauty score \\(x\\) as before: # Get data evals_ch6 <- evals %>% select(score, bty_avg, age) # Fit regression model: score_model <- lm(score ~ bty_avg, data = evals_ch6) # Get regression table: get_regression_table(score_model) # A tibble: 2 x 7 term estimate std_error statistic p_value lower_ci upper_ci <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 intercept 3.88 0.076 51.0 0 3.73 4.03 2 bty_avg 0.067 0.016 4.09 0 0.035 0.099 # Get regression points regression_points <- get_regression_points(score_model) ggplot(regression_points, aes(x = bty_avg, y = residual)) + geom_point() + labs(x = "Beauty Score", y = "Residual") + geom_hline(yintercept = 0, col = "blue", size = 1) FIGURE 11.3: Plot of residuals over beauty score You can think of Figure 11.3 as Figure 6.8 but with the blue line flattened out to \\(y=0\\). Does it seem like there is no systematic pattern to the residuals? This question is rather qualitative and subjective in nature, thus different people may respond with different answers to the above question. However, it can be argued that there isn’t a drastic pattern in the residuals. Let’s now get a little more precise in our definition of no systematic pattern in the residuals. Ideally, the residuals should behave randomly. In addition, the residuals should be on average 0. In other words, sometimes the regression model will make a positive error in that \\(y - \\widehat{y} > 0\\), sometimes the regression model will make a negative error in that \\(y - \\widehat{y} < 0\\), but on average the error is 0. Further, the value and spread of the residuals should not depend on the value of \\(x\\). In Figure 11.4 below, we display some hypothetical examples where there are drastic patterns to the residuals. In Example 1, the value of the residual seems to depend on \\(x\\): the residuals tend to be positive for small and large values of \\(x\\) in this range, whereas values of \\(x\\) more in the middle tend to have negative residuals. In Example 2, while the residuals seem to be on average 0 for each value of \\(x\\), the spread of the residuals varies for different values of \\(x\\); this situation is known as heteroskedasticity. FIGURE 11.4: Examples of less than ideal residual patterns The second way to perform a residual analysis is to look at the histogram of the residuals: ggplot(regression_points, aes(x = residual)) + geom_histogram(binwidth = 0.25, color = "white") + labs(x = "Residual") FIGURE 11.5: Histogram of residuals This histogram seems to indicate that we have more positive residuals than negative. Since the residual \\(y-\\widehat{y}\\) is positive when \\(y > \\widehat{y}\\), it seems our fitted teaching score from the regression model tends to underestimate the true teaching score. This histogram has a slight left-skew in that there is a long tail on the left. Another way to say this is this data exhibits a negative skew. Is this a problem? Again, there is a certain amount of subjectivity in the response. In the authors’ opinion, while there is a slight skew/pattern to the residuals, it isn’t a large concern. On the other hand, others might disagree with our assessment. Here are examples of an ideal and less than ideal pattern to the residuals when viewed in a histogram: FIGURE 11.6: Examples of ideal and less than ideal residual patterns In fact, we’ll see later on that we would like the residuals to be normally distributed with mean 0. In other words, be bell-shaped and centered at 0! While this requirement and residual analysis in general may seem to some of you as not being overly critical at this point, we’ll see later after when we cover inference for regression in Chapter 11 that for the last five columns of the regression table from earlier (std error, statistic, p_value,lower_ci, and upper_ci) to have valid interpretations, the above three conditions should roughly hold. Learning check (LC11.2) Continuing with our regression using age as the explanatory variable and teaching score as the outcome variable, use the get_regression_points() function to get the observed values, fitted values, and residuals for all 463 instructors. Perform a residual analysis and look for any systematic patterns in the residuals. Ideally, there should be little to no pattern. 11.4.2 Residual analysis # Get data: gapminder2007 <- gapminder %>% filter(year == 2007) %>% select(country, continent, lifeExp, gdpPercap) # Fit regression model: lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007) # Get regression table: get_regression_table(lifeExp_model) # A tibble: 5 x 7 term estimate std_error statistic p_value lower_ci upper_ci <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 intercept 54.8 1.02 53.4 0 52.8 56.8 2 continentAmericas 18.8 1.8 10.4 0 15.2 22.4 3 continentAsia 15.9 1.65 9.68 0 12.7 19.2 4 continentEurope 22.8 1.70 13.5 0 19.5 26.2 5 continentOceania 25.9 5.33 4.86 0 15.4 36.4 # Get regression points regression_points <- get_regression_points(lifeExp_model) Recall our discussion on residuals from Section 11.4.1 where our goal was to investigate whether or not there was a systematic pattern to the residuals. Ideally since residuals can be thought of as error, there should be no such pattern. While there are many ways to do such residual analysis, we focused on two approaches based on visualizations. A plot with residuals on the vertical axis and the predictor (in this case continent) on the horizontal axis A histogram of all residuals First, let’s plot the residuals versus continent in Figure 11.7, but also let’s plot all 142 points with a little horizontal random jitter by setting the width = 0.1 parameter in geom_jitter(): ggplot(regression_points, aes(x = continent, y = residual)) + geom_jitter(width = 0.1) + labs(x = "Continent", y = "Residual") + geom_hline(yintercept = 0, col = "blue") FIGURE 11.7: Plot of residuals over continent We observe There seems to be a rough balance of both positive and negative residuals for all 5 continents. However, there is one clear outlier in Asia, which has a residual with the largest deviation away from 0. Let’s investigate the 5 countries in Asia with the shortest life expectancy: gapminder2007 %>% filter(continent == "Asia") %>% arrange(lifeExp) TABLE 11.3: Countries in Asia with shortest life expectancy country continent lifeExp gdpPercap Afghanistan Asia 43.8 975 Iraq Asia 59.5 4471 Cambodia Asia 59.7 1714 Myanmar Asia 62.1 944 Yemen, Rep. Asia 62.7 2281 This was the earlier identified residual for Afghanistan of -26.9. Unfortunately given recent geopolitical turmoil, individuals who live in Afghanistan and, in particular in 2007, have a drastically lower life expectancy. Second, let’s look at a histogram of all 142 values of residuals in Figure 11.8. In this case, the residuals form a rather nice bell-shape, although there are a couple of very low and very high values at the tails. As we said previously, searching for patterns in residuals can be somewhat subjective, but ideally we hope there are no “drastic” patterns. ggplot(regression_points, aes(x = residual)) + geom_histogram(binwidth = 5, color = "white") + labs(x = "Residual") FIGURE 11.8: Histogram of residuals Learning check (LC11.3) Continuing with our regression using gdpPercap as the outcome variable and continent as the explanatory variable, use the get_regression_points() function to get the observed values, fitted values, and residuals for all 142 countries in 2007 and perform a residual analysis to look for any systematic patterns in the residuals. Is there a pattern? Please keep in mind that these types of questions are somewhat subjective and different people will most likely have different answers. The focus should be on being able to justify the conclusions made. 11.4.3 Residual analysis Recall in Section 11.4.1, our first residual analysis plot investigated the presence of any systematic pattern in the residuals when we had a single numerical predictor: bty_age. For the Credit card dataset, since we have two numerical predictors, Limit and Income, we must perform this twice: # Get data: Credit <- Credit %>% select(Balance, Limit, Income, Rating, Age) # Fit regression model: Balance_model <- lm(Balance ~ Limit + Income, data = Credit) # Get regression table: get_regression_table(Balance_model) # A tibble: 3 x 7 term estimate std_error statistic p_value lower_ci upper_ci <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 intercept -385. 19.5 -19.8 0 -423. -347. 2 Limit 0.264 0.006 45.0 0 0.253 0.276 3 Income -7.66 0.385 -19.9 0 -8.42 -6.91 # Get regression points regression_points <- get_regression_points(Balance_model) ggplot(regression_points, aes(x = Limit, y = residual)) + geom_point() + labs(x = "Credit limit (in $)", y = "Residual", title = "Residuals vs credit limit") ggplot(regression_points, aes(x = Income, y = residual)) + geom_point() + labs(x = "Income (in $1000)", y = "Residual", title = "Residuals vs income") FIGURE 11.9: Residuals vs credit limit and income In this case, there does appear to be a systematic pattern to the residuals. As the scatter of the residuals around the line \\(y=0\\) is definitely not consistent. This behavior of the residuals is further evidenced by the histogram of residuals in Figure 11.10. We observe that the residuals have a slight right-skew (recall we say that data is right-skewed, or positively-skewed, if there is a tail to the right). Ideally, these residuals should be bell-shaped around a residual value of 0. ggplot(regression_points, aes(x = residual)) + geom_histogram(color = "white") + labs(x = "Residual") `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. FIGURE 11.10: Relationship between credit card balance and credit limit/income Another way to interpret this histogram is that since the residual is computed as \\(y - \\widehat{y}\\) = balance - balance_hat, we have some values where the fitted value \\(\\widehat{y}\\) is very much lower than the observed value \\(y\\). In other words, we are underestimating certain credit card holders’ balances by a very large amount. Learning check (LC11.4) Continuing with our regression using Rating and Age as the explanatory variables and credit card Balance as the outcome variable, use the get_regression_points() function to get the observed values, fitted values, and residuals for all 400 credit card holders. Perform a residual analysis and look for any systematic patterns in the residuals. 11.4.4 Residual analysis # Get data: evals_ch7 <- evals %>% select(score, age, gender) # Fit regression model: score_model_2 <- lm(score ~ age + gender, data = evals_ch7) # Get regression table: get_regression_table(score_model_2) # A tibble: 3 x 7 term estimate std_error statistic p_value lower_ci upper_ci <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 intercept 4.48 0.125 35.8 0 4.24 4.73 2 age -0.009 0.003 -3.28 0.001 -0.014 -0.003 3 gendermale 0.191 0.052 3.63 0 0.087 0.294 # Get regression points regression_points <- get_regression_points(score_model_2) As always, let’s perform a residual analysis first with a histogram, which we can facet by gender: ggplot(regression_points, aes(x = residual)) + geom_histogram(binwidth = 0.25, color = "white") + labs(x = "Residual") + facet_wrap(~gender) FIGURE 11.11: Interaction model histogram of residuals Second, the residuals as compared to the predictor variables: \\(x_1\\): numerical explanatory/predictor variable of age \\(x_2\\): categorical explanatory/predictor variable of gender ggplot(regression_points, aes(x = age, y = residual)) + geom_point() + labs(x = "age", y = "Residual") + geom_hline(yintercept = 0, col = "blue", size = 1) + facet_wrap(~ gender) FIGURE 11.12: Interaction model residuals vs predictor "], -["12-thinking-with-data.html", "Chapter 12 Thinking with Data 12.1 Case study: Seattle house prices 12.2 Case study: Effective data storytelling Concluding remarks", " Chapter 12 Thinking with Data In preparation for our first print edition to be published by CRC Press in Fall 2019, we’re remodeling this chapter a bit. Don’t expect major changes in content, but rather only minor changes in presentation. Our remodeling will be complete and available online at ModernDive.com by early Summer 2019! Recall in Section 1.1 “Introduction for students” and at the end of chapters throughout this book, we displayed the “ModernDive flowchart” mapping your journey through this book. FIGURE 12.1: ModernDive Flowchart Let’s get a refresher of what you’ve covered so far. You first got started with with data in Chapter 2, where you learned about the difference between R and RStudio, started coding in R, started understanding what R packages are, and explored your first dataset: all domestic departure flights from a New York City airport in 2013. Then: Data science: You assembled your data science toolbox using tidyverse packages. In particular: Ch.3: Visualizing data via the ggplot2 package. Ch.5: Understanding the concept of “tidy” data as a standardized data input format for all packages in the tidyverse Ch.4: Wrangling data via the dplyr package. Data modeling: Using these data science tools and helper functions from the moderndive package, you started performing data modeling. In particular: Ch.6: Constructing basic regression models. Ch.7: Constructing multiple regression models. Statistical inference: Once again using your newly acquired data science tools, we unpacked statistical inference using the infer package. In particular: Ch.8: Understanding the role that sampling variability plays in statistical inference using both tactile and virtual simulations of sampling from a “bowl” with an unknown proportion of red balls. Ch.9: Building confidence intervals. Ch.10: Conducting hypothesis tests. Data modeling revisited: Armed with your new understanding of statistical inference, you revisited and reviewed the models you constructed in Ch.6 & Ch.7. In particular: Ch.11: Interpreting both the statistical and practice significance of the results of the models. All this was our approach of guiding you through your first experiences of “thinking with data”, an expression originally coined by Diane Lambert of Google. How the philosophy underlying this expression guided our mapping of the flowchart above was well put in the introduction to the “Practical Data Science for Stats” collection of preprints focusing on the practical side of data science workflows and statistical analysis, curated by Jennifer Bryan and Hadley Wickham: There are many aspects of day-to-day analytical work that are almost absent from the conventional statistics literature and curriculum. And yet these activities account for a considerable share of the time and effort of data analysts and applied statisticians. The goal of this collection is to increase the visibility and adoption of modern data analytical workflows. We aim to facilitate the transfer of tools and frameworks between industry and academia, between software engineering and statistics and computer science, and across different domains. In other words, in order to be equipped to “think with data” in the 21st century, future analysts need preparation going through the entirety of the “Data/Science Pipeline” we also saw earlier and not just parts of it. FIGURE 12.2: Data/Science Pipeline In Section 12.1, we’ll take you through full-pass of the “Data/Science Pipeline” where we’ll analyze the sale price of houses in Seattle, WA, USA. In Section 12.2, we’ll present you with examples of effective data storytelling, in particular the articles from the data journalism website FiveThirtyEight.com, many of whose source datasets are accessible from the fivethirtyeight R package. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Read Section 2.3 for information on how to install and load R packages. library(ggplot2) library(dplyr) library(moderndive) library(fivethirtyeight) 12.1 Case study: Seattle house prices Kaggle.com is a machine learning and predictive modeling competition website that hosts datasets uploaded by companies, governmental organizations, and other individuals. One of their datasets is the House Sales in King County, USA consisting of homes sold in between May 2014 and May 2015 in King County, Washington State, USA, which includes the greater Seattle metropolitan area. This CC0: Public Domain licensed dataset is included in the moderndive package in the house_prices data frame, which we’ll refer to as the “Seattle house prices” dataset. The dataset consists 21,613 houses and 21 variables describing these houses; for a full list of these variables see the help file by running ?house_prices in the console. In this case study, we’ll create a model using multiple regression where: The outcome variable \\(y\\) is the sale price of houses The two explanatory/predictor variables we’ll use are : \\(x_1\\): house size sqft_living, as measured by square feet of living space, where 1 square foot is about 0.09 square meters. \\(x_2\\): house condition, a categorical variable with 5 levels where 1 indicates “poor” and 5 indicates “excellent.” Let’s load all the packages needed for this case study (this assumes you’ve already installed them). If needed, read Section 2.3 for information on how to install and load R packages. library(ggplot2) library(dplyr) library(moderndive) 12.1.1 Exploratory data analysis (EDA) A crucial first step before any formal modeling is an exploratory data analysis, commonly abbreviated at EDA. Exploratory data analysis can give you a sense of your data, help identify issues with your data, bring to light any outliers, and help inform model construction. There are three basic approaches to EDA: Most fundamentally, just looking at the raw data. For example using RStudio’s View() spreadsheet viewer or the glimpse() function from the dplyr package Creating visualizations like the ones using ggplot2 from Chapter 3 Computing summary statistics using the dplyr data wrangling tools from Chapter 4 First, let’s look the raw data using View() and the glimpse() function. Explore the dataset. Which variables are numerical and which are categorical? For the categorical variables, what are their levels? Which do you think would useful variables to use in a model for house price? In this case study, we’ll only consider the variables price, sqft_living, and condition. An important thing to observe is that while the condition variable has values 1 through 5, these are saved in R as fct factors i.e. R’s way of saving categorical variables. So you should think of these as the “labels” 1 through 5 and not the numerical values 1 through 5. View(house_prices) glimpse(house_prices) Observations: 21,613 Variables: 21 $ id <chr> "7129300520", "6414100192", "5631500400", "2487200875",… $ date <date> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-0… $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500… $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2… $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2… $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 18… $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470… $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, … $ waterfront <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,… $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0… $ condition <fct> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4… $ grade <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, … $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 18… $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0,… $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2… $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ zipcode <fct> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198,… $ lat <dbl> 47.5, 47.7, 47.7, 47.5, 47.6, 47.7, 47.3, 47.4, 47.5, 4… $ long <dbl> -122, -122, -122, -122, -122, -122, -122, -122, -122, -… $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 2… $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113,… Let’s now perform the second possible approach to EDA: creating visualizations. Since price and sqft_living are numerical variables, an appropriate way to visualize of these variables’ distributions would be using a histogram using a geom_histogram() as seen in Section 3.5. However, since condition is categorical, a barplot using a geom_bar() yields an appropriate visualization of its distribution. Recall from Section 3.8 that since condition is not “pre-counted”, we use a geom_bar() and not a geom_col(). In Figure 12.3, we display all three of these visualizations at once. # Histogram of house price: ggplot(house_prices, aes(x = price)) + geom_histogram(color = "white") + labs(x = "price (USD)", title = "House price") # Histogram of sqft_living: ggplot(house_prices, aes(x = sqft_living)) + geom_histogram(color = "white") + labs(x = "living space (square feet)", title = "House size") # Barplot of condition: ggplot(house_prices, aes(x = condition)) + geom_bar() + labs(x = "condition", title = "House condition") FIGURE 12.3: Exploratory visualizations of Seattle house prices data We observe the following: In the histogram for price: Since e+06 means \\(10^6\\), or one million, we see that a majority of houses are less than 2 million dollars. The x-axis stretches out far to the right to 8 million dollars, even though there appear to be no houses. In the histogram for size sqft_living Most houses appear to have less than 5000 square feet of living space. For comparison a standard American football field is about 57,600 square feet, where as a standard soccer AKA association football field is about 64,000 square feet. The x-axis exhibits the same stretched out behavior to the right as for price Most houses are of condition 3, 4, or 5. In the case of price, why does the x-axis stretch so far to the right? It is because there are a very small number of houses with price closer to 8 million; these prices are outliers in this case. We say variable is “right skewed” as exhibited by the long right tail. This skew makes it difficult to compare prices of the less expensive houses as the more expensive houses dominate the scale of the x-axis. This is similarly the case for sqft_living. Let’s now perform the third possible approach to EDA: computing summary statistics. In particular, let’s compute 4 summary statistics using the summarize() data wrangling verb from Section 4.3. Two measures of center: the mean and median Two measures of variability/spread: the standard deviation and interquartile-range (IQR = 3rd quartile - 1st quartile) house_prices %>% summarize( mean_price = mean(price), median_price = median(price), sd_price = sd(price), IQR_price = IQR(price) ) # A tibble: 1 x 4 mean_price median_price sd_price IQR_price <dbl> <dbl> <dbl> <dbl> 1 540088. 450000 367127. 323050 Observe the following: The mean price of $540,088 is larger than the median of $450,000. This is because the small number of very expensive outlier houses prices are inflating the average, whereas since the median is the “middle” value, it is not as sensitive to such large values at the high end. This is why the news typically reports median house prices and not average house prices when describing the real estate market. We say here that the median more “robust to outliers” than the mean. Similarly, while both the standard deviation and IQR are both measures of spread and variability, the IQR is more “robust to outliers.” If you repeat the above summarize() for sqft_living, you’ll find a similar relationship between mean vs median and standard deviation vs IQR given its similar right-skewed nature. Is there anything we can do about this right-skew? Again, this could potentially be an issue because we’ll have a harder time discriminating between houses at the lower end of price and sqft_living, which might lead to a problem when modeling. We can in fact address this issue by using a log base 10 transformation, which we cover next. 12.1.2 log10 transformations At its simplest, log10() transformations returns base 10 logarithms. For example, since \\(1000 = 10^3\\), log10(1000) returns 3. To undo a log10-transformation, we raise 10 to this value. For example, to undo the previous log10-transformation and return the original value of 1000, we raise 10 to this value \\(10^{3}\\) by running 10^(3) = 1000. log-transformations allow us to focus on multiplicative changes instead of additive ones, thereby emphasizing changes in “orders of magnitude.” Let’s illustrate this idea in Table 12.1 with examples of prices of consumer goods in US dollars. TABLE 12.1: log10-transformed prices, orders of magnitude, and examples Price log10(Price) Order of magnitude Examples $1 0 Singles Cups of coffee $10 1 Tens Books $100 2 Hundreds Mobile phones $1,000 3 Thousands High definition TV’s $10,000 4 Tens of thousands Cars $100,000 5 Hundreds of thousands Luxury cars & houses $1,000,000 6 Millions Luxury houses Let’s break this down: When purchasing a cup of coffee, we tend to think of prices ranging in single dollars e.g. $2 or $3. However when purchasing say mobile phones, we don’t tend to think in prices in single dollars e.g. $676 or $757, but tend to round to the nearest unit of hundreds of dollars e.g. $200 or $500. Let’s say we want to know the log10-transformed value of $76. Even if this would be hard to compute without a calculator, we know that its log10 value is between 1 and 2, since $76 is between $10 and $100. In fact, log10(76) is 1.880814. log10-transformations are monotonic, meaning they preserve orderings. So if Price A is lower than Price B, then log10(Price A) will also be lower than log10(Price B). Most importantly, increments of one in log10 correspond to multiplicative changes and not additive ones. For example, increasing from log10(Price) of 3 to 4 corresponds to a multiplicative increase by a factor of 10: $100 to $1000. Let’s create new log10-transformed versions of the right-skewed variable price and sqft_living using the mutate() function from Section 4.5, but we’ll give the latter the name log10_size, which is a little more succinct and descriptive a variable name. house_prices <- house_prices %>% mutate( log10_price = log10(price), log10_size = log10(sqft_living) ) Let’s first display the before and after effects of this transformation on these variables for only the first 10 rows of house_prices: house_prices %>% select(price, log10_price, sqft_living, log10_size) # A tibble: 10 x 4 price log10_price sqft_living log10_size <dbl> <dbl> <int> <dbl> 1 221900 5.35 1180 3.07 2 538000 5.73 2570 3.41 3 180000 5.26 770 2.89 4 604000 5.78 1960 3.29 5 510000 5.71 1680 3.23 6 1225000 6.09 5420 3.73 7 257500 5.41 1715 3.23 8 291850 5.47 1060 3.03 9 229500 5.36 1780 3.25 10 323000 5.51 1890 3.28 Observe in particular: The house in the 6th row with price $1,225,000, which is just above one million dollars. Since \\(10^6\\) is one million, its log10_price is 6.09. Contrast this with all other houses with log10_price less than 6. Similarly, there is only one house with size sqft_living less than 1000. Since \\(1000 = 10^3\\), its the lone house with log10_size less than 3. Let’s now visualize the before and after effects of this transformation for price in Figure 12.4. # Before: ggplot(house_prices, aes(x = price)) + geom_histogram(color = "white") + labs(x = "price (USD)", title = "House price: Before") # After: ggplot(house_prices, aes(x = log10_price)) + geom_histogram(color = "white") + labs(x = "log10 price (USD)", title = "House price: After") FIGURE 12.4: House price before and after log10-transformation Observe that after the transformation, the distribution is much less skewed, and in this case, more symmetric and bell-shaped, although this isn’t always necessarily the case. Now you can now better discriminate between house prices at the lower end of the scale. Let’s do the same for size where the before variable is sqft_living and the after variable is log10_size. Observe in Figure 12.5 that the log10-transformation has a similar effect of un-skewing the variable. Again, we emphasize that while in these two cases the resulting distributions are more symmetric and bell-shaped, this is not always necessarily the case. # Before: ggplot(house_prices, aes(x = sqft_living)) + geom_histogram(color = "white") + labs(x = "living space (square feet)", title = "House size: Before") # After: ggplot(house_prices, aes(x = log10_size)) + geom_histogram(color = "white") + labs(x = "log10 living space (square feet)", title = "House size: After") FIGURE 12.5: House size before and after log10-transformation Given the now un-skewed nature of log10_price and log10_size, we are going to revise our modeling structure: We’ll use a new outcome variable \\(y\\) log10_price of houses The two explanatory/predictor variables we’ll use are: \\(x_1\\): A modified version of house size: log10_size \\(x_2\\): House condition will remain unchanged 12.1.3 EDA Part II Let’s continue our exploratory data analysis from Subsection 12.1.1 above. The earlier EDA you performed was univariate in nature in that we only considered one variable at a time. The goal of modeling, however, is to explore relationships between variables. So we must jointly consider the relationship between the outcome variable log10_price and the explanatory/predictor variables log10_size (numerical) and condition (categorical). We viewed such a modeling scenario in Section 7.2 using the evals dataset, where the outcome variable was teaching score, the numerical explanatory/predictor variable was instructor age and the categorical explanatory/predictor variable was (binary) gender. We have two possible visual models. Either a parallel slopes model in Figure 12.6 where we have a different regression line for each of the 5 possible condition levels, each with a different intercept but the same slope: FIGURE 12.6: Parallel slopes model Or an interaction model in Figure 12.7, where we allow each regression line to not only have different intercepts, but different slopes as well: ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) + geom_point(alpha = 0.1) + labs(y = "log10 price", x = "log10 size", title = "House prices in Seattle") + geom_smooth(method = "lm", se = FALSE) FIGURE 12.7: Interaction model In both cases, we see there is a positive relationship between house price and size, meaning as houses are larger, they tend to be more expensive. Furthermore, in both plots it seems that houses of condition 5 tend to be the most expensive for most house sizes as evidenced by the fact that the purple line is highest, followed by condition 4 and 3. As for condition 1 and 2, this pattern isn’t as clear, as if you recall from the univariate barplot of condition in Figure 12.3 there are very few houses of condition 1 or 2. This reality is more apparent in an alternative visualization to Figure 12.7 displayed in Figure 12.8 that uses facets instead: ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) + geom_point(alpha = 0.3) + labs(y = "log10 price", x = "log10 size", title = "House prices in Seattle") + geom_smooth(method = "lm", se = FALSE) + facet_wrap(~condition) FIGURE 12.8: Interaction model with facets Which exploratory visualization of the interaction model is better, the one in Figure 12.7 or Figure 12.8? There is no universal right answer, you need to make a choice depending on what you want to convey, and own it. 12.1.4 Regression modeling For now let’s focus on the latter, interaction model we’ve visualized in Figure 12.8 above. What are the 5 different slopes and intercepts for the condition = 1, condition = 2, …, and condition = 5 lines in Figure 12.8? To determine these, we first need the values from the regression table: # Fit regression model: price_interaction <- lm(log10_price ~ log10_size * condition, data = house_prices) # Get regression table: get_regression_table(price_interaction) # A tibble: 10 x 7 term estimate std_error statistic p_value lower_ci upper_ci <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 intercept 3.33 0.451 7.38 0 2.45 4.22 2 log10_size 0.69 0.148 4.65 0 0.399 0.98 3 condition2 0.047 0.498 0.094 0.925 -0.93 1.02 4 condition3 -0.367 0.452 -0.812 0.417 -1.25 0.519 5 condition4 -0.398 0.453 -0.879 0.38 -1.29 0.49 6 condition5 -0.883 0.457 -1.93 0.053 -1.78 0.013 7 log10_size:condition2 -0.024 0.163 -0.148 0.882 -0.344 0.295 8 log10_size:condition3 0.133 0.148 0.893 0.372 -0.158 0.424 9 log10_size:condition4 0.146 0.149 0.979 0.328 -0.146 0.437 10 log10_size:condition5 0.31 0.15 2.07 0.039 0.016 0.604 Recall from Section 7.2.3 on how to interpret the outputs where there exists an interaction term, where in this case the “baseline for comparison” group for the categorical variable condition are the condition 1 houses. We’ll write our answers as: \\[\\widehat{\\log10(\\text{price})} = \\hat{\\beta}_0 + \\hat{\\beta}_{\\text{size}} * \\log10(\\text{size})\\] for all five condition levels separately: Condition 1: \\(\\widehat{\\log10(\\text{price})} = 3.33 + 0.69 * \\log10(\\text{size})\\) Condition 2: \\(\\widehat{\\log10(\\text{price})} = (3.33 + 0.047) + (0.69 - 0.024) * \\log10(\\text{size}) = 3.38 + 0.666 * \\log10(\\text{size})\\) Condition 3: \\(\\widehat{\\log10(\\text{price})} = (3.33 - 0.367) + (0.69 + 0.133) * \\log10(\\text{size}) = 2.96 + 0.823 * \\log10(\\text{size})\\) Condition 4: \\(\\widehat{\\log10(\\text{price})} = (3.33 - 0.398) + (0.69 + 0.146) * \\log10(\\text{size}) = 2.93 + 0.836 * \\log10(\\text{size})\\) Condition 5: \\(\\widehat{\\log10(\\text{price})} = (3.33 - 0.883) + (0.69 + 0.31) * \\log10(\\text{size}) = 2.45 + 1 * \\log10(\\text{size})\\) These correspond to the regression lines in the exploratory visualization of the interaction model in Figure 12.7 above. For homes of all 5 condition types, as the size of the house increases, the prices increases. This is what most would expect. However, the rate of increase of price with size is fastest for the homes with condition 3, 4, and 5 of 0.823, 0.836, and 1 respectively; these are the 3 most largest slopes out of the 5. 12.1.5 Making predictions Say you’re a realtor and someone calls you asking you how much their home will sell for. They tell you that it’s in condition = 5 and is sized 1900 square feet. What do you tell them? We first make this prediction visually in Figure 12.9. The predicted log10_price of this house is marked with a black dot: it is where the two following lines intersect: The purple regression line for the condition = 5 homes and The vertical dashed black line at log10_size equals 3.28, since our predictor variable is the log10-transformed square feet of living space and \\(\\log10(1900) = 3.28\\) . FIGURE 12.9: Interaction model with prediction Eyeballing it, it seems the predicted log10_price seems to be around 5.72. Let’s now obtain the an exact numerical value for the prediction using the values of the intercept and slope for the condition = 5 that we computed using the regression table output. We use the equation for the condition = 5 line, being sure to log10() the square footage first. 2.45 + 1 * log10(1900) [1] 5.73 This value is very close to our earlier visually made prediction of 5.72. But wait! We were using the outcome variable log10_price as our outcome variable! So if we want a prediction in terms of price in dollar units, we need to un-log this by taking a power of 10 as described in Section 12.1.2. 10^(2.45 + 1 * log10(1900)) [1] 535493 So we our predicted price for this home of condition 5 and size 1900 square feet is $535,493. Learning check (LC12.1) Repeat the regression modeling in Subsection 12.1.4 and the prediction making you just did on the house of condition 5 and size 1900 square feet in Subsection 12.1.5, but using the parallel slopes model you visualized in Figure 12.6. Hint: it’s $524,807! 12.2 Case study: Effective data storytelling Note: This section is still under construction. If you would like to contribute, please check us out on GitHub at https://github.com/moderndive/moderndive_book. /begin{center} /end{center} As we’ve progressed throughout this book, you’ve seen how to work with data in a variety of ways. You’ve learned effective strategies for plotting data by understanding which types of plots work best for which combinations of variable types. You’ve summarized data in table form and calculated summary statistics for a variety of different variables. Further, you’ve seen the value of inference as a process to come to conclusions about a population by using a random sample. Lastly, you’ve explored how to use linear regression and the importance of checking the conditions required to make it a valid procedure. All throughout, you’ve learned many computational techniques and focused on reproducible research in writing R code. We now present another case study, but this time of the “effective data storytelling” done by data journalists around the world. Great data stories don’t mislead the reader, but rather engulf them in understanding the importance that data plays in our lives through the captivation of storytelling. 12.2.1 Bechdel test for Hollywood gender representation We recommend you read and analyze this article by Walt Hickey entitled The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women on the Bechdel test, an informal test of gender representation in a movie. As you read over it, think carefully about how Walt is using data, graphics, and analyses to paint the picture for the reader of what the story is he wants to tell. In the spirit of reproducibility, the members of FiveThirtyEight have also shared the data and R code that they used to create for this story and many more of their articles on GitHub. ModernDive co-authors Chester Ismay and Albert Y. Kim along with Jennifer Chunn went one step further by creating the fivethirtyeight R package. The fivethirtyeight package takes FiveThirtyEight’s article data from GitHub, “tames” it so that it’s novice-friendly, and makes all data, documentation, and the original article easily accessible via an R package. The package homepage also includes a list of all fivethirtyeight data sets included. Furthermore, example “vignettes” of fully reproducible start-to-finish analyses of some of these data using dplyr, ggplot2, and other packages in the tidyverse is available here. For example, a vignette showing how to reproduce one of the plots at the end of the above article on the Bechdel test is available here. 12.2.2 US Births in 1999 Here is another example involving the US_births_1994_2003 data frame of all births in the United States between 1994 and 2003. For more information on this data frame including a link to the original article on FiveThirtyEight.com, check out the help file by running ?US_births_1994_2003 in the console. First, let’s load all necessary packages: library(ggplot2) library(dplyr) library(fivethirtyeight) It’s always a good idea to preview your data, either by using RStudio’s spreadsheet View() function or using glimpse() from the dplyr package below: # Preview data glimpse(US_births_1994_2003) Observations: 3,652 Variables: 6 $ year <int> 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1… $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ date_of_month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … $ date <date> 1994-01-01, 1994-01-02, 1994-01-03, 1994-01-04, 1994-0… $ day_of_week <ord> Sat, Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tu… $ births <int> 8096, 7772, 10142, 11248, 11053, 11406, 11251, 8653, 79… We’ll focus on the number of births for each date, but only for births that occurred in 1999. Recall we achieve this using the filter() command from dplyr package: US_births_1999 <- US_births_1994_2003 %>% filter(year == 1999) Since date is a notion of time, which has a sequential ordering to it, a linegraph AKA a “time series” plot would be more appropriate than a scatterplot. In other words, use a geom_line() instead of geom_point(): ggplot(US_births_1999, aes(x = date, y = births)) + geom_line() + labs(x = "Data", y = "Number of births", title = "US Births in 1999") We see a big valley occurring just before January 1st, 2000, mostly likely due to the holiday season. However, what about the major peak of over 14,000 births occurring just before October 1st, 1999? What could be the reason for this anomalously high spike in ? Time to think with data! 12.2.3 Other examples Stand by! 12.2.4 Script of R code An R script file of all R code used in this chapter is available here. Concluding remarks If you’ve come to this point in the book, I’d suspect that you know a thing or two about how to work with data in R. You’ve also gained a lot of knowledge about how to use simulation techniques to determine statistical significance and how these techniques build an intuition about traditional inferential methods like the \\(t\\)-test. The hope is that you’ve come to appreciate data wrangling, tidy datasets, and the power of data visualization. Actually, the data visualization part may be the most important thing here. If you can create truly beautiful graphics that display information in ways that the reader can clearly decipher, you’ve picked up a great skill. Let’s hope that that skill keeps you creating great stories with data into the near and far distant future. Thanks for coming along for the ride as we dove into modern data analysis using R! "], +["12-thinking-with-data.html", "Chapter 12 Thinking with Data 12.1 Case study: Seattle house prices 12.2 Case study: Effective data storytelling Concluding remarks", " Chapter 12 Thinking with Data In preparation for our first print edition to be published by CRC Press in Fall 2019, we’re remodeling this chapter a bit. Don’t expect major changes in content, but rather only minor changes in presentation. Our remodeling will be complete and available online at ModernDive.com by early Summer 2019! Recall in Section 1.1 “Introduction for students” and at the end of chapters throughout this book, we displayed the “ModernDive flowchart” mapping your journey through this book. FIGURE 12.1: ModernDive Flowchart Let’s get a refresher of what you’ve covered so far. You first got started with with data in Chapter 2, where you learned about the difference between R and RStudio, started coding in R, started understanding what R packages are, and explored your first dataset: all domestic departure flights from a New York City airport in 2013. Then: Data science: You assembled your data science toolbox using tidyverse packages. In particular: Ch.3: Visualizing data via the ggplot2 package. Ch.5: Understanding the concept of “tidy” data as a standardized data input format for all packages in the tidyverse Ch.4: Wrangling data via the dplyr package. Data modeling: Using these data science tools and helper functions from the moderndive package, you started performing data modeling. In particular: Ch.6: Constructing basic regression models. Ch.7: Constructing multiple regression models. Statistical inference: Once again using your newly acquired data science tools, we unpacked statistical inference using the infer package. In particular: Ch.8: Understanding the role that sampling variability plays in statistical inference using both tactile and virtual simulations of sampling from a “bowl” with an unknown proportion of red balls. Ch.9: Building confidence intervals. Ch.10: Conducting hypothesis tests. Data modeling revisited: Armed with your new understanding of statistical inference, you revisited and reviewed the models you constructed in Ch.6 & Ch.7. In particular: Ch.11: Interpreting both the statistical and practice significance of the results of the models. All this was our approach of guiding you through your first experiences of “thinking with data”, an expression originally coined by Diane Lambert of Google. How the philosophy underlying this expression guided our mapping of the flowchart above was well put in the introduction to the “Practical Data Science for Stats” collection of preprints focusing on the practical side of data science workflows and statistical analysis, curated by Jennifer Bryan and Hadley Wickham: There are many aspects of day-to-day analytical work that are almost absent from the conventional statistics literature and curriculum. And yet these activities account for a considerable share of the time and effort of data analysts and applied statisticians. The goal of this collection is to increase the visibility and adoption of modern data analytical workflows. We aim to facilitate the transfer of tools and frameworks between industry and academia, between software engineering and statistics and computer science, and across different domains. In other words, in order to be equipped to “think with data” in the 21st century, future analysts need preparation going through the entirety of the “Data/Science Pipeline” we also saw earlier and not just parts of it. FIGURE 12.2: Data/Science Pipeline In Section 12.1, we’ll take you through full-pass of the “Data/Science Pipeline” where we’ll analyze the sale price of houses in Seattle, WA, USA. In Section 12.2, we’ll present you with examples of effective data storytelling, in particular the articles from the data journalism website FiveThirtyEight.com, many of whose source datasets are accessible from the fivethirtyeight R package. Needed packages Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). Read Section 2.3 for information on how to install and load R packages. library(ggplot2) library(dplyr) library(moderndive) library(fivethirtyeight) 12.1 Case study: Seattle house prices Kaggle.com is a machine learning and predictive modeling competition website that hosts datasets uploaded by companies, governmental organizations, and other individuals. One of their datasets is the House Sales in King County, USA consisting of homes sold in between May 2014 and May 2015 in King County, Washington State, USA, which includes the greater Seattle metropolitan area. This CC0: Public Domain licensed dataset is included in the moderndive package in the house_prices data frame, which we’ll refer to as the “Seattle house prices” dataset. The dataset consists 21,613 houses and 21 variables describing these houses; for a full list of these variables see the help file by running ?house_prices in the console. In this case study, we’ll create a model using multiple regression where: The outcome variable \\(y\\) is the sale price of houses The two explanatory/predictor variables we’ll use are : \\(x_1\\): house size sqft_living, as measured by square feet of living space, where 1 square foot is about 0.09 square meters. \\(x_2\\): house condition, a categorical variable with 5 levels where 1 indicates “poor” and 5 indicates “excellent.” Let’s load all the packages needed for this case study (this assumes you’ve already installed them). If needed, read Section 2.3 for information on how to install and load R packages. library(ggplot2) library(dplyr) library(moderndive) 12.1.1 Exploratory data analysis (EDA) A crucial first step before any formal modeling is an exploratory data analysis, commonly abbreviated at EDA. Exploratory data analysis can give you a sense of your data, help identify issues with your data, bring to light any outliers, and help inform model construction. There are three basic approaches to EDA: Most fundamentally, just looking at the raw data. For example using RStudio’s View() spreadsheet viewer or the glimpse() function from the dplyr package Creating visualizations like the ones using ggplot2 from Chapter 3 Computing summary statistics using the dplyr data wrangling tools from Chapter 4 First, let’s look the raw data using View() and the glimpse() function. Explore the dataset. Which variables are numerical and which are categorical? For the categorical variables, what are their levels? Which do you think would useful variables to use in a model for house price? In this case study, we’ll only consider the variables price, sqft_living, and condition. An important thing to observe is that while the condition variable has values 1 through 5, these are saved in R as fct factors i.e. R’s way of saving categorical variables. So you should think of these as the “labels” 1 through 5 and not the numerical values 1 through 5. View(house_prices) glimpse(house_prices) Observations: 21,613 Variables: 21 $ id <chr> "7129300520", "6414100192", "5631500400", "2487200875",… $ date <date> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-0… $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500… $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2… $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2… $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 18… $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470… $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, … $ waterfront <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,… $ view <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0… $ condition <fct> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4… $ grade <fct> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, … $ sqft_above <int> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 18… $ sqft_basement <int> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0,… $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2… $ yr_renovated <int> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ zipcode <fct> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198,… $ lat <dbl> 47.5, 47.7, 47.7, 47.5, 47.6, 47.7, 47.3, 47.4, 47.5, 4… $ long <dbl> -122, -122, -122, -122, -122, -122, -122, -122, -122, -… $ sqft_living15 <int> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 2… $ sqft_lot15 <int> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113,… Let’s now perform the second possible approach to EDA: creating visualizations. Since price and sqft_living are numerical variables, an appropriate way to visualize of these variables’ distributions would be using a histogram using a geom_histogram() as seen in Section 3.5. However, since condition is categorical, a barplot using a geom_bar() yields an appropriate visualization of its distribution. Recall from Section 3.8 that since condition is not “pre-counted”, we use a geom_bar() and not a geom_col(). In Figure 12.3, we display all three of these visualizations at once. # Histogram of house price: ggplot(house_prices, aes(x = price)) + geom_histogram(color = "white") + labs(x = "price (USD)", title = "House price") # Histogram of sqft_living: ggplot(house_prices, aes(x = sqft_living)) + geom_histogram(color = "white") + labs(x = "living space (square feet)", title = "House size") # Barplot of condition: ggplot(house_prices, aes(x = condition)) + geom_bar() + labs(x = "condition", title = "House condition") FIGURE 12.3: Exploratory visualizations of Seattle house prices data We observe the following: In the histogram for price: Since e+06 means \\(10^6\\), or one million, we see that a majority of houses are less than 2 million dollars. The x-axis stretches out far to the right to 8 million dollars, even though there appear to be no houses. In the histogram for size sqft_living Most houses appear to have less than 5000 square feet of living space. For comparison a standard American football field is about 57,600 square feet, where as a standard soccer AKA association football field is about 64,000 square feet. The x-axis exhibits the same stretched out behavior to the right as for price Most houses are of condition 3, 4, or 5. In the case of price, why does the x-axis stretch so far to the right? It is because there are a very small number of houses with price closer to 8 million; these prices are outliers in this case. We say variable is “right skewed” as exhibited by the long right tail. This skew makes it difficult to compare prices of the less expensive houses as the more expensive houses dominate the scale of the x-axis. This is similarly the case for sqft_living. Let’s now perform the third possible approach to EDA: computing summary statistics. In particular, let’s compute 4 summary statistics using the summarize() data wrangling verb from Section 4.3. Two measures of center: the mean and median Two measures of variability/spread: the standard deviation and interquartile-range (IQR = 3rd quartile - 1st quartile) house_prices %>% summarize( mean_price = mean(price), median_price = median(price), sd_price = sd(price), IQR_price = IQR(price) ) # A tibble: 1 x 4 mean_price median_price sd_price IQR_price <dbl> <dbl> <dbl> <dbl> 1 540088. 450000 367127. 323050 Observe the following: The mean price of $540,088 is larger than the median of $450,000. This is because the small number of very expensive outlier houses prices are inflating the average, whereas since the median is the “middle” value, it is not as sensitive to such large values at the high end. This is why the news typically reports median house prices and not average house prices when describing the real estate market. We say here that the median more “robust to outliers” than the mean. Similarly, while both the standard deviation and IQR are both measures of spread and variability, the IQR is more “robust to outliers.” If you repeat the above summarize() for sqft_living, you’ll find a similar relationship between mean vs median and standard deviation vs IQR given its similar right-skewed nature. Is there anything we can do about this right-skew? Again, this could potentially be an issue because we’ll have a harder time discriminating between houses at the lower end of price and sqft_living, which might lead to a problem when modeling. We can in fact address this issue by using a log base 10 transformation, which we cover next. 12.1.2 log10 transformations At its simplest, log10() transformations returns base 10 logarithms. For example, since \\(1000 = 10^3\\), log10(1000) returns 3. To undo a log10-transformation, we raise 10 to this value. For example, to undo the previous log10-transformation and return the original value of 1000, we raise 10 to this value \\(10^{3}\\) by running 10^(3) = 1000. log-transformations allow us to focus on multiplicative changes instead of additive ones, thereby emphasizing changes in “orders of magnitude.” Let’s illustrate this idea in Table 12.1 with examples of prices of consumer goods in US dollars. TABLE 12.1: log10-transformed prices, orders of magnitude, and examples Price log10(Price) Order of magnitude Examples $1 0 Singles Cups of coffee $10 1 Tens Books $100 2 Hundreds Mobile phones $1,000 3 Thousands High definition TV’s $10,000 4 Tens of thousands Cars $100,000 5 Hundreds of thousands Luxury cars & houses $1,000,000 6 Millions Luxury houses Let’s break this down: When purchasing a cup of coffee, we tend to think of prices ranging in single dollars e.g. $2 or $3. However when purchasing say mobile phones, we don’t tend to think in prices in single dollars e.g. $676 or $757, but tend to round to the nearest unit of hundreds of dollars e.g. $200 or $500. Let’s say we want to know the log10-transformed value of $76. Even if this would be hard to compute without a calculator, we know that its log10 value is between 1 and 2, since $76 is between $10 and $100. In fact, log10(76) is 1.880814. log10-transformations are monotonic, meaning they preserve orderings. So if Price A is lower than Price B, then log10(Price A) will also be lower than log10(Price B). Most importantly, increments of one in log10 correspond to multiplicative changes and not additive ones. For example, increasing from log10(Price) of 3 to 4 corresponds to a multiplicative increase by a factor of 10: $100 to $1000. Let’s create new log10-transformed versions of the right-skewed variable price and sqft_living using the mutate() function from Section 4.5, but we’ll give the latter the name log10_size, which is a little more succinct and descriptive a variable name. house_prices <- house_prices %>% mutate( log10_price = log10(price), log10_size = log10(sqft_living) ) Let’s first display the before and after effects of this transformation on these variables for only the first 10 rows of house_prices: house_prices %>% select(price, log10_price, sqft_living, log10_size) # A tibble: 10 x 4 price log10_price sqft_living log10_size <dbl> <dbl> <int> <dbl> 1 221900 5.35 1180 3.07 2 538000 5.73 2570 3.41 3 180000 5.26 770 2.89 4 604000 5.78 1960 3.29 5 510000 5.71 1680 3.23 6 1225000 6.09 5420 3.73 7 257500 5.41 1715 3.23 8 291850 5.47 1060 3.03 9 229500 5.36 1780 3.25 10 323000 5.51 1890 3.28 Observe in particular: The house in the 6th row with price $1,225,000, which is just above one million dollars. Since \\(10^6\\) is one million, its log10_price is 6.09. Contrast this with all other houses with log10_price less than 6. Similarly, there is only one house with size sqft_living less than 1000. Since \\(1000 = 10^3\\), its the lone house with log10_size less than 3. Let’s now visualize the before and after effects of this transformation for price in Figure 12.4. # Before: ggplot(house_prices, aes(x = price)) + geom_histogram(color = "white") + labs(x = "price (USD)", title = "House price: Before") # After: ggplot(house_prices, aes(x = log10_price)) + geom_histogram(color = "white") + labs(x = "log10 price (USD)", title = "House price: After") FIGURE 12.4: House price before and after log10-transformation Observe that after the transformation, the distribution is much less skewed, and in this case, more symmetric and bell-shaped, although this isn’t always necessarily the case. Now you can now better discriminate between house prices at the lower end of the scale. Let’s do the same for size where the before variable is sqft_living and the after variable is log10_size. Observe in Figure 12.5 that the log10-transformation has a similar effect of un-skewing the variable. Again, we emphasize that while in these two cases the resulting distributions are more symmetric and bell-shaped, this is not always necessarily the case. # Before: ggplot(house_prices, aes(x = sqft_living)) + geom_histogram(color = "white") + labs(x = "living space (square feet)", title = "House size: Before") # After: ggplot(house_prices, aes(x = log10_size)) + geom_histogram(color = "white") + labs(x = "log10 living space (square feet)", title = "House size: After") FIGURE 12.5: House size before and after log10-transformation Given the now un-skewed nature of log10_price and log10_size, we are going to revise our modeling structure: We’ll use a new outcome variable \\(y\\) log10_price of houses The two explanatory/predictor variables we’ll use are: \\(x_1\\): A modified version of house size: log10_size \\(x_2\\): House condition will remain unchanged 12.1.3 EDA Part II Let’s continue our exploratory data analysis from Subsection 12.1.1 above. The earlier EDA you performed was univariate in nature in that we only considered one variable at a time. The goal of modeling, however, is to explore relationships between variables. So we must jointly consider the relationship between the outcome variable log10_price and the explanatory/predictor variables log10_size (numerical) and condition (categorical). We viewed such a modeling scenario in Section 7.2 using the evals dataset, where the outcome variable was teaching score, the numerical explanatory/predictor variable was instructor age and the categorical explanatory/predictor variable was (binary) gender. We have two possible visual models. Either a parallel slopes model in Figure 12.6 where we have a different regression line for each of the 5 possible condition levels, each with a different intercept but the same slope: FIGURE 12.6: Parallel slopes model Or an interaction model in Figure 12.7, where we allow each regression line to not only have different intercepts, but different slopes as well: ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) + geom_point(alpha = 0.1) + labs(y = "log10 price", x = "log10 size", title = "House prices in Seattle") + geom_smooth(method = "lm", se = FALSE) FIGURE 12.7: Interaction model In both cases, we see there is a positive relationship between house price and size, meaning as houses are larger, they tend to be more expensive. Furthermore, in both plots it seems that houses of condition 5 tend to be the most expensive for most house sizes as evidenced by the fact that the purple line is highest, followed by condition 4 and 3. As for condition 1 and 2, this pattern isn’t as clear, as if you recall from the univariate barplot of condition in Figure 12.3 there are very few houses of condition 1 or 2. This reality is more apparent in an alternative visualization to Figure 12.7 displayed in Figure 12.8 that uses facets instead: ggplot(house_prices, aes(x = log10_size, y = log10_price, col = condition)) + geom_point(alpha = 0.3) + labs(y = "log10 price", x = "log10 size", title = "House prices in Seattle") + geom_smooth(method = "lm", se = FALSE) + facet_wrap(~condition) FIGURE 12.8: Interaction model with facets Which exploratory visualization of the interaction model is better, the one in Figure 12.7 or Figure 12.8? There is no universal right answer, you need to make a choice depending on what you want to convey, and own it. 12.1.4 Regression modeling For now let’s focus on the latter, interaction model we’ve visualized in Figure 12.8 above. What are the 5 different slopes and intercepts for the condition = 1, condition = 2, …, and condition = 5 lines in Figure 12.8? To determine these, we first need the values from the regression table: # Fit regression model: price_interaction <- lm(log10_price ~ log10_size * condition, data = house_prices) # Get regression table: get_regression_table(price_interaction) # A tibble: 10 x 7 term estimate std_error statistic p_value lower_ci upper_ci <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 intercept 3.33 0.451 7.38 0 2.45 4.22 2 log10_size 0.69 0.148 4.65 0 0.399 0.98 3 condition2 0.047 0.498 0.094 0.925 -0.93 1.02 4 condition3 -0.367 0.452 -0.812 0.417 -1.25 0.519 5 condition4 -0.398 0.453 -0.879 0.38 -1.29 0.49 6 condition5 -0.883 0.457 -1.93 0.053 -1.78 0.013 7 log10_size:condition2 -0.024 0.163 -0.148 0.882 -0.344 0.295 8 log10_size:condition3 0.133 0.148 0.893 0.372 -0.158 0.424 9 log10_size:condition4 0.146 0.149 0.979 0.328 -0.146 0.437 10 log10_size:condition5 0.31 0.15 2.07 0.039 0.016 0.604 Recall from Section 7.2.3 on how to interpret the outputs where there exists an interaction term, where in this case the “baseline for comparison” group for the categorical variable condition are the condition 1 houses. We’ll write our answers as: \\[\\widehat{\\log10(\\text{price})} = \\hat{\\beta}_0 + \\hat{\\beta}_{\\text{size}} * \\log10(\\text{size})\\] for all five condition levels separately: Condition 1: \\(\\widehat{\\log10(\\text{price})} = 3.33 + 0.69 * \\log10(\\text{size})\\) Condition 2: \\(\\widehat{\\log10(\\text{price})} = (3.33 + 0.047) + (0.69 - 0.024) * \\log10(\\text{size}) = 3.38 + 0.666 * \\log10(\\text{size})\\) Condition 3: \\(\\widehat{\\log10(\\text{price})} = (3.33 - 0.367) + (0.69 + 0.133) * \\log10(\\text{size}) = 2.96 + 0.823 * \\log10(\\text{size})\\) Condition 4: \\(\\widehat{\\log10(\\text{price})} = (3.33 - 0.398) + (0.69 + 0.146) * \\log10(\\text{size}) = 2.93 + 0.836 * \\log10(\\text{size})\\) Condition 5: \\(\\widehat{\\log10(\\text{price})} = (3.33 - 0.883) + (0.69 + 0.31) * \\log10(\\text{size}) = 2.45 + 1 * \\log10(\\text{size})\\) These correspond to the regression lines in the exploratory visualization of the interaction model in Figure 12.7 above. For homes of all 5 condition types, as the size of the house increases, the prices increases. This is what most would expect. However, the rate of increase of price with size is fastest for the homes with condition 3, 4, and 5 of 0.823, 0.836, and 1 respectively; these are the 3 most largest slopes out of the 5. 12.1.5 Making predictions Say you’re a realtor and someone calls you asking you how much their home will sell for. They tell you that it’s in condition = 5 and is sized 1900 square feet. What do you tell them? We first make this prediction visually in Figure 12.9. The predicted log10_price of this house is marked with a black dot: it is where the two following lines intersect: The purple regression line for the condition = 5 homes and The vertical dashed black line at log10_size equals 3.28, since our predictor variable is the log10-transformed square feet of living space and \\(\\log10(1900) = 3.28\\) . FIGURE 12.9: Interaction model with prediction Eyeballing it, it seems the predicted log10_price seems to be around 5.72. Let’s now obtain the an exact numerical value for the prediction using the values of the intercept and slope for the condition = 5 that we computed using the regression table output. We use the equation for the condition = 5 line, being sure to log10() the square footage first. 2.45 + 1 * log10(1900) [1] 5.73 This value is very close to our earlier visually made prediction of 5.72. But wait! We were using the outcome variable log10_price as our outcome variable! So if we want a prediction in terms of price in dollar units, we need to un-log this by taking a power of 10 as described in Section 12.1.2. 10^(2.45 + 1 * log10(1900)) [1] 535493 So we our predicted price for this home of condition 5 and size 1900 square feet is $535,493. Learning check (LC12.1) Repeat the regression modeling in Subsection 12.1.4 and the prediction making you just did on the house of condition 5 and size 1900 square feet in Subsection 12.1.5, but using the parallel slopes model you visualized in Figure 12.6. Hint: it’s $524,807! 12.2 Case study: Effective data storytelling Note: This section is still under construction. If you would like to contribute, please check us out on GitHub at https://github.com/moderndive/moderndive_book. As we’ve progressed throughout this book, you’ve seen how to work with data in a variety of ways. You’ve learned effective strategies for plotting data by understanding which types of plots work best for which combinations of variable types. You’ve summarized data in table form and calculated summary statistics for a variety of different variables. Further, you’ve seen the value of inference as a process to come to conclusions about a population by using a random sample. Lastly, you’ve explored how to use linear regression and the importance of checking the conditions required to make it a valid procedure. All throughout, you’ve learned many computational techniques and focused on reproducible research in writing R code. We now present another case study, but this time of the “effective data storytelling” done by data journalists around the world. Great data stories don’t mislead the reader, but rather engulf them in understanding the importance that data plays in our lives through the captivation of storytelling. 12.2.1 Bechdel test for Hollywood gender representation We recommend you read and analyze this article by Walt Hickey entitled The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women on the Bechdel test, an informal test of gender representation in a movie. As you read over it, think carefully about how Walt is using data, graphics, and analyses to paint the picture for the reader of what the story is he wants to tell. In the spirit of reproducibility, the members of FiveThirtyEight have also shared the data and R code that they used to create for this story and many more of their articles on GitHub. ModernDive co-authors Chester Ismay and Albert Y. Kim along with Jennifer Chunn went one step further by creating the fivethirtyeight R package. The fivethirtyeight package takes FiveThirtyEight’s article data from GitHub, “tames” it so that it’s novice-friendly, and makes all data, documentation, and the original article easily accessible via an R package. The package homepage also includes a list of all fivethirtyeight data sets included. Furthermore, example “vignettes” of fully reproducible start-to-finish analyses of some of these data using dplyr, ggplot2, and other packages in the tidyverse is available here. For example, a vignette showing how to reproduce one of the plots at the end of the above article on the Bechdel test is available here. 12.2.2 US Births in 1999 Here is another example involving the US_births_1994_2003 data frame of all births in the United States between 1994 and 2003. For more information on this data frame including a link to the original article on FiveThirtyEight.com, check out the help file by running ?US_births_1994_2003 in the console. First, let’s load all necessary packages: library(ggplot2) library(dplyr) library(fivethirtyeight) It’s always a good idea to preview your data, either by using RStudio’s spreadsheet View() function or using glimpse() from the dplyr package below: # Preview data glimpse(US_births_1994_2003) Observations: 3,652 Variables: 6 $ year <int> 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1994, 1… $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ date_of_month <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, … $ date <date> 1994-01-01, 1994-01-02, 1994-01-03, 1994-01-04, 1994-0… $ day_of_week <ord> Sat, Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tu… $ births <int> 8096, 7772, 10142, 11248, 11053, 11406, 11251, 8653, 79… We’ll focus on the number of births for each date, but only for births that occurred in 1999. Recall we achieve this using the filter() command from dplyr package: US_births_1999 <- US_births_1994_2003 %>% filter(year == 1999) Since date is a notion of time, which has a sequential ordering to it, a linegraph AKA a “time series” plot would be more appropriate than a scatterplot. In other words, use a geom_line() instead of geom_point(): ggplot(US_births_1999, aes(x = date, y = births)) + geom_line() + labs(x = "Data", y = "Number of births", title = "US Births in 1999") We see a big valley occurring just before January 1st, 2000, mostly likely due to the holiday season. However, what about the major peak of over 14,000 births occurring just before October 1st, 1999? What could be the reason for this anomalously high spike in ? Time to think with data! 12.2.3 Other examples Stand by! 12.2.4 Script of R code An R script file of all R code used in this chapter is available here. Concluding remarks If you’ve come to this point in the book, I’d suspect that you know a thing or two about how to work with data in R. You’ve also gained a lot of knowledge about how to use simulation techniques to determine statistical significance and how these techniques build an intuition about traditional inferential methods like the \\(t\\)-test. The hope is that you’ve come to appreciate data wrangling, tidy datasets, and the power of data visualization. Actually, the data visualization part may be the most important thing here. If you can create truly beautiful graphics that display information in ways that the reader can clearly decipher, you’ve picked up a great skill. Let’s hope that that skill keeps you creating great stories with data into the near and far distant future. Thanks for coming along for the ride as we dove into modern data analysis using R! "], ["A-appendixA.html", "A Statistical Background A.1 Basic statistical terms A.2 Normal distribution discussion", " A Statistical Background A.1 Basic statistical terms A.1.1 Mean The mean AKA average is the most commonly reported measure of center. It is commonly called the “average” though this term can be a little ambiguous. The mean is the sum of all of the data elements divided by how many elements there are. If we have \\(n\\) data points, the mean is given by: \\[Mean = \\frac{x_1 + x_2 + \\cdots + x_n}{n}\\] A.1.2 Median The median is calculated by first sorting a variable’s data from smallest to largest. After sorting the data, the middle element in the list is the median. If the middle falls between two values, then the median is the mean of those two values. A.1.3 Standard deviation We will next discuss the standard deviation of a sample dataset pertaining to one variable. The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far to expect a given data value is from its mean: \\[Standard \\, deviation = \\sqrt{\\frac{(x_1 - Mean)^2 + (x_2 - Mean)^2 + \\cdots + (x_n - Mean)^2}{n - 1}}\\] A.1.4 Five-number summary The five-number summary consists of five values: minimum, first quantile AKA 25th percentile, second quantile AKA median AKA 50th percentile, third quantile AKA 75th, and maximum. The quantiles are calculated as first quantile (\\(Q_1\\)): the median of the first half of the sorted data third quantile (\\(Q_3\\)): the median of the second half of the sorted data The interquartile range is defined as \\(Q_3 - Q_1\\) and is a measure of how spread out the middle 50% of values is. The five-number summary is not influenced by the presence of outliers in the ways that the mean and standard deviation are. It is, thus, recommended for skewed datasets. A.1.5 Distribution The distribution of a variable/dataset corresponds to generalizing patterns in the dataset. It often shows how frequently elements in the dataset appear. It shows how the data varies and gives some information about where a typical element in the data might fall. Distributions are most easily seen through data visualization. A.1.6 Outliers Outliers correspond to values in the dataset that fall far outside the range of “ordinary” values. In regards to a boxplot (by default), they correspond to values below \\(Q_1 - (1.5 * IQR)\\) or above \\(Q_3 + (1.5 * IQR)\\). Note that these terms (aside from Distribution) only apply to quantitative variables. A.2 Normal distribution discussion "], ["B-appendixB.html", "B Inference Examples Needed packages B.1 Inference mind map B.2 One mean B.3 One proportion B.4 Two proportions B.5 Two means (independent samples) B.6 Two means (paired samples)", " B Inference Examples This appendix is designed to provide you with examples of the five basic hypothesis tests and their corresponding confidence intervals. Traditional theory-based methods as well as computational-based methods are presented. Note: This appendix is still under construction. If you would like to contribute, please check us out on GitHub at https://github.com/moderndive/moderndive_book. Please check out our sneak peak of infer below in the meanwhile. For more details on infer visit https://infer.netlify.com/. include_image(path = "images/sign-2408065_1920.png", html_opts="height=100px", latex_opts = "width=20%") ![](images/sign-2408065_1920.png){ height=100px } Needed packages library(dplyr) library(ggplot2) library(infer) library(knitr) library(kableExtra) library(readr) library(janitor) B.1 Inference mind map To help you better navigate and choose the appropriate analysis, we’ve created a mind map on http://coggle.it available here and below. FIGURE B.1: Mind map for Inference B.2 One mean B.2.1 Problem statement The National Survey of Family Growth conducted by the Centers for Disease Control gathers information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health. One of the variables collected on this survey is the age at first marriage. 5,534 randomly sampled US women between 2006 and 2010 completed the survey. The women sampled here had been married at least once. Do we have evidence that the mean age of first marriage for all US women from 2006 to 2010 is greater than 23 years? (Tweaked a bit from Diez, Barr, and Çetinkaya-Rundel 2014 [Chapter 4]) B.2.2 Competing hypotheses In words Null hypothesis: The mean age of first marriage for all US women from 2006 to 2010 is equal to 23 years. Alternative hypothesis: The mean age of first marriage for all US women from 2006 to 2010 is greater than 23 years. In symbols (with annotations) \\(H_0: \\mu = \\mu_{0}\\), where \\(\\mu\\) represents the mean age of first marriage for all US women from 2006 to 2010 and \\(\\mu_0\\) is 23. \\(H_A: \\mu > 23\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.2.3 Exploring the sample data age_at_marriage <- read_csv("https://moderndive.com/data/ageAtMar.csv") age_summ <- age_at_marriage %>% summarize(sample_size = n(), mean = mean(age), sd = sd(age), minimum = min(age), lower_quartile = quantile(age, 0.25), median = median(age), upper_quartile = quantile(age, 0.75), max = max(age)) kable(age_summ) %>% kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16), latex_options = c("HOLD_position")) sample_size mean sd minimum lower_quartile median upper_quartile max 5534 23.4 4.72 10 20 23 26 43 The histogram below also shows the distribution of age. ggplot(data = age_at_marriage, mapping = aes(x = age)) + geom_histogram(binwidth = 3, color = "white") The observed statistic of interest here is the sample mean: x_bar <- age_at_marriage %>% specify(response = age) %>% calculate(stat = "mean") x_bar # A tibble: 1 x 1 stat <dbl> 1 23.4 Guess about statistical significance We are looking to see if the observed sample mean of 23.44 is statistically greater than \\(\\mu_0 = 23\\). They seem to be quite close, but we have a large sample size here. Let’s guess that the large sample size will lead us to reject this practically small difference. B.2.4 Non-traditional methods Bootstrapping for hypothesis test In order to look to see if the observed sample mean of 23.44 is statistically greater than \\(\\mu_0 = 23\\), we need to account for the sample size. We also need to determine a process that replicates how the original sample of size 5534 was selected. We can use the idea of bootstrapping to simulate the population from which the sample came and then generate samples from that simulated population to account for sampling variability. Recall how bootstrapping would apply in this context: Sample with replacement from our original sample of 5534 women and repeat this process 10,000 times, calculate the mean for each of the 10,000 bootstrap samples created in Step 1., combine all of these bootstrap statistics calculated in Step 2 into a boot_distn object, and shift the center of this distribution over to the null value of 23. (This is needed since it will be centered at 23.44 via the process of bootstrapping.) set.seed(2018) null_distn_one_mean <- age_at_marriage %>% specify(response = age) %>% hypothesize(null = "point", mu = 23) %>% generate(reps = 10000) %>% calculate(stat = "mean") null_distn_one_mean %>% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a right-tailed test so we will be looking for values that are greater than or equal to 23.44 for our \\(p\\)-value. null_distn_one_mean %>% visualize(obs_stat = x_bar, direction = "greater") Calculate \\(p\\)-value pvalue <- null_distn_one_mean %>% get_pvalue(obs_stat = x_bar, direction = "greater") pvalue # A tibble: 1 x 1 p_value <dbl> 1 0 So our \\(p\\)-value is 0 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the tail of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\mu\\) using our sample data using bootstrapping. Note that we don’t need to shift this distribution since we want the center of our confidence interval to be our point estimate \\(\\bar{x}_{obs} = 23.44\\). boot_distn_one_mean <- age_at_marriage %>% specify(response = age) %>% generate(reps = 10000) %>% calculate(stat = "mean") ci <- boot_distn_one_mean %>% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` <dbl> <dbl> 1 23.3 23.6 boot_distn_one_mean %>% visualize(endpoints = ci, direction = "between") We see that 23 is not contained in this confidence interval as a plausible value of \\(\\mu\\) (the unknown population mean) and the entire interval is larger than 23. This matches with our hypothesis test results of rejecting the null hypothesis in favor of the alternative (\\(\\mu > 23\\)). Interpretation: We are 95% confident the true mean age of first marriage for all US women from 2006 to 2010 is between 23.316 and 23.565. B.2.5 Traditional methods Check conditions Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations are collected independently. The cases are selected independently through random sampling so this condition is met. Approximately normal: The distribution of the response variable should be normal or the sample size should be at least 30. The histogram for the sample above does show some skew. The Q-Q plot below also shows some skew. ggplot(data = age_at_marriage, mapping = aes(sample = age)) + stat_qq() The sample size here is quite large though (\\(n = 5534\\)) so both conditions are met. Test statistic The test statistic is a random variable based on the sample data. Here, we want to look at a way to estimate the population mean \\(\\mu\\). A good guess is the sample mean \\(\\bar{X}\\). Recall that this sample mean is actually a random variable that will vary as different samples are (theoretically, would be) collected. We are looking to see how likely is it for us to have observed a sample mean of \\(\\bar{x}_{obs} = 23.44\\) or larger assuming that the population mean is 23 (assuming the null hypothesis is true). If the conditions are met and assuming \\(H_0\\) is true, we can “standardize” this original test statistic of \\(\\bar{X}\\) into a \\(T\\) statistic that follows a \\(t\\) distribution with degrees of freedom equal to \\(df = n - 1\\): \\[ T =\\dfrac{ \\bar{X} - \\mu_0}{ S / \\sqrt{n} } \\sim t (df = n - 1) \\] where \\(S\\) represents the standard deviation of the sample and \\(n\\) is the sample size. Observed test statistic While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the t_test() function to perform this analysis for us. t_test_results <- age_at_marriage %>% infer::t_test(formula = age ~ NULL, alternative = "greater", mu = 23) t_test_results # A tibble: 1 x 6 statistic t_df p_value alternative lower_ci upper_ci <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 6.94 5533 2.25e-12 greater 23.3 Inf We see here that the \\(t_{obs}\\) value is 6.936. Compute \\(p\\)-value The \\(p\\)-value—the probability of observing an \\(t_{obs}\\) value of 6.936 or more in our null distribution of a \\(t\\) with 5533 degrees of freedom—is essentially 0. State conclusion We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that our observed sample mean was statistically greater than the hypothesized mean has supporting evidence here. Based on this sample, we have evidence that the mean age of first marriage for all US women from 2006 to 2010 is greater than 23 years. Confidence interval t.test(x = age_at_marriage$age, alternative = "two.sided", mu = 23)$conf [1] 23.3 23.6 attr(,"conf.level") [1] 0.95 B.2.6 Comparing results Observing the bootstrap distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions also being met (the large sample size was the driver here) leads us to better guess that using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) will lead to similar results. B.3 One proportion B.3.1 Problem statement The CEO of a large electric utility claims that 80 percent of his 1,000,000 customers are satisfied with the service they receive. To test this claim, the local newspaper surveyed 100 customers, using simple random sampling. 73 were satisfied and the remaining were unsatisfied. Based on these findings from the sample, can we reject the CEO’s hypothesis that 80% of the customers are satisfied? [Tweaked a bit from http://stattrek.com/hypothesis-test/proportion.aspx?Tutorial=AP] B.3.2 Competing hypotheses In words Null hypothesis: The proportion of all customers of the large electric utility satisfied with service they receive is equal 0.80. Alternative hypothesis: The proportion of all customers of the large electric utility satisfied with service they receive is different from 0.80. In symbols (with annotations) \\(H_0: \\pi = p_{0}\\), where \\(\\pi\\) represents the proportion of all customers of the large electric utility satisfied with service they receive and \\(p_0\\) is 0.8. \\(H_A: \\pi \\ne 0.8\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.3.3 Exploring the sample data elec <- c(rep("satisfied", 73), rep("unsatisfied", 27)) %>% as_data_frame() %>% rename(satisfy = value) The bar graph below also shows the distribution of satisfy. ggplot(data = elec, aes(x = satisfy)) + geom_bar() The observed statistic is computed as p_hat <- elec %>% specify(response = satisfy, success = "satisfied") %>% calculate(stat = "prop") p_hat # A tibble: 1 x 1 stat <dbl> 1 0.73 Guess about statistical significance We are looking to see if the sample proportion of 0.73 is statistically different from \\(p_0 = 0.8\\) based on this sample. They seem to be quite close, and our sample size is not huge here (\\(n = 100\\)). Let’s guess that we do not have evidence to reject the null hypothesis. B.3.4 Non-traditional methods Simulation for hypothesis test In order to look to see if 0.73 is statistically different from 0.8, we need to account for the sample size. We also need to determine a process that replicates how the original sample of size 100 was selected. We can use the idea of an unfair coin to simulate this process. We will simulate flipping an unfair coin (with probability of success 0.8 matching the null hypothesis) 100 times. Then we will keep track of how many heads come up in those 100 flips. Our simulated statistic matches with how we calculated the original statistic \\(\\hat{p}\\): the number of heads (satisfied) out of our total sample of 100. We then repeat this process many times (say 10,000) to create the null distribution looking at the simulated proportions of successes: set.seed(2018) null_distn_one_prop <- elec %>% specify(response = satisfy, success = "satisfied") %>% hypothesize(null = "point", p = 0.8) %>% generate(reps = 10000) %>% calculate(stat = "prop") null_distn_one_prop %>% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a two-tailed test so we will be looking for values that are 0.8 - 0.73 = 0.07 away from 0.8 in BOTH directions for our \\(p\\)-value: null_distn_one_prop %>% visualize(obs_stat = p_hat, direction = "both") Calculate \\(p\\)-value pvalue <- null_distn_one_prop %>% get_pvalue(obs_stat = p_hat, direction = "both") pvalue # A tibble: 1 x 1 p_value <dbl> 1 0.114 So our \\(p\\)-value is 0.114 and we fail to reject the null hypothesis at the 5% level. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\pi\\) using our sample data. To do so, we use bootstrapping, which involves sampling with replacement from our original sample of 100 survey respondents and repeating this process 10,000 times, calculating the proportion of successes for each of the 10,000 bootstrap samples created in Step 1., combining all of these bootstrap statistics calculated in Step 2 into a boot_distn object, identifying the 2.5th and 97.5th percentiles of this distribution (corresponding to the 5% significance level chosen) to find a 95% confidence interval for \\(\\pi\\), and interpret this confidence interval in the context of the problem. boot_distn_one_prop <- elec %>% specify(response = satisfy, success = "satisfied") %>% generate(reps = 10000) %>% calculate(stat = "prop") Just as we use the mean function for calculating the mean over a numerical variable, we can also use it to compute the proportion of successes for a categorical variable where we specify what we are calling a “success” after the ==. (Think about the formula for calculating a mean and how R handles logical statements such as satisfy == "satisfied" for why this must be true.) ci <- boot_distn_one_prop %>% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` <dbl> <dbl> 1 0.64 0.81 boot_distn_one_prop %>% visualize(endpoints = ci, direction = "between") We see that 0.80 is contained in this confidence interval as a plausible value of \\(\\pi\\) (the unknown population proportion). This matches with our hypothesis test results of failing to reject the null hypothesis. Interpretation: We are 95% confident the true proportion of customers who are satisfied with the service they receive is between 0.64 and 0.81. B.3.5 Traditional methods Check conditions Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations are collected independently. The cases are selected independently through random sampling so this condition is met. Approximately normal: The number of expected successes and expected failures is at least 10. This condition is met since 73 and 27 are both greater than 10. Test statistic The test statistic is a random variable based on the sample data. Here, we want to look at a way to estimate the population proportion \\(\\pi\\). A good guess is the sample proportion \\(\\hat{P}\\). Recall that this sample proportion is actually a random variable that will vary as different samples are (theoretically, would be) collected. We are looking to see how likely is it for us to have observed a sample proportion of \\(\\hat{p}_{obs} = 0.73\\) or larger assuming that the population proportion is 0.80 (assuming the null hypothesis is true). If the conditions are met and assuming \\(H_0\\) is true, we can standardize this original test statistic of \\(\\hat{P}\\) into a \\(Z\\) statistic that follows a \\(N(0, 1)\\) distribution. \\[ Z =\\dfrac{ \\hat{P} - p_0}{\\sqrt{\\dfrac{p_0(1 - p_0)}{n} }} \\sim N(0, 1) \\] Observed test statistic While one could compute this observed test statistic by “hand” by plugging the observed values into the formula, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. The calculation has been done in R below for completeness though: p_hat <- 0.73 p0 <- 0.8 n <- 100 (z_obs <- (p_hat - p0) / sqrt( (p0 * (1 - p0)) / n)) [1] -1.75 We see here that the \\(z_{obs}\\) value is around -1.75. Our observed sample proportion of 0.73 is 1.75 standard errors below the hypothesized parameter value of 0.8. Visualize and compute \\(p\\)-value elec %>% specify(response = satisfy, success = "satisfied") %>% hypothesize(null = "point", p = 0.8) %>% calculate(stat = "z") %>% visualize(method = "theoretical", obs_stat = z_obs, direction = "both") 2 * pnorm(z_obs) [1] 0.0801 The \\(p\\)-value—the probability of observing an \\(z_{obs}\\) value of -1.75 or more extreme (in both directions) in our null distribution—is around 8%. Note that we could also do this test directly using the prop.test function. stats::prop.test(x = table(elec$satisfy), n = length(elec$satisfy), alternative = "two.sided", p = 0.8, correct = FALSE) 1-sample proportions test without continuity correction data: table(elec$satisfy), null probability 0.8 X-squared = 3, df = 1, p-value = 0.08 alternative hypothesis: true p is not equal to 0.8 95 percent confidence interval: 0.636 0.807 sample estimates: p 0.73 prop.test does a \\(\\chi^2\\) test here but this matches up exactly with what we would expect: \\(x^2_{obs} = 3.06 = (-1.75)^2 = (z_{obs})^2\\) and the \\(p\\)-values are the same because we are focusing on a two-tailed test. Note that the 95 percent confidence interval given above matches well with the one calculated using bootstrapping. State conclusion We, therefore, do not have sufficient evidence to reject the null hypothesis. Our initial guess that our observed sample proportion was not statistically greater than the hypothesized proportion has not been invalidated. Based on this sample, we have do not evidence that the proportion of all customers of the large electric utility satisfied with service they receive is different from 0.80, at the 5% level. B.3.6 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions also being met leads us to better guess that using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) will lead to similar results. B.4 Two proportions B.4.1 Problem statement A 2010 survey asked 827 randomly sampled registered voters in California “Do you support? Or do you oppose? Drilling for oil and natural gas off the Coast of California? Or do you not know enough to say?” Conduct a hypothesis test to determine if the data provide strong evidence that the proportion of college graduates who do not have an opinion on this issue is different than that of non-college graduates. (Tweaked a bit from Diez, Barr, and Çetinkaya-Rundel 2014 [Chapter 6]) B.4.2 Competing hypotheses In words Null hypothesis: There is no association between having an opinion on drilling and having a college degree for all registered California voters in 2010. Alternative hypothesis: There is an association between having an opinion on drilling and having a college degree for all registered California voters in 2010. Another way in words Null hypothesis: The probability that a Californian voter in 2010 having no opinion on drilling and is a college graduate is the same as that of a non-college graduate. Alternative hypothesis: These parameter probabilities are different. In symbols (with annotations) \\(H_0: \\pi_{college} = \\pi_{no\\_college}\\) or \\(H_0: \\pi_{college} - \\pi_{no\\_college} = 0\\), where \\(\\pi\\) represents the probability of not having an opinion on drilling. \\(H_A: \\pi_{college} - \\pi_{no\\_college} \\ne 0\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.4.3 Exploring the sample data offshore <- read_csv("https://moderndive.com/data/offshore.csv") offshore %>% tabyl(college_grad, response) college_grad no opinion opinion no 131 258 yes 104 334 off_summ <- offshore %>% group_by(college_grad) %>% summarize(prop_no_opinion = mean(response == "no opinion"), sample_size = n()) ggplot(offshore, aes(x = college_grad, fill = response)) + geom_bar(position = "fill") + coord_flip() Guess about statistical significance We are looking to see if a difference exists in the size of the bars corresponding to no opinion for the plot. Based solely on the plot, we have little reason to believe that a difference exists since the bars seem to be about the same size, BUT…it’s important to use statistics to see if that difference is actually statistically significant! B.4.4 Non-traditional methods Collecting summary info The observed statistic is d_hat <- offshore %>% specify(response ~ college_grad, success = "no opinion") %>% calculate(stat = "diff in props", order = c("yes", "no")) d_hat # A tibble: 1 x 1 stat <dbl> 1 -0.0993 Randomization for hypothesis test In order to look to see if the observed sample proportion of no opinion for college graduates of 0.337 is statistically different than that for graduates of 0.237, we need to account for the sample sizes. Note that this is the same as looking to see if \\(\\hat{p}_{grad} - \\hat{p}_{nograd}\\) is statistically different than 0. We also need to determine a process that replicates how the original group sizes of 389 and 438 were selected. We can use the idea of randomization testing (also known as permutation testing) to simulate the population from which the sample came (with two groups of different sizes) and then generate samples using shuffling from that simulated population to account for sampling variability. set.seed(2018) null_distn_two_props <- offshore %>% specify(response ~ college_grad, success = "no opinion") %>% hypothesize(null = "independence") %>% generate(reps = 10000) %>% calculate(stat = "diff in props", order = c("yes", "no")) null_distn_two_props %>% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a two-tailed test so we will be looking for values that are greater than or equal to -0.099 or less than or equal to 0.099 for our \\(p\\)-value. null_distn_two_props %>% visualize(obs_stat = d_hat, direction = "two_sided") Calculate \\(p\\)-value pvalue <- null_distn_two_props %>% get_pvalue(obs_stat = d_hat, direction = "two_sided") pvalue # A tibble: 1 x 1 p_value <dbl> 1 0.003 So our \\(p\\)-value is 0.003 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the tails of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\pi_{college} - \\pi_{no\\_college}\\) using our sample data with bootstrapping. boot_distn_two_props <- offshore %>% specify(response ~ college_grad, success = "no opinion") %>% generate(reps = 10000) %>% calculate(stat = "diff in props", order = c("yes", "no")) ci <- boot_distn_two_props %>% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` <dbl> <dbl> 1 -0.161 -0.0378 boot_distn_two_props %>% visualize(endpoints = ci, direction = "between") We see that 0 is not contained in this confidence interval as a plausible value of \\(\\pi_{college} - \\pi_{no\\_college}\\) (the unknown population parameter). This matches with our hypothesis test results of rejecting the null hypothesis. Since zero is not a plausible value of the population parameter, we have evidence that the proportion of college graduates in California with no opinion on drilling is different than that of non-college graduates. Interpretation: We are 95% confident the true proportion of non-college graduates with no opinion on offshore drilling in California is between 0.16 dollars smaller to 0.04 dollars smaller than for college graduates. B.4.5 Traditional methods B.4.6 Check conditions Remember that in order to use the short-cut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: Each case that was selected must be independent of all the other cases selected. This condition is met since cases were selected at random to observe. Sample size: The number of pooled successes and pooled failures must be at least 10 for each group. We need to first figure out the pooled success rate: \\[\\hat{p}_{obs} = \\dfrac{131 + 104}{827} = 0.28.\\] We now determine expected (pooled) success and failure counts: \\(0.28 \\cdot (131 + 258) = 108.92\\), \\(0.72 \\cdot (131 + 258) = 280.08\\) \\(0.28 \\cdot (104 + 334) = 122.64\\), \\(0.72 \\cdot (104 + 334) = 315.36\\) Independent selection of samples: The cases are not paired in any meaningful way. We have no reason to suspect that a college graduate selected would have any relationship to a non-college graduate selected. B.4.7 Test statistic The test statistic is a random variable based on the sample data. Here, we are interested in seeing if our observed difference in sample proportions corresponding to no opinion on drilling (\\(\\hat{p}_{college, obs} - \\hat{p}_{no\\_college, obs}\\) = 0.033) is statistically different than 0. Assuming that conditions are met and the null hypothesis is true, we can use the standard normal distribution to standardize the difference in sample proportions (\\(\\hat{P}_{college} - \\hat{P}_{no\\_college}\\)) using the standard error of \\(\\hat{P}_{college} - \\hat{P}_{no\\_college}\\) and the pooled estimate: \\[ Z =\\dfrac{ (\\hat{P}_1 - \\hat{P}_2) - 0}{\\sqrt{\\dfrac{\\hat{P}(1 - \\hat{P})}{n_1} + \\dfrac{\\hat{P}(1 - \\hat{P})}{n_2} }} \\sim N(0, 1) \\] where \\(\\hat{P} = \\dfrac{\\text{total number of successes} }{ \\text{total number of cases}}.\\) Observed test statistic While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the prop.test function to perform this analysis for us. z_hat <- offshore %>% specify(response ~ college_grad, success = "no opinion") %>% calculate(stat = "z", order = c("yes", "no")) z_hat # A tibble: 1 x 1 stat <dbl> 1 -3.16 The observed difference in sample proportions is 3.16 standard deviations smaller than 0. The \\(p\\)-value—the probability of observing a \\(Z\\) value of -3.16 or more extreme in our null distribution—is 0.0016. This can also be calculated in R directly: 2 * pnorm(-3.16, lower.tail = TRUE) [1] 0.00158 B.4.8 State conclusion We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference did not exist in the proportions of no opinion on offshore drilling between college educated and non-college educated Californians was not validated. We do have evidence to suggest that there is a dependency between college graduation and position on offshore drilling for Californians. B.4.9 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions were not met since the number of pairs was small, but the sample data was not highly skewed. Using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) lead to similar results. B.5 Two means (independent samples) B.5.1 Problem statement Average income varies from one region of the country to another, and it often reflects both lifestyles and regional living expenses. Suppose a new graduate is considering a job in two locations, Cleveland, OH and Sacramento, CA, and he wants to see whether the average income in one of these cities is higher than the other. He would like to conduct a hypothesis test based on two randomly selected samples from the 2000 Census. (Tweaked a bit from Diez, Barr, and Çetinkaya-Rundel 2014 [Chapter 5]) B.5.2 Competing hypotheses In words Null hypothesis: There is no association between income and location (Cleveland, OH and Sacramento, CA). Alternative hypothesis: There is an association between income and location (Cleveland, OH and Sacramento, CA). Another way in words Null hypothesis: The mean income is the same for both cities. Alternative hypothesis: The mean income is different for the two cities. In symbols (with annotations) \\(H_0: \\mu_{sac} = \\mu_{cle}\\) or \\(H_0: \\mu_{sac} - \\mu_{cle} = 0\\), where \\(\\mu\\) represents the average income. \\(H_A: \\mu_{sac} - \\mu_{cle} \\ne 0\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.5.3 Exploring the sample data cle_sac <- read.delim("https://moderndive.com/data/cleSac.txt") %>% rename(metro_area = Metropolitan_area_Detailed, income = Total_personal_income) %>% na.omit() inc_summ <- cle_sac %>% group_by(metro_area) %>% summarize(sample_size = n(), mean = mean(income), sd = sd(income), minimum = min(income), lower_quartile = quantile(income, 0.25), median = median(income), upper_quartile = quantile(income, 0.75), max = max(income)) kable(inc_summ) %>% kable_styling(font_size = ifelse(knitr:::is_latex_output(), 10, 16), latex_options = c("HOLD_position")) metro_area sample_size mean sd minimum lower_quartile median upper_quartile max Cleveland_ OH 212 27467 27681 0 8475 21000 35275 152400 Sacramento_ CA 175 32428 35774 0 8050 20000 49350 206900 The boxplot below also shows the mean for each group highlighted by the red dots. ggplot(cle_sac, aes(x = metro_area, y = income)) + geom_boxplot() + stat_summary(fun.y = "mean", geom = "point", color = "red") Guess about statistical significance We are looking to see if a difference exists in the mean income of the two levels of the explanatory variable. Based solely on the boxplot, we have reason to believe that no difference exists. The distributions of income seem similar and the means fall in roughly the same place. B.5.4 Non-traditional methods Collecting summary info We now compute the observed statistic: d_hat <- cle_sac %>% specify(income ~ metro_area) %>% calculate(stat = "diff in means", order = c("Sacramento_ CA", "Cleveland_ OH")) d_hat # A tibble: 1 x 1 stat <dbl> 1 4960. Randomization for hypothesis test In order to look to see if the observed sample mean for Sacramento of 27467.066 is statistically different than that for Cleveland of 32427.543, we need to account for the sample sizes. Note that this is the same as looking to see if \\(\\bar{x}_{sac} - \\bar{x}_{cle}\\) is statistically different than 0. We also need to determine a process that replicates how the original group sizes of 212 and 175 were selected. We can use the idea of randomization testing (also known as permutation testing) to simulate the population from which the sample came (with two groups of different sizes) and then generate samples using shuffling from that simulated population to account for sampling variability. set.seed(2018) null_distn_two_means <- cle_sac %>% specify(income ~ metro_area) %>% hypothesize(null = "independence") %>% generate(reps = 10000) %>% calculate(stat = "diff in means", order = c("Sacramento_ CA", "Cleveland_ OH")) null_distn_two_means %>% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a two-tailed test so we will be looking for values that are greater than or equal to 4960.477 or less than or equal to -4960.477 for our \\(p\\)-value. null_distn_two_means %>% visualize(obs_stat = d_hat, direction = "both") Calculate \\(p\\)-value pvalue <- null_distn_two_means %>% get_pvalue(obs_stat = d_hat, direction = "both") pvalue # A tibble: 1 x 1 p_value <dbl> 1 0.130 So our \\(p\\)-value is 0.13 and we fail to reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are not very far into the tail of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\mu_{sac} - \\mu_{cle}\\) using our sample data with bootstrapping. Here we will bootstrap each of the groups with replacement instead of shuffling. This is done using the groups argument in the resample function to fix the size of each group to be the same as the original group sizes of 175 for Sacramento and 212 for Cleveland. boot_distn_two_means <- cle_sac %>% specify(income ~ metro_area) %>% generate(reps = 10000) %>% calculate(stat = "diff in means", order = c("Sacramento_ CA", "Cleveland_ OH")) ci <- boot_distn_two_means %>% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` <dbl> <dbl> 1 -1446. 11308. boot_distn_two_means %>% visualize(endpoints = ci, direction = "between") We see that 0 is contained in this confidence interval as a plausible value of \\(\\mu_{sac} - \\mu_{cle}\\) (the unknown population parameter). This matches with our hypothesis test results of failing to reject the null hypothesis. Since zero is a plausible value of the population parameter, we do not have evidence that Sacramento incomes are different than Cleveland incomes. Interpretation: We are 95% confident the true mean yearly income for those living in Sacramento is between 1445.53 dollars smaller to 11307.82 dollars higher than for Cleveland. Note: You could also use the null distribution based on randomization with a shift to have its center at \\(\\bar{x}_{sac} - \\bar{x}_{cle} = \\$4960.48\\) instead of at 0 and calculate its percentiles. The confidence interval produced via this method should be comparable to the one done using bootstrapping above. B.5.5 Traditional methods Check conditions Remember that in order to use the short-cut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations are independent in both groups. This metro_area variable is met since the cases are randomly selected from each city. Approximately normal: The distribution of the response for each group should be normal or the sample sizes should be at least 30. ggplot(cle_sac, aes(x = income)) + geom_histogram(color = "white", binwidth = 20000) + facet_wrap(~ metro_area) We have some reason to doubt the normality assumption here since both the histograms show deviation from a normal model fitting the data well for each group. The sample sizes for each group are greater than 100 though so the assumptions should still apply. Independent samples: The samples should be collected without any natural pairing. There is no mention of there being a relationship between those selected in Cleveland and in Sacramento. B.5.6 Test statistic The test statistic is a random variable based on the sample data. Here, we are interested in seeing if our observed difference in sample means (\\(\\bar{x}_{sac, obs} - \\bar{x}_{cle, obs}\\) = 4960.477) is statistically different than 0. Assuming that conditions are met and the null hypothesis is true, we can use the \\(t\\) distribution to standardize the difference in sample means (\\(\\bar{X}_{sac} - \\bar{X}_{cle}\\)) using the approximate standard error of \\(\\bar{X}_{sac} - \\bar{X}_{cle}\\) (invoking \\(S_{sac}\\) and \\(S_{cle}\\) as estimates of unknown \\(\\sigma_{sac}\\) and \\(\\sigma_{cle}\\)). \\[ T =\\dfrac{ (\\bar{X}_1 - \\bar{X}_2) - 0}{ \\sqrt{\\dfrac{S_1^2}{n_1} + \\dfrac{S_2^2}{n_2}} } \\sim t (df = min(n_1 - 1, n_2 - 1)) \\] where 1 = Sacramento and 2 = Cleveland with \\(S_1^2\\) and \\(S_2^2\\) the sample variance of the incomes of both cities, respectively, and \\(n_1 = 175\\) for Sacramento and \\(n_2 = 212\\) for Cleveland. Observed test statistic Note that we could also do (ALMOST) this test directly using the t.test function. The x and y arguments are expected to both be numeric vectors here so we’ll need to appropriately filter our datasets. cle_sac %>% specify(income ~ metro_area) %>% calculate(stat = "t", order = c("Cleveland_ OH", "Sacramento_ CA")) # A tibble: 1 x 1 stat <dbl> 1 -1.50 We see here that the observed test statistic value is around -1.5. While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. B.5.7 Compute \\(p\\)-value The \\(p\\)-value—the probability of observing an \\(t_{174}\\) value of -1.501 or more extreme (in both directions) in our null distribution—is 0.13. This can also be calculated in R directly: 2 * pt(-1.501, df = min(212 - 1, 175 - 1), lower.tail = TRUE) [1] 0.135 We can also approximate by using the standard normal curve: 2 * pnorm(-1.501) [1] 0.133 Note that the 95 percent confidence interval given above matches well with the one calculated using bootstrapping. B.5.8 State conclusion We, therefore, do not have sufficient evidence to reject the null hypothesis. Our initial guess that a statistically significant difference not existing in the means was backed by this statistical analysis. We do not have evidence to suggest that the true mean income differs between Cleveland, OH and Sacramento, CA based on this data. B.5.9 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions also being met leads us to better guess that using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) will lead to similar results. B.6 Two means (paired samples) Problem statement Trace metals in drinking water affect the flavor and an unusually high concentration can pose a health hazard. Ten pairs of data were taken measuring zinc concentration in bottom water and surface water at 10 randomly selected locations on a stretch of river. Do the data suggest that the true average concentration in the surface water is smaller than that of bottom water? (Note that units are not given.) [Tweaked a bit from https://onlinecourses.science.psu.edu/stat500/node/51] B.6.1 Competing hypotheses In words Null hypothesis: The mean concentration in the bottom water is the same as that of the surface water at different paired locations. Alternative hypothesis: The mean concentration in the surface water is smaller than that of the bottom water at different paired locations. In symbols (with annotations) \\(H_0: \\mu_{diff} = 0\\), where \\(\\mu_{diff}\\) represents the mean difference in concentration for surface water minus bottom water. \\(H_A: \\mu_{diff} < 0\\) Set \\(\\alpha\\) It’s important to set the significance level before starting the testing using the data. Let’s set the significance level at 5% here. B.6.2 Exploring the sample data zinc_tidy <- read_csv("https://moderndive.com/data/zinc_tidy.csv") We want to look at the differences in surface - bottom for each location: zinc_diff <- zinc_tidy %>% group_by(loc_id) %>% summarize(pair_diff = diff(concentration)) %>% ungroup() Next we calculate the mean difference as our observed statistic: d_hat <- zinc_diff %>% specify(response = pair_diff) %>% calculate(stat = "mean") d_hat # A tibble: 1 x 1 stat <dbl> 1 -0.0804 The histogram below also shows the distribution of pair_diff. ggplot(zinc_diff, aes(x = pair_diff)) + geom_histogram(binwidth = 0.04, color = "white") Guess about statistical significance We are looking to see if the sample paired mean difference of -0.08 is statistically less than 0. They seem to be quite close, but we have a small number of pairs here. Let’s guess that we will fail to reject the null hypothesis. B.6.3 Non-traditional methods Bootstrapping for hypothesis test In order to look to see if the observed sample mean difference \\(\\bar{x}_{diff} = 4960.477\\) is statistically less than 0, we need to account for the number of pairs. We also need to determine a process that replicates how the paired data was selected in a way similar to how we calculated our original difference in sample means. Treating the differences as our data of interest, we next use the process of bootstrapping to build other simulated samples and then calculate the mean of the bootstrap samples. We hypothesize that the mean difference is zero. This process is similar to comparing the One Mean example seen above, but using the differences between the two groups as a single sample with a hypothesized mean difference of 0. set.seed(2018) null_distn_paired_means <- zinc_diff %>% specify(response = pair_diff) %>% hypothesize(null = "point", mu = 0) %>% generate(reps = 10000) %>% calculate(stat = "mean") null_distn_paired_means %>% visualize() We can next use this distribution to observe our \\(p\\)-value. Recall this is a left-tailed test so we will be looking for values that are less than or equal to 4960.477 for our \\(p\\)-value. null_distn_paired_means %>% visualize(obs_stat = d_hat, direction = "less") Calculate \\(p\\)-value pvalue <- null_distn_paired_means %>% get_pvalue(obs_stat = d_hat, direction = "less") pvalue # A tibble: 1 x 1 p_value <dbl> 1 0 So our \\(p\\)-value is essentially 0 and we reject the null hypothesis at the 5% level. You can also see this from the histogram above that we are far into the left tail of the null distribution. Bootstrapping for confidence interval We can also create a confidence interval for the unknown population parameter \\(\\mu_{diff}\\) using our sample data (the calculated differences) with bootstrapping. This is similar to the bootstrapping done in a one sample mean case, except now our data is differences instead of raw numerical data. Note that this code is identical to the pipeline shown in the hypothesis test above except the hypothesize() function is not called. boot_distn_paired_means <- zinc_diff %>% specify(response = pair_diff) %>% generate(reps = 10000) %>% calculate(stat = "mean") ci <- boot_distn_paired_means %>% get_ci() ci # A tibble: 1 x 2 `2.5%` `97.5%` <dbl> <dbl> 1 -0.112 -0.0503 boot_distn_paired_means %>% visualize(endpoints = ci, direction = "between") We see that 0 is not contained in this confidence interval as a plausible value of \\(\\mu_{diff}\\) (the unknown population parameter). This matches with our hypothesis test results of rejecting the null hypothesis. Since zero is not a plausible value of the population parameter and since the entire confidence interval falls below zero, we have evidence that surface zinc concentration levels are lower, on average, than bottom level zinc concentrations. Interpretation: We are 95% confident the true mean zinc concentration on the surface is between 0.11 units smaller to 0.05 units smaller than on the bottom. B.6.4 Traditional methods Check conditions Remember that in order to use the shortcut (formula-based, theoretical) approach, we need to check that some conditions are met. Independent observations: The observations among pairs are independent. The locations are selected independently through random sampling so this condition is met. Approximately normal: The distribution of population of differences is normal or the number of pairs is at least 30. The histogram above does show some skew so we have reason to doubt the population being normal based on this sample. We also only have 10 pairs which is fewer than the 30 needed. A theory-based test may not be valid here. Test statistic The test statistic is a random variable based on the sample data. Here, we want to look at a way to estimate the population mean difference \\(\\mu_{diff}\\). A good guess is the sample mean difference \\(\\bar{X}_{diff}\\). Recall that this sample mean is actually a random variable that will vary as different samples are (theoretically, would be) collected. We are looking to see how likely is it for us to have observed a sample mean of \\(\\bar{x}_{diff, obs} = 0.0804\\) or larger assuming that the population mean difference is 0 (assuming the null hypothesis is true). If the conditions are met and assuming \\(H_0\\) is true, we can “standardize” this original test statistic of \\(\\bar{X}_{diff}\\) into a \\(T\\) statistic that follows a \\(t\\) distribution with degrees of freedom equal to \\(df = n - 1\\): \\[ T =\\dfrac{ \\bar{X}_{diff} - 0}{ S_{diff} / \\sqrt{n} } \\sim t (df = n - 1) \\] where \\(S\\) represents the standard deviation of the sample differences and \\(n\\) is the number of pairs. Observed test statistic While one could compute this observed test statistic by “hand”, the focus here is on the set-up of the problem and in understanding which formula for the test statistic applies. We can use the t_test function on the differences to perform this analysis for us. t_test_results <- zinc_diff %>% infer::t_test(formula = pair_diff ~ NULL, alternative = "less", mu = 0) t_test_results # A tibble: 1 x 6 statistic t_df p_value alternative lower_ci upper_ci <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 -4.86 9 0.000446 less -Inf -0.0501 We see here that the \\(t_{obs}\\) value is -4.864. Compute \\(p\\)-value The \\(p\\)-value—the probability of observing a \\(t_{obs}\\) value of -4.864 or less in our null distribution of a \\(t\\) with 9 degrees of freedom—is 0. This can also be calculated in R directly: pt(-4.8638, df = nrow(zinc_diff) - 1, lower.tail = TRUE) [1] 0.000446 State conclusion We, therefore, have sufficient evidence to reject the null hypothesis. Our initial guess that our observed sample mean difference was not statistically less than the hypothesized mean of 0 has been invalidated here. Based on this sample, we have evidence that the mean concentration in the bottom water is greater than that of the surface water at different paired locations. B.6.5 Comparing results Observing the bootstrap distribution and the null distribution that were created, it makes quite a bit of sense that the results are so similar for traditional and non-traditional methods in terms of the \\(p\\)-value and the confidence interval since these distributions look very similar to normal distributions. The conditions were not met since the number of pairs was small, but the sample data was not highly skewed. Using any of the methods whether they are traditional (formula-based) or non-traditional (computational-based) lead to similar results here. References "], ["C-appendixC.html", "C Reach for the Stars Needed packages C.1 Sorted barplots C.2 Interactive graphics", " C Reach for the Stars Needed packages library(dplyr) library(ggplot2) library(knitr) library(dygraphs) library(nycflights13) C.1 Sorted barplots Building upon the example in Section 3.8: flights_table <- table(flights$carrier) flights_table 9E AA AS B6 DL EV F9 FL HA MQ OO UA US 18460 32729 714 54635 48110 54173 685 3260 342 26397 32 58665 20536 VX WN YV 5162 12275 601 We can sort this table from highest to lowest counts by using the sort function: sorted_flights <- sort(flights_table, decreasing = TRUE) names(sorted_flights) [1] "UA" "B6" "EV" "DL" "AA" "MQ" "US" "9E" "WN" "VX" "FL" "AS" "F9" "YV" "HA" [16] "OO" It is often preferred for barplots to be ordered corresponding to the heights of the bars. This allows the reader to more easily compare the ordering of different airlines in terms of departed flights (Robbins 2013). We can also much more easily answer questions like “How many airlines have more departing flights than Southwest Airlines?”. We can use the sorted table giving the number of flights defined as sorted_flights to reorder the carrier. ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() + scale_x_discrete(limits = names(sorted_flights)) FIGURE C.1: Number of flights departing NYC in 2013 by airline - Descending numbers The last addition here specifies the values of the horizontal x axis on a discrete scale to correspond to those given by the entries of sorted_flights. C.2 Interactive graphics C.2.1 Interactive linegraphs Another useful tool for viewing linegraphs such as this is the dygraph function in the dygraphs package in combination with the dyRangeSelector function. This allows us to zoom in on a selected range and get an interactive plot for us to work with: library(dygraphs) flights_day <- mutate(flights, date = as.Date(time_hour)) flights_summarized <- flights_day %>% group_by(date) %>% summarize(median_arr_delay = median(arr_delay, na.rm = TRUE)) rownames(flights_summarized) <- flights_summarized$date flights_summarized <- select(flights_summarized, -date) dyRangeSelector(dygraph(flights_summarized)) The syntax here is a little different than what we have covered so far. The dygraph function is expecting for the dates to be given as the rownames of the object. We then remove the date variable from the flights_summarized data frame since it is accounted for in the rownames. Lastly, we run the dygraph function on the new data frame that only contains the median arrival delay as a column and then provide the ability to have a selector to zoom in on the interactive plot via dyRangeSelector. (Note that this plot will only be interactive in the HTML version of this book.) References "],