Skip to content

Commit

Permalink
Version 0.1.3
Browse files Browse the repository at this point in the history
  • Loading branch information
ismayc committed Feb 9, 2017
1 parent db04ba4 commit 1e05d4f
Show file tree
Hide file tree
Showing 101 changed files with 554 additions and 1,318 deletions.
7 changes: 4 additions & 3 deletions 03-tidy.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,10 @@ knitr::opts_chunk$set(tidy = FALSE, out.width = '\\textwidth')

At the beginning of this and all subsequent chapters, we'll always have a list of packages you should have installed and loaded. In particular we load the `nycflights13` package which we'll discuss shortly and the `dplyr` package for data manipulation, the subject of Chapter \@ref(manip).

```{r warning=FALSE}
```{r warning=FALSE, message=FALSE}
library(nycflights13)
library(dplyr)
library(tibble)
```

<!--Subsection on Tidy Data -->
Expand Down Expand Up @@ -311,11 +312,11 @@ kable(data_frame("role" = role, `Sociology?` = sociology,
***
***

```{block tidy_rc, type='review'}
```{block, type='review'}
**_Review questions_**
```

**`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
Review questions have been designed using the `fivethirtyeight` R package [@R-fivethirtyeight] with links to the corresponding FiveThirtyEight.com articles in our free DataCamp course **Effective Data Storytelling using the `tidyverse`**. The material in this chapter is covered in the **Tidy Data** chapter of the DataCamp course available [here](https://campus.datacamp.com/courses/effective-data-storytelling-using-the-tidyverse/tidy-data).

***
***
Expand Down
19 changes: 19 additions & 0 deletions 04-viz.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -704,6 +704,25 @@ knitr::purl("04-viz.Rmd", "docs/scripts/04-viz.R")

An R script file of all R code used in this chapter is available [here](http://ismayc.github.io/moderndiver-book/scripts/04-viz.R).

***
***

```{block, type='review', purl=FALSE}
**_Review questions_**
```

Review questions have been designed using the `fivethirtyeight` R package [@R-fivethirtyeight] with links to the corresponding FiveThirtyEight.com articles in our free DataCamp course **Effective Data Storytelling using the `tidyverse`**. The material in this chapter is covered in the chapters of the DataCamp course available below:

- [Scatter-plots & Line-graphs](https://campus.datacamp.com/courses/effective-data-storytelling-using-the-tidyverse/scatter-plots-line-graphs)

- [Histograms & Boxplots](https://campus.datacamp.com/courses/effective-data-storytelling-using-the-tidyverse/histograms-boxplots)

- [Barplots](https://campus.datacamp.com/courses/effective-data-storytelling-using-the-tidyverse/barplots)

- A **ggplot2 Review** DataCamp course is in development currently.

***
***

### What's to come?

Expand Down
14 changes: 7 additions & 7 deletions 06-sim.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -269,10 +269,10 @@ library(knitr)

### Repeated sampling via `do`

We have looked at two random samples above, but using `mosaic` we can repeat this process over and over again with the `do` function. Below, we repeat this sampling process 10,000 times. We can then plot the different values of the sample means to get a sense for what a reasonable range of values for the population parameter mean `height` is in the `profiles_subset` data frame.
We have looked at two random samples above, but using `mosaic` we can repeat this process over and over again with the `do` function. Below, we repeat this sampling process 5000 times. We can then plot the different values of the sample means to get a sense for what a reasonable range of values for the population parameter mean `height` is in the `profiles_subset` data frame.

```{r do-first, cache=TRUE}
sample_means <- do(10000) *
sample_means <- do(5000) *
(profiles_subset %>% resample(size = 100, replace = FALSE) %>%
summarize(mean_height = mean(height)))
ggplot(data = sample_means, mapping = aes(x = mean_height)) +
Expand Down Expand Up @@ -381,7 +381,7 @@ We need to think about this problem from the standpoint of hypothesis testing.

***

Let's begin with an experiment. I will flip a coin 10 times. Your job is to try to predict the sequence of my 10 flips. Write down 10 H's and T's corresponding to your predictions. We could compare your guesses with my actual flips and then we will note how many correct guesses you have.
Let's begin with an experiment. I will flip a coin 10 times. Your job is to try to predict the sequence of my 10 flips. Write down 10 H's and T's corresponding to your predictions. We could compare your guesses with my actual flips and then we would note how many correct guesses you have.

You may be asking yourself how this models a way to test whether the person was just guessing or not. All we are trying to do is see how likely it is to have 9 matches out of 10 if the person was truly guessing. When we say "truly guessing" we are assuming that we have a 50/50 chance of guessing correctly. This can be modeled using a coin flip and then seeing whether we guessed correctly for each of the coin flips. If we guessed correctly, we can think of that as a "success."

Expand All @@ -401,21 +401,21 @@ do(13) * rflip(10)

We've now done a simulation of what actually happened when you flipped a coin ten times. We have 13 different simulations of flipping a coin 10 times. Note here that `heads` now corresponds to the number of correct guesses and `tails` corresponds to the number of incorrect guesses. (This can be tricky to understand at first since we've done a switch on what the meaning of "heads" and ``tails" are.)

If you look at the output above for our simulation of 13 student guesses, we can begin to get a sense for what an "expected" sample proportion of successes may be. Around five out of 10 seems to be the most likely value. What does this say about what we actually observed with a success rate of 9/10? To better answer this question, we can simulate 10,000 student guesses and then look at the distribution of the simulated sample proportion of successes, also known as the **null distribution**.
If you look at the output above for our simulation of 13 student guesses, we can begin to get a sense for what an "expected" sample proportion of successes may be. Around five out of 10 seems to be the most likely value. What does this say about what we actually observed with a success rate of 9/10? To better answer this question, we can simulate 5000 student guesses and then look at the distribution of the simulated sample proportion of successes, also known as the **null distribution**.

```{r}
library(dplyr)
simGuesses <- do(10000) * rflip(10)
simGuesses <- do(5000) * rflip(10)
simGuesses %>%
group_by(heads) %>%
summarize(count = n())
```

We can see here that we have created a count of how many of each of the 10,000 sets of 10 flips resulted in 0, 1, 2, $\ldots$, up to 10 heads. Note the use of the `group_by` and `summarize` functions from Chapter \@ref(manip) here.
We can see here that we have created a count of how many of each of the 5000 sets of 10 flips resulted in 0, 1, 2, $\ldots$, up to 10 heads. Note the use of the `group_by` and `summarize` functions from Chapter \@ref(manip) here.

In addition, we can plot the distribution of these simulated `heads` using the ideas from Chapter \@ref(viz). `heads` is a quantitative variable. Think about which type of plot is most appropriate here before reading further.

We already have an idea as to an appropriate plot by the data summarization that we did in the chunk above. We'd like to see how many heads occurred in the 10,000 sets of 10 flips. In other words, we'd like to see how frequently 9 or more heads occurred in the 10 flips:
We already have an idea as to an appropriate plot by the data summarization that we did in the chunk above. We'd like to see how many heads occurred in the 5000 sets of 10 flips. In other words, we'd like to see how frequently 9 or more heads occurred in the 10 flips:

```{r fig.cap="Histogram of number of heads in simulation - needs tweaking"}
library(ggplot2)
Expand Down
14 changes: 7 additions & 7 deletions 07-hypo.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -259,10 +259,10 @@ $t_1 = `r t1`$, $t_2 = `r t2`$, $t_3 = `r t3`$

### Distribution of $\delta$ under $H_0$

We could continue this process, say, 10,000 times by flipping a coin in sets of 10 for 10,000 repetitions and counting and taking note of how many heads out of 10 we have for each set. It's at this point that you surely realize that a computer can do this procedure much faster
We could continue this process, say, 5000 times by flipping a coin in sets of 10 for 5000 repetitions and counting and taking note of how many heads out of 10 we have for each set. It's at this point that you surely realize that a computer can do this procedure much faster
and more efficient than the tactile experiment with a coin.

Recall that we've already created the distribution of 10,000 such coin flips and we've stored these values in the `heads` variable in the `simGuesses` data frame:
Recall that we've already created the distribution of 5000 such coin flips and we've stored these values in the `heads` variable in the `simGuesses` data frame:

```{r}
library(ggplot2)
Expand Down Expand Up @@ -300,7 +300,7 @@ Let's walk through each step of this calculation:

4. Now that we have changed the focus to only those rows that have number of heads out of 10 flips corresponding to 9 or more, we count how many of those there are. The function `nrow` gives how many entries are in this filtered data frame and lastly we calculate the proportion that are at least as extreme as our observed value of 9 by dividing by the number of total simulations (`r format(nrow(simGuesses), big.mark = ",")`).

We can see that the observed statistic of 9 correct guesses is not a likely outcome assuming the null hypothesis is true. Only around 1% of the outcomes in our 10,000 simulations fall at or above 9 successes. We have evidence supporting the conclusion that the person is actually better than just guessing at random at determining whether milk has been added first or not. To better visualize this we can also make use of blue shading on the histogram corresponding to the $p$-value:
We can see that the observed statistic of 9 correct guesses is not a likely outcome assuming the null hypothesis is true. Only around 1% of the outcomes in our 5000 simulations fall at or above 9 successes. We have evidence supporting the conclusion that the person is actually better than just guessing at random at determining whether milk has been added first or not. To better visualize this we can also make use of blue shading on the histogram corresponding to the $p$-value:

```{r fig.cap="Barplot of heads with p-value highlighted"}
library(ggplot2)
Expand All @@ -309,7 +309,7 @@ library(ggplot2)
labs(x = "heads")
```

This helps us better see just how few of the values of `heads` are at our observed value or more extreme. This idea of a $p$-value can be extended to the more traditional methods using normal and $t$ distributions in the traditional way that introductory statistics has been presented. These traditional methods were used because statisticians haven't always been able to do 10,000 simulations on the computer within seconds. We'll elaborate on this more in a few sections.
This helps us better see just how few of the values of `heads` are at our observed value or more extreme. This idea of a $p$-value can be extended to the more traditional methods using normal and $t$ distributions in the traditional way that introductory statistics has been presented. These traditional methods were used because statisticians haven't always been able to do 5000 simulations on the computer within seconds. We'll elaborate on this more in a few sections.

***
```{block lc6-2, type='learncheck', purl=FALSE}
Expand Down Expand Up @@ -535,15 +535,15 @@ The only new command here is `shuffle` from the `mosaic` package, which does wha

```{r cache=TRUE}
set.seed(2016)
many_shuffles <- do(10000) *
many_shuffles <- do(5000) *
(movies_trimmed %>%
mutate(rating = shuffle(rating)) %>%
group_by(genre) %>%
summarize(mean = mean(rating))
)
```

It is a good idea here to `View` the `many_shuffles` data frame via `View(many_shuffles)`. We need to figure out a way to subtract the first value of `mean` from the second value of `mean` for each of the 10,000 simulations. This is a little tricky but the `group_by` function comes to our rescue here:
It is a good idea here to `View` the `many_shuffles` data frame via `View(many_shuffles)`. We need to figure out a way to subtract the first value of `mean` from the second value of `mean` for each of the 5000 simulations. This is a little tricky but the `group_by` function comes to our rescue here:

```{r}
rand_distn <- many_shuffles %>%
Expand Down Expand Up @@ -623,7 +623,7 @@ we fail to reject $H_0$. (If no significance level is given, one can assume $\a

As a point of reference, we will now discuss the traditional theory-based way to conduct the hypothesis test for determining if there is a statistically significant difference in the sample mean rating of Action movies versus Romance movies. This method and ones like it work very well when the assumptions are met in order to run the test. They are based on probability models and distributions such as the normal and $t$-distributions.

These traditional methods have been used for many decades back to the time when researchers didn't have access to computers that could run 10,000 simulations in under a minute. They had to base their methods on probability theory instead. Many fields and researchers continue to use these methods and that is the biggest reason for their inclusion here. It's important to remember that a $t$-test or a $z$-test is really just an approximation of what you have seen in this chapter already using simulation and randomization. The focus here is on understanding how the shape of the $t$-curve comes about without digging big into the mathematical underpinnings.
These traditional methods have been used for many decades back to the time when researchers didn't have access to computers that could run 5000 simulations in under a minute. They had to base their methods on probability theory instead. Many fields and researchers continue to use these methods and that is the biggest reason for their inclusion here. It's important to remember that a $t$-test or a $z$-test is really just an approximation of what you have seen in this chapter already using simulation and randomization. The focus here is on understanding how the shape of the $t$-curve comes about without digging big into the mathematical underpinnings.

### EXAMPLE: $t$-test for two independent samples

Expand Down
2 changes: 1 addition & 1 deletion 08-ci.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ You should see some variability begin to tease its way out here. Many of the si
So what's the next step now? Just as we repeated the repetitions thousands of times with the "Lady Tasting Tea" example, we can do a similar thing here:

```{r cache=TRUE, fig.cap="Bootstrapped means histogram"}
trials <- do(10000) * summarize(resample(movies_sample),
trials <- do(5000) * summarize(resample(movies_sample),
mean = mean(rating))
ggplot(data = trials, mapping = aes(x = mean)) +
geom_histogram(bins = 30, color = "white")
Expand Down
6 changes: 5 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
# ModernDive 0.1.2.9000
# ModernDive 0.1.3.9000

# ModernDive 0.1.3

* Attempting to fix Shiny app in Figure 6.2 appearing as white box in published site noted [here](https://github.com/ismayc/moderndiver-book/issues/2)
* Reverted to using screenshot with link instead
* Updated link to `dplyr` [cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf) and `ggplot2` [cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf)
* Began adding DataCamp chapters as Review Questions to the end of Chapters 3 and 4 (More to come)
* Updated link to MailChimp

# ModernDive 0.1.2

Expand Down
2 changes: 1 addition & 1 deletion _bookdown.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
book_filename: "ismaykim"
output_dir: "docs-devel"
output_dir: "docs"
#chapter_name: "Chapter "
24 changes: 15 additions & 9 deletions bib/packages.bib
Original file line number Diff line number Diff line change
Expand Up @@ -42,12 +42,11 @@ @Manual{R-dygraphs
url = {https://CRAN.R-project.org/package=dygraphs},
}
@Manual{R-fivethirtyeight,
title = {fivethirtyeight: Data and Code Behind the Stories and Interactives at
'FiveThirtyEight'},
title = {fivethirtyeight: Data and Code Behind the Stories and Interactives at 'FiveThirtyEight'},
author = {Chester Ismay and Jennifer Chunn},
year = {2017},
note = {R package version 0.1.0},
url = {https://CRAN.R-project.org/package=fivethirtyeight},
note = {R package version 0.1.0.9000},
url = {https://github.com/rudeboybert/fivethirtyeight},
year = {2016},
}
@Manual{R-ggplot2,
title = {ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics},
Expand Down Expand Up @@ -80,8 +79,8 @@ @Manual{R-lattice
@Manual{R-Matrix,
title = {Matrix: Sparse and Dense Matrix Classes and Methods},
author = {Douglas Bates and Martin Maechler},
year = {2016},
note = {R package version 1.2-7.1},
year = {2017},
note = {R package version 1.2-8},
url = {https://CRAN.R-project.org/package=Matrix},
}
@Manual{R-mosaic,
Expand All @@ -108,8 +107,8 @@ @Manual{R-mvtnorm
@Manual{R-nycflights13,
title = {nycflights13: Flights that Departed NYC in 2013},
author = {Hadley Wickham},
year = {2016},
note = {R package version 0.2.1},
year = {2017},
note = {R package version 0.2.2},
url = {https://CRAN.R-project.org/package=nycflights13},
}
@Manual{R-okcupiddata,
Expand All @@ -134,6 +133,13 @@ @Manual{R-rmarkdown
note = {R package version 1.3},
url = {https://CRAN.R-project.org/package=rmarkdown},
}
@Manual{R-tibble,
title = {tibble: Simple Data Frames},
author = {Hadley Wickham and Romain Francois and Kirill Müller},
year = {2016},
note = {R package version 1.2},
url = {https://CRAN.R-project.org/package=tibble},
}
@Manual{R-tufte,
title = {tufte: Tufte's Styles for R Markdown Documents},
author = {Yihui Xie and JJ Allaire},
Expand Down
2 changes: 1 addition & 1 deletion docs/10-effective-data-storytelling.html
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
<meta name="author" content="Chester Ismay and Albert Y. Kim">


<meta name="date" content="2017-01-22">
<meta name="date" content="2017-02-09">

<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="apple-mobile-web-app-capable" content="yes">
Expand Down
2 changes: 1 addition & 1 deletion docs/2-intro.html
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
<meta name="author" content="Chester Ismay and Albert Y. Kim">


<meta name="date" content="2017-01-22">
<meta name="date" content="2017-02-09">

<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="apple-mobile-web-app-capable" content="yes">
Expand Down
Loading

0 comments on commit 1e05d4f

Please sign in to comment.