Skip to content

Commit

Permalink
Merge pull request #54 from moderndive/final-edits
Browse files Browse the repository at this point in the history
Final edits
  • Loading branch information
Chester Ismay authored Jul 22, 2018
2 parents d3d71fe + 015ae3d commit 84b6d0d
Show file tree
Hide file tree
Showing 159 changed files with 11,465 additions and 1,402 deletions.
4 changes: 2 additions & 2 deletions 02-getting-started.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ Learning to code/program is very much like learning a foreign language, it can b

* **Computers are stupid**: You have to tell a computer everything it needs to do. Furthermore, your instructions can't have any mistakes in them, nor can they be ambiguous in any way.
* **Take the "copy/paste/tweak" approach**: Especially when learning your first programming language, it is often much easier to taking existing code that you know works and modify it to suit your ends, rather than trying to write new code from scratch. We call this the *copy/paste/tweak* approach. So early on, we suggest not trying to code from scratch, but please take the code we provide throughout this book and play around with it!
* **Practice is key**: Just as the only solution to improving your foreign language skills is practice, so also the only way to get better at R is through pracitice. Don't worry however, we'll give you plenty of opportunities to practice!
* **Practice is key**: Just as the only solution to improving your foreign language skills is practice, so also the only way to get better at R is through practice. Don't worry however, we'll give you plenty of opportunities to practice!



Expand All @@ -144,7 +144,7 @@ Another point of confusion with new R users is the notion of a package. R packag
There are two key things to remember about R packages:

1. *Installation*: Most packages are not installed by default when you install R and RStudio. You need to install a package before you can use it. Once you've installed it, you likely don't need to install it again unless you want to update it to a newer version of the package.
1. *Loading*: Packages are not loaded automatically when you open RStudio. You need to load them everytime you open RStudio using the `library()` command.
1. *Loading*: Packages are not loaded automatically when you open RStudio. You need to load them every time you open RStudio using the `library()` command.

A good analogy for R packages is they are like apps you can download onto a mobile phone:

Expand Down
16 changes: 8 additions & 8 deletions 03-visualization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ library(readr)

### DataCamp {-}

Our approach to introducing data visualization via the Grammar of Graphics and the `ggplot2` package is very similar to the approach taken in [David Robinson's](https://twitter.com/drob) DataCamp course "Introduction to the Tidyverse," a course targetted at people new to R and the tidyverse. If you're interested in complementing your learning below in an interactive online environment, click on the image below to access the course. The relevant chapters are Chapter 2 on "Data visualization" and Chapter 4 on "Types of visualizations".
Our approach to introducing data visualization via the Grammar of Graphics and the `ggplot2` package is very similar to the approach taken in [David Robinson's](https://twitter.com/drob) DataCamp course "Introduction to the Tidyverse," a course targeted at people new to R and the tidyverse. If you're interested in complementing your learning below in an interactive online environment, click on the image below to access the course. The relevant chapters of the course are Chapter 2 on "Data visualization" and Chapter 4 on "Types of visualizations".

<center>
<a target="_blank" class="page-link" href="https://www.datacamp.com/courses/introduction-to-the-tidyverse"><img src="images/datacamp_intro_to_tidyverse.png" alt="Drawing" style="height: 150px;"/></a>
Expand Down Expand Up @@ -266,7 +266,7 @@ ggplot(data = all_alaska_flights, mapping = aes(x = dep_delay, y = arr_delay)) +
geom_point()
```

In Figure \@ref(fig:noalpha) we see that a positive relationship exists between `dep_delay` and `arr_delay`: as departure delays increase, arrival delays tend to also increase. We also note that the majority of points fall near the point (0, 0). There is a large mass of points clustered there. Furthermore after executing this code, R returns a warning message alerting us to the fact that 5 rows were ignored due to mising values. For 5 rows either the value for `dep_delay` or `arr_delay` or both were missing, and thus these rows were ignored in our plot.
In Figure \@ref(fig:noalpha) we see that a positive relationship exists between `dep_delay` and `arr_delay`: as departure delays increase, arrival delays tend to also increase. We also note that the majority of points fall near the point (0, 0). There is a large mass of points clustered there. Furthermore after executing this code, R returns a warning message alerting us to the fact that 5 rows were ignored due to missing values. For 5 rows either the value for `dep_delay` or `arr_delay` or both were missing, and thus these rows were ignored in our plot.

Let's go back to the `ggplot()` function call that created this visualization, keeping in mind our discussion in Section \@ref(grammarofgraphics):

Expand Down Expand Up @@ -750,11 +750,11 @@ weather %>%

What the boxplot does is summarize the `r weather %>% filter(month == 11) %>% nrow()` points for you, in particular:

1. 25% of points (about 534 observations) fall below the bottom edge of the box which is the first quartile of `r quartiles[1] %>% round(3)` degrees Farenheit (2.2 degrees Celsius). In other words 25% of observations were colder than `r quartiles[1] %>% round(3)` degrees Farenheit.
1. 25% of points fall between the bottom edge of the box and the solid middle line which is the median of `r quartiles[2] %>% round(3)` degrees Farenheit (7.8 degrees Celsius). In other words 25% of observations were between `r quartiles[1] %>% round(3)` and `r quartiles[2] %>% round(3)` degrees Farenheit.
1. 25% of points fall between the solid middle line and the top edge of the box which is the third quartile of `r quartiles[3] %>% round(3)` degrees Farenheit (11.1 degrees Celsius). In other words 25% of observations were between `r quartiles[2] %>% round(3)` and `r quartiles[3] %>% round(3)` degrees Farenheit.
1. 25% of points fall over the top edge of the box. In other words 25% of observations were warmer than `r quartiles[3] %>% round(3)` degrees Farenheit.
1. The middle 50% of points lie within the interquartile range `r (quartiles[3] - quartiles[1]) %>% round(3)` degrees Farenheit.
1. 25% of points (about 534 observations) fall below the bottom edge of the box which is the first quartile of `r quartiles[1] %>% round(3)` degrees Fahrenheit (2.2 degrees Celsius). In other words 25% of observations were colder than `r quartiles[1] %>% round(3)` degrees Fahrenheit.
1. 25% of points fall between the bottom edge of the box and the solid middle line which is the median of `r quartiles[2] %>% round(3)` degrees Fahrenheit (7.8 degrees Celsius). In other words 25% of observations were between `r quartiles[1] %>% round(3)` and `r quartiles[2] %>% round(3)` degrees Fahrenheit.
1. 25% of points fall between the solid middle line and the top edge of the box which is the third quartile of `r quartiles[3] %>% round(3)` degrees Fahrenheit (11.1 degrees Celsius). In other words 25% of observations were between `r quartiles[2] %>% round(3)` and `r quartiles[3] %>% round(3)` degrees Fahrenheit.
1. 25% of points fall over the top edge of the box. In other words 25% of observations were warmer than `r quartiles[3] %>% round(3)` degrees Fahrenheit.
1. The middle 50% of points lie within the interquartile range `r (quartiles[3] - quartiles[1]) %>% round(3)` degrees Fahrenheit.

```{block lc-boxplot, type='learncheck', purl=FALSE}
**_Learning check_**
Expand Down Expand Up @@ -1139,7 +1139,7 @@ Barplots are the preferred way of displaying categorical variables. They are ea

### Putting it all together

Let's recap all five of the Five Named Graphs (5NG) in Table \@ref(tab:viz-summary-table) summarizing their differences. Using these 5NG, you'll be able to visualize the distributions and relationships of variables contained in a wide array of datasets. This will be even more the case as we start to map more variables to more of each `geom`etic object's `aes`thetic attribute options, further unlocking the awesome power of the `ggplot2` package.
Let's recap all five of the Five Named Graphs (5NG) in Table \@ref(tab:viz-summary-table) summarizing their differences. Using these 5NG, you'll be able to visualize the distributions and relationships of variables contained in a wide array of datasets. This will be even more the case as we start to map more variables to more of each `geom`etric object's `aes`thetic attribute options, further unlocking the awesome power of the `ggplot2` package.

```{r viz-summary-table, echo=FALSE, message=FALSE}
# Original at https://docs.google.com/spreadsheets/d/1vzqlFiT6qm5wzy_L_0nL7EWAd6jiUZmLSCFhDhztDSg/edit#gid=0
Expand Down
16 changes: 6 additions & 10 deletions 04-tidy.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ glimpse(airports)

The variables `faa` and `name` are what we will call *identification variables*: variables that uniquely identify each observational unit. They are mainly used to provide a unique name to each observational unit, thereby allowing us to uniquely identify them. `faa` gives the unique code provided by the FAA for that airport, while the `name` variable gives the longer more natural name of the airport. The remaining variables (`lat`, `lon`, `alt`, `tz`, `dst`, `tzone`) are often called *measurement* or *characteristic* variables: variables that describe properties of each observational unit, in other words each observation in each row. For example, `lat` and `long` describe the latitude and longitude of each airport.

So in our above example of a spreadsheet of all students enrolled at a university, email address could be treated as an identifical variable since it uniquely identifies each observational unit i.e. each student, while date of birth could not since it is possible (and highly probable) that two students share the same birthday.
So in our above example of a spreadsheet of all students enrolled at a university, email address could be treated as an identical variable since it uniquely identifies each observational unit i.e. each student, while date of birth could not since it is possible (and highly probable) that two students share the same birthday.

Furthermore, sometimes a single variable might not be enough to uniquely identify each observational unit: combinations of variables might be needed (see Learning Check below). While it is not an absolute rule, for organizational purposes it is considered good practice to have your identification variables in the far left-most columns of your data frame.

Expand Down Expand Up @@ -287,18 +287,14 @@ We'll cover two methods for importing data in R: one using the R console and the

### Method 1: From the console

First, let's download a *Comma Separated Values* (CSV) file of ratings of the level of democracy in different countries spanning 1952 to 1992: <http://ismayc.github.io/dem_score.csv>. We use the `read_csv()` function from the `readr` package to read it off the web and then take a look.
First, let's download a *Comma Separated Values* (CSV) file of ratings of the level of democracy in different countries spanning 1952 to 1992: <https://moderndive.com/data/dem_score.csv>. We use the `read_csv()` function from the `readr` package to read it off the web and then take a look.

```{r message=FALSE, eval=FALSE}
library(readr)
dem_score <- read_csv("http://ismayc.github.io/dem_score.csv")
dem_score <- read_csv("https://moderndive.com/data/dem_score.csv")
dem_score
```
```{r message=FALSE, echo=FALSE}
if(!file.exists("data/dem_score.csv")){
download.file(url = "http://ismayc.github.io/dem_score.csv",
destfile = "data/dem_score.csv")
}
dem_score <- read_csv("data/dem_score.csv")
dem_score
```
Expand All @@ -307,7 +303,7 @@ In this `dem_score` data frame, the minimum value of -10 corresponds to a highly

### Method 2: Using RStudio's interface

Let's read in the same data saved in Excel format this time at <http://ismayc.github.io/dem_score.xlsx>, but using RStudio's graphical interface instead of via the R console. First download the Excel file, then go to the Files pane of RStudio -> Navigate to the directory where your downloaded `dem_score.xlsx` is saved -> Click on `dem_score.xlsx` -> Click "Import Dataset..." -> Click "Import Dataset..." At this point you should see an image like in
Let's read in the same data saved in Excel format this time at <https://moderndive.com/data/dem_score.xlsx>, but using RStudio's graphical interface instead of via the R console. First download the Excel file, then go to the Files pane of RStudio -> Navigate to the directory where your downloaded `dem_score.xlsx` is saved -> Click on `dem_score.xlsx` -> Click "Import Dataset..." -> Click "Import Dataset..." At this point you should see an image like in

![](images/read_excel.png)

Expand Down Expand Up @@ -386,7 +382,7 @@ We'll see in Chapter \@ref(wrangling) how we could use the `mutate()` function t
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Convert the `dem_score` data frame into
a tidy data frame and assign the name of `dem_score_tidy` to the resulting long-formatted data frame.

**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Read in the life expectancy data stored at http://ismayc.github.io/le_mess.csv and convert it to a tidy data frame.
**`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`** Read in the life expectancy data stored at <https://moderndive.com/data/le_mess.csv> and convert it to a tidy data frame.


```{asis lc4-3solutions, include=show_solutions('4-3')}
Expand All @@ -408,7 +404,7 @@ dem_score_tidy
**`r paste0("(LC", chap, ".", (lc - 1), ")")`** The code is similar
```
```{r lc4-3solutions-6, include=show_solutions('4-3'), echo=show_solutions('4-3'), message=FALSE, warning=FALSE}
life_expectancy <- read_csv('http://ismayc.github.io/le_mess.csv')
life_expectancy <- read_csv('https://moderndive.com/data/le_mess.csv')
life_expectancy_tidy <- gather(data = life_expectancy, key = year, value = life_expectancy, -country)
```
```{asis lc4-3solutions-7, include=show_solutions('4-3')}
Expand Down
Loading

0 comments on commit 84b6d0d

Please sign in to comment.