diff --git a/docs/index.html b/docs/index.html index 78506b504..a00ff6721 100644 --- a/docs/index.html +++ b/docs/index.html @@ -417,12 +417,12 @@

1.1 Principles of this Book

  • Ultimately the best textbook is one you’ve written yourself
  • +
    @@ -850,7 +850,7 @@

    Colophon

    Book was last updated:

    -
    ## [1] "By Chester on Saturday, January 07, 2017 09:14:56 EST"
    +
    ## [1] "By Chester on Saturday, January 07, 2017 11:29:21 EST"
    diff --git a/docs/ismaykim.pdf b/docs/ismaykim.pdf index 9e97a54a1..12be10103 100644 Binary files a/docs/ismaykim.pdf and b/docs/ismaykim.pdf differ diff --git a/docs/ismaykim.tex b/docs/ismaykim.tex index 00f2ff2a5..dd28a7242 100644 --- a/docs/ismaykim.tex +++ b/docs/ismaykim.tex @@ -289,19 +289,19 @@ \section{Principles of this Book}\label{principles-of-this-book} \item We encourage use of R Markdown to foster notions of reproducible research. - \end{itemize} -\item - \textbf{Ultimately the best textbook is one you've written yourself} - - \begin{itemize} - \tightlist - \item - You best know your audience, their background, and their priorities - and you know best your own style and types of examples and problems - you like best. Customizability is the ultimate end. \item - A new paradigm for textbooks? Versions, not editions? Pull requests, - crowd-sourcing, and development versions? + \textbf{Ultimately the best textbook is one you've written yourself} + + \begin{itemize} + \tightlist + \item + You best know your audience, their background, and their + priorities and you know best your own style and types of examples + and problems you like best. Customizability is the ultimate end. + \item + A new paradigm for textbooks? Versions, not editions? Pull + requests, crowd-sourcing, and development versions? + \end{itemize} \end{itemize} \end{enumerate} @@ -452,7 +452,7 @@ \section*{Colophon}\label{colophon} \textbf{Book was last updated:} \begin{verbatim} -## [1] "By Chester on Saturday, January 07, 2017 09:13:12 EST" +## [1] "By Chester on Saturday, January 07, 2017 11:27:14 EST" \end{verbatim} \chapter{Introduction}\label{intro} diff --git a/docs/ismaykim_files/figure-html/jitter-1.png b/docs/ismaykim_files/figure-html/jitter-1.png index 8aa31581a..d37e3161d 100644 Binary files a/docs/ismaykim_files/figure-html/jitter-1.png and b/docs/ismaykim_files/figure-html/jitter-1.png differ diff --git a/docs/search_index.json b/docs/search_index.json index 1e35ec8e7..f92fd0343 100644 --- a/docs/search_index.json +++ b/docs/search_index.json @@ -1,5 +1,5 @@ [ -["index.html", "ModernDive 1 Preamble 1.1 Principles of this Book 1.2 Contribute 1.3 Getting Started Colophon", " ModernDive An Introduction to Statistical and Data Sciences via R Chester Ismay and Albert Y. Kim 2017-01-07 1 Preamble 1.1 Principles of this Book These are some principles we keep in mind. If you agree with them, this might be the book for you. Blur the lines between lecture and lab Laptops and open source software are rendering the lab/lecture dichotomy ever more archaic. It’s much harder for students to understand the importance of using the software if they only use it once a week or less. They forget the syntax in much the same way someone learning a foreign language forgets the rules. Focus on the entire data/science research pipeline Grolemund and Wickham’s graphic George Cobb argued for “Minimizing prerequisites to research” It’s all about data, data, data We leverage R packages for rich/complex yet easy-to-load data sets. We’ve heard it before: “You can’t teach ggplot2 for data visualization in intro stats!” We, like David Robinson, are more optimistic and we’ve had success doing so. dplyr is a game changer for data manipulation: the verb describing your desired data action is the command name! Use simulation/resampling for intro stats, not probability/large sample approximation Reinforce concepts, not equations, formulas, and probability tables. To this end, we’re big fans of the mosaic package’s shuffle(), resample(), and do() functions for sampling and simulation. Don’t fence off students from the computation pool, throw them in! Don’t teach them coding/programming per se, but computation and algorithmic thinking. Drawing Venn diagrams delineating statistics, computer science, and data science is also ever more archaic; embrace computation! Complete reproducibility We find it frustrating when textbooks give examples but not the source code and the data itself. We not only give you the source code for all examples, but also the source code for the whole book! We encourage use of R Markdown to foster notions of reproducible research. Ultimately the best textbook is one you’ve written yourself You best know your audience, their background, and their priorities and you know best your own style and types of examples and problems you like best. Customizability is the ultimate end. A new paradigm for textbooks? Versions, not editions? Pull requests, crowd-sourcing, and development versions? 1.2 Contribute This book is in beta testing and is currently at Version 0.1.0. If you would like to receive periodic updates on this book and other similar projects, please fill out this Google Form. The source code for this book is available for download/forking on GitHub. If you find typos or other errors or have suggestions on how to better word something in the book, please create a pull request too! Please feel free to modify the book as you wish for your own needs! All we ask is that you list the authors field above as “Chester Ismay, Albert Y. Kim, and YOU!” We’d also appreciate if you let us now what changes you’ve made and how you’ve used the textbook. We’d love some data on what’s working well and what’s not working so well. 1.3 Getting Started This book was written using the bookdown R package from Yihui Xie. In order to follow along and run the code in this book on your own, you’ll need to have access to R and RStudio. You can find more information on both of these with a simple Google search for “R” and for “RStudio.” An introduction to using R, RStudio, and R Markdown is also available in a free book here (Ismay 2016). It is recommended that you refer back to this book frequently as it has GIF screen recordings that you can follow along with as you learn. We will keep a running list of R packages you will need to have installed to complete the analysis as well here in the needed_pkgs character vector. You can check if you have all of the needed packages installed by running all of the lines below. The last lines including the if will install them as needed (i.e., download their needed files from the internet to your hard drive). You can run the library function on them to load them into your current analysis. Prior to each analysis where a package is needed, you will see the corresponding library function in the text. Make sure to check the top of the chapter to see if a package was loaded there. needed_pkgs <- c("nycflights13", "dplyr", "ggplot2", "knitr", "okcupiddata", "dygraphs", "rmarkdown", "mosaic", "ggplot2movies") new.pkgs <- needed_pkgs[!(needed_pkgs %in% installed.packages())] if(length(new.pkgs)) { install.packages(new.pkgs, repos = "http://cran.rstudio.com") } Colophon The source of the book is available here and was built with versions of R packages (and their dependent packages) given below. This may not be of importance for initial readers of this book, but the hope is you can reproduce a duplicate of this book by installing these versions of the packages. package * version date source assertthat 0.1 2013-12-06 CRAN (R 3.3.0) backports 1.0.4 2016-10-24 CRAN (R 3.3.0) base64enc 0.1-3 2015-07-28 CRAN (R 3.3.0) BH 1.62.0-1 2016-11-19 CRAN (R 3.3.2) bitops 1.0-6 2013-08-17 CRAN (R 3.3.0) caTools 1.17.1 2014-09-10 CRAN (R 3.3.0) colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2) curl 2.3 2016-11-24 CRAN (R 3.3.2) DBI 0.5-1 2016-09-10 CRAN (R 3.3.0) dichromat 2.0-0 2013-01-24 CRAN (R 3.3.0) digest 0.6.11 2017-01-03 CRAN (R 3.3.2) dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.0) dygraphs * 1.1.1.4 2017-01-04 CRAN (R 3.3.2) evaluate 0.10 2016-10-11 CRAN (R 3.3.0) ggdendro 0.1-20 2016-04-27 CRAN (R 3.3.0) ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2) ggplot2movies * 0.0.1 2015-08-25 CRAN (R 3.3.0) gridExtra 2.2.1 2016-02-29 CRAN (R 3.3.0) gtable 0.2.0 2016-02-26 CRAN (R 3.3.0) highr 0.6 2016-05-09 CRAN (R 3.3.0) hms 0.3 2016-11-22 CRAN (R 3.3.2) htmltools 0.3.5 2016-03-21 CRAN (R 3.3.0) htmlwidgets 0.8 2016-11-09 CRAN (R 3.3.2) jsonlite 1.2 2016-12-31 CRAN (R 3.3.2) knitr * 1.15.1 2016-11-22 CRAN (R 3.3.2) labeling 0.3 2014-08-23 CRAN (R 3.3.0) lattice * 0.20-34 2016-09-06 CRAN (R 3.3.2) latticeExtra 0.6-28 2016-02-09 CRAN (R 3.3.0) lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0) magrittr 1.5 2014-11-22 CRAN (R 3.3.0) markdown 0.7.7 2015-04-22 CRAN (R 3.3.0) MASS 7.3-45 2016-04-21 CRAN (R 3.3.2) Matrix * 1.2-7.1 2016-09-01 CRAN (R 3.3.2) mime 0.5 2016-07-07 CRAN (R 3.3.0) mosaic * 0.14.4 2016-07-29 CRAN (R 3.3.0) mosaicData * 0.14.0 2016-06-17 CRAN (R 3.3.0) munsell 0.4.3 2016-02-13 CRAN (R 3.3.0) nycflights13 * 0.2.1 2016-12-30 CRAN (R 3.3.2) okcupiddata * 0.1.0 2016-08-19 CRAN (R 3.3.0) plyr 1.8.4 2016-06-08 CRAN (R 3.3.0) R6 2.2.0 2016-10-05 CRAN (R 3.3.0) RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.3.0) Rcpp 0.12.8 2016-11-17 CRAN (R 3.3.2) readr * 1.0.0 2016-08-03 CRAN (R 3.3.0) reshape2 1.4.2 2016-10-22 CRAN (R 3.3.0) rmarkdown 1.3 2016-12-21 CRAN (R 3.3.2) rprojroot 1.1 2016-10-29 CRAN (R 3.3.0) scales 0.4.1 2016-11-09 CRAN (R 3.3.2) stringi 1.1.2 2016-10-01 CRAN (R 3.3.0) stringr 1.1.0 2016-08-19 CRAN (R 3.3.0) tibble 1.2 2016-08-26 CRAN (R 3.3.0) tidyr 0.6.0 2016-08-12 CRAN (R 3.3.0) xts 0.9-7 2014-01-02 CRAN (R 3.3.0) yaml 2.1.14 2016-11-12 CRAN (R 3.3.2) zoo 1.7-14 2016-12-16 CRAN (R 3.3.2) Book was last updated: ## [1] "By Chester on Saturday, January 07, 2017 09:14:56 EST" References "], +["index.html", "ModernDive 1 Preamble 1.1 Principles of this Book 1.2 Contribute 1.3 Getting Started Colophon", " ModernDive An Introduction to Statistical and Data Sciences via R Chester Ismay and Albert Y. Kim 2017-01-07 1 Preamble 1.1 Principles of this Book These are some principles we keep in mind. If you agree with them, this might be the book for you. Blur the lines between lecture and lab Laptops and open source software are rendering the lab/lecture dichotomy ever more archaic. It’s much harder for students to understand the importance of using the software if they only use it once a week or less. They forget the syntax in much the same way someone learning a foreign language forgets the rules. Focus on the entire data/science research pipeline Grolemund and Wickham’s graphic George Cobb argued for “Minimizing prerequisites to research” It’s all about data, data, data We leverage R packages for rich/complex yet easy-to-load data sets. We’ve heard it before: “You can’t teach ggplot2 for data visualization in intro stats!” We, like David Robinson, are more optimistic and we’ve had success doing so. dplyr is a game changer for data manipulation: the verb describing your desired data action is the command name! Use simulation/resampling for intro stats, not probability/large sample approximation Reinforce concepts, not equations, formulas, and probability tables. To this end, we’re big fans of the mosaic package’s shuffle(), resample(), and do() functions for sampling and simulation. Don’t fence off students from the computation pool, throw them in! Don’t teach them coding/programming per se, but computation and algorithmic thinking. Drawing Venn diagrams delineating statistics, computer science, and data science is also ever more archaic; embrace computation! Complete reproducibility We find it frustrating when textbooks give examples but not the source code and the data itself. We not only give you the source code for all examples, but also the source code for the whole book! We encourage use of R Markdown to foster notions of reproducible research. Ultimately the best textbook is one you’ve written yourself You best know your audience, their background, and their priorities and you know best your own style and types of examples and problems you like best. Customizability is the ultimate end. A new paradigm for textbooks? Versions, not editions? Pull requests, crowd-sourcing, and development versions? 1.2 Contribute This book is in beta testing and is currently at Version 0.1.0. If you would like to receive periodic updates on this book and other similar projects, please fill out this Google Form. The source code for this book is available for download/forking on GitHub. If you find typos or other errors or have suggestions on how to better word something in the book, please create a pull request too! Please feel free to modify the book as you wish for your own needs! All we ask is that you list the authors field above as “Chester Ismay, Albert Y. Kim, and YOU!” We’d also appreciate if you let us now what changes you’ve made and how you’ve used the textbook. We’d love some data on what’s working well and what’s not working so well. 1.3 Getting Started This book was written using the bookdown R package from Yihui Xie. In order to follow along and run the code in this book on your own, you’ll need to have access to R and RStudio. You can find more information on both of these with a simple Google search for “R” and for “RStudio.” An introduction to using R, RStudio, and R Markdown is also available in a free book here (Ismay 2016). It is recommended that you refer back to this book frequently as it has GIF screen recordings that you can follow along with as you learn. We will keep a running list of R packages you will need to have installed to complete the analysis as well here in the needed_pkgs character vector. You can check if you have all of the needed packages installed by running all of the lines below. The last lines including the if will install them as needed (i.e., download their needed files from the internet to your hard drive). You can run the library function on them to load them into your current analysis. Prior to each analysis where a package is needed, you will see the corresponding library function in the text. Make sure to check the top of the chapter to see if a package was loaded there. needed_pkgs <- c("nycflights13", "dplyr", "ggplot2", "knitr", "okcupiddata", "dygraphs", "rmarkdown", "mosaic", "ggplot2movies") new.pkgs <- needed_pkgs[!(needed_pkgs %in% installed.packages())] if(length(new.pkgs)) { install.packages(new.pkgs, repos = "http://cran.rstudio.com") } Colophon The source of the book is available here and was built with versions of R packages (and their dependent packages) given below. This may not be of importance for initial readers of this book, but the hope is you can reproduce a duplicate of this book by installing these versions of the packages. package * version date source assertthat 0.1 2013-12-06 CRAN (R 3.3.0) backports 1.0.4 2016-10-24 CRAN (R 3.3.0) base64enc 0.1-3 2015-07-28 CRAN (R 3.3.0) BH 1.62.0-1 2016-11-19 CRAN (R 3.3.2) bitops 1.0-6 2013-08-17 CRAN (R 3.3.0) caTools 1.17.1 2014-09-10 CRAN (R 3.3.0) colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2) curl 2.3 2016-11-24 CRAN (R 3.3.2) DBI 0.5-1 2016-09-10 CRAN (R 3.3.0) dichromat 2.0-0 2013-01-24 CRAN (R 3.3.0) digest 0.6.11 2017-01-03 CRAN (R 3.3.2) dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.0) dygraphs * 1.1.1.4 2017-01-04 CRAN (R 3.3.2) evaluate 0.10 2016-10-11 CRAN (R 3.3.0) ggdendro 0.1-20 2016-04-27 CRAN (R 3.3.0) ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2) ggplot2movies * 0.0.1 2015-08-25 CRAN (R 3.3.0) gridExtra 2.2.1 2016-02-29 CRAN (R 3.3.0) gtable 0.2.0 2016-02-26 CRAN (R 3.3.0) highr 0.6 2016-05-09 CRAN (R 3.3.0) hms 0.3 2016-11-22 CRAN (R 3.3.2) htmltools 0.3.5 2016-03-21 CRAN (R 3.3.0) htmlwidgets 0.8 2016-11-09 CRAN (R 3.3.2) jsonlite 1.2 2016-12-31 CRAN (R 3.3.2) knitr * 1.15.1 2016-11-22 CRAN (R 3.3.2) labeling 0.3 2014-08-23 CRAN (R 3.3.0) lattice * 0.20-34 2016-09-06 CRAN (R 3.3.2) latticeExtra 0.6-28 2016-02-09 CRAN (R 3.3.0) lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0) magrittr 1.5 2014-11-22 CRAN (R 3.3.0) markdown 0.7.7 2015-04-22 CRAN (R 3.3.0) MASS 7.3-45 2016-04-21 CRAN (R 3.3.2) Matrix * 1.2-7.1 2016-09-01 CRAN (R 3.3.2) mime 0.5 2016-07-07 CRAN (R 3.3.0) mosaic * 0.14.4 2016-07-29 CRAN (R 3.3.0) mosaicData * 0.14.0 2016-06-17 CRAN (R 3.3.0) munsell 0.4.3 2016-02-13 CRAN (R 3.3.0) nycflights13 * 0.2.1 2016-12-30 CRAN (R 3.3.2) okcupiddata * 0.1.0 2016-08-19 CRAN (R 3.3.0) plyr 1.8.4 2016-06-08 CRAN (R 3.3.0) R6 2.2.0 2016-10-05 CRAN (R 3.3.0) RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.3.0) Rcpp 0.12.8 2016-11-17 CRAN (R 3.3.2) readr * 1.0.0 2016-08-03 CRAN (R 3.3.0) reshape2 1.4.2 2016-10-22 CRAN (R 3.3.0) rmarkdown 1.3 2016-12-21 CRAN (R 3.3.2) rprojroot 1.1 2016-10-29 CRAN (R 3.3.0) scales 0.4.1 2016-11-09 CRAN (R 3.3.2) stringi 1.1.2 2016-10-01 CRAN (R 3.3.0) stringr 1.1.0 2016-08-19 CRAN (R 3.3.0) tibble 1.2 2016-08-26 CRAN (R 3.3.0) tidyr 0.6.0 2016-08-12 CRAN (R 3.3.0) xts 0.9-7 2014-01-02 CRAN (R 3.3.0) yaml 2.1.14 2016-11-12 CRAN (R 3.3.2) zoo 1.7-14 2016-12-16 CRAN (R 3.3.2) Book was last updated: ## [1] "By Chester on Saturday, January 07, 2017 11:29:21 EST" References "], ["2-intro.html", "2 Introduction 2.1 Preamble 2.2 Three driving data sources 2.3 Data/science pipeline 2.4 Reproducibility 2.5 Who is this book for?", " 2 Introduction 2.1 Preamble This book is inspired by three books: “Mathematical Statistics with Resampling and R” (Chihara and Hesterberg 2011), “Intro Stat with Randomization and Simulation” (Diez, Barr, and Çetinkaya-Rundel 2014), and “R for Data Science” (Grolemund and Wickham 2016). The first book, while designed for upper-level undergraduates and graduate students, provides an excellent resource on how to use resampling to build statistical concepts like normal distributions using computers instead of focusing on memorization of formulas. The last two books also provide a path towards free alternatives to the traditionally expensive introductory statistics textbook. When looking over the vast number of introductory statistics textbooks we found that there wasn’t one that incorporated many of the new R packages directly into the text. Additionally, there wasn’t an open-source, free textbook available that showed new learners all of the following how to use R to explore and visualize data how to use randomization and simulation to build inferential ideas how to effectively create stories using these ideas to convey information to a lay audience. We will introduce sometimes difficult statistics concepts through the medium of data visualization. In today’s world, we are bombarded with graphics that attempt to convey ideas. We will explore what makes a good graphic and what the standard ways are to convey relationships with data. You’ll also see the use of visualization to introduce concepts like mean, median, standard deviation, distributions, etc. In general, we’ll use visualization as a way of building almost all of the ideas in this book. Additionally, this book will focus on the triad of computational thinking, data thinking, and inferential thinking. We’ll see throughout the book how these three modes of thinking can build effective ways to work with, describe, and convey statistical knowledge. In order to do so, you’ll see the importance of literate programming to develop literate data science. In other words, you’ll see how to write code and descriptions that are useful not just for a computer to execute but also for readers to understand exactly what a statistical analysis is doing and how it works. Hal Abelson coined the phrase that we will follow throughout this book: “Programs must be written for people to read, and only incidentally for machines to execute.” 2.2 Three driving data sources Instead of hopping from one data set to the next, we’ve decided to focus throughout the book on three different data sources: flights leaving New York City in 2013 profiles of OKCupid users in San Francisco IMDB movie ratings By focusing on just three large data sources, it is our hope that you’ll be able to see how each of the chapters is interconnected. You’ll see how the data being tidy leads into data visualization and manipulation and how those concepts tie into inference and regression. 2.3 Data/science pipeline You may think of statistics as just being a bunch of numbers. We commonly hear the phrase “statistician” when listening to broadcasts of sporting events. Statistics (in particular, data analysis), in addition to describing numbers like with baseball batting averages, plays a vital role in all of the sciences. You’ll commonly hear the phrase “statistically significant” thrown around in the media. You’ll see things that say “Science now shows that chocolate is good for you.” Underpinning these claims is data analysis. By the end of this book, you’ll be able to better understand whether these claims should be trusted or whether we should be weary. Inside data analysis are many sub-fields that we will discuss throughout this book (not necessarily in this order): data collection data manipulation data visualization data modeling inference interpretation of results data storytelling This can be summarized in a graphic that is commonly used by Hadley Wickham: Figure 2.1: Hadley’s workflow graphic We will begin with a discussion on what is meant by tidy data and then dig into the gray Understand portion of the cycle and conclude by talking about interpreting and discussing the results of our models via Communication. These steps are vital to any statistical analysis. But why should you care about statistics? “Why did they make me take this class?” There’s a reason so many fields require a statistics course. Scientific knowledge grows through an understanding of statistical significance and data analysis. You needn’t be intimidated by statistics. It’s not the beast that it used to be and paired with computation you’ll see how reproducible research in the sciences particularly increases scientific knowledge. 2.4 Reproducibility “The most important tool is the mindset, when starting, that the end product will be reproducible.” – Keith Baggerly Another large goal of this book is to help readers understand the importance of reproducible analyses. The hope is to get readers into the habit of making their analyses reproducible from the very beginning. This means we’ll be trying to help you build new habits. This will take practice and be difficult at times. You’ll see just why it is so important for you to keep track of your code and well-document it to help yourself later and any potential collaborators as well. Copying and pasting is not the way that efficient and effective scientific research is conducted. It’s much more important for time to be spent on data collection and data analysis and not on copying and pasting plots back and forth across a variety of programs. In a traditional analyses if an error was made with the original data, we’d need to step through the entire process again: recreate the plots and copy and paste all of the new plots and our statistical analysis into your document. This is error prone and a frustrating use of time. We’ll see how to use R Markdown to get away from this tedious activity so that we can spend more time doing science. “We are talking about computational reproducibility.” - Yihui Xie Reproducibility means a lot of things in terms of different scientific fields. Are experiments conducted in a way that another researcher could follow the steps and get similar results? In this book, we will focus on what is known as computational reproducibility. This refers to being able to pass all of one’s data analysis and conclusions to someone else and have them get exactly the same results on their machine. This allows for time to be spent doing actual science and interpreting of results and assumptions instead of the more error prone way of starting from scratch or follow a list of steps that may be different from machine to machine. 2.5 Who is this book for? This book is targeted at students taking a traditional intro stats class in a small college environment using RStudio and preferably RStudio Server. We assume no prerequisites: no calculus and no prior programming experience. This is intended to be a gentle and nice introduction to the practice of statistics in terms of how data scientists, statisticians, and other scientists analyze data and write stories about data. We have intentionally avoided the use of throwing formulas at you and instead have focused on developing statistical concepts via data visualization and statistical computing. We hope this is a more intuitive experience than the way statistics has traditionally been taught in the past (and how it is commonly perceived from the outside). We additionally hope that you see the value of reproducible research via R as you continue in your studies. We understand that there will initially be growing pains in learning to program but we are here to help you and you should know that there is a huge community of R users that are always happy to help newbies along. Now let’s get into learning about how to create good stories about and with data! References "], ["3-tidy.html", "3 Tidy Data 3.1 What is tidy data? 3.2 The nycflights13 datasets 3.3 How is flights tidy? 3.4 Normal forms of data 3.5 What’s to come?", " 3 Tidy Data In this chapter, we’ll discuss the importance of tidy data. You may think that this means just having your data in a spreadsheet, but you’ll see that it is actually more specific than that. Data actually comes to us in a variety of formats from pictures to text and to just numbers. We’ll focus on datasets that can be stored in a spreadsheet throughout this book as that is the most common way data is collected in the sciences. Having tidy data will allow us to more easily create data visualizations as we will see in Chapter ??. It will also help us with manipulating data in Chapter ?? and in all subsequent chapters when we discuss statistical inference. You may not necessarily understand the importance for tidy data but it will become more and more apparent as we proceed through the book. 3.1 What is tidy data? You have surely heard the word “tidy” in your life: “Tidy up your room!” “Please write your homework in a tidy way so that it is easier to grade and to provide feedback.” Marie Kondo’s best-selling book The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing “I am not by any stretch of the imagination a tidy person, and the piles of unread books on the coffee table and by my bed have a plaintive, pleading quality to me - ‘Read me, please!’” - Linda Grant So what does it mean for your data to be tidy? Put simply: it means that your data is organized. But it’s more than just that. It means that your data follows the same standard format making it easy for others to find elements of your data, to manipulate and transform your data, and for our purposes continuing with the common theme: it makes it easier to visualize your data and the relationships between different variables in your data. We will follow Hadley Wickham’s definition of tidy data here (Wickham 2014): A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes. Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. Figure 3.1: Tidy data graphic from http://r4ds.had.co.nz/tidy-data.html Reading over this definition, you can begin to think about datasets that won’t follow this nice format. Learning check (LC3.1) Give an example dataset that doesn’t follow this format. What features of this dataset might make it difficult to visualize? How could the dataset be tweaked to make it tidy? 3.2 The nycflights13 datasets We likely have all flown on airplanes or know someone that has. Air travel has become an ever-present aspect of our daily lives. If you live in or are visiting a relatively large city and you walk around that city’s airport, you see gates showing flight information from many different airlines. And you will frequently see that some flights are delayed because of a variety of conditions. Are there ways that we can avoid having to deal with these flight delays? We’d all like to arrive at our destinations on time whenever possible. (Unless you secretly love hanging out at airports. If you are one of these people, pretend for the moment that you are very much anticipating being at your final destination.) Hadley Wickham (herein just referred to as “Hadley”) created multiple datasets containing information about departing flights from the New York City area in 2013 (Wickham 2016). We will begin by loading in one of these datasets, the flights dataset, and getting an idea of its structure: library(nycflights13) data(flights) The library function here loads the R package nycflights13 into the current R environment in which you are working. The data(flights) loads in the flights dataset that is stored in the nycflights13 package. Note that you’ll get an error if you try to load this package in and it hasn’t been downloaded and installed. You can ensure it is installed by running the code below: if(!require(nycflights13)) install.packages("nycflights13", repos = "http://cran.rstudio.org") This code checks to see if nycflights13 is installed and, if not, then goes to the specified repository of “http://cran.rstudio.org” and downloads the package from there and installs it. If it is already installed you can see it listed in the Packages tab in the bottom right portion of RStudio and the code will not install the package again since this is redundant and you won’t need to do it over and over again. This dataset and most others presented in this book will be in the data.frame format in R. Data frames are ways to look at collections of variables that are tightly coupled together. Frequently, the best way to get a feel for a data frame is to use the View function in RStudio. This command will be given throughout the book as a reminder, but the actual output will be hidden. View(flights) Learning check (LC3.2) What does any ONE row in this flights dataset refer to? A. Data on an airline B. Data on a flight C. Data on an airport D. Data on multiple flights By running View(flights), we see the different variables listed in the columns and we see that there are different types of variables. Some of the variables like distance, day, and arr_delay are what we will call quantitative variables. These variables vary in a numerical way. Other variables here are categorical. Note that if you look in the leftmost column of the View(flights) output, you will see a column of numbers. These are the row numbers of the dataset. If you glance across a row with the same number, say row 5, you can get an idea of what each row corresponds to. In other words, this will allow you to identify what object is being referred to in a given row. This is often called the observational unit. The observational unit in this example is an individual flight departing New York City in 2013. Note: Frequently the first thing you should do when given a dataset is to identify the observation unit, specify the variables, and give the types of variables you are presented with. str(flights) ## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 19 variables: ## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... ## $ month : int 1 1 1 1 1 1 1 1 1 1 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ... ## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ... ## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ... ## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ... ## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ... ## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ... ## $ carrier : chr "UA" "UA" "AA" "B6" ... ## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ... ## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ... ## $ origin : chr "EWR" "LGA" "JFK" "JFK" ... ## $ dest : chr "IAH" "IAH" "MIA" "BQN" ... ## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ... ## $ distance : num 1400 1416 1089 1576 762 ... ## $ hour : num 5 5 5 5 6 5 6 6 6 6 ... ## $ minute : num 15 29 40 45 0 58 0 0 0 0 ... ## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" ... Learning check (LC3.3) What are some examples in this dataset of categorical variables? What makes them different than quantitative variables? (LC3.4) What does int, num, and chr mean in the output above? (LC3.5) How many different columns are in this dataset? (LC3.6) How many different rows are in this dataset? Another way to view the properties of a dataset is to use the str function (“str” is short for “structure”). The str function is expecting an object for its argument. In this case, the object is a data frame named flights. You can use the str function on other objects and data frames using the syntax str(object) where object is the name of an object in R. This will give you the first few entries of each variable in a row after the variable. In addition, the type of the variable is given immediately after the : following each variable’s name. Here, int and num refer to quantitative variables. In contrast, chr refers to categorical variables. One more type of variable is given here with the time_hour variable: POSIXct. As you may suspect, this variable corresponds to a specific date and time of day. Another nice feature of R is the help system. You can get help in R by simply entering a question mark before the name of a function or an object and you will be presented with a page showing the documentation. Note that this output help file is omitted here but can be accessed here on page 3 of the PDF document. ?str ?flights Another aspect of tidy data is a description of what each variable in the dataset represents. This helps others to understand what your variable names mean and what they correspond to. If we look at the output of ?flights, we can see that a description of each variable by name is given. An important feature to ALWAYS include with your data is the appropriate units of measurement. We’ll see this further when we work with the dep_delay variable in Chapter ??. (It’s in minutes, but you’d get some really strange interpretations if you thought it was in hours or seconds. UNITS MATTER!) 3.3 How is flights tidy? We see that flights has a rectangular shape with each row corresponding to a different flight and each column corresponding to a characteristic of that flight. This matches exactly with how Hadley defined tidy data: Each variable forms a column. Each observation forms a row. But what about the third property? Each type of observational unit forms a table. We identified earlier that the observational unit in the flights dataset is an individual flight. And we have shown that this dataset consists of 336,776 flights with 19 variables. In other words, some rows of this dataset don’t refer to a measurement on an airline or on an airport. They specifically refer to characteristics/measurements on a given flight from New York City in 2013. By contrast, also included in the nycflights13 package are datasets with different observational units (Wickham 2016): weather: hourly meteorological data for each airport planes: construction information about each plane airports: airport names and locations airlines: translation between two letter carrier codes and names You may have been asking yourself what carrier refers to in the str(flights) output above. The airlines dataset provides a description of this with each airline being the observational unit: data(airlines) airlines ## # A tibble: 16 × 2 ## carrier name ## <chr> <chr> ## 1 9E Endeavor Air Inc. ## 2 AA American Airlines Inc. ## 3 AS Alaska Airlines Inc. ## 4 B6 JetBlue Airways ## 5 DL Delta Air Lines Inc. ## 6 EV ExpressJet Airlines Inc. ## 7 F9 Frontier Airlines Inc. ## 8 FL AirTran Airways Corporation ## 9 HA Hawaiian Airlines Inc. ## 10 MQ Envoy Air ## 11 OO SkyWest Airlines Inc. ## 12 UA United Air Lines Inc. ## 13 US US Airways Inc. ## 14 VX Virgin America ## 15 WN Southwest Airlines Co. ## 16 YV Mesa Airlines Inc. As can be seen here when you just enter the name of an object in R, by default it will print the contents of that object to the screen. Be careful! It’s usually better to use the View() function in RStudio since larger objects may take awhile to print to the screen and it likely won’t be helpful to you to have hundreds of lines outputted. 3.4 Normal forms of data The datasets included in the nycflights13 package are in a form that minimizes redundancy of data. We will see that there are ways to merge (or join) the different tables together easily. We are capable of doing so because each of the tables have keys in common to relate one to another. This is an important property of normal forms of data. The process of decomposing data frames into less redundant tables without losing information is called normalization. More information is available on Wikipedia. We saw an example of this above with the airlines dataset. While the flights data frame could also include a column with the names of the airlines instead of the carrier code, this would be repetitive since there is a unique mapping of the carrier code to the name of the airline/carrier. Below an example is given showing how to join the airlines data frame together with the flights data frame by linking together the two datasets via a common key of "carrier". Note that this “joined” data frame is assigned to a new data frame called joined_flights. if(!require(nycflights13)) install.packages("nycflights13", repos = "http://cran.rstudio.org") library(dplyr) joined_flights <- inner_join(x = flights, y = airlines, by = "carrier") View(joined_flights) If we View this dataset, we see a new variable has been created called (We will see in Subsection 5.1.1 ways to change name to a more descriptive variable name.) More discussion about joining data frames together will be given in Chapter ??. We will see there that the names of the columns to be linked need not match as they did here with "carrier". Review questions (RQ3.1) What are common characteristics of “tidy” datasets? (RQ3.2) What makes “tidy” datasets useful for organizing data? (RQ3.3) How many variables are presented in the table below? What does each row correspond to? (Hint: You may not be able to answer both of these questions immediately but take your best guess.) students faculty 4 2 6 3 (RQ3.4) The confusion you may have encountered in Question 4 is a common one those that work with data are commonly presented with. This dataset is not tidy. Actually, the dataset in Question 4 has three variables not the two that were presented. Make a guess as to what these variables are and present a tidy dataset instead of this untidy one given in Question 4. (RQ3.5) The actual data presented in Question 4 is given below in tidy data format: role Sociology? Type of School student TRUE Public student TRUE Public student TRUE Public student TRUE Public student FALSE Public student FALSE Public student FALSE Private student FALSE Private student FALSE Private student FALSE Private faculty TRUE Public faculty TRUE Public faculty FALSE Public faculty FALSE Private faculty FALSE Private What does each row correspond to? What are the different variables in this data frame? The Sociology? variable is known as a logical variable. What types of values does a logical variable take on? (RQ3.6) What are some advantages of data in normal forms? What are some disadvantages? 3.5 What’s to come? In Chapter ??, we will further explore the distribution of a variable in a related dataset to flights: the temp variable in the weather dataset. We’ll be interested in understanding how this variable varies in relation to the values of other variables in the dataset. We will see that visualization is often a powerful tool in helping us see what is going on in a dataset. It will be a useful way to expand on the str function we have seen here for tidy data. References "], ["4-data-visualization-via-ggplot2.html", "4 Data Visualization via ggplot2 Needed packages 4.1 The Grammar of Graphics 4.2 Five Named Graphs - The 5NG 4.3 5NG#1: Scatter-plots 4.4 5NG#2: Line-graphs 4.5 5NG#3: Histograms 4.6 Facets 4.7 5NG#4: Boxplots 4.8 5NG#5: Barplots 4.9 Conclusion", " 4 Data Visualization via ggplot2 In Chapter 3, we discussed the importance of datasets being tidy. You will see in examples here why having a tidy dataset helps us immensely when plotting our data. In plotting our data, we will be able to gain valuable insights from our data that we couldn’t initially see from just looking at the raw data. We will focus on using Hadley Wickham’s ggplot2 package in doing so, which was developed to work specifically on datasets that are tidy. It provides an easy way to customize your plots and is based on data visualization theory given in The Grammar of Graphics (Wilkinson 2005). At the most basic level, graphics/plots/charts provide a nice way for us to get a sense for how quantitative variables compare in terms of their center and their spread. The most important thing to know about graphics is that they should be created to make it obvious for your audience to see the findings you want to get across. This requires a balance of not including too much in your plots, but also including enough so that relationships and interesting findings can be easily seen. As we will see, plots/graphics also help us to identify patterns and outliers in our data. We will see that a common extension of these ideas is to compare the distribution of one quantitative variable (i.e., what the spread of a variable looks like) as we go across the levels of a different categorical variable. Needed packages Before we proceed with this chapter, let’s load all the necessary packages, in particular the nycflights13 package introduced in Chapter 3 containing various data sets. library(dplyr) library(ggplot2) library(nycflights13) 4.1 The Grammar of Graphics We begin with a discussion of a theoretical framework for data visualization known as the “The Grammar of Graphics”, which serves as the basis for the ggplot2 package. Much like the way we construct sentences in any language using a linguistic grammar (nouns, verbs, subjects, objects, etc.), the theoretical framework given by Leland Wilkinson (Wilkinson 2005) allows us to specify the components of a statistical graphic. 4.1.1 Components of Grammar In short, the grammar tells us that: A statistical graphic is a mapping of data variables to aesthetic attributes of geometric objects. Specifically, we can break a graphic into the following three essential components: data: the data set comprised of variables that we map. geom: the geometric object in question. This refers to our type of objects we can observe in our plot. For example, points, lines, bars, etc. aes: aesthetic attributes of the geometric object that we can perceive on a graphic. For example, x/y position, color, shape, and size. Each assigned aesthetic attribute can be mapped to a variable in our data set. If not assigned, they are set to defaults. 4.1.2 Napolean’s March on Moscow In 1812, Napoleon led a French invasion of Russia, marching on Moscow. It was one of the biggest military disasters due in large part to the Russian winter. In 1869, a French civil engineer named Charles Joseph Minard published arguably one of the greatest statistical visualizations of all time which summarized this march: Figure 4.1: Minard’s Visualization of Napolean’s March This was considered a revolution in statistical graphics because between the map on top and the line graph on the bottom, there are 6 dimensions of information (i.e. variables) being displayed on a 2-dimensional page. Let’s view this graphic through the lens of the Grammar of Graphics: Table 4.1: Grammar of Map (Top) and Line-Graph (Bottom) in Minard’s Graphic of Napolean’s March data aes geom longitude x point latitude y point army size size path army direction color path data aes geom date x line & text temperature y line & text For example, the data variable longitude gets mapped to x aesthetic of the points geometric objects on the map while the annotated line-graph displays date and temperature variable information via its mapping to the x and y aesthetic of the line geometric object. 4.1.3 Other Components of the Grammar There are other components of the Grammar of Graphics we can control: facet: how to break up a plot into subsets statistical transformations: this includes smoothing, binning values into a histogram, or just itself untransformed "identity". scales both convert data units to physical units the computer can display draw a legend and/or axes, which provide an inverse mapping to make it possible to read the original data values from the graph. coordinate system for x/y values: typically cartesian, but can also be polar, map position adjustments In this text, we will only focus on the first two: faceting (introduced in Section 4.6) and statistical transformations (in a limited sense when consider Barplots in Section 4.8) ; the other components are left to a more advanced text. This is not a problem when producing a plot as each of these components have default settings. There are other extra attributes that can be tweaked as well including the plot title, axes labels, and over-arching themes for the plot. In general, the Grammar of Graphics allows for customization but also a consistent framework that allows the user to easily tweak their creations as needed in order to convey a message about their data. 4.1.4 The ggplot2 Package We introduce Hadley Wickham’s ggplot2 package, which is an implementation of the Grammar of Graphics for R (Wickham and Chang 2016). You may have noticed that a lot of previous text in this chapter is written in computer font. This is because the various components of the Grammar of Graphics are specified using the ggplot function, which expects at a bare minimal as arguments the data frame where the variables exist (the data argument) and the names of the variables to be plotted (the mapping argument). The names of the variables will be entered into the aes function as arguments where aes stands for “aesthetics”. The plot given above is not a histogram, but the output does show us a bit of what is going on with ggplot(data = weather, mapping = aes(x = temp)). It is producing a backdrop onto which we will “paint” elements. We next proceed by adding a layer—hence, the use of the + symbol—to the plot to produce a histogram. (Note also here that we don’t have to specify the data = and mapping = text in our function calls. This is covered in more detail in Chapter 5 of the “Getting Used to R, RStudio, and R Markdown” book (Ismay 2016)). You are encouraged to enter Return on your keyboard after entering the +. As we add more and more elements, it will be nice to keep them indented as you see below. Note that this will not work if you begin the line with the +. An excellent resource as you begin to create plots using the ggplot2 package is a cheatsheet that RStudio has put together entitled “Data Visualization with ggplot2” available By clicking here or by clicking the RStudio Menu Bar -> Help -> Cheatsheets -> “Data Visualization with ggplot2” This covers more than what we’ve discussed in this chapter but provides nice visual descriptions of what each function produces. Review questions **`paste0(\"(RQ\", chap, \".\", (rq 4.2 Five Named Graphs - The 5NG For our purposes, we will be limiting consideration to five different types of graphs (note that in this text we use the terms “graphs”, “plots”, and “charts” interchangeably). We term these five named graphs the 5NG: scatter-plots line-graphs boxplots histograms barplots With this repertoire of plots, you can visualize a wide array of data variables thrown at you. We will discuss some variations of these, but with the 5NG in your toolbox you can do big things! Something we will also stress here is that certain plots only work for categorical/logical variables and others only for quantitative variables. You’ll want to quiz yourself often as we go along on which plot makes sense a given a particular problem set-up. 4.3 5NG#1: Scatter-plots The simplest of the 5NG are scatter-plots (also called bivariate plots); they allow you to investigate the relationship between two continuous variables. While you may already be familiar with such plots, let’s view it through the lens of the Grammar of Graphics. Specifically, we will graphically investigate the relationship between the following two continuous variables in the flights data frame: dep_delay: departure delay on the horizontal “x” axis arr_delay: arrival delay on the vertical “y” axis for Alaska Airlines flights leaving NYC in 2013. This requires paring down the flights data frame to a smaller data frame alaska_flights consisting of only Alaska Airlines (carrier code “AS”) flights. data(flights) alaska_flights <- flights %>% filter(carrier == "AS") This code snippet makes use of functions in the dplyr package for data manipulation to achieve our goal: it takes the flights data frame and filters it to only return the rows which meet the condition carrier == "AS" (recall equality is specified with == and not =). You will see many more examples using this function in Chapter ??. Learning check (LC3.1) Take a look at both the flights and alaska_flights data frames by running View(flights) and View(alaska_flights) in the console. In what respect do these data frames differ? 4.3.1 Scatter-plots via geom_point We proceed to create the scatter-plot using the ggplot() function: ggplot(data=alaska_flights, aes(x = dep_delay, y = arr_delay)) + geom_point() Figure 4.2: Arrival Delays vs Departure Delays for Alaska Airlines flights from NYC in 2013 Let’s break down this keeping in mind our discussion in Section 4.1: Within the ggplot() function call, we specify two of the components of the grammar: The data frame to be alaska_flights by setting data=alaska_flights The aesthetic mapping by setting aes(x = dep_delay, y = arr_delay). Specifically dep_delay maps to the x position arr_delay maps to the y position We add a layer to the ggplot() function call using the + sign The layer in question specifies the third component of the grammar: the geometric object in question. In this case the geometric object are points, set by specifying geom_point() In Figure 4.2 we see that a positive relationship exists between dep_delay and arr_delay: as departure delays increase, arrival delays tend to also increase. We also note that the majority of points fall near the point (0, 0). There is a large mass of points clustered there. Learning check (LC3.2) What are some practical reasons why dep_delay and arr_delay have a positive relationship? (LC3.3) What variables (not necessarily in the flights data frame) would you expect to have a negative correlation (i.e. a negative relationship) with dep_delay? Why? Remember that we are focusing on continuous variables here. (LC3.4) Why do you believe there is a cluster of points near (0, 0)? What does (0, 0) correspond to in terms of the Alaskan flights? (LC3.5) What are some other features of the plot that stand out to you? (LC3.6) Create a new scatter-plot using different variables in the alaska_flights data frame by modifying the example above. 4.3.2 Over-Plotting The large mass of points near (0, 0) can cause some confusion. This is the result of a phenomenon called over-plotting. As one may guess, this corresponds to values being plotted on top of each other over and over again. It is often difficult to know just how many values are plotted in this way when looking at a basic scatter-plot as we have here. There are two ways to address this issue: By adjusting the transparency of the points via the alpha argument By jittering the points via geom_jitter() The first way of relieving over-plotting is by changing the alpha argument to geom_point() which controls the transparency of the points. By default, this value is set to 1. We can change this value to a smaller fraction to change the transparency of the points in the plot: ggplot(data=alaska_flights, aes(x = dep_delay, y = arr_delay)) + geom_point(alpha = 0.2) Figure 4.3: Delay scatterplot with alpha=0.2 Note how this function call is identical to the one in Section 4.3, but with geom_point() replaced with alpha=0.2 added. The second way of relieving over-plotting is to jitter the points a bit. In other words, we are going to add just a bit of random noise to the points to better see them and remove some of the over-plotting. You can think of “jittering” as shaking the points a bit on the plot. Instead of using geom_point, we use geom_jitter to perform this shaking and specify around how much jitter to add with the width and height arguments. This corresponds to how hard you’d like to shake the plot in units corresponding to those for both the horizontal and vertical variables (in this case minutes). ggplot(data=alaska_flights, aes(x = dep_delay, y = arr_delay)) + geom_jitter(width = 30, height = 30) Figure 4.4: Jittered delay scatterplot Note how this function call is identical to the one in Section ??, but with geom_point() replaced with geom_jitter(). The plot in 4.4 helps us a little bit in getting a sense for the over-plotting, but with a relatively large dataset like this one (714 flights), it can be argued that changing the transparency of the points by setting alpha proved more effective. Learning check (LC3.7) Why is setting the alpha argument value useful with scatter-plots? What further information does it give you that a regular scatter-plot cannot? (LC3.8) After viewing the Figure 4.3 above, give a range of arrival times and departure times that occur most frequently? How has that region changed compared to when you observed the same plot without the alpha = 0.2 set in Figure 4.2? 4.3.3 Summary Scatter-plots display the relationship between two continuous variables and may be the most used plot today as they can provide an immediate way to see the trend in one variable versus another. If you try to create a scatter-plot where either one of the two variables is not quantitative however, you will get strange results. Be careful! With medium to large datasets, you may need to play with either geom_jitter or the alpha argument in order to get a good feel for relationships in your data. This tweaking is often a fun part of data visualization since you’ll have the chance to see different relationships come about as you make subtle changes to your plots. 4.4 5NG#2: Line-graphs The next of the 5NG is a line-graph. They are most frequently used when the x-axis represents time and the y-axis represents some other numerical variable; such plots are known as time series. Time represents a variable that is connected together by each day following the previous day. In other words, time has a natural ordering. Line-graphs should be avoided when there is not a clear sequential ordering to the explanatory variable i.e. the x-variable. Our focus turns to the temp variable in this weather dataset. By Looking over the weather dataset by typing View(weather) in the console. Running ?weather to bring up the help file. We can see that the temp variable corresponds to hourly temperature (in Fahrenheit) recordings at weather stations near airports in New York City. Instead of considering all hours in 2013 for all three airports in NYC, let’s focus in the hourly temperature at Newark airport (origin code “EWR”) for the first 15 days in January 2013. The weather data frame in the nycflights13 package contains this data, but we first need to filter it to only include those rows that correspond to Newark in the first 15 days of January. data(weather) early_january_weather <- weather %>% filter(origin=="EWR" & month == 1 & day <= 15) This is very similar to the previous use of the filter command in Section 4.3, however we now use the & operator. The above selects only those rows in weather where origin=="EWR" **and**month=1**and**day <= 15`. Learning check (LC3.9) Take a look at both the weather and early_january_weather data frames by running View(weather) and View(early_january_weather) in the console. In what respect do these data frames differ? (LC3.10) The weather data is recorded hourly. Why does the time_hour variable correctly identify the hour of the measurement and not the just the hour variable? 4.4.1 Line-graphs via geom_line We plot a line-graph of hourly temperature using geom_line(): ggplot(data=early_january_weather, aes(x=time_hour, y=temp)) + geom_line() Figure 4.5: Hourly Temperature in Newark for Jan 1-15 2013 Much as with the ggplot() call in Section ??, we specify the components of the Grammar of Graphics: Within the ggplot() function call, we specify two of the components of the grammar: The data frame to be early_january_weather by setting data=early_january_weather The aesthetic mapping by setting aes(x = time_hour, y = temp). Specifically time_hour (i.e. the time variable) maps to the x position temp maps to the y position We add a layer to the ggplot() function call using the + sign The layer in question specifies the third component of the grammar: the geometric object in question. In this case the geometric object is a line, set by specifying geom_line() Learning check (LC3.11) Why should line-graphs be avoided when there is not a clear ordering of the horizontal axis? (LC3.12) Why are line-graphs frequently used when time is the explanatory variable? ?? instead of `flights` or `flights_day`? --> (LC3.13) Plot a time series of a variable other than temp for Newark Airport in the first 15 days of January 2013. 4.4.2 Summary Line-graphs, just like scatter-plots, display the relationship between two continuous variables. However the variable on the x-axis (i.e. the explanatory variable) should have a natural ordering, like some notion of time. We can mislead our audience if that isn’t the case. 4.5 5NG#3: Histograms Let’s consider the temp variable in the weather data frame once again, but now unlike with the line-graphs in Section 4.4, let’s say we don’t care about the relationship of temperature to time, but rather you care about the (statistical) distribution of temperatures. We could just produce points where each of the different values appear on something similar to a number line: Figure 4.6: Strip Plot of Hourly Temperature Recordings from NYC in 2013 This gives us a general idea of how the values of temp differ. We see that temperatures vary from around 11 up to 100 degrees Fahrenheit. The area between 40 and 60 degrees appears to have more points plotted than outside that range. 4.5.1 Histograms via geom_histogram What is commonly produced instead of this strip plot is a plot known as a histogram. The histogram shows how many elements of a single numerical variable fall in specified bins. In this case, these bins may correspond to between 0-10°F, 10-20°F, etc. We produce a histogram of the hour temperatures at all three NYC airports in 2013: ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram() ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ## Warning: Removed 1 rows containing non-finite values (stat_bin). Figure 4.7: Histogram of Hourly Temperature Recordings from NYC in 2013 Note here: There is only one variable being mapped in aes(): the single continuous variable temp. You don’t need to compute the y-aesthetic: it gets computed automatically. We set the geometric object to be geom_histogram() We got a warning message of 1 rows containing non-finite values being removed. This is due to one of the values of temperature being missing. R is alerting us that this happened. 4.5.2 Adjusting the Bins We can adjust the number/size of the bins two ways: By adjusting the number of bins via the bins argument By adjusting the width of the bins via the binwidth argument First, we have the power to specify how many bins we would like to put the data into as an argument in the geom_histogram function. By default, this is chosen to be 30 somewhat arbitrarily we have received a warning above our plot that this was done. ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(bins = 60) Figure 4.8: Histogram of Hourly Temperature Recordings from NYC in 2013 - 60 Bins Second, instead of specifying the number of bins, we can also specify the width of the bins by using the binwidth argument in the geom_histogram function. ggplot(data = weather, mapping = aes(x = temp)) + geom_histogram(binwidth = 10) Figure 4.9: Histogram of Hourly Temperature Recordings from NYC in 2013 - Binwidth = 10 Learning check (LC3.14) What does changing the number of bins from 30 to 60 tell us about the distribution of temperatures? (LC3.15) Would you classify the distribution of temperatures as symmetric or skewed? (LC3.16) What would you guess is the “center” value in this distribution? Why did you make that choice? (LC3.17) Is this data spread out greatly from the center or is it close? Why? 4.5.3 Summary Histograms, unlike scatter-plots and line-graphs, presents information on only a single continuous variable. In particular they are visualizations of the (statistical) distribution of values. 4.6 Facets Before continuing the 5NG, we briefly introduce a new concept called faceting. Faceting is used when we’d like to create small multiples of the same plot over a different categorical variable. By default, all of the small multiples will have the same vertical axis. For example, suppose we were interested in looking at how the temperature histograms we saw in Chapter 4.5 varied by month. This is what is meant by “the distribution of a variable over another variable”: temp is one variable and month is the other variable. In order to look at histograms of temp for each month, we add a layer facet_wrap(~month). ggplot(data = weather, aes(x = temp)) + geom_histogram(binwidth = 5) + facet_wrap(~month) Figure 4.10: Faceted histogram As we might expect, the temperature tends to increase as summer approaches and then decrease as winter approaches. Learning check (LC3.18) What other things do you notice about the faceted plot above? How does a faceted plot help us see how relationships between two variables? (LC3.19) What do the numbers 1-12 correspond to in the plot above? What about 25, 50, 75, 100? (LC3.21) For which types of datasets would these types of faceted plots not work well in comparing relationships between variables? Give an example describing the variability of the variables and other important characteristics. (LC3.22) Does the temp variable in the weather data set have a lot of variability? Why do you say that? 4.7 5NG#4: Boxplots While using faceted histograms can provide a way to compare distributions of a continuous variable split by groups of a categorical variable as in Chapter 4.6, an alternative plot called a boxplot (also called a side-by-side boxplot) achieves the same task. The boxplot uses the information provided in the five-number summary referred to in Appendix A. It gives a way to compare this summary information across the different levels of a categorical variable. 4.7.1 Boxplots via geom_boxplot Let’s create a boxplot to compare the monthly temperatures as we did above with the faceted histograms. ggplot(data = weather, aes(x = month, y = temp)) + geom_boxplot() Figure 4.11: Invalid boxplot specification Note the first warning that is given here. (The second one corresponds to missing values in the data frame and it is turned off on subsequent plots.) Observe that this plot does not look like what we were expecting. We were expecting to see the distribution of temperatures for each month (so 12 different boxplots). This gives us the overall boxplot without any other groupings. We can get around this by introducing a new function for our x variable: ggplot(data = weather, mapping = aes(x = factor(month), y = temp)) + geom_boxplot() Figure 4.12: Month by temp boxplot We have introduced a new function called factor() here. One of the things this function does is to convert a discrete value like month (1, 2, …, 12) into a categorical variable. The “box” part of this plot represents the 25th percentile, the median (50th percentile), and the 75th percentile. The dots correspond to outliers. (The specific formulation for these outliers is discussed in Appendix A.) The lines show how the data varies that is not in the center 50% defined by the first and third quantiles. Longer lines correspond to more variability and shorter lines correspond to less variability. Learning check (LC3.23) What does the dot at the bottom of the plot for May correspond to? Explain what might have occurred in May to produce this point. (LC3.24) Which months have the highest variability in temperature? What reasons do you think this is? (LC3.25) We looked at the distribution of a continuous variable over a categorical variable here with this boxplot. Why can’t we look at the distribution of one continuous variable over the distribution of another continuous variable? Say temperature across pressure, for example? (LC3.26) Boxplots provide a simple way to identify outliers. Why may outliers be easier to identify when looking at a boxplot instead of a faceted histogram? 4.7.2 Summary Boxplots provide a way to compare and contrast the distribution of one quantitative variable across multiple levels of one categorical variable. One can easily look to see where the median falls across the different groups by looking at the center line in the box. You can also see how spread out the variable is across the different groups by looking at the width of the box and also how far out the lines stretch from the box. If the lines stretch far from the box but the box has a small width, the variability of the values closer to the center is much smaller than the variable of the outer ends of the variable. Lastly, outliers are even more easily identified when looking at a boxplot than when looking at a histogram. 4.8 5NG#5: Barplots Both histograms and boxplots represent ways to visualize the variability of continuous variables. Another common task is to present the distribution of a categorical variable. This is a simpler task since we will be interested in how many elements from our data fall into the different categories of the categorical variable. 4.8.1 Barplots via geom_bar Frequently, the best way to visualize these different counts (also known as frequencies) is via a barplot. Consider the distribution of airlines that flew out of New York City in 2013. Here we explore the number of flights from each airline/carrier. This can be plotted by invoking the geom_bar function in ggplot2: ggplot(data = flights, mapping = aes(x = carrier)) + geom_bar() Figure 4.13: Number of flights departing NYC in 2013 by airline We see that United Air Lines, JetBlue Airways, and ExpressJet Airlines had the most flights depart New York City in 2013. To get the actual number of flights by each airline we can use the count function in the dplyr package on the carrier variable in flights, which we will introduce formally in Chapter @ref{manip}. ## # A tibble: 1 × 1 ## `1.n` ## <int> ## 1 336776 Learning check (LC3.27) Why are histograms inappropriate for visualizing categorical variables? (LC3.28) What is the difference between histograms and barplots? (LC3.29) How many Envoy Air flights departed NYC in 2013? (LC3.30) What was the seventh highest airline in terms of departed flights from NYC in 2013? How can we better present the table to get this answer quickly. 4.8.2 Must avoid pie charts! Unfortunately, one of the most common plots seen today for categorical data is the pie chart. While they may see harmless enough, they actually present a problem in that humans are unable to judge angles well. As Naomi Robbins describes in her book “Creating More Effective Graphs” (Robbins 2013), we overestimate angles greater than 90 degrees and we underestimate angles less than 90 degrees. In other words, it is difficult for us to determine relative size of one piece of the pie compared to another. Let’s examine our previous barplot example on the number of flights departing NYC by airline. This time we will use a pie chart. As you review this chart, try to identify how much larger the portion of the pie is for ExpressJet Airlines (EV) compared to US Airways (US), what the third largest carrier is in terms of departing flights, and how many carriers have fewer flights than United Airlines (UA)? Figure 4.14: The dreaded pie chart While it is quite easy to look back at the barplot to get the answer to these questions, it’s quite difficult to get the answers correct when looking at the pie graph. Barplots can always present the information in a way that is easier for the eye to determine relative position. There may be one exception from Nathan Yau at FlowingData.com but we will leave this for the reader to decide: Figure 4.15: The only good pie chart Learning check (LC3.31) Why should pie charts be avoided and replaced by barplots? (LC3.32) What is your opinion as to why pie charts continue to be used? 4.8.3 Using barplots to compare two variables Barplots are the go-to way to visualize the frequency of different categories of a categorical variable. They make it easy to order the counts and to compare one group’s frequency to another. Another use of barplots (unfortunately, sometimes inappropriately and confusingly) is to compare two categorical variables together. Let’s examine the distribution of outgoing flights from NYC by carrier and airport. We begin by getting the names of the airports in NYC that were included in the flights dataset. Remember from Chapter 3 that this can be done by using the inner_join function (more in Chapter ??). flights_namedports <- flights %>% inner_join(airports, by = c("origin" = "faa")) After running View(flights_namedports), we see that name now corresponds to the name of the airport as referenced by the origin variable. We will now plot carrier as the horizontal variable. When we specify geom_bar, it will specify count as being the vertical variable. A new addition here is fill = name. Look over what was produced from the plot to get an idea of what this argument gives. ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) + geom_bar() Figure 4.16: Stacked barplot comparing the number of flights by carrier and airport This plot is what is known as a stacked barplot. While simple to make, it often leads to many problems. Learning check (LC3.33) What kinds of questions are not easily answered by looking at the above figure? (LC3.34) What can you say, if anything, about the relationship between airline and airport in NYC in 2013 in regards to the number of departing flights? Another variation on the stacked barplot is the side-by-side barplot. ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) + geom_bar(position = "dodge") Figure 4.17: Side-by-side barplot comparing the number of flights by carrier and airport Learning check (LC3.35) Why might the side-by-side barplot be preferable to a stacked barplot in this case? (LC3.36) What are the disadvantages of using a side-by-side barplot, in general? Lastly, an often preferred type of barplot is the faceted barplot. We already saw this concept of faceting and small multiples in Section 4.6. This gives us a nicer way to compare the distributions across both carrier and airport/name. ggplot(data = flights_namedports, mapping = aes(x = carrier, fill = name)) + geom_bar() + facet_grid(name ~ .) Figure 4.18: Faceted barplot comparing the number of flights by carrier and airport Note how the facet_grid function arguments are written here. We are wanting the names of the airports vertically and the carrier listed horizontally. As you may have guessed, this argument and other formulas of this sort in R are in y ~ x order. We will see more examples of this in Chapter ??. Learning check (LC3.37) Why is the faceted barplot preferred to the side-by-side and stacked barplots in this case? (LC3.38) What information about the different carriers at different airports is more easily seen in the faceted barplot? 4.8.4 Summary Barplots are the preferred way of displaying categorical variables. They are easy-to-understand and to make comparisons across groups of a categorical variable. When dealing with more than one categorical variable, faceted barplots are frequently preferred over side-by-side or stacked barplots. Stacked barplots are sometimes nice to look at, but it is quite difficult to compare across the levels since the sizes of the bars are all of different sizes. Side-by-side barplots can provide an improvement on this, but the issue about comparing across groups still must be dealt with. 4.9 Conclusion 4.9.1 What’s to come? In Chapter ??, we’ll further explore data by grouping our data, creating summaries based on those groupings, filtering our data to match conditions, selecting specific columns of our data, and other manipulations with our data including defining new columns/variables. These data manipulation procedures will go hand-in-hand with the data visualizations you’ve produced here. 4.9.2 Script of R code An R script file of all R code used in this chapter is available here. References "], diff --git a/index.Rmd b/index.Rmd index 3a168283b..85a3e75a6 100755 --- a/index.Rmd +++ b/index.Rmd @@ -46,9 +46,9 @@ These are some principles we keep in mind. If you agree with them, this might be 1. **Complete reproducibility** + We find it frustrating when textbooks give examples but not the source code and the data itself. We not only give you the source code for all examples, but also the source code for the whole book! + We encourage use of R Markdown to foster notions of reproducible research. -1. **Ultimately the best textbook is one you’ve written yourself** - + You best know your audience, their background, and their priorities and you know best your own style and types of examples and problems you like best. Customizability is the ultimate end. - + A new paradigm for textbooks? Versions, not editions? Pull requests, crowd-sourcing, and development versions? + + **Ultimately the best textbook is one you’ve written yourself** + - You best know your audience, their background, and their priorities and you know best your own style and types of examples and problems you like best. Customizability is the ultimate end. + - A new paradigm for textbooks? Versions, not editions? Pull requests, crowd-sourcing, and development versions? ## Contribute